Quick Start Guide – Mio Phi Nodes
Running Phi/Mic examples:
We have examples the illustrate the primary modes of operation of the MIC/PHI nodes. If you would like to use the Phi/Mic enabled nodes you must use Intel “impi”. Also, you must specify “phi” as the run queue in your batch script. For example the preamble for a run script might look something like:
#!/bin/sh
#SBATCH --time=1
#SBATCH --nodes=1
#SBATCH -n 2
#SBATCH --export=ALL
#SBATCH -p phi
#SBATCH --overcommit
You may be interested in the OpenMP reference cards:
- Co-processor Offload Infrastructure (COI)
- MPI
- OpenMP (threads)
- Hybrid MPI/OpenMP
- Offload of MKL calls
- Directives based offload
Examples:
tar -xzf phi.tgzThe files are also available on mio001. Copy the file /opt/utility/quickstart/phi.tgz to your home directory and run the command:
This creates the directory phi containing:
[tkaiser@phi001 phi]$ ls -R .: coi directive index.html micrun mpi_openmp offload ./coi: hello_world index.html ./coi/hello_world: Makefile do_coi hello_world_sink.cpp hello_world_source.cpp index.html ./directive: do_off dooff.c index.html makefile ./mpi_openmp: StomOmpf_00d.f do_hybrid do_mpi do_openmp helloc.c hellof.f hybrid.f90 index.html makefile runthread st.in ./offload: book index.html orsl_for_ao_and_cao ./offload/book: auto.c dosubscript index.html makefile output subscript ./offload/orsl_for_ao_and_cao: 4096.log 8192.log do_off index.html makefile run_s run_t t.c t.simple.c
Examples:
Directory ~/coi/hello_world
Contains Co-processor Offload Infrastructure (COI) “coprocessor/CUDA like” example. It has a CPU part, hello_world_source.cpp and a PHI/MIC part hello_world_sink.cpp. When hello_world_source.cpp is run it launches hello_world_sink.cpp on the cards.
To run:
make sbatch do_coi
Typical output:
[tkaiser@mio001 hello_world]$ make mkdir -p debug g++ -L/opt/intel/mic/coi/host-linux-debug/lib -Wl,-rpath=/opt/intel/mic/coi/host-linux-debug/lib -I /opt/intel/mic/coi/include -lcoi_host -Wl,--enable-new-dtags -g -O0 -D_DEBUG -o debug/hello_world_source_host hello_world_source.cpp mkdir -p debug /usr/linux-?1om-4.7/bin/x86_64-?1om-linux-g++ -L/opt/intel/mic/coi/device-linux-debug/lib -I /opt/intel/mic/coi/include -lcoi_device -rdynamic -Wl,--enable-new-dtags -g -O0 -D_DEBUG -o debug/hello_world_sink_mic hello_world_sink.cpp mkdir -p release g++ -L/opt/intel/mic/coi/host-linux-release/lib -Wl,-rpath=/opt/intel/mic/coi/host-linux-release/lib -I /opt/intel/mic/coi/include -lcoi_host -Wl,--enable-new-dtags -DNDEBUG -O3 -o release/hello_world_source_host hello_world_source.cpp mkdir -p release /usr/linux-?1om-4.7/bin/x86_64-?1om-linux-g++ -L/opt/intel/mic/coi/device-linux-release/lib -I /opt/intel/mic/coi/include -lcoi_device -rdynamic -Wl,--enable-new-dtags -DNDEBUG -O3 -o release/hello_world_sink_mic hello_world_sink.cpp [tkaiser@mio001 hello_world]$ sbatch do_coi Submitted batch job 187 [tkaiser@mio001 hello_world]$ ls *out slurm-187.out [tkaiser@mio001 hello_world]$ cat *out phi001 Hello from the sink! 4 engines available Got engine handle Sink process created, press enter to destroy it. Sink process returned 0 Sink exit reason SHUTDOWN OK [tkaiser@mio001 hello_world]$
Directory ~/phi/mpi_openmp
- MPI
- OpenMP (threads)
- Hybrid MPI/OpenMP
Contains MPI and OpenMP examples. This MPI example runs “hello world” on both the CPU and PHI/MIC processors at the same time. The script do_mpi runs the mpi example and do_openmp runs an OpenMP version of the “Stommel” code.
The program hybrid.f90 is a hybrid MPI/OpenMP program Each thread prints out its thread and MPI id. It also shows how to create a collection of node specific MPI communicators based on the name of the node on which a task is running. Each node has it own “node_com” so each thread also prints its MPI rank in the node specific communicator.
To run:
make sbatch do_mpi sbatch do_openmp sbatch do_hybrid
Typical Output:
[tkaiser@mio001 phi]$ cd ~/phi/mpi_openmp/ [tkaiser@mio001 mpi_openmp]$ ls do_hybrid do_mpi do_openmp helloc.c hellof.f hybrid.f90 makefile runthread st.in StomOmpf_00d.f [tkaiser@mio001 mpi_openmp]$ make ifort -free -mmic -openmp -O3 StomOmpf_00d.f -o StomOmpf_00d.mic rm *mod ifort -free -openmp -O3 StomOmpf_00d.f -o StomOmpf_00d.x86 rm *mod mpiicc -mmic helloc.c -o helloc.mic mpiicc helloc.c -o helloc.x86 mpiifort -mmic hellof.f -o hellof.mic mpiifort hellof.f -o hellof.x86 mpiifort -mmic -openmp hybrid.f90 -o hybrid.mic rm *mod mpiifort -openmp hybrid.f90 -o hybrid.x86 rm *mod [tkaiser@mio001 mpi_openmp]$ [tkaiser@mio001 mpi_openmp]$ sbatch do_mpi Submitted batch job 188 [tkaiser@mio001 mpi_openmp]$ ls *188* 188.script hosts.188 slurm-188.out [tkaiser@mio001 mpi_openmp]$ cat slurm-188.out phi001 Hello from phi001 1 20 Hello from phi001 0 20 Hello from phi001 3 20 Hello from phi001 2 20 Hello from phi001-mic2 12 20 Hello from phi001-mic1 8 20 Hello from phi001-mic2 13 20 Hello from phi001-mic0 4 20 Hello from phi001-mic1 9 20 Hello from phi001-mic2 14 20 Hello from phi001-mic3 16 20 Hello from phi001-mic0 5 20 Hello from phi001-mic1 10 20 Hello from phi001-mic2 15 20 Hello from phi001-mic3 17 20 Hello from phi001-mic0 6 20 Hello from phi001-mic1 11 20 Hello from phi001-mic3 18 20 Hello from phi001-mic0 7 20 Hello from phi001-mic3 19 20 Tue Nov 26 10:14:53 MST 2013 [tkaiser@mio001 mpi_openmp]$ sbatch do_openmp Submitted batch job 189 [tkaiser@mio001 mpi_openmp]$ ls *189* 189.script hosts.189 slurm-189.out [tkaiser@mio001 mpi_openmp]$ head slurm-189.out phi001 threads= 50 750 168917584.6 1500 144578230.3 2250 123773356.0 3000 105180096.6 3750 88327143.96 4500 73054749.29 5250 59389704.70 6000 47430832.06 [tkaiser@mio001 mpi_openmp]$ tail slurm-189.out 69750 0.000000000 70500 0.000000000 71250 0.000000000 72000 0.000000000 72750 0.000000000 73500 0.000000000 74250 0.000000000 75000 0.000000000 run time = 6.19109999999637 1 50 Tue Nov 26 10:16:21 MST 2013 [tkaiser@mio001 mpi_openmp]$ [tkaiser@mio001 mpi_openmp]$ sbatch do_hybrid Submitted batch job 234 [tkaiser@mio001 mpi_openmp]$ ls -lt total 4284 -rw-rw-r-- 1 tkaiser tkaiser 6480 Dec 2 10:26 slurm-234.out -rwx------ 1 tkaiser tkaiser 580 Dec 2 10:26 tmpmz19Pk -rw-rw-r-- 1 tkaiser tkaiser 697 Dec 2 10:26 234.script -rw-rw-r-- 1 tkaiser tkaiser 7 Dec 2 10:26 hosts.234 ... [tkaiser@mio001 mpi_openmp]$ for m in mic0 mic1 mic2 mic3 ; do echo $m output ;cat slurm-234.out | grep -a $m | head -2 ; echo "..." ; echo "..." ;cat slurm-234.out | grep -a $m | tail -2 ; done mic0 output 0000 08 phi001-mic0 0000 0000 0000 02 phi001-mic0 0000 0000 ... ... 0003 06 phi001-mic0 0000 0003 0003 07 phi001-mic0 0000 0003 mic1 output 0004 00 phi001-mic1 0004 0000 0004 04 phi001-mic1 0004 0000 ... ... 0006 07 phi001-mic1 0004 0002 0006 01 phi001-mic1 0004 0002 mic2 output 0008 00 phi001-mic2 0008 0000 0008 09 phi001-mic2 0008 0000 ... ... 0011 01 phi001-mic2 0008 0003 0009 09 phi001-mic2 0008 0001 mic3 output 0015 00 phi001-mic3 0012 0003 0015 05 phi001-mic3 0012 0003 ... ... 0014 07 phi001-mic3 0012 0002 0014 08 phi001-mic3 0012 0002 [tkaiser@mio001 mpi_openmp]$
Directory ~phi/offload/book
- Offload of MKL calls
See: Parallel Programming and Optimization with Intel® Xeon Phi
To run:
make sbatch dosubscript
Typical Output:
[tkaiser@mio001 book]$ ls auto.c dosubscript makefile output subscript [tkaiser@mio001 book]$ make icc -mkl -DSIZE=8192 auto.c -o offit [tkaiser@mio001 book]$ ls -l total 192 -rw-rw-r-- 1 tkaiser tkaiser 1580 Jul 25 10:35 auto.c -rwxr-xr-x 1 tkaiser tkaiser 352 Nov 26 10:58 dosubscript -rw-rw-r-- 1 tkaiser tkaiser 89 Jul 25 10:36 makefile -rwxrwxr-x 1 tkaiser tkaiser 166526 Nov 26 11:01 offit -rw-rw-r-- 1 tkaiser tkaiser 5876 Jul 25 10:38 output -rwx------ 1 tkaiser tkaiser 237 Nov 26 10:50 subscript [tkaiser@mio001 book]$ sbatch dosubscript Submitted batch job 203 [tkaiser@mio001 book]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) [tkaiser@mio001 book]$ ls -lt total 216 -rw-rw-r-- 1 tkaiser tkaiser 5876 Nov 26 11:02 out.203 -rw-rw-r-- 1 tkaiser tkaiser 352 Nov 26 11:02 203.script -rw-rw-r-- 1 tkaiser tkaiser 7 Nov 26 11:02 hosts.203 -rw-rw-r-- 1 tkaiser tkaiser 6828 Nov 26 11:02 slurm-203.out -rwxrwxr-x 1 tkaiser tkaiser 166526 Nov 26 11:01 offit -rwxr-xr-x 1 tkaiser tkaiser 352 Nov 26 10:58 dosubscript -rwx------ 1 tkaiser tkaiser 237 Nov 26 10:50 subscript -rw-rw-r-- 1 tkaiser tkaiser 5876 Jul 25 10:38 output -rw-rw-r-- 1 tkaiser tkaiser 89 Jul 25 10:36 makefile -rw-rw-r-- 1 tkaiser tkaiser 1580 Jul 25 10:35 auto.c [tkaiser@mio001 book]$ head out.203 Intializing matrix data size= 8192, GFlops= 403.568 Intializing matrix data [MKL] [MIC --] [AO Function] SGEMM [MKL] [MIC --] [AO SGEMM Workdivision] 0.10 0.23 0.23 0.23 0.23 [MKL] [MIC 00] [AO SGEMM CPU Time] 4.588332 seconds [MKL] [MIC 00] [AO SGEMM MIC Time] 0.691480 seconds [MKL] [MIC 00] [AO SGEMM CPU->MIC Data] 335544320 bytes [tkaiser@mio001 book]$ tail out.203 [MKL] [MIC 01] [AO SGEMM MIC->CPU Data] 67108864 bytes [MKL] [MIC 02] [AO SGEMM CPU Time] 0.876555 seconds [MKL] [MIC 02] [AO SGEMM MIC Time] 0.261351 seconds [MKL] [MIC 02] [AO SGEMM CPU->MIC Data] 335544320 bytes [MKL] [MIC 02] [AO SGEMM MIC->CPU Data] 67108864 bytes [MKL] [MIC 03] [AO SGEMM CPU Time] 0.876555 seconds [MKL] [MIC 03] [AO SGEMM MIC Time] 0.260408 seconds [MKL] [MIC 03] [AO SGEMM CPU->MIC Data] 335544320 bytes [MKL] [MIC 03] [AO SGEMM MIC->CPU Data] 67108864 bytes size= 8192, GFlops= 1235.405 [tkaiser@mio001 book]$
Directory ~phi/offload/orsl_for_ao_and_cao
- Offload of MKL calls
Typical Output:
[tkaiser@mio001 orsl_for_ao_and_cao]$ ls 4096.log 8192.log do_off makefile run_s run_t t.c t.simple.c [tkaiser@mio001 orsl_for_ao_and_cao]$ make icc -O0 -std=c99 -Wall -g -mkl -openmp t.simple.c -o t.sim icc -O0 -std=c99 -Wall -g -mkl -openmp t.c -o t.out [tkaiser@mio001 orsl_for_ao_and_cao]$ sbatch do_off Submitted batch job 205 [tkaiser@mio001 orsl_for_ao_and_cao]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 205 phi do_off tkaiser R 0:07 1 phi001 [tkaiser@mio001 orsl_for_ao_and_cao]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 205 phi do_off tkaiser R 1:03 1 phi001 [tkaiser@mio001 orsl_for_ao_and_cao]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 205 phi do_off tkaiser R 1:37 1 phi001 [tkaiser@mio001 orsl_for_ao_and_cao]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) [tkaiser@mio001 orsl_for_ao_and_cao]$ ls -l *205* -rw-rw-r-- 1 tkaiser tkaiser 2537 Nov 26 11:12 0_0_4096.205 -rw-rw-r-- 1 tkaiser tkaiser 2536 Nov 26 11:12 0_1_4096.205 -rw-rw-r-- 1 tkaiser tkaiser 2537 Nov 26 11:12 0_4096.205 -rw-rw-r-- 1 tkaiser tkaiser 2541 Nov 26 11:12 1_0_4096.205 -rw-rw-r-- 1 tkaiser tkaiser 2240 Nov 26 11:13 1_1_4096.205 -rw-rw-r-- 1 tkaiser tkaiser 2542 Nov 26 11:11 1_4096.205 -rw-rw-r-- 1 tkaiser tkaiser 645 Nov 26 11:11 205.script -rw-rw-r-- 1 tkaiser tkaiser 7 Nov 26 11:11 hosts.205 -rw-rw-r-- 1 tkaiser tkaiser 36 Nov 26 11:13 slurm-205.out [tkaiser@mio001 orsl_for_ao_and_cao]$ head 1_1_4096.205 Coprocessor access: concurrent Manual synchronization: on N: 4096 Offload 4096x4096 DGEMM: 475.49 GFlops [MKL] [MIC --] [AO Function] DGEMM [MKL] [MIC --] [AO DGEMM Workdivision] 0.00 1.00 [MKL] [MIC 00] [AO DGEMM CPU Time] 7.864075 seconds [MKL] [MIC 00] [AO DGEMM MIC Time] 0.708530 seconds [MKL] [MIC 00] [AO DGEMM CPU->MIC Data] 263979008 bytes [MKL] [MIC 00] [AO DGEMM MIC->CPU Data] 129761280 bytes [tkaiser@mio001 orsl_for_ao_and_cao]$ tail 1_1_4096.205 [Offload] [MIC 0] [MIC Time] 0.206782 (seconds) [Offload] [MIC 0] [MIC->CPU Data] 138412032 (bytes) [Offload] [MIC 0] [File] t.c [Offload] [MIC 0] [Line] 35 [Offload] [MIC 0] [CPU Time] 0.288746 (seconds) [Offload] [MIC 0] [CPU->MIC Data] 415236128 (bytes) [Offload] [MIC 0] [MIC Time] 0.206486 (seconds) [Offload] [MIC 0] [MIC->CPU Data] 138412032 (bytes) [tkaiser@mio001 orsl_for_ao_and_cao]$ head 0_0_4096.205 Coprocessor access: serial Manual synchronization: off N: 4096 [MKL] [MIC --] [AO Function] DGEMM [MKL] [MIC --] [AO DGEMM Workdivision] 0.00 1.00 [MKL] [MIC 00] [AO DGEMM CPU Time] 7.670934 seconds [MKL] [MIC 00] [AO DGEMM MIC Time] 0.716403 seconds [MKL] [MIC 00] [AO DGEMM CPU->MIC Data] 263979008 bytes [MKL] [MIC 00] [AO DGEMM MIC->CPU Data] 129761280 bytes [MKL] [MIC --] [AO Function] DGEMM [tkaiser@mio001 orsl_for_ao_and_cao]$ tail 0_0_4096.205 [Offload] [MIC 0] [MIC Time] 0.206475 (seconds) [Offload] [MIC 0] [MIC->CPU Data] 138412032 (bytes) [Offload] [MIC 0] [File] t.c [Offload] [MIC 0] [Line] 35 [Offload] [MIC 0] [CPU Time] 0.288567 (seconds) [Offload] [MIC 0] [CPU->MIC Data] 415236128 (bytes) [Offload] [MIC 0] [MIC Time] 0.205986 (seconds) [Offload] [MIC 0] [MIC->CPU Data] 138412032 (bytes) [tkaiser@mio001 orsl_for_ao_and_cao]$
- Directives based offload
This shows how you can compile and offload your own functions to the cards using directives. There are a few things of note.
- Functions and data for the cards need the __attribute__((target(mic))) specification.
- We do not need the -mmic compile line option.
- If a card is not available then the function will be run on the CPU.
This example initializes an array on the CPU, a portion is modified in a function on the card then the CPU prints out part of the array. There is also a report given of the number of threads available for use on both the CPU and the card.
To run:
make sbatch dosubscript [tkaiser@mio001 directive]$ make icc dooff.c -o dooff [tkaiser@mio001 directive]$ ls dooff do_off dooff.c index.html makefile [tkaiser@mio001 directive]$ sbatch do_off Submitted batch job 253
Typical Output:
[tkaiser@mio001 directive]$ cat slurm-253.out phi001 Hello world! I have 240 logical cores. Hello world! I have 12 logical cores. enter k:0 1 2 3 4 1234 1234 1234 1234 1234 [tkaiser@mio001 directive]$