Application development (Hexagon)
- 1 Modules
- 2 Compilers and programming languages
- 3 Debugging tools
- 4 Application optimization
- 4.1 Performance optimization. General recommendations.
- 4.2 Performance analysis
- 5 Parallel applications
- 6 Checkpoint and restart of applications
- 7 Recommended reading
Environment Modules allows you to dynamically modify your user environment by using information provided by "modulefiles". This make it easy to change between environments or settings, e.g. the Intel compiler environment and the PGI compiler environment. If you have problems during compiling, running the "module list" command could help you see if you have missing or wrong environment modules loaded.
When writing a PBS job script (see Job execution for more information), the wanted environment has to be set inside the script using the modules command. The reason for this is that the user environment is not inherited by the PBS script. The same applies for interactive jobs (i.e. qsub -I).
The "module" command have several subcommands, e.g. "module avail".
The following list shows some of the subcommands used with "module".
|avail||Lists all available modules|
|list||Lists the modules you are using|
|load "module_name"||Loads module "module_name"|
|unload "module_name"||Unloads module "module_name"|
|show "module_name"||Displays "module_name"'s configuration settings|
|swap "old_mod" "new_mod"||Unloads the "old_mod" and loads the "new_mod"|
To load the netcdf module into your environment you type:
module load netcdf
If you want a specific version of the module you instead specify:
module load netcdf/3.6.2
Please avoid using version numbers unless strictly necessary since older versions of packages may be removed at a later time.
If you want to change from the Cray compiler (default) to the Intel compiler you type:
module swap PrgEnv-cray PrgEnv-intel
You should also use swap if you want to load a different version of the same module, this will e.g. replace your current pgi version with 12.2.0:
module swap pgi pgi/12.2.0
A complete list of subcommands can be found in the module man page or here.
Please note, if the module command does not work inside your job scripts, add the line "export -f module" to your ~/.bashrc file. This should be automatically set for new users and is only valid if your shell is bash. For other shells you may source the corresponding file in /opt/modules/default/init/ inside your qsub script before you use any "module" command.
Compilers and programming languages
Four different compilers are available on Hexagon:
- Cray (default)
All compilation for compute nodes must be done using compiler wrappers. To switch between compilers module command must be used:
module switch PrgEnv-pgi PrgEnv-gnu
By default the latest available version will be loaded. You can switch to another compiler version with e.g.:
module switch pgi pgi/12.2.0
How to invoke the compiler
Compiling an application for use on the compute node should be done by the wrappers specified below. Running the command "module list" will give you one entry like "PrgEnv-###", where ### is either cray, pgi, gnu or intel.
Compiling programs for compute nodes
When using the compiler wrappers, the wrappers take care of MPI and all additional modules switches/settings automatically.
|Compute node compiler wrappers|
|Fortran 90/95 programs||ftn|
|Fortran 77 programs||f77|
NOTE: These wrappers also handles MPI and openMP, so you should not compile with mpicc, mpif90 or similar, nor should you need to add any reference to MPI libraries in CFLAGS or similar variables.
Compiling the C program test.c can be done by the command:
cc -o test.out test.c
Where test.out is my selected name of the executable file.
Compiling programs for login nodes
When compiling for the login node the executable will not be able to run on the compute nodes, neither will OpenMP or MPI be supported.
The general rule in this case is to call the compiler directly (like pgcc for PGI).
NOTE: You can compile code for login nodes using compute node wrappers, just keep in mind that in this case you will include MPI and other libraries which are loaded as modules.
Currently installed Programming Environments for compilers:
Frequently used compiler options
Compiling OpenMP programs To activate OpenMP directives, compile and link with
|-mp=nonuma||for the PGI compiler|
C and C++:
|-mp||for the PGI compiler|
Recommended compiler options
Normally if you use compiler wrappers all recommended options will be included.
In some cases you may need to use "--enable-static" during configure for running on compute nodes.
Usefull optimization flags for the AMD "Interlagos"
When using PGI the "-tp bulldozer-64" flag will improve the performance of your code. These options are automatically provided by the module xtpe-interlagos. NOTE: To compile code that should run on the login nodes this module should NOT be loaded.
Recommended environment variable settings
We recommend you to have the module xtpe-interlagos loaded. It will automatically add recommended optimization flags. NOTE: To compile code that should run on the login nodes this module should NOT be loaded.
Additionally, the "xt-libsci" module contains optimized versions of common scientific/math libraries (e.g. LAPACK, BLAS).
List of tools and usage summary
Several tools are available on hexagon for debugging.
Abnormal Termination Processing (ATP) is a system that monitors Cray XT System user applications, and should an application take a system trap, ATP preforms analysis on the dying application. With release 1.0 all of the stack backtraces of the application processes are gathered into a merged stack backtrace tree and written to disk as the file "atpMergedBT.dot". The stack backtrace for the first process to die is sent to stderr as is the number of the signal that caused the death.
You can load ATP environment with:
module load atp
Further information on ATP can found in the intro_atp man page.
This gdb based debugger and launcher allows users to attach to and debug codes which execute multiple processes or threads.
You can load lgdb environment with:
module load xt-lgdb
Usage documentation can be found in the manpage:
The following example shows how to connect to an already running program:
qstat -f JOBID | grep exec_host ssh loginX #take from exec_host of previous command ps x | grep aprun # find your aprun module load xt-lgdb # to connect to the first rank lgdb --pes=0 --pid=APRUNPID # You use APRUNPID from ps x command above # to connect to a list of ranks (from first to 8th) lgdb --pes=0-7 --pid=APRUNPID # You use APRUNPID from ps x command above
TotalView is a graphical, source-level, multiprocess debugger.
License is limited to the number of cores. Maximum is 66.
When using this debugger you need to turn on X-forwarding, which is done when you login via ssh. This is done by adding the -Y on newer ssh version, and -X on older. Following is an example of using a new version of ssh.
ssh -Y email@example.com
If you don't know if you have an old or new version of ssh, you should run "man ssh" and look for an explanation of "-X" and/or "-Y".
The program you want to debug has to be compiled with the debug option. Normally this is the "-g" option, but that depends on the compiler. The executable from this compilation will in the following examples be called "filename".
First, load the totalview module to get the correct environment variables set:
module load xt-totalview
If you are going to run TotalView on more than 64 cores (up to 512):
module load xt-totalview-notur
To start debugging run:
Which will start a graphical user interface.
Once inside the debugger, if you cannot see any source code, and keep the source files in a separate directory, add the search path to this directory via the main menu item File->Search path.
Source lines where it is possible to insert a breakpoint are marked with a box in the left column. Click on a box to toggle a breakpoint.
Double clicking a function/subroutine name in a source file should open the sourcefile. You can go back to the previous view by clicking on the left arrow on the top of the window.
The button "Go" runs the program from the beginning until the first breakpoint. "Next" and "Step" takes you one line / statement forward. "Out" will continue until the end of the current subroutine/function. "Run to" will continue until the next breakpoint.
The value of variables can be inspected by right clicking on the name, then choose "add to expression list". The variable will now be shown in a pop up window. Scalar variables will be shown with their value, arrays with their dimensions and type. To see all values in the array, right click on the variable in the pop up window and choose "dive". You can now scroll through the list of values. Another useful option is to visualize the array: after choosing "dive", open the menu item "Tools->Visualize" of the pop up window. If you did this with a 2D array, use middle button and drag mouse to rotate the surface that popped up, shift+middle button to pan, Ctrl+middle button to zoom in/out.
Running totalview inside the batch system (compute nodes)
qsub -I -l mppwidth=[#procs],walltime=[time] -A [account] -j oe -X mkdir -p /work/$USER/test_dir cp $HOME/test_dir/a.out /work/$USER/test_dir cd /work/$USER/test_dir module load xt-totalview totalview aprun -a -B ./a.out
Replace [#procs] with the core-count for the job. Note that totalview is licensed for a limited amount of cores.
Note: When totalview starts it will get 'aprun' up first. Click GO and YES.)
More information about Totalview can be found in the product knowledge base at http://www.roguewave.com/support/knowledge-base.aspx
Totalview documentation is available at http://www.roguewave.com/support/product-documentation/totalview.aspx#totalview
Performance optimization. General recommendations.
Compilation flags and environment settings
Correct optimization flags will be automatically selected if you use compiler wrappers and module craype-interlagos.
Enable FMA optimizations
Users should be aware that results obtained using FMA operations may differ in the lowest bits from results obtained on other X64 processors. The intermediate result fed from the multiplier to the adder is not rounded to 64 bits. Article at PGI.
Cray compiler: with -hfp3 or -hfp2
PGI: is default when you have -tp=bulldozer
GNU(set of optimizations): -march=bdver1 -Ofast -mprefer-avx128 -funroll-all-loops -ftree-vectorize
Please always very that the result provided with the optimized version is correct. If not try to reduce optimizations.
Please also check AMD Compiler Options Quick Reference Guide
Recommended optimized libraries
The following modules are optimized by Cray and are therefore recommended to use:
- xt-libsci - BLAS, LAPACK, ScaLAPACK, BLACS, IRT, SuperLU, CRAFFT. See : General software and libraries (Hexagon)#Performance libraries
- petsc - MUMPS, SuperLU, ParMETIS, HYPRE. See : General software and libraries (Hexagon)#Performance libraries
- acml - ACML: Fast Fourier Transform (FFT) routines for real and complex data, etc. See : General software and libraries (Hexagon)#Performance libraries
Correct use of file systems
There is no local disk available on the compute nodes.
Only a shared file system is available - /work file system, which is a Lustre FS. Note that this file system is not optimized to be accessed as a local scratch. Please avoid having small read/writes per chunk, instead replace the access pattern with bigger chunks, creating well-formed IO.
Dedicating FPUs per core
Due to specific Interlagos design one FPU unit is shared with 2 cores (see more about Bulldozer at Wiki).
If you have massive calculations on floating point numbers, you can get performance increase by dedicating each FPU per one core. (This will double your CPU time usage.) This is how to do it:
If you are using Cray compiler, you can make it aware of your plans by:
module load craype-interlagos-cu
With other compilers just compile code as regular.
Next you will need to properly allocate tasks per core with aprun and queuing system.
Run w/o OpenMP
#PBS -l mppwidth=xx #PBS -l mppnppn=16 #PBS -l mppdepth=2 aprun -n xx -N 16 -d 2 -S 4 ./mycode
where xx is number of cores you want to use.
(i.e. you pretend to use openmp on half the cores, it will reserve one core in each pair to avoid sharing FPU, the -S 4 says how many cores to put in each numa domain - of which there are 4 on a node)
Run with OpenMP or with precise placement
If you are using OpenMP you need to specifically map your cores (since the depth in -d is used to specify openmp), the equivalent to the above "-d 2 -N 16" is:
#PBS -l mppwidth=xx #PBS -l mppnppn=16 aprun -n xx -N 16 -S 4 -cc 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30 ./mycode
This example will use 16 mpi processes per node, 1 per FPU, leaving 16 cores per each node unused.
You may also want to add "-ss" which is strict memory placement (not allowed to have memory placed in another numa domain), though you may get out-of-mem error then depending on your memory usage. Avoiding cross-numa memory access will help the code by lower latency to memory.
List of tools and usage summary
The Cray performance analysis tool.
CrayPat is a performance analysis tool for evaluating program execution on Cray systems. CrayPat consists of three major components:
- pat_build - used to instrument the program to be analyzed (see "man pat_build")
- pat_report - a standalone text report generator that can be use to further explore the data generated by instrumented program execution (see "man pat_report")
- Apprentice2 - a graphical analysis tool that can be used, in addition to pat_report to further explore and visualize the data generated by instrumented program execution (see "man app2")
- Load the newest version of CrayPat:
module load perftools
- Compile your application:
make clean make
- Instrument the application to generate a sampling profile:
pat_build -O apa a.out
This will create an executable "a.out+pat".
- Run your application (in batch) using the executable "a.out+pat".
This will create the file "a.out+pat+<*>.xf".
- Create Sampling report files:
pat_report a.out+pat+<*>.xf > my_report.txt
This command will automatically create a report file "a.out+pat+<*>.ap2", which can be viewed by Apprentice2.
The command will also create two text files in ascii format: "a.out+pat+<*>.apa" and "my_report.txt".
- For Hardware Counting, instrument application for further analysis:
pat_build -O a.out+pat+<*>.apa
This will create an executable "a.out+apa".
- Modify run script to run the executable "a.out+apa", and add the environment variables
export PAT_RT_MPI_SYNC=0 export PAT_RT_HWPC=[2|3|...]
Running this instrumented application will create a file "a.out+apa+<*>.xf".
- Convert raw data:
pat_report a.out+apa+<*>.xf > my_hwcp_report.txt
This command will automatically create a report file "a.out+apa+<*>.ap2", which can be viewed by Apprentice2. The command will also create a new text file in ascii format: "my_hwcp_report.txt"
- View the results by Apprentice2:
app2 a.out+pat+<*>.ap2 & -for visualizing sampling results app2 a.out+apa+<*>.ap2 & -for visualizing hardware counting results
Apprentice2 generates a variety of interactive graphical reports. For more info, see man.
This summary is based on the slides of Luiz DeRose at the Cray XT4 workshop.
More information can be found in the corresponding manpages (man intro_craypat) or at http://docs.cray.com.
You can find the short version for hexagon below. Loading the module will add all requirement libraries for linking into cc wrapper.
module load ipm cc -o a.out main.c
Next time you execute your binary it will generate IPM report.
To parse results into HTML:
module load ipm ipm_parse -html IPM_result_file.0
A deeper IPM usage is covered on NOTUR pages.
Hexagon has wrappers that should be used when compiling programs for the compute nodes. More information about the wrappers can be found here. These wrappers handle MPI automatically, by using a module called xt-mpt. MPT is based on mpich2.
If you want to change from the default PGI compiler to GNU, PathScale or Intel you can do that by changing the PrgEnv module. This is done by using modules.
Not all MPI-2 features are supported, for a complete list - see:
At hexagon you can run OpenMP jobs within the node, i.e. on maximum 4 cores/processors. Since hexagon is to be used for jobs with high core-counts the use of pure OpenMP is discouraged, see below for an explanation of MPI/OpenMP hybrid.
To activate openMP directives, compile with Fortran:
|-h omp||Cray compiler|
C and C++:
|-mp||for the PGI compiler|
|-h omp||Cray compiler|
In the batch-script set (replace "threads_per_node" with 1-31)
#PBS -l mppnppn=1,mppwidth=1,mppdepth=threads_per_node export OMP_NUM_THREADS=threads_per_node
This number should correspond to
aprun ... -d threads_per_node ...
You can run a hybrid MPI + OpenMP job where MPI is used between the nodes and OpenMP within the node.
No special compiler directives are needed to activate MPI, but to activate the OpenMP directives, compile and link with the following.
|-mp=nonuma||for the PGI compiler|
C and C++:
|-mp||for the PGI compiler|
In the batch-script set
#PBS -l mppnppn=mpi_processes_per_node #PBS -l mppdepth=threads_per_node #PBS -l mppwidth=number_of_nodes export OMP_NUM_THREADS=threads_per_node
These numbers should correspond to
aprun ... -n number_of_nodes -d threads_per_node ...
Note: the mppnppn and mppdepth values must be chosen such that mppnppn x mppdepth <= 32.
Checkpoint and restart of applications
To use the checkpointing feature the application must be compiled with blcr:
module load blcr
With the module loaded, all necessary options will be automatically added by the compiler wrapper. Please recompile your application to include the blcr support. Note that only MPI and SHMEM programming models are supported.
The Cray checkpoint/restart solution uses the BLCR software from Berkley Lab's and inherits its limitations. For more information, refer to the BLCR documentation: http://upc-bugs.lbl.gov/blcr/doc/html/index.html.
The job must be submitted with the "-c enabled" parameter. Please see Job execution (Hexagon)#List of useful job script parameters.
Cray XT Programming Environment User's Guide - contains everything needed to start to work with examples on the Cray XT machine.