STAT is a useful tool for collecting stack trace data at scale and aggregating the information into a call graph for the entire set of parallel processes. See [STAT and ATP] for more information on using STAT.

In Situ

Debugging MPI bugs in parallel can be a major headache, especially on highly-subscribed batch systems. STAT can be run from PBS job control scripts on Cray systems by attaching to the parallel launcher, in this case aprun. The following job submission script runs a libsim-instrumented simulation and attaches STAT to it periodically so we can sample a set of stack traces that can be used to build a graph of program execution at those points in time. Were an MPI hang to occur, the stack traces would indicate which ranks are stuck in certain MPI calls and we could have a better idea of how the code arrived in that situation.

  • This PBS script is for running STAT in batch against a simulation on hopper at NERSC
#PBS -q regular
#PBS -l mppwidth=41472
#PBS -l walltime=02:00:00
#PBS -A m636

aprun -n 41472 -e LD_LIBRARY_PATH=/global/u1/w/whitlocb/Development/install_dyn2/2.9.0/linux-x86_64/lib:/opt/gcc/4.9.2/snos/lib64 ./flamev.x &

sleep 2200
set APRUN=`ps | grep aprun | awk '{print  $  1 }' | grep -E '[0-9]' | head -n 1`
echo "APRUN pid: $APRUN"
module load stat
while (1)
    echo "Invoking STAT -i $APRUN"
    /opt/cray/stat/ -i $APRUN
    sleep 1200

Each time STAT runs, it makes a new results directory in the stat_results directory. You can see the call graph using STATview.