Debugging and Performance Analysis

AMDuProf Guide

Run and Collect Data

Run a batch job and collect data:

#SBATCH --constraint=perf  # perf,nvperf for gpu nodes

export PATH=/sw/external/amd/AMDuProf_Linux_x64_4.0.341/bin:$PATH

set -v
srun AMDuProfCLI collect --config tbp  -o `pwd`/uprof_tbp  `pwd`/stream.22gb
srun AMDuProfCLI collect --config inst_access  -o `pwd`/uprof_inst_access  `pwd`/stream.22gb
srun AMDuProfCLI collect --config assess  -o `pwd`/uprof_assess  `pwd`/stream.22gb
srun AMDuProfCLI collect --config assess_ext  -o `pwd`/uprof_assess_ext  `pwd`/stream.22gb

Generate Report

After collecting data from a batch job, generate reports with the AMDuProfCLI report option:

[arnoldg@dt-login03 uprof_tbp]$ export PATH=/sw/external/amd/AMDuProf_Linux_x64_4.0.341/bin:$PATH
[arnoldg@dt-login03 uprof_tbp]$ AMDuProfCLI report -i AMDuProf-stream-TBP_Dec-19-2022_09-40-27/
Translation started ...
Translation finished
Generated database file : cpu
Report generation started...
Generating report file...

Report generation completed...

Generated report file: /projects/bbka/slurm_test_scripts/cpu/stream/uprof_tbp/AMDuProf-stream-TBP_Dec-19-2022_09-40-27/report.csv

Visualize and Explore Report Data

You can view the data in AMDuProf on Delta or locally with a copy you install on your desktop system. If you install locally, you may need to replicate some paths or add paths to the binary in order to get full functionality.

Launch AMDuProf (no CLI suffix for the GUI) and import the profile session from a completed batch job run with AMDuProfCLI collect.

import profile session

The summary view gives a high-level overview of how time was spent. This is the time-based-profile (tbp) summary.

summary view

The Analyze tab shows hot routines or lines in more detail. The tbp, assess, and inst_access Analyze views follow.

analyze tab assess summary inst_access

Selecting one of the lines or routines will take you to the Sources view where you can see the assembly used in that portion of the code.

sources view

The Session Info is under the Summary tab and displays more detail about the profiling session.

session summary info


Rooflines are currently disabled. Information on rooflines is in the AMD uProf user guide (section 3.5.2). Below are the roofline error messages from the OS.

srun AMDuProfPcm roofline -o stream-roofline.csv -- ./stream.22gb
Error: NMI watchdog is enabled. NMI uses one Core HW PMC counter.
Please disable NMI watchdog - run with root privilege: echo 0 > /proc/sys/kernel/nmi_watchdog
srun: error: cn061: task 0: Exited with exit code 255


AMD uProf user guide

NVIDIA Nsight Systems

Installation (Delta System, rgpu02 Preliminary Documentation)

For admins/sw team: Use Spack to install CUDA, and the nsys command for Nsight Systems is included.

[arnoldg@rgpu02 rgpu02]$ module load cuda
[arnoldg@rgpu02 rgpu02]$ which nsys
[arnoldg@rgpu02 rgpu02]$

Installation (NVIDIA Nsight Systems Client on Local Desktop/Laptop)

  1. Open the NVIDIA developer tools overview and navigate to the Developer Tools Downloads button.

  2. Select Nsight Systems and your operating system. If you do not have an account at, set one up when prompted. When you have completed the forms, your download will begin.

  3. Install the application on your local machine. You will download output files from the server command line application and use the GUI locally on your laptop.

Run Application on Delta

nsys with serial or python CUDA code

$ srun nsys profile -o /path/to/mynysys.out --stats=true ./a.out

nsys wrapper for MPI and HPC CUDA codes

[arnoldg@dt-login03 gromacs]$ cat
# Use $PMI_RANK for MPICH, $OMPI_COMM_WORLD_RANK for openmpi, and $SLURM_PROCID with srun.
if [ $SLURM_PROCID -eq 1 ]; then
  nsys profile -e NSYS_MPI_STORE_TEAMS_PER_RANK=1 -o gmx.nsys --gpu-metrics-set=2 "$@"

batch script , –constraint=

#SBATCH --constraint=perf,nvperf
# the slurm script should run the wrapper above instead of "nsys ..."
time srun $SLURM_SUBMIT_DIR/ \
  gmx_mpi mdrun -nb gpu -pin on -notunepme -dlb yes -v -resethway -noconfout -nsteps 4000 -s water_pme.tpr

# see

MPI Rank Example Result (Viewing with Nsight on Local Desktop)

MPI rank example summary

Copy Resultant Files to Your Local Laptop (Downloads/ or Documents/)

scp is shown below. You could also use Globus Online, sftp, or an sshfs mount from your laptop.

# Delta
[arnoldg@rgpu02 rgpu02]$ ls /tmp/nsys*
/tmp/nsys-report-988d.sqlite  /tmp/nsys-report-b26d.nsys-rep
[arnoldg@rgpu02 rgpu02]$

# local laptop (MacOS example)
(base) galen@macbookair-m1-042020 ~ % cd Downloads
(base) galen@macbookair-m1-042020 Downloads % pwd
(base) galen@macbookair-m1-042020 Downloads % sftp [email protected]

NCSA Delta System

Login with NCSA Kerberos + Duo multi-factor.

DUO Documentation:

([email protected]) Password:
([email protected]) Duo two-factor login for arnoldg

Enter a passcode or select one of the following options:

 1. Duo Push to XXX-XXX-1120
 2. Duo Push to Ipad mini (iOS)
 3. Duo Push to red ipod (iOS)

Passcode or option (1-3): 1
Connected to
sftp> cd /tmp
sftp> mget nsys*
Fetching /tmp/nsys-report-988d.sqlite to nsys-report-988d.sqlite
/tmp/nsys-report-988d.sqlite                  100%  748KB   2.7MB/s   00:00
Fetching /tmp/nsys-report-b26d.nsys-rep to nsys-report-b26d.nsys-rep
/tmp/nsys-report-b26d.nsys-rep                100%  288KB   1.7MB/s   00:00

Open NVIDIA Nsight Systems

Under the File menu, select open, and then navigate to your Downloads/ folder and select the nsys* file of interest (nays-report-b26d.nsys-rep in this example). Explore the data in the GUI application.

timeline analysis

See also: NVTX source code annotations blog article at NVIDIA (can annotate C/C++/python GPU or CPU code)

Python with NVTX

Installing NVTX via pip

[arnoldg@rgpu02 nvtx]$ module load python cuda
[arnoldg@rgpu02 nvtx]$ C_INCLUDE_PATH=$CUDA_HOME/include pip install nvtx
Collecting nvtx
  Using cached nvtx-0.2.3.tar.gz (10 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: nvtx
  Building wheel for nvtx (pyproject.toml) ... done
  Created wheel for nvtx: filename=nvtx-0.2.3-cp39-cp39-linux_x86_64.whl size=177533 sha256=875e0f9d4322d07db4bce397b4281ce301f348cf72e00629b0d7bc23a7db0231
  Stored in directory: /u/arnoldg/.cache/pip/wheels/66/7a/44/68c48f02433263010768b540b0e90bf5a224dd7e6612d88887
Successfully built nvtx
Installing collected packages: nvtx
Successfully installed nvtx-0.2.3
[arnoldg@rgpu02 nvtx]$

Run with NSYS CLI

[arnoldg@rgpu02 nvtx]$ nsys profile -o nvtx_simple.profile --stats=true ./

Warning: LBR backtrace method is not supported on this platform. DWARF backtrace method will be used.
Failed to create '/u/arnoldg/rgpu02/cuda/nvtx/nvtx_simple.profile.nsys-rep': File exists.
Use `--force-overwrite true` to overwrite existing files.
Generating '/tmp/nsys-report-1c93.qdstrm'
[1/8] [========================100%] nsys-report-d073.nsys-rep
Failed to create '/u/arnoldg/rgpu02/cuda/nvtx/nvtx_simple.profile.sqlite': File exists.
Use `--force-overwrite true` to overwrite existing files.
[2/8] [========================100%] nsys-report-e498.sqlite
SKIPPED: /tmp/nsys-report-e498.sqlite does not contain CUDA trace data.
SKIPPED: /tmp/nsys-report-e498.sqlite does not contain CUDA kernel data.
SKIPPED: /tmp/nsys-report-e498.sqlite does not contain GPU memory data.
SKIPPED: /tmp/nsys-report-e498.sqlite does not contain GPU memory data.
[3/8] Executing 'nvtxsum' stats report

NVTX Range Statistics:

 Time (%)  Total Time (ns)  Instances      Avg (ns)          Med (ns)         Min (ns)        Max (ns)       StdDev (ns)     Style   Range
 --------  ---------------  ---------  ----------------  ----------------  --------------  --------------  ---------------  -------  -----
     50.0   10,010,633,188          1  10,010,633,188.0  10,010,633,188.0  10,010,633,188  10,010,633,188              0.0  PushPop  f()
     50.0   10,010,401,574          5   2,002,080,314.8   2,002,090,885.0          15,729   4,004,111,558  1,582,756,979.0  PushPop  loop

[4/8] Executing 'osrtsum' stats report

Operating System Runtime API Statistics:

 Time (%)  Total Time (ns)  Num Calls     Avg (ns)         Med (ns)      Min (ns)    Max (ns)       StdDev (ns)           Name
 --------  ---------------  ---------  ---------------  ---------------  --------  -------------  ---------------  -------------------
    100.0   10,010,198,683          5  2,002,039,736.6  2,002,047,874.0     3,025  4,004,056,124  1,582,740,553.2  select
      0.0        1,005,734         46         21,863.8         21,656.0    18,866         27,070          1,608.1  open64
      0.0          495,879         49         10,120.0          4,960.0     1,262         67,747         12,669.1  read
      0.0           38,843         10          3,884.3          3,957.5     3,186          4,559            408.1  mmap64
      0.0           34,164          1         34,164.0         34,164.0    34,164         34,164              0.0  write
      0.0           27,391          4          6,847.8          4,182.5     2,655         16,371          6,410.6  fopen64
      0.0            6,602          3          2,200.7          1,232.0     1,172          4,198          1,730.0  pthread_cond_signal
      0.0            3,647          1          3,647.0          3,647.0     3,647          3,647              0.0  sigaction
      0.0            2,013          1          2,013.0          2,013.0     2,013          2,013              0.0  fread
      0.0            1,923          1          1,923.0          1,923.0     1,923          1,923              0.0  fclose
      0.0            1,472          1          1,472.0          1,472.0     1,472          1,472              0.0  fflush

[5/8] Executing 'cudaapisum' stats report
[6/8] Executing 'gpukernsum' stats report
[7/8] Executing 'gpumemtimesum' stats report
[8/8] Executing 'gpumemsizesum' stats report
[arnoldg@rgpu02 nvtx]$
nsys profile --gpu-metrics-device=all \
    --gpu-metrics-frequency=20000 <application>   # get metrics from the cuda libs/api

ncu --metrics "regex:.*" <application>   # get all gpu metrics from the hardware

Delta Script and Nsight Systems View of the Resulting Report

#SBATCH --job-name="numba_profile"
#SBATCH --partition=gpuA100x4-interactive
#SBATCH --mem=16G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2   # spread out to use 1 core per numa
#SBATCH --constraint="projects"
#SBATCH --gpus-per-node=1
#SBATCH --gpu-bind=closest   # select a cpu close to gpu on pci bus topology
#SBATCH --account=account_name    # <- match to a "Project" returned by the "accounts" command
#SBATCH -t 00:10:00

module load anaconda3_gpu

dcgmi profile --pause

srun nsys profile \
  --gpu-metrics-device=all \

srun ncu \
  --metrics "regex:.*" \
  --target-processes all \

dcgmi profile --resume

(Transferred the report1.nsys-rep back to local system using Globus Online, sftp, etc.)


Nsight Systems Setup on Local Workstation to Use with Delta

  1. Log into the NVIDIA Nsight systems developer page (make an account if you need to), and download the client for your MacOS, Windows, or Linux local system.

    You can use Globus Online, rsync, sftp, or sshfs (Linux) to transfer files (or view files as local filesystem mounts in the case of sshfs) with the local Nsight Systems client.

    sshfs Mount Example for Linux Box to Delta:

    galen@galen-HP-ProBook-455-G6:~$ sshfs [email protected]:/projects/bbka delta_projects/
    [email protected]'s password:
    ([email protected]) Duo two-factor login for arnoldg
    Enter a passcode or select one of the following options:
     1. Duo Push to XXX-XXX-1120
     2. Duo Push to Ipad mini (iOS)
     3. Duo Push to red ipod (iOS)
     4. Duo Push to Android
    Passcode or option (1-4): 115489
    galen@galen-HP-ProBook-455-G6:~$ df -h delta_projects/
    Filesystem                                                 Size  Used Avail Use% Mounted on
    [email protected]:/projects/bbka 1000T   60T  941T   6% /home/galen/delta_projects
  2. Launch Nsight Systems and define a target under the default opening view. Even if you cannot get Nsight Systems to SSH to the target, you need to define it so that Nsight Systems will present you with the .nsys-rep file type when you try to open a profile from delta that was transferred to local via GO/sftp/rsync or viewable via the sshfs fuse mount like shown above:

    project target
  3. Then open the profile report generated from an srun nsys … at Delta (navigate to Download or the live sshfs fuse mount).

    profile report
  4. Proceed to use Nsight Systems. A stats view of the GPU Summary is shown. This is usually a good performance analysis starting point showing utilization of kernels vs times to transfer data between the host computer and the GPU accelerator.

    GPU stats summary

NVIDIA CUDA C++ programming guide

NVIDIA Nsight Systems user guide (nsys higher level and cuda api )

NVIDIA Nsight Compute CLI documentation (ncu lower level and counters )

GitHub - quasiben/nvtx-examples (sample python test codes )

Debugging MPI (OpenMPI) codes

See: Debugging applications in parallel - (OpenMPI faq on debugging MPI code )

Debugging Open OnDemand Problems

For internal staff debugging (also useful for new OOD applications: debugging jupyterlab, Open OnDemand).