Numba JIT compilation to CUDA kernels

June 5^th, 9:30am-4:30pm Pacific Time

Abstract: In our second GPU Python course this month, participants learn how to use the Numba compiler to harness the full power of NVIDIA GPUs without leaving the Python ecosystem. We cover high-level array acceleration, writing CUDA kernels, managing data transfer and on-device memory, and basic performance tuning techniques. As a capstone exercise, we implement prime factorization on a GPU, with a focus on monitoring and optimizing GPU utilization on a cluster.

Introduction to Numba and GPU Architecture

In the CuPy course last week, Marie-Hélène Burle gave an introduction to the GPU architecture, the available GPU programming frameworks, and using GPUs on our clusters. For more details, see these sections:

Briefly:

A CUDA kernel is a function invoked by the CPU (host) and executed massively in parallel on the GPU (device). Each thread applies this function to its own data element(s), with thousands of threads running in parallel at the same time.
For thread hierarchy, CUDA cores are organized into streaming multiprocessors (SMs), which are responsible for executing thread blocks. One SM can run a multiple of warps, typically with 32 threads per warp, with the exact numbers depending on the card’s architecture.
Two hardware examples:
1. Current-gen NVIDIA H100 SXM5 card on Fir has 132 SMs, each supporting 2,048 threads, giving a theoretical max of 270,336 concurrent threads.
2. Older NVIDIA A100 SXM4 card has 108 SMs, each supporting 2,048 threads, giving a theoretical maximum of 221,184 concurrent threads.
However, the GPU doesn’t execute all threads at once, as execution is interleaved with data movement: many threads are waiting on memory. As a programmer, you want to keep the SMs as busy as possible (high thread occupancy).

Installation

On a laptop:

uv venv ~/env-numba --python 3.12   # create a new virtual env
source ~/env-numba/bin/activate
uv pip install numba
...
deactivate

On Fir:

module load python/3.12.4
python -m venv ~/env-numba
source ~/env-numba/bin/activate
python -m pip install --upgrade --no-index pip
python -m pip install --no-index numba
...
deactivate

Running …

… on a production cluster

To list all available GPU types, including MIG instances and full GPUs, you can use the following command introduced in Marie’s course last week:

sinfo -o "%G" | grep gpu | sed 's/gpu://g' | sed 's/),/\n/g' | cut -d: -f1 | sort | uniq

For example, on Fir this will return:

h100
nvidia_h100_80gb_hbm3_1g.10gb
nvidia_h100_80gb_hbm3_2g.20gb
nvidia_h100_80gb_hbm3_3g.40gb

Then you would submit a job with:

module load cuda/12.6
source ~/env-numba/bin/activate
salloc --time=... --mem=... --gpus-per-node=<gpuType>:1 --account=...

Note

Since each MIG comes with its own overhead, multi-MIG jobs are no longer allowed on our production clusters. If you need more GPU resources, schedule your job on a bigger MIG partition or an entire card.

… on a training cluster

sinfo -o "%G" | grep gpu | sed 's/gpu://g' | sed 's/),/\n/g' | cut -d: -f1 | sort | uniq

module load cuda/12.6
source /project/def-sponsor00/shared/env-numba/bin/activate

# salloc --time=2:0:0 --mem=3600 --gpus-per-node=1
# python

srun --mem=3600 --gpus-per-node=1 python test.py

# this is test.py
from numba import cuda
print(cuda.gpus)                # list of numerical IDs
# cuda.select_device(device_id)   # if several GPUs, select which GPU to use

… without an NVIDIA GPU

Out of the box, Numba’s core library provides only CUDA/Nvidia support. Support for AMD GPUs is currently more experimental than plug-and-play. There is no native support for Apple’s Metal API for GPU acceleration.

Having said that, you can run any Numba-compiled code through a CUDA simulator on CPUs, with zero changes to your code. To enable the built-in CUDA simulator:

source ~/env-numba/bin/activate
export NUMBA_ENABLE_CUDASIM=1

from numba import cuda
print(cuda.gpus)   # will return the list of GPUs

Warning

The CUDA simulator runs your Numba-compiled code much slower than the same Numba code compiled for a CPU, and ~4-5 orders of magnitude slower than on a physical GPU.

You cannot use CPU multithreading in this mode, as the simulator was designed for testing and debugging, not for performance, so it runs on a single thread. With a bit of extra code, you can wrap the simulator into a multi-threaded function call, but this is only useful if you really want to speed up testing and don’t have access to a physical GPU.

High-level array acceleration with @vectorize and @guvectorize

We start with high-level element-wise array functions that run on a GPU, and in the next chapter we’ll talk about writing custom CUDA kernels.

You might have come across numpy ufuncs (universal functions) that operate on ndarrays in an element-by-element fashion. Numba’s @vectorize compiles Python functions taking scalar input arguments into NumPy ufuncs that run as fast as traditional ufuncs written in C.

Available targets

@vectorize decorator supports the following compilation targets:

target='cpu' is the default, single-threaded mode
target='parallel' for a multi-core CPU (multi-threaded)
target='cuda' for an NVIDIA GPU

There are two ways to use @vectorize:

On both CPUs and GPUs, for eager / decoration-time compilation, you pass one or more type signatures to the decorator.
On CPUs only, for lazy / call-time compilation, you do not specify type signatures in the function decorator, resulting in a ufunc that dynamically compiles a new kernel when called with a previously unsupported input type. This results in a slow first call. You would select call-time compilation if you want a generic function that will work on many input data types, without having to specify the types in the function decorator.

On a GPU you can only use the decoration-time compilation:

from numba import vectorize
import numpy as np

a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])
np.add(a, b)   # built-in compiled NumPy function

# or you can write your own:
# --- first syntax option
@vectorize(['int64(int64, int64)'], target='cuda')  # output(input,input) as a list of strings
def add(x, y):
    return x + y

# --- second syntax option
from numba import int64
@vectorize([int64(int64, int64)], target='cuda')  # output(input,input) as a list of types
def add(x, y):
    return x + y

print(add(a, b))

You can specify multiple types (overloading):

from numba import int32, int64, float32, float64
@vectorize([int32(int32, int32), int64(int64, int64),
            float32(float32, float32), float64(float64, float64)])
def add(x, y):
    return x + y

a = np.arange(6)              # integer array
print(add(a,a))

a = np.linspace(0, 1, 6)      # float array
print(add(a,a))

a = np.linspace(0, 1+1j, 6)   # complex array
print(add(a,a))   # error

TypeError: ufunc 'add' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Take-home exercise: pairwise distances between geographic locations

If you have two geographic points (lat1,lon1) and (lat2,lon2), you can calculate the shortest distance between them using this Python code:

dlat = lat2 - lat1
dlon = lon2 - lon1
a = sin(dlat/2)**2 + cos(lat1)*cos(lat2)*sin(dlon/2)**2
dradians = 2 * asin(sqrt(a))
distance = 6371.0 * dradians

where we assume that all input angles (lat1,lon1,lat2,lon2) are in radians.

Let’s say we have a single set of n random points (drawn from a uniform globe coverage):

n = 1000
lat = np.arcsin(2*np.random.rand(n)-1)
lon = np.random.uniform(0, 2*np.pi, n)

and we want to calculate pairwise distances between all of them. This will be a symmetric matrix (since \(d_{\rm ij}=d_{\rm ji}\)), and we want to exclude the main diagonal (since \(d_{\rm ii}=0\)), so we need to calculate only (n*n-n)//2 unique pairs.

We need to form four 1D arrays (two latitudes and two longitudes) covering all (n*n-n)//2 pairs of points, and then we’ll feed these arrays to a @vectorize function. You can form these four arrays in Python either with native Python loops:

uniqueNonZeroDistances = (n*n-n)//2
lat1 = np.zeros(uniqueNonZeroDistances)
lon1 = np.zeros(uniqueNonZeroDistances)
lat2 = np.zeros(uniqueNonZeroDistances)
lon2 = np.zeros(uniqueNonZeroDistances)
count = 0
for i in range(n):
    for j in range(n):
        if j>i:
            lat1[count] = lat[i]
            lon1[count] = lon[i]
            lat2[count] = lat[j]
            lon2[count] = lon[j]
            count += 1

or with NumPy (preferred) operations:

iArray, jArray = np.triu_indices(n, k=1)   # indices for the upper-triangle of the n^2 array
lat1 = lat[iArray]
lon1 = lon[iArray]
lat2 = lat[jArray]
lon2 = lon[jArray]

and then all that remains is to feed these four arrays to the @vectorize function.

Put all these pieces together and time your code on the CPU and GPU for n = 1000, n = 2000, and so on. How far can you scale, and why do you run into a limit?

You can use the standard Python time.time() function to benchmark your code – there is an example later in these notes.

@vectorize is useful for creating fast, compiled functions that operate on numpy arrays element by element. This will give you additional ndarray features such as reduction, accumulation or broadcasting. For more general functions, you will want to use the @cuda.jit decorator.

Note

You can also pass nopython=True to the decorator to ensure the generated code does not fallback to slow object mode.

While @vectorize is for element-by-element math, a more general @guvectorize decorator allows you to build functions that operate on entire array slices. A good example is a moving average (or blur) calculation – you can’t implement it with a standard ufunc because computing each point requires access to its neighbouring values.

from numba import guvectorize
import numpy as np

# () represents a scalar, (n) represents an array of size n
@guvectorize(['f8[:], i8, f8[:]'], '(n),()->(n)')
def movingAverage(x, width, res):
    n = x.shape[0]
    for i in range(n):
        start = max(0, i - width // 2)
        end = min(n, i + width // 2 + 1)
        total, count = 0.0, 0
        for j in range(start, end):
            total += x[j]
            count += 1
        res[i] = total / count

data = np.array([10, 20, 30, 40, 50, 60], dtype=np.float64)
result = movingAverage(data, 3)
print(result)   # [15. 20. 30. 40. 50. 55.]

Writing custom kernels with @cuda.jit

Numba supports CUDA GPU programming by directly compiling a restricted subset of Python code into CUDA kernels. It does not yet implement the full CUDA API, but CUDA support in Numba is being actively developed.

A kernel function is a GPU function that is called from a CPU. It has two fundamental characteristics:

kernels cannot return a value; all results must be written to an array passed to the function, and
kernels must declare their thread hierarchy: the number of blocks and the number of threads per block.

Here is what a kernel declaration looks like:

from numba import cuda

@cuda.jit
def kernelFunction(data):
    ...

Note

The input/output array data can be defined both on the CPU and the GPU. We will see the examples of both.

Let’s assume we are processing a large number of data elements on a GPU. We’ll be doing this on a massively parallel scale, with each thread on a GPU applying the same calculation to its own array element. There are other ways to divide the workload, e.g. using a fixed number of tasks (each processing multiple data elements) and then having each task run on a separate thread.

Key point

In all examples in this workshop, we use a separate thread to process each data element individually.

A common techniques in Numba is to define your block size and then calculate the number of blocks needed to cover your entire data array when launching this kernel:

data = ...   # whether defined on a CPU or GPU

threadsPerBlock = 256   # block size; multiple of 32 (32 threads per warp), usually 128 or 256,
                        # higher number for maximum occupancy for simple kernels;
                        # 1024 is usually the hardware limit;
                        # memory is shared among all threads in the block

blocksPerGrid = (data.size + (threadsPerBlock - 1)) // threadsPerBlock
            # number of thread blocks in the grid; make sure there are enough
            # threads on the GPU to cover every single element in your data
# this one-liner is equivalent to the following two lines:
# blocksPerGrid = data.size // threadsPerBlock
# if data.size > blocksPerGrid*threadsPerBlock: blocksPerGrid += 1

kernelFunction[blocksPerGrid, threadsPerBlock](data)   # start the kernel
print(data)

Note

A kernel launch is a non-blocking (asynchronous) operation: the function returns control to the caller before the kernel finishes executing or the data is synchronized back. If you want the CPU to wait for the GPU to finish, e.g. to time your GPU code, insert cuda.synchronize() after the kernel launch. Transferring data back to the CPU with result = deviceArray.copy_to_host() also forces synchronization: the CPU waits for the GPU to complete all pending work before the data can be safely copied into system memory.

Besides CUDA kernels, you can also define so-called device functions. These are functions that execute on the GPU and can only be called from other GPU code, such as a kernel or another device function. Unlike CUDA kernels, device functions can accept scalar arguments. We will see an example of a device function in the prime factorization section.

Thread positioning

The kernel function is executed by every thread once. From inside the kernel function, you likely want to calculate the thread ID to know which data element(s) this thread is responsible for.

@cuda.jit
def increment(data):        # kernel function processing a 1D array
    tx = cuda.threadIdx.x   # threadID in the current thread block, between 0..cuda.blockDim.x-1
    bx = cuda.blockIdx.x    # blockID in a 1D grid
    bw = cuda.blockDim.x    # number of threads per block (block width)
    pos = bx * bw + tx      # flattened index inside the array
    if pos < data.size:     # check array boundaries, in case there are more threads than data.size
        data[pos] += 1      # do the computation

The code below launches 1024 threads to process 1000 elements of the data array. Because the number of threads exceeds the number of elements, an if condition inside the kernel is necessary to prevent out-of-bounds memory access.

from numba import cuda
import numpy as np

@cuda.jit
def increment(data):        # kernel function processing a 1D array
    ...

data = np.ones(1_000)

threadsPerBlock = 256
blocksPerGrid = (data.size + (threadsPerBlock - 1)) // threadsPerBlock

increment[blocksPerGrid, threadsPerBlock](data)

Exercise: 1D indices

Consider the following code with the same number of data elements and the same thread hierarchy – what will its output be? Try to answer this question without running the code.

from numba import cuda
import numpy as np

@cuda.jit
def test(data):
    tx = cuda.threadIdx.x   # threadID in the current thread block, between 0..255
    print(tx, cuda.blockIdx.x, cuda.blockDim.x, cuda.blockDim.y, cuda.blockDim.z)

data = np.ones(1_000)

threadsPerBlock = 256
blocksPerGrid = (data.size + (threadsPerBlock - 1)) // threadsPerBlock

test[blocksPerGrid, threadsPerBlock](data)

Hint: how many threads will run?

Exercise: no data

Try running the last example removing any reference to data, i.e. the kernel-launching line will be:

test[blocksPerGrid, threadsPerBlock]()

You can simplify the last increment(data) to:

@cuda.jit
def increment(data):   # 1D example
    pos = cuda.grid(1)      # absolute position of the current thread in the entire grid of threads
    if pos < data.size:
        data[pos] += 1      # do the computation

Let’s check out a working 2D example:

from numba import cuda
import numpy as np
import math

# CUDA kernel
@cuda.jit
def square(data):
    i, j = cuda.grid(2)   # x, y position of this thread in the entire 2D grid of threads
    # print(i,j)
    rows, cols = data.shape
    if i < rows and j < cols:
        data[i,j] += 1

data = np.ones((30,30))

# arrange a 2D grid of threads
threadsPerBlock = (16,16)
blocksPerGrid = (math.ceil(data.shape[0] / threadsPerBlock[0]),
                 math.ceil(data.shape[1] / threadsPerBlock[1]))

square[blocksPerGrid, threadsPerBlock](data)
print(data)

In this example, CUDA will launch blocksPerGrid[0] * blocksPerGrid[1] * threadsPerBlock[0] * threadsPerBlock[1] = 2*2*16*16 = 1024 threads which is more than the \(30^2=900\) array elements, i.e. some of the threads won’t be doing any real computation. This is because 30 is not divisible by 16.

Similarly to the 1D example, if we remove the array bounds check if i < rows and j < cols inside the CUDA kernel, some of the threads with local (inside their block) indices cuda.threadIdx.x=2 or cuda.threadIdx.y=2 that do not have any computation to do would try to write into memory that doesn’t exist, leading to either an “out of bounds” runtime error or memory corruption.

Memory management and data transfer

In our previous examples, we would pass a data array to a CUDA kernel, which at the end of the computation would return a modified array back to the host. In other ways, when launching a kernel, data was automatically passed between the host and the device.

In some cases you might want to have more precise control over data movement. For example, you might be launching multiple kernels and want to avoid unnecessary data transfer. In such cases you can allocate an array directly on the GPU with cuda.device_array():

device_array = cuda.device_array(1_000_000, dtype=np.int32)    # allocate memory on the GPU
kernelFunction[blocksPerGrid, threadsPerBlock](device_array)   # launch the kernel without copying data

Note

We still must pass the array to the kernel function, even though the array now resides on the GPU.

Alternatively, you can declare an array on a CPU and then move it manually to the GPU with cuda.to_device:

a = np.zeros(1_000_000, dtype=np.int32)
device_array = cuda.to_device(a)
kernelFunction[blocksPerGrid, threadsPerBlock](device_array)   # launch the kernel without copying data

After the computation, the modified array is still on the GPU, so you can copy it back to the host:

A = device_array.copy_to_host()   # copy the result back to the host
print(A)

Benchmarking …

Unlike in CuPy with its cupyx.profiler.benchmark, Numba does not have a single “one-stop” convenience function that handles warm-ups and multiple kernel launches (to get an accurate estimate) automatically. Having said that, you have several options.

… on the GPU with CUDA events

The most precise way to time a Numba kernel is using CUDA Events. This method measures time directly on the GPU hardware, avoiding the CPU-GPU communication + Python overheads.

from numba import cuda
import numpy as np
import math

@cuda.jit
def logs(data):
    i, j = cuda.grid(2)   # x, y position of this thread in the entire 2D grid of threads
    rows, cols = data.shape
    if i < rows and j < cols:
        data[i,j] = math.log10(i+j+1)

# warm-up
n = 3
data = np.ones((n,n))
threadsPerBlock = (3,3)
blocksPerGrid = (1,1)
logs[blocksPerGrid, threadsPerBlock](data)

# prepare for benchmarking
n = 10_000
data = np.ones((n,n))
threadsPerBlock = (16,16)
blocksPerGrid = (math.ceil(data.shape[0] / threadsPerBlock[0]), math.ceil(data.shape[1] / threadsPerBlock[1]))
startEvent, endEvent = cuda.event(), cuda.event()   # create event variables

# benchmark the kernel
startEvent.record()
logs[blocksPerGrid, threadsPerBlock](data)
endEvent.record()
endEvent.synchronize()   # pause the CPU until endEvent.record() is done
print("Time in ms:", cuda.event_elapsed_time(startEvent, endEvent))   # 80.928ms 80.723ms 80.208ms

… on the CPU

To time from the CPU side, you must use cuda.synchronize() to ensure the GPU has actually finished its work before you stop the clock. In the following example we launch the kernel several times and compute the average time per run:

from numba import cuda
import numpy as np
import math
from time import time

# warm-up
data = np.ones((3,3))
logs[(1,1),(3,3)](data)

n = 10_000
data = np.ones((n,n))
threadsPerBlock = (16,16)
blocksPerGrid = (math.ceil(data.shape[0] / threadsPerBlock[0]), math.ceil(data.shape[1] / threadsPerBlock[1]))
numIterations = 10

cuda.synchronize()
start = time()
for _ in range(numIterations):
    logs[blocksPerGrid, threadsPerBlock](data)

cuda.synchronize()   # pause the CPU until all currently queued tasks on the GPU are completely finished
end = time()
print("Average time per call in seconds:", round((end-start)/numIterations,6))   # 0.079077s

Prime factorization

Now let’s compute a more interesting problem where we do some significant processing of each element of a large array, but independently of other elements – this will port nicely to a GPU. We teach this problem in several programming languages.

Prime factorization of an integer number is finding all its prime factors. For example, the prime factors of 60 are 2, 2, 3, and 5. Let’s write a function that takes an integer number and returns the sum of all its prime factors. For example, for 60 it will return 2+2+3+5 = 12, for 91 it will return 7+13 = 20, for 101 it will return 101, and so on.

def primeFactorizationSum(n):
    total = 0
    while n % 2 == 0: # handle the number of 2s that divide n
        total += 2
        n //= 2
    divisor = 3       # n must be odd at this point, so we can skip one element
    while divisor * divisor <= n:
        while n % divisor == 0:
            total += divisor
            n //= divisor
        divisor += 2
    if n > 2:         # n is a prime number
        total += n
    return total

We can test it quickly:

primeFactorizationSum(60)            # 12
primeFactorizationSum(91)            # 20
primeFactorizationSum(101)           # 101
primeFactorizationSum(100_000_000)   # 56

Since 1 has no prime factors, we will start computing from 2, and then will apply this function to all integers in the range 2..n, where n is a larger number. We will do all computations separately from scratch for each number, i.e. we will not cache our results (caching can significantly speed up our calculations but the point here is to focus on brute-force computing).

Let’s first do this with native Python loops on a CPU:

import numpy as np
from time import time

def primeFactorizationSum(n):
    ...

n = 1_000_000
A = np.zeros(n-1, dtype=np.int32)   # to fill 2..n
start = time()
for i in range(n-1):
    A[i] = primeFactorizationSum(i+2)

end = time()
print("Primes sum array:", A)
print("Time in seconds:", round(end-start,3))

On Fir I am getting: 5.063s 5.054 5.038.

The numerical output is:

Primes sum array: [  2   3   4 ... 287  77  42]

Next let’s run this in Numba on a CPU:

import numpy as np
from numba import njit   # alias to @jit(nopython=True)
from time import time

@njit
def primeFactorizationSum(n):
    ...

@njit
def primes(n):
    A = np.zeros(n-1, dtype=np.int32)   # to fill 2..n
    for i in range(n-1):
        A[i] = primeFactorizationSum(i+2)
    return A

primes(10)   # dry run to pre-compile
n = 1_000_000

start = time()
result = primes(n)
end = time()

print("Primes sum array:", result)
print("Time in seconds:", round(end-start,6))

On Fir the time went down to: 172.202ms 172.820 172.306

Finally, let’s run this on the GPU:

from numba import cuda
import numpy as np
from time import time

# device function
@cuda.jit(device=True)
def primeFactorizationSum(n):
    ...

@cuda.jit
def primes(result):
    # calculate the unique index for this thread
    # pos = (block_index * block_dimension) + thread_index
    i = cuda.grid(1)
    # check boundary to avoid accessing memory outside the array
    if i < result.size:
        result[i] = primeFactorizationSum(i+2)

n = 1_000_000
device_array = cuda.device_array(n-1, dtype=np.int64)   # allocate memory on the GPU
threadsPerBlock = 256   # thread hierarchy
blocksPerGrid = (n-1 + (threadsPerBlock - 1)) // threadsPerBlock

start = time()
primes[blocksPerGrid, threadsPerBlock](device_array)   # launch the kernel
cuda.synchronize()   # ensure GPU finishes before stopping clock
end = time()

A = device_array.copy_to_host()   # copy the result back to the host
print(f"Primes sum array: {A}")
print("Time in seconds:", round(end-start,6))

Running the big kernel without a warm-up, on Fir my runtimes are: 628.017ms 170.453ms 169.146ms 166.548ms – notice how the first number (probably) includes the compilation time, but the numbers are still large, as you need a cached code on the GPU to get the maximum GPU performance.

One way to do this is to provide the kernel signature (types) to the decorator by changing @cuda.jit to @cuda.jit('(int64[:],)') – we are passing an array of int64. With this, our runtimes drop to 844μs 835μs 830μs.

The second solution is to run a warm-up calculation. For example, I can run my entire calculation as a warm-up:

primes[blocksPerGrid, threadsPerBlock](device_array)
cuda.synchronize()

but probably a better strategy is to warm-up the GPU with a small calculation:

dummyArray = cuda.device_array(1, dtype=np.int64)   # tiny "dummy" array
primes[1, 1](dummyArray)   # launch with a single thread to trigger compilation
cuda.synchronize()

Then our runtime becomes 754μs 752μs 752μs.

Bigger calculation

Run the same prime factorization problem on the GPU increasing the problem size to n=10_000_000 and n=100_000_000. Do you see a linear speedup, and why do you think that is?
Try running n=1_000_000_000 – you can run out of memory both on the host and on the device. How would you solve these two separate problems?

GPU utilization monitoring

For your past jobs on production clusters, we encourage you to use our Metrix portal which is currently available for Rorqual, Narval and Nibi. The support for Fir will hopefully come very soon.

However, you can also monitor running jobs in real time via command line. The two tools useful for this are NVIDIA’s nvidia-smi and nvtop.

Note

nvidia-smi and nvtop do not have access to real-time GPU utilization inside a MIG, although they show memory usage per slice. They show GPU utilization only if you schedule a full GPU.

Here is how you would use nvidia-smi:

connect to a currently running job with srun --jobid=... --pty bash
run watch -n 1 nvidia-smi to execute nvidia-smi every second, refreshing the terminal output each time
press Ctrl-C to go back to the prompt on the login node

Exercise: quick access to nvidia-smi

Write a bash function job-nvidia-smi() that lists all your currently running jobs, picks the latest (highest jobID) job from that list, connects to that job and automatically runs watch -n 1 nvidia-smi. You can also try running nvidia-smi -l 2 for a scrolling output style. Try this command on a long-running GPU code. How’s your GPU compute and memory utilization?

Exercise: quick access to nvtop

Next, write a bash function job-nvtop() that runs nvtop instead. This will show a graph of GPU’s compute and memory utilization vs. time inside the terminal.

For more detailed GPU profiling via a GUI, you can run NVidia Nsight Systems on one of the cluster’s GPU interactive nodes via JupyterLab – we will not cover it here.

Basic performance tuning techniques

Consecutive memory access

When running on a GPU, you want coalesced memory access: all threads in a warp should be accessing consecutive memory addresses. Let me show an example in which – instead of consecutive addresses – we use random indices stored in a permutation array on the same GPU.

First, consider this fast code:

from numba import cuda
import numpy as np
import math as m
from time import time

@cuda.jit
def squares(result):
    i = cuda.grid(1)
    if i < result.size:   # if more threads than array elements, check boundary
        result[i] = m.log10(i+1)

# warm-up
devs = cuda.device_array(1, dtype=np.float64)
squares[1, 1](devs)

n = 100_000_000
devr = cuda.device_array(n, dtype=np.float64)   # allocate memory on the GPU

# define thread hierarchy
threadsPerBlock = 256
blocksPerGrid = (n + (threadsPerBlock - 1)) // threadsPerBlock

# launch the kernel
start = time()
squares[blocksPerGrid, threadsPerBlock](devr)
end = time()

A = devr.copy_to_host()   # copy the result back to the host
print(f"Squares array: {A}")
print("Time in μs:", round((end-start)*1e6,3))

Running on Fir’s H100, I get these runtimes: 74.625μs 74.863μs 76.294μs.

Next, let’s randomize memory access, using an integer permutation array on the device:

diff fillNumbaGPU.py fillNumbaGPU-slow.py

< def squares(result):
---
> def squares(result, devp):

<         result[i] = m.log10(i+1)
---
>         result[devp[i]] = m.log10(i+1)

< squares[1, 1](devs)
---
> devp = np.zeros(1, dtype=np.int64)
> squares[1, 1](devs, devp)

> perm = np.random.permutation(np.arange(0, n))
> devp = cuda.to_device(perm)

< squares[blocksPerGrid, threadsPerBlock](devr)
---
> squares[blocksPerGrid, threadsPerBlock](devr, devp)

Now, on the same GPU, my runtimes jump to 332.117μs 338.316μs 317.812μs – just shows the importance of having consecutive memory inside the same warp.

CPU ⇄ GPU data transfer

… is relatively slow.

Don’t pass NumPy arrays directly to a kernel, as Numba will copy them to the GPU and back automatically for every call. Instead, use cuda.to_device() to keep data on the GPU between multiple kernel launches.
Organize your data transfer and kernel launches into CUDA streams. This will let CUDA overlap data transfers with kernel execution, e.g. preparing next data transfer while the current batch is being computed. Here is an example with two overlapping streams:

stream1 = cuda.stream()
stream2 = cuda.stream()

>>> define arrays a1, a2 on the host

d1 = cuda.to_device(a1, stream=stream1)
d2 = cuda.to_device(a2, stream=stream2)

kernel[blocks, threads, stream1](d1)
kernel[blocks, threads, stream2](d2)

r1 = d1.copy_to_host(stream=stream1)
r2 = d2.copy_to_host(stream=stream2)

stream1.synchronize()
stream2.synchronize()

Other best practices

Always “warm up” your kernel.
Use float32 math rather than float64 whenever possible.
Minimize heavy if branching in the code (leading to load imbalance), or at least write your code in such a way that each thread has roughly the same amount of computing to do.
If using more threads than data elements, don’t forget to check for array boundaries.

Introduction to Numba and GPU Architecture

Installation

Running …

… on a production cluster

… on a training cluster

… without an NVIDIA GPU

High-level array acceleration with @vectorize and @guvectorize

Writing custom kernels with @cuda.jit

Thread positioning

Memory management and data transfer

Benchmarking …

… on the GPU with CUDA events

… on the CPU

Prime factorization

GPU utilization monitoring

Basic performance tuning techniques

Consecutive memory access

CPU ⇄ GPU data transfer

Other best practices

Links