If you’ve ever tuned a CUDA kernel, you know the dance: pick a block size, count registers per thread (or let nvcc tell you with --ptxas-options=-v), figure out how much shared memory you’re using, and then work out how many of those blocks can actually live on an SM at once. NVIDIA used to ship a spreadsheet for this - it was great, but a spreadsheet is exactly the friction I don’t want when I’m halfway through optimising a kernel and just want a quick “is the limiting factor registers or shared memory here?” answer.

The CUDA occupancy calculator on the Toolbelt is the same idea, just as a page you can pull up next to your editor. You pick a GPU, plug in block size, registers per thread, and shared memory per block, and it gives you the resident blocks per SM, the theoretical occupancy, and - the bit I actually use - which resource is the bottleneck. There’s also a sweep over candidate block sizes so you can see at a glance whether jumping from 128 to 256 threads buys you anything.

It’s not a replacement for actually profiling with Nsight Compute (nothing is), but it’s a really fast way to rule out the obvious “you’re register-bound, drop the block size” class of mistake before you spend an hour on a profile.