How Much Will Serving This LLM Actually Cost?

The question I get asked most often when someone wants to ship an LLM-powered feature is, basically, “okay but what’s this going to cost?” And the honest answer is it depends on a lot of things you haven’t decided yet: which model, what precision, how many tokens per request, how many requests per second at peak, whether you self-host or pay an API provider per token, and whether you can tolerate the cold-start of a serverless GPU. Most of those have order-of-magnitude effects, so a back-of-envelope number can be off by 10x in either direction.

The LLM inference cost estimator on the Toolbelt is my attempt at making that back-of-envelope a bit less hand-wavy. You describe the workload - model, precision, traffic profile - and it gives you an engineering estimate of how many GPUs you’d need and what the monthly bill looks like across a few common providers and self-hosted GPU options. The point isn’t to give you a number you’d staple to a contract; it’s to give you a defensible number you can use to decide whether to keep going, switch model size, or just call the OpenAI API and move on.

All the reference data (GPU specs, provider prices, model presets) is shipped with the page and dated, so you can see exactly when the numbers were last refreshed. As with the rest of the Toolbelt, nothing about your workload leaves your browser.