Fixed-Cost LLM Inference: Options for Predictable AI Bills
Three real paths to a flat LLM bill: self-hosted open models on rented GPUs, provider capacity commitments, and subscription-backed Codex. Honest tradeoff table included.
Fixed-cost LLM inference exists in three real forms: self-hosted open models on rented GPUs, reserved-capacity commitments from providers, and subscription-backed capacity through Codex on a ChatGPT plan. All three buy capacity instead of tokens; they differ in ceiling, model quality, and ops burden. Discounts like batch and caching make a metered bill smaller, but only capacity purchases make it predictable.
Why per-token bills resist prediction
A metered bill is tokens x rate, and tokens scale with everything good: more users, more agent steps, more retries, longer contexts. The variance is the problem as much as the level; finance can absorb a known $3,000, not a $1,200-to-$5,000 range that moves with engagement. The decision framework for when metered pricing stops fitting is in per-token vs flat-rate LLM pricing. What follows is the survey of what to buy once you have decided you want a flat number.
Path 1: self-hosted open models
Rent GPUs, serve open-weight models (Llama, Qwen, Mistral families), and the bill is the GPU invoice regardless of token count. As a mid-2026 estimate, dedicated H100s rent for roughly $2 to $3 per GPU-hour on commodity clouds, call it $1,500 to $2,200 a month per GPU before engineering time.
The fixed cost is real; the catches are real too. Capacity is whatever throughput your hardware sustains, quality on hard tasks trails frontier closed models, and you own serving, scaling, evals, and 3 a.m. pages. Idle GPUs are the most expensive inference there is: the economics only work at high, steady utilization. Good fit: privacy-constrained workloads and high-volume mid-difficulty tasks a 70B-class model handles well.
Path 2: provider capacity commitments
Cloud providers sell reserved throughput for frontier models: Azure OpenAI provisioned throughput and AWS Bedrock provisioned capacity are the canonical examples. You pay a fixed monthly rate for guaranteed capacity, with frontier quality and someone else’s pagers.
The catch is the entry price. Reserved capacity is sold in units sized for enterprises; commitments typically start in the four-to-five-figure monthly range with contract terms attached. Below that scale, the option effectively does not exist for you. Good fit: enterprises with steady, high, latency-sensitive load and a procurement team.
Path 3: subscription-backed Codex
ChatGPT plans bill flat and include Codex. Codex Hosted is our packaging of that fact: we run OpenAI’s official, unmodified Codex CLI on managed servers, signed into your own ChatGPT account through OpenAI’s device-code flow, and expose it as an OpenAI-compatible endpoint. Bulk workloads bill to the subscription instead of the meter; your API key stays connected as the overflow lane.
The numbers: plan price plus our $129 flat fee, no inference markup. As planning estimates, Plus ($20) absorbs roughly $700 of API-equivalent work a month, Pro 5x ($100) roughly $3,500, Pro 20x ($200) roughly $14,000. Estimates, not guarantees: capacity arrives as usage windows that OpenAI tunes over time. The Codex lane returns complete responses rather than streams, and the model surface is what Codex serves. Programmatic Codex use is documented OpenAI functionality, but OpenAI has the final call on its accounts. The wider landscape of flat OpenAI access, including resellers and DIY proxies, is mapped in flat-rate OpenAI API: does it exist?, and the mechanics are in what is Codex Hosted.
The tradeoff table
| Self-hosted open models | Provider commitments | Subscription-backed Codex | |
|---|---|---|---|
| Monthly cost | ~$1,500-$2,200 per GPU (est.) | Contract; often $10k+ | $149-$329 all-in |
| Capacity ceiling | Your GPUs’ throughput | Reserved throughput | Plan windows (estimates) |
| Model quality | Open weights, below frontier | Frontier | What Codex serves (frontier) |
| Ops burden | High: serving, scaling, evals | Low to medium | Low: managed, logs included |
| Best fit | Privacy, steady mid-tier volume | Enterprise scale | OpenAI-centric bulk work |
| Main risk | Paying for idle capacity | Minimum commitments | Window limits; provider discretion |
One workload, three ways
A production workload of 1B input and 150M output tokens a month on GPT-5 ($1.25/$10 per million, June 2026, live prices at openai.com/api/pricing):
metered: 1,000 x $1.25 + 150 x $10 = $1,250 + $1,500 = $2,750/month
- Self-hosted: one to two H100 nodes ($1,800 to $4,000 estimated) likely sustain the throughput with an open model, if the quality gap is acceptable for the task and someone owns the serving stack.
- Commitment: entry minimums usually start above what a $2,750 workload justifies.
- Subscription-backed: Pro 5x plus ProxyLLM is $229 against a roughly $3,500 capacity estimate, with the API key catching overflow. A $2,750 metered workload fits inside a $229 subscription-backed setup, as an estimate rather than a guarantee.
What fixed cost does not buy
A ceiling is the price of a floor. Fixed-cost paths are bounded: GPU throughput, reserved units, or usage windows. Honest designs keep a metered lane as the safety valve and watch which lane serves what, which is exactly what per-request logs are for. Note also that the popular “50% off” lever, the Batch API, is a discount on metered billing rather than a fixed cost; where it fits is covered in the Batch API discount explained.
If your bill is mostly OpenAI-shaped, the calculator prices your current metered spend against the subscription-backed setup in about thirty seconds.
Frequently asked questions
Is there a way to get a fixed monthly cost for LLM inference?
Yes, three real paths: self-host open models on rented GPUs (fixed at the GPU bill, roughly $1,500 to $2,200 a month per H100 as a mid-2026 estimate), buy reserved capacity from a provider (fixed by contract, typically four to five figures monthly), or run workloads against a flat ChatGPT plan through Codex, at $149 to $329 a month all-in.
Is self-hosting an LLM cheaper than the OpenAI API?
Only at high, steady utilization. A rented H100 costs about the same whether it serves one request or saturates, so self-hosting wins when you keep open-weight models busy around the clock and the quality gap versus frontier models does not hurt your task. Idle GPUs are the most expensive inference there is.
What is subscription-backed inference?
Running workloads against a flat-priced ChatGPT plan instead of metered API billing. ChatGPT plans include Codex, which runs programmatically; Codex Hosted exposes that as an OpenAI-compatible endpoint using your own account, so bulk work bills to the subscription. Capacity comes as plan usage windows, and the API-equivalent figures are estimates.
Are flat LLM plans really unlimited?
No. Every fixed-cost path has a ceiling: GPU throughput for self-hosting, reserved capacity for commitments, and usage windows for subscription plans. Fixed cost means bounded cost and bounded capacity; honest designs keep a metered lane as overflow for the days the workload exceeds the ceiling.