Cerebras for latency-sensitive work.
Cerebras inference is fast on paper. Put cerebras/llama-3.1-70b behind the same endpoint as your other models, on your own key, and check whether the speed holds for your real prompts.
$129/month SaaS. Bring your own model keys. No inference markup.
Three steps to connect.
Pick the Cerebras speed lane
Cerebras sells raw inference speed. ProxyLLM passes Cerebras-backed models through providers that expose them, on your own key; native Cerebras key storage is future work.
One client surface
Send chat completions to https://api.proxyllm.ai/v1 with your ProxyLLM key and keep the same request shape used for every other model.
Measure latency honestly
Headline tokens-per-second is not end-to-end latency. Read real timings in ProxyLLM request logs before you move a workflow to Cerebras.
Fast inference, measured.
Call cerebras/ models where your configured provider exposes them. Your key, your logs, no markup.
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.proxyllm.ai/v1",
apiKey: "pk_live_...",
});
const r = await client.chat.completions.create({
model: "cerebras/llama-3.1-70b",
messages: [{ role: "user", content: "Score these leads from 1 to 5." }],
}); Run your AI workloads on your ChatGPT subscription.
ProxyLLM runs OpenAI's Codex for you, signed in with your own ChatGPT account. Your apps call one OpenAI-compatible endpoint and the work bills to your flat plan instead of per-token API pricing.
Choose speed with logs.
ProxyLLM records latency, tokens, and failures on every call, so the fast lane earns its place with production data. $129/month flat.