An inference-infrastructure benchmarking suite built on top of a reproducible text-to-SQL workload (originally a GitLab Agentic BI CLI POC). 15 LLMs across two serverless inference platforms — Fireworks (5 models) and Baseten (10 models) — measured on accuracy, TTFT, total latency, tokens, and cost-per-query. Every number on this page is reproducible from the repo.
Inference infrastructure choices are usually argued on price-per-token and marketing claims. This dashboard makes them empirical: a fixed 25-question gold set (tolerant multiset comparison) is run against every Model API LLM on Fireworks and Baseten using the same agent prompt, the same schema, and identical sampling. Outputs: which model wins on accuracy, which on TTFT, which on $/query, where the Pareto frontier sits, and where a model is cheap on paper but slow or unreliable in practice. The text-to-SQL workload is the harness; the conclusions are about platforms and inference.
Click any node to see what we chose, the alternatives considered, and the trade-off we accepted.
$ python -m src.cli · run locally — see README for setup.
Same gold set, same agent prompt, same sampling settings, run against every Model API LLM on Fireworks and Baseten. Click a column header to sort. Filter by platform to isolate one provider.
| Model | Platform | Accuracy | JSON OK | TTFT p50 | Lat p50 | Lat p90 | In tok | Out tok | $/query |
|---|
Protocol: one-shot per question, no agent repair loop, temperature=0,
max_tokens=1024, streaming (TTFT = time to first delta chunk).
Fireworks numbers from data/fireworks_sweep_v2.json and Baseten
numbers from data/baseten_sweep.json — both the full
25-question gold set, same streaming protocol, same scoring.
Top-left = high accuracy at low cost. Hover for model details.
Same workload, same agent, same gold set, same scoring — both platforms swept on the full 25 questions with streaming and TTFT capture. Anything that differs below is platform-side: inference engine, model selection, pricing, networking, autoscaling behavior.
"Best" = highest accuracy, tie-broken by lower cost-per-query. The winner gets the apples-to-apples comparison row.
data/fireworks_sweep_v2.json.data/baseten_sweep.json.temperature=0, max_tokens=1024, the same compact+trim schema injection, scored by the same tolerant rows_match. Same 25 questions on both sides — a true head-to-head.404 NOT_FOUND on the Fireworks account and can't be benchmarked — see probed_unavailable in the data file.Seven quick questions an inference-infrastructure engineer would ask of the matrix data, answered with numbers from the sweep — not vibes. Then four deeper sections on fine-tuning economics, serverless-vs-dedicated tradeoffs, quantization, and how the platform actually behaves under concurrent load.
The first instinct on inference cost is "fine-tune the cheapest open-source model to match the expensive one." Sometimes that's right. Often it's not. Here's the framework, grounded in this sweep's numbers.
In this sweep, gpt-oss-120b hits 96% accuracy at $0.000187/query. The cheapest candidate you'd consider fine-tuning (Nemotron-120B-A12B at $0.000121/q) sits at 44% accuracy out of the box — a 52-point gap. To close that gap with SFT against a 25-question workload, you'd need labeled training data (call it 500–2000 examples for a "first SFT pass that won't ruin the model"), plus 2–3 rounds of iteration, plus an eval harness. Realistic minimum: 40–80 engineer-hours + a few hundred dollars of compute.
Cost-per-query difference between the two: $0.000066. At GitLab's projected 30k q/day, that's $1.98 saved per day — about $59/month. Engineer time to fine-tune amortizes at ~5–8 years. Don't fine-tune.
Pick the better model when (a) the accuracy gap is >5 points and (b) the cost gap is <3× and (c) you can ship today. Fine-tuning is a months-long commitment with maintenance overhead (re-tune on base-model updates, drift monitoring, eval-set rot). Fine-tune when the cost gap is >10× and you can articulate a verifiable signal (executor feedback, click-through, rubric) to drive RFT.
In this benchmark, the answer is "pick gpt-oss-120b." If a future month DeepSeek-V5 ships at $0.005/q with 98% accuracy and gpt-oss stays at 96% and $0.0002, the framework flips — but only because the cheap model is already within 2 points of the expensive one.
Every model in this matrix is serverless — shared multi-tenant inference, per-token billing, no GPU allocation on your side. Baseten (and Fireworks) also sell dedicated deployments billed per GPU-minute. The choice is workload-shape-dependent, not platform-dependent.
The single-call sweep on the Model Sweep tab shows median behavior. The Latency under load section below probes the next layer — does the platform actually parallelize, or does the apparent low latency at concurrency=1 collapse at concurrency=10? That's the question dedicated vs serverless turns on, and it's why we ran the concurrency benchmark.
Phase 3 deferred: the original brief included deploying a smaller model via Truss as a dedicated deployment to measure the crossover empirically. Deferred to stay within the serverless-only budget — but the breakeven math above frames where you'd start looking.
Every Baseten Model API LLM ships quantized. That's not a cost-cutting hack — it's why these models are fast enough to be priced per-token at all. Here's what the numbers mean and how they map onto the latencies we measured.
2 bytes per weight. A 120B-parameter model needs ~240 GB of GPU memory just to hold the weights — that's 3× H100 (80GB each) minimum. Highest fidelity to the model's training distribution. Rarely served at scale on cost-sensitive inference platforms because the GPU economics don't work outside research.
1 byte per weight, ~½ the memory of FP16. A 120B model fits in ~120 GB → still 2× H100 minimum, but TFLOPS roughly double on Hopper-class GPUs. Typical quality regression vs FP16 is 0–1.5% on most benchmarks. Baseten serves DeepSeek-V3.1 and MiniMax-M2.5 at FP8.
½ byte per weight, ~¼ the memory of FP16. A 120B model fits in ~60 GB → comfortably on a single H100 or H200. Throughput on Blackwell-class hardware roughly 2× FP8 again. Quality regression typically 1–5%, very model-dependent. 8 of 10 Baseten models in this matrix are FP4.
For the platform: more concurrent users per GPU, cheaper per-token economics, the ability to price gpt-oss-120b at $0.10/M input tokens at all.
For you: fast prefill (the schema-stuffed system prompt ingests faster), high per-GPU throughput (which is what makes the concurrency curves below shaped like they are), and the assumption that the model's accuracy ranking on your workload still holds after the quantization. Verify the last assumption with an eval — which is what this whole dashboard is.
The top performer (gpt-oss-120b · FP4 · 96% acc · 228ms TTFT) beats the second (DeepSeek-V3.1 · FP8 · 92% acc · 343ms TTFT) on both axes despite running at the more aggressive quantization. That's the right answer for the wrong-feeling reason: FP4 doesn't kneecap well-trained models, and the model-architecture / training-data delta dominates the quantization-precision delta on this workload. The implication for model selection is: don't filter by quantization tier. Filter by accuracy and TTFT on your own workload; the quantization is upstream of both.
Quantization data sourced from Baseten's /v1/models
endpoint at sweep time (data/baseten_models_meta.json).
Single-call latency hides queue depth. We fired the full 25-question gold set at the top three Baseten models (gpt-oss-120b, DeepSeek-V3.1, Kimi-K2.6) at concurrency 1, 5, and 10 — same agent, full repair loop, ThreadPoolExecutor. How does P50 / P90 / P99 move?
The customer's current prototype prompt
(Convert this question to SQL: {question} — no
schema, no JSON mode, no execute-and-repair) on the same Kimi K2.6
endpoint scored 2/10 on the dev set. The agent on the
same model scored —.
The lift comes from prompt engineering, not a model swap.
Click any card to load the question into the unified comparison tool on the Live Demo tab — pick models and prompt strategies, then run.
A real python scripts/agent_eval.py run from this repo — the full agent (compact schema, JSON mode, execute-and-repair) scored against the 25-question gold set, written to data/agent_eval.json. Rendered straight from that file.
loading…
Read column-by-column: qid · set · tier · pass/fail · agent latency in ms · repair turns used · diagnostic note.
Every question the agent missed in the run above — with the SQL it actually generated.
Fires 3 sequential calls to Fireworks (Kimi K2.6) against the same dev question and reports the median you see right now. Calls also stream into the Live Demo activity log.
Best clean dev-eval run hit median 2,901 ms separately (no pacing). Schema compaction (935 → 665 mean prompt tokens, −29%) is what made that possible — same accuracy.
| Run | P50 | P90 | Accuracy | Note |
|---|---|---|---|---|
| 1 | 9,067 ms | 22,065 ms | 10/10 | shared tier under load |
| 2 | 8,297 ms | 39,673 ms | 6/10 | three queries hit 429s and never recovered |
| 3 | 4,058 ms | 17,935 ms | 10/10 | quieter shared tier · cleanest run |
All three perf JSONs are checked into the repo (perf_compact_{1,2,3}.json). Median p50 across runs: 8,297 ms. Best clean p50: 4,058 ms.
Most of the per-call latency on shared serverless is prefill, so the prompt-token count is the highest-leverage knob short of moving off shared. Two compounding wins:
Render the schema as Album(AlbumId:integer[pk], Title:nvarchar, ArtistId:integer[fk→Artist.ArtistId]) instead of CREATE TABLE blocks. 62% schema-char reduction. No accuracy regression on the dev set.
Keyword-match table names against the question, expand via foreign keys to keep JOINs valid. Fires on 9/25 questions today (24.7% mean reduction across all 25, 68.5% on the 9 fired). On q_002 ("AC/DC albums"): 89.8% reduction.
One slider, every benchmarked model on both platforms. Cost-per-day is computed from actually-measured token usage during the sweep × published per-token rates. Build-vs-Buy context for the GitLab POC is the section below.
Models with errored sweeps or unknown pricing are hidden. Sorted cheapest → most expensive. Highlighted row = best accuracy at this volume tier.
| Model | Platform | Accuracy | $/query | $/day | $/month | $/year | vs GPT-5.4 |
|---|---|---|---|---|---|---|---|
| Loading… | |||||||
Originally the cost story for the customer email — same slider drives this section. Kept for context on the four-way choice (proprietary / direct API / self-host / managed platform).
All four columns recompute when you drag the slider. If you've made runs in the Live Demo, the token profile switches from the benchmark default to the average of your own runs — apples-to-apples across all four options.
Based on benchmark data (665 in / 78 out)
No infra to manage. Highest cost, proprietary lock-in, 7s latency reported by customer.
Per-token pricing is often cheaper going direct to the model provider — typically 20–40% cheaper than Fireworks. But for an enterprise deployment like GitLab's, the total cost includes more than tokens:
Per-token rates sourced from artificialanalysis.ai/models ↗
Full control at extreme scale. Requires ML infra team, GPU procurement, autoscaling, monitoring, model updates.
Higher per-token cost than some direct model provider APIs — you're paying for production infrastructure, not just inference. You're paying for optimized inference (0.71s TTFT — fastest tracked provider), fine-tuning pipeline, region pinning, and platform convenience.
Includes: FireAttention + speculative decoding, 18+ regions across 8 cloud providers, US/EU/APAC region pinning, BYOC option, single vendor for all open-source models, fine-tuning (SFT / RFT / DPO) at base serving price, on-demand tier for latency SLOs, Enterprise support with dedicated account team.
At extreme scale (1M+ queries/day), self-hosting eventually wins on pure compute cost — but doesn't account for the engineering team needed to run it. At the projected first cohort volume (30k queries/day) and even at 10× growth, Fireworks is the clear choice: zero ops overhead, instant model swapping, and fine-tuning without a training cluster.
Each phase maps to a concrete Fireworks platform feature. The POC is Phase 0.
Prompt engineering can only carry text-to-SQL so far. The two failure modes we hit (top-per-group, date-arithmetic ambiguity) are exactly the shape that fine-tuning fixes — the model needs to be taught the output discipline of the dialect, not more facts.
Generate ~1k synthetic (question, gold_sql) pairs across diverse schemas (synthetic schema generators are cheap). Fine-tune Qwen3-8B or DeepSeek-Coder-V2-Lite to internalize: no surrogate keys in the output, idiomatic LIMIT, correct GROUP BY shape, full-name concatenation. This is what the system prompt is currently bandaging.
SFT is the only stage that can teach a new format from scratch.
Text-to-SQL is the canonical RFT use case. The reward is mechanically verifiable — did the SQL run, and did its result-set match the gold answer? No human labels required. Every gold question we already have is a training example. Fireworks supports RFT directly.
Why this catches s_014Top-per-group failure is silent — the SQL runs, just returns wrong rows. RFT punishes that exactly via the executor reward, which prompt engineering can't.
Direct Preference Optimization on production replay pairs: the working query a user accepted vs the rejected/repaired query they got first. Closes residual style drift after RFT plateaus, using data we'd already be collecting in production.
Why lastDPO needs preference pairs at scale — production gives us that for free.
service_tier="priority" is silently accepted but does nothing. Fireworks priority is account-provisioned, not a request-time flag.format_answer_summary is heuristic. Renders rows as <entity> (col=val, col=val). Values are exact; wording may not match the gold set's phrasing. Eval correctness uses rows_match against the SQL result, not the answer field.perf.py's PRICING_USD_PER_M dict; verify before publishing any cheap-model cost number.HISTORY_TURN_LIMIT.The Chinook database runs in a WebAssembly SQLite instance in your browser; the API key you paste below is used directly from this page to call the selected inference provider. Defaults to Baseten (10-model catalog, including the winners from the Model Sweep tab); toggle to Fireworks for the original take-home flow. Type any natural-language question and watch a baseline prompt and the agent prompt run side-by-side.
Pick any two combinations of model and prompt strategy and run them side-by-side on any of the 25 gold questions or your own free-form input. Defaults to Customer Baseline vs Agent on Kimi K2.6 — change either side to compare anything.
This is the compact schema sent in the system prompt — same as _format_schema_compact in src/agent.py. Trim happens per-question: keyword-matched table names + their FK closure.
loading…