Inference Benchmarking Suite · Text-to-SQL workload · Fireworks Baseten
Overview

Same 25 queries. Same database.
Different models, different platforms.

An inference-infrastructure benchmarking suite built on top of a reproducible text-to-SQL workload (originally a GitLab Agentic BI CLI POC). 15 LLMs across two serverless inference platforms — Fireworks (5 models) and Baseten (10 models) — measured on accuracy, TTFT, total latency, tokens, and cost-per-query. Every number on this page is reproducible from the repo.

Best accuracy (across all models)
Cheapest /query (≥90% accuracy)
Cheaper than GPT-5.4 baseline
Models benchmarked (Fireworks + Baseten)

What this benchmark answers

Inference infrastructure choices are usually argued on price-per-token and marketing claims. This dashboard makes them empirical: a fixed 25-question gold set (tolerant multiset comparison) is run against every Model API LLM on Fireworks and Baseten using the same agent prompt, the same schema, and identical sampling. Outputs: which model wins on accuracy, which on TTFT, which on $/query, where the Pareto frontier sits, and where a model is cheap on paper but slow or unreliable in practice. The text-to-SQL workload is the harness; the conclusions are about platforms and inference.

Architecture · seven steps

Click any node to see what we chose, the alternatives considered, and the trade-off we accepted.

Click a node above ▴

$ python -m src.cli · run locally — see README for setup.

Model Sweep

15 LLMs · 25 questions · one workload.

Same gold set, same agent prompt, same sampling settings, run against every Model API LLM on Fireworks and Baseten. Click a column header to sort. Filter by platform to isolate one provider.

Loading sweep data…
Model Platform Accuracy JSON OK TTFT p50 Lat p50 Lat p90 In tok Out tok $/query

Protocol: one-shot per question, no agent repair loop, temperature=0, max_tokens=1024, streaming (TTFT = time to first delta chunk). Fireworks numbers from data/fireworks_sweep_v2.json and Baseten numbers from data/baseten_sweep.json — both the full 25-question gold set, same streaming protocol, same scoring.

$/query vs accuracy — Pareto view

Top-left = high accuracy at low cost. Hover for model details.

Platform Comparison

Fireworks vs Baseten · head-to-head.

Same workload, same agent, same gold set, same scoring — both platforms swept on the full 25 questions with streaming and TTFT capture. Anything that differs below is platform-side: inference engine, model selection, pricing, networking, autoscaling behavior.

Best model on each platform

"Best" = highest accuracy, tie-broken by lower cost-per-query. The winner gets the apples-to-apples comparison row.

What's actually being compared

  • Fireworks data: Full 25-question gold set, streaming with TTFT capture, 5 models — Kimi K2.5/K2.6, DeepSeek V4 Pro, GLM-5.1, Minimax M2.7. From data/fireworks_sweep_v2.json.
  • Baseten data: Full 25-question gold set, streaming with TTFT capture, 10 models. From data/baseten_sweep.json.
  • Identical methodology: Both sweeps run the same protocol — one-shot, no repair, no retry, JSON mode, temperature=0, max_tokens=1024, the same compact+trim schema injection, scored by the same tolerant rows_match. Same 25 questions on both sides — a true head-to-head.
  • Coverage note: GLM-5 and Qwen3-8B return 404 NOT_FOUND on the Fireworks account and can't be benchmarked — see probed_unavailable in the data file.
Inference Insights

What the matrix actually tells us.

Seven quick questions an inference-infrastructure engineer would ask of the matrix data, answered with numbers from the sweep — not vibes. Then four deeper sections on fine-tuning economics, serverless-vs-dedicated tradeoffs, quantization, and how the platform actually behaves under concurrent load.

Fine-tune a cheap model, or pay for a better one?

The first instinct on inference cost is "fine-tune the cheapest open-source model to match the expensive one." Sometimes that's right. Often it's not. Here's the framework, grounded in this sweep's numbers.

Step 1 · Read the matrix before reaching for SFT

In this sweep, gpt-oss-120b hits 96% accuracy at $0.000187/query. The cheapest candidate you'd consider fine-tuning (Nemotron-120B-A12B at $0.000121/q) sits at 44% accuracy out of the box — a 52-point gap. To close that gap with SFT against a 25-question workload, you'd need labeled training data (call it 500–2000 examples for a "first SFT pass that won't ruin the model"), plus 2–3 rounds of iteration, plus an eval harness. Realistic minimum: 40–80 engineer-hours + a few hundred dollars of compute.

Cost-per-query difference between the two: $0.000066. At GitLab's projected 30k q/day, that's $1.98 saved per day — about $59/month. Engineer time to fine-tune amortizes at ~5–8 years. Don't fine-tune.

Step 2 · The cases where fine-tuning does win

  • Output format lock-in. The model needs to emit a specific JSON shape, dialect, or convention that prompting can't reliably enforce. SFT pins it permanently. (Text-to-SQL: the gold SQL style is one example — but our sweep shows several models already hit ≥90% with prompting alone, so it's not the binding constraint here.)
  • Domain vocabulary the base model never saw. Internal schema names, jargon, codebase-specific identifiers. Prompts can teach this in-context but eat tokens; SFT bakes it in.
  • RFT (Reinforcement Fine-Tuning) when correctness is verifiable. SQL is verifiable — the executor either returns the right rows or it doesn't. This is the Fireworks/OpenAI RFT sweet spot: no human labels, gradient comes from the executor's pass/fail signal. The s_014 "top-per-group" miss in our Quality tab is precisely this kind of pattern-learning target — but only after Step 1 fails to close it.
  • Real cost gaps (10×+), at real volume. If a small model at $0.00001/q could match a $0.001/q model after SFT, and you're doing 10M queries/day, then fine-tuning pays back in days.

Step 3 · Decision rule

Pick the better model when (a) the accuracy gap is >5 points and (b) the cost gap is <3× and (c) you can ship today. Fine-tuning is a months-long commitment with maintenance overhead (re-tune on base-model updates, drift monitoring, eval-set rot). Fine-tune when the cost gap is >10× and you can articulate a verifiable signal (executor feedback, click-through, rubric) to drive RFT.

In this benchmark, the answer is "pick gpt-oss-120b." If a future month DeepSeek-V5 ships at $0.005/q with 98% accuracy and gpt-oss stays at 96% and $0.0002, the framework flips — but only because the cheap model is already within 2 points of the expensive one.

Serverless Model API vs dedicated deployment

Every model in this matrix is serverless — shared multi-tenant inference, per-token billing, no GPU allocation on your side. Baseten (and Fireworks) also sell dedicated deployments billed per GPU-minute. The choice is workload-shape-dependent, not platform-dependent.

Serverless wins

When the workload is bursty, mixed, or you're still picking a model

  • Variable traffic. 30k queries one day, 3k the next. Dedicated capacity sits idle and you pay anyway.
  • Model exploration. Swapping between 10 models for benchmarking is one config change on serverless. On dedicated, it's 10 deployments.
  • Cost discipline at low/medium volume. At the GitLab POC's 30k q/day on gpt-oss-120b, serverless = ~$168/month. A dedicated H100 = $1,440–2,160/month. Serverless wins by 8–13×.
  • No infra ops. No GPU autoscaling, no health checks, no replica fleet sizing.
Dedicated wins

When you need predictable tails or you can saturate a GPU

  • Tight P99 SLO. Serverless tail latency is governed by other tenants' bursts. Dedicated isolates you — no noisy neighbors.
  • Sustained high throughput. If a single GPU is >50% utilized 24/7, dedicated $/token beats serverless. Rough breakeven: ~1.5–3M tokens/hour per H100.
  • Custom weights. Fine-tuned, LoRA-adapted, or proprietary model — serverless catalogs only carry vendor-curated weights.
  • Data residency / compliance. Region-pin, BYOC, single-tenant guarantees — these are dedicated-tier features.
  • Predictable cost. Per-GPU-minute is a flat line; per-token is a stochastic function of traffic.

What this matrix can and can't tell you

The single-call sweep on the Model Sweep tab shows median behavior. The Latency under load section below probes the next layer — does the platform actually parallelize, or does the apparent low latency at concurrency=1 collapse at concurrency=10? That's the question dedicated vs serverless turns on, and it's why we ran the concurrency benchmark.

Phase 3 deferred: the original brief included deploying a smaller model via Truss as a dedicated deployment to measure the crossover empirically. Deferred to stay within the serverless-only budget — but the breakeven math above frames where you'd start looking.

Quantization · FP16 → FP8 → FP4 · what you're actually buying

Every Baseten Model API LLM ships quantized. That's not a cost-cutting hack — it's why these models are fast enough to be priced per-token at all. Here's what the numbers mean and how they map onto the latencies we measured.

FP16 / BF16 (the reference)

2 bytes per weight. A 120B-parameter model needs ~240 GB of GPU memory just to hold the weights — that's 3× H100 (80GB each) minimum. Highest fidelity to the model's training distribution. Rarely served at scale on cost-sensitive inference platforms because the GPU economics don't work outside research.

FP8

1 byte per weight, ~½ the memory of FP16. A 120B model fits in ~120 GB → still 2× H100 minimum, but TFLOPS roughly double on Hopper-class GPUs. Typical quality regression vs FP16 is 0–1.5% on most benchmarks. Baseten serves DeepSeek-V3.1 and MiniMax-M2.5 at FP8.

FP4 (the workhorse here)

½ byte per weight, ~¼ the memory of FP16. A 120B model fits in ~60 GB → comfortably on a single H100 or H200. Throughput on Blackwell-class hardware roughly 2× FP8 again. Quality regression typically 1–5%, very model-dependent. 8 of 10 Baseten models in this matrix are FP4.

What this buys the platform — and you

For the platform: more concurrent users per GPU, cheaper per-token economics, the ability to price gpt-oss-120b at $0.10/M input tokens at all.

For you: fast prefill (the schema-stuffed system prompt ingests faster), high per-GPU throughput (which is what makes the concurrency curves below shaped like they are), and the assumption that the model's accuracy ranking on your workload still holds after the quantization. Verify the last assumption with an eval — which is what this whole dashboard is.

What our matrix tells us about FP4-vs-FP8 in practice

The top performer (gpt-oss-120b · FP4 · 96% acc · 228ms TTFT) beats the second (DeepSeek-V3.1 · FP8 · 92% acc · 343ms TTFT) on both axes despite running at the more aggressive quantization. That's the right answer for the wrong-feeling reason: FP4 doesn't kneecap well-trained models, and the model-architecture / training-data delta dominates the quantization-precision delta on this workload. The implication for model selection is: don't filter by quantization tier. Filter by accuracy and TTFT on your own workload; the quantization is upstream of both.

Quantization data sourced from Baseten's /v1/models endpoint at sweep time (data/baseten_models_meta.json).

Latency under load · concurrency 1 / 5 / 10

Single-call latency hides queue depth. We fired the full 25-question gold set at the top three Baseten models (gpt-oss-120b, DeepSeek-V3.1, Kimi-K2.6) at concurrency 1, 5, and 10 — same agent, full repair loop, ThreadPoolExecutor. How does P50 / P90 / P99 move?

Loading latency-under-load data…

Method notes

  • Both sweeps use one-shot calls (no repair loop) — this isolates raw model capability from agent compensation. The agent's repair loop adds 1–2 retry rounds in production and improves Kimi K2.6 from 92% to its measured 23/25 ceiling.
  • Cost-per-query is computed from actually-measured token usage during the sweep × published per-token rates. No averaging across queries — the model's real prompt/completion sizes drive the cost.
  • "Free" pricing (GLM-5.1 on Baseten) is flagged in the matrix — likely a beta / promotional rate. Don't make platform-selection decisions on prices that may not be load-bearing.
  • TTFT here is wall-clock from request open → first SSE delta with non-empty payload. Includes network round-trip from the test machine; not isolated to model prefill.
Quality

2/10 → , on the same model.

The customer's current prototype prompt (Convert this question to SQL: {question} — no schema, no JSON mode, no execute-and-repair) on the same Kimi K2.6 endpoint scored 2/10 on the dev set. The agent on the same model scored . The lift comes from prompt engineering, not a model swap.

Dev set
Synthetic set
Combined
0/25
You re-ran live
Click any card below to verify yourself

All 25 questions · click any card to inspect

Click any card to load the question into the unified comparison tool on the Live Demo tab — pick models and prompt strategies, then run.

Dev set · 10 questions
Synthetic set · 15 questions

Evaluation log · actual terminal output

A real python scripts/agent_eval.py run from this repo — the full agent (compact schema, JSON mode, execute-and-repair) scored against the 25-question gold set, written to data/agent_eval.json. Rendered straight from that file.

loading…

Read column-by-column: qid · set · tier · pass/fail · agent latency in ms · repair turns used · diagnostic note.

Honest failure analysis

Every question the agent missed in the run above — with the SQL it actually generated.

Latency

The 3-second P50 question.

Run queries in Live Demo to see your own latency stats.

Live latency test · Kimi K2.6

Fires 3 sequential calls to Fireworks (Kimi K2.6) against the same dev question and reports the median you see right now. Calls also stream into the Live Demo activity log.

Set your Fireworks key in the Live Demo tab.

Latency stats

Benchmark data · pre-recorded · 30 calls across 3 runs
p50
loading…
across 30 calls
p90
loading…
tail dominated by cold-start
p99
loading…
worst observed
% under 3 s
loading…
cleared the 3,000 ms target

Best clean dev-eval run hit median 2,901 ms separately (no pacing). Schema compaction (935 → 665 mean prompt tokens, −29%) is what made that possible — same accuracy.

Pre-recorded benchmark · 3 × 10-question runs on shared tier

RunP50P90AccuracyNote
19,067 ms22,065 ms10/10shared tier under load
28,297 ms39,673 ms6/10three queries hit 429s and never recovered
34,058 ms17,935 ms10/10quieter shared tier · cleanest run

All three perf JSONs are checked into the repo (perf_compact_{1,2,3}.json). Median p50 across runs: 8,297 ms. Best clean p50: 4,058 ms.

Schema trimming · the latency lever

Most of the per-call latency on shared serverless is prefill, so the prompt-token count is the highest-leverage knob short of moving off shared. Two compounding wins:

Compact form

935 → 665 mean prompt tokens (−29%)

Render the schema as Album(AlbumId:integer[pk], Title:nvarchar, ArtistId:integer[fk→Artist.ArtistId]) instead of CREATE TABLE blocks. 62% schema-char reduction. No accuracy regression on the dev set.

Keyword + FK trim

68.5% mean reduction when it fires

Keyword-match table names against the question, expand via foreign keys to keep JOINs valid. Fires on 9/25 questions today (24.7% mean reduction across all 25, 68.5% on the 9 fired). On q_002 ("AC/DC albums"): 89.8% reduction.

Path to a firm sub-3s P50 SLO

  • Schema compaction (shipped). One-line-per-table form. This is what brought our best clean run to 2.9s.
  • Prompt caching on the static schema chunk — Fireworks's prefix cache is exactly this use case. Most of the 665 input tokens are identical across queries.
  • On-demand or dedicated deployment for predictable tail latency — shared serverless will always have load spikes. This is the SLO-grade lever.
  • Speculative decoding (Fireworks platform feature) — text-to-SQL has very predictable output structure. Ideal target.
  • Schema retrieval for production-scale schemas — on Chinook the 11-table dump is fine; on a 1,000-table customer DB the schema alone would break the latency budget.
Cost Explorer

Drag the volume. Watch the matrix re-rank.

One slider, every benchmarked model on both platforms. Cost-per-day is computed from actually-measured token usage during the sweep × published per-token rates. Build-vs-Buy context for the GitLab POC is the section below.

Volume
30,000 queries/day
Default 30,000 = 1,000 users × 30 q/day (GitLab POC projection).
Drag from 1k → 300k to see how the rankings shift.

All benchmarked models at your volume

Models with errored sweeps or unknown pricing are hidden. Sorted cheapest → most expensive. Highlighted row = best accuracy at this volume tier.

Model Platform Accuracy $/query $/day $/month $/year vs GPT-5.4
Loading…

Build vs Buy · the GitLab POC framing

Originally the cost story for the customer email — same slider drives this section. Kept for context on the four-way choice (proprietary / direct API / self-host / managed platform).

All four columns recompute when you drag the slider. If you've made runs in the Live Demo, the token profile switches from the benchmark default to the average of your own runs — apples-to-apples across all four options.

Based on benchmark data (665 in / 78 out)

GPT-5.4 current

  • Per query$0.002862
  • Monthly$2,576
  • Annual$30,917

No infra to manage. Highest cost, proprietary lock-in, 7s latency reported by customer.

Direct model APIs

Moonshot, Alibaba, DeepSeek, etc.
  • Per queryVaries
  • Monthly (est. avg)$594
  • Annual (est. avg)$7,231

Per-token pricing is often cheaper going direct to the model provider — typically 20–40% cheaper than Fireworks. But for an enterprise deployment like GitLab's, the total cost includes more than tokens:

  • Vendor management. Each model comes from a different company (Moonshot in Beijing, Alibaba in Hangzhou, DeepSeek in Hangzhou). Multiple contracts, billing relationships, and SLA negotiations.
  • Data residency. Most leading open-source model providers are headquartered outside the US / EU. Enterprise customers may require guarantees about where their data is processed. Fireworks offers US / EU / APAC region pinning and BYOC.
  • Rate limits and scaling. Direct APIs often require prepaid recharge tiers and have manual scaling limits. Production traffic at 30k+ q/day needs automatic scaling.
  • No unified fine-tuning. Can't fine-tune across model families on one platform. Fireworks offers SFT / RFT / DPO for any hosted model at base serving price.
  • Reliability. Research labs ship models, not production infrastructure. No published SLAs. Fireworks processes 5T+ tokens/day.

Per-token rates sourced from artificialanalysis.ai/models ↗

Self-host open-source

  • GPUs needed1× H100
  • GPU/month (@ $2.50/hr)$1,800
  • + 0.5 FTE ML infra$6,250
  • Monthly total$8,050
  • Annual$96,600

Full control at extreme scale. Requires ML infra team, GPU procurement, autoscaling, monitoring, model updates.

Fireworks (this POC)

  • Per query$0.000942
  • Monthly$848
  • Annual$10,176

Higher per-token cost than some direct model provider APIs — you're paying for production infrastructure, not just inference. You're paying for optimized inference (0.71s TTFT — fastest tracked provider), fine-tuning pipeline, region pinning, and platform convenience.

Includes: FireAttention + speculative decoding, 18+ regions across 8 cloud providers, US/EU/APAC region pinning, BYOC option, single vendor for all open-source models, fine-tuning (SFT / RFT / DPO) at base serving price, on-demand tier for latency SLOs, Enterprise support with dedicated account team.

At extreme scale (1M+ queries/day), self-hosting eventually wins on pure compute cost — but doesn't account for the engineering team needed to run it. At the projected first cohort volume (30k queries/day) and even at 10× growth, Fireworks is the clear choice: zero ops overhead, instant model swapping, and fine-tuning without a training cluster.

Production Roadmap

From shared serverless to a tuned production tier.

Each phase maps to a concrete Fireworks platform feature. The POC is Phase 0.

Fine-tuning pipeline · SFT → RFT → DPO

Prompt engineering can only carry text-to-SQL so far. The two failure modes we hit (top-per-group, date-arithmetic ambiguity) are exactly the shape that fine-tuning fixes — the model needs to be taught the output discipline of the dialect, not more facts.

Step 1 · SFT

Establish the format

Generate ~1k synthetic (question, gold_sql) pairs across diverse schemas (synthetic schema generators are cheap). Fine-tune Qwen3-8B or DeepSeek-Coder-V2-Lite to internalize: no surrogate keys in the output, idiomatic LIMIT, correct GROUP BY shape, full-name concatenation. This is what the system prompt is currently bandaging.

Why first

SFT is the only stage that can teach a new format from scratch.

Step 2 · RFT · the centerpiece

Use the executor as a reward

Text-to-SQL is the canonical RFT use case. The reward is mechanically verifiable — did the SQL run, and did its result-set match the gold answer? No human labels required. Every gold question we already have is a training example. Fireworks supports RFT directly.

Why this catches s_014

Top-per-group failure is silent — the SQL runs, just returns wrong rows. RFT punishes that exactly via the executor reward, which prompt engineering can't.

Step 3 · DPO

Polish the long tail

Direct Preference Optimization on production replay pairs: the working query a user accepted vs the rejected/repaired query they got first. Closes residual style drift after RFT plateaus, using data we'd already be collecting in production.

Why last

DPO needs preference pairs at scale — production gives us that for free.

Known limitations

  • Sub-3s P50 on shared serverless is not contractual. 23% of measured calls under 3s, 33% under 5s. Best clean p50 4,058 ms (run 3); a separate dev-eval pass hit 2,901 ms. Variance is load-side. SLO requires on-demand or dedicated deployment.
  • service_tier="priority" is silently accepted but does nothing. Fireworks priority is account-provisioned, not a request-time flag.
  • Schema trim is keyword-based, not retrieval. Fires on 9/25 questions (mean 68.5% reduction on those, 24.7% across all). For a 1,000-table customer DB this isn't enough — needs a real retrieval step or schema caching with a stable cache id.
  • format_answer_summary is heuristic. Renders rows as <entity> (col=val, col=val). Values are exact; wording may not match the gold set's phrasing. Eval correctness uses rows_match against the SQL result, not the answer field.
  • s_013 is run-to-run flaky at temperature 0. "Last 6 months of available data" is genuinely ambiguous; counted as a failure to be conservative.
  • Qwen3-8B's 50% should be read with care. Some failures (q_001 units vs revenue) are real semantic misses; others would likely improve with one or two few-shot examples. Plausible as a "fast first attempt" inside a router, not a solo replacement.
  • Pricing for Qwen3-8B is an estimate in perf.py's PRICING_USD_PER_M dict; verify before publishing any cheap-model cost number.
  • No multi-turn perf measurement. Single-turn cost number assumes a fresh conversation; production multi-turn adds ~200 tokens per prior exchange, capped at 6 turns by the CLI's HISTORY_TURN_LIMIT.
Live Demo

sql.js + Chinook.db running in your browser.

The Chinook database runs in a WebAssembly SQLite instance in your browser; the API key you paste below is used directly from this page to call the selected inference provider. Defaults to Baseten (10-model catalog, including the winners from the Model Sweep tab); toggle to Fireworks for the original take-home flow. Type any natural-language question and watch a baseline prompt and the agent prompt run side-by-side.

Never stored. Sent only to inference.baseten.co.
Loading sql.js + Chinook.db…

Unified comparison tool

Pick any two combinations of model and prompt strategy and run them side-by-side on any of the 25 gold questions or your own free-form input. Defaults to Customer Baseline vs Agent on Kimi K2.6 — change either side to compare anything.

Side A
vs
Side B
Live calls hit inference.baseten.co. Each side uses your selected model + prompt strategy.
Side A · configure and run
Side B · configure and run
Run a comparison above to see the cost-at-scale projection.

Schema · what the model sees

This is the compact schema sent in the system prompt — same as _format_schema_compact in src/agent.py. Trim happens per-question: keyword-matched table names + their FK closure.

loading…