Inference Benchmarking Suite · Text-to-SQL Workload

Inference Insights

What the matrix actually tells us.

Seven quick questions an inference-infrastructure engineer would ask of the matrix data, answered with numbers from the sweep — not vibes. Then four deeper sections on fine-tuning economics, serverless-vs-dedicated tradeoffs, quantization, and how the platform actually behaves under concurrent load.

Fine-tune a cheap model, or pay for a better one?

The first instinct on inference cost is "fine-tune the cheapest open-source model to match the expensive one." Sometimes that's right. Often it's not. Here's the framework, grounded in this sweep's numbers.

Step 1 · Read the matrix before reaching for SFT

In this sweep, gpt-oss-120b hits 96% accuracy at $0.000187/query. The cheapest candidate you'd consider fine-tuning (Nemotron-120B-A12B at $0.000121/q) sits at 44% accuracy out of the box — a 52-point gap. To close that gap with SFT against a 25-question workload, you'd need labeled training data (call it 500–2000 examples for a "first SFT pass that won't ruin the model"), plus 2–3 rounds of iteration, plus an eval harness. Realistic minimum: 40–80 engineer-hours + a few hundred dollars of compute.

Cost-per-query difference between the two: $0.000066. At GitLab's projected 30k q/day, that's $1.98 saved per day — about $59/month. Engineer time to fine-tune amortizes at ~5–8 years. Don't fine-tune.

Step 2 · The cases where fine-tuning does win

Output format lock-in. The model needs to emit a specific JSON shape, dialect, or convention that prompting can't reliably enforce. SFT pins it permanently. (Text-to-SQL: the gold SQL style is one example — but our sweep shows several models already hit ≥90% with prompting alone, so it's not the binding constraint here.)
Domain vocabulary the base model never saw. Internal schema names, jargon, codebase-specific identifiers. Prompts can teach this in-context but eat tokens; SFT bakes it in.
RFT (Reinforcement Fine-Tuning) when correctness is verifiable. SQL is verifiable — the executor either returns the right rows or it doesn't. This is the Fireworks/OpenAI RFT sweet spot: no human labels, gradient comes from the executor's pass/fail signal. The s_014 "top-per-group" miss in our Quality tab is precisely this kind of pattern-learning target — but only after Step 1 fails to close it.
Real cost gaps (10×+), at real volume. If a small model at $0.00001/q could match a $0.001/q model after SFT, and you're doing 10M queries/day, then fine-tuning pays back in days.

Step 3 · Decision rule

Pick the better model when (a) the accuracy gap is >5 points and (b) the cost gap is <3× and (c) you can ship today. Fine-tuning is a months-long commitment with maintenance overhead (re-tune on base-model updates, drift monitoring, eval-set rot). Fine-tune when the cost gap is >10× and you can articulate a verifiable signal (executor feedback, click-through, rubric) to drive RFT.

In this benchmark, the answer is "pick gpt-oss-120b." If a future month DeepSeek-V5 ships at $0.005/q with 98% accuracy and gpt-oss stays at 96% and $0.0002, the framework flips — but only because the cheap model is already within 2 points of the expensive one.

Serverless Model API vs dedicated deployment

Every model in this matrix is serverless — shared multi-tenant inference, per-token billing, no GPU allocation on your side. Baseten (and Fireworks) also sell dedicated deployments billed per GPU-minute. The choice is workload-shape-dependent, not platform-dependent.

Serverless wins

When the workload is bursty, mixed, or you're still picking a model

Variable traffic. 30k queries one day, 3k the next. Dedicated capacity sits idle and you pay anyway.
Model exploration. Swapping between 10 models for benchmarking is one config change on serverless. On dedicated, it's 10 deployments.
Cost discipline at low/medium volume. At the GitLab POC's 30k q/day on gpt-oss-120b, serverless = ~$168/month. A dedicated H100 = $1,440–2,160/month. Serverless wins by 8–13×.
No infra ops. No GPU autoscaling, no health checks, no replica fleet sizing.

Dedicated wins

When you need predictable tails or you can saturate a GPU

Tight P99 SLO. Serverless tail latency is governed by other tenants' bursts. Dedicated isolates you — no noisy neighbors.
Sustained high throughput. If a single GPU is >50% utilized 24/7, dedicated $/token beats serverless. Rough breakeven: ~1.5–3M tokens/hour per H100.
Custom weights. Fine-tuned, LoRA-adapted, or proprietary model — serverless catalogs only carry vendor-curated weights.
Data residency / compliance. Region-pin, BYOC, single-tenant guarantees — these are dedicated-tier features.
Predictable cost. Per-GPU-minute is a flat line; per-token is a stochastic function of traffic.

What this matrix can and can't tell you

The single-call sweep on the Model Sweep tab shows median behavior. The Latency under load section below probes the next layer — does the platform actually parallelize, or does the apparent low latency at concurrency=1 collapse at concurrency=10? That's the question dedicated vs serverless turns on, and it's why we ran the concurrency benchmark.

Phase 3 deferred: the original brief included deploying a smaller model via Truss as a dedicated deployment to measure the crossover empirically. Deferred to stay within the serverless-only budget — but the breakeven math above frames where you'd start looking.

Quantization · FP16 → FP8 → FP4 · what you're actually buying

Every Baseten Model API LLM ships quantized. That's not a cost-cutting hack — it's why these models are fast enough to be priced per-token at all. Here's what the numbers mean and how they map onto the latencies we measured.

FP16 / BF16 (the reference)

2 bytes per weight. A 120B-parameter model needs ~240 GB of GPU memory just to hold the weights — that's 3× H100 (80GB each) minimum. Highest fidelity to the model's training distribution. Rarely served at scale on cost-sensitive inference platforms because the GPU economics don't work outside research.

FP8

1 byte per weight, ~½ the memory of FP16. A 120B model fits in ~120 GB → still 2× H100 minimum, but TFLOPS roughly double on Hopper-class GPUs. Typical quality regression vs FP16 is 0–1.5% on most benchmarks. Baseten serves DeepSeek-V3.1 and MiniMax-M2.5 at FP8.

FP4 (the workhorse here)

½ byte per weight, ~¼ the memory of FP16. A 120B model fits in ~60 GB → comfortably on a single H100 or H200. Throughput on Blackwell-class hardware roughly 2× FP8 again. Quality regression typically 1–5%, very model-dependent. 8 of 10 Baseten models in this matrix are FP4.

What this buys the platform — and you

For the platform: more concurrent users per GPU, cheaper per-token economics, the ability to price gpt-oss-120b at $0.10/M input tokens at all.

For you: fast prefill (the schema-stuffed system prompt ingests faster), high per-GPU throughput (which is what makes the concurrency curves below shaped like they are), and the assumption that the model's accuracy ranking on your workload still holds after the quantization. Verify the last assumption with an eval — which is what this whole dashboard is.

What our matrix tells us about FP4-vs-FP8 in practice

The top performer (gpt-oss-120b · FP4 · 96% acc · 228ms TTFT) beats the second (DeepSeek-V3.1 · FP8 · 92% acc · 343ms TTFT) on both axes despite running at the more aggressive quantization. That's the right answer for the wrong-feeling reason: FP4 doesn't kneecap well-trained models, and the model-architecture / training-data delta dominates the quantization-precision delta on this workload. The implication for model selection is: don't filter by quantization tier. Filter by accuracy and TTFT on your own workload; the quantization is upstream of both.

Quantization data sourced from Baseten's /v1/models endpoint at sweep time (data/baseten_models_meta.json).

Latency under load · concurrency 1 / 5 / 10

Single-call latency hides queue depth. We fired the full 25-question gold set at the top three Baseten models (gpt-oss-120b, DeepSeek-V3.1, Kimi-K2.6) at concurrency 1, 5, and 10 — same agent, full repair loop, ThreadPoolExecutor. How does P50 / P90 / P99 move?

Loading latency-under-load data…

Method notes

Both sweeps use one-shot calls (no repair loop) — this isolates raw model capability from agent compensation. The agent's repair loop adds 1–2 retry rounds in production and improves Kimi K2.6 from 92% to its measured 23/25 ceiling.
Cost-per-query is computed from actually-measured token usage during the sweep × published per-token rates. No averaging across queries — the model's real prompt/completion sizes drive the cost.
"Free" pricing (GLM-5.1 on Baseten) is flagged in the matrix — likely a beta / promotional rate. Don't make platform-selection decisions on prices that may not be load-bearing.
TTFT here is wall-clock from request open → first SSE delta with non-empty payload. Includes network round-trip from the test machine; not isolated to model prefill.

Quality

2/10 → —, on the same model.

The customer's current prototype prompt (Convert this question to SQL: {question} — no schema, no JSON mode, no execute-and-repair) on the same Kimi K2.6 endpoint scored 2/10 on the dev set. The agent on the same model scored —. The lift comes from prompt engineering, not a model swap.

—

Dev set

—

Synthetic set

—

Combined

—

0/25

You re-ran live

Click any card below to verify yourself

All 25 questions · click any card to inspect

Click any card to load the question into the unified comparison tool on the Live Demo tab — pick models and prompt strategies, then run.

Dev set · 10 questions

Synthetic set · 15 questions

Evaluation log · actual terminal output

A real python scripts/agent_eval.py run from this repo — the full agent (compact schema, JSON mode, execute-and-repair) scored against the 25-question gold set, written to data/agent_eval.json. Rendered straight from that file.

loading…

Read column-by-column: qid · set · tier · pass/fail · agent latency in ms · repair turns used · diagnostic note.

Honest failure analysis

Every question the agent missed in the run above — with the SQL it actually generated.

Latency

The 3-second P50 question.

Run queries in Live Demo to see your own latency stats.

Live latency test · Kimi K2.6

Fires 3 sequential calls to Fireworks (Kimi K2.6) against the same dev question and reports the median you see right now. Calls also stream into the Live Demo activity log.

Model: Set your Fireworks key in the Live Demo tab.

Latency stats

Benchmark data · pre-recorded · 30 calls across 3 runs

p50

loading…

across 30 calls

p90

loading…

tail dominated by cold-start

p99

loading…

worst observed

% under 3 s

loading…

cleared the 3,000 ms target

Best clean dev-eval run hit median 2,901 ms separately (no pacing). Schema compaction (935 → 665 mean prompt tokens, −29%) is what made that possible — same accuracy.

Pre-recorded benchmark · 3 × 10-question runs on shared tier

Run	P50	P90	Accuracy	Note
1	9,067 ms	22,065 ms	10/10	shared tier under load
2	8,297 ms	39,673 ms	6/10	three queries hit 429s and never recovered
3	4,058 ms	17,935 ms	10/10	quieter shared tier · cleanest run

All three perf JSONs are checked into the repo (perf_compact_{1,2,3}.json). Median p50 across runs: 8,297 ms. Best clean p50: 4,058 ms.

Schema trimming · the latency lever

Most of the per-call latency on shared serverless is prefill, so the prompt-token count is the highest-leverage knob short of moving off shared. Two compounding wins:

Compact form

935 → 665 mean prompt tokens (−29%)

Render the schema as Album(AlbumId:integer[pk], Title:nvarchar, ArtistId:integer[fk→Artist.ArtistId]) instead of CREATE TABLE blocks. 62% schema-char reduction. No accuracy regression on the dev set.

Keyword + FK trim

68.5% mean reduction when it fires

Keyword-match table names against the question, expand via foreign keys to keep JOINs valid. Fires on 9/25 questions today (24.7% mean reduction across all 25, 68.5% on the 9 fired). On q_002 ("AC/DC albums"): 89.8% reduction.

Path to a firm sub-3s P50 SLO

Schema compaction (shipped). One-line-per-table form. This is what brought our best clean run to 2.9s.
Prompt caching on the static schema chunk — Fireworks's prefix cache is exactly this use case. Most of the 665 input tokens are identical across queries.
On-demand or dedicated deployment for predictable tail latency — shared serverless will always have load spikes. This is the SLO-grade lever.
Speculative decoding (Fireworks platform feature) — text-to-SQL has very predictable output structure. Ideal target.
Schema retrieval for production-scale schemas — on Chinook the 11-table dump is fine; on a 1,000-table customer DB the schema alone would break the latency budget.

Cost Explorer

Drag the volume. Watch the matrix re-rank.

One slider, every benchmarked model on both platforms. Cost-per-day is computed from actually-measured token usage during the sweep × published per-token rates. Build-vs-Buy context for the GitLab POC is the section below.

Volume

30,000 queries/day

Default 30,000 = 1,000 users × 30 q/day (GitLab POC projection).
Drag from 1k → 300k to see how the rankings shift.

All benchmarked models at your volume

Models with errored sweeps or unknown pricing are hidden. Sorted cheapest → most expensive. Highlighted row = best accuracy at this volume tier.

Model	Platform	Accuracy	$/query	$/day	$/month	$/year	vs GPT-5.4
Loading…

Build vs Buy · the GitLab POC framing

Originally the cost story for the customer email — same slider drives this section. Kept for context on the four-way choice (proprietary / direct API / self-host / managed platform).

All four columns recompute when you drag the slider. If you've made runs in the Live Demo, the token profile switches from the benchmark default to the average of your own runs — apples-to-apples across all four options.

Based on benchmark data (665 in / 78 out)

GPT-5.4 current

Per query$0.002862
Monthly$2,576
Annual$30,917

No infra to manage. Highest cost, proprietary lock-in, 7s latency reported by customer.

Direct model APIs

Moonshot, Alibaba, DeepSeek, etc.

Per queryVaries
Monthly (est. avg)$594
Annual (est. avg)$7,231

Per-token pricing is often cheaper going direct to the model provider — typically 20–40% cheaper than Fireworks. But for an enterprise deployment like GitLab's, the total cost includes more than tokens:

Vendor management. Each model comes from a different company (Moonshot in Beijing, Alibaba in Hangzhou, DeepSeek in Hangzhou). Multiple contracts, billing relationships, and SLA negotiations.
Data residency. Most leading open-source model providers are headquartered outside the US / EU. Enterprise customers may require guarantees about where their data is processed. Fireworks offers US / EU / APAC region pinning and BYOC.
Rate limits and scaling. Direct APIs often require prepaid recharge tiers and have manual scaling limits. Production traffic at 30k+ q/day needs automatic scaling.
No unified fine-tuning. Can't fine-tune across model families on one platform. Fireworks offers SFT / RFT / DPO for any hosted model at base serving price.
Reliability. Research labs ship models, not production infrastructure. No published SLAs. Fireworks processes 5T+ tokens/day.

Per-token rates sourced from artificialanalysis.ai/models ↗

Self-host open-source

GPUs needed1× H100
GPU/month (@ $2.50/hr)$1,800
+ 0.5 FTE ML infra$6,250
Monthly total$8,050
Annual$96,600

Full control at extreme scale. Requires ML infra team, GPU procurement, autoscaling, monitoring, model updates.

Fireworks (this POC)

Per query$0.000942
Monthly$848
Annual$10,176

Higher per-token cost than some direct model provider APIs — you're paying for production infrastructure, not just inference. You're paying for optimized inference (0.71s TTFT — fastest tracked provider), fine-tuning pipeline, region pinning, and platform convenience.

Includes: FireAttention + speculative decoding, 18+ regions across 8 cloud providers, US/EU/APAC region pinning, BYOC option, single vendor for all open-source models, fine-tuning (SFT / RFT / DPO) at base serving price, on-demand tier for latency SLOs, Enterprise support with dedicated account team.

At extreme scale (1M+ queries/day), self-hosting eventually wins on pure compute cost — but doesn't account for the engineering team needed to run it. At the projected first cohort volume (30k queries/day) and even at 10× growth, Fireworks is the clear choice: zero ops overhead, instant model swapping, and fine-tuning without a training cluster.

Production Roadmap

From shared serverless to a tuned production tier.

Each phase maps to a concrete Fireworks platform feature. The POC is Phase 0.

Fine-tuning pipeline · SFT → RFT → DPO

Prompt engineering can only carry text-to-SQL so far. The two failure modes we hit (top-per-group, date-arithmetic ambiguity) are exactly the shape that fine-tuning fixes — the model needs to be taught the output discipline of the dialect, not more facts.

Step 1 · SFT

Establish the format

Generate ~1k synthetic (question, gold_sql) pairs across diverse schemas (synthetic schema generators are cheap). Fine-tune Qwen3-8B or DeepSeek-Coder-V2-Lite to internalize: no surrogate keys in the output, idiomatic LIMIT, correct GROUP BY shape, full-name concatenation. This is what the system prompt is currently bandaging.

Why first

SFT is the only stage that can teach a new format from scratch.

Step 2 · RFT · the centerpiece

Use the executor as a reward

Text-to-SQL is the canonical RFT use case. The reward is mechanically verifiable — did the SQL run, and did its result-set match the gold answer? No human labels required. Every gold question we already have is a training example. Fireworks supports RFT directly.

Why this catches s_014

Top-per-group failure is silent — the SQL runs, just returns wrong rows. RFT punishes that exactly via the executor reward, which prompt engineering can't.

Step 3 · DPO

Polish the long tail

Direct Preference Optimization on production replay pairs: the working query a user accepted vs the rejected/repaired query they got first. Closes residual style drift after RFT plateaus, using data we'd already be collecting in production.

Why last

DPO needs preference pairs at scale — production gives us that for free.

Known limitations

Sub-3s P50 on shared serverless is not contractual. 23% of measured calls under 3s, 33% under 5s. Best clean p50 4,058 ms (run 3); a separate dev-eval pass hit 2,901 ms. Variance is load-side. SLO requires on-demand or dedicated deployment.
service_tier="priority" is silently accepted but does nothing. Fireworks priority is account-provisioned, not a request-time flag.
Schema trim is keyword-based, not retrieval. Fires on 9/25 questions (mean 68.5% reduction on those, 24.7% across all). For a 1,000-table customer DB this isn't enough — needs a real retrieval step or schema caching with a stable cache id.
format_answer_summary is heuristic. Renders rows as <entity> (col=val, col=val). Values are exact; wording may not match the gold set's phrasing. Eval correctness uses rows_match against the SQL result, not the answer field.
s_013 is run-to-run flaky at temperature 0. "Last 6 months of available data" is genuinely ambiguous; counted as a failure to be conservative.
Qwen3-8B's 50% should be read with care. Some failures (q_001 units vs revenue) are real semantic misses; others would likely improve with one or two few-shot examples. Plausible as a "fast first attempt" inside a router, not a solo replacement.
Pricing for Qwen3-8B is an estimate in perf.py's PRICING_USD_PER_M dict; verify before publishing any cheap-model cost number.
No multi-turn perf measurement. Single-turn cost number assumes a fresh conversation; production multi-turn adds ~200 tokens per prior exchange, capped at 6 turns by the CLI's HISTORY_TURN_LIMIT.

Same 25 queries. Same database.Different models, different platforms.

What this benchmark answers

Architecture · seven steps

15 LLMs · 25 questions · one workload.

$/query vs accuracy — Pareto view

Fireworks vs Baseten · head-to-head.

Best model on each platform

What's actually being compared

What the matrix actually tells us.

Fine-tune a cheap model, or pay for a better one?

Step 1 · Read the matrix before reaching for SFT

Step 2 · The cases where fine-tuning does win

Step 3 · Decision rule

Serverless Model API vs dedicated deployment

When the workload is bursty, mixed, or you're still picking a model

When you need predictable tails or you can saturate a GPU

What this matrix can and can't tell you

Quantization · FP16 → FP8 → FP4 · what you're actually buying

FP16 / BF16 (the reference)

FP8

FP4 (the workhorse here)

What this buys the platform — and you

What our matrix tells us about FP4-vs-FP8 in practice

Latency under load · concurrency 1 / 5 / 10

Method notes

2/10 → —, on the same model.

All 25 questions · click any card to inspect

Evaluation log · actual terminal output

Honest failure analysis

The 3-second P50 question.

Live latency test · Kimi K2.6

Latency stats

Pre-recorded benchmark · 3 × 10-question runs on shared tier

Schema trimming · the latency lever

935 → 665 mean prompt tokens (−29%)

68.5% mean reduction when it fires

Path to a firm sub-3s P50 SLO

Drag the volume. Watch the matrix re-rank.

All benchmarked models at your volume

Build vs Buy · the GitLab POC framing

GPT-5.4 current

Direct model APIs

Self-host open-source

Fireworks (this POC)

From shared serverless to a tuned production tier.

Fine-tuning pipeline · SFT → RFT → DPO

Establish the format

Use the executor as a reward

Polish the long tail

Known limitations

sql.js + Chinook.db running in your browser.

Unified comparison tool

Schema · what the model sees

Same 25 queries. Same database.
Different models, different platforms.