Results

This page shows the engineering envelope. Which models load, what they weigh, how much KV cache is left, and how many requests fit. These figures are already on the Zenodo record and the model cards.

Performance numbers are held back. Throughput, latency, energy per token, and cross-model comparisons stay private until the papers are published. The plots that show them are not embedded here. What follows is load behaviour and capacity. That part is public.

Engineering envelope

Measured with vLLM 0.19.0+rocm721 and enforce_eager. The single-card variants run at TP=1 and max_seq_len 2048. The 70B family runs at TP=2 and max_seq_len 8192, so its context, weight, and concurrency sit on a different basis. See the note below the table. Max-concurrency is the capacity multiple vLLM reports at load time. It is a load figure, not a throughput measurement.

Filter

Family	Model	Quant	TP	Ctx	Weight (GiB)	KV budget (GiB)	Max conc.×	Gate-1	tok/s 🔒	p99 lat 🔒	mWh/tok 🔒

🔒 The locked columns hold back throughput, latency, and energy per token until the papers are published. They fill in by a data swap. The layout stays the same. The single-card rows are measured at max_seq_len 2048. The 70B row is from its model card at TP=2 and max_seq_len 8192. Its weight is the total across two cards. Its KV budget is reported in tokens there, so the GiB column shows n/a. Its max-concurrency is not directly comparable to the 2048 rows. Bielik-11B is still running its sweep.

Both single-card variants run on one consumer card. That leaves the second R9700 free for a parallel sweep or training. The 70B family needs both cards (TP=2). Full per-variant figures are on each model card.

Gate 1 sanity

Each single-card variant passed 5/5 on a Polish-clinical prompt grid. The grid has one factual prompt and four clinical ones: definitional, syndrome, instructional, procedural. Prompts go through /v1/completions with the chat template bypassed. This checks that the model runs and stays coherent. It does not score quality. Raw outputs sit under environment/sanity-tests/ in the repo.

What we found

The gfx1201 engineering findings behind these figures — enforce_eager being mandatory, AWQ running slower than BF16 here, and the sanity chat-template bypass — are written up on the Conclusions page. No performance numbers; those stay embargoed.

Why no plots

The scaling plots put concrete per-N values on their axes. That makes them paper-bound, so they wait for acceptance. Anything not shown above as public is held back too. The method page covers the plot design.