Methodology

A hardware-envelope and inference-benchmark protocol for consumer AMD RDNA 4 (gfx1201). The full document lives in the repository as METHODOLOGY.md. This page summarises the part that is public.

Scope and goals

The platform answers two questions. They are kept apart on purpose.

Engineering envelope (public). Which configurations of (model, quantization, max_model_len, gpu_memory_utilization, kv_cache_dtype, tensor_parallel_size) load on this hardware. For each, what is the VRAM footprint, the KV-cache capacity, and the per-GPU thermal and power profile under load.
Scaling sweep (embargoed until publication). At the best validated configurations, how throughput, latency, and power efficiency scale with concurrent request count N. Where the throughput knee sits. How these compare across quantizations, architectures, and backends.

Two-phase experimental design

Phase 1 · Hardware envelope

Determine what loads and what it costs. VRAM footprint, KV-cache capacity, max-concurrency, and per-GPU thermal and power profile per configuration. Public.

Phase 2 · Scaling sweep

At the best validated configurations, sweep concurrent request count N over a standard grid. Characterise throughput, latency, and efficiency scaling. Numerical results embargoed.

Statistical protocol, Tier A, n = 10

From the Run-3 / v0.5 series onward, each (quant, TP, N) cell runs REPS = 10 times. Each rep is a fresh vLLM process with a full model load and a cooldown between runs. Results are reported as descriptive statistics over the 10 reps, never a single observation.

Central tendency: median. The throughput and latency distributions are skewed by cold-cache, thermal, and preemption outliers. The arithmetic mean would mislead.
Dispersion and tails: p95, p99, min, max (IQR where useful). The deployment-relevant figure is the p99 of per-request latency, not the mean.
n_runs per cell is recorded explicitly. A cell with fewer than 10 valid reps is aggregated over what completed and flagged.

When the text calls a difference between configurations meaningful, the family-wise error rate across the N-ladder is controlled with Holm–Bonferroni¹. Without that test, a comparison stays descriptive and claims no significance.

Statistical analysis and presentation

Tier-A cells are summarised and plotted as distributions, not point estimates. This follows current small-sample reporting guidance: Weissgerber et al. 2015, PLoS Biology ("show the data")², the Tukey box convention³, and the SAMPL guidelines⁴.

Central tendency and dispersion. Each cell is reported as median plus interquartile range (IQR, Q1–Q3) with min/max, and p95/p99 for per-request latency. Arithmetic mean ± SD is not the primary summary. The distributions are right-skewed and small-n.
Default plot: scaling line and band. Median of the metric vs N (log-x), connected per TP, with a shaded min–max band over the 10 reps. Run-to-run variance is usually very small (CV ≈ 1 %). A box-and-whisker would collapse below marker size. A box-and-whisker plot (box = IQR, whiskers = 1.5 × IQR) is kept for cells with larger spread. There the distribution shape matters.
Comparisons and inference. Where a difference is asserted as more than descriptive, distributions are compared with a non-parametric test (Mann–Whitney U / Wilcoxon)^5,6. The family-wise error rate is controlled with Holm–Bonferroni¹. An effect size (median ratio or Hodges–Lehmann shift)⁷ is reported alongside any p-value.
Reproducibility. matplotlib (Agg backend), Python statistics / numpy. Every figure regenerates deterministically from the locally-held raw CSV.

These conventions are public. The per-model numerical values and the rendered figures (scaling-band and box-whisker plots) are embargoed pending peer-reviewed publication. They are not reproduced anywhere on this site.

What the benchmark does not measure

The benchmark measures throughput, latency, thermal envelope, and power efficiency under concurrent load. It does not measure model quality or clinical accuracy. The claims stop at the hardware.

Embargo policy

The platform separates engineering content from research content. Engineering is public and released immediately. Research is held until paper acceptance.

Public

Hardware-envelope tables (load yes/no, peak VRAM, KV tokens, max-concurrency)
Engineering workarounds (env vars, enforce_eager, ROCR_VISIBLE_DEVICES filter)
Sanity throughput numbers (single-prompt, memory-bandwidth baseline)
All scripts (runners, orchestrators, analysis, plotting)
The methodology document and the AI-disclosure structure
Knee-position observations (the knee is compute-saturation-driven)

Embargoed until publication

Phase 2 scaling tables: throughput@N for N ∈ {10…1000}
Latency distributions: P50/P95/P99 per N
KV-cache utilisation curves
Cross-model comparative claims with concrete numbers
Quantization-vs-quantization scaling-law interpretations
Paper-bound figures (scaling curves, thermal galleries, efficiency curves) with concrete y-axis values
Bielik and PLLuM scaling numbers, under a stricter hold. The Polish AI community is small and scoop risk is higher. PLLuM-70B sweep results stay local-only until co-author review and paper submission.

Every result-generating step is labelled at the time of generation. EMBARGO: paper figure or PUBLIC: engineering note for repo. This prevents accidental commits of paper-bound numbers.

References

Holm S. A simple sequentially rejective multiple test procedure. Scand J Stat. 1979;6(2):65–70. jstor.org/stable/4615733
Weissgerber TL, Milic NM, Winham SJ, Garovic VD. Beyond bar and line graphs: time for a new data presentation paradigm. PLoS Biol. 2015;13(4):e1002128. doi:10.1371/journal.pbio.1002128
Tukey JW. Exploratory Data Analysis. Reading, MA: Addison-Wesley; 1977.
Lang TA, Altman DG. Basic statistical reporting for articles published in biomedical journals: the SAMPL Guidelines. Int J Nurs Stud. 2015;52(1):5–9. doi:10.1016/j.ijnurstu.2014.09.006
Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947;18(1):50–60. doi:10.1214/aoms/1177730491
Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bull. 1945;1(6):80–83. doi:10.2307/3001968
Hodges JL, Lehmann EL. Estimates of location based on rank tests. Ann Math Stat. 1963;34(2):598–611. doi:10.1214/aoms/1177704172