Methodology
A hardware-envelope and inference-benchmark protocol for consumer AMD RDNA 4 (gfx1201). The full document lives in the repository as METHODOLOGY.md. This page summarises the part that is public.
Scope and goals
The platform answers two questions. They are kept apart on purpose.
- Engineering envelope (public). Which configurations of
(model, quantization, max_model_len, gpu_memory_utilization, kv_cache_dtype, tensor_parallel_size)load on this hardware. For each, what is the VRAM footprint, the KV-cache capacity, and the per-GPU thermal and power profile under load. - Scaling sweep (embargoed until publication). At the best validated configurations, how throughput, latency, and power efficiency scale with concurrent request count
N. Where the throughput knee sits. How these compare across quantizations, architectures, and backends.
Two-phase experimental design
Phase 1 · Hardware envelope
Determine what loads and what it costs. VRAM footprint, KV-cache capacity, max-concurrency, and per-GPU thermal and power profile per configuration. Public.
Phase 2 · Scaling sweep
At the best validated configurations, sweep concurrent request count N over a standard grid. Characterise throughput, latency, and efficiency scaling. Numerical results embargoed.
Statistical protocol, Tier A, n = 10
From the Run-3 / v0.5 series onward, each (quant, TP, N) cell runs REPS = 10 times. Each rep is a fresh vLLM process with a full model load and a cooldown between runs. Results are reported as descriptive statistics over the 10 reps, never a single observation.
- Central tendency: median. The throughput and latency distributions are skewed by cold-cache, thermal, and preemption outliers. The arithmetic mean would mislead.
- Dispersion and tails: p95, p99, min, max (IQR where useful). The deployment-relevant figure is the p99 of per-request latency, not the mean.
n_runsper cell is recorded explicitly. A cell with fewer than 10 valid reps is aggregated over what completed and flagged.
When the text calls a difference between configurations meaningful, the family-wise error rate across the N-ladder is controlled with Holm–Bonferroni1. Without that test, a comparison stays descriptive and claims no significance.
Statistical analysis and presentation
Tier-A cells are summarised and plotted as distributions, not point estimates. This follows current small-sample reporting guidance: Weissgerber et al. 2015, PLoS Biology ("show the data")2, the Tukey box convention3, and the SAMPL guidelines4.
- Central tendency and dispersion. Each cell is reported as median plus interquartile range (IQR, Q1–Q3) with min/max, and p95/p99 for per-request latency. Arithmetic mean ± SD is not the primary summary. The distributions are right-skewed and small-n.
- Default plot: scaling line and band. Median of the metric vs
N(log-x), connected per TP, with a shaded min–max band over the 10 reps. Run-to-run variance is usually very small (CV ≈ 1 %). A box-and-whisker would collapse below marker size. A box-and-whisker plot (box = IQR, whiskers = 1.5 × IQR) is kept for cells with larger spread. There the distribution shape matters. - Comparisons and inference. Where a difference is asserted as more than descriptive, distributions are compared with a non-parametric test (Mann–Whitney U / Wilcoxon)5,6. The family-wise error rate is controlled with Holm–Bonferroni1. An effect size (median ratio or Hodges–Lehmann shift)7 is reported alongside any p-value.
- Reproducibility. matplotlib (Agg backend), Python
statistics/ numpy. Every figure regenerates deterministically from the locally-held raw CSV.
What the benchmark does not measure
The benchmark measures throughput, latency, thermal envelope, and power efficiency under concurrent load. It does not measure model quality or clinical accuracy. The claims stop at the hardware.
Embargo policy
The platform separates engineering content from research content. Engineering is public and released immediately. Research is held until paper acceptance.
Public
- Hardware-envelope tables (load yes/no, peak VRAM, KV tokens, max-concurrency)
- Engineering workarounds (env vars,
enforce_eager,ROCR_VISIBLE_DEVICESfilter) - Sanity throughput numbers (single-prompt, memory-bandwidth baseline)
- All scripts (runners, orchestrators, analysis, plotting)
- The methodology document and the AI-disclosure structure
- Knee-position observations (the knee is compute-saturation-driven)
Embargoed until publication
- Phase 2 scaling tables: throughput@N for N ∈ {10…1000}
- Latency distributions: P50/P95/P99 per N
- KV-cache utilisation curves
- Cross-model comparative claims with concrete numbers
- Quantization-vs-quantization scaling-law interpretations
- Paper-bound figures (scaling curves, thermal galleries, efficiency curves) with concrete y-axis values
- Bielik and PLLuM scaling numbers, under a stricter hold. The Polish AI community is small and scoop risk is higher. PLLuM-70B sweep results stay local-only until co-author review and paper submission.
Every result-generating step is labelled at the time of generation. EMBARGO: paper figure or PUBLIC: engineering note for repo. This prevents accidental commits of paper-bound numbers.
References
- Holm S. A simple sequentially rejective multiple test procedure. Scand J Stat. 1979;6(2):65–70. jstor.org/stable/4615733
- Weissgerber TL, Milic NM, Winham SJ, Garovic VD. Beyond bar and line graphs: time for a new data presentation paradigm. PLoS Biol. 2015;13(4):e1002128. doi:10.1371/journal.pbio.1002128
- Tukey JW. Exploratory Data Analysis. Reading, MA: Addison-Wesley; 1977.
- Lang TA, Altman DG. Basic statistical reporting for articles published in biomedical journals: the SAMPL Guidelines. Int J Nurs Stud. 2015;52(1):5–9. doi:10.1016/j.ijnurstu.2014.09.006
- Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947;18(1):50–60. doi:10.1214/aoms/1177730491
- Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bull. 1945;1(6):80–83. doi:10.2307/3001968
- Hodges JL, Lehmann EL. Estimates of location based on rank tests. Ann Math Stat. 1963;34(2):598–611. doi:10.1214/aoms/1177704172