NaviMed-UMB

Local clinical-AI inference on consumer AMD GPUs. Polish LLM releases and an open benchmark method.

Methodology

A hardware-envelope and inference-benchmark protocol for consumer AMD RDNA 4 (gfx1201). The full document lives in the repository as METHODOLOGY.md. This page summarises the part that is public.

Scope and goals

The platform answers two questions. They are kept apart on purpose.

  1. Engineering envelope (public). Which configurations of (model, quantization, max_model_len, gpu_memory_utilization, kv_cache_dtype, tensor_parallel_size) load on this hardware. For each, what is the VRAM footprint, the KV-cache capacity, and the per-GPU thermal and power profile under load.
  2. Scaling sweep (embargoed until publication). At the best validated configurations, how throughput, latency, and power efficiency scale with concurrent request count N. Where the throughput knee sits. How these compare across quantizations, architectures, and backends.

Two-phase experimental design

Phase 1 · Hardware envelope

Determine what loads and what it costs. VRAM footprint, KV-cache capacity, max-concurrency, and per-GPU thermal and power profile per configuration. Public.

Phase 2 · Scaling sweep

At the best validated configurations, sweep concurrent request count N over a standard grid. Characterise throughput, latency, and efficiency scaling. Numerical results embargoed.

Statistical protocol, Tier A, n = 10

From the Run-3 / v0.5 series onward, each (quant, TP, N) cell runs REPS = 10 times. Each rep is a fresh vLLM process with a full model load and a cooldown between runs. Results are reported as descriptive statistics over the 10 reps, never a single observation.

When the text calls a difference between configurations meaningful, the family-wise error rate across the N-ladder is controlled with Holm–Bonferroni1. Without that test, a comparison stays descriptive and claims no significance.

Statistical analysis and presentation

Tier-A cells are summarised and plotted as distributions, not point estimates. This follows current small-sample reporting guidance: Weissgerber et al. 2015, PLoS Biology ("show the data")2, the Tukey box convention3, and the SAMPL guidelines4.

These conventions are public. The per-model numerical values and the rendered figures (scaling-band and box-whisker plots) are embargoed pending peer-reviewed publication. They are not reproduced anywhere on this site.

What the benchmark does not measure

The benchmark measures throughput, latency, thermal envelope, and power efficiency under concurrent load. It does not measure model quality or clinical accuracy. The claims stop at the hardware.

Embargo policy

The platform separates engineering content from research content. Engineering is public and released immediately. Research is held until paper acceptance.

Public

Embargoed until publication

Every result-generating step is labelled at the time of generation. EMBARGO: paper figure or PUBLIC: engineering note for repo. This prevents accidental commits of paper-bound numbers.

References

  1. Holm S. A simple sequentially rejective multiple test procedure. Scand J Stat. 1979;6(2):65–70. jstor.org/stable/4615733
  2. Weissgerber TL, Milic NM, Winham SJ, Garovic VD. Beyond bar and line graphs: time for a new data presentation paradigm. PLoS Biol. 2015;13(4):e1002128. doi:10.1371/journal.pbio.1002128
  3. Tukey JW. Exploratory Data Analysis. Reading, MA: Addison-Wesley; 1977.
  4. Lang TA, Altman DG. Basic statistical reporting for articles published in biomedical journals: the SAMPL Guidelines. Int J Nurs Stud. 2015;52(1):5–9. doi:10.1016/j.ijnurstu.2014.09.006
  5. Mann HB, Whitney DR. On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat. 1947;18(1):50–60. doi:10.1214/aoms/1177730491
  6. Wilcoxon F. Individual comparisons by ranking methods. Biometrics Bull. 1945;1(6):80–83. doi:10.2307/3001968
  7. Hodges JL, Lehmann EL. Estimates of location based on rank tests. Ann Math Stat. 1963;34(2):598–611. doi:10.1214/aoms/1177704172