NaviMed-UMB

Local clinical-AI inference on consumer AMD GPUs. Polish LLM releases and an open benchmark method.

Results

This page shows the engineering envelope. Which models load, what they weigh, how much KV cache is left, and how many requests fit. These figures are already on the Zenodo record and the model cards.

Performance numbers are held back. Throughput, latency, energy per token, and cross-model comparisons stay private until the papers are published. The plots that show them are not embedded here. What follows is load behaviour and capacity. That part is public.

Engineering envelope

Measured with vLLM 0.19.0+rocm721 and enforce_eager. The single-card variants run at TP=1 and max_seq_len 2048. The 70B family runs at TP=2 and max_seq_len 8192, so its context, weight, and concurrency sit on a different basis. See the note below the table. Max-concurrency is the capacity multiple vLLM reports at load time. It is a load figure, not a throughput measurement.

Model Quant TP Ctx Weight (GiB) KV budget (GiB) Max conc.× Gate-1 tok/s 🔒 p99 lat 🔒 mWh/tok 🔒

🔒 The locked columns hold back throughput, latency, and energy per token until the papers are published. They fill in by a data swap. The layout stays the same. The single-card rows are measured at max_seq_len 2048. The 70B row is from its model card at TP=2 and max_seq_len 8192. Its weight is the total across two cards. Its KV budget is reported in tokens there, so the GiB column shows n/a. Its max-concurrency is not directly comparable to the 2048 rows. Bielik-11B is still running its sweep.

Both single-card variants run on one consumer card. That leaves the second R9700 free for a parallel sweep or training. The 70B family needs both cards (TP=2). Full per-variant figures are on each model card.

Gate 1 sanity

Each single-card variant passed 5/5 on a Polish-clinical prompt grid. The grid has one factual prompt and four clinical ones: definitional, syndrome, instructional, procedural. Prompts go through /v1/completions with the chat template bypassed. This checks that the model runs and stays coherent. It does not score quality. Raw outputs sit under environment/sanity-tests/ in the repo.

What we found

Engineering notes. No performance numbers, those stay embargoed.

enforce_eager is mandatory on gfx1201

With hybrid attention, the default CUDA-graph path crashes on gfx1201 (HSA_STATUS_ERROR_INVALID_PACKET_FORMAT). enforce_eager=True fixes it. This is a runtime constraint, not a model defect.

AWQ runs slower than BF16 here

On this gfx1201 / ROCm 7.2.0 / vLLM 0.19.0 build, the AWQ kernels are slower than BF16 for the same model. The kernels are still young on RDNA 4. The size of the gap is embargoed.

Sanity bypasses the chat template

Prompts go straight to /v1/completions. The chat-template path produced failures that came from the harness, not the model. So the documented pattern skips it.

Why no plots

The scaling plots put concrete per-N values on their axes. That makes them paper-bound, so they wait for acceptance. Anything not shown above as public is held back too. The method page covers the plot design.