Clinical LLMs on hardware you own

NaviMed-UMB runs open-weight language models on a consumer dual AMD Radeon AI PRO R9700 workstation. It ships a benchmark suite, public model cards, and a documented method. Inference runs on-premise. No patient data leaves the building.

That privacy claim holds for on-premise inference with no external telemetry or API calls. Deployment hygiene is the operator's job. See Limitations.

Zenodo concept DOI GitHub repository HuggingFace · mozarcik Docs: CC BY 4.0

Clinical safety boundary. NaviMed-UMB is a research and engineering platform for local inference and benchmarking. It is not validated for diagnosis, treatment, triage, or any patient-care decision. Model quality and clinical accuracy are out of scope here. They are evaluated separately in Layer 3. See Limitations.

NaviMed-UMB should be understood as an exploratory research infrastructure. Its purpose is to investigate whether local, reproducible, sovereign LLM workflows can support future scientific work on medical-language processing, retrieval, benchmarking, and domain-specific evaluation.

The platform is Layer 1 of a three-layer design for on-premise medical AI. It is the layer that runs today.

If you read one thing.

Layer 1 is implemented and released (v0.5.0).
Layers 2 and 3 are scoped.
Public: ten model cards, one calibration dataset, the method, and the reproducibility stack.
Performance numbers are embargoed until publication.
No clinical-accuracy claim is made here.

How it fits together

Clinical SmPC corpus (EMA, no PHI)

AWQ W4A16 calibration

Quantized model cards (HuggingFace)

Local serving: vLLM, ROCm, R9700

Benchmark harness

Engineering envelope (public)

Layer 2 retrieval and Layer 3 arena (scoped)

Hardware

AMD Ryzen 9 9950X3D. 2× GIGABYTE Radeon AI PRO R9700 32 GB (gfx1201, RDNA 4). 96 GB DDR5-6000. Kubuntu 24.04, kernel 6.17, ROCm 7.2.0, vLLM 0.19.0.

Releases (v0.5.0)

Ten HuggingFace model cards. The Llama-PLLuM-70B family in eight AWQ variants for two-card deployment, plus two variants that fit on a single R9700. To the author's knowledge, the first public AWQ W4A16 quantization of each. See Models.
One calibration dataset. A reusable Polish clinical SmPC corpus. 418 fragments of EMA-published text. No PHI.
One Zenodo deposit. Concept DOI 10.5281/zenodo.19851346, current version v0.5.0.

Three layers

Each layer is a separate contribution. This site documents Layer 1.

L1 · Workstation

The hardware envelope, the benchmark platform, and the AWQ releases. Released as v0.5.0.

L2 · Retrieval

A clinical retrieval layer over the Polish SmPC corpus. Hybrid retrieval, not a single vector store. In scoping.

L3 · Arena

A model-quality evaluation method, kept separate from the throughput work. In scoping.

Scope

The benchmark measures throughput, latency, thermal envelope, and power under concurrent load. It does not measure model quality, reasoning, or clinical accuracy. A confidently wrong model runs as fast as a correct one. The benchmark cannot tell them apart. Quality belongs to the L3 Arena and to clinical validation. Not here.

Public now vs held back

Public now	Held until the papers
Model cards	Throughput scaling
Calibration corpus	Latency distributions
Hardware envelope	Energy per token
Reproducibility stack	Cross-model comparisons
Methodology	Paper-bound figures

What this site holds back. Performance numbers stay private until the papers are published. Throughput, latency distributions, energy per token, and cross-model comparisons are not here. The engineering side is: hardware limits, the method, the model cards, and how to reproduce them.