NaviMed-UMB

Local clinical-AI inference on consumer AMD GPUs. Polish LLM releases and an open benchmark method.

Reproduce

To reproduce a run, match the full software stack. A different stack is a different experiment. The values below are the minimum spec. The repository holds the exact dated manifests.

Pinned stack

ComponentPinned value
GPU2× AMD Radeon AI PRO R9700 32 GB (gfx1201, RDNA 4 / Navi 48)
CPU / RAMAMD Ryzen 9 9950X3D · 96 GB DDR5-6000
OS / kernelKubuntu 24.04 · kernel 6.17
ROCm7.2.0
vLLM0.19.0+rocm721
QuantizationAWQ W4A16 (compressed-tensors); llm-compressor 0.10.0.2

Every benchmark JSON record captures the full version set (rocm_version, vllm_version, torch_version, torch_hip_version) plus the complete env-var dictionary. Dated software and system manifests (pip freeze, ROCm, kernel) live under environment/ in the repository.

Mandatory gfx1201 environment variables

These variables are required on this platform. They are public engineering content.

unset PYTORCH_ALLOC_CONF
export VLLM_ROCM_USE_AITER=0
export AMD_SERIALIZE_KERNEL=1            # NOT 3, rejected by current PyTorch
export HIP_LAUNCH_BLOCKING=1
export ROCR_VISIBLE_DEVICES=0,1          # exclude the iGPU (RAPHAEL)

Mandatory model-construction flag

For every model with hybrid attention, enforce_eager=True is mandatory. The default CUDA-graph capture crashes with HSA_STATUS_ERROR_INVALID_PACKET_FORMAT on gfx1201. This is a runtime constraint, not a model defect.

Serving a released model

The released AWQ checkpoints ship vLLM usage snippets for two platforms on their HuggingFace cards: AMD ROCm validated, NVIDIA portable via the awq_marlin kernel. On this platform, a single-card variant is served roughly as below. The model card gives the exact per-variant invocation.

# after exporting the gfx1201 env-var floor above
vllm serve mozarcik/Llama-PLLuM-8B-chat-2512-awq \
  --tensor-parallel-size 1 \
  --max-model-len 2048 \
  --enforce-eager

Sanity checks go to the /v1/completions endpoint with the chat template bypassed. See Results for why.

What you can reproduce from public artifacts

The Phase 2 numerical results (per-N throughput, latency distributions, energy per token) and the figures needed to reproduce the scaling analysis are held back until the papers are published. The raw results tree (benchmarks/results/) is gitignored and kept locally only.