NaviMed-UMB

Local clinical-AI inference on consumer AMD GPUs. Polish LLM releases and an open benchmark method.

Released models & dataset

Eleven public artifacts on HuggingFace under the mozarcik/ namespace. Ten AWQ model cards and one reusable calibration dataset. To the author's knowledge, the first public AWQ W4A16 / vLLM-native quantization of each variant.

All checkpoints are AWQ W4A16 (4-bit weight, 16-bit activation) compressed-tensors. Each is calibrated on the shared Polish clinical SmPC corpus below. Each ships with dual-platform vLLM usage snippets, validated on AMD ROCm and portable to NVIDIA via the awq_marlin kernel. Every card states "to the author's knowledge, first public".

The "first public" qualifier is based on a public HuggingFace search at release date. It is not a registry-level claim.

Llama-PLLuM-70B family: 8 variants

Quantized from the CYFRAGOVPL Llama-PLLuM-70B base / instruct / chat checkpoints. Calibration ran on an AMD Instinct MI300X via the AMD Developer Cloud, powered by DigitalOcean, under the AMD Developer Program. Its 192 GB of HBM3 holds the 70B in one calibration pass. Deployment then runs locally on 2× R9700 (TP = 2). Released 2026-05-23. License: llama3.1 (Llama 3.1 Community License). Each card ships with full NOTICE / LICENSE / USE_POLICY.md compliance.

VariantCheckpointBase modelLicense
base · 2412Llama-PLLuM-70B-base-2412-awqCYFRAGOVPL/Llama-PLLuM-70B-base-2412llama3.1
base · 2508Llama-PLLuM-70B-base-2508-awqCYFRAGOVPL/Llama-PLLuM-70B-base-2508llama3.1
instruct · 2412Llama-PLLuM-70B-instruct-2412-awqCYFRAGOVPL/Llama-PLLuM-70B-instruct-2412llama3.1
instruct · 2508Llama-PLLuM-70B-instruct-2508-awqCYFRAGOVPL/Llama-PLLuM-70B-instruct-2508llama3.1
instruct · 2512Llama-PLLuM-70B-instruct-2512-awqCYFRAGOVPL/Llama-PLLuM-70B-instruct-2512llama3.1
chat · 2412Llama-PLLuM-70B-chat-2412-awqCYFRAGOVPL/Llama-PLLuM-70B-chat-2412llama3.1
chat · 2508Llama-PLLuM-70B-chat-2508-awqCYFRAGOVPL/Llama-PLLuM-70B-chat-2508llama3.1
chat · 2512Llama-PLLuM-70B-chat-2512-awqCYFRAGOVPL/Llama-PLLuM-70B-chat-2512llama3.1

Run-3 consumer-GPU variants: single R9700

Released 2026-05-26. Unlike the 70B family, both were quantized locally on a single 32 GB R9700 (TP = 1). They are small enough to calibrate on one consumer card. Both deploy on one consumer card under vLLM, leaving the second GPU free for a parallel sweep or co-located training. Each passed a 5/5 Polish-clinical Gate 1 sanity grid (see Results).

CheckpointBase modelBase familyLicense
Llama-PLLuM-8B-chat-2512-awq CYFRAGOVPL/Llama-PLLuM-8B-chat-2512 Llama 3.1 llama3.1
PLLuM-12B-chat-2512-awq CYFRAGOVPL/PLLuM-12B-chat-2512 Mistral-Nemo-Base-2407 apache-2.0

The two Run-3 variants carry different licenses because they derive from different base families. The 8B is a Llama 3.1 derivative (Llama 3.1 Community License). The 12B is a Mistral-Nemo derivative (Apache 2.0).

Calibration dataset

One AWQ calibration corpus, shared across all 8B / 12B / 70B quantizations. Cross-variant quantization-quality comparisons stay corpus-controlled.

mozarcik/clinical-pl-smpc-awq-calibration license: other

418 fragments of Polish Charakterystyka Produktu Leczniczego (ChPL / SmPC) text published by the European Medicines Agency (EMA), ~512 tokens each. It covers 61 medicines drawn from a curated catalog of 81 INN across 9 NFZ drug programmes, with a pulmonology and thoracic oncology focus. No PHI. Reusable as a calibration corpus for other Polish-clinical quantization work.

Licensed separately from the repository. It derives from third-party EMA-published regulatory documents and is governed by the dataset's own license, not the root CC-BY-4.0 / MIT.

What is not on this page. Per-model and cross-model performance numbers stay back until the papers are published. That covers per-N throughput, latency distributions, KV-occupancy curves, energy-per-token, and any quantization-vs-quantization scaling comparison. The engineering envelope figures that are public appear on the Results page: weight footprint, KV budget, max-concurrency.