Reproduce
To reproduce a run, match the full software stack. A different stack is a different experiment. The values below are the minimum spec. The repository holds the exact dated manifests.
Pinned stack
| Component | Pinned value |
|---|---|
| GPU | 2× AMD Radeon AI PRO R9700 32 GB (gfx1201, RDNA 4 / Navi 48) |
| CPU / RAM | AMD Ryzen 9 9950X3D · 96 GB DDR5-6000 |
| OS / kernel | Kubuntu 24.04 · kernel 6.17 |
| ROCm | 7.2.0 |
| vLLM | 0.19.0+rocm721 |
| Quantization | AWQ W4A16 (compressed-tensors); llm-compressor 0.10.0.2 |
Every benchmark JSON record captures the full version set (rocm_version, vllm_version, torch_version, torch_hip_version) plus the complete env-var dictionary. Dated software and system manifests (pip freeze, ROCm, kernel) live under environment/ in the repository.
Mandatory gfx1201 environment variables
These variables are required on this platform. They are public engineering content.
unset PYTORCH_ALLOC_CONF
export VLLM_ROCM_USE_AITER=0
export AMD_SERIALIZE_KERNEL=1 # NOT 3, rejected by current PyTorch
export HIP_LAUNCH_BLOCKING=1
export ROCR_VISIBLE_DEVICES=0,1 # exclude the iGPU (RAPHAEL)
Mandatory model-construction flag
For every model with hybrid attention, enforce_eager=True is mandatory. The default CUDA-graph capture crashes with HSA_STATUS_ERROR_INVALID_PACKET_FORMAT on gfx1201. This is a runtime constraint, not a model defect.
Serving a released model
The released AWQ checkpoints ship vLLM usage snippets for two platforms on their HuggingFace cards: AMD ROCm validated, NVIDIA portable via the awq_marlin kernel. On this platform, a single-card variant is served roughly as below. The model card gives the exact per-variant invocation.
# after exporting the gfx1201 env-var floor above
vllm serve mozarcik/Llama-PLLuM-8B-chat-2512-awq \
--tensor-parallel-size 1 \
--max-model-len 2048 \
--enforce-eager
Sanity checks go to the /v1/completions endpoint with the chat template bypassed. See Results for why.
What you can reproduce from public artifacts
- The engineering envelope. Load behaviour, weight footprint, KV-cache budget, and max-concurrency for each released checkpoint. These figures are on the Results page and the model cards.
- The quantization. Re-quantize any base checkpoint against the published calibration corpus with the pinned
llm-compressorversion. The 8B and 12B were quantized locally on one R9700. The 70B family was quantized on an AMD Instinct MI300X (192 GB) via the AMD Developer Cloud on DigitalOcean, under the AMD Developer Program. A 70B AWQ pass needs more memory than two R9700 cards leave free. The scripts and their MI300X paths are undercalibration/quantization/. - The methodology. The full Phase 1 and Phase 2 protocol, statistical design, and plotting scripts are in the repository.