Reproduce

To reproduce a run, match the full software stack. A different stack is a different experiment. The values below are the minimum spec. The repository holds the exact dated manifests.

Pinned stack

Component	Pinned value
GPU	2× AMD Radeon AI PRO R9700 32 GB (gfx1201, RDNA 4 / Navi 48)
CPU / RAM	AMD Ryzen 9 9950X3D · 96 GB DDR5-6000
OS / kernel	Kubuntu 24.04 · kernel 6.17
ROCm	`7.2.0`
vLLM	`0.19.0+rocm721`
Quantization	AWQ W4A16 (compressed-tensors); `llm-compressor 0.10.0.2`

Every benchmark JSON record captures the full version set (rocm_version, vllm_version, torch_version, torch_hip_version) plus the complete env-var dictionary. Dated software and system manifests (pip freeze, ROCm, kernel) live under environment/ in the repository.

Mandatory gfx1201 environment variables

These variables are required on this platform. They are public engineering content.

unset PYTORCH_ALLOC_CONF
export VLLM_ROCM_USE_AITER=0
export AMD_SERIALIZE_KERNEL=1            # NOT 3, rejected by current PyTorch
export HIP_LAUNCH_BLOCKING=1
export ROCR_VISIBLE_DEVICES=0,1          # exclude the iGPU (RAPHAEL)

Mandatory model-construction flag

For every model with hybrid attention, enforce_eager=True is mandatory. The default CUDA-graph capture crashes with HSA_STATUS_ERROR_INVALID_PACKET_FORMAT on gfx1201. This is a runtime constraint, not a model defect.

Serving a released model

The released AWQ checkpoints ship vLLM usage snippets for two platforms on their HuggingFace cards: AMD ROCm validated, NVIDIA portable via the awq_marlin kernel. On this platform, a single-card variant is served roughly as below. The model card gives the exact per-variant invocation.

# after exporting the gfx1201 env-var floor above
vllm serve mozarcik/Llama-PLLuM-8B-chat-2512-awq \
  --tensor-parallel-size 1 \
  --max-model-len 2048 \
  --enforce-eager

Sanity checks go to the /v1/completions endpoint with the chat template bypassed. See Results for why.

What you can reproduce from public artifacts

The engineering envelope. Load behaviour, weight footprint, KV-cache budget, and max-concurrency for each released checkpoint. These figures are on the Results page and the model cards.
The quantization. Re-quantize any base checkpoint against the published calibration corpus with the pinned llm-compressor version. The 8B and 12B were quantized locally on one R9700. The 70B family was quantized on an AMD Instinct MI300X (192 GB) via the AMD Developer Cloud on DigitalOcean, under the AMD Developer Program. A 70B AWQ pass needs more memory than two R9700 cards leave free. The scripts and their MI300X paths are under calibration/quantization/.
The methodology. The full Phase 1 and Phase 2 protocol, statistical design, and plotting scripts are in the repository.

The Phase 2 numerical results (per-N throughput, latency distributions, energy per token) and the figures needed to reproduce the scaling analysis are held back until the papers are published. The raw results tree (benchmarks/results/) is gitignored and kept locally only.