Vulkan vs CUDA on NVIDIA Blackwell: The Benchmark Nobody Expected
Tested on RTX PRO 6000 Blackwell Server Edition (96 GB VRAM) with llama.cpp b7966, February 2026
TL;DR
We benchmarked 8 LLM models (1B to 70B parameters) across two backends, two operating systems, and six runs. Vulkan with coopmat2 challenged CUDA 13.1 on NVIDIA's newest architecture. The results broke every assumption we had.
The headlines:
- 8B and smaller : use CUDA. It dominates prompt processing and throughput on small models.
- 32B and above : try Vulkan. Token generation is 18 to 41% faster. The speed you feel during inference.
- Mistral Nemo 12B : use Vulkan, no question. 8.3x faster than CUDA (not a typo).
- Thermal behavior differs by backend. CUDA thermal-throttled gracefully at 95°C. Vulkan crashed with unrecoverable device-lost errors instead.
- Server GPUs should stay in server cases. A 600W passive GPU in a consumer mid-tower is a recipe for thermal disaster.
- OS choice barely matters. Linux and Windows CUDA performance within 5 to 10%.