Skip to content

2026

Vulkan vs CUDA on NVIDIA Blackwell: The Benchmark Nobody Expected

Tested on RTX PRO 6000 Blackwell Server Edition (96 GB VRAM) with llama.cpp b7966, February 2026


TL;DR

We benchmarked 8 LLM models (1B to 70B parameters) across two backends, two operating systems, and six runs. Vulkan with coopmat2 challenged CUDA 13.1 on NVIDIA's newest architecture. The results broke every assumption we had.

The headlines:

  • 8B and smaller : use CUDA. It dominates prompt processing and throughput on small models.
  • 32B and above : try Vulkan. Token generation is 18 to 41% faster. The speed you feel during inference.
  • Mistral Nemo 12B : use Vulkan, no question. 8.3x faster than CUDA (not a typo).
  • Thermal behavior differs by backend. CUDA thermal-throttled gracefully at 95°C. Vulkan crashed with unrecoverable device-lost errors instead.
  • Server GPUs should stay in server cases. A 600W passive GPU in a consumer mid-tower is a recipe for thermal disaster.
  • OS choice barely matters. Linux and Windows CUDA performance within 5 to 10%.

Devstral 2 & vibe by Mistral AI the hidden gems of the AI Coding Agent.

For about a year, I have been working daily with various coding assistants, choosing different tools depending on my mood, needs and constraints. My journey has included testing Windsurf and Tabnine professionally, while personally transitioning from being a fervent Copilot user to adopting Claude Code.

During this exploration, I discovered Devstral 2, which ultimately replaced Claude Code in my workflow for several compelling reasons:

  1. Aesthetic Excellence: The tool offers a beautiful user experience.
    From the blog post announcement to the API documentation and vibe itself, the color scheme, visual effects, and overall polish create a distinctly pleasant working environment.

  2. Comparable Performance: In the "me, myself & I benchmark", Devstral 2 code suggestion is on par with Claude Code.
    While both trend to occasionally overlook framework documentation ; they deliver excellent results overall when refactoring, suggesting commit message, or tweaking CSS.

  3. Cost-Effective and Open Source: Devstral 2 is significantly more affordable than Claude Code and is open source.
    Users receive 1 million tokens for trial, with pricing at $0.10/$0.30 for Devstral Small 2 past the 1st million.
    With Claude Code, I frequently hit usage limits, even after employing /compact commands and tracking my /usage.
    And even if you bust the vibe usage limits it has:

  4. Local Execution Capability: Although vibe time to first token can be slower than claude, Mistral offers a crucial advantage !
    Both Devstral 2 & small version are open source with the ability to run entirely on local machines, providing greater control, privacy, and if you have the gear, blazing-fast performance⚡.

The documentation to run it locally is rather sparse and Devstral-2-small is still relatively resource-intensive, therefore needing some tweaks.

Here are the instructions for running Devstral 2 small + vibe on Ubuntu 24.04 with an NVIDIA L40S with 24GB VRAM hosted by Scaleway .

Welcome to LaFabrique.AI

Welcome to LaFabrique.AI! An evolution of storage-chaos.io. This blog tracks and documents the beginning of a journey through the world of artificial intelligence.

What to Expect

This blog will cover a wide range of AI-related topics (or not!):

  • 🤖 AI Tools – Reviews and tutorials on the latest AI tools
  • 🏗️ AI Infrastructure – Benchmark & architecture best practices
  • ☸️ Kubernetes – Insights on Kubernetes, storage & its use in the context of AI
  • 🏭 Industry Insights – Trends and developments in the AI space

Stay tuned for more content!