Skip to content

Devstral 2 & vibe by Mistral AI the hidden gems of the AI Coding Agent.

For about a year, I have been working daily with various coding assistants, choosing different tools depending on my mood, needs and constraints. My journey has included testing Windsurf and Tabnine professionally, while personally transitioning from being a fervent Copilot user to adopting Claude Code.

During this exploration, I discovered Devstral 2, which ultimately replaced Claude Code in my workflow for several compelling reasons:

  1. Aesthetic Excellence: The tool offers a beautiful user experience.
    From the blog post announcement to the API documentation and vibe itself, the color scheme, visual effects, and overall polish create a distinctly pleasant working environment.

  2. Comparable Performance: In the "me, myself & I benchmark", Devstral 2 code suggestion is on par with Claude Code.
    While both trend to occasionally overlook framework documentation ; they deliver excellent results overall when refactoring, suggesting commit message, or tweaking CSS.

  3. Cost-Effective and Open Source: Devstral 2 is significantly more affordable than Claude Code and is open source.
    Users receive 1 million tokens for trial, with pricing at $0.10/$0.30 for Devstral Small 2 past the 1st million.
    With Claude Code, I frequently hit usage limits, even after employing /compact commands and tracking my /usage.
    And even if you bust the vibe usage limits it has:

  4. Local Execution Capability: Although vibe time to first token can be slower than claude, Mistral offers a crucial advantage !
    Both Devstral 2 & small version are open source with the ability to run entirely on local machines, providing greater control, privacy, and if you have the gear, blazing-fast performance⚡.

The documentation to run it locally is rather sparse and Devstral-2-small is still relatively resource-intensive, therefore needing some tweaks.

Here are the instructions for running Devstral 2 small + vibe on Ubuntu 24.04 with an NVIDIA L40S with 24GB VRAM hosted by Scaleway .

Installation

The main resources for offline installation are: - Official vibe documentation - Hugging Face Devstral-Small-2 page 🤗

For this setup, I used vllm (the recommended approach for running Devstral-Small-2). Alternatively, you can use Ollama where the model is readily available: https://ollama.com/library/devstral-small-2🦙.

This guide uses the Nightly build with CUDA 12.9 and the uv package manager. Other configurations are available at: https://vllm.ai/#quick-start.

Let's get started:

  1. Install CUDA 12.9:

The installation process is straightforward. Select your platform from the NVIDIA CUDA archive and run:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-ubuntu2404.pin
mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.9.1/local_installers/cuda-repo-ubuntu2404-12-9-local_12.9.1-575.57.08-1_amd64.deb
dpkg -i cuda-repo-ubuntu2404-12-9-local_12.9.1-575.57.08-1_amd64.deb
cp /var/cuda-repo-ubuntu2404-12-9-local/cuda-*-keyring.gpg /usr/share/keyrings/
apt-get update
apt-get -y install cuda-toolkit-12-9
  1. Install Python dependencies (for fresh Ubuntu installations):
curl -LsSf https://astral.sh/uv/install.sh | sh
apt-get -y install python3.12-venv
  1. Set up vllm environment:
mkdir vllm-cuda-12
cd vllm-cuda-12
python3 -m venv .
source bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly/cu129 --extra-index-url https://download.pytorch.org/whl/cu129 --index-strategy unsafe-best-match
  1. Run the model:
vllm serve mistralai/Devstral-Small-2-24B-Instruct-2512 \
    --max-model-len 8192 \            # 200k tokens exceeded 48GB VRAM capacity
    --gpu-memory-utilization 0.95 \   # Reserve memory for other applications
    --tensor-parallel-size 1 \        # Using single GPU (as opposed to default multi-GPU setup)
    --tool-call-parser mistral \
    --enable-auto-tool-choice 2>&1 | tee vllm.log  # Log output for troubleshooting
  1. Install vibe and configure the model:

The final step involves configuring ~/.vibe/config.toml to use the correct model with the appropriate provider:

[[providers]]
name = "llamacpp"
api_base = "http://127.0.0.1:8000/v1"
api_key_env_var = ""
api_style = "openai"
backend = "generic"
reasoning_field_name = "reasoning_content"

[[models]]
name = "devstral-2-small"
provider = "llamacpp"
alias = "local"
temperature = 0.2
input_price = 0.0
output_price = 0.0

Within vibe, you can select the local context using the /config command.

You are now ready to code with Devstral-Small-2!

Here's a snapshot of GPU usage while serving tokens at high speed:

nvidia-smi
Tue Jan 27 14:53:40 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.95.05              Driver Version: 580.95.05      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S                    Off |   00000000:01:00.0 Off |                    0 |
| N/A   40C    P0             84W /  350W |   42113MiB /  46068MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            6409      C   VLLM::EngineCore                      42104MiB |
+-----------------------------------------------------------------------------------------+

Side Quest #1: Running Devstral-Small-2 on KVM with a 5070Ti

Spoiler alert: It failed, but I gained valuable insights. One key takeaway was learning how to tweak WSL2 settings using C:\Users\[username]\.wslconfig (complete configuration options available here).

[wsl2]
# Memory allocation - leave 28GB for Windows
memory=100GB

# Use most of my CPU cores
processors=14

# Swap space for safety during compilation
swap=24GB

# GPU passthrough (critical for vLLM)
nestedVirtualization=true

# Disable page reporting to improve memory performance
pageReporting=false

# Increase virtual disk limit if needed
[experimental]
autoMemoryReclaim=gradual
sparseVhd=true

Another valuable lesson was learning how to offload some GPU inference workload to the CPU:

vllm serve mistralai/Devstral-Small-2-24B-Instruct-2512 \
    --max-model-len 2048 \
    --gpu-memory-utilization 0.95 \
    --cpu-offload-gb 10 \
    --tool-call-parser mistral \
    --enable-auto-tool-choice

Side Quest #2: Running Devstral-Small-2 on openSUSE Tumbleweed with a 5070Ti with ollama

I started this second side-quest three days after completing the post above and... Guess what? It just worked with:

And that’s it.

Performance was equivalent to the Scaleway Ubuntu instance I used earlier — fast, smooth, and reliable. Nice and easy. You can basically forget everything I wrote above 😄

The French Paradox of AI Coding

Devstral 2 is like driving a Bugatti Veyron at the price of a Renault Clio ; elegant French engineering delivering supercar performance with everyday affordability.

Give it a try and Vive la révolution de l'IA! 🚀

Disclaimer: This blog post has been proofread with Devstral using the guidelines from LaFabrique.ai GitHub repo