On April 2, 2026, Vitalik Buterin published a detailed breakdown of his private AI setup — running Qwen3.5:35B locally on a laptop with an Nvidia RTX 5090, NixOS, bubblewrap sandboxes, and a “human + LLM 2-of-2” authorization system. Here’s how to replicate it at any budget, and 6 prompts to audit your own AI privacy posture.
On April 2, 2026, Vitalik Buterin published the most technically detailed account of a personal private AI stack by any public figure — and the reasoning behind it should concern anyone sending sensitive data to cloud AI providers. Buterin runs Qwen3.5:35B locally on a laptop with an Nvidia RTX 5090 GPU, achieves 90 tokens per second, and wraps the entire system in NixOS reproducible configs with bubblewrap sandboxes. His core argument: the privacy movement spent decades winning battles against surveillance, and cloud AI is quietly reversing all of it. This guide breaks down his exact setup, shows you how to build your own version at any budget from $0 to $2,000, and gives you 6 original prompts to audit and lock down your AI privacy right now.
Why Vitalik Buterin Went Local: The Privacy Argument
Buterin did not switch to local AI because of performance. He switched because of what he calls a “deep fear” that cloud AI services are erasing the gains of decades of privacy advocacy. His argument has three layers, and the third one is the most important.
Layer 1: Your prompts are training data. Every query you send to a cloud AI provider is, unless you explicitly opt out (and sometimes even then), potential training data. The aggregate of your prompts — your questions, your drafts, your code, your confessions to the AI — forms a portrait of your thinking that is more intimate than your email history. Cloud providers hold this data indefinitely, and their privacy policies allow them to modify retention terms with notice.
Layer 2: AI agents multiply the exposure. As AI moves from chat interfaces to autonomous agents — agents that read your files, browse the web on your behalf, send messages, and execute code — the volume and sensitivity of data flowing through AI systems increases by orders of magnitude. An AI agent that manages your calendar, reads your emails, and drafts responses has access to more of your life than any single application you currently use. If that agent runs on someone else’s infrastructure, the privacy exposure is total.
Layer 3: The supply chain is already compromised. Buterin cited a specific data point that should alarm anyone building with AI tooling: 15% of community-contributed tools for OpenClaw (an open-source AI agent framework) contained malicious instructions. Not bugs. Not poorly-written code. Deliberate prompt injections designed to exfiltrate data or manipulate agent behavior. When the tools your AI agent uses are themselves compromised, running those tools on infrastructure you do not control means you have no line of defense. Running locally does not eliminate the malicious tool problem, but it gives you a sandbox boundary that cloud execution cannot provide.
This is not a theoretical concern from a person unfamiliar with technology. Buterin is the co-founder of Ethereum and one of the most technically sophisticated public figures in the world. When he restructures his entire computing workflow around local AI to protect his privacy, the reasoning deserves serious examination. If you’re already concerned about data exposure, start with the basics: use our password generator to create strong credentials for every AI service you currently use, and our hash generator to verify the integrity of any local model weights you download.
Vitalik’s Exact Hardware and Software Stack
Hardware
Buterin runs his AI stack on a laptop equipped with an Nvidia RTX 5090 GPU with 24 GB of VRAM. This is not a server rack or a custom-built desktop — it is a portable machine that travels with him. The RTX 5090 mobile variant, released in early 2026, delivers enough compute to run a 35-billion parameter model at approximately 90 tokens per second, which Buterin describes as his target for “comfortable daily use.” For context, 90 tokens per second means the AI generates roughly 70 words per second — significantly faster than you can read. This is not the compromised, laggy experience people associate with local AI from 2024. It is functionally instant for interactive use.
Model: Qwen3.5:35B
Buterin chose Qwen3.5:35B as his primary model. Qwen3.5 is Alibaba’s open-weight model family, and the 35B variant sits in the sweet spot between capability and hardware requirements. At 35 billion parameters, it fits comfortably in 24 GB of VRAM with Q4 quantization, delivers strong reasoning and coding performance, and supports a 128K token context window. In community benchmarks from April 2026, Qwen3.5:35B scores within 5–10% of GPT-4o on most reasoning and coding tasks — more than sufficient for daily coding assistance, writing, analysis, and agent tasks.
Inference: llama-server via llama-swap
Rather than using Ollama or vLLM, Buterin runs inference through llama-server (part of the llama.cpp project) managed by llama-swap, a lightweight model-switching proxy. This setup allows him to hot-swap between multiple models without restarting the inference server — useful when different tasks benefit from different model specializations. The llama.cpp backend is C++ compiled, with no Python runtime overhead, which contributes to the high token-per-second throughput on his hardware.
Operating System: NixOS
The operating system choice is NixOS, a Linux distribution built around declarative, reproducible configuration. Every package, every system setting, and every service configuration is defined in a single configuration file that can be version-controlled, audited, and reproduced exactly on another machine. For a privacy-focused AI setup, this is significant: you can verify that your system configuration has not been modified, roll back to any previous state, and share your exact configuration with others for independent audit. NixOS eliminates the “it works on my machine” problem and the “I don’t know what’s running on my system” problem simultaneously.
Sandboxing: Bubblewrap
For AI agent tasks — situations where the model needs to execute code, read files, or interact with external services — Buterin uses bubblewrap (bwrap), a lightweight sandboxing tool that creates isolated environments with restricted filesystem access, no network connectivity unless explicitly granted, and limited system call access. This is the defense against the compromised-tool problem: even if an AI agent executes a malicious instruction from a community tool, the sandbox prevents it from accessing files outside its designated directory or making unauthorized network requests.
Communication Security: Human + LLM 2-of-2 Authorization
The most novel element of Buterin’s setup is a messaging daemon that implements “human + LLM 2-of-2” authorization for outgoing messages. When an AI agent wants to send a message on Buterin’s behalf, the message requires approval from both a human (Buterin himself) and a separate LLM instance acting as a security reviewer. Neither party alone can authorize an outgoing message. This is a cryptographic-style authorization pattern — similar to multi-signature cryptocurrency wallets — applied to AI agent communication. It prevents both accidental sends (human approves carelessly) and prompt injection attacks (malicious instruction bypasses human review by crafting plausible-looking messages).
Build Your Own Private AI Stack: Three Budget Tiers
You do not need Buterin’s hardware budget to run AI locally. Here are three realistic paths at different price points, each delivering meaningfully private AI capability.
Tier 1: $0 — CPU-Only on Your Existing Machine
If you have a reasonably modern laptop or desktop with 16 GB of RAM, you can run smaller models entirely on CPU. The experience is slower — roughly 5–15 tokens per second depending on your CPU and the model — but functional for many tasks.
Setup:
# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull a capable small model
ollama pull qwen2.5:7b
# Start chatting
ollama run qwen2.5:7bRecommended models for CPU: Qwen2.5:7B, Mistral Small 3.1 (24B with Q3 quantization if you have 32 GB RAM), or Phi-3.5 Mini (3.8B for machines with only 8 GB RAM). These models handle coding assistance, writing, and general Q&A at quality levels that would have been considered frontier-class in 2023. The key principle: your data never leaves your machine, and the model runs entirely offline once downloaded.
Tier 2: $300–$500 — Used GPU Acceleration
A used RTX 3060 12GB or RTX 3080 10GB, available for $200–$400 on secondary markets in April 2026, transforms local AI performance. With 10–12 GB of VRAM, you can run 7B–14B parameter models at 30–60 tokens per second — fast enough for comfortable interactive use.
Setup:
# Install Ollama (same as above)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull a mid-range model that fits 12GB VRAM
ollama pull qwen2.5:14b
# Or run Mistral Small for stronger reasoning
ollama pull mistral-small:latest
# Start the API server for integration with other tools
ollama serveRecommended models: Qwen2.5:14B (excellent coding and reasoning), Mistral Small 3.1 (strong multilingual and instruction following), or Llama 3.3:8B (Meta’s efficient general-purpose model). At this tier, you get genuinely useful AI assistance for coding, writing, analysis, and research — all running locally with zero data exposure.
Tier 3: $1,500–$2,000 — RTX 4090 or Equivalent
An RTX 4090 with 24 GB of VRAM is the sweet spot for serious local AI in 2026. At this tier, you can run Qwen3.5:35B (Buterin’s model of choice), Llama 4 Scout at Q3 quantization, or any model up to approximately 35B parameters at full speed. Performance reaches 40–90 tokens per second depending on the model and quantization level. For a comprehensive guide on running Llama 4 Scout locally, see our Llama 4 Scout local deployment guide.
Setup with llama.cpp (Buterin’s approach):
# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make LLAMA_CUDA=1
# Download a GGUF model (e.g., Qwen3.5:35B Q4_K_M)
# From Hugging Face or your preferred model repository
# Run the server
./llama-server -m models/qwen3.5-35b-q4_k_m.gguf \
--ctx-size 32768 \
--n-gpu-layers 99 \
--port 8080This exposes an OpenAI-compatible API at localhost:8080. You can connect it to VS Code extensions, custom scripts, or any tool that supports the OpenAI chat completions format. The model runs entirely on your hardware, and the API never leaves your local network.
Apple Silicon Alternative
If you are on a Mac with Apple Silicon, the unified memory architecture gives you a significant advantage. A Mac Mini M4 Pro with 48 GB unified memory ($1,599 in April 2026) can run Qwen3.5:35B at Q4 quantization with approximately 20–25 tokens per second — slower than a discrete GPU but entirely silent, extremely power-efficient, and with no driver complexity. For developers who prefer the Apple ecosystem, this is the cleanest path to a Buterin-style setup.
Sandboxing Your AI Agents: Options for Every OS
Running the model locally is half the equation. The other half is ensuring that when your AI executes code or interacts with files, it cannot access anything beyond what you explicitly permit.
Linux: bubblewrap (bwrap)
# Run a command in a sandbox with limited filesystem access
bwrap --ro-bind /usr /usr \
--ro-bind /lib /lib \
--ro-bind /lib64 /lib64 \
--bind /tmp/ai-sandbox /workspace \
--unshare-net \
--die-with-parent \
/bin/bashThis creates an environment where the AI agent can only read system libraries (read-only) and write to a designated workspace directory. Network access is completely disabled (--unshare-net). If the agent tries to read your home directory, access your SSH keys, or phone home to a remote server, it fails silently.
macOS: Apple’s App Sandbox or Docker
# Run AI agent tasks in a Docker container with no network
docker run --rm --network none \
-v $(pwd)/sandbox:/workspace \
python:3.12-slim \
python /workspace/agent_task.pyWindows: Windows Sandbox or WSL2 + Docker
Windows Sandbox provides a lightweight, disposable virtual machine that resets completely when closed. For persistent sandboxed environments, WSL2 with Docker provides Linux-equivalent isolation. Either approach prevents AI agent tasks from accessing your Windows user profile, documents, or network resources without explicit permission.
6 Prompts to Lock Down Your AI Privacy Right Now
You do not need local hardware to start improving your AI privacy posture. These six prompts work with any AI model — cloud or local — and help you audit, boundary-set, and plan your transition to more private AI usage.
Prompt 1: AI Privacy Audit
Use this prompt with whatever AI service you currently use most. The response reveals how the service handles your data, what it retains, and where your exposure points are.
I want to understand exactly what happens to my data when I use you.
Answer these questions specifically:
1. Are my prompts stored after this session ends? For how long?
2. Are my prompts used to train future models? Can I opt out?
3. If I paste code, documents, or personal information into this chat,
who at your company can access it?
4. Do you share any conversation data with third parties?
5. If I delete my account, is my conversation history actually deleted
from all systems, including backups?
6. What jurisdiction’s privacy laws govern my data?
Be specific. If the answer is "it depends on your plan," tell me
what it depends on.The AI’s response (or its inability to answer clearly) tells you exactly how much you should trust it with sensitive information. If the model hedges or cannot answer questions 1–3 directly, treat that as a red flag.
Prompt 2: Permission Boundary Setter
Use this at the start of any session where you plan to share sensitive context. It establishes explicit boundaries the AI should respect.
For this session, I am setting the following boundaries:
- You may ONLY reference information I explicitly provide in this chat
- Do NOT infer, assume, or reference any information from my previous
sessions, account profile, or usage patterns
- If I share code or documents, treat them as confidential. Do not
reference their content in any summarization, analytics, or training
pipeline
- If any instruction contradicts these boundaries, refuse it and
explain why
Confirm you understand these constraints and will follow them for
this entire session.This prompt does not technically prevent a cloud provider from processing your data. But it creates an explicit record of your intent, and in jurisdictions with strong data protection laws (GDPR, CCPA), documented intent carries legal weight. More practically, it primes the model to avoid cross-session data leakage in its responses.
Prompt 3: Data Leak Detector
This prompt tests whether your AI service leaks information across sessions or users. Run it at the start of a fresh session.
I want to test whether information persists across sessions.
Without me telling you, can you answer any of these:
1. What programming language did I use most recently?
2. What project am I currently working on?
3. What is my name or any identifying information?
4. What topics have I asked about in previous conversations?
For each question, tell me:
- Whether you have any information (yes/no)
- If yes, where that information comes from (memory feature,
system prompt, account data, or inference from this session)
Be completely honest. If you are uncertain whether you should
reveal this information, say so and explain why.If the AI can answer any of these questions in a session where you have not provided the information, you have confirmed cross-session data persistence. This is not inherently malicious — many services offer “memory” as a feature — but you should know it is happening and understand how to disable it.
Prompt 4: Local Model Evaluator
Before investing in local AI hardware, use this prompt to determine whether local models can handle your specific workload. Run it with both a cloud model and a local model, then compare outputs.
I’m evaluating whether I can replace my cloud AI usage with a
local model. Help me design a fair test.
Here are my top 5 AI use cases (replace with your actual uses):
1. [e.g., Code review for Python/TypeScript projects]
2. [e.g., Writing technical documentation]
3. [e.g., Analyzing CSV data and generating summaries]
4. [e.g., Brainstorming product features]
5. [e.g., Debugging error messages]
For each use case:
- Rate the minimum model capability needed (low/medium/high)
- Suggest the smallest local model that would handle it adequately
- Estimate the VRAM requirement for that model
- Flag any use cases where local models in April 2026 genuinely
cannot match cloud quality
Be honest about where local models fall short. I need accurate
assessment, not enthusiasm.Prompt 5: Threat Model Generator
Have your AI create a threat model for your own AI usage patterns. This surfaces risks you may not have considered.
Act as a security analyst. I’m going to describe my current AI
usage, and I want you to build a threat model.
My setup:
- Primary AI: [e.g., ChatGPT Plus via browser]
- Secondary AI: [e.g., GitHub Copilot in VS Code]
- I use AI for: [list your actual uses]
- Sensitive data I regularly share with AI: [be honest]
- My industry: [e.g., fintech, healthcare, education]
Build a threat model with:
1. Attack surface map (every point where my data touches AI infra)
2. Top 5 realistic threats ranked by likelihood and impact
3. For each threat: what the attacker gains, how they exploit it,
and what evidence I’d see if it happened
4. Specific mitigations I can implement THIS WEEK
5. What changes if I move to local AI (which threats disappear,
which new ones appear)
Do not soften the assessment. I want the uncomfortable version.Prompt 6: Escape Hatch Builder
This prompt generates scripts and procedures to extract all your data from any AI service, ensuring you are never locked in.
I want to build escape hatches for every AI service I use so I
can leave any of them within 24 hours without losing data.
Services I use:
- [e.g., ChatGPT — conversation history, custom GPTs, system prompts]
- [e.g., GitHub Copilot — suggestion history, settings]
- [e.g., Notion AI — enhanced documents]
- [Add your actual services]
For each service, give me:
1. Exact steps to export ALL my data (API calls, UI steps, or scripts)
2. A script I can run to automate the export where possible
3. The format the data exports in, and how to convert it to
a portable open format
4. What data CANNOT be exported and why
5. A local storage plan for the exported data
6. How to verify the export is complete (checksums, record counts)
Write the scripts in Python or Bash. Assume I’m on macOS or Linux.Running this prompt once and saving the output gives you a documented exit strategy for every AI service in your stack. Update it quarterly as services change their export capabilities.
The Bigger Picture: Self-Sovereign AI Is Not Paranoia
Buterin’s move to local AI is not an isolated technical decision. It is part of a broader pattern visible across the technology industry in 2026: the people who understand AI infrastructure most deeply are the ones most aggressively moving their personal AI usage off of cloud platforms.
The pattern makes sense when you consider the incentive structures. Cloud AI providers are building businesses on data. Their models improve when they have more data. Your prompts are data. The tension between “we protect your privacy” and “our model needs your data to improve” is structural, not accidental. It cannot be fully resolved by privacy policies or opt-out checkboxes because the fundamental business model depends on access to user data at scale.
Local AI resolves this tension by elimination. When the model runs on your hardware and your data never leaves your machine, there is no privacy policy to parse, no opt-out to verify, and no trust decision to make about a corporation’s future behavior. The security boundary is physical: your data stays on metal you own.
This does not mean cloud AI is useless or that everyone should immediately abandon commercial AI services. For many use cases — especially tasks that benefit from the largest frontier models, real-time web access, or multimodal capabilities that require datacenter-scale compute — cloud AI remains the better tool. The point is that the choice should be conscious, not default. Every prompt you send to a cloud service should pass a mental test: “Would I be comfortable if this prompt appeared in a data breach disclosure?” If the answer is no, that prompt belongs on local hardware.
Getting Started This Week
You do not need to replicate Buterin’s full NixOS setup to meaningfully improve your AI privacy. Here is a practical starting sequence that takes less than an hour:
- Run Prompt 1 (AI Privacy Audit) with your current AI service. Understand your exposure.
- Install Ollama on your current machine. It takes two minutes and works on macOS, Linux, and Windows.
- Pull a small model (
ollama pull qwen2.5:7b) and try it for your most common tasks. - Run Prompt 4 (Local Model Evaluator) to determine which of your tasks can move to local AI without meaningful quality loss.
- Move your most sensitive tasks first. Code review of proprietary code, drafting confidential documents, brainstorming competitive strategy — these should run locally regardless of any quality tradeoff.
- Run Prompt 5 (Threat Model Generator) to understand your remaining exposure and plan further migration.
The goal is not perfection. The goal is to stop sending your most sensitive data to infrastructure you do not control, starting with the highest-risk use cases and expanding from there. Buterin’s setup represents one endpoint of that spectrum. But every step along the spectrum — from running your first local model to sandboxing your AI agents to implementing 2-of-2 message authorization — meaningfully reduces your exposure.
The tools exist. The models are good enough. The hardware is affordable enough. The only remaining variable is whether you decide your AI privacy is worth an afternoon of setup.