← all topics⚙️

Running Local Models

Free, private, no API costs. How to run powerful AI models on your own hardware — overnight tasks, bulk work, nothing leaves your machine.

This is an advanced section. It requires installing new software (Ollama) and assumes you're comfortable with a terminal. If you're just getting started, build the OS and knowledge wiki first — come back here when you've outgrown Claude alone.

I added this section as I truly believe local models are getting good enough to do some serious grunt work without having to pay a higher end model to do it. Want to stay in Claude? Skip to the Haiku section below.

Why local models

Cloud APIs charge per token. For a single research question, the cost is negligible. For bulk work — ingesting 50 documents, running overnight analysis, checking data sources every hour — the costs add up fast.

Local models run for free once installed. No API key, no usage limits, no bill at the end of the month. The tradeoff is real: local models are less capable than frontier models like Claude, they have no internet access, and they need decent hardware. But for the right tasks, they're more than good enough — and the economics are fundamentally different.

↳

The goal isn't to replace Claude with local models. It's to route work intelligently — free local compute for volume tasks, paid cloud for quality tasks. Once you have this routing in place, your AI costs drop while capability stays high.

What hardware you need

A Mac with Apple Silicon (M1, M2, M3, or M4 chip) is currently the best consumer hardware for running local models. Apple's unified memory architecture means the GPU and CPU share the same RAM — which matters for model inference. 16GB RAM is the minimum; 32GB or more is recommended for larger models.

A Windows or Linux PC with a dedicated GPU also works. NVIDIA cards with 16GB+ VRAM are the standard choice. No cloud account needed after initial setup.

💡

If you have a Mac Studio, Mac Pro, or M-series MacBook Pro with 32GB+ RAM, you already have excellent local model hardware. You don't need anything new.

Ollama — the easiest way to run local models

Ollama (ollama.ai) is the simplest way to get started. Install it, pull a model with one command, and it runs locally. Ollama serves models via a local API — so any tool that can call an API can use your local model, exactly as it would call Claude.

# Install Ollama from ollama.ai, then:

ollama pull qwen2.5:32b          # pull a model
ollama run qwen2.5:32b           # run it interactively

# Or call it via API (same interface as OpenAI):
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "qwen2.5:32b", "messages": [{"role": "user", "content": "Hello"}]}'

💡

Ollama has a model library at ollama.ai/library where you can browse available models, see their sizes, and check what hardware they need. Check there for current recommendations — the landscape changes fast.

Which models to run

Model recommendations shift as new versions are released — treat these as starting points. The general principle holds: bigger models are more capable but slower and need more RAM.

For general tasks and research: a 30B+ parameter model with strong reasoning. For coding: a code-specialist model in the 30B range. For quick checks and monitoring: a lightweight 7B model that runs fast.

Check the Ollama model library for current best options in each category.

# General tasks / research (32-35B range)
ollama pull qwen2.5:32b

# Coding tasks (code-specialist)
ollama pull qwen2.5-coder:32b

# Quick checks / monitoring (fast, lightweight)
ollama pull qwen2.5:7b

What to use local models for

Local models are the workhorse — high volume, low judgment. Things that need to run many times, unattended, without watching the API bill:

Bulk document ingestion — processing fund docs into the knowledge wiki overnight.

Recurring monitoring — checking data sources every hour, flagging changes.

Data pipeline work — fetching, parsing, structuring data.

Draft writing that you'll review before using.

💡

If a task runs more than 10 times a day or you want it to run while you're not there — that's a local model job. If quality matters more than cost, or you need judgment — that's Claude.

What to keep on Claude

Judgment calls. Nuanced analysis. Anything client-facing. Complex reasoning chains. Anything where getting it slightly wrong has consequences.

Local models are excellent at pattern-following, summarisation, and structured tasks with clear inputs and outputs. They're weaker at novel reasoning, creative synthesis, and knowing when something doesn't quite add up.

The routing decision is simple: if you'd be comfortable with a junior analyst doing the task unsupervised, it's a local model job. If you'd want a senior person to think it through — use Claude.

Not ready for local models? Use Haiku instead

If installing Ollama feels like a step too far right now, Claude Haiku is a good middle ground for bulk tasks. It's Anthropic's fastest, cheapest model — roughly 25x cheaper per token than Sonnet — and it handles ingestion, summarisation, and structured extraction well.

Not as capable as Sonnet for judgment calls, but for processing documents into a knowledge wiki it's more than good enough. You stay within the Claude ecosystem, no new software to install, and the cost stays manageable even at volume.

When local models start making sense: when you want zero API costs, when data privacy is critical, or when you need tasks running continuously overnight without watching the bill.

Security and privacy

Nothing leaves your machine. The model runs locally, the data stays local, no API calls go out. For sensitive fund information this matters — you're not sending confidential docs to a cloud provider.

This is also why the separate machine approach makes sense (see The OS). A dedicated device running local models, with no credentials or sensitive data on it, is a clean security boundary.

My setup

Mac Studio 32GB. Ollama running three models:

35B general model — for research, analysis, document work.

32B code-specialist model — for building scripts and pipelines.

7B lightweight model — for quick checks and monitoring.

Total cost to run: electricity. Everything else is paid for.

// next up

Follow the build

get updates as each new topic goes live