AI Article

Going Local: The Reality of Replacing Claude and GPT

Developers are swapping frontier models for local setups, but managing context caching and agent loops requires careful tuning.

Priya Nair

AI & Developer Experience Writer · Jun 15, 2026 · 5 min read

The promise of running a fully local, private, and free development assistant is highly appealing. For developers handling sensitive codebases or looking to escape recurring API subscriptions, the transition from proprietary frontier models to local hardware is no longer a distant dream.

However, moving away from Claude and GPT-4 is not a simple drop-in replacement. Real-world experiences from developers running local setups reveal a stark reality: while local models can deliver impressive productivity gains, they require highly specific hardware configurations, precise prompting, and active troubleshooting of context caching issues.

The Hardware and Backend Stack

To run capable coding models locally, developers are gravitating toward high-bandwidth unified memory architectures. Two primary hardware configurations dominate these setups:

Apple Silicon: Mac Studio configurations with 128GB of RAM or MacBooks with 36GB of RAM running containerized environments.
AMD Strix Halo: Laptops equipped with 128 GiB of unified memory, running llama.cpp inside containers.

On the software side, developers are running agentic coding harnesses—such as containerized and sandboxed instances of the Pi coding harness—to ensure the model operates completely offline without access to sensitive credentials.

Interestingly, on AMD Strix Halo hardware, developers report a performance paradox regarding backends. While AMD's ROCm is the expected choice, multiple developers have found that the Vulkan backend in llama.cpp actually runs slightly faster and more reliably than ROCm releases when executing Qwen models.

Model Selection: Finding the Sweet Spot

When it comes to model selection, bigger is not always better. While massive models exist, they are often too slow for interactive daily coding. Instead, developers are finding a sweet spot in hybrid Mixture of Experts (MoE) models:

Qwen 3.6 35B (with 3B active parameters): This model (specifically the Qwen 3.6 35B-A3B variant) has emerged as the favorite for daily agentic coding. It offers a strong balance of speed and capability on both Apple Silicon and AMD Strix Halo setups.
Qwen 3.5 122B (with 10B active parameters): Reserved for highly complex tasks. However, at 10B active parameters, it runs significantly slower, making it less ideal for rapid, iterative development.
Alternative Models: For non-coding tasks, developers keep a variety of models in rotation. Gemma 4 31B is frequently used for general chat and translation, Gemma 4 12B handles audio tasks, and models like Nemotron 3 Super 122B-A12B, Step 3.7 Flash, Minimax M2.7, and GPT-OSS 120B are kept on hand for benchmarking and specialized testing.

The Prompt Caching and "Thinking" Gotcha

One of the biggest technical hurdles when running local hybrid models is prompt caching. Developers running Qwen hybrid models on llama.cpp frequently encounter an issue where the model re-processes the entire context in full on every single turn, destroying performance.

This issue stems from how local models handle reasoning tokens. Most local models are not trained to preserve the full reasoning trace between chat turns. By default, they skip passing the reasoning trace from previous turns back to the LLM. When a long, interleaved chain of reasoning and tool calls occurs, the executor drops the reasoning on the next turn, forcing a complete re-calculation of the KV cache.

To resolve this, developers must ensure their local executor is up to date and explicitly configure the model to preserve its thinking. For Qwen 3.6 models in llama.cpp, this is achieved by adding the following configuration to the models.ini file:

chat-template-kwargs = {"preserve_thinking": true}

Enabling this flag prevents the model from dropping its reasoning trace, allowing llama.cpp to reuse the cache efficiently instead of reprocessing the entire context on every turn.

Agentic Workflows: Junior vs. Senior

Even with optimized hardware and caching, the gap between local models and frontier models like Claude Opus remains highly visible in daily workflows.

When tasked with building features in frameworks—such as redesigning a website homepage and blog using Django and Wagtail—local models face distinct limitations. Because Wagtail is less common, an offline local agent without internet access struggles to find correct patterns.

Developers describe the difference between Qwen 3.6 35B and Claude Opus as the difference between a junior developer and a senior architect:

Lack of Architectural Foresight: Local models do not "think ahead" for you. If assumptions are left open in a prompt, the model will take the path of least resistance to achieve the immediate goal (such as writing inline CSS directly in HTML) rather than designing a clean, maintainable architecture.
Tool Call Failures: Local agents frequently get edit tool calls wrong. When an edit fails, instead of retrying as instructed by the system prompt, they often enter loops, wasting thinking tokens and repeatedly re-reading files.
The Speedup Gap: While Claude Opus can provide an estimated 15x speedup for certain tasks, a fully offline local Qwen setup delivers closer to a 5x speedup.

Despite these limitations, a 5x speedup on a completely free, private, and local stack represents a massive win for developers willing to write highly precise prompts and actively guide their local assistant.

Sources & further reading

Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding? — news.ycombinator.com

#Ai Coding #Hardware #Llama Cpp #Local Llms #Qwen

Written by

Priya Nair · AI & Developer Experience Writer

Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to.

Discussion 1

Join the discussion

Marc Pope @marcpope · 7 hours ago

Until the local models can hang with Claude Code on a mac studio with 48GB RAM, I'm going to stick with Claude. I hate paying $200 a month, but I get the work of 5 developers.

Going Local: The Reality of Replacing Claude and GPT

The Hardware and Backend Stack

Model Selection: Finding the Sweet Spot

The Prompt Caching and "Thinking" Gotcha

Agentic Workflows: Junior vs. Senior

Sources & further reading

Discussion 1

Related Reading

CrankGPT Parody Exposes the Real Cost of AI Compute

Stop Wasting Tokens: High-Efficiency Prompting for Budget LLMs

Indexing 669 GB of Video Locally on Apple Silicon

Claude Slots Into Apple's Foundation Models Framework