Skip to content
Dev Tools Tutorial

Run a Fast, Fully Local Coding Agent on macOS

Ditch the cloud and run Gemma 4 locally with llama.cpp and speculative decoding for blazing-fast terminal assistance.

Rachel Goldstein
Rachel Goldstein
Dev Tools Editor · Jun 12, 2026 · 5 min read

Cloud-based coding assistants are great until your internet drops, or you realize you are paying a premium to send your proprietary codebase to a third-party server. Running a fully local coding agent on macOS is the obvious cure, but historically, local models have been too sluggish for the rapid-fire tool calls that agents require.

That equation changes with Google's Gemma 4. By pairing the Gemma 4 26B-A4B model with Multi-Token Prediction (MTP) and speculative decoding, you can achieve highly usable generation speeds entirely on Apple Silicon.

The Secret Sauce: Multi-Token Prediction (MTP)

Speculative decoding is a classic optimization trick: use a tiny, fast draft model to guess tokens, and let the massive main model validate them in parallel. Gemma 4 introduces a Q8 MTP draft model (gemma-4-26B-A4B-it-Q8_0-MTP.gguf) that acts as this accelerator.

When running the main 16 GB gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf model on an Apple M1 Max (64 GB unified memory, macOS 15.7.7), a baseline run with llama.cpp and Metal acceleration yields about 58.2 tokens per second. That is usable, but painful when an agent is looping through multiple tool calls.

By adding the Q8 MTP draft model and tuning the speculative draft parameter (--spec-draft-n-max), generation speeds jump to 72.2 tokens per second—a clean 24% speedup. Crucially, prompt processing speeds remain virtually untouched at around 295 tokens per second.

The Benchmark: llama.cpp Outpaces MLX

You might expect Apple's own MLX framework to run circles around a cross-platform tool like llama.cpp on Apple Silicon. However, benchmarks show that llama.cpp's mature Metal optimizations and speculative decoding support give it a massive edge for this specific workload.

Runtime / Model Generation Speed (tok/s)
llama.cpp Metal + MTP (Unsloth GGUF Q4 + Q8 MTP) 72.2
llama.cpp Metal (Unsloth GGUF Q4) 58.2
MLX-LM (Unsloth UD MLX 4-bit) 45.8
MLX-LM (mlx-community 4-bit) 43.9
MLX-LM (mlx-community OptiQ 4-bit) 38.1

For this setup, llama.cpp with MTP is the clear winner. Attempts to run Gemma 4 MTP via gemma-4-swift-mlx can also run into weight key mismatches with 26B 4-bit MLX checkpoints, making llama.cpp the more robust path forward.

Adding Sight with a Multimodal Projector

A terminal coding agent like Pi becomes twice as useful if you can feed it screenshots of the UI it just generated. While the smaller Gemma 4 12B model is natively multimodal, the 26B variant requires a separate multimodal projector (mmproj-BF16.gguf) to handle images.

By loading the projector with the --mmproj flag, llama.cpp advertises multimodal capabilities to your agent. Best of all, adding the projector does not degrade text-generation performance, maintaining the same 72.2 tokens per second generation speed.

Advertisement

Step-by-Step Local Setup

Here is how to compile llama.cpp with Metal support and download the required Gemma 4 files.

First, install the necessary build tools and dependencies via Homebrew:

brew install cmake git tmux python@3.11

Next, clone and compile llama.cpp with Metal and Accelerate framework support enabled:

mkdir -p ~/Developer/ML-Models/Gemma4/repos
cd ~/Developer/ML-Models/Gemma4
git clone https://github.com/ggml-org/llama.cpp repos/llama.cpp
cd repos/llama.cpp

cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_METAL=ON \
  -DGGML_ACCELERATE=ON

cmake --build build --config Release -j

This configures the build with GGML_METAL=ON, GGML_ACCELERATE=ON, and Apple-specific BLAS support.

Now, set up a virtual environment and download the models from Hugging Face:

cd ~/Developer/ML-Models/Gemma4
python3.11 -m venv .venv
source .venv/bin/activate
pip install -U huggingface_hub hf_xet

mkdir -p models/unsloth-gemma-4-26B-A4B-it-GGUF
huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \
  gemma-4-26B-A4B-it

(Note: Ensure you download the main GGUF, the MTP draft model, and the multimodal projector into your models directory.)

Tuning the Speculative Engine

To get the absolute most out of speculative decoding, you must tune the --spec-draft-n-max parameter. This controls how many draft tokens the engine attempts to predict before validation.

While Unsloth recommends starting at 2, performance is highly hardware-dependent. On an M1 Max, the sweep looks like this:

  • --spec-draft-n-max 1: 68.4 tok/s
  • --spec-draft-n-max 2: 72.0 tok/s
  • --spec-draft-n-max 3: 72.2 tok/s (Optimal)
  • --spec-draft-n-max 4: 70.7 tok/s
  • --spec-draft-n-max 5: 63.7 tok/s
  • --spec-draft-n-max 6: 61.2 tok/s

Setting this value too high actually degrades performance because the overhead of validating incorrect draft tokens outweighs the parallelization benefits. For an M1 Max, stick to 3 or 2.

Once configured, point your local agent (such as Pi) to the llama.cpp local server endpoint, and you will have a blazing-fast, private, and completely offline coding assistant ready for action.

Sources & further reading

  1. How to setup a local coding agent on macOS — ikyle.me
Rachel Goldstein
Written by
Rachel Goldstein · Dev Tools Editor

Rachel has been embedded in the developer tooling ecosystem for nearly eight years, covering everything from IDE wars and package-manager drama to the quiet rise of AI-assisted coding. She has a soft spot for open-source maintainers and an unhealthy number of terminal emulators installed on a single laptop.

Discussion 0

Join the discussion

Sign in or create an account to comment and vote.

No comments yet

Be the first to weigh in.

Related Reading