Self-Host a Local LLM with Ollama: From Install to Production-Style API
Stand up open-weight models on your own hardware, call them over an HTTP API from real code, and learn how quantization and VRAM actually constrain what you can run.
What you'll build
By the end you'll have a local Ollama server running an open-weight model (Llama 3.1, Qwen2.5, etc.), be calling it from Python via both Ollama's native API and the OpenAI-compatible endpoint, and have a working mental model for choosing quantization levels against your GPU/RAM budget. Everything runs on localhost — no data leaves the machine.
Prerequisites
- OS: macOS 12+ (Apple Silicon strongly preferred — Metal acceleration is automatic), Linux (x86_64 or arm64), or Windows 10/11.
- GPU acceleration (optional but recommended):
- Apple Silicon: works out of the box via Metal; unified memory is shared with the GPU.
- NVIDIA: a recent driver with CUDA support. Ollama bundles its own CUDA runtime, so you do not need to install the CUDA toolkit — just a working GPU driver (
nvidia-smimust run). - AMD on Linux: ROCm-supported GPU.
- Disk: 10–50 GB free. An 8B model at 4-bit is ~5 GB; 70B models are 40 GB+.
- Python 3.9+ if you want to follow the code examples (
python3 --version). - RAM/VRAM: at least 8 GB to run small models comfortably; see the tradeoffs table below.
Step 1: Install Ollama
macOS
Use Homebrew or the official app. With Homebrew:
brew install ollama
This installs the ollama CLI and server. To run the background service:
brew services start ollama
Alternatively, download the .dmg from https://ollama.com/download and launch the menu-bar app, which manages the server for you.
Linux
The official installer is a shell script. Inspect it before piping to a shell — this is good practice for any remote script:
curl -fsSL https://ollama.com/install.sh -o install-ollama.sh
less install-ollama.sh # review what it does
sh install-ollama.sh
The installer creates a systemd service (ollama.service) running as a dedicated ollama user and detects NVIDIA/AMD GPUs.
Windows
Download and run the installer from https://ollama.com/download. It registers a background service automatically.
Verify the install
ollama --version
You should see a version string (e.g. ollama version is 0.x.x).
Step 2: Start the server (if not already running)
The Homebrew service, systemd unit, and desktop apps all start the server for you. To run it manually in the foreground (useful for watching logs):
ollama serve
By default the server listens on 127.0.0.1:11434. Confirm it's up:
curl http://localhost:11434
# -> Ollama is running
If you started it manually, open a second terminal for the commands below.
Step 3: Pull and run a model
Models live in a registry addressed by name:tag. Pull an 8B Llama model (defaults to a 4-bit quantization):
ollama pull llama3.1:8b
Start an interactive chat:
ollama run llama3.1:8b
Type a prompt; use /bye to exit. To send a one-shot prompt without entering the REPL:
ollama run llama3.1:8b "Summarize the actor model in two sentences."
Useful management commands:
ollama list # installed models and sizes
ollama ps # currently loaded models + whether on GPU/CPU
ollama show llama3.1:8b # parameters, context length, quantization
ollama rm llama3.1:8b # delete a model
Good starting models:
| Model | Tag | Params | Good for |
|---|---|---|---|
| Llama 3.2 | llama3.2:3b |
3B | Fast, low-RAM, on-device |
| Llama 3.1 | llama3.1:8b |
8B | General-purpose default |
| Qwen2.5 | qwen2.5:7b |
7B | Strong coding/multilingual |
| Mistral | mistral:7b |
7B | Lean general model |
| Gemma 2 | gemma2:9b |
9B | Quality at mid size |
Step 4: Understand quantization and hardware tradeoffs
Ollama ships models in GGUF format with various quantization levels. Quantization compresses model weights from 16-bit floats down to ~4–8 bits, trading a small amount of quality for large reductions in memory and faster inference. The default tag (e.g. llama3.1:8b) is typically q4_K_M.
You can request a specific quant explicitly:
ollama pull llama3.1:8b-instruct-q8_0
ollama pull llama3.1:8b-instruct-q4_K_M
Approximate sizes and characteristics for an 8B model:
| Quant | Bits/weight (approx) | ~Disk/RAM (8B) | Quality vs fp16 |
|---|---|---|---|
q4_K_M |
~4.5 | ~4.9 GB | Good — the sweet spot |
q5_K_M |
~5.5 | ~5.7 GB | Slightly better |
q6_K |
~6.5 | ~6.6 GB | Near-lossless |
q8_0 |
8 | ~8.5 GB | Practically lossless |
fp16 |
16 | ~16 GB | Full precision |
Rule of thumb for fit: you need roughly the model's on-disk size in VRAM (or unified memory on Apple Silicon), plus overhead for the KV cache, which grows with context length. If the model doesn't fit in VRAM, Ollama offloads layers to CPU/RAM — it still works but is much slower. Check what actually happened:
ollama ps
# The PROCESSOR column shows 100% GPU, a CPU/GPU split, or 100% CPU.
Practical guidance:
- 8 GB VRAM / 8–16 GB Mac: run 3B–8B models at
q4_K_M. - 16–24 GB: comfortably run 8B–14B, or 70B at very low quant with heavy CPU offload (slow).
- 48 GB+: 70B at
q4_K_Mfits mostly on GPU.
Larger context windows cost real memory. Set context length per-request (next step) or per-model via a Modelfile rather than defaulting to the maximum.
Step 5: Call the native HTTP API
Ollama exposes a REST API on port 11434. The /api/chat endpoint takes a messages array.
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1:8b",
"messages": [
{"role": "system", "content": "You are a terse senior engineer."},
{"role": "user", "content": "When should I prefer gRPC over REST?"}
],
"stream": false,
"options": { "temperature": 0.4, "num_ctx": 8192 }
}'
The options object maps to model runtime parameters: temperature, top_p, num_ctx (context window), num_predict (max output tokens), seed, and others.
Streaming in Python with requests — each line is a JSON object:
import json
import requests
resp = requests.post(
"http://localhost:11434/api/chat",
json={
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Explain CAP theorem briefly."}],
"stream": True,
},
stream=True,
)
resp.raise_for_status()
for line in resp.iter_lines():
if not line:
continue
chunk = json.loads(line)
if not chunk.get("done"):
print(chunk["message"]["content"], end="", flush=True)
print()
Step 6: Use the OpenAI-compatible endpoint
Ollama also serves an OpenAI-compatible API at /v1, so existing SDKs work with a base-URL swap. The API key is ignored but must be non-empty.
Install the client:
python3 -m pip install openai
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required by the SDK, ignored by Ollama
)
resp = client.chat.completions.create(
model="llama3.1:8b",
messages=[
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": "Give me a one-line definition of idempotency."},
],
temperature=0.3,
)
print(resp.choices[0].message.content)
This path is handy for dropping a local model into frameworks (LangChain, LlamaIndex, etc.) that already speak the OpenAI protocol.
Step 7: Customize with a Modelfile
To bake in a system prompt and default parameters, create a Modelfile:
FROM llama3.1:8b
PARAMETER temperature 0.2
PARAMETER num_ctx 8192
SYSTEM """You are CodeReviewBot. Respond only with actionable review comments."""
Build and run your derived model:
ollama create code-review-bot -f Modelfile
ollama run code-review-bot "def add(a,b): return a-b"
Step 8: Tune the server (optional)
Server behavior is controlled by environment variables. On Linux with systemd, set them with sudo systemctl edit ollama.service; on macOS export them before ollama serve.
| Variable | Purpose |
|---|---|
OLLAMA_HOST |
Bind address, e.g. 0.0.0.0:11434 to expose on the LAN |
OLLAMA_MODELS |
Directory where model blobs are stored |
OLLAMA_KEEP_ALIVE |
How long an idle model stays in memory (e.g. 30m, -1 for forever) |
OLLAMA_MAX_LOADED_MODELS |
Number of models kept resident simultaneously |
OLLAMA_NUM_PARALLEL |
Concurrent requests per model |
OLLAMA_FLASH_ATTENTION |
Set to 1 to enable flash attention (lower KV-cache memory) |
Security note: only set OLLAMA_HOST=0.0.0.0 if you intend to expose the server. The API has no authentication, so put it behind a reverse proxy with TLS and auth (e.g. Caddy or nginx) before letting anything off-host reach it — never expose port 11434 directly to the internet.
Verify it works
- Server reachable:
curl http://localhost:11434/api/tagsreturns JSON listing your installed models. - Inference works:
ollama run llama3.1:8b "say hello"prints a response. - GPU in use:
ollama psshows100% GPU(or a split) in thePROCESSORcolumn while a model is loaded. On NVIDIA,nvidia-smishows anollamaprocess consuming VRAM. - Code path works: running the Python OpenAI example prints a one-line answer.
Troubleshooting
Error: could not connect to ollama app / connection refused
The server isn't running or is on another port. Start it (ollama serve, brew services start ollama, or sudo systemctl start ollama) and confirm with curl http://localhost:11434.
Model runs slowly / ollama ps shows 100% CPU
The model didn't fit in VRAM and was offloaded. Pull a smaller quant (...-q4_K_M or a 3B model), reduce num_ctx, or close other GPU-heavy apps. On NVIDIA, verify the driver works with nvidia-smi; if Ollama logs no compatible GPUs found, your driver is too old or missing.
out of memory / process killed (OOM)
The combination of model size + context window exceeds available memory. Lower num_ctx, choose a smaller or more aggressively quantized model, and enable OLLAMA_FLASH_ATTENTION=1 to shrink the KV cache.
Pulls fail or stall behind a proxy
Set HTTPS_PROXY/HTTP_PROXY in the server's environment (for systemd, via systemctl edit ollama.service), then restart the service. Re-running ollama pull resumes partial downloads.
Next steps
- Embeddings + RAG: pull an embedding model (
ollama pull nomic-embed-text) and callPOST /api/embedto build a local retrieval pipeline. - Import custom weights: use a
FROM ./model.ggufModelfile to run any GGUF you've downloaded or quantized yourself. - Scale concurrency: tune
OLLAMA_NUM_PARALLELandOLLAMA_MAX_LOADED_MODELS, and front the server with a reverse proxy for auth, rate limiting, and TLS. - Benchmark quants: run the same eval prompts across
q4_K_M,q5_K_M, andq8_0to measure the quality/latency/memory curve for your specific workload before committing in production.
Rachel has been embedded in the developer tooling ecosystem for nearly eight years, covering everything from IDE wars and package-manager drama to the quiet rise of AI-assisted coding. She has a soft spot for open-source maintainers and an unhealthy number of terminal emulators installed on a single laptop.
Discussion 0
No comments yet
Be the first to weigh in.