AI Advanced Tutorial

Self-Host a Local LLM with Ollama: From Install to Production-Style API

Stand up open-weight models on your own hardware, call them over an HTTP API from real code, and learn how quantization and VRAM actually constrain what you can run.

Rachel Goldstein

Dev Tools Editor · Jun 9, 2026 · 12 min read

What you'll build

By the end you'll have a local Ollama server running an open-weight model (Llama 3.1, Qwen2.5, etc.), be calling it from Python via both Ollama's native API and the OpenAI-compatible endpoint, and have a working mental model for choosing quantization levels against your GPU/RAM budget. Everything runs on localhost — no data leaves the machine.

Prerequisites

OS: macOS 12+ (Apple Silicon strongly preferred — Metal acceleration is automatic), Linux (x86_64 or arm64), or Windows 10/11.
GPU acceleration (optional but recommended):
- Apple Silicon: works out of the box via Metal; unified memory is shared with the GPU.
- NVIDIA: a recent driver with CUDA support. Ollama bundles its own CUDA runtime, so you do not need to install the CUDA toolkit — just a working GPU driver (nvidia-smi must run).
- AMD on Linux: ROCm-supported GPU.
Disk: 10–50 GB free. An 8B model at 4-bit is ~5 GB; 70B models are 40 GB+.
Python 3.9+ if you want to follow the code examples (python3 --version).
RAM/VRAM: at least 8 GB to run small models comfortably; see the tradeoffs table below.

Step 1: Install Ollama

macOS

Use Homebrew or the official app. With Homebrew:

brew install ollama

This installs the ollama CLI and server. To run the background service:

brew services start ollama

Alternatively, download the .dmg from https://ollama.com/download and launch the menu-bar app, which manages the server for you.

Linux

The official installer is a shell script. Inspect it before piping to a shell — this is good practice for any remote script:

curl -fsSL https://ollama.com/install.sh -o install-ollama.sh
less install-ollama.sh        # review what it does
sh install-ollama.sh

The installer creates a systemd service (ollama.service) running as a dedicated ollama user and detects NVIDIA/AMD GPUs.

Windows

Download and run the installer from https://ollama.com/download. It registers a background service automatically.

Verify the install

ollama --version

You should see a version string (e.g. ollama version is 0.x.x).

Step 2: Start the server (if not already running)

The Homebrew service, systemd unit, and desktop apps all start the server for you. To run it manually in the foreground (useful for watching logs):

ollama serve

By default the server listens on 127.0.0.1:11434. Confirm it's up:

curl http://localhost:11434
# -> Ollama is running

If you started it manually, open a second terminal for the commands below.

Step 3: Pull and run a model

Models live in a registry addressed by name:tag. Pull an 8B Llama model (defaults to a 4-bit quantization):

ollama pull llama3.1:8b

Start an interactive chat:

ollama run llama3.1:8b

Type a prompt; use /bye to exit. To send a one-shot prompt without entering the REPL:

ollama run llama3.1:8b "Summarize the actor model in two sentences."

Useful management commands:

ollama list          # installed models and sizes
ollama ps            # currently loaded models + whether on GPU/CPU
ollama show llama3.1:8b   # parameters, context length, quantization
ollama rm llama3.1:8b     # delete a model

Good starting models:

Model	Tag	Params	Good for
Llama 3.2	`llama3.2:3b`	3B	Fast, low-RAM, on-device
Llama 3.1	`llama3.1:8b`	8B	General-purpose default
Qwen2.5	`qwen2.5:7b`	7B	Strong coding/multilingual
Mistral	`mistral:7b`	7B	Lean general model
Gemma 2	`gemma2:9b`	9B	Quality at mid size

Step 4: Understand quantization and hardware tradeoffs

Ollama ships models in GGUF format with various quantization levels. Quantization compresses model weights from 16-bit floats down to ~4–8 bits, trading a small amount of quality for large reductions in memory and faster inference. The default tag (e.g. llama3.1:8b) is typically q4_K_M.

You can request a specific quant explicitly:

ollama pull llama3.1:8b-instruct-q8_0
ollama pull llama3.1:8b-instruct-q4_K_M

Approximate sizes and characteristics for an 8B model:

Quant	Bits/weight (approx)	~Disk/RAM (8B)	Quality vs fp16
`q4_K_M`	~4.5	~4.9 GB	Good — the sweet spot
`q5_K_M`	~5.5	~5.7 GB	Slightly better
`q6_K`	~6.5	~6.6 GB	Near-lossless
`q8_0`	8	~8.5 GB	Practically lossless
`fp16`	16	~16 GB	Full precision

Rule of thumb for fit: you need roughly the model's on-disk size in VRAM (or unified memory on Apple Silicon), plus overhead for the KV cache, which grows with context length. If the model doesn't fit in VRAM, Ollama offloads layers to CPU/RAM — it still works but is much slower. Check what actually happened:

ollama ps
# The PROCESSOR column shows 100% GPU, a CPU/GPU split, or 100% CPU.

Practical guidance:

8 GB VRAM / 8–16 GB Mac: run 3B–8B models at q4_K_M.
16–24 GB: comfortably run 8B–14B, or 70B at very low quant with heavy CPU offload (slow).
48 GB+: 70B at q4_K_M fits mostly on GPU.

Larger context windows cost real memory. Set context length per-request (next step) or per-model via a Modelfile rather than defaulting to the maximum.

Step 5: Call the native HTTP API

Ollama exposes a REST API on port 11434. The /api/chat endpoint takes a messages array.

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "messages": [
    {"role": "system", "content": "You are a terse senior engineer."},
    {"role": "user", "content": "When should I prefer gRPC over REST?"}
  ],
  "stream": false,
  "options": { "temperature": 0.4, "num_ctx": 8192 }
}'

The options object maps to model runtime parameters: temperature, top_p, num_ctx (context window), num_predict (max output tokens), seed, and others.

Streaming in Python with requests — each line is a JSON object:

import json
import requests

resp = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3.1:8b",
        "messages": [{"role": "user", "content": "Explain CAP theorem briefly."}],
        "stream": True,
    },
    stream=True,
)
resp.raise_for_status()

for line in resp.iter_lines():
    if not line:
        continue
    chunk = json.loads(line)
    if not chunk.get("done"):
        print(chunk["message"]["content"], end="", flush=True)
print()

Step 6: Use the OpenAI-compatible endpoint

Ollama also serves an OpenAI-compatible API at /v1, so existing SDKs work with a base-URL swap. The API key is ignored but must be non-empty.

Install the client:

python3 -m pip install openai

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by the SDK, ignored by Ollama
)

resp = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "system", "content": "You are a concise assistant."},
        {"role": "user", "content": "Give me a one-line definition of idempotency."},
    ],
    temperature=0.3,
)
print(resp.choices[0].message.content)

This path is handy for dropping a local model into frameworks (LangChain, LlamaIndex, etc.) that already speak the OpenAI protocol.

Step 7: Customize with a Modelfile

To bake in a system prompt and default parameters, create a Modelfile:

FROM llama3.1:8b

PARAMETER temperature 0.2
PARAMETER num_ctx 8192

SYSTEM """You are CodeReviewBot. Respond only with actionable review comments."""

Build and run your derived model:

ollama create code-review-bot -f Modelfile
ollama run code-review-bot "def add(a,b): return a-b"

Step 8: Tune the server (optional)

Server behavior is controlled by environment variables. On Linux with systemd, set them with sudo systemctl edit ollama.service; on macOS export them before ollama serve.

Variable	Purpose
`OLLAMA_HOST`	Bind address, e.g. `0.0.0.0:11434` to expose on the LAN
`OLLAMA_MODELS`	Directory where model blobs are stored
`OLLAMA_KEEP_ALIVE`	How long an idle model stays in memory (e.g. `30m`, `-1` for forever)
`OLLAMA_MAX_LOADED_MODELS`	Number of models kept resident simultaneously
`OLLAMA_NUM_PARALLEL`	Concurrent requests per model
`OLLAMA_FLASH_ATTENTION`	Set to `1` to enable flash attention (lower KV-cache memory)

Security note: only set OLLAMA_HOST=0.0.0.0 if you intend to expose the server. The API has no authentication, so put it behind a reverse proxy with TLS and auth (e.g. Caddy or nginx) before letting anything off-host reach it — never expose port 11434 directly to the internet.

Verify it works

Server reachable: curl http://localhost:11434/api/tags returns JSON listing your installed models.
Inference works: ollama run llama3.1:8b "say hello" prints a response.
GPU in use: ollama ps shows 100% GPU (or a split) in the PROCESSOR column while a model is loaded. On NVIDIA, nvidia-smi shows an ollama process consuming VRAM.
Code path works: running the Python OpenAI example prints a one-line answer.

Troubleshooting

Error: could not connect to ollama app / connection refused The server isn't running or is on another port. Start it (ollama serve, brew services start ollama, or sudo systemctl start ollama) and confirm with curl http://localhost:11434.

Model runs slowly / ollama ps shows 100% CPU The model didn't fit in VRAM and was offloaded. Pull a smaller quant (...-q4_K_M or a 3B model), reduce num_ctx, or close other GPU-heavy apps. On NVIDIA, verify the driver works with nvidia-smi; if Ollama logs no compatible GPUs found, your driver is too old or missing.

out of memory / process killed (OOM) The combination of model size + context window exceeds available memory. Lower num_ctx, choose a smaller or more aggressively quantized model, and enable OLLAMA_FLASH_ATTENTION=1 to shrink the KV cache.

Pulls fail or stall behind a proxy Set HTTPS_PROXY/HTTP_PROXY in the server's environment (for systemd, via systemctl edit ollama.service), then restart the service. Re-running ollama pull resumes partial downloads.

Next steps

Embeddings + RAG: pull an embedding model (ollama pull nomic-embed-text) and call POST /api/embed to build a local retrieval pipeline.
Import custom weights: use a FROM ./model.gguf Modelfile to run any GGUF you've downloaded or quantized yourself.
Scale concurrency: tune OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS, and front the server with a reverse proxy for auth, rate limiting, and TLS.
Benchmark quants: run the same eval prompts across q4_K_M, q5_K_M, and q8_0 to measure the quality/latency/memory curve for your specific workload before committing in production.

#Python #Llm #Ai #Self Hosting #Ollama

Written by

Rachel Goldstein · Dev Tools Editor

Rachel has been embedded in the developer tooling ecosystem for nearly eight years, covering everything from IDE wars and package-manager drama to the quiet rise of AI-assisted coding. She has a soft spot for open-source maintainers and an unhealthy number of terminal emulators installed on a single laptop.

Discussion 0

Join the discussion

No comments yet

Be the first to weigh in.