AI Intermediate Tutorial

Add Retrieval-Augmented Generation (RAG) to Your App

Build a minimal but production-shaped RAG loop in Python: chunk and embed documents, store vectors, retrieve the most relevant context, and feed it to an LLM to ground its answers.

Priya Nair

AI & Developer Experience Writer · Jun 9, 2026 · 9 min read

What you'll build

You'll build a small, self-contained RAG pipeline in Python that answers questions about your own documents. By the end you'll have a script that chunks text, creates embeddings with OpenAI, stores them locally, retrieves the most relevant chunks for a query via cosine similarity, and passes that context to a chat model to produce a grounded answer.

We deliberately use NumPy for the vector store so you can see exactly how retrieval works. Swapping in a real vector database is covered in Next steps.

Prerequisites

Python 3.10+ (python3 --version to check). The code uses standard typing and f-strings that work on 3.10–3.12.
An OpenAI account with an API key and a small amount of billing credit. Embedding a few pages costs fractions of a cent.
Basic familiarity with the terminal and pip.
macOS, Linux, or Windows. On macOS (Apple Silicon or Intel) and Linux the commands below work as written; on Windows use PowerShell and adjust the virtual-env activation line as noted.

Step 1: Set up the project

mkdir rag-demo && cd rag-demo
python3 -m venv .venv
source .venv/bin/activate        # Windows PowerShell: .venv\Scripts\Activate.ps1
pip install --upgrade pip
pip install openai numpy

Set your API key as an environment variable so it never ends up in source control:

export OPENAI_API_KEY="sk-..."   # Windows PowerShell: $env:OPENAI_API_KEY="sk-..."

Create a sample document to query. In a real app this would be your docs, PDFs, or knowledge base.

cat > knowledge.txt <<'EOF'
Acme Cloud offers three plans: Free, Pro, and Enterprise.
The Free plan includes 1 GB of storage and community support.
The Pro plan costs $20 per month and includes 100 GB of storage and email support.
The Enterprise plan includes unlimited storage, SSO, and a dedicated account manager.
All plans include automatic daily backups retained for 30 days.
Data is encrypted at rest using AES-256 and in transit using TLS 1.2 or higher.
EOF

Step 2: Chunk the document

LLMs and embedding models work best on small, coherent passages. Chunking also keeps retrieval precise — you return only the relevant slice, not the whole file. We'll use a simple word-count chunker with overlap so context isn't lost at boundaries.

Create rag.py:

from __future__ import annotations
import os
import numpy as np
from openai import OpenAI

client = OpenAI()  # reads OPENAI_API_KEY from the environment

EMBED_MODEL = "text-embedding-3-small"  # 1536-dimensional, low cost
CHAT_MODEL = "gpt-4o-mini"


def chunk_text(text: str, chunk_size: int = 80, overlap: int = 20) -> list[str]:
    """Split text into overlapping chunks of roughly `chunk_size` words."""
    words = text.split()
    if not words:
        return []
    step = max(1, chunk_size - overlap)
    chunks = []
    for start in range(0, len(words), step):
        chunk = words[start:start + chunk_size]
        chunks.append(" ".join(chunk))
        if start + chunk_size >= len(words):
            break
    return chunks

For short docs like ours this yields one or two chunks; for larger files it produces many. Tune chunk_size/overlap to your content. Prose tolerates larger chunks; dense reference material benefits from smaller ones.

Step 3: Embed the chunks

An embedding is a vector that captures semantic meaning. Similar text produces similar vectors. We batch all chunks into a single API call — the input parameter accepts a list.

Add to rag.py:

def embed(texts: list[str]) -> np.ndarray:
    """Return an (n, dim) float32 array of embeddings for the given texts."""
    resp = client.embeddings.create(model=EMBED_MODEL, input=texts)
    vectors = [item.embedding for item in resp.data]
    return np.array(vectors, dtype=np.float32)

The OpenAI SDK preserves input order, so resp.data[i] corresponds to texts[i].

Step 4: Store the vectors

A "vector store" is just embeddings plus the text they came from, kept together so you can search. We'll hold them in memory and persist to a .npz file so you don't re-embed on every run.

def build_index(path: str, store_file: str = "index.npz") -> None:
    with open(path, "r", encoding="utf-8") as f:
        text = f.read()
    chunks = chunk_text(text)
    vectors = embed(chunks)
    np.savez(store_file, vectors=vectors, chunks=np.array(chunks, dtype=object))
    print(f"Indexed {len(chunks)} chunks into {store_file}")


def load_index(store_file: str = "index.npz"):
    data = np.load(store_file, allow_pickle=True)
    return data["vectors"], data["chunks"]

Step 5: Retrieve relevant context

Retrieval ranks stored chunks by similarity to the query embedding. Cosine similarity measures the angle between two vectors, ignoring magnitude — the standard metric for text embeddings.

def cosine_similarity(query: np.ndarray, matrix: np.ndarray) -> np.ndarray:
    query_norm = query / (np.linalg.norm(query) + 1e-10)
    matrix_norm = matrix / (np.linalg.norm(matrix, axis=1, keepdims=True) + 1e-10)
    return matrix_norm @ query_norm


def retrieve(question: str, vectors: np.ndarray, chunks: np.ndarray, k: int = 3) -> list[str]:
    q_vec = embed([question])[0]
    scores = cosine_similarity(q_vec, vectors)
    top_idx = np.argsort(scores)[::-1][:k]
    return [str(chunks[i]) for i in top_idx]

k controls how many chunks you feed the model. Start with 3–5; too many dilutes the prompt and raises cost.

Step 6: Generate a grounded answer

Now assemble a prompt that includes the retrieved context and instructs the model to rely on it. The system message reduces hallucination by telling the model what to do when the answer isn't present.

def answer(question: str, context: list[str]) -> str:
    context_block = "\n\n".join(f"- {c}" for c in context)
    messages = [
        {
            "role": "system",
            "content": (
                "You are a helpful assistant. Answer the question using ONLY the "
                "context provided. If the context does not contain the answer, "
                "say you don't know."
            ),
        },
        {
            "role": "user",
            "content": f"Context:\n{context_block}\n\nQuestion: {question}",
        },
    ]
    resp = client.chat.completions.create(
        model=CHAT_MODEL,
        messages=messages,
        temperature=0.2,
    )
    return resp.choices[0].message.content

Finally, wire it together with a CLI entry point:

if __name__ == "__main__":
    import sys

    if len(sys.argv) >= 2 and sys.argv[1] == "index":
        build_index("knowledge.txt")
    else:
        question = " ".join(sys.argv[1:]) or "How much does the Pro plan cost?"
        vectors, chunks = load_index()
        context = retrieve(question, vectors, chunks)
        print(answer(question, context))

Verify it works

First build the index, then ask a question:

python rag.py index
python rag.py "How much does the Pro plan cost and what support does it include?"

Expected output (wording will vary slightly between model runs):

Indexed 1 chunks into index.npz
The Pro plan costs $20 per month and includes 100 GB of storage and email support.

Now test the guardrail by asking something the document doesn't cover:

python rag.py "What is Acme Cloud's phone number?"

The model should respond that it doesn't know, because the context contains no phone number. That "I don't know" behavior is the whole point of grounding — the answer comes from your data, not the model's training.

Troubleshooting

Symptom	Cause	Fix
`openai.AuthenticationError`	`OPENAI_API_KEY` not set or invalid	Re-export the key in the current shell; confirm with `echo $OPENAI_API_KEY`.
`FileNotFoundError: index.npz`	You ran a query before indexing	Run `python rag.py index` first.
`openai.RateLimitError`	No billing credit, or too many requests	Add credit in the OpenAI dashboard; batch embeddings (we already do) and add retries with backoff.
Answers ignore your docs	`k` too small, or chunks too large to be specific	Increase `k`, lower `chunk_size`, and confirm `build_index` ran after editing `knowledge.txt`.

If you change knowledge.txt, always re-run python rag.py index — the stored vectors are a snapshot, not a live view.

Next steps

Use a real vector database. NumPy linear scan is fine for thousands of chunks but not millions. Drop in Chroma, Qdrant, or pgvector for Postgres to get persistence, metadata filtering, and approximate nearest-neighbor search.
Improve chunking. Split on semantic boundaries (headings, paragraphs) instead of fixed word counts, and store metadata like source filename and page so you can cite sources in answers.
Add re-ranking. Retrieve more candidates than you need, then re-rank with a cross-encoder or the LLM to push the best passages to the top.
Evaluate. Build a small set of question/expected-answer pairs and measure retrieval hit rate before tuning chunk size or k. RAG quality lives or dies on retrieval, so measure it.

#Python #Llm #Rag #Ai #Embeddings #Openai

Written by

Priya Nair · AI & Developer Experience Writer

Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to.

Discussion 0

Join the discussion

No comments yet

Be the first to weigh in.