Retrieval-Augmented Generation

Here's the thing about language models: they're trained once, then frozen. The model you're talking to right now learned from data collected up to some cutoff date. After that, it doesn't update. Its knowledge is a photograph, not a window.

— Retrieval-Augmented Generation — is the fix. The idea is almost embarrassingly simple: before the model answers your question, it goes and looks something up first.

Imagine a brilliant friend who read voraciously for years and then went off the grid. They know a lot. They reason beautifully. But they don't know what happened last week, and they've never seen your company's internal docs. Now hand that friend a library card — and a research assistant who can sprint to the shelves and bring back the right pages before you ask your question.

That's RAG.

The original paper[·] was a research proposal. Five years later it's the default architecture for "talk to your docs" — every customer-support bot, legal-research tool, and codebase assistant you've used probably has some flavor of it running underneath.

How it actually works

The simplest version has three steps.

Embed your documents. You take whatever documents you want the model to know about — your wiki, a research paper, your email archive, a thousand product reviews — and convert each chunk of text into an : a long list of numbers representing its meaning. Similar texts produce similar vectors. This is where semantic understanding lives.

Retrieve the relevant chunks. When a user asks a question, you embed that question too, then find the document chunks whose vectors are closest to it. Fast, approximate, surprisingly good.

Augment the prompt. You paste the retrieved chunks into the , before the question, so the model has the relevant context in front of it when it answers. The model never "knows" the document. It reads it, right then, the way you'd read a passage before answering a question about it.

# A minimal RAG loop
def answer_with_rag(question: str) -> str:
    # 1. embed the question
    query_vec = embedder.embed(question)

    # 2. find the top-k closest document chunks
    chunks = vector_store.search(query_vec, top_k=5)

    # 3. build the augmented prompt
    context = "\n\n".join(chunks)
    prompt = f"Context:\n{context}\n\nQuestion: {question}"

    return llm.complete(prompt)

INTERACTIVE — try it

Document corpus — 6 chunks

Remote Work

Work-from-Home Policy

Employees may work remotely up to 3 days per week with manager approval. Full-remote arrangements require VP sign-off and a quarterly in-person review. All remote employees must maintain core hours of 10 AM–3 PM in their local timezone.

Time Off

Vacation & PTO

Full-time employees accrue 15 days of PTO annually, increasing to 20 days after 3 years. PTO must be requested at least 2 weeks in advance for periods exceeding 5 days. Unused PTO carries over up to a maximum of 10 days per calendar year.

Performance

Review Cycle & Promotions

Performance reviews are conducted twice annually: in June and December. Each review includes a self-assessment, peer feedback, and a manager evaluation. Promotion eligibility requires a rating of "Exceeds Expectations" in at least two consecutive review cycles.

Benefits

Health & Dental Coverage

The company covers 90% of premiums for medical, dental, and vision insurance for employees, and 70% for dependents. Enrollment is open during the first 30 days of employment and each November. A $500 annual wellness stipend is available for gym memberships, therapy, or fitness equipment.

Finance

Expense Reimbursement

Business expenses under $75 can be submitted without prior approval. Expenses between $75–$500 require manager approval before purchase. All expenses must be submitted within 30 days via the Expensify portal with original receipts. Travel expenses follow a separate per-diem schedule.

Conduct

Conflicts of Interest

Employees must disclose any outside employment, financial interest, or personal relationship that could conflict with company interests. Disclosure forms must be submitted to HR within 14 days of a conflict arising. Failure to disclose may result in disciplinary action up to and including termination.

What this changes

Without RAG, asking a language model about something specific — your codebase, a paper you just published, anything that happened recently — is like asking someone who has never seen your document to write about it. They'll produce something plausible. And wrong.

With RAG, the model is working from sources. It can still make mistakes, but now it has something real to be wrong about. That's a meaningful improvement, and it changes how you should think about failure modes.

It also changes the economics. a model on your proprietary data is expensive, slow, and requires retraining every time anything changes. RAG is leaner: update the vector store whenever your documents change, and the retrieval adapts immediately. No retraining cycle.

The parts most people skip

RAG sounds simple in the hello-world version. The hard parts are quieter.

Chunking strategy matters more than you'd expect. Split documents too finely and you lose context. Too coarsely and you dilute the signal. Good results often mean experimenting with chunk size, overlap, and structure-aware splitting — respecting headers, code blocks, tables — instead of blindly slicing every 512 tokens.

Retrieval quality is the ceiling. The model can only work with what it retrieves. If the wrong chunks come back — because the question was ambiguous, the embedding model didn't capture your domain, or the document was poorly formatted — the answer will be wrong even when the right information exists somewhere in your corpus.

Reranking helps a lot. The initial similarity search is fast but approximate. Production RAG pipelines often add a second step: a that reads the question and candidate chunks together and produces a more careful relevance score. First retrieval casts a wide net; the reranker curates.[·]

Hybrid search beats pure vector search. A vector store finds semantically similar things — but sometimes you just need a keyword match (a product SKU, an error code, a name). Combining vector and keyword search (often via reciprocal rank fusion) catches both.

Most AI tools that feel unusually sharp about a specific domain — your company's knowledge, a particular research area, your own uploaded files — are running some version of this. The underlying model might be identical to what everyone else is using. What's different is what it was handed right before it answered.

That's RAG. Not magic. Just a well-timed library run.