Retrieval-Augmented Generation in 45 minutes. From naive similarity search to chunking strategies, embeddings, and vector databases.

RAG Basics: Build a Production Retrieval System

RAG (Retrieval-Augmented Generation) is how you make LLMs know your data. This tutorial takes you from “naive similarity search” to a production system in 45 minutes.

What is RAG?

┌──────────┐     ┌────────┐     ┌──────────┐
│ Question │────►│ Search │────►│ LLM +    │────► Answer
└──────────┘     │ your   │     │ context  │
                │ docs   │     │ from     │
                └────────┘     │ search   │
                              └──────────┘

The LLM gets your data, not just its training data.

The naive approach (DON’T ship this)

# BAD: Stuff whole docs into prompt
context = "\n".join(load_all_documents())  # could be MBs
response = llm(f"Answer based on: {context}\n\nQ: {question}")

Problems: token limits, slow, expensive, irrelevant context.

The right approach

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# 1. Load & chunk docs
documents = SimpleDirectoryReader("docs/").load_data()
index = VectorStoreIndex.from_documents(documents)

# 2. Query (auto-retrieves top-k chunks)
query_engine = index.as_query_engine()
response = query_engine.query("What is X?")
print(response)

That’s it for basics. 45 lines total.

Chunking strategies (the secret sauce)

Default chunking often fails. Try:

Strategy	When to use
Fixed-size (512 tokens)	Generic documents
Sentence splitter	Short answers
Semantic chunker	Long-form docs (papers)
Hierarchical	Documentation with headers

from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

parser = SemanticSplitterNodeParser(
    embed_model=OpenAIEmbedding(),
    buffer_size=1,  # sentences between breakpoints
)
nodes = parser.get_nodes_from_documents(documents)

Vector DB choice

Scale	DB
< 100k docs	Chroma (in-memory)
100k-10M	Qdrant / Weaviate
> 10M	Pinecone / Milvus

Evaluation (most teams skip this!)

from llama_index.core.evaluation import FaithfulnessEvaluator

evaluator = FaithfulnessEvaluator()
result = evaluator.evaluate_response(response=response)
print(f"Faithful: {result.passing} ({result.score})")

Key takeaways

RAG = search + LLM with your data
Chunking matters more than embeddings
Always evaluate (faithfulness + relevance)

RAG Basics: Build a Production Retrieval System

RAG Basics: Build a Production Retrieval System

What is RAG?

The naive approach (DON’T ship this)

The right approach

Chunking strategies (the secret sauce)

Vector DB choice

Evaluation (most teams skip this!)

Key takeaways

📦 开源项目

🛠️ Related Tools & Resources