RAG Basics: Build a Production Retrieval System
RAG (Retrieval-Augmented Generation) is how you make LLMs know your data. This tutorial takes you from โnaive similarity searchโ to a production system in 45 minutes.
What is RAG?
โโโโโโโโโโโโ โโโโโโโโโโ โโโโโโโโโโโโ
โ Question โโโโโโบโ Search โโโโโโบโ LLM + โโโโโโบ Answer
โโโโโโโโโโโโ โ your โ โ context โ
โ docs โ โ from โ
โโโโโโโโโโ โ search โ
โโโโโโโโโโโโ
The LLM gets your data, not just its training data.
The naive approach (DONโT ship this)
# BAD: Stuff whole docs into prompt
context = "\n".join(load_all_documents()) # could be MBs
response = llm(f"Answer based on: {context}\n\nQ: {question}")
Problems: token limits, slow, expensive, irrelevant context.
The right approach
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
# 1. Load & chunk docs
documents = SimpleDirectoryReader("docs/").load_data()
index = VectorStoreIndex.from_documents(documents)
# 2. Query (auto-retrieves top-k chunks)
query_engine = index.as_query_engine()
response = query_engine.query("What is X?")
print(response)
Thatโs it for basics. 45 lines total.
Chunking strategies (the secret sauce)
Default chunking often fails. Try:
| Strategy | When to use |
|---|---|
| Fixed-size (512 tokens) | Generic documents |
| Sentence splitter | Short answers |
| Semantic chunker | Long-form docs (papers) |
| Hierarchical | Documentation with headers |
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
parser = SemanticSplitterNodeParser(
embed_model=OpenAIEmbedding(),
buffer_size=1, # sentences between breakpoints
)
nodes = parser.get_nodes_from_documents(documents)
Vector DB choice
| Scale | DB |
|---|---|
| < 100k docs | Chroma (in-memory) |
| 100k-10M | Qdrant / Weaviate |
| > 10M | Pinecone / Milvus |
Evaluation (most teams skip this!)
from llama_index.core.evaluation import FaithfulnessEvaluator
evaluator = FaithfulnessEvaluator()
result = evaluator.evaluate_response(response=response)
print(f"Faithful: {result.passing} ({result.score})")
Key takeaways
- RAG = search + LLM with your data
- Chunking matters more than embeddings
- Always evaluate (faithfulness + relevance)