Complete tutorial on Retrieval-Augmented Generation (RAG). Learn how to build a system that lets AI answer questions about your documents with code examples.
You’ve probably had this experience: you ask ChatGPT about your company’s policies, and it confidently makes something up. That’s because LLMs only know what they were trained on — they don’t know about your documents, your data, or your business.
RAG (Retrieval-Augmented Generation) fixes this. It’s the technology that lets AI answer questions based on your specific documents, databases, and knowledge bases. And in 2026, it’s the most important AI architecture pattern to understand.
What Is RAG? (Simple Explanation)
Think of RAG like giving the AI a reference library before it answers your question:
- You ask a question
- The system searches your documents for relevant information
- The relevant chunks are given to the AI as context
- The AI answers using that context instead of guessing
It’s like the difference between asking someone a question from memory versus letting them look it up in a textbook first.
// Without RAG:
User: "What's our refund policy?"
AI: "I don't have that information." (or worse, makes something up)
// With RAG:
User: "What's our refund policy?"
System: [searches documents] → [finds refund-policy.pdf]
AI: "Based on your policy document, refunds are available within
30 days of purchase for unused products..." (accurate!)
How RAG Works (Technical Deep-Dive)
Step 1: Document Ingestion
First, you need to process your documents into a format the system can search efficiently.
// Split documents into chunks
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000, // Characters per chunk
chunkOverlap: 200, // Overlap between chunks
separators: ['\n\n', '\n', '. ', ' '] // Split priorities
});
const chunks = await splitter.splitDocuments(documents);
Why chunks? LLMs have limited context windows. Instead of feeding entire documents, you find and inject only the relevant sections.
Step 2: Creating Embeddings
Each chunk is converted into a vector embedding — a list of numbers that represents the chunk’s meaning.
// Generate embeddings using OpenAI
import { OpenAIEmbeddings } from '@langchain/openai';
const embeddings = new OpenAIEmbeddings({
model: 'text-embedding-3-large',
});
// Each chunk becomes a 3072-dimensional vector
const vectors = await embeddings.embedDocuments(
chunks.map(c => c.pageContent)
);
How embeddings work: Similar concepts end up close together in vector space. “refund policy” and “return guidelines” would have vectors pointing in similar directions, even though the words are different.
Step 3: Storing in a Vector Database
// Store in Pinecone (or Qdrant, Weaviate, Chroma, etc.)
import { PineconeStore } from '@langchain/pinecone';
const vectorStore = await PineconeStore.fromDocuments(
chunks,
embeddings,
{
pineconeIndex: index,
namespace: 'company-docs',
}
);
Step 4: Retrieval at Query Time
When a user asks a question, the same embedding model converts their question into a vector, and the system finds the closest matching document chunks.
// Find relevant chunks for a question
const relevantDocs = await vectorStore.similaritySearch(
"What is the refund policy?",
4 // Return top 4 most relevant chunks
);
Step 5: Generation with Context
// Send the question + relevant context to the LLM
const response = await llm.invoke([
{
role: "system",
content: `Answer the user's question based ONLY on the
provided context. If the context doesn't contain the answer,
say "I don't have that information."
Context:
${relevantDocs.map(d => d.pageContent).join('\n\n')}`
},
{
role: "user",
content: "What is the refund policy?"
}
]);
Comments · 0
No comments yet. Be the first to share your thoughts.