Your LLM hallucinates on proprietary documents? Or your chatbot only knows up to its training cutoff? The fix is RAG — Retrieval Augmented Generation. No expensive fine-tuning, full data control. At Meteora Web, we use LangChain to build production-grade RAG pipelines. Here's exactly how.
Why RAG beats fine-tuning
Imagine a 500-page product manual. Fine-tuning demands GPUs, hours, and risks overfitting. RAG keeps your base model intact and feeds it the right context at query time. Benefits: real-time updates (just swap documents), zero training, transparency (you see the exact sources used). And it works out of the box.
RAG architecture in LangChain
Three components: Indexing (load, split, vectorize), Retrieval (semantic search), Generation (LLM + context). LangChain chains them into a coherent pipeline.
1. Document Indexing
Start with a PDF. We use PyPDFLoader or UnstructuredPDFLoader. Then split into chunks: too small loses context, too big exceeds context window. Our sweet spot is ~1000 characters with 200 overlap using RecursiveCharacterTextSplitter.
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = PyPDFLoader("manual.pdf")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_documents(documents)
print(f"{len(chunks)} chunks created")
Generate embeddings. We prefer text-embedding-3-small for cost/quality, but local models via HuggingFaceEmbeddings work too. For vector store, Chroma (lightweight, no server) or Qdrant in production.
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
vectorstore.persist()
2. Retrieval – find the right chunks
Use as_retriever with k=4 or 5. A common mistake: relying solely on cosine similarity. If chunks vary in length, normalize. We often switch to Maximal Marginal Relevance (MMR) to diversify results and avoid redundancy.
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 4, "lambda_mult": 0.5}
)
3. Generation with context
Create a prompt that forces the LLM to use ONLY the provided context. Use ChatPromptTemplate and the RAG chain via create_stuff_documents_chain and create_retrieval_chain.
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "You are an expert assistant. Use only the provided context to answer. If you don't know, say so."),
("human", "Context: \n{context}\n\nQuestion: {input}")
])
combine_docs_chain = create_stuff_documents_chain(llm, prompt)
qa_chain = create_retrieval_chain(retriever, combine_docs_chain)
result = qa_chain.invoke({"input": "What safety instructions are in the manual?"})
print(result["answer"])
Critical: Explicitly forbid hallucination. GPT-4o and Claude 3.5 comply well.
Production optimizations
A working RAG is just the start. To deploy, follow these.
Intelligent chunking
Naive chunking breaks tables, code blocks, cross-references. Use semantic chunking with sentence boundaries (NLTK or Spacy splitters). For code, split by function/class. Never cut mid-sentence that refers to another section.
Hybrid search
Semantic similarity fails on exact terms like “order ID 12345”. Combine with BM25 full-text via EnsembleRetriever.
from langchain.retrievers import EnsembleRetriever, BM25Retriever
bm25 = BM25Retriever.from_documents(chunks, k=3)
semantic = vectorstore.as_retriever(search_kwargs={"k": 3})
ensemble = EnsembleRetriever(
retrievers=[bm25, semantic],
weights=[0.3, 0.7]
)
Reranking
Not all retrieved chunks are equally relevant. A reranker (e.g., Cohere Rerank) reorders by actual pertinence. LangChain supports it via ContextualCompressionRetriever or a Cohere wrapper. We use it when k>5.
Common pitfalls and how to avoid them
- Over-stuffing: passing more than 8 chunks degrades answers. Start with k=4/5.
- Missing metadata: without document name and page, you lose source attribution. LangChain loaders can include them.
- Weak prompt: “Use context” is vague. Specify what to do with irrelevant or contradictory context.
- Wrong embeddings: for domain-specific knowledge, consider fine-tuned embedding models (e.g., legal embeddings).
Real-world stack we use
For clients demanding privacy, we run Ollama with local models (Llama 3, Mistral) and local Chroma. For high traffic, Qdrant on cloud. We always add a feedback loop — users rate answers, logs go to a database to improve retrieval. AI amplifies, not replaces: every output is reviewed by a domain expert.
What to do now
1. Prepare a test dataset. Grab 3–5 internal documents (PDF, Markdown, web pages).
2. Install dependencies: pip install langchain langchain-openai langchain-community chromadb pypdf and set OPENAI_API_KEY.
3. Copy, adapt the code above to build your own RAG pipeline. Experiment with chunk size and k.
4. Measure quality. Ask 10 questions with known answers. Check correctness vs hallucinations. A good RAG should hit at least 80% accuracy on covered topics.
5. Deploy to production — FastAPI backend, React or Streamlit frontend. Show sources for every answer.
Useful links:
- LangChain RAG Tutorial (official)
- GitHub Copilot: how we boost coding speed (Meteora Web)
- AI Agent Authorization: lessons from the Meta breach (Meteora Web)
Sponsored Protocol