RAG with LangChain: Practical Guide to Retrieval Augmented Generation • Meteora Web Agency

Your LLM hallucinates on proprietary documents? Or your chatbot only knows up to its training cutoff? The fix is RAG — Retrieval Augmented Generation. No expensive fine-tuning, full data control. At Meteora Web, we use LangChain to build production-grade RAG pipelines. Here's exactly how.

Why RAG beats fine-tuning

Imagine a 500-page product manual. Fine-tuning demands GPUs, hours, and risks overfitting. RAG keeps your base model intact and feeds it the right context at query time. Benefits: real-time updates (just swap documents), zero training, transparency (you see the exact sources used). And it works out of the box.

RAG architecture in LangChain

Three components: Indexing (load, split, vectorize), Retrieval (semantic search), Generation (LLM + context). LangChain chains them into a coherent pipeline.

1. Document Indexing

Start with a PDF. We use PyPDFLoader or UnstructuredPDFLoader. Then split into chunks: too small loses context, too big exceeds context window. Our sweet spot is ~1000 characters with 200 overlap using RecursiveCharacterTextSplitter.

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("manual.pdf")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_documents(documents)
print(f"{len(chunks)} chunks created")

Generate embeddings. We prefer text-embedding-3-small for cost/quality, but local models via HuggingFaceEmbeddings work too. For vector store, Chroma (lightweight, no server) or Qdrant in production.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
vectorstore.persist()

2. Retrieval – find the right chunks

Use as_retriever with k=4 or 5. A common mistake: relying solely on cosine similarity. If chunks vary in length, normalize. We often switch to Maximal Marginal Relevance (MMR) to diversify results and avoid redundancy.

retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "lambda_mult": 0.5}
)

3. Generation with context

Create a prompt that forces the LLM to use ONLY the provided context. Use ChatPromptTemplate and the RAG chain via create_stuff_documents_chain and create_retrieval_chain.

from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an expert assistant. Use only the provided context to answer. If you don't know, say so."),
    ("human", "Context: \n{context}\n\nQuestion: {input}")
])

combine_docs_chain = create_stuff_documents_chain(llm, prompt)
qa_chain = create_retrieval_chain(retriever, combine_docs_chain)

result = qa_chain.invoke({"input": "What safety instructions are in the manual?"})
print(result["answer"])

Critical: Explicitly forbid hallucination. GPT-4o and Claude 3.5 comply well.

Production optimizations

A working RAG is just the start. To deploy, follow these.

Intelligent chunking

Naive chunking breaks tables, code blocks, cross-references. Use semantic chunking with sentence boundaries (NLTK or Spacy splitters). For code, split by function/class. Never cut mid-sentence that refers to another section.

Hybrid search

Semantic similarity fails on exact terms like “order ID 12345”. Combine with BM25 full-text via EnsembleRetriever.

from langchain.retrievers import EnsembleRetriever, BM25Retriever

bm25 = BM25Retriever.from_documents(chunks, k=3)
semantic = vectorstore.as_retriever(search_kwargs={"k": 3})
ensemble = EnsembleRetriever(
    retrievers=[bm25, semantic],
    weights=[0.3, 0.7]
)

Reranking

Not all retrieved chunks are equally relevant. A reranker (e.g., Cohere Rerank) reorders by actual pertinence. LangChain supports it via ContextualCompressionRetriever or a Cohere wrapper. We use it when k>5.

Common pitfalls and how to avoid them

Over-stuffing: passing more than 8 chunks degrades answers. Start with k=4/5.
Missing metadata: without document name and page, you lose source attribution. LangChain loaders can include them.
Weak prompt: “Use context” is vague. Specify what to do with irrelevant or contradictory context.
Wrong embeddings: for domain-specific knowledge, consider fine-tuned embedding models (e.g., legal embeddings).

Real-world stack we use

For clients demanding privacy, we run Ollama with local models (Llama 3, Mistral) and local Chroma. For high traffic, Qdrant on cloud. We always add a feedback loop — users rate answers, logs go to a database to improve retrieval. AI amplifies, not replaces: every output is reviewed by a domain expert.

What to do now

1. Prepare a test dataset. Grab 3–5 internal documents (PDF, Markdown, web pages).

2. Install dependencies: pip install langchain langchain-openai langchain-community chromadb pypdf and set OPENAI_API_KEY.

3. Copy, adapt the code above to build your own RAG pipeline. Experiment with chunk size and k.

4. Measure quality. Ask 10 questions with known answers. Check correctness vs hallucinations. A good RAG should hit at least 80% accuracy on covered topics.

5. Deploy to production — FastAPI backend, React or Streamlit frontend. Show sources for every answer.

Useful links: