This project provides a minimal Retrieval-Augmented Generation (RAG) search system using:
- LangChain for document loading, splitting, embeddings, and retrieval
- Chroma as a local vector database for similarity search
- LangGraph-inspired orchestration (simple 2-node pipeline: retrieve -> synthesize)
Features
- Local-first: uses HuggingFace sentence-transformers by default; no API key required
- Local LLM via Ollama by default (pull a model like llama3.1:8b). If Ollama is not available, it can fall back to OpenAI (if OPENAI_API_KEY is set) or to an extractive response.
- Persistent vector store using Chroma
- Simple CLI for ingestion and queries
Prerequisites
- Python 3.10+
Setup
-
Create and activate a virtual environment (optional): python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate
-
Install dependencies: pip install -r requirements.txt
-
Add some .txt files into data/ (some sample content is already in data/sample/).
Environment variables (optional)
- EMBED_MODEL: HuggingFace embeddings model (default: sentence-transformers/all-MiniLM-L6-v2)
- LLM_PROVIDER: choose 'ollama' (default) or 'openai'
- OLLAMA_MODEL: Ollama model to use (default: llama3.1:8b)
- OLLAMA_BASE_URL (or OLLAMA_HOST): e.g., http://localhost:11434
- OPENAI_API_KEY: if set and provider=openai, the system will use OpenAI Chat Completions
- OPENAI_MODEL: default gpt-4o-mini
- CHROMA_URL: if set, the app connects to a Chroma Server (e.g., http://localhost:8000) instead of local .chroma
Usage 0. Extract OCR from XML into .txt (if your data is XML with OCR):
python -m src.rag_system.cli extract_ocr --input data_sample --output data --glob "**/*.xml"
python -m src.rag_system.cli extract_ocr --input data_sample --output data --xpath ".//OCR"
-
Ingest documents into Chroma (local embedded): python -m src.rag_system.cli ingest --source data
Or, with Chroma Server in Docker (recommended for shared access):
docker compose up -d chroma
export CHROMA_URL=http://localhost:8000 python -m src.rag_system.cli ingest --source data --chroma_url $CHROMA_URL --collection corpus
-
Run a query (Ollama by default):
Make sure Ollama is installed and running: https://ollama.com/
python -m src.rag_system.cli query --provider ollama --ollama_model llama3.1:8b "What is this repository about?"
python -m src.rag_system.cli query --provider ollama --ollama_model llama3.1:8b --chroma_url $CHROMA_URL "What is this repository about?"
export OPENAI_API_KEY=sk-... python -m src.rag_system.cli query --provider openai --model gpt-4o-mini --chroma_url $CHROMA_URL "What is this repository about?"
Advanced options
-
Choose a different embedding model: python -m src.rag_system.cli ingest --embed_model sentence-transformers/all-mpnet-base-v2
-
Configure top-k and model for queries:
python -m src.rag_system.cli query --provider ollama --ollama_model llama3.1:8b --k 8 "Explain the stack used here"
python -m src.rag_system.cli query --provider openai --model gpt-4o-mini --k 8 "Explain the stack used here"
Project structure
- src/rag_system/ingest.py -> Ingestion pipeline (load, split, embed, index)
- src/rag_system/graph.py -> Retrieval + synthesis pipeline
- src/rag_system/cli.py -> Command-line interface
- docker-compose.yml -> Chroma Server (Docker) for remote vector DB
- data/ -> Put your .txt files here
Notes
- If you do not set OPENAI_API_KEY, answers are generated by a simple extractive fallback (concatenation of top documents) to keep everything offline.
- If you set OPENAI_API_KEY, the system uses OpenAI's Chat model configured via OPENAI_MODEL.
If you see an error like RuntimeError: Numpy is not available when running ingest or query, install NumPy explicitly before other packages and ensure pip/setuptools are recent:
-
Upgrade build tooling python -m pip install --upgrade pip setuptools wheel
-
Install NumPy first (compatible range) python -m pip install "numpy>=1.26,<2.1"
-
Install the project requirements python -m pip install -r requirements.txt
Notes:
- On Apple Silicon (M1/M2/M3), use Python 3.10+ from python.org or pyenv and a recent pip (>=23).
- If you still hit issues, try recreating the venv and installing NumPy first, then requirements.
We use LangChain’s RecursiveCharacterTextSplitter during ingestion to break large documents into smaller, partially-overlapping chunks before embedding and indexing in Chroma.
Where it’s used here
- File: src/rag_system/ingest.py
- Code: splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, add_start_index=True)
- You control chunk_size and chunk_overlap via CLI: --chunk_size and --chunk_overlap
What it does (high level)
- It tries to split text using a prioritized list of separators to preserve natural boundaries: by default ["\n\n", "\n", " ", ""].
- It recursively picks the first separator that actually appears in your text. If a segment is still too long (exceeds chunk_size by the chosen length function), it recursively retries with the next, “finer” separator.
- It merges segments back into chunks no larger than chunk_size, with chunk_overlap characters of overlap between consecutive chunks to improve retrieval recall.
Key parameters you’ll care about
- chunk_size: Target maximum size (in characters by default) of each chunk.
- chunk_overlap: Number of characters to overlap between adjacent chunks, preserving context across boundaries.
- separators: Optional custom list of separators to try in order (e.g., section headings, paragraphs, sentences, words, characters). Defaults to ["\n\n", "\n", " ", ""].
- keep_separator: Whether to keep the separator in chunks; can be True, False, "start", or "end". Defaults to True.
- is_separator_regex: Treat separators as regex patterns (advanced). Defaults to False.
- add_start_index: When True, each output Document receives metadata["start_index"] with its starting character offset relative to the original text. We set this to True so you can trace chunks back to the source.
- length_function: Function used to measure length (defaults to Python len on characters). You can customize (e.g., token counting) if you subclass or construct differently.
How the algorithm works (step-by-step)
- Choose a separator:
- Scan the separators list in order; use the first that occurs in the text. If none match, use the last (often empty string) which falls back to character-level splitting.
- Split the text by that separator.
- For each resulting piece:
- If piece length < chunk_size: add it to a temporary list of “good” splits.
- If piece length >= chunk_size:
- First, merge current “good” splits into final chunks (respecting chunk_size and chunk_overlap).
- Then recursively call the same procedure on the long piece, but now with the remaining, finer separators.
- After processing all pieces, merge any remaining “good” splits into the final chunk list.
Merging and overlap
- The splitter uses a sliding window over the accumulated splits to emit chunks whose size does not exceed chunk_size.
- It ensures consecutive chunks overlap by chunk_overlap characters, which helps retrieval models maintain context when a relevant sentence lies near a boundary.
Why it’s good for RAG
- Preserves semantic boundaries when possible (paragraphs, then lines, then words) while guaranteeing chunks are not too large for embedding/token limits.
- Overlap improves recall and robustness to query variations.
Practical tuning advice
- Start with chunk_size=800 and chunk_overlap=120 (our CLI defaults). Increase chunk_size for long-form technical docs; decrease for short notes.
- If your documents have strong structure (e.g., Markdown, headings), consider providing custom separators, e.g.:
- separators=["\n\n# ", "\n\n", "\n", " ", ""] with is_separator_regex=False
- If you have very long tokens/words (e.g., base64 or code blobs) and chunks exceed the limit, the recursion eventually falls back to character-level splitting "".
- keep_separator="end" can help keep phrase punctuation near the chunk end; "start" can help the next chunk’s beginning be self-contained.
- For token-aware sizing (e.g., tiktoken), consider a custom length_function or LangChain’s token-based splitters for more precise control.
Examples with this project
-
Default ingestion: python -m src.rag_system.cli ingest --source data --chunk_size 800 --chunk_overlap 120
-
Larger chunks for long technical reports: python -m src.rag_system.cli ingest --source data --chunk_size 1200 --chunk_overlap 150
-
Smaller, tighter chunks for noisy OCR text: python -m src.rag_system.cli ingest --source data --chunk_size 500 --chunk_overlap 100
Customizing separators (code snippet)
-
If you want custom separators, edit src/rag_system/ingest.py and pass separators in the splitter construction, for example:
splitter = RecursiveCharacterTextSplitter( chunk_size=chunk_size, chunk_overlap=chunk_overlap, add_start_index=True, separators=["\n\n## ", "\n\n", "\n", " ", ""], # try headings, then paragraphs, lines, words, chars keep_separator=True, )
Edge cases
- Documents with no newlines: the splitter quickly falls back to splitting on spaces or characters.
- Separator is regex: set is_separator_regex=True and pass patterns (ensure they appear in text, or recursion goes finer).
- Extremely long single “words”: recursion will end up splitting at character level to respect chunk_size.
You can enable runtime logs for the ingestion process to monitor progress and performance.
-
Set the log level via CLI: python -m src.rag_system.cli ingest --source data --log_level DEBUG
-
Or via environment variable (default is INFO if not provided): export LOG_LEVEL=DEBUG python -m src.rag_system.cli ingest --source data
The logs include:
- Start parameters (source, glob, chunking, collection, target Chroma location)
- Number of documents loaded
- Chunking stats (number of chunks and average length)
- Embedding model used
- Where data is being written (local Chroma directory or Chroma Server)
- Total ingestion time
You can run the Ollama server via Docker using the provided docker-compose.yml.
-
Start Ollama (and optionally Chroma) in the background: docker compose up -d ollama
docker compose up -d ollama chroma
-
Point the app to the Dockerized Ollama server: export OLLAMA_BASE_URL=http://localhost:11434
-
Pull a model inside the container (one-time): docker exec -it ollama ollama pull llama3.1:8b
Then you can query with the CLI (provider=ollama), either against local Chroma or Chroma Server:
- Local Chroma: python -m src.rag_system.cli query --provider ollama --ollama_model llama3.1:8b "What is this repository about?"
- Chroma Server: export CHROMA_URL=http://localhost:8000 python -m src.rag_system.cli query --provider ollama --ollama_model llama3.1:8b --chroma_url $CHROMA_URL "What is this repository about?"
You have three convenient options to download an Ollama model:
-
If Ollama is installed on your host (macOS/Linux/WSL):
- Ensure the Ollama daemon is running: ollama serve (usually started automatically)
- Pull a model by name: ollama pull llama3.1:8b
-
If you run Ollama via Docker Compose (this repo’s docker-compose.yml):
- Start the service: docker compose up -d ollama
- Pull the model inside the container: docker exec -it ollama ollama pull llama3.1:8b
- Point the app to the Dockerized server: export OLLAMA_BASE_URL=http://localhost:11434
-
Using this project’s CLI (talks to the Ollama HTTP API):
- Host or Docker both work as long as the server is reachable.
- Example (defaults to http://localhost:11434 if OLLAMA_BASE_URL is not set): python -m src.rag_system.cli ollama_pull --model llama3.1:8b --base_url $OLLAMA_BASE_URL
Notes
- Model names are listed on https://ollama.com/library (e.g., llama3.1, mistral, codellama). Tags like :8b/:70b choose parameter sizes.
- The first pull downloads the weights; subsequent pulls are fast.
- If the server is remote, set OLLAMA_BASE_URL to that host, e.g., http://your-server:11434.