This project demonstrates how to build a simple RAG (Retrieval-Augmented Generation) system using a small language model (TinyLlama-1.1B-Chat-v1.0) and FAISS vector database to answer questions about events that occurred after the model's knowledge cutoff date.
At just 1.1B parameters, TinyLlama is approximately 70x smaller than Llama 2 (70B) and approximately 80x smaller than Llama 3 (80B) while still maintaining impressive performance for many tasks. This makes it ideal for resource-constrained environments or rapid prototyping of RAG systems.
Open in Google Colab here
The project showcases how to:
- Use a compact 1.1B parameter model (TinyLlama) for RAG applications
- Set up a FAISS vector database for document retrieval
- Extract and chunk text from PDF documents
- Generate embeddings using sentence transformers
- Retrieve relevant context and generate accurate answers
- Language Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 (4-bit quantized, ~700MB)
- Vector Database: FAISS (Facebook AI Similarity Search)
- Embeddings: all-MiniLM-L6-v2 sentence transformer
- Document Processing: PyMuPDF for PDF text extraction
- Google Colab (recommended) or local Python environment
- Python 3.8+
- GPU support (optional but recommended for faster inference)
For Google Colab:
- Create a
datafolder in your Google Colab workspace - Upload the
./data/2024–25_NFL_playoffs_pg_1.pdfPDF file using the file upload interface - Ensure the PDF is accessible at
./data/2024–25_NFL_playoffs_pg_1.pdf
For Local Development:
- Create a
datafolder in your project directory if it doesn't exist - Place the
./data/2024–25_NFL_playoffs_pg_1.pdfPDF file in the./data/directory - Update the notebook to read the PDF from the appropriate location.
In the notebook, update the PDF path to match your file name:
# For local development (relative path)
pdf_text = "\n".join(block[4] for block in fitz.open("./data/your_document.pdf").load_page(0).get_text("blocks"))
# For Google Colab (absolute path)
pdf_text = "\n".join(block[4] for block in fitz.open("/content/data/your_document.pdf").load_page(0).get_text("blocks"))Important: Replace your_document.pdf with the actual name of your PDF file.
The notebook will automatically install the required packages:
pip install transformers sentence_transformers faiss-cpu bitsandbytes PyMuPDF- Open
tiny-rag.ipynbin Google Colab or your local Jupyter environment - Run all cells sequentially
- The system will:
- Load the TinyLlama model
- Extract text from your PDF in the
./data/directory - Create embeddings and store them in FAISS
- Demonstrate RAG functionality with example queries
tiny-rag/
├── README.md
├── tiny-rag.ipynb # Main notebook with RAG implementation
├── assets/
│ └── tiny-rag.png # Architecture diagram
└── data/
├── 2024–25_NFL_playoffs_pg_1.pdf # Example PDF document
└── your_document.pdf # Add your PDFs here
- Uses 4-bit quantization for memory efficiency
- Loads TinyLlama-1.1B-Chat-v1.0 model (~700MB on disk, ~1GB RAM)
- Configured for conversational chat format
- Extracts text from PDF using PyMuPDF
- Chunks text into smaller segments (150 words per chunk)
- Creates embeddings for each chunk using sentence transformers
- FAISS IndexFlatL2 for L2 distance similarity search
- Stores document embeddings for fast retrieval
- Returns top-k most similar chunks for context
- Query: User asks a question
- Retrieval: System finds relevant document chunks
- Generation: Model generates answer using retrieved context
The notebook demonstrates the RAG system with an NFL championship question:
Without RAG (Model hallucination):
Q: Who won the 2024 NFL Championship?
A: The Los Angeles Rams (incorrect)
With RAG (Accurate answer):
Q: Who won the 2024 NFL Championship?
A: The Philadelphia Eagles defeated the Kansas City Chiefs 40-22 in Super Bowl LIX
- Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
- Quantization: 4-bit (nf4)
- Compute dtype: bfloat16
- Max tokens: 300 (configurable)
- Chunk size: 150 words (configurable)
- Overlap: None (can be added for better context)
- Top-k: 3 chunks (configurable)
- Similarity metric: L2 distance
- Model Size: ~700MB (4-bit quantized)
- Memory Usage: ~1GB RAM
- Inference Speed: Fast on GPU, moderate on CPU
- Accuracy: Good for factual questions with proper context
- Use GPU acceleration when available
- Adjust chunk size based on your document characteristics
- Consider using multiple smaller documents instead of one large document
This project is open source and available under the MIT License.
Note: This project is for educational purposes and demonstrates the basic concepts of RAG systems. For production use, consider more robust solutions with better error handling, security, and scalability features.
