A high-performance Retrieval-Augmented Generation (RAG) system that leverages speculative decoding techniques to minimize latency while maintaining answer quality. This project uses fast language models for initial draft generation and more capable models for verification, all while providing comprehensive observability through Langfuse.
- Speculative RAG Architecture: Combines fast draft generation with quality verification to optimize response times
- Multi-Model Approach: Uses Groq's Llama models for both drafting (fast) and verification (accurate)
- Local Embeddings: HuggingFace embeddings with ChromaDB for efficient vector storage
- Observability: Full tracing and monitoring with Langfuse
- PDF Processing: Automated ingestion and chunking of PDF documents
- Latency Optimization: Designed to reduce response times compared to traditional single-model RAG systems
PDF Document → Text Splitting → Embedding (HuggingFace) → ChromaDB Vector Store
↓
User Query → Retrieval → Draft Generation (Groq Llama 3.1 8B) → Verification (Groq Llama 3.3 70B) → Final Answer
↓
Langfuse Tracing
- Python 3.9+ (recommended: 3.11 or 3.12, avoid 3.14)
- API Keys:
- Groq API Key (for both drafting and verification models)
- Langfuse API Keys (Optional, for tracing - use Cloud or self-hosted)
-
Clone and Setup Environment:
git clone <repository-url> cd rag-latency-optimization python -m venv .venv # On Windows: .venv\Scripts\activate # On Unix/Mac: source .venv/bin/activate pip install -r requirements.txt
-
Environment Configuration:
cp .env.example .env
Fill in your
.envfile:GROQ_API_KEY=your_groq_api_key_here LANGFUSE_PUBLIC_KEY=your_langfuse_public_key LANGFUSE_SECRET_KEY=your_langfuse_secret_key LANGFUSE_HOST=https://cloud.langfuse.com -
Run the Application:
Option A: CLI Mode
python -m app.main # Enter PDF path when promptedOption B: API Server
uvicorn app.api:app --reload
The API will be available at
http://localhost:8000.- Swagger UI:
http://localhost:8000/docs
- Swagger UI:
python -m app.mainOnce loaded, ask questions about your document:
User: What are the main findings in this paper?
--- Draft ---
[Fast initial response]
--- Final ---
[Refined, verified answer]
- Start the server:
uvicorn app.api:app - Ingest a document:
curl -X POST "http://localhost:8000/ingest" -H "Content-Type: application/json" -d '{"pdf_path": "path/to/doc.pdf"}'
- Ask a question:
curl -X POST "http://localhost:8000/ask" -H "Content-Type: application/json" -d '{"question": "What is this about?"}'
- Document Ingestion: PDFs are loaded, split into chunks, and embedded using local HuggingFace models stored in ChromaDB
- Speculative Generation:
- Draft Phase: Fast Groq Llama 3.1 8B model generates initial answers based on retrieved context
- Verification Phase: More capable Groq Llama 3.3 70B model reviews and refines the draft for accuracy
- Tracing: All interactions are logged to Langfuse for performance monitoring and debugging
- Reduced Latency: Speculative approach provides faster initial responses
- Quality Assurance: Two-stage verification ensures accuracy
- Cost Efficiency: Balances speed and quality across different model capabilities
- Observability: Comprehensive tracing helps identify bottlenecks
The pipeline uses two Groq models:
- Drafter:
llama-3.1-8b-instant(fast, cost-effective) - Verifier:
llama-3.3-70b-versatile(accurate, thorough)
- Chunk Size: 1000 characters
- Overlap: 200 characters
- Model:
all-MiniLM-L6-v2(local, no API costs)
- Python 3.14 Errors: Use Python 3.11 or 3.12. Version 3.14 lacks binary wheels for some dependencies.
- Rate Limits: Groq has rate limits; consider upgrading your plan for production use.
- Memory Issues: Large PDFs may require more RAM; consider chunk size adjustments.
If you encounter dependency issues:
# Recreate virtual environment
rm -rf .venv
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
pip install -r requirements.txtKey libraries:
langchain: RAG pipeline frameworklangchain-groq: Groq model integrationlangchain-huggingface: Local embeddingschromadb: Vector databaselangfuse: Observability and tracingpypdf: PDF processing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
[Add your license here]
- Groq for providing fast inference APIs
- Langfuse for observability tools
- LangChain for the RAG framework