Speculative RAG Pipeline for Latency Optimization

A high-performance Retrieval-Augmented Generation (RAG) system that leverages speculative decoding techniques to minimize latency while maintaining answer quality. This project uses fast language models for initial draft generation and more capable models for verification, all while providing comprehensive observability through Langfuse.

Key Features

Speculative RAG Architecture: Combines fast draft generation with quality verification to optimize response times
Multi-Model Approach: Uses Groq's Llama models for both drafting (fast) and verification (accurate)
Local Embeddings: HuggingFace embeddings with ChromaDB for efficient vector storage
Observability: Full tracing and monitoring with Langfuse
PDF Processing: Automated ingestion and chunking of PDF documents
Latency Optimization: Designed to reduce response times compared to traditional single-model RAG systems

Architecture

PDF Document → Text Splitting → Embedding (HuggingFace) → ChromaDB Vector Store
                                      ↓
User Query → Retrieval → Draft Generation (Groq Llama 3.1 8B) → Verification (Groq Llama 3.3 70B) → Final Answer
                                      ↓
                                 Langfuse Tracing

Prerequisites

Python 3.9+ (recommended: 3.11 or 3.12, avoid 3.14)
API Keys:
- Groq API Key (for both drafting and verification models)
- Langfuse API Keys (Optional, for tracing - use Cloud or self-hosted)

Quick Start

Clone and Setup Environment:

git clone <repository-url>
cd rag-latency-optimization
python -m venv .venv
# On Windows:
.venv\Scripts\activate
# On Unix/Mac:
source .venv/bin/activate
pip install -r requirements.txt

Environment Configuration:

cp .env.example .env

Fill in your .env file:

GROQ_API_KEY=your_groq_api_key_here
LANGFUSE_PUBLIC_KEY=your_langfuse_public_key
LANGFUSE_SECRET_KEY=your_langfuse_secret_key
LANGFUSE_HOST=https://cloud.langfuse.com

Run the Application:

Option A: CLI Mode
```
python -m app.main
# Enter PDF path when prompted
```
Option B: API Server
```
uvicorn app.api:app --reload
```
The API will be available at http://localhost:8000.
- Swagger UI: http://localhost:8000/docs

Usage

CLI Mode

python -m app.main

Once loaded, ask questions about your document:

User: What are the main findings in this paper?
--- Draft ---
[Fast initial response]

--- Final ---
[Refined, verified answer]

API Mode

Start the server: uvicorn app.api:app

Ingest a document:

curl -X POST "http://localhost:8000/ingest" -H "Content-Type: application/json" -d '{"pdf_path": "path/to/doc.pdf"}'

Ask a question:

curl -X POST "http://localhost:8000/ask" -H "Content-Type: application/json" -d '{"question": "What is this about?"}'

How It Works

Document Ingestion: PDFs are loaded, split into chunks, and embedded using local HuggingFace models stored in ChromaDB
Speculative Generation:
- Draft Phase: Fast Groq Llama 3.1 8B model generates initial answers based on retrieved context
- Verification Phase: More capable Groq Llama 3.3 70B model reviews and refines the draft for accuracy
Tracing: All interactions are logged to Langfuse for performance monitoring and debugging

Performance Benefits

Reduced Latency: Speculative approach provides faster initial responses
Quality Assurance: Two-stage verification ensures accuracy
Cost Efficiency: Balances speed and quality across different model capabilities
Observability: Comprehensive tracing helps identify bottlenecks

Configuration

Model Selection

The pipeline uses two Groq models:

Drafter: llama-3.1-8b-instant (fast, cost-effective)
Verifier: llama-3.3-70b-versatile (accurate, thorough)

Chunking Parameters

Chunk Size: 1000 characters
Overlap: 200 characters

Embedding Model

Model: all-MiniLM-L6-v2 (local, no API costs)

Troubleshooting

Common Issues

Python 3.14 Errors: Use Python 3.11 or 3.12. Version 3.14 lacks binary wheels for some dependencies.
Rate Limits: Groq has rate limits; consider upgrading your plan for production use.
Memory Issues: Large PDFs may require more RAM; consider chunk size adjustments.

Environment Setup

If you encounter dependency issues:

# Recreate virtual environment
rm -rf .venv
python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows
pip install -r requirements.txt

Dependencies

Key libraries:

langchain: RAG pipeline framework
langchain-groq: Groq model integration
langchain-huggingface: Local embeddings
chromadb: Vector database
langfuse: Observability and tracing
pypdf: PDF processing

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

[Add your license here]

Acknowledgments

Groq for providing fast inference APIs
Langfuse for observability tools
LangChain for the RAG framework

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
app		app
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
README_SETUP.md		README_SETUP.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Speculative RAG Pipeline for Latency Optimization

Key Features

Architecture

Prerequisites

Quick Start

Usage

CLI Mode

API Mode

How It Works

Performance Benefits

Configuration

Model Selection

Chunking Parameters

Embedding Model

Troubleshooting

Common Issues

Environment Setup

Dependencies

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Rafay-AABS/rag-latency-optimization

Folders and files

Latest commit

History

Repository files navigation

Speculative RAG Pipeline for Latency Optimization

Key Features

Architecture

Prerequisites

Quick Start

Usage

CLI Mode

API Mode

How It Works

Performance Benefits

Configuration

Model Selection

Chunking Parameters

Embedding Model

Troubleshooting

Common Issues

Environment Setup

Dependencies

Contributing

License

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages