A modern chat-based application that analyzes customer data using LLM insights to identify Ideal Customer Profile (ICP) matches. Ask natural language questions about your data and get instant insights. Built with FastAPI, Next.js, PostgreSQL, and OpenAI/Anthropic.
- Quick Start
- Features
- Architecture
- Data Processing & Parsing
- Usage Guide
- API Endpoints
- Environment Variables
- Troubleshooting
- Deployment
- Contributing
Fastest way to get started locally:
# 1. Configure environment
cp env.docker.example .env
# Add your API key to .env (OPENAI_API_KEY or ANTHROPIC_API_KEY)
# 2. Start everything
docker-compose up -d
# 3. Open the app
# Visit http://localhost:3000/data-analystThat's it! Backend, frontend, and database all running in containers.
Docker Commands:
# View logs
docker-compose logs -f
# Stop services
docker-compose down
# Restart after code changes
docker-compose up -d --build
# Access database
docker-compose exec postgres psql -U postgres -d render_dataRun services directly on your machine:
Prerequisites:
- Python 3.9+
- Node.js 18+
- PostgreSQL (optional)
- OpenAI or Anthropic API key
Backend:
# Install dependencies
pip install -e .
# or with uv: uv sync
# Configure
cp env.docker.example .env
# Add API key to .env
# Start backend
./run_api.sh
# Runs at http://localhost:8000Frontend:
# Install dependencies
cd frontend
npm install
# Configure
echo "NEXT_PUBLIC_API_URL=http://localhost:8000" > .env.local
# Start frontend
npm run dev
# Runs at http://localhost:3000Usage:
- Open http://localhost:3000/data-analyst
- Upload your enrichment CSV file (sidebar)
- Start asking questions about your data!
๐ก Production Deployment: For production, we recommend deploying to Render. See the Deployment section for detailed instructions. Docker Compose is intended for local development only.
- ๐ค Natural Language Queries - Ask questions in plain English
- ๐ก Smart Responses - Get insights with tables, charts, and explanations
- ๐ Conversation History - Context-aware responses based on previous questions
- ๐ Dynamic Visualizations - Automatic chart generation for data insights
- ๐ CSV Upload - Upload enrichment data directly through the UI
- ๐ Database Integration - Connect to PostgreSQL for signup data
- ๐ฏ ICP Scoring - AI-powered scoring (0-100 scale) with confidence levels
- ๐ Pattern Recognition - Identify trends and commonalities
- ๐ก Smart Recommendations - Actionable insights from your data
- ๐ Multi-Provider Support - OpenAI and Anthropic
- ๐จ Flexible Models - Choose GPT-4, Claude, or other models
- ๐ง Context-Aware - Understands your data structure automatically
- ๐ Structured Outputs - Tables, charts, and SQL queries
- โก FastAPI Backend - High-performance, async Python API
- โ๏ธ Next.js Frontend - Modern React with TypeScript
- ๐ OpenAPI Docs - Auto-generated API documentation
- ๐ Environment Variables - Secure configuration management
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ Frontend โ โโโโโโถ โ Backend โ
โ (Next.js) โ HTTP โ (FastAPI) โ
โ โ โโโโโโ โ โ
โ โข Chat UI โ JSON โ โข REST API โ
โ โข Upload โ โ โข LLM Logic โ
โ โข Display โ โ โข Data Proc โ
โโโโโโโโโโโโโโโโ โโโโโโโโฌโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโ
โ PostgreSQL โ
โ (Optional) โ
โโโโโโโโโโโโโโโโ
Backend (FastAPI + Python)
- REST API with async support
- LLM integration (OpenAI/Anthropic)
- PostgreSQL database connector
- In-memory state management
- Auto-generated OpenAPI docs
Frontend (Next.js + React)
- Server-side rendering
- TypeScript for type safety
- Tailwind CSS for styling
- Real-time chat interface
- Recharts for visualizations
Data Flow
- User uploads CSV โ Backend stores in memory
- User asks question โ Frontend sends to
/api/chat - Backend processes with LLM โ Returns structured response
- Frontend displays โ Tables, charts, explanations
This section explains how the API combines and parses user data from multiple sources.
The system works with two primary data sources:
- Enrichment Data (CSV) - Customer enrichment data uploaded via the UI
- Signup Data (Database) - User signup records from PostgreSQL (optional)
When a CSV file is uploaded:
# File is read into memory
content = await file.read()
# Saved to temporary file
with tempfile.NamedTemporaryFile(mode='wb', delete=False, suffix='.csv') as tmp:
tmp.write(content)
tmp_path = tmp.name
# Loaded into pandas DataFrame
df = pd.read_csv(tmp_path)
# Stored in application state
state.enrichment_data = dfKey Steps:
- File validation (must be
.csv) - Temporary file creation for processing
- Pandas DataFrame conversion
- In-memory storage in
AppState.enrichment_data - Temporary file cleanup
When a chat query is made and database is connected:
# Check database connection
db = get_database_connection()
success, _ = db.test_connection()
if success:
# Get available tables
tables = db.get_tables()
if tables:
# Query signup data (default: first table, limit 1000 rows)
table_name = tables[0]
state.signup_data = db.query_signups(table_name, limit=1000)Key Steps:
- Database connection test
- Table discovery
- Data querying (with row limit for performance)
- Storage in
AppState.signup_data
Before analysis, data goes through cleaning via DataCleaner:
# Standardizes column names to lowercase with underscores
"Company Name" โ "company_name"
"Email Address" โ "email_address"
"Industry/Sector" โ "industry_sector"Process:
- Convert to lowercase
- Replace spaces with underscores
- Remove special characters
- Remove consecutive underscores
- Trim leading/trailing underscores
-
Duplicate Removal
- Removes duplicate rows
- Handles unhashable types gracefully
- Removes duplicate columns
-
String Cleaning
- Strips whitespace
- Normalizes multiple spaces to single space
- Replaces null indicators (
'nan','None','NULL','') withNaN
-
Numeric Cleaning
- Replaces infinite values (
inf,-inf) withNaN - Preserves numeric types
- Replaces infinite values (
-
Missing Data Handling
- Can drop columns with >50% missing values
- Can fill missing values (mean, median, mode, forward fill)
- Default: drops rows with any missing values
When combining enrichment and signup data, SchemaMapper aligns schemas:
# Automatically suggests column mappings based on similarity
mappings = SchemaMapper.suggest_column_mapping(
source_df=enrichment_data,
target_df=signup_data,
threshold=0.6 # Minimum similarity score
)
# Example mappings:
# "company" โ "company_name" (similarity: 0.85)
# "email" โ "email_address" (similarity: 0.90)
# "industry" โ "sector" (similarity: 0.75)Similarity Calculation:
- Exact match: 1.0
- Sequence matcher ratio (fuzzy matching)
- Pattern boosts (e.g.,
email/mailโ 0.9,company/organizationโ 0.8)
# Aligns two DataFrames to have matching columns
df1_aligned, df2_aligned = SchemaMapper.align_schemas(
df1=enrichment_data,
df2=signup_data,
mapping=optional_explicit_mapping
)
# Result: Both DataFrames have the same column names
# Only common columns are keptWhen processing a chat query, data is combined:
# Build context for LLM
context_parts = []
# Add enrichment data summary
context_parts.append("## Available Data\n")
context_parts.append(f"### Enrichment Data")
context_parts.append(f"- Rows: {len(enrichment_data)}")
context_parts.append(f"- Columns: {', '.join(enrichment_data.columns)}")
context_parts.append(f"- Sample:\n```\n{enrichment_data.head(3).to_string()}\n```\n")
# Add signup data summary (if available)
if signup_data is not None:
context_parts.append(f"### Signup Data")
context_parts.append(f"- Rows: {len(signup_data)}")
context_parts.append(f"- Columns: {', '.join(signup_data.columns)}")
context_parts.append(f"- Sample:\n```\n{signup_data.head(3).to_string()}\n```\n")
# Add conversation history (last 3 exchanges)
if len(conversation_history) > 1:
context_parts.append("\n## Recent Conversation")
for msg in conversation_history[-6:]: # Last 6 messages
context_parts.append(f"**{msg['role']}**: {msg['content']}")Key Points:
- Both datasets are summarized (row count, columns, sample rows)
- Data is presented as text tables for LLM consumption
- Conversation history provides context for follow-up questions
- Limited to first 3 rows per dataset to manage token usage
For full ICP analysis (/api/analyze), data is processed differently:
# Limit data size for LLM context
signup_sample = signup_data.head(max_customers) # Default: 50
customer_sample = customer_data.head(max_customers) # Default: 50The system derives ICP from signup data:
# Signup data represents the ICP (Ideal Customer Profile)
# The LLM analyzes patterns in signups to understand:
# - Common industries
# - Company size ranges
# - Use cases
# - Behavioral patternsEach customer in enrichment data is scored against the ICP:
# For each customer:
# 1. Compare attributes to ICP patterns
# 2. Calculate match score (0-100)
# 3. Identify matching attributes
# 4. Identify gaps
# 5. Assign confidence level (high/medium/low)
# 6. Generate reasoningThe LLM identifies patterns across high-scoring customers:
patterns = {
"common_industries": ["SaaS", "Technology", "Finance"],
"common_sizes": ["50-200", "200-1000"],
"common_attributes": ["Has engineering team", "Uses cloud infrastructure"],
"key_indicators": ["High engagement", "Multiple users"]
}Before processing, data is validated via DataValidator:
validation = DataValidator.validate_dataframe(df)
# Returns:
{
'is_valid': True/False,
'errors': ['Column X has all missing values'],
'warnings': ['Found 5 duplicate rows', 'Missing data: 10%'],
'stats': {
'total_rows': 1000,
'total_columns': 15,
'numeric_columns': 8,
'text_columns': 7,
'missing_values': 150,
'duplicate_rows': 5
}
}Quality Score Calculation:
score = DataValidator.get_data_quality_score(df)
# Factors:
# - Completeness (40%): No missing values
# - Uniqueness (30%): No duplicates
# - Validity (30%): No all-null columns
# Returns: 0-100 scoreCurrent Implementation:
- Data stored in-memory (
AppStateclass) - Resets on server restart
- No persistence between sessions
Limitations:
- Single user session (no multi-user support)
- Data lost on restart
- No file persistence
Future Enhancements:
- Redis for session storage
- Database persistence for uploaded files
- File system storage for CSVs
- Multi-user support with session management
1. User uploads "customers.csv"
โ
2. Backend reads CSV โ pandas DataFrame
โ
3. DataCleaner.clean_dataframe() standardizes columns
โ
4. Stored in state.enrichment_data
โ
5. User asks: "Who are my top ICP customers?"
โ
6. Backend loads signup_data from database (if connected)
โ
7. SchemaMapper aligns schemas (if needed)
โ
8. Context built with both datasets + conversation history
โ
9. LLM analyzes and generates response
โ
10. Response parsed into ChatResponse (tables, charts, text)
โ
11. Frontend displays formatted results
Backend (FastAPI):
./run_api.sh
# Starts at http://localhost:8000
# API docs at http://localhost:8000/docsFrontend (Next.js):
cd frontend
npm run dev
# Starts at http://localhost:3000- Open http://localhost:3000/data-analyst
- Click the CSV upload area in the sidebar
- Select your enrichment CSV file
- Wait for upload confirmation
- Check the "Connected Data" section shows green status
Type natural language questions in the chat box, such as:
- Basic Analysis: "Who are our highest ICP fit customers?"
- Segmentation: "Show me customers grouped by industry"
- Patterns: "What do high-scoring customers have in common?"
- Comparisons: "Compare customers by company size"
- Trends: "What percentage of customers are in tech?"
The agent responds with:
- Text explanations - Natural language insights
- Tables - Structured data summaries
- Charts - Visual representations of trends
- SQL queries - The underlying queries (optional)
Ask follow-up questions to dive deeper:
- "Tell me more about the top 5"
- "What about customers in enterprise segment?"
- "Show me the breakdown by source"
You: "Upload enrichment CSV" โ Click sidebar upload
Agent: "โ
Successfully uploaded! 1,247 rows loaded"
You: "Who are our top ICP customers?"
Agent: [Shows table with top 10, includes scores and reasons]
You: "What patterns do they share?"
Agent: [Analyzes commonalities, shows charts]
You: "Show me customers in the SaaS industry"
Agent: [Filters and displays SaaS customers]
The project includes sample data to help you get started:
data/mock_signups.sql: SQL script with mock user recordsdata/customer_data.csv: Mock customer enrichment data
Health check endpoint.
Response:
{
"status": "healthy",
"service": "ICP Analysis Agent API",
"version": "1.0.0"
}Get current data status.
Response:
{
"enrichment_loaded": true,
"enrichment_rows": 1247,
"enrichment_columns": 15,
"signup_loaded": true,
"signup_rows": 500,
"signup_columns": 12,
"database_connected": true
}Upload enrichment CSV file.
Request:
curl -X POST http://localhost:8000/api/upload \
-F "file=@customer_data.csv"Response:
{
"status": "success",
"message": "Successfully uploaded customer_data.csv",
"data": {
"rows": 1247,
"columns": 15,
"column_names": ["company", "industry", "size", ...],
"sample": [...]
}
}Send chat message for analysis.
Request:
{
"message": "Who are our top ICP customers?",
"include_sql": true
}Response:
{
"content": "Based on your data, here are the top ICP customers...",
"table": {
"title": "Top ICP Customers",
"columns": ["Company", "Score", "Industry"],
"rows": [...]
},
"charts": [
{
"id": "score-distribution",
"title": "ICP Score Distribution",
"data": [...],
"xKey": "score_range",
"yKey": "count"
}
],
"sql": "SELECT ..."
}Run full ICP analysis.
Response:
{
"status": "success",
"summary": "Analysis complete...",
"statistics": {
"total_analyzed": 50,
"avg_score": 72.5,
"matches_above_80": 12
},
"top_matches": [...],
"patterns": {...},
"recommendations": [...]
}Reset all application state.
Response:
{
"status": "success",
"message": "State reset successfully"
}Interactive API documentation (Swagger UI).
# LLM API Keys (at least one required)
OPENAI_API_KEY=sk-... # Get from https://platform.openai.com/api-keys
ANTHROPIC_API_KEY=sk-... # Get from https://console.anthropic.com/
# Database Configuration (optional - for signup data)
DATABASE_URL=postgresql://user:password@localhost:5432/dbname
DB_HOST=localhost
DB_PORT=5432
DB_NAME=your_database
DB_USER=postgres
DB_PASSWORD=your_password
# API Configuration
PORT=8000
LLM_PROVIDER=openai
LLM_MODEL=gpt-4-turbo-previewNEXT_PUBLIC_API_URL=http://localhost:8000- Check that port 8000 is available:
lsof -i :8000 - Verify Python version:
python --version(needs 3.9+) - Check
.envfile has valid API keys - Install dependencies:
pip install -e .
- Check that port 3000 is available:
lsof -i :3000 - Verify Node version:
node --version(needs 18+) - Install dependencies:
cd frontend && npm install - Check
.env.localhas correct API URL
- Ensure backend is running at http://localhost:8000
- Test backend:
curl http://localhost:8000/api/status - Check browser console for CORS errors
- Verify
NEXT_PUBLIC_API_URLin frontend/.env.local
- Check file is valid CSV format
- Ensure CSV has header row
- Verify file size is reasonable (<300MB)
- Check backend logs for detailed error
- Verify API key is correct and active
- Check API quotas and billing
- Try switching providers (OpenAI โ Anthropic)
- Check backend logs for full error message
- Verify PostgreSQL is running
- Check credentials in
.env - Test connection:
psql -h localhost -U user -d dbname - Database is optional - agent works with CSV only
Port conflicts:
# Find what's using the port
lsof -i :3000
lsof -i :8000
lsof -i :5432
# Kill the process or change ports in docker-compose.ymlServices won't start:
# Check logs
docker-compose logs
# Check specific service
docker-compose logs backend
docker-compose logs frontend
docker-compose logs postgres
# Restart services
docker-compose restartFrontend can't connect to backend:
# Check if backend is running
curl http://localhost:8000/api/status
# Check frontend logs
docker-compose logs frontend
# Restart both
docker-compose restart backend frontendGetting more help:
- Check backend logs in terminal where
run_api.shis running - Check frontend logs in terminal where
npm run devis running - Check browser console (F12) for frontend errors
- Review API docs: http://localhost:8000/docs
Render is the recommended production deployment platform for this application.
- Create a new Web Service on Render
- Connect your GitHub repository
- Configure settings:
- Build Command:
pip install -e .oruv sync - Start Command:
uvicorn api:app --host 0.0.0.0 --port $PORT - Environment: Python 3
- Build Command:
- Add environment variables:
OPENAI_API_KEYorANTHROPIC_API_KEYDATABASE_URL(if using PostgreSQL)LLM_PROVIDER(default:openai)LLM_MODEL(default:gpt-4-turbo-preview)
- Deploy! Render will automatically run migrations on each deploy.
- Create a new Static Site or Web Service on Render
- Connect your GitHub repository
- Set root directory to
frontend - Configure settings:
- Build Command:
cd frontend && npm install && npm run build - Publish Directory:
frontend/.next - Environment: Node
- Build Command:
- Add environment variable:
NEXT_PUBLIC_API_URL: Your backend Render URL (e.g.,https://your-backend.onrender.com)
- Deploy!
Alternative Production Platforms:
- Railway: Auto-detects Python/Node, similar setup
- Fly.io: Containerize with Docker
- Vercel: Great for Next.js frontend (one-click deploy)
For local development, use Docker Compose:
# Start all services (backend, frontend, database)
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose down
# Rebuild after code changes
docker-compose up -d --buildLocal Development Setup:
- Copy
env.docker.exampleto.env - Add your API keys to
.env - Run
docker-compose up -d - Access frontend at http://localhost:3000
- Access backend API at http://localhost:8000
Note: Docker Compose is intended for local development only. For production, use Render or another managed platform.
render-data-agent/
โโโ api.py # FastAPI backend application
โโโ run_api.sh # Backend startup script
โโโ pyproject.toml # Python dependencies
โโโ docker-compose.yml # Docker setup for local development
โโโ Dockerfile # Backend container
โโโ Dockerfile.frontend # Frontend container
โ
โโโ frontend/ # Next.js frontend
โ โโโ app/
โ โ โโโ page.tsx # Landing page
โ โ โโโ data-analyst/
โ โ โโโ page.tsx # Main chat interface
โ โโโ components/ # React components
โ โโโ package.json # Node dependencies
โ โโโ tsconfig.json # TypeScript config
โ
โโโ src/ # Core Python modules
โ โโโ __init__.py
โ โโโ config.py # Configuration management
โ โโโ database.py # PostgreSQL connection
โ โโโ data_processing.py # Data cleaning and transformation
โ โโโ icp_analyzer.py # LLM-powered ICP analysis
โ โโโ llm_client.py # LLM provider abstraction
โ โโโ segmentation.py # Customer segmentation logic
โ โโโ utils.py # Helper functions
โ
โโโ data/ # Sample data
โ โโโ customer_data.csv # Sample customer data
โ โโโ mock_signups.sql # Database setup script
โ โโโ README.md # Data documentation
โ
โโโ tests/ # Test files
โโโ __init__.py
The agent can work with just CSV files, but connecting a PostgreSQL database allows comparison with signup data.
-
Install PostgreSQL (if not already installed)
-
Create database:
createdb render_data- Load sample data:
psql -U postgres -d render_data -f data/mock_signups.sql- Configure in .env:
DATABASE_URL=postgresql://postgres:password@localhost:5432/render_dataThe agent will automatically detect and use the database if configured.
Contributions are welcome! Please feel free to submit a Pull Request.
- Follow existing code style
- Add tests for new features
- Update documentation
- Test both backend and frontend
Backend formatting:
black src/ api.py
ruff check src/ api.pyFrontend formatting:
cd frontend
npm run lint- API Documentation: Interactive docs at http://localhost:8000/docs
- Sample Data: See
data/README.mdfor data documentation - Docker Guide: See
DOCKER_QUICKSTART.mdfor Docker commands
- Multi-file comparison and merging
- Historical tracking and trends
- Custom ICP criteria configuration
- Scheduled analysis and reporting
- Export to CRM systems
- Team collaboration features
- Session persistence (Redis)
- WebSocket streaming for real-time responses
- User authentication and multi-user support
Built with โค๏ธ using FastAPI, Next.js, and modern full-stack practices.