High-Fidelity Agent Data Pipeline for Technical Documentation Crawling
TechCyclopedia is an intelligent MCP (Model Context Protocol) server that provides advanced web crawling capabilities with interactive configuration, live progress tracking, and background task management for technical documentation.
- Smart Documentation Discovery: Automatically find docs for 20+ popular tools (Python, React, FastAPI, etc.)
- Intelligent URL Processing: Pass tool names or URLs - automatically discovers and validates
- Interactive Configuration: Smart user preference system with persistent storage
- Live Progress Tracking: Real-time updates on crawling progress with detailed status
- Background Tasks: Run crawls in the background while continuing to chat
- Organized Output: Clean, structured file organization by domain
- Content Filtering: Automatic boilerplate removal and content optimization
- Comprehensive Error Handling: Robust validation, detailed error messages, graceful degradation
- File Size Management: Automatic size checking and filtering (10MB limit)
- Persistent Preferences: SQLite-based user preference storage
- Deep Crawling: BFS strategy for comprehensive documentation extraction
Visit the live website: https://nomanayeem.github.io/TechCyclopedia
The website showcases TechCyclopedia's features, usage examples, and provides comprehensive documentation.
- 🎯 Smart Tool Discovery: Just type
pythonorreactinstead of full URLs - ✅ Enhanced Validation: Comprehensive URL validation with detailed error messages
- 🛡️ Better Error Handling: Graceful failure handling, detailed error reports
- 📊 Rich Return Values: Get detailed statistics and metadata from all operations
- 🔍 New Tools:
discover_docsandlist_supported_toolsfor exploration - 📏 File Size Management: Automatic size checking (10MB limit)
- 🐛 Bug Fixes: Fixed datetime deprecation warnings
See IMPROVEMENTS.md for detailed documentation.
- Python 3.9+
- pip package manager
- Clone the repository:
git clone https://github.com/NoManNayeem/TechCyclopedia.git
cd TechCyclopedia- Create virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt
playwright install- Start the server:
python server/server.pyTechCyclopedia provides several powerful MCP tools:
Crawl technical documentation with automatic URL discovery and validation.
NEW: Now supports tool names! Pass "python" instead of full URLs.
Parameters:
urls: List of URLs or tool names (e.g., ["python", "react", "https://example.com"])output_dir: Directory path where MD files will be saveduser_id: User identifier for preferences (default: "default")
Examples:
// Using tool names (NEW!)
{
"tool": "crawl_tech_docs",
"parameters": {
"urls": ["python", "fastapi"],
"output_dir": "results",
"user_id": "default"
}
}
// Using direct URLs (still works)
{
"tool": "crawl_tech_docs",
"parameters": {
"urls": ["https://docs.python.org/3/tutorial/"],
"output_dir": "results",
"user_id": "default"
}
}
// Mix both!
{
"tool": "crawl_tech_docs",
"parameters": {
"urls": ["python", "https://custom-docs.com"],
"output_dir": "results"
}
}Returns:
{
"success": true,
"task_id": "uuid",
"urls_processed": 3,
"files_created": 45,
"output_directory": "/path/to/results",
"files": ["file1.md", "file2.md", ...],
"inputs_processed": ["Found 3 documentation URL(s) for 'python'"],
"validation_errors": null
}Start a background crawling task that can run while the user continues chatting.
Parameters:
urls: List of starting URLs for documentation crawloutput_dir: Directory path where MD files will be saveduser_id: User identifier for preferences
Returns: Task ID for tracking the background task
Check the status of a background crawling task.
Parameters:
task_id: Task ID returned by start_background_crawl
Returns: Dictionary with task status and progress information
Get all crawling tasks and their status.
Returns: List of all tasks with their status information
Set the 'don't ask again' flag for user preferences.
Parameters:
user_id: User identifier
Returns: True if successfully set
Discover documentation URLs for a specific tool without starting a crawl.
Parameters:
tool_name: Name of tool/framework (e.g., "python", "react")
Example:
{
"tool": "discover_docs",
"parameters": {
"tool_name": "python"
}
}Returns:
{
"success": true,
"tool_name": "python",
"urls": [
"https://docs.python.org/3/",
"https://docs.python.org/3/tutorial/",
"https://docs.python.org/3/library/"
],
"count": 3
}Get a list of all tools with built-in documentation discovery.
No parameters required
Returns:
{
"success": true,
"tools": ["python", "react", "fastapi", "docker", ...],
"count": 20,
"categories": {
"Programming Languages": ["python", "typescript", "rust", "go"],
"Web Frameworks": ["react", "next.js", "vue", "angular", ...],
...
}
}TechCyclopedia offers flexible configuration options:
- Deep Crawl (default): Crawl all related pages within the domain
- Shallow Crawl: Only crawl the specific URLs provided
- Single Page: Just extract content from the given URLs
- Remove Boilerplate: Automatic removal of navigation, footers, ads
- Organize by Domain: Create domain-specific subdirectories
- Content Optimization: Extract clean, LLM-ready markdown
- Max Depth: How deep to crawl (1-10 levels)
- Max Pages: Maximum number of pages to crawl
- Include External Links: Follow links to other domains
results/
├── docs.python.org/
│ ├── tutorial_index.md
│ ├── introduction.md
│ └── ...
├── ai-sdk.dev/
│ ├── docs_ai-sdk-ui.md
│ └── ...
└── ...
TechCyclopedia is built on a modular architecture with the following components:
┌─────────────────────────────────────────────────────────────┐
│ LLM Agent / Client │
│ (MCP Protocol Consumer) │
└────────────────────────┬────────────────────────────────────┘
│ MCP Tool Call
│ crawl_tech_docs(urls, output_dir)
▼
┌─────────────────────────────────────────────────────────────┐
│ FastMCP2 Server (server.py) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ @mcp.tool async def crawl_tech_docs() │ │
│ │ - Input validation │ │
│ │ - Progress reporting via ctx.info() │ │
│ │ - Resource URI generation (MD5 hashing) │ │
│ └────────────────────┬─────────────────────────────────┘ │
└────────────────────────┼─────────────────────────────────────┘
│ async/await
▼
┌─────────────────────────────────────────────────────────────┐
│ Crawl4ai Engine (AsyncWebCrawler) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ BFSDeepCrawlStrategy │ │
│ │ - Breadth-first link discovery │ │
│ │ - Domain scoping (include_external=False) │ │
│ │ - max_depth=5, max_pages=500 │ │
│ └────────────────────┬─────────────────────────────────┘ │
│ │ │
│ ┌────────────────────▼─────────────────────────────────┐ │
│ │ Playwright Headless Browser │ │
│ │ - JavaScript rendering │ │
│ │ - Dynamic content handling │ │
│ └────────────────────┬─────────────────────────────────┘ │
└────────────────────────┼─────────────────────────────────────┘
│ Raw HTML
▼
┌─────────────────────────────────────────────────────────────┐
│ Multi-Stage Content Filter Pipeline │
│ │
│ Stage 1: HTML Pre-Exclusion │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Remove: nav, footer, header, script, style │ │
│ └────────────────────┬─────────────────────────────────┘ │
│ │ │
│ Stage 2: PruningContentFilter │
│ ┌────────────────────▼─────────────────────────────────┐ │
│ │ Heuristics: │ │
│ │ - Text density scoring │ │
│ │ - Link density penalty │ │
│ │ - Structural context analysis │ │
│ │ - Dynamic threshold adjustment │ │
│ │ - min_word_threshold=20 │ │
│ └────────────────────┬─────────────────────────────────┘ │
│ │ │
│ Stage 3: Global Refinement │
│ ┌────────────────────▼─────────────────────────────────┐ │
│ │ exclude_external_links=True │ │
│ └────────────────────┬─────────────────────────────────┘ │
└────────────────────────┼─────────────────────────────────────┘
│ Clean Markdown
▼
┌─────────────────────────────────────────────────────────────┐
│ Fit Markdown Output │
│ - 60-80% token reduction vs raw HTML │
│ - Semantic structure preserved │
│ - Technical details intact │
└────────────────────────┬────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Storage Layer (results/) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ MD5-hashed filenames (deterministic) │ │
│ │ Example: 8a1b2c3d4e5f6789abcdef0123456789.md │ │
│ └────────────────────┬─────────────────────────────────┘ │
└────────────────────────┼─────────────────────────────────────┘
│ Resource URIs
▼
┌─────────────────────────────────────────────────────────────┐
│ MCP Resources (for Agent Retrieval) │
│ Agent can read specific resources by URI │
└─────────────────────────────────────────────────────────────┘
- File:
server/server.py - Responsibilities: MCP protocol handling, asynchronous tool execution, input validation, progress reporting, resource URI generation
- File:
server/enhanced_crawler.py - Responsibilities: Advanced crawling with progress tracking, organized output, content filtering
- File:
server/user_preferences.py - Responsibilities: SQLite-based preference storage, user configuration management
- File:
server/interactive_config.py - Responsibilities: User interaction handling, configuration choices, preference management
- File:
server/progress_tracker.py - Responsibilities: Task management, progress monitoring, background task coordination
- File:
server/crawler_config.py - Responsibilities: Deep crawl strategy definition, URL filtering, content filter configuration
-
Test core components:
python tests/simple_test.py
-
Test MCP server:
python tests/mcp_test.py
-
Quick test:
python tests/quick_test.py
The test suite covers:
- Enhanced Crawler functionality
- User Preferences system
- Progress Tracker
- Interactive Configuration
- MCP tool registration
- Direct function calls
- Tool parameter handling
TechCyclopedia works with any MCP-compatible client. Here are detailed setup instructions for popular clients:
Windows Location:
C:\Users\[YourUsername]\AppData\Roaming\Claude\claude_desktop_config.json
Create or edit the configuration file:
{
"mcpServers": {
"techcyclopedia": {
"command": "python",
"args": ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"],
"env": {
"PYTHONUNBUFFERED": "1"
}
}
}
}If you want to use the project's virtual environment:
{
"mcpServers": {
"techcyclopedia": {
"command": "C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\venv\\Scripts\\python.exe",
"args": ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"],
"env": {
"PYTHONUNBUFFERED": "1"
}
}
}
}- Save the configuration file
- Close Claude Desktop completely
- Reopen Claude Desktop
- TechCyclopedia tools will be available
Try this command in Claude Desktop:
"Use TechCyclopedia to crawl https://docs.python.org/3/tutorial/ and save the results to a folder called 'test_results'"
Create or edit: ~/.continue/config.json
{
"mcpServers": {
"techcyclopedia": {
"command": "python",
"args": ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"]
}
}
}- Open Continue panel in VS Code
- Ask: "Use MCP tool crawl_tech_docs to fetch Python docs"
- Continue will invoke TechCyclopedia automatically
Create or edit: ~/.cursor/mcp.json
{
"servers": {
"techcyclopedia": {
"command": "python",
"args": ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"],
"transport": "stdio"
}
}
}pip install mcpimport asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
async def crawl_with_techcyclopedia():
server_params = StdioServerParameters(
command="python",
args=["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"],
env=None
)
async with stdio_client(server_params) as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
# List available tools
tools = await session.list_tools()
print(f"Available tools: {tools}")
# Call the crawl_tech_docs tool
result = await session.call_tool(
"crawl_tech_docs",
arguments={
"urls": ["https://docs.python.org/3/"],
"output_dir": "results",
"user_id": "my_user"
}
)
print(f"Crawl result: {result}")
# Check background task
task_id = await session.call_tool(
"start_background_crawl",
arguments={
"urls": ["https://ai-sdk.dev/docs/ai-sdk-ui"],
"output_dir": "background_results",
"user_id": "my_user"
}
)
print(f"Background task started: {task_id}")
asyncio.run(crawl_with_techcyclopedia())npm install @modelcontextprotocol/sdkimport { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";
async function crawlWithTechCyclopedia() {
const transport = new StdioClientTransport({
command: "python",
args: ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"],
});
const client = new Client(
{ name: "techcyclopedia-client", version: "1.0.0" },
{ capabilities: {} }
);
await client.connect(transport);
// List available tools
const tools = await client.listTools();
console.log("Available tools:", tools);
// Call crawl_tech_docs tool
const result = await client.callTool({
name: "crawl_tech_docs",
arguments: {
urls: ["https://docs.python.org/3/"],
output_dir: "results",
user_id: "my_user"
},
});
console.log("Crawl result:", result);
// Start background task
const taskId = await client.callTool({
name: "start_background_crawl",
arguments: {
urls: ["https://ai-sdk.dev/docs/ai-sdk-ui"],
output_dir: "background_results",
user_id: "my_user"
},
});
console.log("Background task started:", taskId);
await client.close();
}
crawlWithTechCyclopedia().catch(console.error);1. Client → Server: Initialize request
2. Server → Client: Initialize response
3. Client → Server: List tools
4. Server → Client: Tools list
5. Client → Server: Call tool
6. Server → Client: Tool result
import subprocess
import json
class SimpleMCPClient:
def __init__(self, command, args):
self.process = subprocess.Popen(
[command] + args,
stdin=subprocess.PIPE,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True
)
self.msg_id = 0
def send_message(self, method, params=None):
self.msg_id += 1
message = {
"jsonrpc": "2.0",
"id": self.msg_id,
"method": method,
"params": params or {}
}
self.process.stdin.write(json.dumps(message) + "\n")
self.process.stdin.flush()
# Read response
response = self.process.stdout.readline()
return json.loads(response)
def call_tool(self, name, arguments):
return self.send_message("tools/call", {
"name": name,
"arguments": arguments
})
# Usage
client = SimpleMCPClient("python", ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"])
client.send_message("initialize", {"clientInfo": {"name": "simple", "version": "1.0"}})
result = client.call_tool("crawl_tech_docs", {
"urls": ["https://docs.python.org/3/"],
"output_dir": "results",
"user_id": "my_user"
})
print(result)npm install -g @modelcontextprotocol/inspector
mcp-inspector python C:\Users\lenovo\Desktop\Mamdo\dumpo\server\server.py# Test the server directly
python C:\Users\lenovo\Desktop\Mamdo\dumpo\server\server.pyGeneric MCP Client Config:
{
"servers": {
"techcyclopedia": {
"command": "python",
"args": ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"],
"transport": "stdio",
"env": {
"PYTHONUNBUFFERED": "1"
}
}
}
}With Virtual Environment:
{
"command": "C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\venv\\Scripts\\python.exe",
"args": ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"]
}- Path not found: Use absolute paths (full path starting with
C:\) - Python not found: Use full path to Python executable
- Permission denied: Run client as administrator
- Module not found: Ensure all dependencies are installed
- Test server manually:
python server/server.py - Check file paths are correct
- Verify Python environment
- Check client logs for errors
- Ensure MCP protocol compatibility
git clone https://github.com/NoManNayeem/TechCyclopedia.git
cd TechCyclopedia
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txtpython server/server.pyChoose your preferred client from the integration examples above and follow the setup instructions.
Use the MCP tools in your client to crawl technical documentation:
Example Commands:
- "Use TechCyclopedia to crawl Python docs"
- "Start a background crawl of React documentation"
- "Check the status of my crawling tasks"
server/server.py: FastMCP server with tool definitionsserver/enhanced_crawler.py: Advanced crawling with progress trackingserver/user_preferences.py: SQLite-based preference storageserver/interactive_config.py: User-friendly configuration promptsserver/progress_tracker.py: Real-time progress monitoring
User Request → MCP Client → TechCyclopedia Server → Enhanced Crawler → File System
↓ ↓ ↓ ↓ ↓
Configuration → Tool Call → Progress Updates → Content Processing → Organized Output
We welcome contributions! Please see our Contributing Guidelines for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: Live Website
TechCyclopedia - Transforming web documentation into clean, AI-ready content for the future of intelligent systems.