TechCyclopedia

High-Fidelity Agent Data Pipeline for Technical Documentation Crawling

TechCyclopedia is an intelligent MCP (Model Context Protocol) server that provides advanced web crawling capabilities with interactive configuration, live progress tracking, and background task management for technical documentation.

Features

Smart Documentation Discovery: Automatically find docs for 20+ popular tools (Python, React, FastAPI, etc.)
Intelligent URL Processing: Pass tool names or URLs - automatically discovers and validates
Interactive Configuration: Smart user preference system with persistent storage
Live Progress Tracking: Real-time updates on crawling progress with detailed status
Background Tasks: Run crawls in the background while continuing to chat
Organized Output: Clean, structured file organization by domain
Content Filtering: Automatic boilerplate removal and content optimization
Comprehensive Error Handling: Robust validation, detailed error messages, graceful degradation
File Size Management: Automatic size checking and filtering (10MB limit)
Persistent Preferences: SQLite-based user preference storage
Deep Crawling: BFS strategy for comprehensive documentation extraction

🌐 Live Demo

Visit the live website: https://nomanayeem.github.io/TechCyclopedia

The website showcases TechCyclopedia's features, usage examples, and provides comprehensive documentation.

What's New in v2.0 🎉

🎯 Smart Tool Discovery: Just type python or react instead of full URLs
✅ Enhanced Validation: Comprehensive URL validation with detailed error messages
🛡️ Better Error Handling: Graceful failure handling, detailed error reports
📊 Rich Return Values: Get detailed statistics and metadata from all operations
🔍 New Tools: discover_docs and list_supported_tools for exploration
📏 File Size Management: Automatic size checking (10MB limit)
🐛 Bug Fixes: Fixed datetime deprecation warnings

See IMPROVEMENTS.md for detailed documentation.

Quick Start

Prerequisites

Python 3.9+
pip package manager

Installation

Clone the repository:

   git clone https://github.com/NoManNayeem/TechCyclopedia.git
   cd TechCyclopedia

Create virtual environment:

   python -m venv venv
   source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt
playwright install

Start the server:

   python server/server.py

Usage

MCP Tools Available

TechCyclopedia provides several powerful MCP tools:

1. `crawl_tech_docs` - Smart Documentation Crawling

Crawl technical documentation with automatic URL discovery and validation.

NEW: Now supports tool names! Pass "python" instead of full URLs.

Parameters:

urls: List of URLs or tool names (e.g., ["python", "react", "https://example.com"])
output_dir: Directory path where MD files will be saved
user_id: User identifier for preferences (default: "default")

Examples:

// Using tool names (NEW!)
{
  "tool": "crawl_tech_docs",
  "parameters": {
    "urls": ["python", "fastapi"],
    "output_dir": "results",
    "user_id": "default"
  }
}

// Using direct URLs (still works)
{
  "tool": "crawl_tech_docs",
  "parameters": {
    "urls": ["https://docs.python.org/3/tutorial/"],
    "output_dir": "results",
    "user_id": "default"
  }
}

// Mix both!
{
  "tool": "crawl_tech_docs",
  "parameters": {
    "urls": ["python", "https://custom-docs.com"],
    "output_dir": "results"
  }
}

Returns:

{
  "success": true,
  "task_id": "uuid",
  "urls_processed": 3,
  "files_created": 45,
  "output_directory": "/path/to/results",
  "files": ["file1.md", "file2.md", ...],
  "inputs_processed": ["Found 3 documentation URL(s) for 'python'"],
  "validation_errors": null
}

2. `start_background_crawl` - Background Tasks

Start a background crawling task that can run while the user continues chatting.

Parameters:

urls: List of starting URLs for documentation crawl
output_dir: Directory path where MD files will be saved
user_id: User identifier for preferences

Returns: Task ID for tracking the background task

3. `check_task_status` - Monitor Progress

Check the status of a background crawling task.

Parameters:

task_id: Task ID returned by start_background_crawl

Returns: Dictionary with task status and progress information

4. `get_all_tasks` - View All Tasks

Get all crawling tasks and their status.

Returns: List of all tasks with their status information

5. `set_dont_ask_again` - Save Preferences

Set the 'don't ask again' flag for user preferences.

Parameters:

user_id: User identifier

Returns: True if successfully set

6. `discover_docs` - Find Documentation URLs (NEW!)

Discover documentation URLs for a specific tool without starting a crawl.

Parameters:

tool_name: Name of tool/framework (e.g., "python", "react")

Example:

{
  "tool": "discover_docs",
  "parameters": {
    "tool_name": "python"
  }
}

Returns:

{
  "success": true,
  "tool_name": "python",
  "urls": [
    "https://docs.python.org/3/",
    "https://docs.python.org/3/tutorial/",
    "https://docs.python.org/3/library/"
  ],
  "count": 3
}

7. `list_supported_tools` - List Available Tools (NEW!)

Get a list of all tools with built-in documentation discovery.

No parameters required

Returns:

{
  "success": true,
  "tools": ["python", "react", "fastapi", "docker", ...],
  "count": 20,
  "categories": {
    "Programming Languages": ["python", "typescript", "rust", "go"],
    "Web Frameworks": ["react", "next.js", "vue", "angular", ...],
    ...
  }
}

Configuration Options

TechCyclopedia offers flexible configuration options:

Crawl Strategies

Deep Crawl (default): Crawl all related pages within the domain
Shallow Crawl: Only crawl the specific URLs provided
Single Page: Just extract content from the given URLs

Content Processing

Remove Boilerplate: Automatic removal of navigation, footers, ads
Organize by Domain: Create domain-specific subdirectories
Content Optimization: Extract clean, LLM-ready markdown

Advanced Options

Max Depth: How deep to crawl (1-10 levels)
Max Pages: Maximum number of pages to crawl
Include External Links: Follow links to other domains

Output Structure

results/
├── docs.python.org/
│   ├── tutorial_index.md
│   ├── introduction.md
│   └── ...
├── ai-sdk.dev/
│   ├── docs_ai-sdk-ui.md
│   └── ...
└── ...

Architecture

System Overview

TechCyclopedia is built on a modular architecture with the following components:

┌─────────────────────────────────────────────────────────────┐
│                     LLM Agent / Client                       │
│                  (MCP Protocol Consumer)                     │
└────────────────────────┬────────────────────────────────────┘
                         │ MCP Tool Call
                         │ crawl_tech_docs(urls, output_dir)
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                  FastMCP2 Server (server.py)                 │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  @mcp.tool async def crawl_tech_docs()               │   │
│  │  - Input validation                                  │   │
│  │  - Progress reporting via ctx.info()                 │   │
│  │  - Resource URI generation (MD5 hashing)            │   │
│  └────────────────────┬─────────────────────────────────┘   │
└────────────────────────┼─────────────────────────────────────┘
                         │ async/await
                         ▼
┌─────────────────────────────────────────────────────────────┐
│              Crawl4ai Engine (AsyncWebCrawler)               │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  BFSDeepCrawlStrategy                                │   │
│  │  - Breadth-first link discovery                      │   │
│  │  - Domain scoping (include_external=False)          │   │
│  │  - max_depth=5, max_pages=500                       │   │
│  └────────────────────┬─────────────────────────────────┘   │
│                       │                                      │
│  ┌────────────────────▼─────────────────────────────────┐   │
│  │  Playwright Headless Browser                         │   │
│  │  - JavaScript rendering                              │   │
│  │  - Dynamic content handling                          │   │
│  └────────────────────┬─────────────────────────────────┘   │
└────────────────────────┼─────────────────────────────────────┘
                         │ Raw HTML
                         ▼
┌─────────────────────────────────────────────────────────────┐
│            Multi-Stage Content Filter Pipeline               │
│                                                              │
│  Stage 1: HTML Pre-Exclusion                                │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Remove: nav, footer, header, script, style          │   │
│  └────────────────────┬─────────────────────────────────┘   │
│                       │                                      │
│  Stage 2: PruningContentFilter                              │
│  ┌────────────────────▼─────────────────────────────────┐   │
│  │  Heuristics:                                          │   │
│  │  - Text density scoring                               │   │
│  │  - Link density penalty                               │   │
│  │  - Structural context analysis                        │   │
│  │  - Dynamic threshold adjustment                       │   │
│  │  - min_word_threshold=20                             │   │
│  └────────────────────┬─────────────────────────────────┘   │
│                       │                                      │
│  Stage 3: Global Refinement                                 │
│  ┌────────────────────▼─────────────────────────────────┐   │
│  │  exclude_external_links=True                          │   │
│  └────────────────────┬─────────────────────────────────┘   │
└────────────────────────┼─────────────────────────────────────┘
                         │ Clean Markdown
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                  Fit Markdown Output                         │
│  - 60-80% token reduction vs raw HTML                       │
│  - Semantic structure preserved                             │
│  - Technical details intact                                 │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                Storage Layer (results/)                      │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  MD5-hashed filenames (deterministic)                 │   │
│  │  Example: 8a1b2c3d4e5f6789abcdef0123456789.md        │   │
│  └────────────────────┬─────────────────────────────────┘   │
└────────────────────────┼─────────────────────────────────────┘
                         │ Resource URIs
                         ▼
┌─────────────────────────────────────────────────────────────┐
│              MCP Resources (for Agent Retrieval)             │
│  Agent can read specific resources by URI                   │
└─────────────────────────────────────────────────────────────┘

Component Details

1. FastMCP2 Server Layer

File: server/server.py
Responsibilities: MCP protocol handling, asynchronous tool execution, input validation, progress reporting, resource URI generation

2. Enhanced Crawler

File: server/enhanced_crawler.py
Responsibilities: Advanced crawling with progress tracking, organized output, content filtering

3. User Preferences System

File: server/user_preferences.py
Responsibilities: SQLite-based preference storage, user configuration management

4. Interactive Configuration

File: server/interactive_config.py
Responsibilities: User interaction handling, configuration choices, preference management

5. Progress Tracker

File: server/progress_tracker.py
Responsibilities: Task management, progress monitoring, background task coordination

6. Crawler Configuration

File: server/crawler_config.py
Responsibilities: Deep crawl strategy definition, URL filtering, content filter configuration

Testing

Running Tests

Test core components:
```
python tests/simple_test.py
```
Test MCP server:
```
python tests/mcp_test.py
```
Quick test:
```
python tests/quick_test.py
```

Test Coverage

The test suite covers:

Enhanced Crawler functionality
User Preferences system
Progress Tracker
Interactive Configuration
MCP tool registration
Direct function calls
Tool parameter handling

MCP Client Integration

TechCyclopedia works with any MCP-compatible client. Here are detailed setup instructions for popular clients:

Claude Desktop Integration

1. Find Claude Desktop Configuration

Windows Location:

C:\Users\[YourUsername]\AppData\Roaming\Claude\claude_desktop_config.json

2. Add TechCyclopedia Configuration

Create or edit the configuration file:

{
  "mcpServers": {
    "techcyclopedia": {
      "command": "python",
      "args": ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"],
      "env": {
        "PYTHONUNBUFFERED": "1"
      }
    }
  }
}

3. Alternative: Using Virtual Environment

If you want to use the project's virtual environment:

{
  "mcpServers": {
    "techcyclopedia": {
      "command": "C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\venv\\Scripts\\python.exe",
      "args": ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"],
      "env": {
        "PYTHONUNBUFFERED": "1"
      }
    }
  }
}

4. Save and Restart

Save the configuration file
Close Claude Desktop completely
Reopen Claude Desktop
TechCyclopedia tools will be available

5. Test in Claude Desktop

Try this command in Claude Desktop:

"Use TechCyclopedia to crawl https://docs.python.org/3/tutorial/ and save the results to a folder called 'test_results'"

Continue (VS Code Extension)

1. Configuration File

Create or edit: ~/.continue/config.json

{
  "mcpServers": {
    "techcyclopedia": {
      "command": "python",
      "args": ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"]
    }
  }
}

2. Usage in VS Code

Open Continue panel in VS Code
Ask: "Use MCP tool crawl_tech_docs to fetch Python docs"
Continue will invoke TechCyclopedia automatically

Cursor IDE Integration

1. Configuration File

Create or edit: ~/.cursor/mcp.json

{
  "servers": {
    "techcyclopedia": {
      "command": "python",
      "args": ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"],
      "transport": "stdio"
    }
  }
}

Python MCP Client

1. Install MCP SDK

pip install mcp

2. Example Client Code

import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

async def crawl_with_techcyclopedia():
    server_params = StdioServerParameters(
        command="python",
        args=["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"],
        env=None
    )
    
    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            
            # List available tools
            tools = await session.list_tools()
            print(f"Available tools: {tools}")
            
            # Call the crawl_tech_docs tool
            result = await session.call_tool(
                "crawl_tech_docs",
                arguments={
                    "urls": ["https://docs.python.org/3/"],
                    "output_dir": "results",
                    "user_id": "my_user"
                }
            )
            print(f"Crawl result: {result}")
            
            # Check background task
            task_id = await session.call_tool(
                "start_background_crawl",
                arguments={
                    "urls": ["https://ai-sdk.dev/docs/ai-sdk-ui"],
                    "output_dir": "background_results",
                    "user_id": "my_user"
                }
            )
            print(f"Background task started: {task_id}")

asyncio.run(crawl_with_techcyclopedia())

JavaScript/TypeScript MCP Client

1. Install MCP SDK

npm install @modelcontextprotocol/sdk

2. Example Client Code

import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";

async function crawlWithTechCyclopedia() {
  const transport = new StdioClientTransport({
    command: "python",
    args: ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"],
  });

  const client = new Client(
    { name: "techcyclopedia-client", version: "1.0.0" },
    { capabilities: {} }
  );

  await client.connect(transport);
  
  // List available tools
  const tools = await client.listTools();
  console.log("Available tools:", tools);
  
  // Call crawl_tech_docs tool
  const result = await client.callTool({
    name: "crawl_tech_docs",
    arguments: {
      urls: ["https://docs.python.org/3/"],
      output_dir: "results",
      user_id: "my_user"
    },
  });

  console.log("Crawl result:", result);
  
  // Start background task
  const taskId = await client.callTool({
    name: "start_background_crawl",
    arguments: {
      urls: ["https://ai-sdk.dev/docs/ai-sdk-ui"],
      output_dir: "background_results",
      user_id: "my_user"
    },
  });
  
  console.log("Background task started:", taskId);
  
  await client.close();
}

crawlWithTechCyclopedia().catch(console.error);

Custom MCP Client Integration

1. Basic Protocol Flow

1. Client → Server: Initialize request
2. Server → Client: Initialize response  
3. Client → Server: List tools
4. Server → Client: Tools list
5. Client → Server: Call tool
6. Server → Client: Tool result

2. Example: Minimal Python Client

import subprocess
import json

class SimpleMCPClient:
    def __init__(self, command, args):
        self.process = subprocess.Popen(
            [command] + args,
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True
        )
        self.msg_id = 0

    def send_message(self, method, params=None):
        self.msg_id += 1
        message = {
            "jsonrpc": "2.0",
            "id": self.msg_id,
            "method": method,
            "params": params or {}
        }
        self.process.stdin.write(json.dumps(message) + "\n")
        self.process.stdin.flush()

        # Read response
        response = self.process.stdout.readline()
        return json.loads(response)

    def call_tool(self, name, arguments):
        return self.send_message("tools/call", {
            "name": name,
            "arguments": arguments
        })

# Usage
client = SimpleMCPClient("python", ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"])
client.send_message("initialize", {"clientInfo": {"name": "simple", "version": "1.0"}})
result = client.call_tool("crawl_tech_docs", {
    "urls": ["https://docs.python.org/3/"],
    "output_dir": "results",
    "user_id": "my_user"
})
print(result)

Testing Your Integration

1. Using MCP Inspector

npm install -g @modelcontextprotocol/inspector
mcp-inspector python C:\Users\lenovo\Desktop\Mamdo\dumpo\server\server.py

2. Manual Testing

# Test the server directly
python C:\Users\lenovo\Desktop\Mamdo\dumpo\server\server.py

3. Configuration Examples

Generic MCP Client Config:

{
  "servers": {
    "techcyclopedia": {
      "command": "python",
      "args": ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"],
      "transport": "stdio",
      "env": {
        "PYTHONUNBUFFERED": "1"
      }
    }
  }
}

With Virtual Environment:

{
  "command": "C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\venv\\Scripts\\python.exe",
  "args": ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"]
}

Troubleshooting

Common Issues:

Path not found: Use absolute paths (full path starting with C:\)
Python not found: Use full path to Python executable
Permission denied: Run client as administrator
Module not found: Ensure all dependencies are installed

Debug Steps:

Test server manually: python server/server.py
Check file paths are correct
Verify Python environment
Check client logs for errors
Ensure MCP protocol compatibility

Quick Start Guide

1. Installation

git clone https://github.com/NoManNayeem/TechCyclopedia.git
cd TechCyclopedia
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

2. Test the Server

python server/server.py

3. Configure Your MCP Client

Choose your preferred client from the integration examples above and follow the setup instructions.

4. Start Crawling

Use the MCP tools in your client to crawl technical documentation:

Example Commands:

"Use TechCyclopedia to crawl Python docs"
"Start a background crawl of React documentation"
"Check the status of my crawling tasks"

Architecture

Core Components

server/server.py: FastMCP server with tool definitions
server/enhanced_crawler.py: Advanced crawling with progress tracking
server/user_preferences.py: SQLite-based preference storage
server/interactive_config.py: User-friendly configuration prompts
server/progress_tracker.py: Real-time progress monitoring

Data Flow

User Request → MCP Client → TechCyclopedia Server → Enhanced Crawler → File System
     ↓              ↓              ↓                    ↓              ↓
Configuration → Tool Call → Progress Updates → Content Processing → Organized Output

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: Live Website

TechCyclopedia - Transforming web documentation into clean, AI-ready content for the future of intelligent systems.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.claude		.claude
.github/workflows		.github/workflows
_includes		_includes
_layouts		_layouts
server		server
tests		tests
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
index.html		index.html
package.json		package.json
requirements.txt		requirements.txt
script.js		script.js
styles.css		styles.css
test_single_crawl.py		test_single_crawl.py

License

NoManNayeem/TechCyclopedia

Folders and files

Latest commit

History

Repository files navigation

TechCyclopedia

Features

🌐 Live Demo

What's New in v2.0 🎉

Quick Start

Prerequisites

Installation

Usage

MCP Tools Available

1. crawl_tech_docs - Smart Documentation Crawling

2. start_background_crawl - Background Tasks

3. check_task_status - Monitor Progress

4. get_all_tasks - View All Tasks

5. set_dont_ask_again - Save Preferences

6. discover_docs - Find Documentation URLs (NEW!)

7. list_supported_tools - List Available Tools (NEW!)

Configuration Options

Crawl Strategies

Content Processing

Advanced Options

Output Structure

Architecture

System Overview

Component Details

1. FastMCP2 Server Layer

2. Enhanced Crawler

3. User Preferences System

4. Interactive Configuration

5. Progress Tracker

6. Crawler Configuration

Testing

Running Tests

Test Coverage

MCP Client Integration

Claude Desktop Integration

1. Find Claude Desktop Configuration

2. Add TechCyclopedia Configuration

3. Alternative: Using Virtual Environment

4. Save and Restart

5. Test in Claude Desktop

Continue (VS Code Extension)

1. Configuration File

2. Usage in VS Code

Cursor IDE Integration

1. Configuration File

Python MCP Client

1. Install MCP SDK

2. Example Client Code

JavaScript/TypeScript MCP Client

1. Install MCP SDK

2. Example Client Code

Custom MCP Client Integration

1. Basic Protocol Flow

2. Example: Minimal Python Client

Testing Your Integration

1. Using MCP Inspector

2. Manual Testing

3. Configuration Examples

Troubleshooting

Common Issues:

Debug Steps:

Quick Start Guide

1. Installation

2. Test the Server

3. Configure Your MCP Client

4. Start Crawling

Architecture

Core Components

Data Flow

Contributing

Development Setup

License

Support

About

1. `crawl_tech_docs` - Smart Documentation Crawling

2. `start_background_crawl` - Background Tasks

3. `check_task_status` - Monitor Progress

4. `get_all_tasks` - View All Tasks

5. `set_dont_ask_again` - Save Preferences

6. `discover_docs` - Find Documentation URLs (NEW!)

7. `list_supported_tools` - List Available Tools (NEW!)

Packages