Skip to content

NoManNayeem/TechCyclopedia

Repository files navigation

TechCyclopedia

High-Fidelity Agent Data Pipeline for Technical Documentation Crawling

GitHub GitHub Pages License Python

TechCyclopedia is an intelligent MCP (Model Context Protocol) server that provides advanced web crawling capabilities with interactive configuration, live progress tracking, and background task management for technical documentation.

Features

  • Smart Documentation Discovery: Automatically find docs for 20+ popular tools (Python, React, FastAPI, etc.)
  • Intelligent URL Processing: Pass tool names or URLs - automatically discovers and validates
  • Interactive Configuration: Smart user preference system with persistent storage
  • Live Progress Tracking: Real-time updates on crawling progress with detailed status
  • Background Tasks: Run crawls in the background while continuing to chat
  • Organized Output: Clean, structured file organization by domain
  • Content Filtering: Automatic boilerplate removal and content optimization
  • Comprehensive Error Handling: Robust validation, detailed error messages, graceful degradation
  • File Size Management: Automatic size checking and filtering (10MB limit)
  • Persistent Preferences: SQLite-based user preference storage
  • Deep Crawling: BFS strategy for comprehensive documentation extraction

🌐 Live Demo

Visit the live website: https://nomanayeem.github.io/TechCyclopedia

The website showcases TechCyclopedia's features, usage examples, and provides comprehensive documentation.

What's New in v2.0 🎉

  • 🎯 Smart Tool Discovery: Just type python or react instead of full URLs
  • ✅ Enhanced Validation: Comprehensive URL validation with detailed error messages
  • 🛡️ Better Error Handling: Graceful failure handling, detailed error reports
  • 📊 Rich Return Values: Get detailed statistics and metadata from all operations
  • 🔍 New Tools: discover_docs and list_supported_tools for exploration
  • 📏 File Size Management: Automatic size checking (10MB limit)
  • 🐛 Bug Fixes: Fixed datetime deprecation warnings

See IMPROVEMENTS.md for detailed documentation.

Quick Start

Prerequisites

  • Python 3.9+
  • pip package manager

Installation

  1. Clone the repository:
   git clone https://github.com/NoManNayeem/TechCyclopedia.git
   cd TechCyclopedia
  1. Create virtual environment:
   python -m venv venv
   source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
playwright install
  1. Start the server:
   python server/server.py

Usage

MCP Tools Available

TechCyclopedia provides several powerful MCP tools:

1. crawl_tech_docs - Smart Documentation Crawling

Crawl technical documentation with automatic URL discovery and validation.

NEW: Now supports tool names! Pass "python" instead of full URLs.

Parameters:

  • urls: List of URLs or tool names (e.g., ["python", "react", "https://example.com"])
  • output_dir: Directory path where MD files will be saved
  • user_id: User identifier for preferences (default: "default")

Examples:

// Using tool names (NEW!)
{
  "tool": "crawl_tech_docs",
  "parameters": {
    "urls": ["python", "fastapi"],
    "output_dir": "results",
    "user_id": "default"
  }
}

// Using direct URLs (still works)
{
  "tool": "crawl_tech_docs",
  "parameters": {
    "urls": ["https://docs.python.org/3/tutorial/"],
    "output_dir": "results",
    "user_id": "default"
  }
}

// Mix both!
{
  "tool": "crawl_tech_docs",
  "parameters": {
    "urls": ["python", "https://custom-docs.com"],
    "output_dir": "results"
  }
}

Returns:

{
  "success": true,
  "task_id": "uuid",
  "urls_processed": 3,
  "files_created": 45,
  "output_directory": "/path/to/results",
  "files": ["file1.md", "file2.md", ...],
  "inputs_processed": ["Found 3 documentation URL(s) for 'python'"],
  "validation_errors": null
}

2. start_background_crawl - Background Tasks

Start a background crawling task that can run while the user continues chatting.

Parameters:

  • urls: List of starting URLs for documentation crawl
  • output_dir: Directory path where MD files will be saved
  • user_id: User identifier for preferences

Returns: Task ID for tracking the background task

3. check_task_status - Monitor Progress

Check the status of a background crawling task.

Parameters:

  • task_id: Task ID returned by start_background_crawl

Returns: Dictionary with task status and progress information

4. get_all_tasks - View All Tasks

Get all crawling tasks and their status.

Returns: List of all tasks with their status information

5. set_dont_ask_again - Save Preferences

Set the 'don't ask again' flag for user preferences.

Parameters:

  • user_id: User identifier

Returns: True if successfully set

6. discover_docs - Find Documentation URLs (NEW!)

Discover documentation URLs for a specific tool without starting a crawl.

Parameters:

  • tool_name: Name of tool/framework (e.g., "python", "react")

Example:

{
  "tool": "discover_docs",
  "parameters": {
    "tool_name": "python"
  }
}

Returns:

{
  "success": true,
  "tool_name": "python",
  "urls": [
    "https://docs.python.org/3/",
    "https://docs.python.org/3/tutorial/",
    "https://docs.python.org/3/library/"
  ],
  "count": 3
}

7. list_supported_tools - List Available Tools (NEW!)

Get a list of all tools with built-in documentation discovery.

No parameters required

Returns:

{
  "success": true,
  "tools": ["python", "react", "fastapi", "docker", ...],
  "count": 20,
  "categories": {
    "Programming Languages": ["python", "typescript", "rust", "go"],
    "Web Frameworks": ["react", "next.js", "vue", "angular", ...],
    ...
  }
}

Configuration Options

TechCyclopedia offers flexible configuration options:

Crawl Strategies

  • Deep Crawl (default): Crawl all related pages within the domain
  • Shallow Crawl: Only crawl the specific URLs provided
  • Single Page: Just extract content from the given URLs

Content Processing

  • Remove Boilerplate: Automatic removal of navigation, footers, ads
  • Organize by Domain: Create domain-specific subdirectories
  • Content Optimization: Extract clean, LLM-ready markdown

Advanced Options

  • Max Depth: How deep to crawl (1-10 levels)
  • Max Pages: Maximum number of pages to crawl
  • Include External Links: Follow links to other domains

Output Structure

results/
├── docs.python.org/
│   ├── tutorial_index.md
│   ├── introduction.md
│   └── ...
├── ai-sdk.dev/
│   ├── docs_ai-sdk-ui.md
│   └── ...
└── ...

Architecture

System Overview

TechCyclopedia is built on a modular architecture with the following components:

┌─────────────────────────────────────────────────────────────┐
│                     LLM Agent / Client                       │
│                  (MCP Protocol Consumer)                     │
└────────────────────────┬────────────────────────────────────┘
                         │ MCP Tool Call
                         │ crawl_tech_docs(urls, output_dir)
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                  FastMCP2 Server (server.py)                 │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  @mcp.tool async def crawl_tech_docs()               │   │
│  │  - Input validation                                  │   │
│  │  - Progress reporting via ctx.info()                 │   │
│  │  - Resource URI generation (MD5 hashing)            │   │
│  └────────────────────┬─────────────────────────────────┘   │
└────────────────────────┼─────────────────────────────────────┘
                         │ async/await
                         ▼
┌─────────────────────────────────────────────────────────────┐
│              Crawl4ai Engine (AsyncWebCrawler)               │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  BFSDeepCrawlStrategy                                │   │
│  │  - Breadth-first link discovery                      │   │
│  │  - Domain scoping (include_external=False)          │   │
│  │  - max_depth=5, max_pages=500                       │   │
│  └────────────────────┬─────────────────────────────────┘   │
│                       │                                      │
│  ┌────────────────────▼─────────────────────────────────┐   │
│  │  Playwright Headless Browser                         │   │
│  │  - JavaScript rendering                              │   │
│  │  - Dynamic content handling                          │   │
│  └────────────────────┬─────────────────────────────────┘   │
└────────────────────────┼─────────────────────────────────────┘
                         │ Raw HTML
                         ▼
┌─────────────────────────────────────────────────────────────┐
│            Multi-Stage Content Filter Pipeline               │
│                                                              │
│  Stage 1: HTML Pre-Exclusion                                │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  Remove: nav, footer, header, script, style          │   │
│  └────────────────────┬─────────────────────────────────┘   │
│                       │                                      │
│  Stage 2: PruningContentFilter                              │
│  ┌────────────────────▼─────────────────────────────────┐   │
│  │  Heuristics:                                          │   │
│  │  - Text density scoring                               │   │
│  │  - Link density penalty                               │   │
│  │  - Structural context analysis                        │   │
│  │  - Dynamic threshold adjustment                       │   │
│  │  - min_word_threshold=20                             │   │
│  └────────────────────┬─────────────────────────────────┘   │
│                       │                                      │
│  Stage 3: Global Refinement                                 │
│  ┌────────────────────▼─────────────────────────────────┐   │
│  │  exclude_external_links=True                          │   │
│  └────────────────────┬─────────────────────────────────┘   │
└────────────────────────┼─────────────────────────────────────┘
                         │ Clean Markdown
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                  Fit Markdown Output                         │
│  - 60-80% token reduction vs raw HTML                       │
│  - Semantic structure preserved                             │
│  - Technical details intact                                 │
└────────────────────────┬────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                Storage Layer (results/)                      │
│  ┌──────────────────────────────────────────────────────┐   │
│  │  MD5-hashed filenames (deterministic)                 │   │
│  │  Example: 8a1b2c3d4e5f6789abcdef0123456789.md        │   │
│  └────────────────────┬─────────────────────────────────┘   │
└────────────────────────┼─────────────────────────────────────┘
                         │ Resource URIs
                         ▼
┌─────────────────────────────────────────────────────────────┐
│              MCP Resources (for Agent Retrieval)             │
│  Agent can read specific resources by URI                   │
└─────────────────────────────────────────────────────────────┘

Component Details

1. FastMCP2 Server Layer

  • File: server/server.py
  • Responsibilities: MCP protocol handling, asynchronous tool execution, input validation, progress reporting, resource URI generation

2. Enhanced Crawler

  • File: server/enhanced_crawler.py
  • Responsibilities: Advanced crawling with progress tracking, organized output, content filtering

3. User Preferences System

  • File: server/user_preferences.py
  • Responsibilities: SQLite-based preference storage, user configuration management

4. Interactive Configuration

  • File: server/interactive_config.py
  • Responsibilities: User interaction handling, configuration choices, preference management

5. Progress Tracker

  • File: server/progress_tracker.py
  • Responsibilities: Task management, progress monitoring, background task coordination

6. Crawler Configuration

  • File: server/crawler_config.py
  • Responsibilities: Deep crawl strategy definition, URL filtering, content filter configuration

Testing

Running Tests

  1. Test core components:

    python tests/simple_test.py
  2. Test MCP server:

    python tests/mcp_test.py
  3. Quick test:

    python tests/quick_test.py

Test Coverage

The test suite covers:

  • Enhanced Crawler functionality
  • User Preferences system
  • Progress Tracker
  • Interactive Configuration
  • MCP tool registration
  • Direct function calls
  • Tool parameter handling

MCP Client Integration

TechCyclopedia works with any MCP-compatible client. Here are detailed setup instructions for popular clients:

Claude Desktop Integration

1. Find Claude Desktop Configuration

Windows Location:

C:\Users\[YourUsername]\AppData\Roaming\Claude\claude_desktop_config.json

2. Add TechCyclopedia Configuration

Create or edit the configuration file:

{
  "mcpServers": {
    "techcyclopedia": {
      "command": "python",
      "args": ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"],
      "env": {
        "PYTHONUNBUFFERED": "1"
      }
    }
  }
}

3. Alternative: Using Virtual Environment

If you want to use the project's virtual environment:

{
  "mcpServers": {
    "techcyclopedia": {
      "command": "C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\venv\\Scripts\\python.exe",
      "args": ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"],
      "env": {
        "PYTHONUNBUFFERED": "1"
      }
    }
  }
}

4. Save and Restart

  1. Save the configuration file
  2. Close Claude Desktop completely
  3. Reopen Claude Desktop
  4. TechCyclopedia tools will be available

5. Test in Claude Desktop

Try this command in Claude Desktop:

"Use TechCyclopedia to crawl https://docs.python.org/3/tutorial/ and save the results to a folder called 'test_results'"

Continue (VS Code Extension)

1. Configuration File

Create or edit: ~/.continue/config.json

{
  "mcpServers": {
    "techcyclopedia": {
      "command": "python",
      "args": ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"]
    }
  }
}

2. Usage in VS Code

  1. Open Continue panel in VS Code
  2. Ask: "Use MCP tool crawl_tech_docs to fetch Python docs"
  3. Continue will invoke TechCyclopedia automatically

Cursor IDE Integration

1. Configuration File

Create or edit: ~/.cursor/mcp.json

{
  "servers": {
    "techcyclopedia": {
      "command": "python",
      "args": ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"],
      "transport": "stdio"
    }
  }
}

Python MCP Client

1. Install MCP SDK

pip install mcp

2. Example Client Code

import asyncio
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client

async def crawl_with_techcyclopedia():
    server_params = StdioServerParameters(
        command="python",
        args=["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"],
        env=None
    )
    
    async with stdio_client(server_params) as (read, write):
        async with ClientSession(read, write) as session:
            await session.initialize()
            
            # List available tools
            tools = await session.list_tools()
            print(f"Available tools: {tools}")
            
            # Call the crawl_tech_docs tool
            result = await session.call_tool(
                "crawl_tech_docs",
                arguments={
                    "urls": ["https://docs.python.org/3/"],
                    "output_dir": "results",
                    "user_id": "my_user"
                }
            )
            print(f"Crawl result: {result}")
            
            # Check background task
            task_id = await session.call_tool(
                "start_background_crawl",
                arguments={
                    "urls": ["https://ai-sdk.dev/docs/ai-sdk-ui"],
                    "output_dir": "background_results",
                    "user_id": "my_user"
                }
            )
            print(f"Background task started: {task_id}")

asyncio.run(crawl_with_techcyclopedia())

JavaScript/TypeScript MCP Client

1. Install MCP SDK

npm install @modelcontextprotocol/sdk

2. Example Client Code

import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";

async function crawlWithTechCyclopedia() {
  const transport = new StdioClientTransport({
    command: "python",
    args: ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"],
  });

  const client = new Client(
    { name: "techcyclopedia-client", version: "1.0.0" },
    { capabilities: {} }
  );

  await client.connect(transport);
  
  // List available tools
  const tools = await client.listTools();
  console.log("Available tools:", tools);
  
  // Call crawl_tech_docs tool
  const result = await client.callTool({
    name: "crawl_tech_docs",
    arguments: {
      urls: ["https://docs.python.org/3/"],
      output_dir: "results",
      user_id: "my_user"
    },
  });

  console.log("Crawl result:", result);
  
  // Start background task
  const taskId = await client.callTool({
    name: "start_background_crawl",
    arguments: {
      urls: ["https://ai-sdk.dev/docs/ai-sdk-ui"],
      output_dir: "background_results",
      user_id: "my_user"
    },
  });
  
  console.log("Background task started:", taskId);
  
  await client.close();
}

crawlWithTechCyclopedia().catch(console.error);

Custom MCP Client Integration

1. Basic Protocol Flow

1. Client → Server: Initialize request
2. Server → Client: Initialize response  
3. Client → Server: List tools
4. Server → Client: Tools list
5. Client → Server: Call tool
6. Server → Client: Tool result

2. Example: Minimal Python Client

import subprocess
import json

class SimpleMCPClient:
    def __init__(self, command, args):
        self.process = subprocess.Popen(
            [command] + args,
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True
        )
        self.msg_id = 0

    def send_message(self, method, params=None):
        self.msg_id += 1
        message = {
            "jsonrpc": "2.0",
            "id": self.msg_id,
            "method": method,
            "params": params or {}
        }
        self.process.stdin.write(json.dumps(message) + "\n")
        self.process.stdin.flush()

        # Read response
        response = self.process.stdout.readline()
        return json.loads(response)

    def call_tool(self, name, arguments):
        return self.send_message("tools/call", {
            "name": name,
            "arguments": arguments
        })

# Usage
client = SimpleMCPClient("python", ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"])
client.send_message("initialize", {"clientInfo": {"name": "simple", "version": "1.0"}})
result = client.call_tool("crawl_tech_docs", {
    "urls": ["https://docs.python.org/3/"],
    "output_dir": "results",
    "user_id": "my_user"
})
print(result)

Testing Your Integration

1. Using MCP Inspector

npm install -g @modelcontextprotocol/inspector
mcp-inspector python C:\Users\lenovo\Desktop\Mamdo\dumpo\server\server.py

2. Manual Testing

# Test the server directly
python C:\Users\lenovo\Desktop\Mamdo\dumpo\server\server.py

3. Configuration Examples

Generic MCP Client Config:

{
  "servers": {
    "techcyclopedia": {
      "command": "python",
      "args": ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"],
      "transport": "stdio",
      "env": {
        "PYTHONUNBUFFERED": "1"
      }
    }
  }
}

With Virtual Environment:

{
  "command": "C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\venv\\Scripts\\python.exe",
  "args": ["C:\\Users\\lenovo\\Desktop\\Mamdo\\dumpo\\server\\server.py"]
}

Troubleshooting

Common Issues:

  1. Path not found: Use absolute paths (full path starting with C:\)
  2. Python not found: Use full path to Python executable
  3. Permission denied: Run client as administrator
  4. Module not found: Ensure all dependencies are installed

Debug Steps:

  1. Test server manually: python server/server.py
  2. Check file paths are correct
  3. Verify Python environment
  4. Check client logs for errors
  5. Ensure MCP protocol compatibility

Quick Start Guide

1. Installation

git clone https://github.com/NoManNayeem/TechCyclopedia.git
cd TechCyclopedia
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

2. Test the Server

python server/server.py

3. Configure Your MCP Client

Choose your preferred client from the integration examples above and follow the setup instructions.

4. Start Crawling

Use the MCP tools in your client to crawl technical documentation:

Example Commands:

  • "Use TechCyclopedia to crawl Python docs"
  • "Start a background crawl of React documentation"
  • "Check the status of my crawling tasks"

Architecture

Core Components

  • server/server.py: FastMCP server with tool definitions
  • server/enhanced_crawler.py: Advanced crawling with progress tracking
  • server/user_preferences.py: SQLite-based preference storage
  • server/interactive_config.py: User-friendly configuration prompts
  • server/progress_tracker.py: Real-time progress monitoring

Data Flow

User Request → MCP Client → TechCyclopedia Server → Enhanced Crawler → File System
     ↓              ↓              ↓                    ↓              ↓
Configuration → Tool Call → Progress Updates → Content Processing → Organized Output

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support


TechCyclopedia - Transforming web documentation into clean, AI-ready content for the future of intelligent systems.

About

TechCyclopedia: High-Fidelity Agent Data Pipeline for Technical Documentation Crawling

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published