This project develops an advanced, multi-modal AI system designed to solve the critical developer problem of maintaining up-to-date and informative documentation. The core service is the Automatic README Generation AI, which transforms raw GitHub code into structured Markdown documentation. This feature is integrated into a larger career management platform, aiming to streamline workflow convenience and enhance job-seeking capabilities.
Manually documenting extensive codebases is tedious and error-prone. Furthermore, feeding entire code files to Large Language Models (LLMs) is prohibitively expensive due to high token consumption.
Our Solution: Implement an Abstract Syntax Tree (AST) Analysis Pipeline to pre-process code, drastically reducing the token budget while maximizing the information density of the input prompt.
Examples:
feat/login-apifix/comment-delete-bugtest/user-service-test
Type List:
| Type | Description |
|---|---|
feat |
Add a new feature |
fix |
Fix a bug |
refactor |
Improve code quality without changing functionality |
test |
Add or modify test code |
hotfix |
Apply an urgent fix |
This video showcases the end-to-end functionality, from inputting a GitHub repository URL to receiving the final, structured Markdown README file.
The system utilizes a specialized combination of models for task-specific performance, leveraging both local GPU infrastructure and external, high-throughput APIs.
The README generation process is engineered for efficiency and code comprehension:
| Stage | Process | Key Technology / Model | Rationale |
|---|---|---|---|
| I. Code Ingestion | Retrieve target code files from a linked GitHub repository. | GitHub API Integration | Ensures access to the most current codebase. |
| II. Token Optimization | Convert raw code (e.g., Python) into an Abstract Syntax Tree (AST). Only critical nodes (function definitions, library imports, key logic flow) are extracted. | AST Parser (libcst or equivalent) |
CRITICAL for cost reduction and focus. Reduces LLM input token count by up to 90%. |
| III. Generation | The processed AST metadata is transformed into a concise, context-rich prompt and fed to the LLM. | QwenCoder Model | Selected for superior performance, as detailed below. |
The QwenCoder model was chosen as the primary engine for README generation based on stringent performance criteria:
| Criterion | QwenCoder Performance | Justification |
|---|---|---|
| Code Understanding | High | Proven ability to grasp complex code structure and context. |
| Multilingual Support | High | Essential for processing code in diverse programming languages. |
| MultiPL-E Score | High Ranking | Verifies strong performance on the Multi-Programming Language Evaluation benchmark. |
| McEval Score | High Ranking | Verifies superior performance on the Massively Multilingual Code Evaluation benchmark. |
A separate ensemble system analyzes repository content to recommend technical tags, prioritizing accuracy and diversity.
| Model / System | Role | Execution Method | Rationale |
|---|---|---|---|
| Gemini-thinking | Inference & Reasoning | Multi-threaded API Call | Utilized for strong reasoning and structural interpretation. |
| Qwen-coder-32b | Code-Specific Analysis | Multi-threaded API Call | Provides robust, code-centric analysis. |
| llama-versatile | Auxiliary Analysis | Multi-threaded API Call | Contributes diverse perspectives and maintains high throughput. |
| Ensemble Aggregation | Final Decision | Bagging Technique (JSON Output) | Combines results (majority vote) to significantly boost tag reliability. |
| Dependency | Purpose | Installation |
|---|---|---|
| Python | Core runtime environment. | Python 3.10+ |
| Libraries / SDKs | Required for LLM inference, AST analysis, and external model access. | pip install -r requirements.txt (See requirements.txt for exact versions) |
This project was developed and tested with the following key libraries:
| Package | Version | Purpose |
|---|---|---|
| python | 3.10.x | Core runtime environment |
| libcst | 1.1.0 | AST-based code parsing |
| requests | 2.31.0 | GitHub API communication |
| python-dotenv | 1.0.1 | Environment variable management |
| groq | 0.9.0 | Groq API client |
| google-generativeai | 0.5.2 | Gemini model access |
| tqdm | 4.66.1 | Progress bar visualization |
For the full dependency list, see
requirements.txt.
To ensure secure access to external LLM services, create a file named .env in the project's root directory:
# --- External AI Service Credentials ---
# Google AI (Gemini) API Key for the Code Analysis Ensemble
# Used for its robust inference capabilities.
GEMINI_API_KEY="YOUR_GEMINI_API_KEY_HERE"
# Groq API Key for the Code Analysis Ensemble
# Used for its high-speed inference in multi-model parallel processing.
GROQ_API_KEY="YOUR_GROQ_API_KEY_HERE"
# --- Local Server Configuration ---
# URL for the local vLLM server hosting the BART model (for summarization feature)
BART_VLLM_SERVER_URL="http://127.0.0.1:8000/v1/completions"
- Install dependencies
- Recommended: use a Python 3.10 virtual environment (venv or conda)
- PowerShell (Windows) example:
# venv example
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install -r .\requirements.txt
# or conda example
conda create -n gs-env python=3.10 -y
conda activate gs-env
python -m pip install -r .\requirements.txt- API keys (
.env)
- Create a
.envfile in the project root and set required keys. - Main environment variables used by the code:
GROQ_API_KEY(used for README generation)GOOGLE_API_KEY(used for Gemini / Google generativeai calls)GITHUB_TOKEN(recommended for GitHub API requests, optional)
Example .env:
GROQ_API_KEY=your_groq_api_key_here
GOOGLE_API_KEY=your_google_api_key_here
GITHUB_TOKEN=ghp_xxx... # optional: use to increase rate limits / access private repos
- vLLM / local LLM server (optional)
- This project uses external APIs (Groq, Google) by default for README/tag generation. A local vLLM server or GPU is not required.
- Only set up a local vLLM server if you plan to run local models; doing so may require changes to
.envand the code.
- Run the script
- The entry point is
main.py. Pass the GitHub repository URL as the argument. - Basic run example (PowerShell):
# Default run: generate README, extract tags, download image
python .\main.py "https://github.com/owner/repo"- Options:
--out <folder>: output directory to save results (default:output)--no-readme: skip README generation--no-tags: skip tag extraction--no-image: skip image selection/download
- Example (specify output folder, skip image):
python .\main.py "https://github.com/owner/repo" --out .\results --no-image- Output files and locations
-
Default output structure:
output/<owner__repo>/GENERATED_README.md: generated README (if produced)TAGS.json: tag extraction result (if produced)repo_image.<ext>: selected repository image (if produced)
-
If
.envis missing or API keys are not set, some features (README generation, tag extraction) may not run. The script will print errors or skip those steps.
This project is grounded in prior research and official technical documentation related to code analysis, token optimization, and large language models.
- LibCST Documentation – Concrete Syntax Tree for Python
https://libcst.readthedocs.io/ - Baxter, I. D., et al. Clone Detection Using Abstract Syntax Trees, IEEE, 1998.
- Cassano et al., MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation, 2022
https://arxiv.org/abs/2208.08227
- Groq API Documentation – High-throughput inference platform used for fast parallel LLM execution.
https://console.groq.com/docs - Google AI Studio (Gemini API) – Official platform for accessing Gemini models and experiment management.
https://aistudio.google.com/
- Dietterich, T. G. Ensemble Methods in Machine Learning, Springer, 2000.
