A high-performance Python CLI tool that fetches GitHub data (pull requests, reviews, issues, checks, and workflows) and stores it in BigQuery for analytics. Built with async/await for parallelism, supports batching, resume from checkpoints, and rate-limit safety.
- 🚀 High Performance: Async/concurrent data fetching with configurable parallelism
- 📊 Comprehensive Data: Collects PRs, reviews, issues, comments, checks, and workflow runs
- 🔄 Resume Support: Checkpoint-based recovery for interrupted harvests
- 🛡️ Rate Limit Safe: Automatic rate limit detection and backoff
- 📦 Batch Processing: Efficient bulk loading to BigQuery with MERGE operations
- 🎯 Idempotent: Safe to re-run - updates existing data without duplicates
- 📅 Time-windowed: Process specific date ranges with automatic chunking
- Python 3.12+
- Google Cloud Project with BigQuery enabled
- GitHub Personal Access Token with appropriate permissions
# Clone the repository
git clone https://github.com/your-org/github-stats.git
cd github-stats
# Install dependencies
pip install -r requirements.txt
# Or install as package
pip install -e .# GitHub Authentication
export GITHUB_TOKEN="ghp_your_personal_access_token"
# BigQuery Configuration
export BQ_PROJECT="your-gcp-project"
export BQ_DATASET="devprod_raw"
# Optional
export BQ_LOCATION="US" # BigQuery location (defaults to project location)
export DEVPROD_CHECKPOINT_TABLE="devprod_checkpoints" # Checkpoint table name
export DEVPROD_STAGING_BUCKET="gs://your-bucket" # For batch loading (optional)
export WORKFLOW_FILTER="bazel_pr_runner.yml" # Filter to specific workflow (default: bazel_pr_runner.yml)Your GitHub token needs the following scopes:
repo- Access to private repositories (if needed)read:org- Read organization dataread:packages- Read packages (optional)
First, create the dataset and tables in BigQuery:
# Create dataset and tables
devprod init-bq --create-dataset
# Or if dataset already exists
devprod init-bqdevprod harvest \
--repo owner/repo \
--start 2025-01-01 \
--end 2025-01-31 \
--entities prs,reviews,issuesdevprod harvest \
--org my-organization \
--start 2025-01-01 \
--end 2025-01-31 \
--entities prs,reviews,issues,checks,workflows \
--concurrency 12devprod harvest \
--org my-organization \
--start 2025-01-01 \
--end 2025-01-31 \
--entities prs,reviews,issues \
--resumeFor large historical ranges, use the backfill command which automatically chunks the work:
devprod backfill \
--org my-organization \
--since 2024-01-01 \
--entities prs,reviews,issues \
--chunk-days 30View the progress of ongoing or completed harvests:
# Check all checkpoints
devprod status
# Filter by organization
devprod status --org my-organization
# Filter by specific repos and entities
devprod status \
--repo owner/repo1 \
--repo owner/repo2 \
--entities prs,issuesThe tool can harvest the following GitHub entities:
| Entity | Description | Dependencies |
|---|---|---|
prs |
Pull requests with metadata | None |
reviews |
PR reviews (approved, changes requested, etc.) | Requires prs |
pr_comments |
Review comments on code diffs | Requires prs |
pr_commits |
PR head commits | Requires prs |
issues |
Issues with metadata | None |
issue_comments |
Comments on issues | Requires issues |
checks |
Check suites for commits | Requires prs |
workflows |
GitHub Actions workflow runs (filtered by WORKFLOW_FILTER env var) | Requires prs |
Data is stored in a raw, append-friendly layer with the following tables:
raw_pull_requests- PR metadataraw_pull_request_reviews- Review eventsraw_pull_request_comments- Code review commentsraw_pr_commits- Commit metadataraw_issues- Issue metadataraw_issue_comments- Issue commentsraw_check_suites- CI check resultsraw_workflow_runs- GitHub Actions runsdevprod_checkpoints- Harvest progress tracking
All tables are:
- Partitioned by
_ingest_date(DATE) - Clustered by
repo_full_nameand primary keys - Include
_ingest_run_idfor traceability
Preview what will be harvested without executing:
devprod harvest \
--org my-organization \
--start 2025-01-01 \
--end 2025-01-31 \
--dry-runAdjust parallelism based on your rate limits:
devprod harvest \
--repo owner/large-repo \
--start 2025-01-01 \
--end 2025-01-31 \
--concurrency 16 \
--batch-size 100Harvest multiple specific repositories:
devprod harvest \
--repo owner/repo1 \
--repo owner/repo2 \
--repo owner/repo3 \
--start 2025-01-01 \
--end 2025-01-31 \
--entities prs,reviews#!/bin/bash
# Run this daily via cron/scheduler
TODAY=$(date +%Y-%m-%d)
YESTERDAY=$(date -d "yesterday" +%Y-%m-%d)
devprod harvest \
--org my-organization \
--start $YESTERDAY \
--end $TODAY \
--entities prs,reviews,issues \
--resume# Backfill all of 2024 in monthly chunks
devprod backfill \
--org my-organization \
--since 2024-01-01 \
--entities prs,reviews,issues,checks,workflows \
--chunk-days 30 \
--concurrency 8# Harvest PR data for critical repositories
devprod harvest \
--repo my-org/frontend \
--repo my-org/backend \
--repo my-org/api \
--start 2025-01-01 \
--end 2025-01-31 \
--entities prs,reviews,pr_comments \
--concurrency 12# Set workflow filter for bazel_pr_runner
export WORKFLOW_FILTER="bazel_pr_runner.yml"
# Harvest PRs and workflow runs
devprod harvest \
--repo askscio/scio \
--start 2025-01-01 \
--end 2025-01-31 \
--entities prs,workflows \
--concurrency 8 \
--resume- Start with smaller windows: Test with 7-day windows before running large backfills
- Monitor rate limits: Start with lower concurrency (4-8) and increase gradually
- Use batch loading: Set
DEVPROD_STAGING_BUCKETfor better performance on large datasets - Run reviews/comments separately: After harvesting PRs, run dependent entities
- Use resume: Always use
--resumefor production runs to handle interruptions
The tool uses a two-phase approach for optimal API usage:
- Search Phase: Uses GraphQL search to find entities in date ranges (cheap API calls)
- Hydrate Phase: Batch fetches detailed data using node IDs (efficient batching)
Key components:
- Async HTTP: Uses httpx with HTTP/2 for maximum throughput
- Rate Limiting: Proactive rate limit checking with exponential backoff
- Checkpointing: Persistent progress tracking in BigQuery
- Idempotent Loading: MERGE operations prevent duplicates
If you encounter rate limits:
- Reduce
--concurrencyparameter - Tool will automatically sleep and retry
- Use
--resumeto continue from last checkpoint
Ensure your GitHub token has the required permissions:
# Test your token
curl -H "Authorization: Bearer $GITHUB_TOKEN" https://api.github.com/userCheck your GCP credentials:
# Authenticate with GCP
gcloud auth application-default login
# Verify project
gcloud config get-value project# Run all tests
pytest gh_harvester/tests/
# Run specific test file
pytest gh_harvester/tests/test_transforms.py
# Run with coverage
pytest --cov=gh_harvester gh_harvester/tests/gh_harvester/
├── cli.py # CLI commands
├── config.py # Configuration management
├── github/
│ ├── client.py # GitHub API client
│ ├── queries.py # GraphQL query templates
│ ├── search.py # Search operations
│ └── resolvers.py # Data resolution
├── harvest/
│ ├── planner.py # Task planning
│ ├── runner.py # Async execution
│ ├── checkpoints.py # Progress tracking
│ └── transforms.py # Data normalization
├── bq/
│ ├── ddl.py # Table creation
│ ├── schema.py # Schema definitions
│ └── loader.py # Data loading
└── utils/
├── time.py # Time utilities
├── ids.py # ID management
└── logging.py # Structured logging
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
[Add your license here]
For issues and questions:
- Open a GitHub issue
- Contact the development team
Built with:
- Typer - CLI framework
- httpx - Async HTTP client
- Google Cloud BigQuery - Data warehouse
- Rich - Terminal formatting