GitHub Stats Collector

A high-performance Python CLI tool that fetches GitHub data (pull requests, reviews, issues, checks, and workflows) and stores it in BigQuery for analytics. Built with async/await for parallelism, supports batching, resume from checkpoints, and rate-limit safety.

Features

🚀 High Performance: Async/concurrent data fetching with configurable parallelism
📊 Comprehensive Data: Collects PRs, reviews, issues, comments, checks, and workflow runs
🔄 Resume Support: Checkpoint-based recovery for interrupted harvests
🛡️ Rate Limit Safe: Automatic rate limit detection and backoff
📦 Batch Processing: Efficient bulk loading to BigQuery with MERGE operations
🎯 Idempotent: Safe to re-run - updates existing data without duplicates
📅 Time-windowed: Process specific date ranges with automatic chunking

Installation

Prerequisites

Python 3.12+
Google Cloud Project with BigQuery enabled
GitHub Personal Access Token with appropriate permissions

Install from source

# Clone the repository
git clone https://github.com/your-org/github-stats.git
cd github-stats

# Install dependencies
pip install -r requirements.txt

# Or install as package
pip install -e .

Configuration

Required Environment Variables

# GitHub Authentication
export GITHUB_TOKEN="ghp_your_personal_access_token"

# BigQuery Configuration
export BQ_PROJECT="your-gcp-project"
export BQ_DATASET="devprod_raw"

# Optional
export BQ_LOCATION="US"  # BigQuery location (defaults to project location)
export DEVPROD_CHECKPOINT_TABLE="devprod_checkpoints"  # Checkpoint table name
export DEVPROD_STAGING_BUCKET="gs://your-bucket"  # For batch loading (optional)
export WORKFLOW_FILTER="bazel_pr_runner.yml"  # Filter to specific workflow (default: bazel_pr_runner.yml)

GitHub Token Permissions

Your GitHub token needs the following scopes:

repo - Access to private repositories (if needed)
read:org - Read organization data
read:packages - Read packages (optional)

Usage

1. Initialize BigQuery

First, create the dataset and tables in BigQuery:

# Create dataset and tables
devprod init-bq --create-dataset

# Or if dataset already exists
devprod init-bq

2. Harvest Data

Basic harvest for a single repository

devprod harvest \
  --repo owner/repo \
  --start 2025-01-01 \
  --end 2025-01-31 \
  --entities prs,reviews,issues

Organization-wide harvest

devprod harvest \
  --org my-organization \
  --start 2025-01-01 \
  --end 2025-01-31 \
  --entities prs,reviews,issues,checks,workflows \
  --concurrency 12

Resume an interrupted harvest

devprod harvest \
  --org my-organization \
  --start 2025-01-01 \
  --end 2025-01-31 \
  --entities prs,reviews,issues \
  --resume

3. Backfill Historical Data

For large historical ranges, use the backfill command which automatically chunks the work:

devprod backfill \
  --org my-organization \
  --since 2024-01-01 \
  --entities prs,reviews,issues \
  --chunk-days 30

4. Check Status

View the progress of ongoing or completed harvests:

# Check all checkpoints
devprod status

# Filter by organization
devprod status --org my-organization

# Filter by specific repos and entities
devprod status \
  --repo owner/repo1 \
  --repo owner/repo2 \
  --entities prs,issues

Supported Entities

The tool can harvest the following GitHub entities:

Entity	Description	Dependencies
`prs`	Pull requests with metadata	None
`reviews`	PR reviews (approved, changes requested, etc.)	Requires `prs`
`pr_comments`	Review comments on code diffs	Requires `prs`
`pr_commits`	PR head commits	Requires `prs`
`issues`	Issues with metadata	None
`issue_comments`	Comments on issues	Requires `issues`
`checks`	Check suites for commits	Requires `prs`
`workflows`	GitHub Actions workflow runs (filtered by WORKFLOW_FILTER env var)	Requires `prs`

BigQuery Schema

Data is stored in a raw, append-friendly layer with the following tables:

raw_pull_requests - PR metadata
raw_pull_request_reviews - Review events
raw_pull_request_comments - Code review comments
raw_pr_commits - Commit metadata
raw_issues - Issue metadata
raw_issue_comments - Issue comments
raw_check_suites - CI check results
raw_workflow_runs - GitHub Actions runs
devprod_checkpoints - Harvest progress tracking

All tables are:

Partitioned by _ingest_date (DATE)
Clustered by repo_full_name and primary keys
Include _ingest_run_id for traceability

Advanced Usage

Dry Run Mode

Preview what will be harvested without executing:

devprod harvest \
  --org my-organization \
  --start 2025-01-01 \
  --end 2025-01-31 \
  --dry-run

Custom Concurrency

Adjust parallelism based on your rate limits:

devprod harvest \
  --repo owner/large-repo \
  --start 2025-01-01 \
  --end 2025-01-31 \
  --concurrency 16 \
  --batch-size 100

Specific Repository List

Harvest multiple specific repositories:

devprod harvest \
  --repo owner/repo1 \
  --repo owner/repo2 \
  --repo owner/repo3 \
  --start 2025-01-01 \
  --end 2025-01-31 \
  --entities prs,reviews

Examples

Example 1: Daily incremental harvest

#!/bin/bash
# Run this daily via cron/scheduler

TODAY=$(date +%Y-%m-%d)
YESTERDAY=$(date -d "yesterday" +%Y-%m-%d)

devprod harvest \
  --org my-organization \
  --start $YESTERDAY \
  --end $TODAY \
  --entities prs,reviews,issues \
  --resume

Example 2: Full historical backfill

# Backfill all of 2024 in monthly chunks
devprod backfill \
  --org my-organization \
  --since 2024-01-01 \
  --entities prs,reviews,issues,checks,workflows \
  --chunk-days 30 \
  --concurrency 8

Example 3: Targeted harvest for specific repos

# Harvest PR data for critical repositories
devprod harvest \
  --repo my-org/frontend \
  --repo my-org/backend \
  --repo my-org/api \
  --start 2025-01-01 \
  --end 2025-01-31 \
  --entities prs,reviews,pr_comments \
  --concurrency 12

Example 4: Harvest specific CI workflow for askscio/scio

# Set workflow filter for bazel_pr_runner
export WORKFLOW_FILTER="bazel_pr_runner.yml"

# Harvest PRs and workflow runs
devprod harvest \
  --repo askscio/scio \
  --start 2025-01-01 \
  --end 2025-01-31 \
  --entities prs,workflows \
  --concurrency 8 \
  --resume

Performance Tips

Start with smaller windows: Test with 7-day windows before running large backfills
Monitor rate limits: Start with lower concurrency (4-8) and increase gradually
Use batch loading: Set DEVPROD_STAGING_BUCKET for better performance on large datasets
Run reviews/comments separately: After harvesting PRs, run dependent entities
Use resume: Always use --resume for production runs to handle interruptions

Architecture

The tool uses a two-phase approach for optimal API usage:

Search Phase: Uses GraphQL search to find entities in date ranges (cheap API calls)
Hydrate Phase: Batch fetches detailed data using node IDs (efficient batching)

Key components:

Async HTTP: Uses httpx with HTTP/2 for maximum throughput
Rate Limiting: Proactive rate limit checking with exponential backoff
Checkpointing: Persistent progress tracking in BigQuery
Idempotent Loading: MERGE operations prevent duplicates

Troubleshooting

Rate Limit Errors

If you encounter rate limits:

Reduce --concurrency parameter
Tool will automatically sleep and retry
Use --resume to continue from last checkpoint

Authentication Errors

Ensure your GitHub token has the required permissions:

# Test your token
curl -H "Authorization: Bearer $GITHUB_TOKEN" https://api.github.com/user

BigQuery Errors

Check your GCP credentials:

# Authenticate with GCP
gcloud auth application-default login

# Verify project
gcloud config get-value project

Development

Running Tests

# Run all tests
pytest gh_harvester/tests/

# Run specific test file
pytest gh_harvester/tests/test_transforms.py

# Run with coverage
pytest --cov=gh_harvester gh_harvester/tests/

Code Structure

gh_harvester/
├── cli.py              # CLI commands
├── config.py           # Configuration management
├── github/
│   ├── client.py       # GitHub API client
│   ├── queries.py      # GraphQL query templates
│   ├── search.py       # Search operations
│   └── resolvers.py    # Data resolution
├── harvest/
│   ├── planner.py      # Task planning
│   ├── runner.py       # Async execution
│   ├── checkpoints.py  # Progress tracking
│   └── transforms.py   # Data normalization
├── bq/
│   ├── ddl.py          # Table creation
│   ├── schema.py       # Schema definitions
│   └── loader.py       # Data loading
└── utils/
    ├── time.py         # Time utilities
    ├── ids.py          # ID management
    └── logging.py      # Structured logging

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Add tests for new functionality
Submit a pull request

License

[Add your license here]

Support

For issues and questions:

Open a GitHub issue
Contact the development team

Acknowledgments

Built with:

Typer - CLI framework
httpx - Async HTTP client
Google Cloud BigQuery - Data warehouse
Rich - Terminal formatting

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
gh_harvester		gh_harvester
.gitignore		.gitignore
GITHUB_STATS_COLLECTOR.md		GITHUB_STATS_COLLECTOR.md
README.md		README.md
load_github_handles.py		load_github_handles.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

rahul-roy-glean/github-stats

Folders and files

Latest commit

History

Repository files navigation