Skip to content

rahul-roy-glean/github-stats

Repository files navigation

GitHub Stats Collector

A high-performance Python CLI tool that fetches GitHub data (pull requests, reviews, issues, checks, and workflows) and stores it in BigQuery for analytics. Built with async/await for parallelism, supports batching, resume from checkpoints, and rate-limit safety.

Features

  • 🚀 High Performance: Async/concurrent data fetching with configurable parallelism
  • 📊 Comprehensive Data: Collects PRs, reviews, issues, comments, checks, and workflow runs
  • 🔄 Resume Support: Checkpoint-based recovery for interrupted harvests
  • 🛡️ Rate Limit Safe: Automatic rate limit detection and backoff
  • 📦 Batch Processing: Efficient bulk loading to BigQuery with MERGE operations
  • 🎯 Idempotent: Safe to re-run - updates existing data without duplicates
  • 📅 Time-windowed: Process specific date ranges with automatic chunking

Installation

Prerequisites

  • Python 3.12+
  • Google Cloud Project with BigQuery enabled
  • GitHub Personal Access Token with appropriate permissions

Install from source

# Clone the repository
git clone https://github.com/your-org/github-stats.git
cd github-stats

# Install dependencies
pip install -r requirements.txt

# Or install as package
pip install -e .

Configuration

Required Environment Variables

# GitHub Authentication
export GITHUB_TOKEN="ghp_your_personal_access_token"

# BigQuery Configuration
export BQ_PROJECT="your-gcp-project"
export BQ_DATASET="devprod_raw"

# Optional
export BQ_LOCATION="US"  # BigQuery location (defaults to project location)
export DEVPROD_CHECKPOINT_TABLE="devprod_checkpoints"  # Checkpoint table name
export DEVPROD_STAGING_BUCKET="gs://your-bucket"  # For batch loading (optional)
export WORKFLOW_FILTER="bazel_pr_runner.yml"  # Filter to specific workflow (default: bazel_pr_runner.yml)

GitHub Token Permissions

Your GitHub token needs the following scopes:

  • repo - Access to private repositories (if needed)
  • read:org - Read organization data
  • read:packages - Read packages (optional)

Usage

1. Initialize BigQuery

First, create the dataset and tables in BigQuery:

# Create dataset and tables
devprod init-bq --create-dataset

# Or if dataset already exists
devprod init-bq

2. Harvest Data

Basic harvest for a single repository

devprod harvest \
  --repo owner/repo \
  --start 2025-01-01 \
  --end 2025-01-31 \
  --entities prs,reviews,issues

Organization-wide harvest

devprod harvest \
  --org my-organization \
  --start 2025-01-01 \
  --end 2025-01-31 \
  --entities prs,reviews,issues,checks,workflows \
  --concurrency 12

Resume an interrupted harvest

devprod harvest \
  --org my-organization \
  --start 2025-01-01 \
  --end 2025-01-31 \
  --entities prs,reviews,issues \
  --resume

3. Backfill Historical Data

For large historical ranges, use the backfill command which automatically chunks the work:

devprod backfill \
  --org my-organization \
  --since 2024-01-01 \
  --entities prs,reviews,issues \
  --chunk-days 30

4. Check Status

View the progress of ongoing or completed harvests:

# Check all checkpoints
devprod status

# Filter by organization
devprod status --org my-organization

# Filter by specific repos and entities
devprod status \
  --repo owner/repo1 \
  --repo owner/repo2 \
  --entities prs,issues

Supported Entities

The tool can harvest the following GitHub entities:

Entity Description Dependencies
prs Pull requests with metadata None
reviews PR reviews (approved, changes requested, etc.) Requires prs
pr_comments Review comments on code diffs Requires prs
pr_commits PR head commits Requires prs
issues Issues with metadata None
issue_comments Comments on issues Requires issues
checks Check suites for commits Requires prs
workflows GitHub Actions workflow runs (filtered by WORKFLOW_FILTER env var) Requires prs

BigQuery Schema

Data is stored in a raw, append-friendly layer with the following tables:

  • raw_pull_requests - PR metadata
  • raw_pull_request_reviews - Review events
  • raw_pull_request_comments - Code review comments
  • raw_pr_commits - Commit metadata
  • raw_issues - Issue metadata
  • raw_issue_comments - Issue comments
  • raw_check_suites - CI check results
  • raw_workflow_runs - GitHub Actions runs
  • devprod_checkpoints - Harvest progress tracking

All tables are:

  • Partitioned by _ingest_date (DATE)
  • Clustered by repo_full_name and primary keys
  • Include _ingest_run_id for traceability

Advanced Usage

Dry Run Mode

Preview what will be harvested without executing:

devprod harvest \
  --org my-organization \
  --start 2025-01-01 \
  --end 2025-01-31 \
  --dry-run

Custom Concurrency

Adjust parallelism based on your rate limits:

devprod harvest \
  --repo owner/large-repo \
  --start 2025-01-01 \
  --end 2025-01-31 \
  --concurrency 16 \
  --batch-size 100

Specific Repository List

Harvest multiple specific repositories:

devprod harvest \
  --repo owner/repo1 \
  --repo owner/repo2 \
  --repo owner/repo3 \
  --start 2025-01-01 \
  --end 2025-01-31 \
  --entities prs,reviews

Examples

Example 1: Daily incremental harvest

#!/bin/bash
# Run this daily via cron/scheduler

TODAY=$(date +%Y-%m-%d)
YESTERDAY=$(date -d "yesterday" +%Y-%m-%d)

devprod harvest \
  --org my-organization \
  --start $YESTERDAY \
  --end $TODAY \
  --entities prs,reviews,issues \
  --resume

Example 2: Full historical backfill

# Backfill all of 2024 in monthly chunks
devprod backfill \
  --org my-organization \
  --since 2024-01-01 \
  --entities prs,reviews,issues,checks,workflows \
  --chunk-days 30 \
  --concurrency 8

Example 3: Targeted harvest for specific repos

# Harvest PR data for critical repositories
devprod harvest \
  --repo my-org/frontend \
  --repo my-org/backend \
  --repo my-org/api \
  --start 2025-01-01 \
  --end 2025-01-31 \
  --entities prs,reviews,pr_comments \
  --concurrency 12

Example 4: Harvest specific CI workflow for askscio/scio

# Set workflow filter for bazel_pr_runner
export WORKFLOW_FILTER="bazel_pr_runner.yml"

# Harvest PRs and workflow runs
devprod harvest \
  --repo askscio/scio \
  --start 2025-01-01 \
  --end 2025-01-31 \
  --entities prs,workflows \
  --concurrency 8 \
  --resume

Performance Tips

  1. Start with smaller windows: Test with 7-day windows before running large backfills
  2. Monitor rate limits: Start with lower concurrency (4-8) and increase gradually
  3. Use batch loading: Set DEVPROD_STAGING_BUCKET for better performance on large datasets
  4. Run reviews/comments separately: After harvesting PRs, run dependent entities
  5. Use resume: Always use --resume for production runs to handle interruptions

Architecture

The tool uses a two-phase approach for optimal API usage:

  1. Search Phase: Uses GraphQL search to find entities in date ranges (cheap API calls)
  2. Hydrate Phase: Batch fetches detailed data using node IDs (efficient batching)

Key components:

  • Async HTTP: Uses httpx with HTTP/2 for maximum throughput
  • Rate Limiting: Proactive rate limit checking with exponential backoff
  • Checkpointing: Persistent progress tracking in BigQuery
  • Idempotent Loading: MERGE operations prevent duplicates

Troubleshooting

Rate Limit Errors

If you encounter rate limits:

  1. Reduce --concurrency parameter
  2. Tool will automatically sleep and retry
  3. Use --resume to continue from last checkpoint

Authentication Errors

Ensure your GitHub token has the required permissions:

# Test your token
curl -H "Authorization: Bearer $GITHUB_TOKEN" https://api.github.com/user

BigQuery Errors

Check your GCP credentials:

# Authenticate with GCP
gcloud auth application-default login

# Verify project
gcloud config get-value project

Development

Running Tests

# Run all tests
pytest gh_harvester/tests/

# Run specific test file
pytest gh_harvester/tests/test_transforms.py

# Run with coverage
pytest --cov=gh_harvester gh_harvester/tests/

Code Structure

gh_harvester/
├── cli.py              # CLI commands
├── config.py           # Configuration management
├── github/
│   ├── client.py       # GitHub API client
│   ├── queries.py      # GraphQL query templates
│   ├── search.py       # Search operations
│   └── resolvers.py    # Data resolution
├── harvest/
│   ├── planner.py      # Task planning
│   ├── runner.py       # Async execution
│   ├── checkpoints.py  # Progress tracking
│   └── transforms.py   # Data normalization
├── bq/
│   ├── ddl.py          # Table creation
│   ├── schema.py       # Schema definitions
│   └── loader.py       # Data loading
└── utils/
    ├── time.py         # Time utilities
    ├── ids.py          # ID management
    └── logging.py      # Structured logging

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Submit a pull request

License

[Add your license here]

Support

For issues and questions:

  • Open a GitHub issue
  • Contact the development team

Acknowledgments

Built with:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages