Skip to content

Conversation

@nirmitparikh8
Copy link
Contributor

@nirmitparikh8 nirmitparikh8 commented Dec 19, 2025

🎯 What We're Trying to Achieve

Replace manual visual inspection of experiment results with an automated statistical regression detection system that can:

  • Automatically detect performance degradations in circuit breaker experiments
  • Provide boolean pass/fail decisions for CI/CD integration
  • Eliminate the need for manual chart analysis and subjective interpretation
  • Catch regressions early before they impact production systems

🔧 How We're Achieving It

Statistical Approach: Percentile-Based Control Charts

  • Baseline Collection: Collect 10+ historical "good" experiment runs for each experiment type
  • Percentile Analysis: Calculate configurable percentiles (currently 5th-95th) for key metrics:
    • Deviation from Target: |actual_rate - target_rate| / target_rate * 100
    • Raw Error Rate: Direct error percentage
    • Raw Rejection Rate: Direct rejection percentage
  • Control Limits: Use percentile bounds as "normal operating range"
  • Violation Detection: Flag experiments where >X% of time windows fall outside bounds

Automated Pipeline Components

  1. collect_baseline_data.rb: Automated baseline collection (runs experiments N times, organizes results)
  2. compute_baselines.rb: Statistical analysis (calculates percentiles from historical data)
  3. detect_regressions.rb: Main detection engine (compares current results vs baselines)
  4. regression_config.rb: Centralized, tunable configuration
  5. GitHub Actions Integration: Fully automated CI checks on every PR

📊 Pipeline Flow & Math

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────────┐
│   Historical    │    │   Statistical    │    │   Current Results   │
│ Baseline Runs   │───▶│   Analysis       │───▶│    Comparison       │
│ (10-15 runs)    │    │ (Percentiles)    │    │  (Pass/Fail)        │
└─────────────────┘    └──────────────────┘    └─────────────────────┘
         │                       │                        │
         │              ┌──────────▼────────────┐         │
         │              │ Control Limits:       │         │
         │              │ p5 = 0.79% errors     │         │
         │              │ p95 = 4.76% errors    │         │
         │              │ p5 = 0.0% rejected    │         │
         │              │ p95 = 51.47% rejected |         │
         │              └───────────────────────┘         │
         │                                                │
┌────────▼────────────────────────────────────────────────▼───┐
│               Regression Detection Logic:                   │
│  IF violation_rate > THRESHOLD → FAIL (Regression)          │
│  Current: 50-80% thresholds (very generous)                 │
└─────────────────────────────────────────────────────────────┘

🚧 Current State & Progress

✅ What's Complete

  • Full pipeline architecture implemented and tested
  • GitHub Actions integration - automatically runs on every PR
  • Dynamic percentile system - fully configurable percentile ranges
  • Comprehensive documentation - setup guides, troubleshooting, examples
  • Robust error handling - graceful failures, clear messaging

🔧 What We're Still Tuning

  1. Baseline Data Sufficiency: How many historical runs provide stable percentiles? (Currently requiring 10+)
  2. Optimal Percentile Ranges: 5th-95th vs 3rd-97th vs 10th-90th percentiles?
  3. Violation Thresholds: What % of windows can violate bounds before flagging regression?

🤖 Current Configuration (Intentionally Generous)

# Very generous settings to avoid blocking CI while we tune
LOWER_PERCENTILE = 5          # 5th percentile  
UPPER_PERCENTILE = 95         # 95th percentile
DEVIATION_VIOLATION_THRESHOLD = 0.8   # 80% of windows can violate (very high)
ERROR_RATE_VIOLATION_THRESHOLD = 0.8  # 80% of windows can violate (very high)  
REJECTION_RATE_VIOLATION_THRESHOLD = 0.8  # 80% of windows can violate (very high)

Why So Generous? I wanted to develop a baseline MVP for the team to iterate on before my internship ends. I didn't get time to go deeper.

📈 Next Steps

  1. Collect Production Baseline Data: Run collect_baseline_data.rb 15 across multiple weeks
  2. Analyze Natural Variation: Study percentile distributions to find optimal bounds
  3. Iterative Threshold Tuning: Gradually tighten violation thresholds (50% → 30% → 15%)
  4. Per-Experiment Calibration: Some experiments may need different sensitivity levels
  5. False Positive Monitoring: Track and eliminate unnecessary CI failures

🎯 Success Metrics

  • Zero False Negatives: Catch all real performance regressions
  • Minimal False Positives: <5% of PRs blocked unnecessarily
  • CI Integration: Seamless GitHub Actions workflow
  • Developer Experience: Clear, actionable regression reports

This PR establishes the foundation for data-driven circuit breaker regression detection. The generous initial configuration ensures we don't block development while we gather data to optimize the system.

@nirmitparikh8 nirmitparikh8 changed the base branch from main to pid-take-2 December 19, 2025 15:27
@nirmitparikh8 nirmitparikh8 changed the title Goodput analysis of experiments Add Automated Circuit Breaker Regression Detection System Dec 19, 2025
@nirmitparikh8 nirmitparikh8 force-pushed the goodput-analysis-of-experiments branch from be0d5c2 to 2d17fc3 Compare December 19, 2025 17:30
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the file to tune configs. Let's iterate on this and make the configs more restrictive. Right now we allow 80 percent of changes in error trend. We should figure out why these violations are so high when we compare new csv to baseline and fix that. Could be something with the test or with hoq we check for regressions (This new system I built)

@nirmitparikh8 nirmitparikh8 force-pushed the goodput-analysis-of-experiments branch from b730d85 to 2d17fc3 Compare December 19, 2025 17:39
@nirmitparikh8 nirmitparikh8 force-pushed the goodput-analysis-of-experiments branch from aeba5a1 to 2231b64 Compare December 19, 2025 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant