-
Notifications
You must be signed in to change notification settings - Fork 6
feat: Unified QoS System - Pluggable Health Checks + Reputation #500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jorgecuesta
wants to merge
12
commits into
staging
Choose a base branch
from
feat/unified-qos
base: staging
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
a3eec4a to
9f8eb17
Compare
Replace the legacy hydrator + sanctions system with a unified reputation-based QoS system that provides configurable health checks, endpoint recovery, and comprehensive observability. - Score-based endpoint tracking (0-100) replacing binary sanctions - Tiered selection (Tier 1/2/3) based on reputation scores - Probation system for endpoint recovery (10% traffic sampling) - Latency-aware scoring with service-specific profiles (EVM, Cosmos, Solana, LLM) - Signal types: success, minor/major/critical/fatal errors, recovery_success - YAML-configurable checks (replacing hardcoded Go logic) - Execute through protocol layer (tests full relay path including relay miners) - Support for jsonrpc, rest, websocket check types - External URL config support for centralized health check definitions - Leader election for multi-instance deployments - Full Prometheus metrics suite for reputation, probation, health checks, sessions - Metrics labels: service_id, endpoint_domain, signal_type, endpoint_type - Async observation pipeline for non-blocking response processing - Hydrator (gateway/hydrator*.go, cmd/hydrator.go) - Permanent sanctions (protocol/shannon/sanction*.go) - Hardcoded per-chain health checks - Health check executor (gateway/health_check_*.go) - QoS extractors (qos/*/extractor.go) for deep response parsing - Documentation (docs/HOW_TO_RUN_PATH.md, docs/REPUTATION_SYSTEM.md) - Metrics packages (metrics/healthcheck, metrics/session, metrics/retry)
0487120 to
b255323
Compare
This commit introduces a unified YAML configuration system that allows per-service overrides for all gateway settings. Key changes: Unified Service Configuration: - New `gateway_config.defaults` for global service defaults - New `gateway_config.services[]` for per-service overrides - Merge logic: services inherit from defaults and override specific fields - Named latency profiles (`fast`, `standard`, `slow`, `llm`) Per-Service Configuration Support: - Reputation config (initial_score, min_threshold, recovery_timeout) - Tiered selection thresholds (tier1_threshold, tier2_threshold) - Probation settings (threshold, traffic_percent, recovery_multiplier) - Retry config (max_retries, retry_on_5xx, retry_on_timeout) - Observation pipeline sample rates - Latency profiles and inline latency config - Active health checks with per-service rules Health Check System: - Leader election for distributed deployments (Redis-based) - External health check rules URL with auto-refresh - Per-service local health check overrides - Latency integration in health check signals - Comprehensive health check executor with configurable checks Retry Logic: - Full retry implementation with per-service configuration - Retry on 5xx, timeout, and connection errors - Configurable max retries per service Code Quality: - Removed hardcoded service configurations (service_qos_config.go) - Removed scattered config files (health_check_defaults.go) - Added comprehensive test coverage for new components - Fixed private key logging (now redacted) - Added extractor factory for QoS metric extraction Breaking Changes: - Config format changed: services now defined under gateway_config.services[] - Removed deprecated service_fallback array (use services[].fallback)
Implemented comprehensive retry system improvements: - Retry endpoint rotation: never reuse same endpoint on retry, select new endpoint following QoS/reputation rules - Latency budget: configurable max_retry_latency (default 500ms) to prevent retrying slow requests - Concurrency configuration: made max_parallel_endpoints, max_concurrent_relays, and max_batch_payloads configurable via YAML - Endpoint exhaustion handling: exponential backoff (100ms, 200ms, 400ms) when all endpoints tried Bug fixes: - Fixed empty response handling in retry loops - Fixed attempt number in retry metrics (removed incorrect +1) - Fixed break/return in select statements - Fixed empty endpoint domain in metrics recording - Fixed Redis test to use testcontainer address - Added missing error storage before loop breaks - Enhanced context cancellation logging Metrics improvements: - Added endpoint_domain label to retry_latency metric - Added retry_reason label to retry_budget_skipped metric - New endpoint rotation metrics: shannon_retry_endpoint_switches_total, shannon_retry_endpoint_exhaustion_total - All metrics follow lowercase_underscore naming convention Configuration: - Updated config schema with max_retry_latency and concurrency_config - Added comprehensive examples in config.shannon_example.yaml - Per-service override support for all retry and concurrency settings Testing: - Fixed Redis test conflicts by using testcontainer addresses - Updated test mocks with GetConcurrencyConfig() - E2E tests passing at 99.33% success rate (298/300 requests)
84a32ba to
a041247
Compare
Contributor
Author
This commit fixes a critical bug in the RPC type fallback system and adds RPC-type-aware reputation tracking and configurable Cosmos QoS. ## Critical Bug Fix - Empty URL Issue **Problem**: RPC type fallback was setting `actualRPCType` internally but not propagating it back to the relay context. When `endpoint.GetURL(originalRPCType)` was called with the original (unsupported) RPC type, it returned empty strings, causing "Post \"\": unsupported protocol scheme \"\"" errors. **Impact**: 88-95% failure rate for Cosmos chains that relied on RPC fallback - osmosis: 12.31% success (87.69% failures) - xrplevm: 5.22% success (94.78% failures) **Root Cause**: In `protocol/shannon/protocol.go`, `getUniqueEndpoints()` and `getSessionsUniqueEndpoints()` performed RPC type fallback (COMET_BFT → JSON_RPC) but didn't return the actual RPC type used. The relay context continued using the original unsupported RPC type, causing empty URL lookups. **Solution**: 1. Extended `getUniqueEndpoints()` to return `actualRPCType` as second return value 2. Extended `getSessionsUniqueEndpoints()` to return `actualRPCType` as second return value 3. Updated all callers to capture and log actualRPCType 4. Added runtime fallback safety net in `context.go` to try alternative RPC types **Result**: 5-10x improvement in success rates - osmosis: 12% → 62% (5x improvement) - xrplevm: 5% → 100% (20x improvement after additional fixes) - Zero empty URL errors across all 34 test runs **Files Modified**: - `protocol/shannon/protocol.go`: Extended return signatures - `protocol/shannon/context.go`: Added runtime fallback handler - `protocol/shannon/websocket_context.go`: Updated websocket endpoint selection - `protocol/shannon/operational.go`: Updated GetServiceReadiness ## Enhancement - Cosmos QoS Configurable RPC Types **Problem**: Cosmos SDK QoS was hardcoded to only accept REST and COMET_BFT RPC types. Hybrid chains like XRPLEVM (Cosmos SDK with EVM support) need to handle JSON_RPC for EVM methods. After RPC fallback to JSON_RPC, QoS rejected requests as "unsupported RPC type". **Impact**: XRPLEVM showing "request uses unsupported RPC type" errors for EVM methods after RPC fallback. **Solution**: 1. Added `NewSimpleQoSInstanceWithAPIs()` constructor accepting custom supported APIs 2. Added `convertRPCTypesToMap()` helper in `cmd/qos.go` to read RPC types from config 3. Updated Cosmos QoS initialization to read `rpc_types` from unified service config 4. Updated `simpleCosmosConfig` to store and return configurable supported APIs **Result**: XRPLEVM achieves 100% success (570/570 requests) handling both Cosmos and EVM methods seamlessly. **Files Modified**: - `qos/cosmos/qos.go`: Added configurable constructor - `cmd/qos.go`: Added RPC type conversion logic ## Enhancement - RPC-Type-Aware Reputation Tracking **Problem**: Reputation was tracked per (service, endpoint) only. If an endpoint URL served multiple RPC types with different reliability (e.g., WebSocket broken but JSON-RPC works), both protocols shared the same score, causing incorrect filtering decisions. **Solution**: Extended reputation key to include RPC type as required dimension. All reputation scores now tracked at (service, endpoint, rpcType) granularity. **Key Format Changes**: - Before: `"eth:pokt1abc-https://node.example.com"` - After: `"eth:pokt1abc-https://node.example.com:json_rpc"` **Benefits**: - Separate reputation scores for different protocols at same endpoint - Better filtering decisions for hybrid chains - More accurate supplier quality assessment **Files Modified**: - `reputation/reputation.go`: Added RPCType field to EndpointKey - `reputation/key.go`: Updated all KeyBuilder implementations - `reputation/key_test.go`: Added comprehensive tests for RPC-type-aware keys - `reputation/*_test.go`: Updated all tests to include RPC type - `reputation/storage/*.go`: Updated storage implementations - `protocol/shannon/context.go`: Updated recording points to include RPC type - `protocol/shannon/reputation.go`: Updated filtering to use RPC type - `protocol/shannon/websocket_context.go`: Updated WebSocket recording ## Additional Features **Target-Suppliers Header Support**: Added `parseAllowedSuppliersHeader()` to read and parse the `Target-Suppliers` HTTP header, allowing clients to restrict requests to specific suppliers. When specified, bypasses reputation filtering and other selection logic. **RPC Type Detection & Validation**: Added new gateway components for RPC type detection, validation, and error response generation: - `gateway/rpc_type_detector.go`: Detects RPC type from HTTP requests - `gateway/rpc_type_validator.go`: Validates RPC types against service config - `gateway/rpc_type_error_response.go`: Generates proper error responses - `gateway/rpctype_mapper.go`: Maps RPC types to/from wire format **Error Classification**: Added comprehensive error classification system for Shannon protocol with detailed signal mapping and reputation scoring impact. **WebSocket Monitoring**: Added WebSocket connection monitoring and health tracking. **Test Coverage**: Added extensive tests for RPC type fallback, gateway modes, and reputation tracking. ## Test Results **Unit Tests**: All passing (26 packages, 0 failures) **Lint**: 0 issues **Build**: Successful **E2E Test Summary** (across 34 test runs): - **Empty URL Errors**: 0 (was primary failure mode before fix) - **RPC Fallback Success**: 100% working correctly **Cosmos Chains** (48-100% success after fix): - juno: 100% ✅ - persistence: 100% ✅ - akash: 100% ✅ - stargaze: 99.74% ✅ - xrplevm: 100% ✅ (hybrid Cosmos+EVM) - fetch: 92.31% ✅ - osmosis: 62% (improved from 12%, remaining failures are supplier quality) **EVM Chains** (75-83% success): - eth, poly, avax, bsc, base: All passing with supplier-quality-related failures only **Other Chains**: - solana: 95.42% ✅ **All remaining failures are supplier quality issues** (pruned state, missing trie nodes, 404 errors, timeout issues), not PATH bugs. ## Breaking Changes **Reputation Storage Format**: Reputation keys now include RPC type. Existing reputation data will be invalidated. System will build new scores naturally as traffic flows (~5-10 minute settling period). **Protocol Interface**: `AvailableHTTPEndpoints()` and `BuildHTTPRequestContextForEndpoint()` now require `rpcType` parameter. **QoS Interface**: `ParseHTTPRequest()` now receives `detectedRPCType` parameter. ## Migration Notes 1. Reputation scores will reset on deployment (new key format) 2. System reaches steady state within 5-10 minutes as new scores accumulate 3. No configuration changes required (backward compatible) 4. Storage schema unchanged (keys stored as strings, format change transparent) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…election fix Major changes: - Wire all metrics (health checks, relays, retries, batches, probation events) - Add relay counter/histogram with 6 labels (domain, rpc_type, service_id, status_code, reputation_signal, request_type) - Add mean score gauge for reputation tracking - Wire WebSocket metrics (connections active, events, duration, messages) - Wire probation events (entered, exited, routed) - Fix tiered selection bug with per-domain key granularity - Build reverse mapping from domain keys to full endpoint addresses - Fixes "0 endpoints in selected tier" when using per-domain reputation - Add configurable signal impacts in reputation config - Fix death spiral for WebSocket (filterByReputation=false) - Add RecoverySuccessSignal for health checks and probation requests - Remove unused functions and clean up lint errors
- Optimize gRPC connections with keepalive, backoff, and flow control - Add production-ready dial options (keepalive pings, exponential backoff) - Increase flow control windows (1MB) and message sizes (4MB) for high throughput - Refactor account cache with 1 hour TTL (was infinite ~292 years) - Add cache invalidation on signature verification failure - Add retry mechanism: invalidate cache and retry once before blacklisting - Add new metrics for pubkey cache events: - path_supplier_nil_pubkey_total - path_supplier_pubkey_cache_events_total (invalidated/recovered) - Fix lint errors: add missing mock methods, remove orphaned legacy files
- Add SignerContext caching in signer.go to reuse pre-computed crypto values - Cache rings by (appAddress, sessionEndHeight) to avoid redundant ring creation - Add session endpoint caching via getOrCreateSessionEndpoints() - Improve e2e test container cleanup to handle interrupted runs - Fix reputation key_test.go incorrect expectation for domain extraction - Add unified metrics dashboard and relay tracking improvements
- Add ring signature generation for WebSocket handshake (eager validation) - Include session metadata headers in RelayMiner connection for upfront validation - Implement bidirectional close code propagation between client and endpoint - Add panic recovery for observation channel during shutdown
Adds validation to detect when cached endpoints have a different session start height than the requested session, which can occur if session IDs are reused across different session periods. When a collision is detected: - Logs detailed error with session IDs and start heights - Invalidates the stale cache entry - Creates fresh endpoints from the new session - Returns correct endpoints to prevent supplier validation failures This fixes the issue where relay miners reject requests with "supplier not found in session" even though the supplier address and session ID are correct, caused by mismatched session start heights.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Unified QoS system with per-service configuration, reputation-based endpoint selection, async observation pipeline, and comprehensive retry/fallback mechanisms including RPC type fallback with proper actualRPCType propagation.
New Features
fast,standard,slow,llmdefaults+services[]Removed Systems
Breaking Changes
Reputation Storage Format: Keys now include RPC type dimension (
serviceID:endpointAddr:rpcType)Protocol Interface Changes:
AvailableHTTPEndpoints()now requiresrpcTypeparameterBuildHTTPRequestContextForEndpoint()now requiresrpcTypeparameterQoS Interface Changes:
ParseHTTPRequest()now receivesdetectedRPCTypeparameterConfiguration Structure: New required sections
reputation_config(global)active_health_checks(global)observation_pipeline(global)defaults(service defaults)services(per-service array)Private Key Logging: Now redacted (security fix)
Configuration
See
config/examples/config.shannon_example.yamlfor full configuration examples including:Test Results
Pre-Commit Checks:
E2E Results (34 test runs):
Cosmos Chains (48-100%):
EVM Chains (75-83%):
Other Chains:
Remaining failures are supplier quality issues (pruned state, missing trie nodes, 404s, timeouts).
New Prometheus Metrics
Reputation:
shannon_reputation_signals_totalshannon_reputation_endpoints_filtered_totalshannon_reputation_score_distributionshannon_probation_endpointsHealth Checks:
shannon_health_check_totalshannon_health_check_duration_secondsSession:
shannon_active_sessionsshannon_session_endpointsRetry:
shannon_retries_totalshannon_retry_success_totalshannon_retry_latency_secondsshannon_retry_budget_skipped_totalshannon_retry_endpoint_switches_totalshannon_retry_endpoint_exhaustion_total