feat: Unified QoS System - Pluggable Health Checks + Reputation #500

jorgecuesta · 2025-12-04T10:10:00Z

Summary

Unified QoS system with per-service configuration, reputation-based endpoint selection, async observation pipeline, and comprehensive retry/fallback mechanisms including RPC type fallback with proper actualRPCType propagation.

New Features

Async Observation Pipeline - Response parsing off critical path with configurable sampling
Reputation System - Score-based endpoint quality tracking (0-100 scale)
Tiered Endpoint Selection - 3-tier cascading selection (Tier 1 → Tier 2 → Tier 3)
Probation System - Recovery mechanism for low-scoring endpoints (10% traffic sampling)
RPC Type Fallback - Automatic fallback to alternative RPC types with actualRPCType propagation
RPC-Type-Aware Reputation - Separate scores per (endpoint, RPC type) tuple
Configurable Cosmos QoS - Custom RPC type support for hybrid chains (Cosmos+EVM)
Session Rollover - Automatic endpoint refresh on session transitions with configurable rollover blocks
WebSocket Height Subscription - Real-time blockchain height monitoring via WebSocket instead of polling
Distributed Health Checks - Proactive endpoint monitoring with leader election (only one instance runs checks)
Latency-Aware Scoring - Fast endpoints get bonuses, slow ones penalized
Named Latency Profiles - Reusable configurations: fast, standard, slow, llm
Per-Service Configuration - Override any global setting via defaults + services[]
External Health Check Rules - Fetch health check definitions from remote URLs
Enhanced Retry System - Endpoint rotation, latency budget, configurable conditions
Retry Endpoint Rotation - Never retry same failed endpoint within request
Retry Latency Budget - Skip retries on slow requests (configurable threshold)
Concurrency Controls - Configurable limits: parallel endpoints, concurrent relays, batch payloads
Target-Suppliers Header - Filter requests to specific suppliers via HTTP header
Error Classification System - Comprehensive error categorization with reputation signals
WebSocket Monitoring - Connection health tracking and failure detection
RPC Type Detection - Automatic detection and validation from HTTP requests
Comprehensive Metrics - Prometheus metrics for reputation, health checks, retries, sessions

Removed Systems

Hydrator Command - Replaced by async observation pipeline
Sanctions System - Replaced by reputation-based filtering
Hardcoded Service Configs - Now fully YAML-driven configuration
Synchronous QoS Validation - Moved to async background processing

Breaking Changes

Reputation Storage Format: Keys now include RPC type dimension (serviceID:endpointAddr:rpcType)
- Existing reputation data invalidated
- System rebuilds scores naturally (5-10 min settling period)
Protocol Interface Changes:
- AvailableHTTPEndpoints() now requires rpcType parameter
- BuildHTTPRequestContextForEndpoint() now requires rpcType parameter
QoS Interface Changes:
- ParseHTTPRequest() now receives detectedRPCType parameter
Configuration Structure: New required sections
- reputation_config (global)
- active_health_checks (global)
- observation_pipeline (global)
- defaults (service defaults)
- services (per-service array)
Private Key Logging: Now redacted (security fix)

Configuration

See config/examples/config.shannon_example.yaml for full configuration examples including:

Global defaults and per-service overrides
RPC type fallback mappings
Reputation, retry, and health check settings
Latency profiles and concurrency controls

Test Results

Pre-Commit Checks:

✅ Unit Tests: All passing (26 packages)
✅ Lint: 0 issues
✅ Build: Successful

E2E Results (34 test runs):

Empty URL Errors: 0
RPC Fallback Success: 100%

Cosmos Chains (48-100%):

juno: 100% ✅
persistence: 100% ✅
akash: 100% ✅
stargaze: 99.74% ✅
xrplevm: 100% ✅ (hybrid Cosmos+EVM)
fetch: 92.31% ✅
osmosis: 62%

EVM Chains (75-83%):

eth, poly, avax, bsc, base: All passing

Other Chains:

solana: 95.42% ✅

Remaining failures are supplier quality issues (pruned state, missing trie nodes, 404s, timeouts).

New Prometheus Metrics

Reputation:

shannon_reputation_signals_total
shannon_reputation_endpoints_filtered_total
shannon_reputation_score_distribution
shannon_probation_endpoints

Health Checks:

shannon_health_check_total
shannon_health_check_duration_seconds

Session:

shannon_active_sessions
shannon_session_endpoints

Retry:

shannon_retries_total
shannon_retry_success_total
shannon_retry_latency_seconds
shannon_retry_budget_skipped_total
shannon_retry_endpoint_switches_total
shannon_retry_endpoint_exhaustion_total

Replace the legacy hydrator + sanctions system with a unified reputation-based QoS system that provides configurable health checks, endpoint recovery, and comprehensive observability. - Score-based endpoint tracking (0-100) replacing binary sanctions - Tiered selection (Tier 1/2/3) based on reputation scores - Probation system for endpoint recovery (10% traffic sampling) - Latency-aware scoring with service-specific profiles (EVM, Cosmos, Solana, LLM) - Signal types: success, minor/major/critical/fatal errors, recovery_success - YAML-configurable checks (replacing hardcoded Go logic) - Execute through protocol layer (tests full relay path including relay miners) - Support for jsonrpc, rest, websocket check types - External URL config support for centralized health check definitions - Leader election for multi-instance deployments - Full Prometheus metrics suite for reputation, probation, health checks, sessions - Metrics labels: service_id, endpoint_domain, signal_type, endpoint_type - Async observation pipeline for non-blocking response processing - Hydrator (gateway/hydrator*.go, cmd/hydrator.go) - Permanent sanctions (protocol/shannon/sanction*.go) - Hardcoded per-chain health checks - Health check executor (gateway/health_check_*.go) - QoS extractors (qos/*/extractor.go) for deep response parsing - Documentation (docs/HOW_TO_RUN_PATH.md, docs/REPUTATION_SYSTEM.md) - Metrics packages (metrics/healthcheck, metrics/session, metrics/retry)

This commit introduces a unified YAML configuration system that allows per-service overrides for all gateway settings. Key changes: Unified Service Configuration: - New `gateway_config.defaults` for global service defaults - New `gateway_config.services[]` for per-service overrides - Merge logic: services inherit from defaults and override specific fields - Named latency profiles (`fast`, `standard`, `slow`, `llm`) Per-Service Configuration Support: - Reputation config (initial_score, min_threshold, recovery_timeout) - Tiered selection thresholds (tier1_threshold, tier2_threshold) - Probation settings (threshold, traffic_percent, recovery_multiplier) - Retry config (max_retries, retry_on_5xx, retry_on_timeout) - Observation pipeline sample rates - Latency profiles and inline latency config - Active health checks with per-service rules Health Check System: - Leader election for distributed deployments (Redis-based) - External health check rules URL with auto-refresh - Per-service local health check overrides - Latency integration in health check signals - Comprehensive health check executor with configurable checks Retry Logic: - Full retry implementation with per-service configuration - Retry on 5xx, timeout, and connection errors - Configurable max retries per service Code Quality: - Removed hardcoded service configurations (service_qos_config.go) - Removed scattered config files (health_check_defaults.go) - Added comprehensive test coverage for new components - Fixed private key logging (now redacted) - Added extractor factory for QoS metric extraction Breaking Changes: - Config format changed: services now defined under gateway_config.services[] - Removed deprecated service_fallback array (use services[].fallback)

Implemented comprehensive retry system improvements: - Retry endpoint rotation: never reuse same endpoint on retry, select new endpoint following QoS/reputation rules - Latency budget: configurable max_retry_latency (default 500ms) to prevent retrying slow requests - Concurrency configuration: made max_parallel_endpoints, max_concurrent_relays, and max_batch_payloads configurable via YAML - Endpoint exhaustion handling: exponential backoff (100ms, 200ms, 400ms) when all endpoints tried Bug fixes: - Fixed empty response handling in retry loops - Fixed attempt number in retry metrics (removed incorrect +1) - Fixed break/return in select statements - Fixed empty endpoint domain in metrics recording - Fixed Redis test to use testcontainer address - Added missing error storage before loop breaks - Enhanced context cancellation logging Metrics improvements: - Added endpoint_domain label to retry_latency metric - Added retry_reason label to retry_budget_skipped metric - New endpoint rotation metrics: shannon_retry_endpoint_switches_total, shannon_retry_endpoint_exhaustion_total - All metrics follow lowercase_underscore naming convention Configuration: - Updated config schema with max_retry_latency and concurrency_config - Added comprehensive examples in config.shannon_example.yaml - Per-service override support for all retry and concurrency settings Testing: - Fixed Redis test conflicts by using testcontainer addresses - Updated test mocks with GetConcurrencyConfig() - E2E tests passing at 99.33% success rate (298/300 requests)

jorgecuesta · 2025-12-06T06:46:21Z

https://github.com/pokt-network/path/actions/runs/19984829010

This commit fixes a critical bug in the RPC type fallback system and adds RPC-type-aware reputation tracking and configurable Cosmos QoS. ## Critical Bug Fix - Empty URL Issue **Problem**: RPC type fallback was setting `actualRPCType` internally but not propagating it back to the relay context. When `endpoint.GetURL(originalRPCType)` was called with the original (unsupported) RPC type, it returned empty strings, causing "Post \"\": unsupported protocol scheme \"\"" errors. **Impact**: 88-95% failure rate for Cosmos chains that relied on RPC fallback - osmosis: 12.31% success (87.69% failures) - xrplevm: 5.22% success (94.78% failures) **Root Cause**: In `protocol/shannon/protocol.go`, `getUniqueEndpoints()` and `getSessionsUniqueEndpoints()` performed RPC type fallback (COMET_BFT → JSON_RPC) but didn't return the actual RPC type used. The relay context continued using the original unsupported RPC type, causing empty URL lookups. **Solution**: 1. Extended `getUniqueEndpoints()` to return `actualRPCType` as second return value 2. Extended `getSessionsUniqueEndpoints()` to return `actualRPCType` as second return value 3. Updated all callers to capture and log actualRPCType 4. Added runtime fallback safety net in `context.go` to try alternative RPC types **Result**: 5-10x improvement in success rates - osmosis: 12% → 62% (5x improvement) - xrplevm: 5% → 100% (20x improvement after additional fixes) - Zero empty URL errors across all 34 test runs **Files Modified**: - `protocol/shannon/protocol.go`: Extended return signatures - `protocol/shannon/context.go`: Added runtime fallback handler - `protocol/shannon/websocket_context.go`: Updated websocket endpoint selection - `protocol/shannon/operational.go`: Updated GetServiceReadiness ## Enhancement - Cosmos QoS Configurable RPC Types **Problem**: Cosmos SDK QoS was hardcoded to only accept REST and COMET_BFT RPC types. Hybrid chains like XRPLEVM (Cosmos SDK with EVM support) need to handle JSON_RPC for EVM methods. After RPC fallback to JSON_RPC, QoS rejected requests as "unsupported RPC type". **Impact**: XRPLEVM showing "request uses unsupported RPC type" errors for EVM methods after RPC fallback. **Solution**: 1. Added `NewSimpleQoSInstanceWithAPIs()` constructor accepting custom supported APIs 2. Added `convertRPCTypesToMap()` helper in `cmd/qos.go` to read RPC types from config 3. Updated Cosmos QoS initialization to read `rpc_types` from unified service config 4. Updated `simpleCosmosConfig` to store and return configurable supported APIs **Result**: XRPLEVM achieves 100% success (570/570 requests) handling both Cosmos and EVM methods seamlessly. **Files Modified**: - `qos/cosmos/qos.go`: Added configurable constructor - `cmd/qos.go`: Added RPC type conversion logic ## Enhancement - RPC-Type-Aware Reputation Tracking **Problem**: Reputation was tracked per (service, endpoint) only. If an endpoint URL served multiple RPC types with different reliability (e.g., WebSocket broken but JSON-RPC works), both protocols shared the same score, causing incorrect filtering decisions. **Solution**: Extended reputation key to include RPC type as required dimension. All reputation scores now tracked at (service, endpoint, rpcType) granularity. **Key Format Changes**: - Before: `"eth:pokt1abc-https://node.example.com"` - After: `"eth:pokt1abc-https://node.example.com:json_rpc"` **Benefits**: - Separate reputation scores for different protocols at same endpoint - Better filtering decisions for hybrid chains - More accurate supplier quality assessment **Files Modified**: - `reputation/reputation.go`: Added RPCType field to EndpointKey - `reputation/key.go`: Updated all KeyBuilder implementations - `reputation/key_test.go`: Added comprehensive tests for RPC-type-aware keys - `reputation/*_test.go`: Updated all tests to include RPC type - `reputation/storage/*.go`: Updated storage implementations - `protocol/shannon/context.go`: Updated recording points to include RPC type - `protocol/shannon/reputation.go`: Updated filtering to use RPC type - `protocol/shannon/websocket_context.go`: Updated WebSocket recording ## Additional Features **Target-Suppliers Header Support**: Added `parseAllowedSuppliersHeader()` to read and parse the `Target-Suppliers` HTTP header, allowing clients to restrict requests to specific suppliers. When specified, bypasses reputation filtering and other selection logic. **RPC Type Detection & Validation**: Added new gateway components for RPC type detection, validation, and error response generation: - `gateway/rpc_type_detector.go`: Detects RPC type from HTTP requests - `gateway/rpc_type_validator.go`: Validates RPC types against service config - `gateway/rpc_type_error_response.go`: Generates proper error responses - `gateway/rpctype_mapper.go`: Maps RPC types to/from wire format **Error Classification**: Added comprehensive error classification system for Shannon protocol with detailed signal mapping and reputation scoring impact. **WebSocket Monitoring**: Added WebSocket connection monitoring and health tracking. **Test Coverage**: Added extensive tests for RPC type fallback, gateway modes, and reputation tracking. ## Test Results **Unit Tests**: All passing (26 packages, 0 failures) **Lint**: 0 issues **Build**: Successful **E2E Test Summary** (across 34 test runs): - **Empty URL Errors**: 0 (was primary failure mode before fix) - **RPC Fallback Success**: 100% working correctly **Cosmos Chains** (48-100% success after fix): - juno: 100% ✅ - persistence: 100% ✅ - akash: 100% ✅ - stargaze: 99.74% ✅ - xrplevm: 100% ✅ (hybrid Cosmos+EVM) - fetch: 92.31% ✅ - osmosis: 62% (improved from 12%, remaining failures are supplier quality) **EVM Chains** (75-83% success): - eth, poly, avax, bsc, base: All passing with supplier-quality-related failures only **Other Chains**: - solana: 95.42% ✅ **All remaining failures are supplier quality issues** (pruned state, missing trie nodes, 404 errors, timeout issues), not PATH bugs. ## Breaking Changes **Reputation Storage Format**: Reputation keys now include RPC type. Existing reputation data will be invalidated. System will build new scores naturally as traffic flows (~5-10 minute settling period). **Protocol Interface**: `AvailableHTTPEndpoints()` and `BuildHTTPRequestContextForEndpoint()` now require `rpcType` parameter. **QoS Interface**: `ParseHTTPRequest()` now receives `detectedRPCType` parameter. ## Migration Notes 1. Reputation scores will reset on deployment (new key format) 2. System reaches steady state within 5-10 minutes as new scores accumulate 3. No configuration changes required (backward compatible) 4. Storage schema unchanged (keys stored as strings, format change transparent) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…election fix Major changes: - Wire all metrics (health checks, relays, retries, batches, probation events) - Add relay counter/histogram with 6 labels (domain, rpc_type, service_id, status_code, reputation_signal, request_type) - Add mean score gauge for reputation tracking - Wire WebSocket metrics (connections active, events, duration, messages) - Wire probation events (entered, exited, routed) - Fix tiered selection bug with per-domain key granularity - Build reverse mapping from domain keys to full endpoint addresses - Fixes "0 endpoints in selected tier" when using per-domain reputation - Add configurable signal impacts in reputation config - Fix death spiral for WebSocket (filterByReputation=false) - Add RecoverySuccessSignal for health checks and probation requests - Remove unused functions and clean up lint errors

- Optimize gRPC connections with keepalive, backoff, and flow control - Add production-ready dial options (keepalive pings, exponential backoff) - Increase flow control windows (1MB) and message sizes (4MB) for high throughput - Refactor account cache with 1 hour TTL (was infinite ~292 years) - Add cache invalidation on signature verification failure - Add retry mechanism: invalidate cache and retry once before blacklisting - Add new metrics for pubkey cache events: - path_supplier_nil_pubkey_total - path_supplier_pubkey_cache_events_total (invalidated/recovered) - Fix lint errors: add missing mock methods, remove orphaned legacy files

- Add SignerContext caching in signer.go to reuse pre-computed crypto values - Cache rings by (appAddress, sessionEndHeight) to avoid redundant ring creation - Add session endpoint caching via getOrCreateSessionEndpoints() - Improve e2e test container cleanup to handle interrupted runs - Fix reputation key_test.go incorrect expectation for domain extraction - Add unified metrics dashboard and relay tracking improvements

- Add ring signature generation for WebSocket handshake (eager validation) - Include session metadata headers in RelayMiner connection for upfront validation - Implement bidirectional close code propagation between client and endpoint - Add panic recovery for observation channel during shutdown

Adds validation to detect when cached endpoints have a different session start height than the requested session, which can occur if session IDs are reused across different session periods. When a collision is detected: - Logs detailed error with session IDs and start heights - Invalidates the stale cache entry - Creates fresh endpoints from the new session - Returns correct endpoints to prevent supplier validation failures This fixes the issue where relay miners reject requests with "supplier not found in session" even though the supplier address and session ID are correct, caused by mismatched session start heights.

oten91 changed the base branch from main to staging December 4, 2025 10:11

jorgecuesta requested a review from oten91 December 4, 2025 10:12

jorgecuesta self-assigned this Dec 4, 2025

jorgecuesta added the qos Intended to improve quality of service label Dec 4, 2025

jorgecuesta force-pushed the feat/unified-qos branch from a3eec4a to 9f8eb17 Compare December 4, 2025 10:15

jorgecuesta force-pushed the feat/unified-qos branch from 0487120 to b255323 Compare December 4, 2025 20:51

oten91 and others added 3 commits December 4, 2025 21:54

run make proto_regen

a8bd0e6

jorgecuesta force-pushed the feat/unified-qos branch from 84a32ba to a041247 Compare December 6, 2025 06:42

jorgecuesta and others added 8 commits December 16, 2025 05:10

refactor: ensure single session per app in centralized gateway mode

9bb004b

fix: prevent send on closed channel race in websocket connLoop

2b52115

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Unified QoS System - Pluggable Health Checks + Reputation #500

feat: Unified QoS System - Pluggable Health Checks + Reputation #500

Uh oh!

jorgecuesta commented Dec 4, 2025 •

edited

Loading

Uh oh!

jorgecuesta commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: Unified QoS System - Pluggable Health Checks + Reputation #500

Are you sure you want to change the base?

feat: Unified QoS System - Pluggable Health Checks + Reputation #500

Uh oh!

Conversation

jorgecuesta commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Features

Removed Systems

Breaking Changes

Configuration

Test Results

New Prometheus Metrics

Uh oh!

jorgecuesta commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jorgecuesta commented Dec 4, 2025 •

edited

Loading