Skip to content

Conversation

@jorgecuesta
Copy link
Contributor

@jorgecuesta jorgecuesta commented Dec 4, 2025

Summary

Unified QoS system with per-service configuration, reputation-based endpoint selection, async observation pipeline, and comprehensive retry/fallback mechanisms including RPC type fallback with proper actualRPCType propagation.

New Features

  1. Async Observation Pipeline - Response parsing off critical path with configurable sampling
  2. Reputation System - Score-based endpoint quality tracking (0-100 scale)
  3. Tiered Endpoint Selection - 3-tier cascading selection (Tier 1 → Tier 2 → Tier 3)
  4. Probation System - Recovery mechanism for low-scoring endpoints (10% traffic sampling)
  5. RPC Type Fallback - Automatic fallback to alternative RPC types with actualRPCType propagation
  6. RPC-Type-Aware Reputation - Separate scores per (endpoint, RPC type) tuple
  7. Configurable Cosmos QoS - Custom RPC type support for hybrid chains (Cosmos+EVM)
  8. Session Rollover - Automatic endpoint refresh on session transitions with configurable rollover blocks
  9. WebSocket Height Subscription - Real-time blockchain height monitoring via WebSocket instead of polling
  10. Distributed Health Checks - Proactive endpoint monitoring with leader election (only one instance runs checks)
  11. Latency-Aware Scoring - Fast endpoints get bonuses, slow ones penalized
  12. Named Latency Profiles - Reusable configurations: fast, standard, slow, llm
  13. Per-Service Configuration - Override any global setting via defaults + services[]
  14. External Health Check Rules - Fetch health check definitions from remote URLs
  15. Enhanced Retry System - Endpoint rotation, latency budget, configurable conditions
  16. Retry Endpoint Rotation - Never retry same failed endpoint within request
  17. Retry Latency Budget - Skip retries on slow requests (configurable threshold)
  18. Concurrency Controls - Configurable limits: parallel endpoints, concurrent relays, batch payloads
  19. Target-Suppliers Header - Filter requests to specific suppliers via HTTP header
  20. Error Classification System - Comprehensive error categorization with reputation signals
  21. WebSocket Monitoring - Connection health tracking and failure detection
  22. RPC Type Detection - Automatic detection and validation from HTTP requests
  23. Comprehensive Metrics - Prometheus metrics for reputation, health checks, retries, sessions

Removed Systems

  1. Hydrator Command - Replaced by async observation pipeline
  2. Sanctions System - Replaced by reputation-based filtering
  3. Hardcoded Service Configs - Now fully YAML-driven configuration
  4. Synchronous QoS Validation - Moved to async background processing

Breaking Changes

  1. Reputation Storage Format: Keys now include RPC type dimension (serviceID:endpointAddr:rpcType)

    • Existing reputation data invalidated
    • System rebuilds scores naturally (5-10 min settling period)
  2. Protocol Interface Changes:

    • AvailableHTTPEndpoints() now requires rpcType parameter
    • BuildHTTPRequestContextForEndpoint() now requires rpcType parameter
  3. QoS Interface Changes:

    • ParseHTTPRequest() now receives detectedRPCType parameter
  4. Configuration Structure: New required sections

    • reputation_config (global)
    • active_health_checks (global)
    • observation_pipeline (global)
    • defaults (service defaults)
    • services (per-service array)
  5. Private Key Logging: Now redacted (security fix)

Configuration

See config/examples/config.shannon_example.yaml for full configuration examples including:

  • Global defaults and per-service overrides
  • RPC type fallback mappings
  • Reputation, retry, and health check settings
  • Latency profiles and concurrency controls

Test Results

Pre-Commit Checks:

  • ✅ Unit Tests: All passing (26 packages)
  • ✅ Lint: 0 issues
  • ✅ Build: Successful

E2E Results (34 test runs):

  • Empty URL Errors: 0
  • RPC Fallback Success: 100%

Cosmos Chains (48-100%):

  • juno: 100% ✅
  • persistence: 100% ✅
  • akash: 100% ✅
  • stargaze: 99.74% ✅
  • xrplevm: 100% ✅ (hybrid Cosmos+EVM)
  • fetch: 92.31% ✅
  • osmosis: 62%

EVM Chains (75-83%):

  • eth, poly, avax, bsc, base: All passing

Other Chains:

  • solana: 95.42% ✅

Remaining failures are supplier quality issues (pruned state, missing trie nodes, 404s, timeouts).

New Prometheus Metrics

Reputation:

  • shannon_reputation_signals_total
  • shannon_reputation_endpoints_filtered_total
  • shannon_reputation_score_distribution
  • shannon_probation_endpoints

Health Checks:

  • shannon_health_check_total
  • shannon_health_check_duration_seconds

Session:

  • shannon_active_sessions
  • shannon_session_endpoints

Retry:

  • shannon_retries_total
  • shannon_retry_success_total
  • shannon_retry_latency_seconds
  • shannon_retry_budget_skipped_total
  • shannon_retry_endpoint_switches_total
  • shannon_retry_endpoint_exhaustion_total

@oten91 oten91 changed the base branch from main to staging December 4, 2025 10:11
@jorgecuesta jorgecuesta requested a review from oten91 December 4, 2025 10:12
@jorgecuesta jorgecuesta self-assigned this Dec 4, 2025
@jorgecuesta jorgecuesta added the qos Intended to improve quality of service label Dec 4, 2025
Replace the legacy hydrator + sanctions system with a unified reputation-based
QoS system that provides configurable health checks, endpoint recovery, and
comprehensive observability.

- Score-based endpoint tracking (0-100) replacing binary sanctions
- Tiered selection (Tier 1/2/3) based on reputation scores
- Probation system for endpoint recovery (10% traffic sampling)
- Latency-aware scoring with service-specific profiles (EVM, Cosmos, Solana, LLM)
- Signal types: success, minor/major/critical/fatal errors, recovery_success

- YAML-configurable checks (replacing hardcoded Go logic)
- Execute through protocol layer (tests full relay path including relay miners)
- Support for jsonrpc, rest, websocket check types
- External URL config support for centralized health check definitions
- Leader election for multi-instance deployments

- Full Prometheus metrics suite for reputation, probation, health checks, sessions
- Metrics labels: service_id, endpoint_domain, signal_type, endpoint_type
- Async observation pipeline for non-blocking response processing

- Hydrator (gateway/hydrator*.go, cmd/hydrator.go)
- Permanent sanctions (protocol/shannon/sanction*.go)
- Hardcoded per-chain health checks

- Health check executor (gateway/health_check_*.go)
- QoS extractors (qos/*/extractor.go) for deep response parsing
- Documentation (docs/HOW_TO_RUN_PATH.md, docs/REPUTATION_SYSTEM.md)
- Metrics packages (metrics/healthcheck, metrics/session, metrics/retry)
oten91 and others added 3 commits December 4, 2025 21:54
This commit introduces a unified YAML configuration system that allows
per-service overrides for all gateway settings. Key changes:

Unified Service Configuration:
- New `gateway_config.defaults` for global service defaults
- New `gateway_config.services[]` for per-service overrides
- Merge logic: services inherit from defaults and override specific fields
- Named latency profiles (`fast`, `standard`, `slow`, `llm`)

Per-Service Configuration Support:
- Reputation config (initial_score, min_threshold, recovery_timeout)
- Tiered selection thresholds (tier1_threshold, tier2_threshold)
- Probation settings (threshold, traffic_percent, recovery_multiplier)
- Retry config (max_retries, retry_on_5xx, retry_on_timeout)
- Observation pipeline sample rates
- Latency profiles and inline latency config
- Active health checks with per-service rules

Health Check System:
- Leader election for distributed deployments (Redis-based)
- External health check rules URL with auto-refresh
- Per-service local health check overrides
- Latency integration in health check signals
- Comprehensive health check executor with configurable checks

Retry Logic:
- Full retry implementation with per-service configuration
- Retry on 5xx, timeout, and connection errors
- Configurable max retries per service

Code Quality:
- Removed hardcoded service configurations (service_qos_config.go)
- Removed scattered config files (health_check_defaults.go)
- Added comprehensive test coverage for new components
- Fixed private key logging (now redacted)
- Added extractor factory for QoS metric extraction

Breaking Changes:
- Config format changed: services now defined under gateway_config.services[]
- Removed deprecated service_fallback array (use services[].fallback)
Implemented comprehensive retry system improvements:

- Retry endpoint rotation: never reuse same endpoint on retry, select new endpoint following QoS/reputation rules
- Latency budget: configurable max_retry_latency (default 500ms) to prevent retrying slow requests
- Concurrency configuration: made max_parallel_endpoints, max_concurrent_relays, and max_batch_payloads configurable via YAML
- Endpoint exhaustion handling: exponential backoff (100ms, 200ms, 400ms) when all endpoints tried

Bug fixes:
- Fixed empty response handling in retry loops
- Fixed attempt number in retry metrics (removed incorrect +1)
- Fixed break/return in select statements
- Fixed empty endpoint domain in metrics recording
- Fixed Redis test to use testcontainer address
- Added missing error storage before loop breaks
- Enhanced context cancellation logging

Metrics improvements:
- Added endpoint_domain label to retry_latency metric
- Added retry_reason label to retry_budget_skipped metric
- New endpoint rotation metrics: shannon_retry_endpoint_switches_total, shannon_retry_endpoint_exhaustion_total
- All metrics follow lowercase_underscore naming convention

Configuration:
- Updated config schema with max_retry_latency and concurrency_config
- Added comprehensive examples in config.shannon_example.yaml
- Per-service override support for all retry and concurrency settings

Testing:
- Fixed Redis test conflicts by using testcontainer addresses
- Updated test mocks with GetConcurrencyConfig()
- E2E tests passing at 99.33% success rate (298/300 requests)
@jorgecuesta
Copy link
Contributor Author

jorgecuesta and others added 8 commits December 16, 2025 05:10
This commit fixes a critical bug in the RPC type fallback system and adds
RPC-type-aware reputation tracking and configurable Cosmos QoS.

## Critical Bug Fix - Empty URL Issue

**Problem**: RPC type fallback was setting `actualRPCType` internally but not
propagating it back to the relay context. When `endpoint.GetURL(originalRPCType)`
was called with the original (unsupported) RPC type, it returned empty strings,
causing "Post \"\": unsupported protocol scheme \"\"" errors.

**Impact**: 88-95% failure rate for Cosmos chains that relied on RPC fallback
- osmosis: 12.31% success (87.69% failures)
- xrplevm: 5.22% success (94.78% failures)

**Root Cause**: In `protocol/shannon/protocol.go`, `getUniqueEndpoints()` and
`getSessionsUniqueEndpoints()` performed RPC type fallback (COMET_BFT → JSON_RPC)
but didn't return the actual RPC type used. The relay context continued using
the original unsupported RPC type, causing empty URL lookups.

**Solution**:
1. Extended `getUniqueEndpoints()` to return `actualRPCType` as second return value
2. Extended `getSessionsUniqueEndpoints()` to return `actualRPCType` as second return value
3. Updated all callers to capture and log actualRPCType
4. Added runtime fallback safety net in `context.go` to try alternative RPC types

**Result**: 5-10x improvement in success rates
- osmosis: 12% → 62% (5x improvement)
- xrplevm: 5% → 100% (20x improvement after additional fixes)
- Zero empty URL errors across all 34 test runs

**Files Modified**:
- `protocol/shannon/protocol.go`: Extended return signatures
- `protocol/shannon/context.go`: Added runtime fallback handler
- `protocol/shannon/websocket_context.go`: Updated websocket endpoint selection
- `protocol/shannon/operational.go`: Updated GetServiceReadiness

## Enhancement - Cosmos QoS Configurable RPC Types

**Problem**: Cosmos SDK QoS was hardcoded to only accept REST and COMET_BFT RPC
types. Hybrid chains like XRPLEVM (Cosmos SDK with EVM support) need to handle
JSON_RPC for EVM methods. After RPC fallback to JSON_RPC, QoS rejected requests
as "unsupported RPC type".

**Impact**: XRPLEVM showing "request uses unsupported RPC type" errors for EVM
methods after RPC fallback.

**Solution**:
1. Added `NewSimpleQoSInstanceWithAPIs()` constructor accepting custom supported APIs
2. Added `convertRPCTypesToMap()` helper in `cmd/qos.go` to read RPC types from config
3. Updated Cosmos QoS initialization to read `rpc_types` from unified service config
4. Updated `simpleCosmosConfig` to store and return configurable supported APIs

**Result**: XRPLEVM achieves 100% success (570/570 requests) handling both Cosmos
and EVM methods seamlessly.

**Files Modified**:
- `qos/cosmos/qos.go`: Added configurable constructor
- `cmd/qos.go`: Added RPC type conversion logic

## Enhancement - RPC-Type-Aware Reputation Tracking

**Problem**: Reputation was tracked per (service, endpoint) only. If an endpoint
URL served multiple RPC types with different reliability (e.g., WebSocket broken
but JSON-RPC works), both protocols shared the same score, causing incorrect
filtering decisions.

**Solution**: Extended reputation key to include RPC type as required dimension.
All reputation scores now tracked at (service, endpoint, rpcType) granularity.

**Key Format Changes**:
- Before: `"eth:pokt1abc-https://node.example.com"`
- After: `"eth:pokt1abc-https://node.example.com:json_rpc"`

**Benefits**:
- Separate reputation scores for different protocols at same endpoint
- Better filtering decisions for hybrid chains
- More accurate supplier quality assessment

**Files Modified**:
- `reputation/reputation.go`: Added RPCType field to EndpointKey
- `reputation/key.go`: Updated all KeyBuilder implementations
- `reputation/key_test.go`: Added comprehensive tests for RPC-type-aware keys
- `reputation/*_test.go`: Updated all tests to include RPC type
- `reputation/storage/*.go`: Updated storage implementations
- `protocol/shannon/context.go`: Updated recording points to include RPC type
- `protocol/shannon/reputation.go`: Updated filtering to use RPC type
- `protocol/shannon/websocket_context.go`: Updated WebSocket recording

## Additional Features

**Target-Suppliers Header Support**: Added `parseAllowedSuppliersHeader()` to
read and parse the `Target-Suppliers` HTTP header, allowing clients to restrict
requests to specific suppliers. When specified, bypasses reputation filtering
and other selection logic.

**RPC Type Detection & Validation**: Added new gateway components for RPC type
detection, validation, and error response generation:
- `gateway/rpc_type_detector.go`: Detects RPC type from HTTP requests
- `gateway/rpc_type_validator.go`: Validates RPC types against service config
- `gateway/rpc_type_error_response.go`: Generates proper error responses
- `gateway/rpctype_mapper.go`: Maps RPC types to/from wire format

**Error Classification**: Added comprehensive error classification system for
Shannon protocol with detailed signal mapping and reputation scoring impact.

**WebSocket Monitoring**: Added WebSocket connection monitoring and health tracking.

**Test Coverage**: Added extensive tests for RPC type fallback, gateway modes,
and reputation tracking.

## Test Results

**Unit Tests**: All passing (26 packages, 0 failures)
**Lint**: 0 issues
**Build**: Successful

**E2E Test Summary** (across 34 test runs):
- **Empty URL Errors**: 0 (was primary failure mode before fix)
- **RPC Fallback Success**: 100% working correctly

**Cosmos Chains** (48-100% success after fix):
- juno: 100% ✅
- persistence: 100% ✅
- akash: 100% ✅
- stargaze: 99.74% ✅
- xrplevm: 100% ✅ (hybrid Cosmos+EVM)
- fetch: 92.31% ✅
- osmosis: 62% (improved from 12%, remaining failures are supplier quality)

**EVM Chains** (75-83% success):
- eth, poly, avax, bsc, base: All passing with supplier-quality-related failures only

**Other Chains**:
- solana: 95.42% ✅

**All remaining failures are supplier quality issues** (pruned state, missing trie
nodes, 404 errors, timeout issues), not PATH bugs.

## Breaking Changes

**Reputation Storage Format**: Reputation keys now include RPC type. Existing
reputation data will be invalidated. System will build new scores naturally as
traffic flows (~5-10 minute settling period).

**Protocol Interface**: `AvailableHTTPEndpoints()` and
`BuildHTTPRequestContextForEndpoint()` now require `rpcType` parameter.

**QoS Interface**: `ParseHTTPRequest()` now receives `detectedRPCType` parameter.

## Migration Notes

1. Reputation scores will reset on deployment (new key format)
2. System reaches steady state within 5-10 minutes as new scores accumulate
3. No configuration changes required (backward compatible)
4. Storage schema unchanged (keys stored as strings, format change transparent)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…election fix

Major changes:
- Wire all metrics (health checks, relays, retries, batches, probation events)
- Add relay counter/histogram with 6 labels (domain, rpc_type, service_id, status_code, reputation_signal, request_type)
- Add mean score gauge for reputation tracking
- Wire WebSocket metrics (connections active, events, duration, messages)
- Wire probation events (entered, exited, routed)
- Fix tiered selection bug with per-domain key granularity
  - Build reverse mapping from domain keys to full endpoint addresses
  - Fixes "0 endpoints in selected tier" when using per-domain reputation
- Add configurable signal impacts in reputation config
- Fix death spiral for WebSocket (filterByReputation=false)
- Add RecoverySuccessSignal for health checks and probation requests
- Remove unused functions and clean up lint errors
- Optimize gRPC connections with keepalive, backoff, and flow control
- Add production-ready dial options (keepalive pings, exponential backoff)
- Increase flow control windows (1MB) and message sizes (4MB) for high throughput

- Refactor account cache with 1 hour TTL (was infinite ~292 years)
- Add cache invalidation on signature verification failure
- Add retry mechanism: invalidate cache and retry once before blacklisting

- Add new metrics for pubkey cache events:
  - path_supplier_nil_pubkey_total
  - path_supplier_pubkey_cache_events_total (invalidated/recovered)

- Fix lint errors: add missing mock methods, remove orphaned legacy files
- Add SignerContext caching in signer.go to reuse pre-computed crypto values
- Cache rings by (appAddress, sessionEndHeight) to avoid redundant ring creation
- Add session endpoint caching via getOrCreateSessionEndpoints()
- Improve e2e test container cleanup to handle interrupted runs
- Fix reputation key_test.go incorrect expectation for domain extraction
- Add unified metrics dashboard and relay tracking improvements
- Add ring signature generation for WebSocket handshake (eager validation)
- Include session metadata headers in RelayMiner connection for upfront validation
- Implement bidirectional close code propagation between client and endpoint
- Add panic recovery for observation channel during shutdown
Adds validation to detect when cached endpoints have a different session
start height than the requested session, which can occur if session IDs
are reused across different session periods.

When a collision is detected:
- Logs detailed error with session IDs and start heights
- Invalidates the stale cache entry
- Creates fresh endpoints from the new session
- Returns correct endpoints to prevent supplier validation failures

This fixes the issue where relay miners reject requests with "supplier
not found in session" even though the supplier address and session ID
are correct, caused by mismatched session start heights.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qos Intended to improve quality of service

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants