Skip to content

Conversation

@MuncleUscles
Copy link
Member

@MuncleUscles MuncleUscles commented Jan 6, 2026

Summary

  • Add retry logic (3x with exponential backoff) to POST /genvm/run for faster failure detection
  • Track consecutive GenVM failures; /health returns 503 after threshold → K8s restarts pod
  • Transaction is released on failure (not stuck) so another healthy worker picks it up immediately

Config (env vars)

Variable Default Purpose
GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS 10 Per-attempt timeout (was 30)
GENVM_MANAGER_RUN_RETRIES 3 Max retry attempts
GENVM_MANAGER_RUN_RETRY_DELAY_SECONDS 1 Base delay for backoff
GENVM_FAILURE_UNHEALTHY_THRESHOLD 3 Failures before unhealthy

Test plan

  • Unit tests for retry config and callbacks (tests/unit/test_genvm_retry.py)
  • Unit tests for health degradation (tests/unit/test_worker_health_degradation.py)
  • Manual test: simulate GenVM Manager timeout and verify pod restart

Summary by CodeRabbit

  • New Features

    • GenVM operations now retry transient failures with exponential backoff and invoke success/failure callbacks to track consecutive failures.
    • Health checks now probe GenVM and mark the service degraded (503) when consecutive failures meet a configurable threshold; responses include failure count and threshold for diagnostics.
    • Startup logs indicate GenVM failure tracking is configured.
  • Tests

    • Added tests for retry behavior, callback handling, env config, and health-degradation scenarios.

✏️ Tip: You can customize this high-level summary in your review settings.

@cursor
Copy link

cursor bot commented Jan 6, 2026

You have run out of free Bugbot PR reviews for this billing cycle. This will reset on February 26.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 6, 2026

Warning

Rate limit exceeded

@MuncleUscles has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 1 minutes and 3 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 53590f1 and 6c25bbe.

📒 Files selected for processing (5)
  • backend/consensus/worker_service.py
  • backend/node/genvm/origin/base_host.py
  • tests/unit/test_genvm_retry.py
  • tests/unit/test_genvm_retry_integration.py
  • tests/unit/test_worker_health_degradation.py
📝 Walkthrough

Walkthrough

Adds retryable GenVM run logic with success/failure callbacks, tracks consecutive GenVM failures in the worker, and makes health checks return 503 when failures meet a configurable threshold. Tests for retries, callbacks, env parsing, and health-degradation behavior were added.

Changes

Cohort / File(s) Summary
Worker Service Failure Tracking
backend/consensus/worker_service.py
Adds module globals _genvm_consecutive_failures and _genvm_failure_unhealthy_threshold (env-driven), public APIs increment_genvm_failure(), reset_genvm_failures(), get_genvm_failure_count(), wires callbacks at startup via set_genvm_callbacks(), and updates health_check() to return 503 with error, count, and threshold when the threshold is reached.
GenVM Manager Retry & Callbacks
backend/node/genvm/origin/base_host.py
Adds _get_int() env helper, global callback hooks _on_genvm_success / _on_genvm_failure with set_genvm_callbacks(), and wraps run_genvm() in a retry loop (configurable max_retries, retry_base_delay_s, HTTP timeout). Calls callbacks on success or final failure and preserves timeout/cancellation behavior across retries.
Unit Tests
tests/unit/test_genvm_retry.py, tests/unit/test_worker_health_degradation.py
Adds tests for callback registration/invocation, env var parsing and timeout helpers, retry configuration/defaults, GenVM consecutive-failure tracking (increment/reset/get), and /health behavior for healthy vs. threshold-exceeded cases.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Worker as Worker Service
    participant Callbacks as GenVM Callbacks
    participant GenVM as GenVM Manager
    participant API as GenVM API

    Note over Worker,GenVM: Startup — wire callbacks
    Worker->>Callbacks: set_genvm_callbacks(on_success,on_failure)

    Note over GenVM,API: GenVM run with retries
    GenVM->>API: Attempt 1 /genvm/run
    alt Transient error
        API--xGenVM: Error/Timeout
        GenVM->>GenVM: backoff wait
        GenVM->>API: Attempt N /genvm/run
        alt Success
            API-->>GenVM: Success (genvm_id)
            GenVM->>Callbacks: _on_genvm_success()
            Callbacks->>Worker: reset_genvm_failures()
        else Exhausted retries
            API--xGenVM: Final error
            GenVM->>Callbacks: _on_genvm_failure()
            Callbacks->>Worker: increment_genvm_failure()
        end
    else Immediate success
        API-->>GenVM: Success (genvm_id)
        GenVM->>Callbacks: _on_genvm_success()
        Callbacks->>Worker: reset_genvm_failures()
    end

    Note over Client,Worker: Health probe
    Client->>Worker: GET /health
    alt failures < threshold
        Worker-->>Client: 200 OK
    else failures ≥ threshold
        Worker-->>Client: 503 {error, count, threshold}
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~35 minutes

Possibly related PRs

Poem

🐰 I hopped through retries, counted each fall,

Backoff and logs kept me on call,
Callbacks cheered when runs came through,
Or nudged the counter when errors grew,
Now health checks hum — steady, not small.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main changes: adding retry logic and health degradation for GenVM Manager.
Description check ✅ Passed The description covers main changes, configuration, and testing, but lacks the 'Why', 'Decisions made', and 'Reviewing tips' sections from the template.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
tests/unit/test_genvm_retry.py (1)

95-126: Inconsistent environment variable isolation in tests.

test_default_timeout and test_default_retry_delay directly modify os.environ via pop() without a context manager, which could cause test pollution. Consider using patch.dict with clear=False and explicit key removal for consistency, or use a fixture for cleanup.

🔎 Proposed fix for consistent env var isolation
     def test_default_timeout(self):
         """Default timeout is 10 seconds"""
-        import os
-
-        os.environ.pop("GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS", None)
-        assert base_host._get_timeout_seconds("GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS", 10.0) == 10.0
+        with patch.dict("os.environ", {}, clear=False):
+            import os
+            os.environ.pop("GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS", None)
+            assert base_host._get_timeout_seconds("GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS", 10.0) == 10.0

     def test_custom_timeout(self):
         """Timeout can be configured via env var"""
         with patch.dict("os.environ", {"GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS": "5"}):
             assert base_host._get_timeout_seconds("GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS", 10.0) == 5.0

     def test_default_retry_delay(self):
         """Default retry delay is 1 second"""
-        import os
-
-        os.environ.pop("GENVM_MANAGER_RUN_RETRY_DELAY_SECONDS", None)
-        assert base_host._get_timeout_seconds("GENVM_MANAGER_RUN_RETRY_DELAY_SECONDS", 1.0) == 1.0
+        with patch.dict("os.environ", {}, clear=False):
+            import os
+            os.environ.pop("GENVM_MANAGER_RUN_RETRY_DELAY_SECONDS", None)
+            assert base_host._get_timeout_seconds("GENVM_MANAGER_RUN_RETRY_DELAY_SECONDS", 1.0) == 1.0
tests/unit/test_worker_health_degradation.py (2)

82-87: Async context manager mock may be fragile.

The mock pattern for aiohttp.request uses synchronous MagicMock for __aenter__ and __aexit__. While this often works, it's more robust to use AsyncMock or explicit async functions.

🔎 More robust async mock pattern
         with patch("aiohttp.request") as mock_request:
             mock_response = MagicMock()
             mock_response.status = 200
-            mock_response.__aenter__ = MagicMock(return_value=mock_response)
-            mock_response.__aexit__ = MagicMock(return_value=None)
+            async def aenter():
+                return mock_response
+            async def aexit(*args):
+                pass
+            mock_response.__aenter__ = aenter
+            mock_response.__aexit__ = aexit
             mock_request.return_value = mock_response

Or use unittest.mock.AsyncMock for a cleaner approach.


127-132: Consider asserting specific response type for healthy case.

The branching assertion handles both JSONResponse and dict returns, but it would be clearer to assert the exact expected type. If the healthy response should always be a dict, assert that directly.

🔎 Proposed clarification
             # Should be healthy (200 or dict response)
             # The endpoint returns dict for healthy, JSONResponse for unhealthy
-            if hasattr(response, "status_code"):
-                assert response.status_code != 503
-            else:
-                assert response.get("status") in ["healthy", "stopping"]
+            # Healthy response is a dict, not JSONResponse
+            assert isinstance(response, dict), f"Expected dict for healthy response, got {type(response)}"
+            assert response.get("status") in ["healthy", "stopping"]
backend/node/genvm/origin/base_host.py (2)

517-524: Remove unnecessary f-string prefix.

Line 518 uses an f-string without any placeholders. Remove the f prefix.

🔎 Proposed fix
                     logger.error(
-                        f"genvm manager /genvm/run failed",
+                        "genvm manager /genvm/run failed",
                         status=resp.status,
                         body=data,
                     )

564-566: Finally block logs on every retry attempt.

The finally block at line 564 executes after each loop iteration, including failed attempts before retries. This may produce confusing log entries showing "proc started" with genvm_id=None during retry scenarios.

Consider moving this logging outside the retry loop or adjusting the condition to only log meaningful state.

🔎 Proposed adjustment
-            finally:
-                if genvm_id_cell[0] is not None or last_exc is not None:
-                    logger.debug("proc started", genvm_id=genvm_id_cell[0])
+            finally:
+                # Only log when we have a valid genvm_id (successful start)
+                if genvm_id_cell[0] is not None:
+                    logger.debug("proc started", genvm_id=genvm_id_cell[0])

Or remove the finally block entirely since success already logs at line 527-529.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c00df52 and bd863c9.

📒 Files selected for processing (4)
  • backend/consensus/worker_service.py
  • backend/node/genvm/origin/base_host.py
  • tests/unit/test_genvm_retry.py
  • tests/unit/test_worker_health_degradation.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Target Python 3.12, use 4-space indentation, and rely on Black via pre-commit for formatting consistency

**/*.py: Apply Black formatter for Python code formatting
Include type hints in all Python code

Files:

  • tests/unit/test_worker_health_degradation.py
  • backend/consensus/worker_service.py
  • backend/node/genvm/origin/base_host.py
  • tests/unit/test_genvm_retry.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place new backend tests in the closest scope folder and name them test_<feature>.py for Pytest auto-discovery

Use pytest with fixtures from tests/common/ for backend testing

Files:

  • tests/unit/test_worker_health_degradation.py
  • tests/unit/test_genvm_retry.py
backend/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Align backend filenames with their behavior (e.g., validators/llm_validator.py) and mirror that pattern in tests

Files:

  • backend/consensus/worker_service.py
  • backend/node/genvm/origin/base_host.py
backend/consensus/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement validator rotation using VRF (Verifiable Random Function) in the consensus system

Files:

  • backend/consensus/worker_service.py
🧬 Code graph analysis (3)
backend/consensus/worker_service.py (2)
backend/node/genvm/origin/base_host.py (1)
  • set_genvm_callbacks (50-57)
tests/unit/test_genvm_retry.py (2)
  • on_success (45-47)
  • on_failure (49-51)
backend/node/genvm/origin/base_host.py (1)
tests/unit/test_genvm_retry.py (2)
  • on_success (45-47)
  • on_failure (49-51)
tests/unit/test_genvm_retry.py (1)
backend/node/genvm/origin/base_host.py (3)
  • set_genvm_callbacks (50-57)
  • _get_int (35-42)
  • _get_timeout_seconds (25-32)
🪛 GitHub Actions: pre-commit
backend/node/genvm/origin/base_host.py

[error] 1-1: Black formatting check failed. 2 files were reformatted by Black during the pre-commit hook.

tests/unit/test_genvm_retry.py

[error] 1-1: Black formatting check failed. 2 files were reformatted by Black during the pre-commit hook.

🪛 Ruff (0.14.10)
backend/node/genvm/origin/base_host.py

518-518: f-string without any placeholders

Remove extraneous f prefix

(F541)


522-524: Create your own exception

(TRY002)


522-524: Avoid specifying long messages outside the exception class

(TRY003)


543-547: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: load-test
  • GitHub Check: test
  • GitHub Check: backend-unit-tests
🔇 Additional comments (11)
tests/unit/test_genvm_retry.py (3)

1-9: LGTM! Well-structured test file for GenVM retry logic.

The test organization into separate classes for callbacks, env var config, and retry config is clean and follows pytest conventions.


11-63: LGTM! Callback tests are thorough.

The test class properly resets state in setup_method and covers setting, clearing, and invoking callbacks. Testing direct invocation is appropriate for unit tests.


66-89: LGTM! Good coverage of env var parsing edge cases.

Tests cover valid values, invalid values, and missing values for both _get_int and _get_timeout_seconds.

tests/unit/test_worker_health_degradation.py (2)

12-48: LGTM! Comprehensive failure tracking tests.

The TestGenVMFailureTracking class provides good coverage for increment, reset, and get operations on the failure counter.


168-187: LGTM! Environment variable threshold tests.

The tests appropriately verify the parsing pattern. The comment on line 173 correctly notes the limitation regarding module import time initialization.

backend/consensus/worker_service.py (3)

80-108: LGTM! GenVM failure tracking implementation.

The module-level state and functions for tracking consecutive GenVM failures are well-structured. The logging provides good visibility into failure progression.

One note: if callbacks could ever be invoked from multiple threads simultaneously (unlikely in a single-event-loop async context), the increment operation wouldn't be atomic. For now, this appears safe given the FastAPI async context.


201-209: LGTM! Callback wiring is correct.

The callbacks are properly wired to connect GenVM retry outcomes in base_host.py with failure tracking in worker_service.py. The inline import avoids potential circular import issues.


513-524: LGTM! Health check degradation logic is correct.

The consecutive failure check properly returns 503 when the threshold is reached or exceeded. The response body includes all necessary debugging information.

backend/node/genvm/origin/base_host.py (3)

35-57: LGTM! Helper functions and callbacks are well-designed.

The _get_int helper follows the same defensive pattern as _get_timeout_seconds. The callback mechanism is simple and effective for decoupling failure tracking from the retry logic.


516-524: Non-200 responses bypass failure tracking.

When GenVM manager returns a non-200 status (e.g., 500), the code raises an exception immediately without triggering _on_genvm_failure. This means transient server errors from GenVM manager won't contribute to the consecutive failure count or trigger health degradation.

Consider whether non-200 responses should also be retried and/or trigger the failure callback.


473-480: LGTM! Retry configuration is well-designed.

The reduced per-attempt timeout (10s) combined with 3 retries and exponential backoff provides faster failure detection while maintaining similar total time budget. Configuration via environment variables allows tuning without code changes.

@MuncleUscles MuncleUscles force-pushed the feat/genvm-retry-health-degradation branch from bd863c9 to 6a80e48 Compare January 6, 2026 17:11
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI Agents
In @backend/node/genvm/origin/base_host.py:
- Around line 486-487: The finally block uses last_exc (initialized to None) and
genvm_id_cell[0] but last_exc may remain None if a non-transient exception
(e.g., JSONDecodeError in the try body) occurs, causing the finally logic to
mis-fire or skip necessary cleanup; update the finally block to not rely solely
on last_exc: either set last_exc in an except Exception as e handler for all
non-transient errors, or change the finally logic to check genvm_id_cell[0]
directly and/or inspect a new explicit flag (e.g., proc_started or success)
instead of last_exc; locate symbols last_exc, genvm_id_cell, and the finally
block around the try in base_host.py to implement the chosen fix and ensure
proc-started logging and cleanup only run when the process actually started.
- Around line 473-483: The 10s value applies only to HTTP network calls, not
overall GenVM execution; update the declarations using _get_timeout_seconds so
their intent is explicit (e.g., rename or add a clear inline comment for
run_http_timeout_s and status_http_timeout_s to indicate they are HTTP request
timeouts for POST /genvm/run and GET status polling), and ensure you do NOT
change the separate overall execution timeout controlled by the timeout
parameter (leave any function/parameter named timeout untouched); also update
any callers or docs that reference these variables to reflect the “HTTP request
timeout” semantics so there’s no confusion during deployment.
🧹 Nitpick comments (6)
tests/unit/test_worker_health_degradation.py (2)

15-17: Consider using a fixture instead of setup_method for state reset.

The manual reset of worker_service._genvm_consecutive_failures in setup_method directly mutates module-level state. Consider using a pytest fixture with proper teardown to ensure test isolation, especially if tests run in parallel.

🔎 Proposed refactor using fixture
+@pytest.fixture(autouse=True)
+def reset_genvm_state():
+    """Reset GenVM state before each test."""
+    worker_service._genvm_consecutive_failures = 0
+    yield
+    # Cleanup after test
+    worker_service._genvm_consecutive_failures = 0
+
 class TestGenVMFailureTracking:
     """Test GenVM failure tracking functions"""
 
-    def setup_method(self):
-        # Reset failure count before each test
-        worker_service._genvm_consecutive_failures = 0
-
     def test_increment_genvm_failure(self):

171-187: Env var tests don't verify actual module behavior.

These tests only verify os.environ.get() behavior, not how worker_service actually parses GENVM_FAILURE_UNHEALTHY_THRESHOLD at module import time. Since the threshold is read when the module loads (line 82-84 in worker_service.py), these tests don't confirm the integration works correctly.

Consider either:

  1. Testing the actual _genvm_failure_unhealthy_threshold value in worker_service after mocking the env var before import, or
  2. Removing these tests if they're redundant (since they only test standard library behavior).
🔎 Alternative approach to test actual module behavior
def test_threshold_uses_configured_value(self):
    """Verify worker_service respects the configured threshold"""
    # Direct test: check that worker_service._genvm_failure_unhealthy_threshold
    # matches the expected value from environment or default
    import backend.consensus.worker_service as ws
    # This assumes the module was loaded with default or set env var
    assert ws._genvm_failure_unhealthy_threshold >= 1

Note: Full integration testing would require module reloading, which can be complex.

backend/node/genvm/origin/base_host.py (3)

523-523: Remove extraneous f-string prefix.

The f-string on line 523 has no placeholders.

🔎 Proposed fix
-                        logger.error(
-                            f"genvm manager /genvm/run failed",
+                        logger.error(
+                            "genvm manager /genvm/run failed",
                             status=resp.status,
                             body=data,
                         )

As per static analysis hints.


527-529: Consider using a custom exception class.

Raising Exception with a formatted message makes it harder to catch specific failure types programmatically. Consider defining a GenVMRunError exception class for better error handling.

🔎 Proposed refactor

Add near the top of the file:

class GenVMRunError(Exception):
    """Raised when GenVM /genvm/run request fails."""
    def __init__(self, status: int, body: dict):
        self.status = status
        self.body = body
        super().__init__(f"genvm manager /genvm/run failed: {status} {body}")

Then update line 527-529:

-                        raise Exception(
-                            f"genvm manager /genvm/run failed: {resp.status} {data}"
-                        )
+                        raise GenVMRunError(resp.status, data)

As per static analysis hints.


550-560: Use logging.exception for better traceback capture.

When logging errors with exceptions in except blocks, logger.exception() automatically includes the traceback, providing better debugging information than logger.error().

🔎 Proposed fix
                 if is_last_attempt:
                     # All retries exhausted - track failure and propagate
-                    logger.error(
+                    logger.exception(
                         "genvm manager request failed after all retries",
-                        error=str(exc),
                         attempts=max_retries,
                     )

Note: logger.exception() automatically captures exc, so error=str(exc) becomes redundant.

As per static analysis hints.

backend/consensus/worker_service.py (1)

80-84: Consider validation for threshold configuration.

The threshold parsing uses int() directly, which will raise ValueError if the env var contains invalid data. While this might be intentional (fail-fast on misconfiguration), it could also crash the worker at startup. The pattern differs from _get_timeout_seconds in base_host.py which returns a default on parse failure.

🔎 Proposed defensive parsing
-_genvm_failure_unhealthy_threshold: int = int(
-    os.environ.get("GENVM_FAILURE_UNHEALTHY_THRESHOLD", "3")
-)
+def _parse_threshold() -> int:
+    try:
+        return int(os.environ.get("GENVM_FAILURE_UNHEALTHY_THRESHOLD", "3"))
+    except ValueError:
+        logger.warning(
+            "Invalid GENVM_FAILURE_UNHEALTHY_THRESHOLD, using default 3"
+        )
+        return 3
+
+_genvm_failure_unhealthy_threshold: int = _parse_threshold()

This aligns with the defensive parsing pattern used in base_host.py's _get_int().

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd863c9 and 6a80e48.

📒 Files selected for processing (4)
  • backend/consensus/worker_service.py
  • backend/node/genvm/origin/base_host.py
  • tests/unit/test_genvm_retry.py
  • tests/unit/test_worker_health_degradation.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unit/test_genvm_retry.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Target Python 3.12, use 4-space indentation, and rely on Black via pre-commit for formatting consistency

**/*.py: Apply Black formatter for Python code formatting
Include type hints in all Python code

Files:

  • backend/consensus/worker_service.py
  • backend/node/genvm/origin/base_host.py
  • tests/unit/test_worker_health_degradation.py
backend/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Align backend filenames with their behavior (e.g., validators/llm_validator.py) and mirror that pattern in tests

Files:

  • backend/consensus/worker_service.py
  • backend/node/genvm/origin/base_host.py
backend/consensus/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement validator rotation using VRF (Verifiable Random Function) in the consensus system

Files:

  • backend/consensus/worker_service.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place new backend tests in the closest scope folder and name them test_<feature>.py for Pytest auto-discovery

Use pytest with fixtures from tests/common/ for backend testing

Files:

  • tests/unit/test_worker_health_degradation.py
🧬 Code graph analysis (2)
backend/consensus/worker_service.py (2)
backend/node/genvm/origin/base_host.py (1)
  • set_genvm_callbacks (50-57)
tests/unit/test_genvm_retry.py (2)
  • on_success (45-47)
  • on_failure (49-51)
tests/unit/test_worker_health_degradation.py (1)
backend/consensus/worker_service.py (4)
  • increment_genvm_failure (87-93)
  • reset_genvm_failures (96-103)
  • get_genvm_failure_count (106-108)
  • health_check (385-591)
🪛 Ruff (0.14.10)
backend/node/genvm/origin/base_host.py

523-523: f-string without any placeholders

Remove extraneous f prefix

(F541)


527-529: Create your own exception

(TRY002)


527-529: Avoid specifying long messages outside the exception class

(TRY003)


552-556: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: backend-unit-tests
🔇 Additional comments (10)
backend/node/genvm/origin/base_host.py (5)

35-43: LGTM! Clean integer parsing helper.

The _get_int helper mirrors the existing _get_timeout_seconds pattern and provides robust fallback behavior for invalid or missing environment variables.


45-57: LGTM! Callback mechanism is well-designed.

The callback globals and setter provide a clean integration point for external failure tracking without tight coupling. The optional callbacks pattern is appropriate here.


541-544: LGTM! Success callback integration is correct.

The success callback is invoked after the GenVM ID is captured and timeout watcher is started, ensuring the failure counter is reset only on confirmed success.


557-559: LGTM! Failure callback timing is appropriate.

The failure callback is invoked only after all retries are exhausted, which correctly tracks persistent failures rather than transient issues.


562-571: LGTM! Exponential backoff implementation is correct.

The retry delay calculation retry_base_delay_s * (2**attempt) properly implements exponential backoff (1s, 2s, 4s for attempts 0, 1, 2 with default base delay of 1s).

backend/consensus/worker_service.py (5)

87-93: LGTM! Clear failure tracking with informative logging.

The increment function properly updates global state and provides useful warning logs that include both current count and threshold for easy debugging.


96-103: LGTM! Conditional reset prevents log spam.

The check if _genvm_consecutive_failures > 0 before logging prevents unnecessary log entries on repeated successes, improving log quality.


106-108: LGTM! Testing helper with clear documentation.

The get_genvm_failure_count() function provides necessary test observability while keeping the implementation simple.


202-209: LGTM! Callback wiring is properly integrated into startup.

The callbacks are registered after genvm_manager is created and before the worker starts, ensuring the failure tracking is active throughout the worker's execution lifecycle. The startup log confirms the integration.


513-524: LGTM! Health degradation returns appropriate 503 with diagnostic info.

The health check correctly returns a 503 status code when the failure threshold is met or exceeded, and includes count and threshold in the response body for debugging. This enables Kubernetes to restart unhealthy pods.

@MuncleUscles MuncleUscles force-pushed the feat/genvm-retry-health-degradation branch from 6a80e48 to 53590f1 Compare January 6, 2026 17:20
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
backend/node/genvm/origin/base_host.py (3)

546-547: Remove unused last_exc variable.

The variable last_exc is assigned but never read. This was flagged by static analysis (F841).

🔎 Proposed fix
             except (aiohttp.ClientError, asyncio.TimeoutError) as exc:
-                last_exc = exc
                 is_last_attempt = attempt >= max_retries - 1

527-529: Consider using a custom exception class for GenVM errors.

Static analysis flagged this for creating an exception with a long message (TRY003) and using a generic Exception instead of a custom class (TRY002). A dedicated GenVMManagerError exception would improve error handling upstream.


550-560: Verify exception chain is preserved for debugging.

When raise is used without arguments on line 560, it re-raises the caught exception exc. This is correct, but consider using raise exc from exc or logging the full traceback with logger.exception instead of logger.error to preserve the stack trace for debugging (TRY400).

🔎 Proposed improvement
                 if is_last_attempt:
                     # All retries exhausted - track failure and propagate
-                    logger.error(
+                    logger.exception(
                         "genvm manager request failed after all retries",
-                        error=str(exc),
                         attempts=max_retries,
                     )
tests/unit/test_worker_health_degradation.py (2)

82-87: Async context manager mock is incomplete.

The __aenter__ and __aexit__ methods should return coroutines for proper async context manager behavior. While this may work due to MagicMock's flexibility, it's not technically correct and could cause issues with stricter async implementations.

🔎 Proposed fix using AsyncMock
-        with patch("aiohttp.request") as mock_request:
-            mock_response = MagicMock()
-            mock_response.status = 200
-            mock_response.__aenter__ = MagicMock(return_value=mock_response)
-            mock_response.__aexit__ = MagicMock(return_value=None)
-            mock_request.return_value = mock_response
+        from unittest.mock import AsyncMock
+        with patch("aiohttp.request") as mock_request:
+            mock_response = MagicMock()
+            mock_response.status = 200
+            mock_response.__aenter__ = AsyncMock(return_value=mock_response)
+            mock_response.__aexit__ = AsyncMock(return_value=None)
+            mock_request.return_value = mock_response

This pattern is repeated at lines 117-122 and 150-155.


168-187: Tests don't exercise actual module threshold parsing.

These tests only verify that os.environ.get() works correctly with defaults, not that worker_service._genvm_failure_unhealthy_threshold is correctly parsed at import time. Since the threshold is read at module import, patching the environment after import won't change the module's threshold value.

Consider adding a test that verifies the actual _genvm_failure_unhealthy_threshold value or documenting that this is testing the parsing pattern rather than the integrated behavior.

backend/consensus/worker_service.py (2)

80-84: Consider validating threshold is positive.

If GENVM_FAILURE_UNHEALTHY_THRESHOLD is set to 0 or a negative value, the health check would immediately return 503 even with zero failures. Consider adding validation or using the _get_int pattern with bounds checking.

🔎 Proposed improvement
 # GenVM Manager consecutive failure tracking
 _genvm_consecutive_failures: int = 0
-_genvm_failure_unhealthy_threshold: int = int(
-    os.environ.get("GENVM_FAILURE_UNHEALTHY_THRESHOLD", "3")
-)
+_genvm_failure_unhealthy_threshold: int = max(
+    1, int(os.environ.get("GENVM_FAILURE_UNHEALTHY_THRESHOLD", "3"))
+)

82-84: Threshold parsing doesn't handle invalid values gracefully.

Unlike _get_int in base_host.py, this will raise ValueError if the environment variable contains a non-integer string. Consider using a try/except pattern for consistency.

🔎 Proposed fix
 # GenVM Manager consecutive failure tracking
 _genvm_consecutive_failures: int = 0
-_genvm_failure_unhealthy_threshold: int = int(
-    os.environ.get("GENVM_FAILURE_UNHEALTHY_THRESHOLD", "3")
-)
+try:
+    _genvm_failure_unhealthy_threshold: int = int(
+        os.environ.get("GENVM_FAILURE_UNHEALTHY_THRESHOLD", "3")
+    )
+except ValueError:
+    _genvm_failure_unhealthy_threshold = 3
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a80e48 and 53590f1.

📒 Files selected for processing (4)
  • backend/consensus/worker_service.py
  • backend/node/genvm/origin/base_host.py
  • tests/unit/test_genvm_retry.py
  • tests/unit/test_worker_health_degradation.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unit/test_genvm_retry.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Target Python 3.12, use 4-space indentation, and rely on Black via pre-commit for formatting consistency

**/*.py: Apply Black formatter for Python code formatting
Include type hints in all Python code

Files:

  • tests/unit/test_worker_health_degradation.py
  • backend/consensus/worker_service.py
  • backend/node/genvm/origin/base_host.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place new backend tests in the closest scope folder and name them test_<feature>.py for Pytest auto-discovery

Use pytest with fixtures from tests/common/ for backend testing

Files:

  • tests/unit/test_worker_health_degradation.py
backend/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Align backend filenames with their behavior (e.g., validators/llm_validator.py) and mirror that pattern in tests

Files:

  • backend/consensus/worker_service.py
  • backend/node/genvm/origin/base_host.py
backend/consensus/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement validator rotation using VRF (Verifiable Random Function) in the consensus system

Files:

  • backend/consensus/worker_service.py
🧬 Code graph analysis (2)
tests/unit/test_worker_health_degradation.py (1)
backend/consensus/worker_service.py (4)
  • increment_genvm_failure (87-93)
  • reset_genvm_failures (96-103)
  • get_genvm_failure_count (106-108)
  • health_check (385-591)
backend/consensus/worker_service.py (3)
backend/node/genvm/origin/logger.py (2)
  • warning (20-21)
  • info (17-18)
backend/node/genvm/origin/base_host.py (1)
  • set_genvm_callbacks (50-57)
tests/unit/test_genvm_retry.py (2)
  • on_success (45-47)
  • on_failure (49-51)
🪛 Ruff (0.14.10)
backend/node/genvm/origin/base_host.py

527-529: Create your own exception

(TRY002)


527-529: Avoid specifying long messages outside the exception class

(TRY003)


547-547: Local variable last_exc is assigned to but never used

Remove assignment to unused variable last_exc

(F841)


552-556: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: backend-unit-tests
🔇 Additional comments (9)
backend/node/genvm/origin/base_host.py (4)

35-43: LGTM!

The _get_int helper correctly mirrors _get_timeout_seconds for parsing integer environment variables with graceful fallback on missing or invalid values.


45-57: LGTM!

The callback mechanism is clean and well-documented. The type hints are correct, and the global state management is appropriate for this use case.


573-576: LGTM! The finally block logic is now correct.

The condition now only logs when genvm_id_cell[0] is not None, which addresses the previous concern about misleading "proc started" logs on failure paths.


488-544: LGTM! Retry loop with exponential backoff.

The retry logic is well-structured:

  • Exponential backoff with base delay
  • Proper success callback invocation
  • Clean exit on success via return
  • Attempt tracking in logs
tests/unit/test_worker_health_degradation.py (2)

12-48: LGTM! Failure tracking tests are comprehensive.

Tests cover increment, reset, reset when zero, and get count. Good coverage of the core tracking primitives.


53-60: Good test isolation via setup_method.

Resetting all relevant module-level state before each test ensures test isolation. This is important since the tests modify global state.

backend/consensus/worker_service.py (3)

87-108: LGTM! Clean failure tracking API.

The functions are simple, well-documented, and follow a clear pattern. The logging in increment_genvm_failure and reset_genvm_failures provides good observability for debugging.


201-209: LGTM! Callback wiring is correctly placed in lifespan.

The callbacks are wired after genvm_manager is created and before the worker starts, ensuring the tracking is active for all GenVM operations. The import placement inside the function avoids circular import issues.


513-524: LGTM! Health degradation response is well-structured.

The 503 response includes all necessary debugging information: status, worker_id, error type, current count, and threshold. This will help operators diagnose issues when Kubernetes restarts pods.

@MuncleUscles MuncleUscles force-pushed the feat/genvm-retry-health-degradation branch from 53590f1 to f7a15ba Compare January 6, 2026 17:28
When POST /genvm/run times out, retry up to 3x with exponential backoff.
After consecutive failures exceed threshold, /health returns 503 triggering
K8s pod restart. Transaction is released (not stuck) so another worker picks it up.
@MuncleUscles MuncleUscles force-pushed the feat/genvm-retry-health-degradation branch from f7a15ba to 6c25bbe Compare January 6, 2026 17:34
@sonarqubecloud
Copy link

sonarqubecloud bot commented Jan 6, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
50.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

@MuncleUscles MuncleUscles merged commit 0389ca7 into main Jan 6, 2026
11 of 12 checks passed
MuncleUscles added a commit that referenced this pull request Jan 6, 2026
When POST /genvm/run times out, retry up to 3x with exponential backoff.
After consecutive failures exceed threshold, /health returns 503 triggering
K8s pod restart. Transaction is released (not stuck) so another worker picks it up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants