feat: add GenVM Manager retry logic and health degradation #1367

MuncleUscles · 2026-01-06T17:05:58Z

Summary

Add retry logic (3x with exponential backoff) to POST /genvm/run for faster failure detection
Track consecutive GenVM failures; /health returns 503 after threshold → K8s restarts pod
Transaction is released on failure (not stuck) so another healthy worker picks it up immediately

Config (env vars)

Variable	Default	Purpose
`GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS`	10	Per-attempt timeout (was 30)
`GENVM_MANAGER_RUN_RETRIES`	3	Max retry attempts
`GENVM_MANAGER_RUN_RETRY_DELAY_SECONDS`	1	Base delay for backoff
`GENVM_FAILURE_UNHEALTHY_THRESHOLD`	3	Failures before unhealthy

Test plan

Unit tests for retry config and callbacks (tests/unit/test_genvm_retry.py)
Unit tests for health degradation (tests/unit/test_worker_health_degradation.py)
Manual test: simulate GenVM Manager timeout and verify pod restart

Summary by CodeRabbit

New Features
- GenVM operations now retry transient failures with exponential backoff and invoke success/failure callbacks to track consecutive failures.
- Health checks now probe GenVM and mark the service degraded (503) when consecutive failures meet a configurable threshold; responses include failure count and threshold for diagnostics.
- Startup logs indicate GenVM failure tracking is configured.
Tests
- Added tests for retry behavior, callback handling, env config, and health-degradation scenarios.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

cursor · 2026-01-06T17:06:03Z

You have run out of free Bugbot PR reviews for this billing cycle. This will reset on February 26.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

coderabbitai · 2026-01-06T17:06:09Z

Warning

Rate limit exceeded

@MuncleUscles has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 1 minutes and 3 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 53590f1 and 6c25bbe.

📒 Files selected for processing (5)

backend/consensus/worker_service.py
backend/node/genvm/origin/base_host.py
tests/unit/test_genvm_retry.py
tests/unit/test_genvm_retry_integration.py
tests/unit/test_worker_health_degradation.py

📝 Walkthrough

Walkthrough

Adds retryable GenVM run logic with success/failure callbacks, tracks consecutive GenVM failures in the worker, and makes health checks return 503 when failures meet a configurable threshold. Tests for retries, callbacks, env parsing, and health-degradation behavior were added.

Changes

Cohort / File(s)	Summary
Worker Service Failure Tracking `backend/consensus/worker_service.py`	Adds module globals `_genvm_consecutive_failures` and `_genvm_failure_unhealthy_threshold` (env-driven), public APIs `increment_genvm_failure()`, `reset_genvm_failures()`, `get_genvm_failure_count()`, wires callbacks at startup via `set_genvm_callbacks()`, and updates `health_check()` to return 503 with `error`, `count`, and `threshold` when the threshold is reached.
GenVM Manager Retry & Callbacks `backend/node/genvm/origin/base_host.py`	Adds `_get_int()` env helper, global callback hooks `_on_genvm_success` / `_on_genvm_failure` with `set_genvm_callbacks()`, and wraps `run_genvm()` in a retry loop (configurable `max_retries`, `retry_base_delay_s`, HTTP timeout). Calls callbacks on success or final failure and preserves timeout/cancellation behavior across retries.
Unit Tests `tests/unit/test_genvm_retry.py`, `tests/unit/test_worker_health_degradation.py`	Adds tests for callback registration/invocation, env var parsing and timeout helpers, retry configuration/defaults, GenVM consecutive-failure tracking (increment/reset/get), and `/health` behavior for healthy vs. threshold-exceeded cases.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Worker as Worker Service
    participant Callbacks as GenVM Callbacks
    participant GenVM as GenVM Manager
    participant API as GenVM API

    Note over Worker,GenVM: Startup — wire callbacks
    Worker->>Callbacks: set_genvm_callbacks(on_success,on_failure)

    Note over GenVM,API: GenVM run with retries
    GenVM->>API: Attempt 1 /genvm/run
    alt Transient error
        API--xGenVM: Error/Timeout
        GenVM->>GenVM: backoff wait
        GenVM->>API: Attempt N /genvm/run
        alt Success
            API-->>GenVM: Success (genvm_id)
            GenVM->>Callbacks: _on_genvm_success()
            Callbacks->>Worker: reset_genvm_failures()
        else Exhausted retries
            API--xGenVM: Final error
            GenVM->>Callbacks: _on_genvm_failure()
            Callbacks->>Worker: increment_genvm_failure()
        end
    else Immediate success
        API-->>GenVM: Success (genvm_id)
        GenVM->>Callbacks: _on_genvm_success()
        Callbacks->>Worker: reset_genvm_failures()
    end

    Note over Client,Worker: Health probe
    Client->>Worker: GET /health
    alt failures < threshold
        Worker-->>Client: 200 OK
    else failures ≥ threshold
        Worker-->>Client: 503 {error, count, threshold}
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~35 minutes

Possibly related PRs

fix: debug genvm invalid contract error #1365: Touches GenVM startup/health paths and backend/node/genvm/origin/base_host.py / worker health behavior; likely overlaps with retry/callback wiring.
fix: genvm invalid contract error #1363: Modifies backend/consensus/worker_service.py health_check logic and related unhealthy-condition handling; strongly related to the new failure-threshold health behavior.

Poem

🐰 I hopped through retries, counted each fall,

Backoff and logs kept me on call,
Callbacks cheered when runs came through,
Or nudged the counter when errors grew,
Now health checks hum — steady, not small.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main changes: adding retry logic and health degradation for GenVM Manager.
Description check	✅ Passed	The description covers main changes, configuration, and testing, but lacks the 'Why', 'Decisions made', and 'Reviewing tips' sections from the template.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

tests/unit/test_genvm_retry.py (1)

95-126: Inconsistent environment variable isolation in tests.

test_default_timeout and test_default_retry_delay directly modify os.environ via pop() without a context manager, which could cause test pollution. Consider using patch.dict with clear=False and explicit key removal for consistency, or use a fixture for cleanup.

🔎 Proposed fix for consistent env var isolation

     def test_default_timeout(self):
         """Default timeout is 10 seconds"""
-        import os
-
-        os.environ.pop("GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS", None)
-        assert base_host._get_timeout_seconds("GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS", 10.0) == 10.0
+        with patch.dict("os.environ", {}, clear=False):
+            import os
+            os.environ.pop("GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS", None)
+            assert base_host._get_timeout_seconds("GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS", 10.0) == 10.0

     def test_custom_timeout(self):
         """Timeout can be configured via env var"""
         with patch.dict("os.environ", {"GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS": "5"}):
             assert base_host._get_timeout_seconds("GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS", 10.0) == 5.0

     def test_default_retry_delay(self):
         """Default retry delay is 1 second"""
-        import os
-
-        os.environ.pop("GENVM_MANAGER_RUN_RETRY_DELAY_SECONDS", None)
-        assert base_host._get_timeout_seconds("GENVM_MANAGER_RUN_RETRY_DELAY_SECONDS", 1.0) == 1.0
+        with patch.dict("os.environ", {}, clear=False):
+            import os
+            os.environ.pop("GENVM_MANAGER_RUN_RETRY_DELAY_SECONDS", None)
+            assert base_host._get_timeout_seconds("GENVM_MANAGER_RUN_RETRY_DELAY_SECONDS", 1.0) == 1.0

tests/unit/test_worker_health_degradation.py (2)

82-87: Async context manager mock may be fragile.

The mock pattern for aiohttp.request uses synchronous MagicMock for __aenter__ and __aexit__. While this often works, it's more robust to use AsyncMock or explicit async functions.

🔎 More robust async mock pattern

         with patch("aiohttp.request") as mock_request:
             mock_response = MagicMock()
             mock_response.status = 200
-            mock_response.__aenter__ = MagicMock(return_value=mock_response)
-            mock_response.__aexit__ = MagicMock(return_value=None)
+            async def aenter():
+                return mock_response
+            async def aexit(*args):
+                pass
+            mock_response.__aenter__ = aenter
+            mock_response.__aexit__ = aexit
             mock_request.return_value = mock_response

Or use unittest.mock.AsyncMock for a cleaner approach.

127-132: Consider asserting specific response type for healthy case.

The branching assertion handles both JSONResponse and dict returns, but it would be clearer to assert the exact expected type. If the healthy response should always be a dict, assert that directly.

🔎 Proposed clarification

             # Should be healthy (200 or dict response)
             # The endpoint returns dict for healthy, JSONResponse for unhealthy
-            if hasattr(response, "status_code"):
-                assert response.status_code != 503
-            else:
-                assert response.get("status") in ["healthy", "stopping"]
+            # Healthy response is a dict, not JSONResponse
+            assert isinstance(response, dict), f"Expected dict for healthy response, got {type(response)}"
+            assert response.get("status") in ["healthy", "stopping"]

backend/node/genvm/origin/base_host.py (2)

517-524: Remove unnecessary f-string prefix.

Line 518 uses an f-string without any placeholders. Remove the f prefix.
🔎 Proposed fix
                     logger.error(
-                        f"genvm manager /genvm/run failed",
+                        "genvm manager /genvm/run failed",
                         status=resp.status,
                         body=data,
                     )
564-566: Finally block logs on every retry attempt.

The finally block at line 564 executes after each loop iteration, including failed attempts before retries. This may produce confusing log entries showing "proc started" with genvm_id=None during retry scenarios.

Consider moving this logging outside the retry loop or adjusting the condition to only log meaningful state.
🔎 Proposed adjustment
-            finally:
-                if genvm_id_cell[0] is not None or last_exc is not None:
-                    logger.debug("proc started", genvm_id=genvm_id_cell[0])
+            finally:
+                # Only log when we have a valid genvm_id (successful start)
+                if genvm_id_cell[0] is not None:
+                    logger.debug("proc started", genvm_id=genvm_id_cell[0])
Or remove the finally block entirely since success already logs at line 527-529.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c00df52 and bd863c9.

📒 Files selected for processing (4)

backend/consensus/worker_service.py
backend/node/genvm/origin/base_host.py
tests/unit/test_genvm_retry.py
tests/unit/test_worker_health_degradation.py

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Target Python 3.12, use 4-space indentation, and rely on Black via pre-commit for formatting consistency

**/*.py: Apply Black formatter for Python code formatting
Include type hints in all Python code

Files:

tests/unit/test_worker_health_degradation.py
backend/consensus/worker_service.py
backend/node/genvm/origin/base_host.py
tests/unit/test_genvm_retry.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place new backend tests in the closest scope folder and name them test_<feature>.py for Pytest auto-discovery

Use pytest with fixtures from tests/common/ for backend testing

Files:

tests/unit/test_worker_health_degradation.py
tests/unit/test_genvm_retry.py

backend/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Align backend filenames with their behavior (e.g., validators/llm_validator.py) and mirror that pattern in tests

Files:

backend/consensus/worker_service.py
backend/node/genvm/origin/base_host.py

backend/consensus/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement validator rotation using VRF (Verifiable Random Function) in the consensus system

Files:

backend/consensus/worker_service.py

🧬 Code graph analysis (3)

backend/consensus/worker_service.py (2)

backend/node/genvm/origin/base_host.py (1)

set_genvm_callbacks (50-57)

tests/unit/test_genvm_retry.py (2)

on_success (45-47)

on_failure (49-51)

backend/node/genvm/origin/base_host.py (1)

tests/unit/test_genvm_retry.py (2)

on_success (45-47)

on_failure (49-51)

tests/unit/test_genvm_retry.py (1)

backend/node/genvm/origin/base_host.py (3)

set_genvm_callbacks (50-57)

_get_int (35-42)

_get_timeout_seconds (25-32)

🪛 GitHub Actions: pre-commit

backend/node/genvm/origin/base_host.py

[error] 1-1: Black formatting check failed. 2 files were reformatted by Black during the pre-commit hook.

tests/unit/test_genvm_retry.py

[error] 1-1: Black formatting check failed. 2 files were reformatted by Black during the pre-commit hook.

🪛 Ruff (0.14.10)

backend/node/genvm/origin/base_host.py

518-518: f-string without any placeholders

Remove extraneous f prefix

(F541)

522-524: Create your own exception

(TRY002)

522-524: Avoid specifying long messages outside the exception class

(TRY003)

543-547: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: load-test
GitHub Check: test
GitHub Check: backend-unit-tests

🔇 Additional comments (11)

tests/unit/test_genvm_retry.py (3)

1-9: LGTM! Well-structured test file for GenVM retry logic.

The test organization into separate classes for callbacks, env var config, and retry config is clean and follows pytest conventions.

11-63: LGTM! Callback tests are thorough.

The test class properly resets state in setup_method and covers setting, clearing, and invoking callbacks. Testing direct invocation is appropriate for unit tests.

66-89: LGTM! Good coverage of env var parsing edge cases.

Tests cover valid values, invalid values, and missing values for both _get_int and _get_timeout_seconds.

tests/unit/test_worker_health_degradation.py (2)

12-48: LGTM! Comprehensive failure tracking tests.

The TestGenVMFailureTracking class provides good coverage for increment, reset, and get operations on the failure counter.

168-187: LGTM! Environment variable threshold tests.

The tests appropriately verify the parsing pattern. The comment on line 173 correctly notes the limitation regarding module import time initialization.

backend/consensus/worker_service.py (3)

80-108: LGTM! GenVM failure tracking implementation.

The module-level state and functions for tracking consecutive GenVM failures are well-structured. The logging provides good visibility into failure progression.

One note: if callbacks could ever be invoked from multiple threads simultaneously (unlikely in a single-event-loop async context), the increment operation wouldn't be atomic. For now, this appears safe given the FastAPI async context.

201-209: LGTM! Callback wiring is correct.

The callbacks are properly wired to connect GenVM retry outcomes in base_host.py with failure tracking in worker_service.py. The inline import avoids potential circular import issues.

513-524: LGTM! Health check degradation logic is correct.

The consecutive failure check properly returns 503 when the threshold is reached or exceeded. The response body includes all necessary debugging information.

backend/node/genvm/origin/base_host.py (3)

35-57: LGTM! Helper functions and callbacks are well-designed.

The _get_int helper follows the same defensive pattern as _get_timeout_seconds. The callback mechanism is simple and effective for decoupling failure tracking from the retry logic.

516-524: Non-200 responses bypass failure tracking.

When GenVM manager returns a non-200 status (e.g., 500), the code raises an exception immediately without triggering _on_genvm_failure. This means transient server errors from GenVM manager won't contribute to the consecutive failure count or trigger health degradation.

Consider whether non-200 responses should also be retried and/or trigger the failure callback.

473-480: LGTM! Retry configuration is well-designed.

The reduced per-attempt timeout (10s) combined with 3 retries and exponential backoff provides faster failure detection while maintaining similar total time budget. Configuration via environment variables allows tuning without code changes.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI Agents

In @backend/node/genvm/origin/base_host.py:
- Around line 486-487: The finally block uses last_exc (initialized to None) and
genvm_id_cell[0] but last_exc may remain None if a non-transient exception
(e.g., JSONDecodeError in the try body) occurs, causing the finally logic to
mis-fire or skip necessary cleanup; update the finally block to not rely solely
on last_exc: either set last_exc in an except Exception as e handler for all
non-transient errors, or change the finally logic to check genvm_id_cell[0]
directly and/or inspect a new explicit flag (e.g., proc_started or success)
instead of last_exc; locate symbols last_exc, genvm_id_cell, and the finally
block around the try in base_host.py to implement the chosen fix and ensure
proc-started logging and cleanup only run when the process actually started.
- Around line 473-483: The 10s value applies only to HTTP network calls, not
overall GenVM execution; update the declarations using _get_timeout_seconds so
their intent is explicit (e.g., rename or add a clear inline comment for
run_http_timeout_s and status_http_timeout_s to indicate they are HTTP request
timeouts for POST /genvm/run and GET status polling), and ensure you do NOT
change the separate overall execution timeout controlled by the timeout
parameter (leave any function/parameter named timeout untouched); also update
any callers or docs that reference these variables to reflect the “HTTP request
timeout” semantics so there’s no confusion during deployment.

🧹 Nitpick comments (6)

tests/unit/test_worker_health_degradation.py (2)
15-17: Consider using a fixture instead of setup_method for state reset.

The manual reset of worker_service._genvm_consecutive_failures in setup_method directly mutates module-level state. Consider using a pytest fixture with proper teardown to ensure test isolation, especially if tests run in parallel.
🔎 Proposed refactor using fixture
+@pytest.fixture(autouse=True)
+def reset_genvm_state():
+    """Reset GenVM state before each test."""
+    worker_service._genvm_consecutive_failures = 0
+    yield
+    # Cleanup after test
+    worker_service._genvm_consecutive_failures = 0
+
 class TestGenVMFailureTracking:
     """Test GenVM failure tracking functions"""
 
-    def setup_method(self):
-        # Reset failure count before each test
-        worker_service._genvm_consecutive_failures = 0
-
     def test_increment_genvm_failure(self):
171-187: Env var tests don't verify actual module behavior.

These tests only verify os.environ.get() behavior, not how worker_service actually parses GENVM_FAILURE_UNHEALTHY_THRESHOLD at module import time. Since the threshold is read when the module loads (line 82-84 in worker_service.py), these tests don't confirm the integration works correctly.

Consider either:

Testing the actual _genvm_failure_unhealthy_threshold value in worker_service after mocking the env var before import, or

Removing these tests if they're redundant (since they only test standard library behavior).
🔎 Alternative approach to test actual module behavior
def test_threshold_uses_configured_value(self):
    """Verify worker_service respects the configured threshold"""
    # Direct test: check that worker_service._genvm_failure_unhealthy_threshold
    # matches the expected value from environment or default
    import backend.consensus.worker_service as ws
    # This assumes the module was loaded with default or set env var
    assert ws._genvm_failure_unhealthy_threshold >= 1
Note: Full integration testing would require module reloading, which can be complex.
backend/node/genvm/origin/base_host.py (3)
523-523: Remove extraneous f-string prefix.

The f-string on line 523 has no placeholders.
🔎 Proposed fix
-                        logger.error(
-                            f"genvm manager /genvm/run failed",
+                        logger.error(
+                            "genvm manager /genvm/run failed",
                             status=resp.status,
                             body=data,
                         )
As per static analysis hints.

527-529: Consider using a custom exception class.

Raising Exception with a formatted message makes it harder to catch specific failure types programmatically. Consider defining a GenVMRunError exception class for better error handling.
🔎 Proposed refactor

Add near the top of the file:
class GenVMRunError(Exception):
    """Raised when GenVM /genvm/run request fails."""
    def __init__(self, status: int, body: dict):
        self.status = status
        self.body = body
        super().__init__(f"genvm manager /genvm/run failed: {status} {body}")
Then update line 527-529:
-                        raise Exception(
-                            f"genvm manager /genvm/run failed: {resp.status} {data}"
-                        )
+                        raise GenVMRunError(resp.status, data)
As per static analysis hints.

550-560: Use logging.exception for better traceback capture.

When logging errors with exceptions in except blocks, logger.exception() automatically includes the traceback, providing better debugging information than logger.error().
🔎 Proposed fix
                 if is_last_attempt:
                     # All retries exhausted - track failure and propagate
-                    logger.error(
+                    logger.exception(
                         "genvm manager request failed after all retries",
-                        error=str(exc),
                         attempts=max_retries,
                     )
Note: logger.exception() automatically captures exc, so error=str(exc) becomes redundant.
As per static analysis hints.
backend/consensus/worker_service.py (1)
80-84: Consider validation for threshold configuration.

The threshold parsing uses int() directly, which will raise ValueError if the env var contains invalid data. While this might be intentional (fail-fast on misconfiguration), it could also crash the worker at startup. The pattern differs from _get_timeout_seconds in base_host.py which returns a default on parse failure.
🔎 Proposed defensive parsing
-_genvm_failure_unhealthy_threshold: int = int(
-    os.environ.get("GENVM_FAILURE_UNHEALTHY_THRESHOLD", "3")
-)
+def _parse_threshold() -> int:
+    try:
+        return int(os.environ.get("GENVM_FAILURE_UNHEALTHY_THRESHOLD", "3"))
+    except ValueError:
+        logger.warning(
+            "Invalid GENVM_FAILURE_UNHEALTHY_THRESHOLD, using default 3"
+        )
+        return 3
+
+_genvm_failure_unhealthy_threshold: int = _parse_threshold()
This aligns with the defensive parsing pattern used in base_host.py's _get_int().

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd863c9 and 6a80e48.

📒 Files selected for processing (4)

backend/consensus/worker_service.py
backend/node/genvm/origin/base_host.py
tests/unit/test_genvm_retry.py
tests/unit/test_worker_health_degradation.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/unit/test_genvm_retry.py

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Target Python 3.12, use 4-space indentation, and rely on Black via pre-commit for formatting consistency

**/*.py: Apply Black formatter for Python code formatting
Include type hints in all Python code

Files:

backend/consensus/worker_service.py
backend/node/genvm/origin/base_host.py
tests/unit/test_worker_health_degradation.py

backend/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Align backend filenames with their behavior (e.g., validators/llm_validator.py) and mirror that pattern in tests

Files:

backend/consensus/worker_service.py
backend/node/genvm/origin/base_host.py

backend/consensus/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement validator rotation using VRF (Verifiable Random Function) in the consensus system

Files:

backend/consensus/worker_service.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place new backend tests in the closest scope folder and name them test_<feature>.py for Pytest auto-discovery

Use pytest with fixtures from tests/common/ for backend testing

Files:

tests/unit/test_worker_health_degradation.py

🧬 Code graph analysis (2)

backend/consensus/worker_service.py (2)

backend/node/genvm/origin/base_host.py (1)

set_genvm_callbacks (50-57)

tests/unit/test_genvm_retry.py (2)

on_success (45-47)

on_failure (49-51)

tests/unit/test_worker_health_degradation.py (1)

backend/consensus/worker_service.py (4)

increment_genvm_failure (87-93)

reset_genvm_failures (96-103)

get_genvm_failure_count (106-108)

health_check (385-591)

🪛 Ruff (0.14.10)

backend/node/genvm/origin/base_host.py

523-523: f-string without any placeholders

Remove extraneous f prefix

(F541)

527-529: Create your own exception

(TRY002)

527-529: Avoid specifying long messages outside the exception class

(TRY003)

552-556: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: backend-unit-tests

🔇 Additional comments (10)

backend/node/genvm/origin/base_host.py (5)

35-43: LGTM! Clean integer parsing helper.

The _get_int helper mirrors the existing _get_timeout_seconds pattern and provides robust fallback behavior for invalid or missing environment variables.

45-57: LGTM! Callback mechanism is well-designed.

The callback globals and setter provide a clean integration point for external failure tracking without tight coupling. The optional callbacks pattern is appropriate here.

541-544: LGTM! Success callback integration is correct.

The success callback is invoked after the GenVM ID is captured and timeout watcher is started, ensuring the failure counter is reset only on confirmed success.

557-559: LGTM! Failure callback timing is appropriate.

The failure callback is invoked only after all retries are exhausted, which correctly tracks persistent failures rather than transient issues.

562-571: LGTM! Exponential backoff implementation is correct.

The retry delay calculation retry_base_delay_s * (2**attempt) properly implements exponential backoff (1s, 2s, 4s for attempts 0, 1, 2 with default base delay of 1s).

backend/consensus/worker_service.py (5)

87-93: LGTM! Clear failure tracking with informative logging.

The increment function properly updates global state and provides useful warning logs that include both current count and threshold for easy debugging.

96-103: LGTM! Conditional reset prevents log spam.

The check if _genvm_consecutive_failures > 0 before logging prevents unnecessary log entries on repeated successes, improving log quality.

106-108: LGTM! Testing helper with clear documentation.

The get_genvm_failure_count() function provides necessary test observability while keeping the implementation simple.

202-209: LGTM! Callback wiring is properly integrated into startup.

The callbacks are registered after genvm_manager is created and before the worker starts, ensuring the failure tracking is active throughout the worker's execution lifecycle. The startup log confirms the integration.

513-524: LGTM! Health degradation returns appropriate 503 with diagnostic info.

The health check correctly returns a 503 status code when the failure threshold is met or exceeded, and includes count and threshold in the response body for debugging. This enables Kubernetes to restart unhealthy pods.

backend/node/genvm/origin/base_host.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (7)

backend/node/genvm/origin/base_host.py (3)
546-547: Remove unused last_exc variable.

The variable last_exc is assigned but never read. This was flagged by static analysis (F841).
🔎 Proposed fix
             except (aiohttp.ClientError, asyncio.TimeoutError) as exc:
-                last_exc = exc
                 is_last_attempt = attempt >= max_retries - 1
527-529: Consider using a custom exception class for GenVM errors.

Static analysis flagged this for creating an exception with a long message (TRY003) and using a generic Exception instead of a custom class (TRY002). A dedicated GenVMManagerError exception would improve error handling upstream.

550-560: Verify exception chain is preserved for debugging.

When raise is used without arguments on line 560, it re-raises the caught exception exc. This is correct, but consider using raise exc from exc or logging the full traceback with logger.exception instead of logger.error to preserve the stack trace for debugging (TRY400).
🔎 Proposed improvement
                 if is_last_attempt:
                     # All retries exhausted - track failure and propagate
-                    logger.error(
+                    logger.exception(
                         "genvm manager request failed after all retries",
-                        error=str(exc),
                         attempts=max_retries,
                     )
tests/unit/test_worker_health_degradation.py (2)
82-87: Async context manager mock is incomplete.

The __aenter__ and __aexit__ methods should return coroutines for proper async context manager behavior. While this may work due to MagicMock's flexibility, it's not technically correct and could cause issues with stricter async implementations.
🔎 Proposed fix using AsyncMock
-        with patch("aiohttp.request") as mock_request:
-            mock_response = MagicMock()
-            mock_response.status = 200
-            mock_response.__aenter__ = MagicMock(return_value=mock_response)
-            mock_response.__aexit__ = MagicMock(return_value=None)
-            mock_request.return_value = mock_response
+        from unittest.mock import AsyncMock
+        with patch("aiohttp.request") as mock_request:
+            mock_response = MagicMock()
+            mock_response.status = 200
+            mock_response.__aenter__ = AsyncMock(return_value=mock_response)
+            mock_response.__aexit__ = AsyncMock(return_value=None)
+            mock_request.return_value = mock_response
This pattern is repeated at lines 117-122 and 150-155.
168-187: Tests don't exercise actual module threshold parsing.

These tests only verify that os.environ.get() works correctly with defaults, not that worker_service._genvm_failure_unhealthy_threshold is correctly parsed at import time. Since the threshold is read at module import, patching the environment after import won't change the module's threshold value.

Consider adding a test that verifies the actual _genvm_failure_unhealthy_threshold value or documenting that this is testing the parsing pattern rather than the integrated behavior.
backend/consensus/worker_service.py (2)
80-84: Consider validating threshold is positive.

If GENVM_FAILURE_UNHEALTHY_THRESHOLD is set to 0 or a negative value, the health check would immediately return 503 even with zero failures. Consider adding validation or using the _get_int pattern with bounds checking.
🔎 Proposed improvement
 # GenVM Manager consecutive failure tracking
 _genvm_consecutive_failures: int = 0
-_genvm_failure_unhealthy_threshold: int = int(
-    os.environ.get("GENVM_FAILURE_UNHEALTHY_THRESHOLD", "3")
-)
+_genvm_failure_unhealthy_threshold: int = max(
+    1, int(os.environ.get("GENVM_FAILURE_UNHEALTHY_THRESHOLD", "3"))
+)
82-84: Threshold parsing doesn't handle invalid values gracefully.

Unlike _get_int in base_host.py, this will raise ValueError if the environment variable contains a non-integer string. Consider using a try/except pattern for consistency.
🔎 Proposed fix
 # GenVM Manager consecutive failure tracking
 _genvm_consecutive_failures: int = 0
-_genvm_failure_unhealthy_threshold: int = int(
-    os.environ.get("GENVM_FAILURE_UNHEALTHY_THRESHOLD", "3")
-)
+try:
+    _genvm_failure_unhealthy_threshold: int = int(
+        os.environ.get("GENVM_FAILURE_UNHEALTHY_THRESHOLD", "3")
+    )
+except ValueError:
+    _genvm_failure_unhealthy_threshold = 3

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a80e48 and 53590f1.

📒 Files selected for processing (4)

backend/consensus/worker_service.py
backend/node/genvm/origin/base_host.py
tests/unit/test_genvm_retry.py
tests/unit/test_worker_health_degradation.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/unit/test_genvm_retry.py

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Target Python 3.12, use 4-space indentation, and rely on Black via pre-commit for formatting consistency

**/*.py: Apply Black formatter for Python code formatting
Include type hints in all Python code

Files:

tests/unit/test_worker_health_degradation.py
backend/consensus/worker_service.py
backend/node/genvm/origin/base_host.py

tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Place new backend tests in the closest scope folder and name them test_<feature>.py for Pytest auto-discovery

Use pytest with fixtures from tests/common/ for backend testing

Files:

tests/unit/test_worker_health_degradation.py

backend/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Align backend filenames with their behavior (e.g., validators/llm_validator.py) and mirror that pattern in tests

Files:

backend/consensus/worker_service.py
backend/node/genvm/origin/base_host.py

backend/consensus/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement validator rotation using VRF (Verifiable Random Function) in the consensus system

Files:

backend/consensus/worker_service.py

🧬 Code graph analysis (2)

tests/unit/test_worker_health_degradation.py (1)

backend/consensus/worker_service.py (4)

increment_genvm_failure (87-93)

reset_genvm_failures (96-103)

get_genvm_failure_count (106-108)

health_check (385-591)

backend/consensus/worker_service.py (3)

backend/node/genvm/origin/logger.py (2)

warning (20-21)

info (17-18)

backend/node/genvm/origin/base_host.py (1)

set_genvm_callbacks (50-57)

tests/unit/test_genvm_retry.py (2)

on_success (45-47)

on_failure (49-51)

🪛 Ruff (0.14.10)

backend/node/genvm/origin/base_host.py

527-529: Create your own exception

(TRY002)

527-529: Avoid specifying long messages outside the exception class

(TRY003)

547-547: Local variable last_exc is assigned to but never used

Remove assignment to unused variable last_exc

(F841)

552-556: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: backend-unit-tests

🔇 Additional comments (9)

backend/node/genvm/origin/base_host.py (4)

35-43: LGTM!

The _get_int helper correctly mirrors _get_timeout_seconds for parsing integer environment variables with graceful fallback on missing or invalid values.

45-57: LGTM!

The callback mechanism is clean and well-documented. The type hints are correct, and the global state management is appropriate for this use case.

573-576: LGTM! The finally block logic is now correct.

The condition now only logs when genvm_id_cell[0] is not None, which addresses the previous concern about misleading "proc started" logs on failure paths.

488-544: LGTM! Retry loop with exponential backoff.

The retry logic is well-structured:

Exponential backoff with base delay

Proper success callback invocation

Clean exit on success via return

Attempt tracking in logs

tests/unit/test_worker_health_degradation.py (2)

12-48: LGTM! Failure tracking tests are comprehensive.

Tests cover increment, reset, reset when zero, and get count. Good coverage of the core tracking primitives.

53-60: Good test isolation via setup_method.

Resetting all relevant module-level state before each test ensures test isolation. This is important since the tests modify global state.

backend/consensus/worker_service.py (3)

87-108: LGTM! Clean failure tracking API.

The functions are simple, well-documented, and follow a clear pattern. The logging in increment_genvm_failure and reset_genvm_failures provides good observability for debugging.

201-209: LGTM! Callback wiring is correctly placed in lifespan.

The callbacks are wired after genvm_manager is created and before the worker starts, ensuring the tracking is active for all GenVM operations. The import placement inside the function avoids circular import issues.

513-524: LGTM! Health degradation response is well-structured.

The 503 response includes all necessary debugging information: status, worker_id, error type, current count, and threshold. This will help operators diagnose issues when Kubernetes restarts pods.

When POST /genvm/run times out, retry up to 3x with exponential backoff. After consecutive failures exceed threshold, /health returns 503 triggering K8s pod restart. Transaction is released (not stuck) so another worker picks it up.

sonarqubecloud · 2026-01-06T17:40:49Z

Quality Gate failed

Failed conditions
50.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

When POST /genvm/run times out, retry up to 3x with exponential backoff. After consecutive failures exceed threshold, /health returns 503 triggering K8s pod restart. Transaction is released (not stuck) so another worker picks it up.

coderabbitai bot reviewed Jan 6, 2026

View reviewed changes

MuncleUscles force-pushed the feat/genvm-retry-health-degradation branch from bd863c9 to 6a80e48 Compare January 6, 2026 17:11

coderabbitai bot reviewed Jan 6, 2026

View reviewed changes

backend/node/genvm/origin/base_host.py Show resolved Hide resolved

backend/node/genvm/origin/base_host.py Show resolved Hide resolved

MuncleUscles force-pushed the feat/genvm-retry-health-degradation branch from 6a80e48 to 53590f1 Compare January 6, 2026 17:20

coderabbitai bot reviewed Jan 6, 2026

View reviewed changes

MuncleUscles force-pushed the feat/genvm-retry-health-degradation branch from 53590f1 to f7a15ba Compare January 6, 2026 17:28

MuncleUscles force-pushed the feat/genvm-retry-health-degradation branch from f7a15ba to 6c25bbe Compare January 6, 2026 17:34

MuncleUscles merged commit 0389ca7 into main Jan 6, 2026
11 of 12 checks passed

coderabbitai bot mentioned this pull request Jan 7, 2026

perf: background health checker for fast /health endpoint #1372

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add GenVM Manager retry logic and health degradation #1367

feat: add GenVM Manager retry logic and health degradation #1367

Uh oh!

MuncleUscles commented Jan 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

cursor bot commented Jan 6, 2026

Uh oh!

coderabbitai bot commented Jan 6, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

sonarqubecloud bot commented Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add GenVM Manager retry logic and health degradation #1367

feat: add GenVM Manager retry logic and health degradation #1367

Uh oh!

Conversation

MuncleUscles commented Jan 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Config (env vars)

Test plan

Summary by CodeRabbit

Uh oh!

cursor bot commented Jan 6, 2026

Uh oh!

coderabbitai bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Jan 6, 2026

Quality Gate failed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MuncleUscles commented Jan 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 6, 2026 •

edited

Loading