-
Notifications
You must be signed in to change notification settings - Fork 28
feat: add GenVM Manager retry logic and health degradation #1367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
You have run out of free Bugbot PR reviews for this billing cycle. This will reset on February 26. To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial. |
|
Warning Rate limit exceeded@MuncleUscles has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 1 minutes and 3 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (5)
📝 WalkthroughWalkthroughAdds retryable GenVM run logic with success/failure callbacks, tracks consecutive GenVM failures in the worker, and makes health checks return 503 when failures meet a configurable threshold. Tests for retries, callbacks, env parsing, and health-degradation behavior were added. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Worker as Worker Service
participant Callbacks as GenVM Callbacks
participant GenVM as GenVM Manager
participant API as GenVM API
Note over Worker,GenVM: Startup — wire callbacks
Worker->>Callbacks: set_genvm_callbacks(on_success,on_failure)
Note over GenVM,API: GenVM run with retries
GenVM->>API: Attempt 1 /genvm/run
alt Transient error
API--xGenVM: Error/Timeout
GenVM->>GenVM: backoff wait
GenVM->>API: Attempt N /genvm/run
alt Success
API-->>GenVM: Success (genvm_id)
GenVM->>Callbacks: _on_genvm_success()
Callbacks->>Worker: reset_genvm_failures()
else Exhausted retries
API--xGenVM: Final error
GenVM->>Callbacks: _on_genvm_failure()
Callbacks->>Worker: increment_genvm_failure()
end
else Immediate success
API-->>GenVM: Success (genvm_id)
GenVM->>Callbacks: _on_genvm_success()
Callbacks->>Worker: reset_genvm_failures()
end
Note over Client,Worker: Health probe
Client->>Worker: GET /health
alt failures < threshold
Worker-->>Client: 200 OK
else failures ≥ threshold
Worker-->>Client: 503 {error, count, threshold}
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~35 minutes Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
tests/unit/test_genvm_retry.py (1)
95-126: Inconsistent environment variable isolation in tests.
test_default_timeoutandtest_default_retry_delaydirectly modifyos.environviapop()without a context manager, which could cause test pollution. Consider usingpatch.dictwithclear=Falseand explicit key removal for consistency, or use a fixture for cleanup.🔎 Proposed fix for consistent env var isolation
def test_default_timeout(self): """Default timeout is 10 seconds""" - import os - - os.environ.pop("GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS", None) - assert base_host._get_timeout_seconds("GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS", 10.0) == 10.0 + with patch.dict("os.environ", {}, clear=False): + import os + os.environ.pop("GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS", None) + assert base_host._get_timeout_seconds("GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS", 10.0) == 10.0 def test_custom_timeout(self): """Timeout can be configured via env var""" with patch.dict("os.environ", {"GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS": "5"}): assert base_host._get_timeout_seconds("GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDS", 10.0) == 5.0 def test_default_retry_delay(self): """Default retry delay is 1 second""" - import os - - os.environ.pop("GENVM_MANAGER_RUN_RETRY_DELAY_SECONDS", None) - assert base_host._get_timeout_seconds("GENVM_MANAGER_RUN_RETRY_DELAY_SECONDS", 1.0) == 1.0 + with patch.dict("os.environ", {}, clear=False): + import os + os.environ.pop("GENVM_MANAGER_RUN_RETRY_DELAY_SECONDS", None) + assert base_host._get_timeout_seconds("GENVM_MANAGER_RUN_RETRY_DELAY_SECONDS", 1.0) == 1.0tests/unit/test_worker_health_degradation.py (2)
82-87: Async context manager mock may be fragile.The mock pattern for
aiohttp.requestuses synchronousMagicMockfor__aenter__and__aexit__. While this often works, it's more robust to useAsyncMockor explicit async functions.🔎 More robust async mock pattern
with patch("aiohttp.request") as mock_request: mock_response = MagicMock() mock_response.status = 200 - mock_response.__aenter__ = MagicMock(return_value=mock_response) - mock_response.__aexit__ = MagicMock(return_value=None) + async def aenter(): + return mock_response + async def aexit(*args): + pass + mock_response.__aenter__ = aenter + mock_response.__aexit__ = aexit mock_request.return_value = mock_responseOr use
unittest.mock.AsyncMockfor a cleaner approach.
127-132: Consider asserting specific response type for healthy case.The branching assertion handles both
JSONResponseanddictreturns, but it would be clearer to assert the exact expected type. If the healthy response should always be a dict, assert that directly.🔎 Proposed clarification
# Should be healthy (200 or dict response) # The endpoint returns dict for healthy, JSONResponse for unhealthy - if hasattr(response, "status_code"): - assert response.status_code != 503 - else: - assert response.get("status") in ["healthy", "stopping"] + # Healthy response is a dict, not JSONResponse + assert isinstance(response, dict), f"Expected dict for healthy response, got {type(response)}" + assert response.get("status") in ["healthy", "stopping"]backend/node/genvm/origin/base_host.py (2)
517-524: Remove unnecessary f-string prefix.Line 518 uses an f-string without any placeholders. Remove the
fprefix.🔎 Proposed fix
logger.error( - f"genvm manager /genvm/run failed", + "genvm manager /genvm/run failed", status=resp.status, body=data, )
564-566: Finally block logs on every retry attempt.The
finallyblock at line 564 executes after each loop iteration, including failed attempts before retries. This may produce confusing log entries showing "proc started" withgenvm_id=Noneduring retry scenarios.Consider moving this logging outside the retry loop or adjusting the condition to only log meaningful state.
🔎 Proposed adjustment
- finally: - if genvm_id_cell[0] is not None or last_exc is not None: - logger.debug("proc started", genvm_id=genvm_id_cell[0]) + finally: + # Only log when we have a valid genvm_id (successful start) + if genvm_id_cell[0] is not None: + logger.debug("proc started", genvm_id=genvm_id_cell[0])Or remove the finally block entirely since success already logs at line 527-529.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
backend/consensus/worker_service.pybackend/node/genvm/origin/base_host.pytests/unit/test_genvm_retry.pytests/unit/test_worker_health_degradation.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Target Python 3.12, use 4-space indentation, and rely on Black via pre-commit for formatting consistency
**/*.py: Apply Black formatter for Python code formatting
Include type hints in all Python code
Files:
tests/unit/test_worker_health_degradation.pybackend/consensus/worker_service.pybackend/node/genvm/origin/base_host.pytests/unit/test_genvm_retry.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Place new backend tests in the closest scope folder and name them
test_<feature>.pyfor Pytest auto-discoveryUse pytest with fixtures from tests/common/ for backend testing
Files:
tests/unit/test_worker_health_degradation.pytests/unit/test_genvm_retry.py
backend/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Align backend filenames with their behavior (e.g.,
validators/llm_validator.py) and mirror that pattern in tests
Files:
backend/consensus/worker_service.pybackend/node/genvm/origin/base_host.py
backend/consensus/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Implement validator rotation using VRF (Verifiable Random Function) in the consensus system
Files:
backend/consensus/worker_service.py
🧬 Code graph analysis (3)
backend/consensus/worker_service.py (2)
backend/node/genvm/origin/base_host.py (1)
set_genvm_callbacks(50-57)tests/unit/test_genvm_retry.py (2)
on_success(45-47)on_failure(49-51)
backend/node/genvm/origin/base_host.py (1)
tests/unit/test_genvm_retry.py (2)
on_success(45-47)on_failure(49-51)
tests/unit/test_genvm_retry.py (1)
backend/node/genvm/origin/base_host.py (3)
set_genvm_callbacks(50-57)_get_int(35-42)_get_timeout_seconds(25-32)
🪛 GitHub Actions: pre-commit
backend/node/genvm/origin/base_host.py
[error] 1-1: Black formatting check failed. 2 files were reformatted by Black during the pre-commit hook.
tests/unit/test_genvm_retry.py
[error] 1-1: Black formatting check failed. 2 files were reformatted by Black during the pre-commit hook.
🪛 Ruff (0.14.10)
backend/node/genvm/origin/base_host.py
518-518: f-string without any placeholders
Remove extraneous f prefix
(F541)
522-524: Create your own exception
(TRY002)
522-524: Avoid specifying long messages outside the exception class
(TRY003)
543-547: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: load-test
- GitHub Check: test
- GitHub Check: backend-unit-tests
🔇 Additional comments (11)
tests/unit/test_genvm_retry.py (3)
1-9: LGTM! Well-structured test file for GenVM retry logic.The test organization into separate classes for callbacks, env var config, and retry config is clean and follows pytest conventions.
11-63: LGTM! Callback tests are thorough.The test class properly resets state in
setup_methodand covers setting, clearing, and invoking callbacks. Testing direct invocation is appropriate for unit tests.
66-89: LGTM! Good coverage of env var parsing edge cases.Tests cover valid values, invalid values, and missing values for both
_get_intand_get_timeout_seconds.tests/unit/test_worker_health_degradation.py (2)
12-48: LGTM! Comprehensive failure tracking tests.The
TestGenVMFailureTrackingclass provides good coverage for increment, reset, and get operations on the failure counter.
168-187: LGTM! Environment variable threshold tests.The tests appropriately verify the parsing pattern. The comment on line 173 correctly notes the limitation regarding module import time initialization.
backend/consensus/worker_service.py (3)
80-108: LGTM! GenVM failure tracking implementation.The module-level state and functions for tracking consecutive GenVM failures are well-structured. The logging provides good visibility into failure progression.
One note: if callbacks could ever be invoked from multiple threads simultaneously (unlikely in a single-event-loop async context), the increment operation wouldn't be atomic. For now, this appears safe given the FastAPI async context.
201-209: LGTM! Callback wiring is correct.The callbacks are properly wired to connect GenVM retry outcomes in
base_host.pywith failure tracking inworker_service.py. The inline import avoids potential circular import issues.
513-524: LGTM! Health check degradation logic is correct.The consecutive failure check properly returns 503 when the threshold is reached or exceeded. The response body includes all necessary debugging information.
backend/node/genvm/origin/base_host.py (3)
35-57: LGTM! Helper functions and callbacks are well-designed.The
_get_inthelper follows the same defensive pattern as_get_timeout_seconds. The callback mechanism is simple and effective for decoupling failure tracking from the retry logic.
516-524: Non-200 responses bypass failure tracking.When GenVM manager returns a non-200 status (e.g., 500), the code raises an exception immediately without triggering
_on_genvm_failure. This means transient server errors from GenVM manager won't contribute to the consecutive failure count or trigger health degradation.Consider whether non-200 responses should also be retried and/or trigger the failure callback.
473-480: LGTM! Retry configuration is well-designed.The reduced per-attempt timeout (10s) combined with 3 retries and exponential backoff provides faster failure detection while maintaining similar total time budget. Configuration via environment variables allows tuning without code changes.
bd863c9 to
6a80e48
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI Agents
In @backend/node/genvm/origin/base_host.py:
- Around line 486-487: The finally block uses last_exc (initialized to None) and
genvm_id_cell[0] but last_exc may remain None if a non-transient exception
(e.g., JSONDecodeError in the try body) occurs, causing the finally logic to
mis-fire or skip necessary cleanup; update the finally block to not rely solely
on last_exc: either set last_exc in an except Exception as e handler for all
non-transient errors, or change the finally logic to check genvm_id_cell[0]
directly and/or inspect a new explicit flag (e.g., proc_started or success)
instead of last_exc; locate symbols last_exc, genvm_id_cell, and the finally
block around the try in base_host.py to implement the chosen fix and ensure
proc-started logging and cleanup only run when the process actually started.
- Around line 473-483: The 10s value applies only to HTTP network calls, not
overall GenVM execution; update the declarations using _get_timeout_seconds so
their intent is explicit (e.g., rename or add a clear inline comment for
run_http_timeout_s and status_http_timeout_s to indicate they are HTTP request
timeouts for POST /genvm/run and GET status polling), and ensure you do NOT
change the separate overall execution timeout controlled by the timeout
parameter (leave any function/parameter named timeout untouched); also update
any callers or docs that reference these variables to reflect the “HTTP request
timeout” semantics so there’s no confusion during deployment.
🧹 Nitpick comments (6)
tests/unit/test_worker_health_degradation.py (2)
15-17: Consider using a fixture instead of setup_method for state reset.The manual reset of
worker_service._genvm_consecutive_failuresinsetup_methoddirectly mutates module-level state. Consider using a pytest fixture with proper teardown to ensure test isolation, especially if tests run in parallel.🔎 Proposed refactor using fixture
+@pytest.fixture(autouse=True) +def reset_genvm_state(): + """Reset GenVM state before each test.""" + worker_service._genvm_consecutive_failures = 0 + yield + # Cleanup after test + worker_service._genvm_consecutive_failures = 0 + class TestGenVMFailureTracking: """Test GenVM failure tracking functions""" - def setup_method(self): - # Reset failure count before each test - worker_service._genvm_consecutive_failures = 0 - def test_increment_genvm_failure(self):
171-187: Env var tests don't verify actual module behavior.These tests only verify
os.environ.get()behavior, not howworker_serviceactually parsesGENVM_FAILURE_UNHEALTHY_THRESHOLDat module import time. Since the threshold is read when the module loads (line 82-84 in worker_service.py), these tests don't confirm the integration works correctly.Consider either:
- Testing the actual
_genvm_failure_unhealthy_thresholdvalue in worker_service after mocking the env var before import, or- Removing these tests if they're redundant (since they only test standard library behavior).
🔎 Alternative approach to test actual module behavior
def test_threshold_uses_configured_value(self): """Verify worker_service respects the configured threshold""" # Direct test: check that worker_service._genvm_failure_unhealthy_threshold # matches the expected value from environment or default import backend.consensus.worker_service as ws # This assumes the module was loaded with default or set env var assert ws._genvm_failure_unhealthy_threshold >= 1Note: Full integration testing would require module reloading, which can be complex.
backend/node/genvm/origin/base_host.py (3)
523-523: Remove extraneous f-string prefix.The f-string on line 523 has no placeholders.
🔎 Proposed fix
- logger.error( - f"genvm manager /genvm/run failed", + logger.error( + "genvm manager /genvm/run failed", status=resp.status, body=data, )As per static analysis hints.
527-529: Consider using a custom exception class.Raising
Exceptionwith a formatted message makes it harder to catch specific failure types programmatically. Consider defining aGenVMRunErrorexception class for better error handling.🔎 Proposed refactor
Add near the top of the file:
class GenVMRunError(Exception): """Raised when GenVM /genvm/run request fails.""" def __init__(self, status: int, body: dict): self.status = status self.body = body super().__init__(f"genvm manager /genvm/run failed: {status} {body}")Then update line 527-529:
- raise Exception( - f"genvm manager /genvm/run failed: {resp.status} {data}" - ) + raise GenVMRunError(resp.status, data)As per static analysis hints.
550-560: Use logging.exception for better traceback capture.When logging errors with exceptions in except blocks,
logger.exception()automatically includes the traceback, providing better debugging information thanlogger.error().🔎 Proposed fix
if is_last_attempt: # All retries exhausted - track failure and propagate - logger.error( + logger.exception( "genvm manager request failed after all retries", - error=str(exc), attempts=max_retries, )Note:
logger.exception()automatically capturesexc, soerror=str(exc)becomes redundant.As per static analysis hints.
backend/consensus/worker_service.py (1)
80-84: Consider validation for threshold configuration.The threshold parsing uses
int()directly, which will raiseValueErrorif the env var contains invalid data. While this might be intentional (fail-fast on misconfiguration), it could also crash the worker at startup. The pattern differs from_get_timeout_secondsin base_host.py which returns a default on parse failure.🔎 Proposed defensive parsing
-_genvm_failure_unhealthy_threshold: int = int( - os.environ.get("GENVM_FAILURE_UNHEALTHY_THRESHOLD", "3") -) +def _parse_threshold() -> int: + try: + return int(os.environ.get("GENVM_FAILURE_UNHEALTHY_THRESHOLD", "3")) + except ValueError: + logger.warning( + "Invalid GENVM_FAILURE_UNHEALTHY_THRESHOLD, using default 3" + ) + return 3 + +_genvm_failure_unhealthy_threshold: int = _parse_threshold()This aligns with the defensive parsing pattern used in base_host.py's
_get_int().
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
backend/consensus/worker_service.pybackend/node/genvm/origin/base_host.pytests/unit/test_genvm_retry.pytests/unit/test_worker_health_degradation.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/unit/test_genvm_retry.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Target Python 3.12, use 4-space indentation, and rely on Black via pre-commit for formatting consistency
**/*.py: Apply Black formatter for Python code formatting
Include type hints in all Python code
Files:
backend/consensus/worker_service.pybackend/node/genvm/origin/base_host.pytests/unit/test_worker_health_degradation.py
backend/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Align backend filenames with their behavior (e.g.,
validators/llm_validator.py) and mirror that pattern in tests
Files:
backend/consensus/worker_service.pybackend/node/genvm/origin/base_host.py
backend/consensus/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Implement validator rotation using VRF (Verifiable Random Function) in the consensus system
Files:
backend/consensus/worker_service.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Place new backend tests in the closest scope folder and name them
test_<feature>.pyfor Pytest auto-discoveryUse pytest with fixtures from tests/common/ for backend testing
Files:
tests/unit/test_worker_health_degradation.py
🧬 Code graph analysis (2)
backend/consensus/worker_service.py (2)
backend/node/genvm/origin/base_host.py (1)
set_genvm_callbacks(50-57)tests/unit/test_genvm_retry.py (2)
on_success(45-47)on_failure(49-51)
tests/unit/test_worker_health_degradation.py (1)
backend/consensus/worker_service.py (4)
increment_genvm_failure(87-93)reset_genvm_failures(96-103)get_genvm_failure_count(106-108)health_check(385-591)
🪛 Ruff (0.14.10)
backend/node/genvm/origin/base_host.py
523-523: f-string without any placeholders
Remove extraneous f prefix
(F541)
527-529: Create your own exception
(TRY002)
527-529: Avoid specifying long messages outside the exception class
(TRY003)
552-556: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: backend-unit-tests
🔇 Additional comments (10)
backend/node/genvm/origin/base_host.py (5)
35-43: LGTM! Clean integer parsing helper.The
_get_inthelper mirrors the existing_get_timeout_secondspattern and provides robust fallback behavior for invalid or missing environment variables.
45-57: LGTM! Callback mechanism is well-designed.The callback globals and setter provide a clean integration point for external failure tracking without tight coupling. The optional callbacks pattern is appropriate here.
541-544: LGTM! Success callback integration is correct.The success callback is invoked after the GenVM ID is captured and timeout watcher is started, ensuring the failure counter is reset only on confirmed success.
557-559: LGTM! Failure callback timing is appropriate.The failure callback is invoked only after all retries are exhausted, which correctly tracks persistent failures rather than transient issues.
562-571: LGTM! Exponential backoff implementation is correct.The retry delay calculation
retry_base_delay_s * (2**attempt)properly implements exponential backoff (1s, 2s, 4s for attempts 0, 1, 2 with default base delay of 1s).backend/consensus/worker_service.py (5)
87-93: LGTM! Clear failure tracking with informative logging.The increment function properly updates global state and provides useful warning logs that include both current count and threshold for easy debugging.
96-103: LGTM! Conditional reset prevents log spam.The check
if _genvm_consecutive_failures > 0before logging prevents unnecessary log entries on repeated successes, improving log quality.
106-108: LGTM! Testing helper with clear documentation.The
get_genvm_failure_count()function provides necessary test observability while keeping the implementation simple.
202-209: LGTM! Callback wiring is properly integrated into startup.The callbacks are registered after
genvm_manageris created and before the worker starts, ensuring the failure tracking is active throughout the worker's execution lifecycle. The startup log confirms the integration.
513-524: LGTM! Health degradation returns appropriate 503 with diagnostic info.The health check correctly returns a 503 status code when the failure threshold is met or exceeded, and includes
countandthresholdin the response body for debugging. This enables Kubernetes to restart unhealthy pods.
6a80e48 to
53590f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (7)
backend/node/genvm/origin/base_host.py (3)
546-547: Remove unusedlast_excvariable.The variable
last_excis assigned but never read. This was flagged by static analysis (F841).🔎 Proposed fix
except (aiohttp.ClientError, asyncio.TimeoutError) as exc: - last_exc = exc is_last_attempt = attempt >= max_retries - 1
527-529: Consider using a custom exception class for GenVM errors.Static analysis flagged this for creating an exception with a long message (TRY003) and using a generic
Exceptioninstead of a custom class (TRY002). A dedicatedGenVMManagerErrorexception would improve error handling upstream.
550-560: Verify exception chain is preserved for debugging.When
raiseis used without arguments on line 560, it re-raises the caught exceptionexc. This is correct, but consider usingraise exc from excor logging the full traceback withlogger.exceptioninstead oflogger.errorto preserve the stack trace for debugging (TRY400).🔎 Proposed improvement
if is_last_attempt: # All retries exhausted - track failure and propagate - logger.error( + logger.exception( "genvm manager request failed after all retries", - error=str(exc), attempts=max_retries, )tests/unit/test_worker_health_degradation.py (2)
82-87: Async context manager mock is incomplete.The
__aenter__and__aexit__methods should return coroutines for proper async context manager behavior. While this may work due to MagicMock's flexibility, it's not technically correct and could cause issues with stricter async implementations.🔎 Proposed fix using AsyncMock
- with patch("aiohttp.request") as mock_request: - mock_response = MagicMock() - mock_response.status = 200 - mock_response.__aenter__ = MagicMock(return_value=mock_response) - mock_response.__aexit__ = MagicMock(return_value=None) - mock_request.return_value = mock_response + from unittest.mock import AsyncMock + with patch("aiohttp.request") as mock_request: + mock_response = MagicMock() + mock_response.status = 200 + mock_response.__aenter__ = AsyncMock(return_value=mock_response) + mock_response.__aexit__ = AsyncMock(return_value=None) + mock_request.return_value = mock_responseThis pattern is repeated at lines 117-122 and 150-155.
168-187: Tests don't exercise actual module threshold parsing.These tests only verify that
os.environ.get()works correctly with defaults, not thatworker_service._genvm_failure_unhealthy_thresholdis correctly parsed at import time. Since the threshold is read at module import, patching the environment after import won't change the module's threshold value.Consider adding a test that verifies the actual
_genvm_failure_unhealthy_thresholdvalue or documenting that this is testing the parsing pattern rather than the integrated behavior.backend/consensus/worker_service.py (2)
80-84: Consider validating threshold is positive.If
GENVM_FAILURE_UNHEALTHY_THRESHOLDis set to0or a negative value, the health check would immediately return 503 even with zero failures. Consider adding validation or using the_get_intpattern with bounds checking.🔎 Proposed improvement
# GenVM Manager consecutive failure tracking _genvm_consecutive_failures: int = 0 -_genvm_failure_unhealthy_threshold: int = int( - os.environ.get("GENVM_FAILURE_UNHEALTHY_THRESHOLD", "3") -) +_genvm_failure_unhealthy_threshold: int = max( + 1, int(os.environ.get("GENVM_FAILURE_UNHEALTHY_THRESHOLD", "3")) +)
82-84: Threshold parsing doesn't handle invalid values gracefully.Unlike
_get_intinbase_host.py, this will raiseValueErrorif the environment variable contains a non-integer string. Consider using a try/except pattern for consistency.🔎 Proposed fix
# GenVM Manager consecutive failure tracking _genvm_consecutive_failures: int = 0 -_genvm_failure_unhealthy_threshold: int = int( - os.environ.get("GENVM_FAILURE_UNHEALTHY_THRESHOLD", "3") -) +try: + _genvm_failure_unhealthy_threshold: int = int( + os.environ.get("GENVM_FAILURE_UNHEALTHY_THRESHOLD", "3") + ) +except ValueError: + _genvm_failure_unhealthy_threshold = 3
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
backend/consensus/worker_service.pybackend/node/genvm/origin/base_host.pytests/unit/test_genvm_retry.pytests/unit/test_worker_health_degradation.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/unit/test_genvm_retry.py
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Target Python 3.12, use 4-space indentation, and rely on Black via pre-commit for formatting consistency
**/*.py: Apply Black formatter for Python code formatting
Include type hints in all Python code
Files:
tests/unit/test_worker_health_degradation.pybackend/consensus/worker_service.pybackend/node/genvm/origin/base_host.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Place new backend tests in the closest scope folder and name them
test_<feature>.pyfor Pytest auto-discoveryUse pytest with fixtures from tests/common/ for backend testing
Files:
tests/unit/test_worker_health_degradation.py
backend/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Align backend filenames with their behavior (e.g.,
validators/llm_validator.py) and mirror that pattern in tests
Files:
backend/consensus/worker_service.pybackend/node/genvm/origin/base_host.py
backend/consensus/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Implement validator rotation using VRF (Verifiable Random Function) in the consensus system
Files:
backend/consensus/worker_service.py
🧬 Code graph analysis (2)
tests/unit/test_worker_health_degradation.py (1)
backend/consensus/worker_service.py (4)
increment_genvm_failure(87-93)reset_genvm_failures(96-103)get_genvm_failure_count(106-108)health_check(385-591)
backend/consensus/worker_service.py (3)
backend/node/genvm/origin/logger.py (2)
warning(20-21)info(17-18)backend/node/genvm/origin/base_host.py (1)
set_genvm_callbacks(50-57)tests/unit/test_genvm_retry.py (2)
on_success(45-47)on_failure(49-51)
🪛 Ruff (0.14.10)
backend/node/genvm/origin/base_host.py
527-529: Create your own exception
(TRY002)
527-529: Avoid specifying long messages outside the exception class
(TRY003)
547-547: Local variable last_exc is assigned to but never used
Remove assignment to unused variable last_exc
(F841)
552-556: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: backend-unit-tests
🔇 Additional comments (9)
backend/node/genvm/origin/base_host.py (4)
35-43: LGTM!The
_get_inthelper correctly mirrors_get_timeout_secondsfor parsing integer environment variables with graceful fallback on missing or invalid values.
45-57: LGTM!The callback mechanism is clean and well-documented. The type hints are correct, and the global state management is appropriate for this use case.
573-576: LGTM! The finally block logic is now correct.The condition now only logs when
genvm_id_cell[0] is not None, which addresses the previous concern about misleading "proc started" logs on failure paths.
488-544: LGTM! Retry loop with exponential backoff.The retry logic is well-structured:
- Exponential backoff with base delay
- Proper success callback invocation
- Clean exit on success via
return- Attempt tracking in logs
tests/unit/test_worker_health_degradation.py (2)
12-48: LGTM! Failure tracking tests are comprehensive.Tests cover increment, reset, reset when zero, and get count. Good coverage of the core tracking primitives.
53-60: Good test isolation viasetup_method.Resetting all relevant module-level state before each test ensures test isolation. This is important since the tests modify global state.
backend/consensus/worker_service.py (3)
87-108: LGTM! Clean failure tracking API.The functions are simple, well-documented, and follow a clear pattern. The logging in
increment_genvm_failureandreset_genvm_failuresprovides good observability for debugging.
201-209: LGTM! Callback wiring is correctly placed in lifespan.The callbacks are wired after
genvm_manageris created and before the worker starts, ensuring the tracking is active for all GenVM operations. The import placement inside the function avoids circular import issues.
513-524: LGTM! Health degradation response is well-structured.The 503 response includes all necessary debugging information: status, worker_id, error type, current count, and threshold. This will help operators diagnose issues when Kubernetes restarts pods.
53590f1 to
f7a15ba
Compare
When POST /genvm/run times out, retry up to 3x with exponential backoff. After consecutive failures exceed threshold, /health returns 503 triggering K8s pod restart. Transaction is released (not stuck) so another worker picks it up.
f7a15ba to
6c25bbe
Compare
|
When POST /genvm/run times out, retry up to 3x with exponential backoff. After consecutive failures exceed threshold, /health returns 503 triggering K8s pod restart. Transaction is released (not stuck) so another worker picks it up.


Summary
/genvm/runfor faster failure detection/healthreturns 503 after threshold → K8s restarts podConfig (env vars)
GENVM_MANAGER_RUN_HTTP_TIMEOUT_SECONDSGENVM_MANAGER_RUN_RETRIESGENVM_MANAGER_RUN_RETRY_DELAY_SECONDSGENVM_FAILURE_UNHEALTHY_THRESHOLDTest plan
tests/unit/test_genvm_retry.py)tests/unit/test_worker_health_degradation.py)Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.