Skip to content

Conversation

@howold-lab
Copy link

When a peer's CLI process is alive but API calls are failing (400, 429, 500, etc.), the current crash detection doesn't trigger because the pane isn't dead.

This adds:

  • New module: orchestrator/api_error_recovery.py
  • Detects API error patterns in pane output during health checks
  • Auto-restarts peer when errors detected (with debounce and limits)
  • Notifies user via outbox when restart occurs

Reuses existing restart infrastructure (count_recent_restarts, restart_peer) and follows the existing module pattern (make() factory function).

When a peer's CLI process is alive but API calls are failing (400, 429,
500, etc.), the current crash detection doesn't trigger because the pane
isn't dead.

This adds:
- New module: orchestrator/api_error_recovery.py
- Detects API error patterns in pane output during health checks
- Auto-restarts peer when errors detected (with debounce and limits)
- Notifies user via outbox when restart occurs

Reuses existing restart infrastructure (count_recent_restarts, restart_peer)
and follows the existing module pattern (make() factory function).
@ChesterRa
Copy link
Owner

Thanks a lot for your PR!
Would you please explain why RESTART helps when API calls are failing?
I'm a little confused...

@howold-lab
Copy link
Author

hi, author
Because under certain models, such as CC, it's easy to encounter 400 errors, etc. As you know, in some countries, a VPN is required. so

@waterbang
Copy link
Contributor

Hi, I feel we should let Foreman make autonomous decisions, what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants