Skip to content

Conversation

@rshoemaker
Copy link
Contributor

@rshoemaker rshoemaker commented Nov 14, 2025

Summary

This PR enhances the "remove-host" endpoint by adding a "force" option which allows users to recover from situations where a host is lost. Prior to this, calling "remove-host" on a lost host would fail because the internal state would indicate a db running and prevent the removal.

Changes

  • Updated the API endpoint to:
    • include an optional "force" param
    • return a RemoveHostResponse containing a slice of tasks related to the host removal
    • propagated related changes through the client package
  • Updated logic in post_init_handler.go to bypass the instance check during host removal
  • Added new RemoveHost workflow/task that propagates the db changes across the rest of the cluster
  • Added logic to prevent additional requests being sent to the dead host's work queue
  • Added a new test to clustertest framework (more below)

Testing

Added a new test case and utils to the clustertest framework:

  • create a new 3 node cluster and database
  • wait for it to become healthy
  • forcibly kill one of the host containers
  • call remove-host --force to remove the missing container from the cluster state
  • wait for it to become healthy
  • assert that the cluster has the expected number of instances, nodes, etc in the expected states

Checklist

  • Tests added or updated (unit and/or e2e, as needed)
  • Documentation updated (if needed)
  • Issue is linked (branch name or URL in PR description)
  • Changelog entry added for user-facing behavior changes
  • Breaking changes (if any) are clearly called out in the PR description

Notes for Reviewers

There are a couple related follow-on tasks:

  1. the current task management layer is db-specific and needs some refactoring to allow tasks for additional entity types (like hosts): PLAT-347
  2. there is an issue preventing cleanup of the dead instance from the cluster after forcibly removing it. The cluster is healthy, but when you call get-database, you can see that there is still an instance leftover in the "unknown" state.
    • as a result, there is a workaround func in utils that allows the test to pass even though the extra zombie instance exists - this will need to be cleaned up when the root issue is resolved.
    • Uncomment this line and remove the WORKAROUND util func:
      tLog(t, "verifying database health with 2 nodes")
      // err = verifyDatabaseHealth(ctx, t, cluster.Client(), dbID, 2)
      err = verifyDatabaseHealthWORKAROUND(ctx, t, cluster.Client(), dbID, 2)

@rshoemaker rshoemaker force-pushed the feat/PLAT-130/host-recovery branch from 44490ac to 2ab794e Compare December 1, 2025 20:17
Copy link
Member

@jason-lynch jason-lynch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is close! Just one more change and I think we can call this done for now.

@rshoemaker rshoemaker marked this pull request as ready for review December 2, 2025 20:15
@rshoemaker rshoemaker merged commit 8ab6e6b into main Dec 2, 2025
2 checks passed
@rshoemaker rshoemaker deleted the feat/PLAT-130/host-recovery branch December 2, 2025 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants