feat: host recovery #204

rshoemaker · 2025-11-14T20:03:18Z

Summary

This PR enhances the "remove-host" endpoint by adding a "force" option which allows users to recover from situations where a host is lost. Prior to this, calling "remove-host" on a lost host would fail because the internal state would indicate a db running and prevent the removal.

Changes

Updated the API endpoint to:
- include an optional "force" param
- return a RemoveHostResponse containing a slice of tasks related to the host removal
- propagated related changes through the client package
Updated logic in post_init_handler.go to bypass the instance check during host removal
Added new RemoveHost workflow/task that propagates the db changes across the rest of the cluster
Added logic to prevent additional requests being sent to the dead host's work queue
Added a new test to clustertest framework (more below)

Testing

Added a new test case and utils to the clustertest framework:

create a new 3 node cluster and database
wait for it to become healthy
forcibly kill one of the host containers
call remove-host --force to remove the missing container from the cluster state
wait for it to become healthy
assert that the cluster has the expected number of instances, nodes, etc in the expected states

Checklist

Tests added or updated (unit and/or e2e, as needed)
Documentation updated (if needed)
Issue is linked (branch name or URL in PR description)
Changelog entry added for user-facing behavior changes
Breaking changes (if any) are clearly called out in the PR description

Notes for Reviewers

There are a couple related follow-on tasks:

the current task management layer is db-specific and needs some refactoring to allow tasks for additional entity types (like hosts): PLAT-347

there is an issue preventing cleanup of the dead instance from the cluster after forcibly removing it. The cluster is healthy, but when you call get-database, you can see that there is still an instance leftover in the "unknown" state.

as a result, there is a workaround func in utils that allows the test to pass even though the extra zombie instance exists - this will need to be cleaned up when the root issue is resolved.

Uncomment this line and remove the WORKAROUND util func:

control-plane/clustertest/add_remove_host_test.go

Lines 173 to 175 in 845b9a2

    
           tLog(t, "verifying database health with 2 nodes") 
        
           // err = verifyDatabaseHealth(ctx, t, cluster.Client(), dbID, 2) 
        
           err = verifyDatabaseHealthWORKAROUND(ctx, t, cluster.Client(), dbID, 2)

…w, switch to variadics, debug output

jason-lynch

This is close! Just one more change and I think we can call this done for now.

server/internal/workflows/remove_host.go

docs/development/cluster-tests.md

clustertest/add_remove_host_test.go

rshoemaker added 6 commits December 1, 2025 15:12

add support for force query param on "remove-host" endpoint.

b804c48

start fleshing out logic to handle force removing hosts.

0b3d992

added code to remove host from spec, update workflow, refresh workflo…

ae6c79c

…w, switch to variadics, debug output

logic fixes

ee0f912

rebase origin/main

b07135d

rebase

2ab794e

rshoemaker force-pushed the feat/PLAT-130/host-recovery branch from 44490ac to 2ab794e Compare December 1, 2025 20:17

rshoemaker added 3 commits December 1, 2025 15:18

regenerate api after rebase.

7424a2c

remove outer RemoveHost task

9e76304

fix clustertest compilation error

912b9db

jason-lynch reviewed Dec 2, 2025

View reviewed changes

server/internal/workflows/remove_host.go Outdated Show resolved Hide resolved

rshoemaker added 5 commits December 2, 2025 13:58

add ExecuteRemoveHost back

58183bd

add test case for "remove-host --force" and utils

57130b8

adding docs for cluster tests (from Jason and Claude)

d37d3f3

changelog entry & clarification

abc7a59

codacity refactor

c63fed9

rshoemaker marked this pull request as ready for review December 2, 2025 20:15

jason-lynch reviewed Dec 2, 2025

View reviewed changes

docs/development/cluster-tests.md Outdated Show resolved Hide resolved

clustertest/add_remove_host_test.go Outdated Show resolved Hide resolved

clustertest/add_remove_host_test.go Outdated Show resolved Hide resolved

simplify test docs, streamline test code.

845b9a2

jason-lynch approved these changes Dec 2, 2025

View reviewed changes

rshoemaker merged commit 8ab6e6b into main Dec 2, 2025
2 checks passed

rshoemaker deleted the feat/PLAT-130/host-recovery branch December 2, 2025 22:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: host recovery #204

feat: host recovery #204

Uh oh!

rshoemaker commented Nov 14, 2025 •

edited

Loading

Uh oh!

jason-lynch left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	tLog(t, "verifying database health with 2 nodes")
	// err = verifyDatabaseHealth(ctx, t, cluster.Client(), dbID, 2)
	err = verifyDatabaseHealthWORKAROUND(ctx, t, cluster.Client(), dbID, 2)

feat: host recovery #204

feat: host recovery #204

Uh oh!

Conversation

rshoemaker commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Checklist

Notes for Reviewers

Uh oh!

jason-lynch left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rshoemaker commented Nov 14, 2025 •

edited

Loading