Skip to content

How to deal with the split-brain problem in kvrocks #329

@Allen315

Description

@Allen315

Hi guys,
while analyzing the use of a controller to manage KVrocks clusters in a multi-AZ deployment, we identified a risk of split-brain scenarios.
For example, in the diagram below, when a network partition occurs:

  • The connection between the Load Balancer (LB) and Node N1 remains functional
  • But the connection between the Controller Master and Node N1 fails due to network issues
    This causes the Controller Master to mistakenly mark N1 as faulty and trigger failover, promoting N1's slave to master. However, N1 is actually healthy and continues to accept writes from the LB.
    Result : Two active masters (split-brain) exist for the same shard, leading to data inconsistency.
Image

In the current solution, this situation can occur in many scenarios, in addition to network partitions, there are also instances that hang for a while and then recover during failover and provide write services normally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions