Skip to content

Conversation

@yongli-shopify
Copy link

@yongli-shopify yongli-shopify commented Mar 21, 2025

Solves https://github.com/Shopify/db-mobility/issues/776

Create new TLA+ model to prove that FOR SHARE NOWAIT provides similar safety guarantee as FOR UPDATE at least

The key insights from our verification are:

  1. Both locking modes maintain data consistency: The invariant DataConsistency holds in both cases, ensuring that copied data is accurate.
  2. FOR_SHARE_NOWAIT guarantees progress: By skipping locked rows and coming back to them later, the TableIterator can always make progress.
  3. FOR_UPDATE can lead to deadlocks: If the application holds locks indefinitely, the TableIterator can get stuck waiting.
  4. FOR_SHARE_NOWAIT is safer for production: It prevents the copy process from getting stuck, which is crucial for long-running migrations.

This verification confirms that using FOR_SHARE_NOWAIT is the safer option for Ghostferry, as it prevents deadlocks while maintaining data consistency.

Here is the test result locally:

Understanding the Ghostferry Lock Safety Model

This TLA+ model verifies the safety guarantees of different locking strategies in Ghostferry. Let me explain how it works and what insights it provides.

Model Overview

The model simulates two concurrent processes:

  1. TableIterator - copies rows from source to target table
  2. Application - modifies rows in the source table

We're comparing two locking strategies:

  • FOR_UPDATE - exclusive locks that block when unavailable
  • FOR_SHARE_NOWAIT - shared locks that fail immediately when unavailable

Key Components of the Model

State Variables

  • SourceTable and TargetTable - represent database tables
  • lockOwners - tracks which process owns locks on which rows
  • currentRow - the row TableIterator is currently processing
  • copyComplete - whether copying is finished
  • rowToModify - the row Application is trying to modify

Actions

  • CopyRow - TableIterator copies a row when it can get a lock
  • SkipLockedRow - TableIterator skips a locked row (FOR_SHARE_NOWAIT only)
  • WaitForRow - TableIterator waits for a lock (FOR_UPDATE only)
  • ModifyRow - Application modifies a row
  • PickNewRow - Application picks a new row when it can't get a lock

Properties Verified

  • DataConsistency - copied data matches source data
  • LockSafety - no conflicting locks are held simultaneously
  • CopyEventuallyCompletes - copying eventually finishes (FOR_SHARE_NOWAIT only)
  • FinalConsistency - when copying is complete, target matches source

How the Model Verifies Safety

The model checker explores all possible interleavings of actions to verify:

  1. Safety Properties - These must hold in all states:

    • TypeOK - variables have correct types
    • LockSafety - lock conflicts never occur
    • DataConsistency - copied data is always consistent
  2. Liveness Properties - These must eventually become true:

    • CopyEventuallyCompletes - copying finishes (FOR_SHARE_NOWAIT only)
    • ModificationProgress - application can make progress

Key Insights from the Model

  1. FOR_UPDATE can deadlock:

    • If Application holds a lock indefinitely, TableIterator gets stuck
    • The model shows this by allowing WaitForRow to execute indefinitely
    • No fairness is applied to WaitForRow, so it can wait forever
  2. FOR_SHARE_NOWAIT prevents deadlocks:

    • TableIterator skips locked rows and continues
    • It can always make progress, even when rows are locked
    • The model verifies this with the CopyEventuallyCompletes property
  3. Both strategies maintain consistency:

    • DataConsistency holds in all states for both strategies
    • This confirms that both approaches are safe for data integrity

Practical Implications

The model verification confirms that:

  1. FOR_SHARE_NOWAIT is safer for production use because:

    • It prevents the copy process from getting stuck
    • It maintains data consistency
    • It guarantees eventual completion
  2. FOR_UPDATE should be used with caution because:

    • It can lead to deadlocks if locks are held for too long
    • While it maintains data consistency, it may not complete

This formal verification gives us confidence that Ghostferry's FOR_SHARE_NOWAIT strategy provides better safety guarantees for production database migrations.

Copy link
Member

@grodowski grodowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some changes while reviewing and proposing them here: #385

General takeaway:

I think FOR SHARE NOWAIT should be OK to provide consistency (or lock errors :trollface:, and nothing inbetween) given the current implementation of cursor.go - we should carefully look at the retry mechanisms there, since the proof doesn't consider any sort of batching in contrast to what ghostferry does.

/\ WF_vars(CompleteCopy)
/\ WF_vars(ModifyRow)
/\ WF_vars(PickNewRow)
\* No fairness for WaitForRow or Stutter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What WF_vars does from https://ilyasergey.net/CS6213/week-06-tla.html

Weak fairness of action A asserts of a behavior: If A ever remains continuously enabled, then an A step must eventually occur. It’s written as WF_vars(A) in TLA+. The vars subscript ensures that this step is not stuttering, i.e., it will change some of the variables.

CONSTANTS
Records = {r1, r2}
TableCapacity = 2
LockMode = "FOR_UPDATE"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this to FOR_SHARE_NOWAIT to alternate between modes 👀

I was running this with tlc -config ghostferry_share_safety.cfg ghostferry_share_safety.tla

Comment on lines +57 to +61
/\ lockOwners' = [lockOwners EXCEPT ![currentRow] = @ \cup {TableIterator}]
/\ IF SourceTable[currentRow] # NoRecordHere
THEN TargetTable' = [TargetTable EXCEPT ![currentRow] = SourceTable[currentRow]]
ELSE UNCHANGED TargetTable
/\ lockOwners' = [lockOwners' EXCEPT ![currentRow] = @ \ {TableIterator}]
Copy link
Member

@grodowski grodowski Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we both be acquiring and releasing currentRow in lockOwners' at one point in time? I'm not sure how this is supposed to work 🤔 This reads like a sequence, where we

  • acquire the lock for TableIterator
  • copy to TargetTable
  • release the lock

but in TLA+ all of the expressions in a state definitions are commutative.

I also think this could be the leading cause of why the generated graph only had a depth of 1 (see output logs).

/\ currentRow' = currentRow + 1
/\ UNCHANGED << SourceTable, copyComplete, rowToModify, newValue >>

\* TableIterator skips a row when it can't get a lock (FOR_SHARE_NOWAIT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment sounds like it's confusing FOR_SHARE_NOWAIT with SKIP LOCKED? I think we shouldn't be advancing currentRow' in this state?

And translating to real ghostferry, in this case the current row will error in the select with NOWAIT and will be retried (the whole cursor batch). We might need to look closer at DBReadRetries in cursor.go and potentially adjust the sleep parameter to be non-zero. Default retries is 5 and it might be insufficient to avoid crashing if some rows in the batch are locked and we're in NOWAIT mode.

@grodowski
Copy link
Member

I successfully ran the primary ghostferry.tla using the TLA+ toolbox (took ~40min) after failing to complete it using the tlc CLI. Makes me thing if it would be feasible to merge the shared lock model into the primary, we should look into it, besides considering the tweaks from earlier review.

Screenshot 2025-04-25 at 09 56 53

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants