Skip to content

Conversation

@virajmehta
Copy link
Member

@virajmehta virajmehta commented Dec 21, 2025

Problem

  1. Deadlock risk: Functions that touch both tasks and runs could deadlock if they acquired locks in inconsistent order
  2. Scattered cleanup logic: Terminal task cleanup (deleting waiters, emitting parent events, cascading cancellation) was duplicated across multiple functions
  3. Incomplete cascade cancellation: Auto-cancelled tasks (via max_duration) didn't cascade cancel their children

Solution

Lock Ordering: All functions now acquire locks in consistent order (task first, then run):

  • complete_run, fail_run, sleep_for, await_event: lock task FOR UPDATE before locking run
  • claim_task: lock task before run when handling expired claims
  • emit_event: lock sleeping tasks before waking runs (via locked_tasks CTE)

Cleanup Consolidation: New cleanup_task_terminal() function handles:

  • Deleting wait registrations for the task
  • Emitting completion event for parent ($child:<task_id>)
  • Optionally cascading cancellation to children

Used by: complete_run, fail_run, cancel_task, cascade_cancel_children, claim_task

Other improvements:

  • emit_event early return when event already exists (optimization)
  • sleep_for now takes task_id parameter for proper lock ordering

Tests

  • New lock_order_test.rs with 6 tests verifying lock ordering works correctly
  • New test_cascade_cancel_when_parent_auto_cancelled_by_max_duration in fanout_test.rs
  • New test helpers: single_conn_pool(), RunInfo, get_runs_for_task()

This PR also contains #39

   Add advisory locks to serialize concurrent await_event and emit_event
   operations on the same event. This prevents a race condition where:

   1. Task A checks if event exists (not yet)
   2. Task B emits the event and wakes waiters (none yet)
   3. Task A registers as a waiter (missed the wake)

   The lock_event() function uses pg_advisory_xact_lock with hashed
   queue_name and event_name to ensure atomicity.

   Also changes emit_event to first-writer-wins semantics (ON CONFLICT
   DO NOTHING) to maintain consistency - subsequent emits for the same
   event are no-ops.

   Tests:
   - test_event_functions_use_advisory_locks: Verifies both functions call lock_event
   - test_event_race_stress: Stress test with 128 concurrent tasks x 4 rounds
   - test_event_first_writer_wins: Renamed from test_event_last_write_wins
  Problem

  1. Deadlock risk: Functions that touch both tasks and runs could deadlock if they acquired locks in inconsistent order
  2. Scattered cleanup logic: Terminal task cleanup (deleting waiters, emitting parent events, cascading cancellation) was duplicated across multiple functions
  3. Incomplete cascade cancellation: Auto-cancelled tasks (via max_duration) didn't cascade cancel their children

  Solution

  Lock Ordering: All functions now acquire locks in consistent order (task first, then run):
  - complete_run, fail_run, sleep_for, await_event: lock task FOR UPDATE before locking run
  - claim_task: lock task before run when handling expired claims
  - emit_event: lock sleeping tasks before waking runs (via locked_tasks CTE)

  Cleanup Consolidation: New cleanup_task_terminal() function handles:
  - Deleting wait registrations for the task
  - Emitting completion event for parent ($child:<task_id>)
  - Optionally cascading cancellation to children

  Used by: complete_run, fail_run, cancel_task, cascade_cancel_children, claim_task

  Other improvements:
  - emit_event early return when event already exists (optimization)
  - sleep_for now takes task_id parameter for proper lock ordering

  Tests

  - New lock_order_test.rs with 6 tests verifying lock ordering works correctly
  - New test_cascade_cancel_when_parent_auto_cancelled_by_max_duration in fanout_test.rs
  - New test helpers: single_conn_pool(), RunInfo, get_runs_for_task()
@virajmehta virajmehta marked this pull request as draft December 21, 2025 00:40
@virajmehta virajmehta changed the base branch from viraj/event-race-fix to main December 21, 2025 18:01
@virajmehta virajmehta changed the title [STACKED] fix: lock ordering & cleanup fix: lock ordering & cleanup (also contains atomic checkpoints) Dec 21, 2025
@virajmehta virajmehta marked this pull request as ready for review December 21, 2025 18:02
@GabrielBianconi
Copy link
Member

@codex review

1 similar comment
@GabrielBianconi
Copy link
Member

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@virajmehta virajmehta added this pull request to the merge queue Dec 24, 2025
Merged via the queue into main with commit d564532 Dec 24, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants