-
Notifications
You must be signed in to change notification settings - Fork 0
fix: lock ordering & cleanup (also contains atomic checkpoints) #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add advisory locks to serialize concurrent await_event and emit_event operations on the same event. This prevents a race condition where: 1. Task A checks if event exists (not yet) 2. Task B emits the event and wakes waiters (none yet) 3. Task A registers as a waiter (missed the wake) The lock_event() function uses pg_advisory_xact_lock with hashed queue_name and event_name to ensure atomicity. Also changes emit_event to first-writer-wins semantics (ON CONFLICT DO NOTHING) to maintain consistency - subsequent emits for the same event are no-ops. Tests: - test_event_functions_use_advisory_locks: Verifies both functions call lock_event - test_event_race_stress: Stress test with 128 concurrent tasks x 4 rounds - test_event_first_writer_wins: Renamed from test_event_last_write_wins
Problem 1. Deadlock risk: Functions that touch both tasks and runs could deadlock if they acquired locks in inconsistent order 2. Scattered cleanup logic: Terminal task cleanup (deleting waiters, emitting parent events, cascading cancellation) was duplicated across multiple functions 3. Incomplete cascade cancellation: Auto-cancelled tasks (via max_duration) didn't cascade cancel their children Solution Lock Ordering: All functions now acquire locks in consistent order (task first, then run): - complete_run, fail_run, sleep_for, await_event: lock task FOR UPDATE before locking run - claim_task: lock task before run when handling expired claims - emit_event: lock sleeping tasks before waking runs (via locked_tasks CTE) Cleanup Consolidation: New cleanup_task_terminal() function handles: - Deleting wait registrations for the task - Emitting completion event for parent ($child:<task_id>) - Optionally cascading cancellation to children Used by: complete_run, fail_run, cancel_task, cascade_cancel_children, claim_task Other improvements: - emit_event early return when event already exists (optimization) - sleep_for now takes task_id parameter for proper lock ordering Tests - New lock_order_test.rs with 6 tests verifying lock ordering works correctly - New test_cascade_cancel_when_parent_auto_cancelled_by_max_duration in fanout_test.rs - New test helpers: single_conn_pool(), RunInfo, get_runs_for_task()
|
@codex review |
1 similar comment
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Problem
Solution
Lock Ordering: All functions now acquire locks in consistent order (task first, then run):
Cleanup Consolidation: New cleanup_task_terminal() function handles:
Used by: complete_run, fail_run, cancel_task, cascade_cancel_children, claim_task
Other improvements:
Tests
This PR also contains #39