-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8373100: Genshen: Control thread can miss allocation failure notification #28665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8373100: Genshen: Control thread can miss allocation failure notification #28665
Conversation
|
👋 Welcome back wkemper! A progress list of the required criteria for merging this PR into |
|
@earthling-amzn This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been no new commits pushed to the ➡️ To integrate this PR with the above commit message to the |
|
@earthling-amzn The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
kdnilsen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
|
|
||
| // Notifies the control thread, but does not update the requested cause or generation. | ||
| // The overloaded variant should be used when the _control_lock is already held. | ||
| void notify_cancellation(GCCause::Cause cause); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods were the root cause here. ShenandoahHeap::_canceled_gc is read/written atomically, but ShenandoahGenerationalControlThread::_requested_gc_cause is read/written under a lock. These notify_cancellation methods did not update _requested_gc_cause at all. So, in the failure I observed we had:
- Control thread finishes cycle and sees no cancellation is requested (no lock used).
- Mutator thread fails allocation, cancels GC (again, no lock used), and does not change
_requested_gc_cause. - Control thread takes
_control_lockand checks_requested_gc_causeand sees_no_gc(becausenotify_cancellationdidn't change it) andwaitsforever now.
The fix here is to replace notify_cancellation with notify_control_thread which serializes updates to _requested_gc_cause under _control_lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at the places where ShenandoahHeap::clear_cancelled_gc is called, I feel the problem is more likely from op_final_update_refs:
void ShenandoahConcurrentGC::op_final_update_refs() {
ShenandoahHeap* const heap = ShenandoahHeap::heap();
...
...
// Clear cancelled GC, if set. On cancellation path, the block before would handle
// everything.
if (heap->cancelled_gc()) {
heap->clear_cancelled_gc();
}
...
...
}
Let's say there is concurrent GC running, right before the final update refs safepoint, there is mutator allocation failure:
- The mutator tries to cancel the the concurrent GC and notify controller thread.
- The mutator block itself at
_alloc_failure_waiters_lock, claiming safepoint safe as well. - concurrent GC enter the final update refs (VM operation)
- in final update refs, VMThread sees cancelled_gc and clear it.
- concurrent GC finishes, but cancelled_gc has been cleared so it won't notify the mutator.
The fix seems to work in generational mode, but may not work in non-generational mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I was staring at the code ShenandoahController::handle_alloc_failure today, I found there is discrepancy between ShenandoahGenerationalControlThread and ShenandoahControlThread, I created a bug to unify the behavior, we could fix the issue in ShenandoahControlThread there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scenario I described wasn't supposition, that is actually what happened in the debugger. The scenario you describe with op_final_update_refs would also be fixed by this PR. The _requested_gc_cause field should always be accessed under a lock. The code change here fixes an issue where an allocation failure might not set _requested_gc_cause at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understand the fix will solve the issue for genshen and also fix scenario I described.
I'll solve the potential issue in non-generational Shenandoah in the PR to fix the behavior differences in Genshen and non-generational Shenandoah.
kdnilsen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for diligent testing and analysis. Subtle code here.
pengxiaolong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the digging and fixing the issue.
ysramakrishna
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for cleaning this up.
Did you review the non-generational ShenandoahControlThread and uses thereof to make sure a similar issues doesn't exist there?
As Xiaolong states, it might be worthwhile to do a refactor that shares as much as needed and no more, and to do so cleanly.
This looks good; sorry for the delay in reviewing.
🚢
| void notify_control_thread(GCCause::Cause cause, ShenandoahGeneration* generation); | ||
| void notify_control_thread(MonitorLocker& ml, GCCause::Cause cause, ShenandoahGeneration* generation); | ||
|
|
||
| // Notifies the control thread, but does not update the requested cause or generation. | ||
| // The overloaded variant should be used when the _control_lock is already held. | ||
| void notify_cancellation(GCCause::Cause cause); | ||
| void notify_cancellation(MonitorLocker& ml, GCCause::Cause cause); | ||
| void notify_control_thread(GCCause::Cause cause); | ||
| void notify_control_thread(MonitorLocker& ml, GCCause::Cause cause); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
I'd (subjectively) order them thus: (nct = notify_control_thread)
- nct(cause)
- nct(ml, cause)
- nct(cause, generation)
- nct(ml, cause, generation)
For completeness in the documentation comment preceding, state that if an argument, cause or generation, is missing, it isn't updated.
I am assuming that there is a specific small subset of cause values where the generation isn't important to spell out and really implies "isn't necessary or is implicitly understood" for cancellation/request cause? Is there a call argument/consistency check that might be done in the nct:s where these bottom out to confirm this, or am I being unnecessarily paranoid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are two uses where we don't need the generation:
- It's important to not update the generation for an allocation failure (degenerated cycle needs to use same generation)
- We are shutting down the JVM and don't want to start another cycle.
All cases need to pass a GCCause.
|
@ysramakrishna , @pengxiaolong - The non-generational control thread is less susceptible to this sort of issue because it has the responsibility of evaluating trigger conditions. It's loop therefore sleeps with a timed wait when the GC cycle is complete. If it misses a cancelled gc request, it will see it on the next iteration. |
|
/integrate |
|
Going to push as commit ea6493c.
Your commit was automatically rebased without conflicts. |
|
@earthling-amzn Pushed as commit ea6493c. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
In some cases, the control thread may fail to observe an allocation failure. This results in the thread which failed to allocate waiting forever for the control thread to run a cycle. Depending on which thread fails to allocate, the process may not make progress.
Progress
Issue
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28665/head:pull/28665$ git checkout pull/28665Update a local copy of the PR:
$ git checkout pull/28665$ git pull https://git.openjdk.org/jdk.git pull/28665/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 28665View PR using the GUI difftool:
$ git pr show -t 28665Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28665.diff
Using Webrev
Link to Webrev Comment