Skip to content

Conversation

@earthling-amzn
Copy link
Contributor

@earthling-amzn earthling-amzn commented Dec 4, 2025

In some cases, the control thread may fail to observe an allocation failure. This results in the thread which failed to allocate waiting forever for the control thread to run a cycle. Depending on which thread fails to allocate, the process may not make progress.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8373100: Genshen: Control thread can miss allocation failure notification (Bug - P4)

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28665/head:pull/28665
$ git checkout pull/28665

Update a local copy of the PR:
$ git checkout pull/28665
$ git pull https://git.openjdk.org/jdk.git pull/28665/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28665

View PR using the GUI difftool:
$ git pr show -t 28665

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28665.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Dec 4, 2025

👋 Welcome back wkemper! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Dec 4, 2025

@earthling-amzn This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8373100: Genshen: Control thread can miss allocation failure notification

Reviewed-by: ysr, kdnilsen, xpeng

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the master branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk bot added hotspot-gc hotspot-gc-dev@openjdk.org shenandoah shenandoah-dev@openjdk.org labels Dec 4, 2025
@openjdk
Copy link

openjdk bot commented Dec 4, 2025

@earthling-amzn The following labels will be automatically applied to this pull request:

  • hotspot-gc
  • shenandoah

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 4, 2025
@mlbridge
Copy link

mlbridge bot commented Dec 4, 2025

Webrevs

Copy link
Contributor

@kdnilsen kdnilsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@earthling-amzn earthling-amzn marked this pull request as draft December 5, 2025 15:41
@openjdk openjdk bot removed the rfr Pull request is ready for review label Dec 5, 2025

// Notifies the control thread, but does not update the requested cause or generation.
// The overloaded variant should be used when the _control_lock is already held.
void notify_cancellation(GCCause::Cause cause);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods were the root cause here. ShenandoahHeap::_canceled_gc is read/written atomically, but ShenandoahGenerationalControlThread::_requested_gc_cause is read/written under a lock. These notify_cancellation methods did not update _requested_gc_cause at all. So, in the failure I observed we had:

  1. Control thread finishes cycle and sees no cancellation is requested (no lock used).
  2. Mutator thread fails allocation, cancels GC (again, no lock used), and does not change _requested_gc_cause.
  3. Control thread takes _control_lock and checks _requested_gc_cause and sees _no_gc (because notify_cancellation didn't change it) and waits forever now.

The fix here is to replace notify_cancellation with notify_control_thread which serializes updates to _requested_gc_cause under _control_lock.

Copy link

@pengxiaolong pengxiaolong Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at the places where ShenandoahHeap::clear_cancelled_gc is called, I feel the problem is more likely from op_final_update_refs:

void ShenandoahConcurrentGC::op_final_update_refs() {
  ShenandoahHeap* const heap = ShenandoahHeap::heap();
   ... 
  ...
  // Clear cancelled GC, if set. On cancellation path, the block before would handle
  // everything.
  if (heap->cancelled_gc()) {
    heap->clear_cancelled_gc();
  }
  ...
  ...
}

Let's say there is concurrent GC running, right before the final update refs safepoint, there is mutator allocation failure:

  1. The mutator tries to cancel the the concurrent GC and notify controller thread.
  2. The mutator block itself at _alloc_failure_waiters_lock, claiming safepoint safe as well.
  3. concurrent GC enter the final update refs (VM operation)
  4. in final update refs, VMThread sees cancelled_gc and clear it.
  5. concurrent GC finishes, but cancelled_gc has been cleared so it won't notify the mutator.

The fix seems to work in generational mode, but may not work in non-generational mode.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I was staring at the code ShenandoahController::handle_alloc_failure today, I found there is discrepancy between ShenandoahGenerationalControlThread and ShenandoahControlThread, I created a bug to unify the behavior, we could fix the issue in ShenandoahControlThread there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario I described wasn't supposition, that is actually what happened in the debugger. The scenario you describe with op_final_update_refs would also be fixed by this PR. The _requested_gc_cause field should always be accessed under a lock. The code change here fixes an issue where an allocation failure might not set _requested_gc_cause at all.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand the fix will solve the issue for genshen and also fix scenario I described.
I'll solve the potential issue in non-generational Shenandoah in the PR to fix the behavior differences in Genshen and non-generational Shenandoah.

@earthling-amzn earthling-amzn marked this pull request as ready for review December 5, 2025 18:49
@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 5, 2025
Copy link
Contributor

@kdnilsen kdnilsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for diligent testing and analysis. Subtle code here.

Copy link

@pengxiaolong pengxiaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the digging and fixing the issue.

Copy link
Member

@ysramakrishna ysramakrishna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning this up.

Did you review the non-generational ShenandoahControlThread and uses thereof to make sure a similar issues doesn't exist there?

As Xiaolong states, it might be worthwhile to do a refactor that shares as much as needed and no more, and to do so cleanly.

This looks good; sorry for the delay in reviewing.

🚢

Comment on lines 140 to 143
void notify_control_thread(GCCause::Cause cause, ShenandoahGeneration* generation);
void notify_control_thread(MonitorLocker& ml, GCCause::Cause cause, ShenandoahGeneration* generation);

// Notifies the control thread, but does not update the requested cause or generation.
// The overloaded variant should be used when the _control_lock is already held.
void notify_cancellation(GCCause::Cause cause);
void notify_cancellation(MonitorLocker& ml, GCCause::Cause cause);
void notify_control_thread(GCCause::Cause cause);
void notify_control_thread(MonitorLocker& ml, GCCause::Cause cause);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

I'd (subjectively) order them thus: (nct = notify_control_thread)

  1. nct(cause)
  2. nct(ml, cause)
  3. nct(cause, generation)
  4. nct(ml, cause, generation)

For completeness in the documentation comment preceding, state that if an argument, cause or generation, is missing, it isn't updated.

I am assuming that there is a specific small subset of cause values where the generation isn't important to spell out and really implies "isn't necessary or is implicitly understood" for cancellation/request cause? Is there a call argument/consistency check that might be done in the nct:s where these bottom out to confirm this, or am I being unnecessarily paranoid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are two uses where we don't need the generation:

  • It's important to not update the generation for an allocation failure (degenerated cycle needs to use same generation)
  • We are shutting down the JVM and don't want to start another cycle.

All cases need to pass a GCCause.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Dec 13, 2025
@earthling-amzn
Copy link
Contributor Author

@ysramakrishna , @pengxiaolong - The non-generational control thread is less susceptible to this sort of issue because it has the responsibility of evaluating trigger conditions. It's loop therefore sleeps with a timed wait when the GC cycle is complete. If it misses a cancelled gc request, it will see it on the next iteration.

@earthling-amzn
Copy link
Contributor Author

/integrate

@openjdk
Copy link

openjdk bot commented Dec 15, 2025

Going to push as commit ea6493c.
Since your change was applied there have been 20 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Dec 15, 2025
@openjdk openjdk bot closed this Dec 15, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Dec 15, 2025
@openjdk
Copy link

openjdk bot commented Dec 15, 2025

@earthling-amzn Pushed as commit ea6493c.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-gc hotspot-gc-dev@openjdk.org integrated Pull request has been integrated shenandoah shenandoah-dev@openjdk.org

Development

Successfully merging this pull request may close these issues.

4 participants