-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Modernize NVHPC CI job (to make it working again): Ubuntu-24.04 runner, NVHPC 25.11 #5935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+6
−6
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add explicit timeouts to the busy-wait coordination loops in the Per-Subinterpreter GIL test in tests/test_with_catch/test_subinterpreter.cpp. Previously those loops spun indefinitely waiting for shared atomics like `started` and `sync` to change, which is fine when CPython's free-threading and per-interpreter GIL behavior matches the test's expectations but becomes pathologically bad when that behavior regresses: the `test_with_catch` executable can then hang forever, causing our 3.14t CI jobs to time out after 90 minutes. This change keeps the structure and intent of the test but adds a std::chrono::steady_clock deadline to each of the coordination loops, using a conservative 10 second bound. Worker threads record a failure and return if they hit the timeout, while the main thread fails the test via Catch2 instead of hanging. That way, if future CPython free-threading patches change the semantics again, the test will fail quickly and produced a diagnosable error instead of wedging the CI job.
This reverts commit 7847ada.
Introduce a custom Catch2 reporter for tests/test_with_catch that prints a simple one-line status for each test case as it starts and ends, and wire the cpptest CMake target to invoke test_with_catch with -r progress. This makes it much easier to see where the embedded/interpreter test binary is spending its time in CI logs, and in particular to pinpoint which test case is stuck when the free-threading builds hang. Compared to adding ad hoc timeouts around potentially infinite busy-wait loops in individual tests, a progress reporter is a more general and robust approach: it gives visibility into all tests (including future ones) without changing their behavior, and turns otherwise opaque 90-minute timeouts into locatable issues in the Catch output.
Print the CPython version once at the start of the Catch-based interpreter tests using Py_GetVersion(). This makes it trivial to confirm which free-threaded build a failing run is using when inspecting CI or local logs.
This reverts commit ad3e1c3.
Update the standard-small and standard-large GitHub Actions jobs to request python-version 3.14.0t instead of 3.14t. This forces setup-python to use the last-known-good 3.14.0 free-threaded build rather than the newer 3.14.1+ builds where subinterpreter finalization regressed.
This reverts commit 5281e1c.
This reverts commit ed11292.
This reverts commit 0fe6a42.
This reverts commit 60ae0e8.
Update the standard-small and standard-large GitHub Actions jobs to request python-version 3.14.0t instead of 3.14t. This forces setup-python to use the last-known-good 3.14.0 free-threaded build rather than the newer 3.14.1+ builds where subinterpreter finalization regressed.
…e all jobs except ubuntu-nvhpc7
Collaborator
Author
|
I'll go ahead merge this. In combination with #5934, our CI will be green again. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR updates the NVHPC CI job to run on a newer Ubuntu image and the latest NVIDIA HPC SDK release, in order to fix a persistent "no space left on device" failure, while keeping the rest of the CI configuration unchanged.
Concretely:
ubuntu-22.04toubuntu-24.04.nvhpc-23-5tonvhpc-25-11and load the matching module file.-j $(nproc)).Original failure (ubuntu-22.04 + NVHPC 23.5)
On the existing CI configuration, the NVHPC job (
ubuntu-nvhpc7) was running onubuntu-22.04and usingnvhpc-23-5. Starting in early December 2025, these jobs began to fail in a way that was clearly environmental rather than code-related:nvc++compilation of the C++ test suite.catastrophic error: error while writing intermediate language file: No space left on devicegmake/cmake --builderrors and job termination.You are running out of disk space. The runner will stop working when the machine runs out of disk space. Free space left: 34 MB.The failure was not specific to
pybind11or to a particular test. Instead, it was the combination of:ubuntu-22.04runner image, whose available free disk shrank with newer image releases.Other
ubuntu-22.04jobs in this repository did not hit the same limit, which is consistent with the NVHPC job simply being the heaviest consumer of disk space.First experiment: change runner only
To separate runner effects from NVHPC version effects, the first experiment was to change only the runner image for the NVHPC job:
runs-on: ubuntu-22.04→runs-on: ubuntu-24.04.nvhpc-23-5and all CMake/test settings exactly the same.Result:
ubuntu-24.04with the old NVHPC package was not sufficient to get the job working again.Second experiment: newer NVHPC on ubuntu-24.04
The next step was to also upgrade NVHPC to the latest release and adjust the module path accordingly:
nvhpc-23-5→nvhpc-25-11in theapt-get installstep.module load /opt/nvidia/hpc_sdk/modulefiles/nvhpc/23.5→module load /opt/nvidia/hpc_sdk/modulefiles/nvhpc/25.11.cmake --buildinvocation was also updated to use all available cores (-j $(nproc)) instead of a fixed-j 2.With that the NVHPC job completed successfully.
Suggested changelog entry: