Skip to content

Conversation

@rwgk
Copy link
Collaborator

@rwgk rwgk commented Dec 14, 2025

Description

This PR updates the NVHPC CI job to run on a newer Ubuntu image and the latest NVIDIA HPC SDK release, in order to fix a persistent "no space left on device" failure, while keeping the rest of the CI configuration unchanged.

Concretely:

  • Switch the NVHPC job from ubuntu-22.04 to ubuntu-24.04.
  • Bump the NVHPC package from nvhpc-23-5 to nvhpc-25-11 and load the matching module file.
  • Keep the NVHPC job configuration (CMake arguments, test filter, etc.) otherwise identical to the previous setup, with only a minor build parallelism tweak (-j $(nproc)).

Original failure (ubuntu-22.04 + NVHPC 23.5)

On the existing CI configuration, the NVHPC job (ubuntu-nvhpc7) was running on ubuntu-22.04 and using nvhpc-23-5. Starting in early December 2025, these jobs began to fail in a way that was clearly environmental rather than code-related:

  • The job progressed into the CMake/nvc++ compilation of the C++ test suite.
  • Compilation failed with errors of the form:
    • catastrophic error: error while writing intermediate language file: No space left on device
    • Followed by gmake/cmake --build errors and job termination.
  • In some runs, the GitHub Actions runner itself also reported warnings like:
    • You are running out of disk space. The runner will stop working when the machine runs out of disk space. Free space left: 34 MB.

The failure was not specific to pybind11 or to a particular test. Instead, it was the combination of:

  • The ubuntu-22.04 runner image, whose available free disk shrank with newer image releases.
  • The relatively heavy disk footprint of the NVHPC toolchain and its intermediate IR/object files when compiling the full C++ test suite.

Other ubuntu-22.04 jobs in this repository did not hit the same limit, which is consistent with the NVHPC job simply being the heaviest consumer of disk space.

First experiment: change runner only

To separate runner effects from NVHPC version effects, the first experiment was to change only the runner image for the NVHPC job:

  • runs-on: ubuntu-22.04runs-on: ubuntu-24.04.
  • Keep nvhpc-23-5 and all CMake/test settings exactly the same.

Result:

  • The job failed very quickly, but still with a disk-related failure rather than a compiler bug or test regression.
  • This confirmed that simply moving to ubuntu-24.04 with the old NVHPC package was not sufficient to get the job working again.

Second experiment: newer NVHPC on ubuntu-24.04

The next step was to also upgrade NVHPC to the latest release and adjust the module path accordingly:

  • nvhpc-23-5nvhpc-25-11 in the apt-get install step.
  • module load /opt/nvidia/hpc_sdk/modulefiles/nvhpc/23.5module load /opt/nvidia/hpc_sdk/modulefiles/nvhpc/25.11.
  • While touching the job, the cmake --build invocation was also updated to use all available cores (-j $(nproc)) instead of a fixed -j 2.

With that the NVHPC job completed successfully.

Suggested changelog entry:

  • n/a

rwgk added 18 commits December 13, 2025 17:05
Add explicit timeouts to the busy-wait coordination loops in the
Per-Subinterpreter GIL test in tests/test_with_catch/test_subinterpreter.cpp.
Previously those loops spun indefinitely waiting for shared atomics like
`started` and `sync` to change, which is fine when CPython's free-threading
and per-interpreter GIL behavior matches the test's expectations but becomes
pathologically bad when that behavior regresses: the `test_with_catch`
executable can then hang forever, causing our 3.14t CI jobs to time out
after 90 minutes.

This change keeps the structure and intent of the test but adds a
std::chrono::steady_clock deadline to each of the coordination loops,
using a conservative 10 second bound. Worker threads record a failure and
return if they hit the timeout, while the main thread fails the test via
Catch2 instead of hanging. That way, if future CPython free-threading
patches change the semantics again, the test will fail quickly and
produced a diagnosable error instead of wedging the CI job.
Introduce a custom Catch2 reporter for tests/test_with_catch that prints a
simple one-line status for each test case as it starts and ends, and wire the
cpptest CMake target to invoke test_with_catch with -r progress. This makes
it much easier to see where the embedded/interpreter test binary is spending
its time in CI logs, and in particular to pinpoint which test case is stuck
when the free-threading builds hang.

Compared to adding ad hoc timeouts around potentially infinite busy-wait
loops in individual tests, a progress reporter is a more general and robust
approach: it gives visibility into all tests (including future ones) without
changing their behavior, and turns otherwise opaque 90-minute timeouts into
locatable issues in the Catch output.
Print the CPython version once at the start of the Catch-based
interpreter tests using Py_GetVersion(). This makes it trivial to
confirm which free-threaded build a failing run is using when
inspecting CI or local logs.
Update the standard-small and standard-large GitHub Actions jobs to
request python-version 3.14.0t instead of 3.14t. This forces setup-python
to use the last-known-good 3.14.0 free-threaded build rather than the
newer 3.14.1+ builds where subinterpreter finalization regressed.
Update the standard-small and standard-large GitHub Actions jobs to
request python-version 3.14.0t instead of 3.14t. This forces setup-python
to use the last-known-good 3.14.0 free-threaded build rather than the
newer 3.14.1+ builds where subinterpreter finalization regressed.
@rwgk rwgk requested a review from henryiii as a code owner December 14, 2025 16:59
@rwgk rwgk changed the title WIP: Modernization of NVHPC CI job Modernize NVHPC CI job (to make it working again): Ubuntu-24.04 runner, NVHPC 25.11 Dec 15, 2025
@rwgk
Copy link
Collaborator Author

rwgk commented Dec 15, 2025

I'll go ahead merge this. In combination with #5934, our CI will be green again.

@rwgk rwgk merged commit d4f9cfb into pybind:master Dec 15, 2025
84 of 87 checks passed
@rwgk rwgk deleted the nvhpc_modernization branch December 15, 2025 03:01
@github-actions github-actions bot added the needs changelog Possibly needs a changelog entry label Dec 15, 2025
@rwgk rwgk removed the needs changelog Possibly needs a changelog entry label Dec 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant