-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
NVIDIA Open GPU Kernel Modules Version
590.44.01 and 580.105.08, both from the NVIDIA's official CUDA rhel10 repo
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Rocky Linux 10.1 (Red Quartz)
Kernel Release
6.12.0-124.13.1.el10_1.x86_64 (PREEMPT_DYNAMIC)
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- I am running on a stable kernel release.
Hardware: GPU
NVIDIA GeForce RTX 5080
Describe the bug
RTX 5080 connected via Thunderbolt 5 eGPU enclosure (Sonnet Breakaway Box 850T5) initializes correctly and is visible to nvidia-smi at idle. Any CUDA operation causes immediate system hard-lock requiring power cycle. No kernel panic, no Xid error logged, no SysRq response.
This appears related to closed issue #900 (RTX 5090 via OCuLink), which showed identical symptoms: GPU functional at idle, crash under load, GSP firmware bootstrap errors. That issue was resolved by switching OCuLink docks, but dock alternatives for Thunderbolt 5 are extremely limited.
PCIe link negotiates correctly at 16GT/s x4 (PCIe 4.0 x4) - optimal for Thunderbolt 4 host controller bandwidth. BAR allocation succeeds with hotplug resource reservation parameters. Driver loads and initializes without error.
Minimal reproducer:
nvidia-smi # Works - GPU visible, ~2W idle
python3 -c "import torch; torch.zeros(1, device='cuda')" # Hard lock
Hardware:
- GPU: NVIDIA GeForce RTX 5080 (GB203)
- eGPU Enclosure: Sonnet Breakaway Box 850T5 (Thunderbolt 5)
- Host: Lenovo ThinkPad X1 Carbon Gen 11
- Thunderbolt Controller: Intel Raptor Lake-P Thunderbolt 4 (host), USB4/TB5 (enclosure)
- OS: Rocky Linux 10.1
Required kernel parameters: pcie_ports=native pcie_aspm=off pcie_port_pm=off pci=assign-busses,realloc
Without pcie_ports=native, GPU enters D3cold and driver fails with "Unable to change power state from D3cold to D0".
To Reproduce
- Connect RTX 5080 to Sonnet Breakaway Box 850T5 (Thunderbolt 5 eGPU enclosure)
- Connect enclosure to host via Thunderbolt cable
- Boot system with kernel parameters:
pcie_ports=native pcie_aspm=off pcie_port_pm=off pci=assign-busses,realloc - Confirm GPU detected:
nvidia-smi(shows GPU at idle, ~2W, ~30°C) - Run any CUDA operation:
python3 -c "import torch; torch.zeros(1, device='cuda')" - System hard-locks immediately. No kernel panic, no SysRq response. Power cycle required.
Bug Incidence
Always
nvidia-bug-report.log.gz
More Info
Also reported at https://forums.developer.nvidia.com/t/590-release-feedback-discussion/353310/53, cross-posted to https://forums.developer.nvidia.com/t/580-release-feedback-discussion/341205/898, and and linux-bugs