RTX 5080 via Thunderbolt 5 eGPU: Hard lock on CUDA operations (nvidia-smi works at idle)

### NVIDIA Open GPU Kernel Modules Version

590.44.01 and 580.105.08, both from the NVIDIA's official CUDA rhel10 repo

### Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

- [ ] I confirm that this does not happen with the proprietary driver package.

### Operating System and Version

Rocky Linux 10.1 (Red Quartz)

### Kernel Release

`6.12.0-124.13.1.el10_1.x86_64 (PREEMPT_DYNAMIC)`

### Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

- [x] I am running on a stable kernel release.

### Hardware: GPU

NVIDIA GeForce RTX 5080

### Describe the bug

RTX 5080 connected via Thunderbolt 5 eGPU enclosure (Sonnet Breakaway Box 850T5) initializes correctly and is visible to `nvidia-smi` at idle. Any CUDA operation causes immediate system hard-lock requiring power cycle. No kernel panic, no Xid error logged, no SysRq response.

This appears related to closed issue #900 (RTX 5090 via OCuLink), which showed identical symptoms: GPU functional at idle, crash under load, GSP firmware bootstrap errors. That issue was resolved by switching OCuLink docks, but dock alternatives for Thunderbolt 5 are extremely limited.

PCIe link negotiates correctly at 16GT/s x4 (PCIe 4.0 x4) - optimal for Thunderbolt 4 host controller bandwidth. BAR allocation succeeds with hotplug resource reservation parameters. Driver loads and initializes without error.

Minimal reproducer:
```
nvidia-smi                    # Works - GPU visible, ~2W idle
python3 -c "import torch; torch.zeros(1, device='cuda')"   # Hard lock
```

Hardware:
- GPU: NVIDIA GeForce RTX 5080 (GB203)
- eGPU Enclosure: Sonnet Breakaway Box 850T5 (Thunderbolt 5)
- Host: Lenovo ThinkPad X1 Carbon Gen 11
- Thunderbolt Controller: Intel Raptor Lake-P Thunderbolt 4 (host), USB4/TB5 (enclosure)
- OS: Rocky Linux 10.1

Required kernel parameters: `pcie_ports=native pcie_aspm=off pcie_port_pm=off pci=assign-busses,realloc`

Without `pcie_ports=native`, GPU enters D3cold and driver fails with "Unable to change power state from D3cold to D0".

### To Reproduce

1. Connect RTX 5080 to Sonnet Breakaway Box 850T5 (Thunderbolt 5 eGPU enclosure)
2. Connect enclosure to host via Thunderbolt cable
3. Boot system with kernel parameters: `pcie_ports=native pcie_aspm=off pcie_port_pm=off pci=assign-busses,realloc`
4. Confirm GPU detected: `nvidia-smi` (shows GPU at idle, ~2W, ~30°C)
5. Run any CUDA operation: `python3 -c "import torch; torch.zeros(1, device='cuda')"`
6. System hard-locks immediately. No kernel panic, no SysRq response. Power cycle required.

### Bug Incidence

Always

### nvidia-bug-report.log.gz

[nvidia-bug-report.log.gz](https://github.com/user-attachments/files/23943585/nvidia-bug-report.log.gz)

### More Info

Also reported at https://forums.developer.nvidia.com/t/590-release-feedback-discussion/353310/53, cross-posted to https://forums.developer.nvidia.com/t/580-release-feedback-discussion/341205/898, and and linux-bugs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RTX 5080 via Thunderbolt 5 eGPU: Hard lock on CUDA operations (nvidia-smi works at idle) #979

NVIDIA Open GPU Kernel Modules Version

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Kernel Release

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

Describe the bug

To Reproduce

Bug Incidence

nvidia-bug-report.log.gz

More Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RTX 5080 via Thunderbolt 5 eGPU: Hard lock on CUDA operations (nvidia-smi works at idle) #979

Description

NVIDIA Open GPU Kernel Modules Version

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

Operating System and Version

Kernel Release

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

Hardware: GPU

Describe the bug

To Reproduce

Bug Incidence

nvidia-bug-report.log.gz

More Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions