-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
NVIDIA Open GPU Kernel Modules Version
[root@A11-R42-I61-42-5504045 ~]# cat /proc/driver/nvidia/params ResmanDebugLevel: 4294967295 RmLogonRC: 1 ModifyDeviceFiles: 1 DeviceFileUID: 0 DeviceFileGID: 0 DeviceFileMode: 438 InitializeSystemMemoryAllocations: 1 UsePageAttributeTable: 4294967295 EnableMSI: 1 EnablePCIeGen3: 0 MemoryPoolSize: 0 KMallocHeapMaxSize: 0 VMallocHeapMaxSize: 0 IgnoreMMIOCheck: 0 EnableStreamMemOPs: 0 EnableUserNUMAManagement: 1 NvLinkDisable: 0 RmProfilingAdminOnly: 1 PreserveVideoMemoryAllocations: 0 EnableS0ixPowerManagement: 0 S0ixPowerManagementVideoMemoryThreshold: 256 DynamicPowerManagement: 3 DynamicPowerManagementVideoMemoryThreshold: 200 RegisterPCIDriver: 1 EnablePCIERelaxedOrderingMode: 0 EnableResizableBar: 0 EnableGpuFirmware: 18 EnableGpuFirmwareLogs: 2 RmNvlinkBandwidthLinkCount: 0 EnableDbgBreakpoint: 0 OpenRmEnableUnsupportedGpus: 1 DmaRemapPeerMmio: 1 ImexChannelCount: 2048 CreateImexChannel0: 0 GrdmaPciTopoCheckOverride: 0 RegistryDwords: "" RegistryDwordsPerDevice: "" RmMsg: "" GpuBlacklist: "" TemporaryFilePath: "" ExcludedGpus: ""
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- I confirm that this does not happen with the proprietary driver package.
Operating System and Version
[root@A11-R42-I61-42-5504045 ~]# cat /etc/openeuler-release openeuler release 2.0 (LTS-SP2) [root@A11-R42-I61-42-5504045 ~]#
Kernel Release
[root@A11-R42-I61-42-5504045 ~]# uname -a Linux A11-R42-I61-42-5504045. 6.6.0-100. SMP Fri Aug 22 10:50:04 CST 2025 x86_64 x86_64 x86_64 GNU/Linux
[root@A11-R42-I61-42-5504045 ~]# uname -r 6.6.0-100
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- I am running on a stable kernel release.
Hardware: GPU
B200
Describe the bug
nvidia-smi hangs indefinitely after ~66 days 12 hours uptime with driver 570.133.20 OpenRM on B200
[root@A11-R42-I61-42-5504045 ~]# dmesg -T | grep -i nvrm | head -n 10
[Sat Nov 22 05:08:50 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!
[Sat Nov 22 05:08:50 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed!
[Sat Nov 22 05:08:54 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!
[Sat Nov 22 05:08:54 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed!
[Sat Nov 22 05:08:58 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!
[Sat Nov 22 05:08:58 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed!
[Sat Nov 22 05:09:02 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!
[Sat Nov 22 05:09:02 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer0's postRxDetLinkMask failed!
[Sat Nov 22 05:09:06 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!
[Sat Nov 22 05:09:06 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed!
[root@A11-R42-I61-42-5504045 ~]#
[root@A11-R42-I61-42-5504045 ~]# uptime
22:50:02 up 67 days, 6:11, 2 users, load average: 17.40, 16.73, 18.67
[root@A11-R42-I61-42-5504045 ~]# last reboot
reboot system boot 6.6.0-100. Tue Sep 16 16:38 still running
reboot system boot 6.6.0-100 Tue Sep 9 17:02 - 16:34 (6+23:32)
To Reproduce
nvidia-smi hangs indefinitely after ~66 days 12 hours uptime with driver 570.133.20 OpenRM on B200 and kernel 6.6.0
Bug Incidence
Once
nvidia-bug-report.log.gz
no
More Info
No response