Skip to content

nvidia-smi hangs indefinitely after ~66 days 12 hours uptime with driver 570.133.20 OpenRM on B200 and kernel 6.6.0 #971

@zheng199512

Description

@zheng199512

NVIDIA Open GPU Kernel Modules Version

[root@A11-R42-I61-42-5504045 ~]# cat /proc/driver/nvidia/params ResmanDebugLevel: 4294967295 RmLogonRC: 1 ModifyDeviceFiles: 1 DeviceFileUID: 0 DeviceFileGID: 0 DeviceFileMode: 438 InitializeSystemMemoryAllocations: 1 UsePageAttributeTable: 4294967295 EnableMSI: 1 EnablePCIeGen3: 0 MemoryPoolSize: 0 KMallocHeapMaxSize: 0 VMallocHeapMaxSize: 0 IgnoreMMIOCheck: 0 EnableStreamMemOPs: 0 EnableUserNUMAManagement: 1 NvLinkDisable: 0 RmProfilingAdminOnly: 1 PreserveVideoMemoryAllocations: 0 EnableS0ixPowerManagement: 0 S0ixPowerManagementVideoMemoryThreshold: 256 DynamicPowerManagement: 3 DynamicPowerManagementVideoMemoryThreshold: 200 RegisterPCIDriver: 1 EnablePCIERelaxedOrderingMode: 0 EnableResizableBar: 0 EnableGpuFirmware: 18 EnableGpuFirmwareLogs: 2 RmNvlinkBandwidthLinkCount: 0 EnableDbgBreakpoint: 0 OpenRmEnableUnsupportedGpus: 1 DmaRemapPeerMmio: 1 ImexChannelCount: 2048 CreateImexChannel0: 0 GrdmaPciTopoCheckOverride: 0 RegistryDwords: "" RegistryDwordsPerDevice: "" RmMsg: "" GpuBlacklist: "" TemporaryFilePath: "" ExcludedGpus: ""

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

[root@A11-R42-I61-42-5504045 ~]# cat /etc/openeuler-release openeuler release 2.0 (LTS-SP2) [root@A11-R42-I61-42-5504045 ~]#

Kernel Release

[root@A11-R42-I61-42-5504045 ~]# uname -a Linux A11-R42-I61-42-5504045. 6.6.0-100. SMP Fri Aug 22 10:50:04 CST 2025 x86_64 x86_64 x86_64 GNU/Linux
[root@A11-R42-I61-42-5504045 ~]# uname -r 6.6.0-100

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

B200

Describe the bug

nvidia-smi hangs indefinitely after ~66 days 12 hours uptime with driver 570.133.20 OpenRM on B200

[root@A11-R42-I61-42-5504045 ~]# dmesg -T | grep -i nvrm | head -n 10
[Sat Nov 22 05:08:50 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!
[Sat Nov 22 05:08:50 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed!
[Sat Nov 22 05:08:54 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!
[Sat Nov 22 05:08:54 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed!
[Sat Nov 22 05:08:58 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!
[Sat Nov 22 05:08:58 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed!
[Sat Nov 22 05:09:02 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!
[Sat Nov 22 05:09:02 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer0's postRxDetLinkMask failed!
[Sat Nov 22 05:09:06 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask!
[Sat Nov 22 05:09:06 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed!
[root@A11-R42-I61-42-5504045 ~]#

[root@A11-R42-I61-42-5504045 ~]# uptime
22:50:02 up 67 days, 6:11, 2 users, load average: 17.40, 16.73, 18.67
[root@A11-R42-I61-42-5504045 ~]# last reboot
reboot system boot 6.6.0-100. Tue Sep 16 16:38 still running
reboot system boot 6.6.0-100 Tue Sep 9 17:02 - 16:34 (6+23:32)

To Reproduce

nvidia-smi hangs indefinitely after ~66 days 12 hours uptime with driver 570.133.20 OpenRM on B200 and kernel 6.6.0

Bug Incidence

Once

nvidia-bug-report.log.gz

no

More Info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    NV-TriagedAn NVBug has been created for dev to investigatebugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions