-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
NVIDIA Open GPU Kernel Modules Version
580.95.05
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Ubuntu 24.04.3 LTS
With Ubuntu 22.04 LTS, the issue is not happening.
Kernel Release
6.8.0-87-generic
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- I am running on a stable kernel release.
Hardware: GPU
GPU 0: NVIDIA GeForce RTX 4060
GPU 1: NVIDIA GeForce RTX 4060
Describe the bug
We have a deadlock issue on two servers (happens not 100% of the time but multiple times a week after daily reboots). When starting up the following compose file (cut down to the essential parts):
services:
vllm-server-gpu0:
image: vllm/vllm-openai:v0.11.0
runtime: nvidia
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
command: >
--model Qwen/Qwen2.5-VL-3B-Instruct-AWQ
--served-model-name Qwen/Qwen2.5-VL-3B-Instruct
--gpu-memory-utilization 0.90
--max_model_len 1600
--tensor-parallel-size 1
--load-format safetensors
--enable-log-requests
--enable-log-outputs
vllm-server-gpu1:
image: vllm/vllm-openai:v0.11.0
runtime: nvidia
ipc: host
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
command: >
--model google/gemma-3-1b-it-qat-int4-unquantized
--served-model-name google/gemma-3-1b-it
--gpu-memory-utilization 0.90
--max_model_len 32768
--tensor-parallel-size 1
--load-format safetensors
--enable-log-requests
--enable-log-outputsthen the system completely stalls (not even keyboard input is possible any more).
The indication why we think it is a deadlock:
Process 1: nv_open_q (trying to OPEN/initialize GPU)
task:nv_open_q state:D stack:0 pid:1014
Call Trace:
rwsem_down_write_slowpath+0x27e/0x550 ← Waiting for write lock
down_write+0x5c/0x80
os_acquire_rwlock_write+0x3c/0x70 ← NVIDIA trying to acquire lock
portSyncRwLockAcquireWrite+0x10/0x40
rmapiLockAcquire+0x294/0x360
kgspInitRm_IMPL+0xcad/0x1680 ← Initializing GPU
RmInitAdapter+0xff2/0x1e40 ← Opening adapter
rm_init_adapter+0xad/0xc0
nv_open_device+0x222/0xa80
Process 2: python3 (trying to CLOSE/shutdown GPU)
task:python3 state:D stack:0 pid:654119
Call Trace:
__down+0x1d/0x30
down+0x54/0x80
console_lock+0x25/0x70 ← Trying to acquire console lock
os_disable_console_access+0xe/0x20
RmShutdownAdapter+0x18d/0x3b0 ← Shutting down adapter
rm_shutdown_adapter+0x58/0x60
nv_shutdown_adapter+0xae/0x1d0
nv_close_device+0x132/0x180 ← Closing device
nvidia_close+0xf7/0x280
Why It's a Deadlock:
nv_open_qis blocked atrwsem_down_write_slowpathtrying to get a write lock to initialize the GPU adapterpython3is blocked atconsole_locktrying to disable console access during GPU shutdown- Both are in state D (uninterruptible sleep) - they cannot be killed or interrupted
- They're waiting on resources held by each other or by the NVIDIA driver's internal lock management
To Reproduce
Start the compose file or reboot the system (which starts up the the two containers also in parallel)
Bug Incidence
Sometimes
nvidia-bug-report.log.gz
I cannot run anything after the error happens
More Info
Details
Trace 1:
Nov 19 01:03:26.198647 myserver kernel: task:nv_open_q state:D stack:0 pid:1014 tgid:1014 ppid:2 flags:0x00004000
Nov 19 01:03:26.198686 myserver kernel: Call Trace:
Nov 19 01:03:26.198721 myserver kernel: <TASK>
Nov 19 01:03:26.198752 myserver kernel: __schedule+0x27c/0x6b0
Nov 19 01:03:26.198836 myserver kernel: schedule+0x33/0x110
Nov 19 01:03:26.198878 myserver kernel: schedule_preempt_disabled+0x15/0x30
Nov 19 01:03:26.198916 myserver kernel: rwsem_down_write_slowpath+0x27e/0x550
Nov 19 01:03:26.198948 myserver kernel: down_write+0x5c/0x80
Nov 19 01:03:26.198978 myserver kernel: os_acquire_rwlock_write+0x3c/0x70 [nvidia]
Nov 19 01:03:26.199004 myserver kernel: portSyncRwLockAcquireWrite+0x10/0x40 [nvidia]
Nov 19 01:03:26.199034 myserver kernel: rmapiLockAcquire+0x294/0x360 [nvidia]
Nov 19 01:03:26.200733 myserver kernel: kgspInitRm_IMPL+0xcad/0x1680 [nvidia]
Nov 19 01:03:26.200867 myserver kernel: ? down+0x36/0x80
Nov 19 01:03:26.200920 myserver kernel: RmInitAdapter+0xff2/0x1e40 [nvidia]
Nov 19 01:03:26.200952 myserver kernel: ? _raw_spin_lock_irqsave+0xe/0x20
Nov 19 01:03:26.200982 myserver kernel: ? up+0x58/0xa0
Nov 19 01:03:26.201683 myserver kernel: rm_init_adapter+0xad/0xc0 [nvidia]
Nov 19 01:03:26.202101 myserver kernel: nv_open_device+0x222/0xa80 [nvidia]
Nov 19 01:03:26.202924 myserver kernel: nvidia_open_deferred+0x39/0xf0 [nvidia]
Nov 19 01:03:26.202966 myserver kernel: _main_loop+0x7f/0x140 [nvidia]
Nov 19 01:03:26.202975 myserver kernel: ? __pfx__main_loop+0x10/0x10 [nvidia]
Nov 19 01:03:26.202983 myserver kernel: kthread+0xef/0x120
Nov 19 01:03:26.202990 myserver kernel: ? __pfx_kthread+0x10/0x10
Nov 19 01:03:26.202996 myserver kernel: ret_from_fork+0x44/0x70
Nov 19 01:03:26.203003 myserver kernel: ? __pfx_kthread+0x10/0x10
Nov 19 01:03:26.203019 myserver kernel: ret_from_fork_asm+0x1b/0x30
Nov 19 01:03:26.203026 myserver kernel: </TASK>
Nov 19 01:03:26.203033 myserver kernel: INFO: task python3:654119 blocked for more than 122 seconds.
Nov 19 01:03:26.203041 myserver kernel: Tainted: G O 6.8.0-87-generic #88-Ubuntu
Nov 19 01:03:26.203047 myserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 19 01:03:26.203053 myserver kernel: task:python3 state:D stack:0 pid:654119 tgid:654119 ppid:653435 flags:0x00004002
Trace 2:
Nov 19 01:03:26.203053 myserver kernel: task:python3 state:D stack:0 pid:654119 tgid:654119 ppid:653435 flags:0x00004002
Nov 19 01:03:26.203059 myserver kernel: Call Trace:
Nov 19 01:03:26.203065 myserver kernel: <TASK>
Nov 19 01:03:26.203070 myserver kernel: __schedule+0x27c/0x6b0
Nov 19 01:03:26.203076 myserver kernel: schedule+0x33/0x110
Nov 19 01:03:26.203081 myserver kernel: schedule_timeout+0x157/0x170
Nov 19 01:03:26.203088 myserver kernel: ___down_common+0xfd/0x160
Nov 19 01:03:26.203097 myserver kernel: __down_common+0x22/0xd0
Nov 19 01:03:26.203103 myserver kernel: __down+0x1d/0x30
Nov 19 01:03:26.203111 myserver kernel: down+0x54/0x80
Nov 19 01:03:26.203119 myserver kernel: console_lock+0x25/0x70
Nov 19 01:03:26.203126 myserver kernel: os_disable_console_access+0xe/0x20 [nvidia]
Nov 19 01:03:26.203133 myserver kernel: RmShutdownAdapter+0x18d/0x3b0 [nvidia]
Nov 19 01:03:26.203909 myserver kernel: rm_shutdown_adapter+0x58/0x60 [nvidia]
Nov 19 01:03:26.204004 myserver kernel: nv_shutdown_adapter+0xae/0x1d0 [nvidia]
Nov 19 01:03:26.204012 myserver kernel: nv_close_device+0x132/0x180 [nvidia]
Nov 19 01:03:26.204031 myserver kernel: nvidia_close_callback+0x99/0x1a0 [nvidia]
Nov 19 01:03:26.204037 myserver kernel: nvidia_close+0xf7/0x280 [nvidia]
Nov 19 01:03:26.204043 myserver kernel: __fput+0xa0/0x2e0
Nov 19 01:03:26.204049 myserver kernel: __fput_sync+0x1c/0x30
Nov 19 01:03:26.204055 myserver kernel: __x64_sys_close+0x3e/0x90
Nov 19 01:03:26.204065 myserver kernel: x64_sys_call+0x1fec/0x25a0
Nov 19 01:03:26.204073 myserver kernel: do_syscall_64+0x7f/0x180
Nov 19 01:03:26.204081 myserver kernel: ? arch_exit_to_user_mode_prepare.isra.0+0x1a/0xe0
Nov 19 01:03:26.204087 myserver kernel: ? syscall_exit_to_user_mode+0x43/0x1e0
Nov 19 01:03:26.204093 myserver kernel: ? do_syscall_64+0x8c/0x180
Nov 19 01:03:26.204100 myserver kernel: ? __fput+0x160/0x2e0
Nov 19 01:03:26.204107 myserver kernel: ? arch_exit_to_user_mode_prepare.isra.0+0x1a/0xe0
Nov 19 01:03:26.204112 myserver kernel: ? syscall_exit_to_user_mode+0x43/0x1e0
Nov 19 01:03:26.204117 myserver kernel: ? do_syscall_64+0x8c/0x180
Nov 19 01:03:26.204121 myserver kernel: ? do_syscall_64+0x8c/0x180