Skip to content

Driver Deadlock when using 2 GPUs with 2 docker containers starting in parallel #968

@guenhter

Description

@guenhter

NVIDIA Open GPU Kernel Modules Version

580.95.05

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 24.04.3 LTS

With Ubuntu 22.04 LTS, the issue is not happening.

Kernel Release

6.8.0-87-generic

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 4060
GPU 1: NVIDIA GeForce RTX 4060

Describe the bug

We have a deadlock issue on two servers (happens not 100% of the time but multiple times a week after daily reboots). When starting up the following compose file (cut down to the essential parts):

services:

  vllm-server-gpu0:
    image: vllm/vllm-openai:v0.11.0
    runtime: nvidia
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]
    command: >
      --model Qwen/Qwen2.5-VL-3B-Instruct-AWQ
      --served-model-name Qwen/Qwen2.5-VL-3B-Instruct
      --gpu-memory-utilization 0.90
      --max_model_len 1600
      --tensor-parallel-size 1
      --load-format safetensors
      --enable-log-requests
      --enable-log-outputs

  vllm-server-gpu1:
    image: vllm/vllm-openai:v0.11.0
    runtime: nvidia
    ipc: host
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]
    command: >
      --model google/gemma-3-1b-it-qat-int4-unquantized
      --served-model-name google/gemma-3-1b-it
      --gpu-memory-utilization 0.90
      --max_model_len 32768
      --tensor-parallel-size 1
      --load-format safetensors
      --enable-log-requests
      --enable-log-outputs

then the system completely stalls (not even keyboard input is possible any more).

The indication why we think it is a deadlock:

Process 1: nv_open_q (trying to OPEN/initialize GPU)

task:nv_open_q       state:D stack:0     pid:1014
Call Trace:
  rwsem_down_write_slowpath+0x27e/0x550   ← Waiting for write lock
  down_write+0x5c/0x80
  os_acquire_rwlock_write+0x3c/0x70       ← NVIDIA trying to acquire lock
  portSyncRwLockAcquireWrite+0x10/0x40
  rmapiLockAcquire+0x294/0x360
  kgspInitRm_IMPL+0xcad/0x1680            ← Initializing GPU
  RmInitAdapter+0xff2/0x1e40              ← Opening adapter
  rm_init_adapter+0xad/0xc0
  nv_open_device+0x222/0xa80

Process 2: python3 (trying to CLOSE/shutdown GPU)

task:python3         state:D stack:0     pid:654119
Call Trace:
  __down+0x1d/0x30
  down+0x54/0x80
  console_lock+0x25/0x70                  ← Trying to acquire console lock
  os_disable_console_access+0xe/0x20
  RmShutdownAdapter+0x18d/0x3b0           ← Shutting down adapter
  rm_shutdown_adapter+0x58/0x60
  nv_shutdown_adapter+0xae/0x1d0
  nv_close_device+0x132/0x180             ← Closing device
  nvidia_close+0xf7/0x280

Why It's a Deadlock:

  1. nv_open_q is blocked at rwsem_down_write_slowpath trying to get a write lock to initialize the GPU adapter
  2. python3 is blocked at console_lock trying to disable console access during GPU shutdown
  3. Both are in state D (uninterruptible sleep) - they cannot be killed or interrupted
  4. They're waiting on resources held by each other or by the NVIDIA driver's internal lock management

To Reproduce

Start the compose file or reboot the system (which starts up the the two containers also in parallel)

Bug Incidence

Sometimes

nvidia-bug-report.log.gz

I cannot run anything after the error happens

More Info


Details

Trace 1:

Nov 19 01:03:26.198647 myserver kernel: task:nv_open_q       state:D stack:0     pid:1014  tgid:1014  ppid:2      flags:0x00004000
Nov 19 01:03:26.198686 myserver kernel: Call Trace:
Nov 19 01:03:26.198721 myserver kernel:  <TASK>
Nov 19 01:03:26.198752 myserver kernel:  __schedule+0x27c/0x6b0
Nov 19 01:03:26.198836 myserver kernel:  schedule+0x33/0x110
Nov 19 01:03:26.198878 myserver kernel:  schedule_preempt_disabled+0x15/0x30
Nov 19 01:03:26.198916 myserver kernel:  rwsem_down_write_slowpath+0x27e/0x550
Nov 19 01:03:26.198948 myserver kernel:  down_write+0x5c/0x80
Nov 19 01:03:26.198978 myserver kernel:  os_acquire_rwlock_write+0x3c/0x70 [nvidia]
Nov 19 01:03:26.199004 myserver kernel:  portSyncRwLockAcquireWrite+0x10/0x40 [nvidia]
Nov 19 01:03:26.199034 myserver kernel:  rmapiLockAcquire+0x294/0x360 [nvidia]
Nov 19 01:03:26.200733 myserver kernel:  kgspInitRm_IMPL+0xcad/0x1680 [nvidia]
Nov 19 01:03:26.200867 myserver kernel:  ? down+0x36/0x80
Nov 19 01:03:26.200920 myserver kernel:  RmInitAdapter+0xff2/0x1e40 [nvidia]
Nov 19 01:03:26.200952 myserver kernel:  ? _raw_spin_lock_irqsave+0xe/0x20
Nov 19 01:03:26.200982 myserver kernel:  ? up+0x58/0xa0
Nov 19 01:03:26.201683 myserver kernel:  rm_init_adapter+0xad/0xc0 [nvidia]
Nov 19 01:03:26.202101 myserver kernel:  nv_open_device+0x222/0xa80 [nvidia]
Nov 19 01:03:26.202924 myserver kernel:  nvidia_open_deferred+0x39/0xf0 [nvidia]
Nov 19 01:03:26.202966 myserver kernel:  _main_loop+0x7f/0x140 [nvidia]
Nov 19 01:03:26.202975 myserver kernel:  ? __pfx__main_loop+0x10/0x10 [nvidia]
Nov 19 01:03:26.202983 myserver kernel:  kthread+0xef/0x120
Nov 19 01:03:26.202990 myserver kernel:  ? __pfx_kthread+0x10/0x10
Nov 19 01:03:26.202996 myserver kernel:  ret_from_fork+0x44/0x70
Nov 19 01:03:26.203003 myserver kernel:  ? __pfx_kthread+0x10/0x10
Nov 19 01:03:26.203019 myserver kernel:  ret_from_fork_asm+0x1b/0x30
Nov 19 01:03:26.203026 myserver kernel:  </TASK>
Nov 19 01:03:26.203033 myserver kernel: INFO: task python3:654119 blocked for more than 122 seconds.
Nov 19 01:03:26.203041 myserver kernel:       Tainted: G           O       6.8.0-87-generic #88-Ubuntu
Nov 19 01:03:26.203047 myserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 19 01:03:26.203053 myserver kernel: task:python3         state:D stack:0     pid:654119 tgid:654119 ppid:653435 flags:0x00004002

Trace 2:

Nov 19 01:03:26.203053 myserver kernel: task:python3         state:D stack:0     pid:654119 tgid:654119 ppid:653435 flags:0x00004002
Nov 19 01:03:26.203059 myserver kernel: Call Trace:
Nov 19 01:03:26.203065 myserver kernel:  <TASK>
Nov 19 01:03:26.203070 myserver kernel:  __schedule+0x27c/0x6b0
Nov 19 01:03:26.203076 myserver kernel:  schedule+0x33/0x110
Nov 19 01:03:26.203081 myserver kernel:  schedule_timeout+0x157/0x170
Nov 19 01:03:26.203088 myserver kernel:  ___down_common+0xfd/0x160
Nov 19 01:03:26.203097 myserver kernel:  __down_common+0x22/0xd0
Nov 19 01:03:26.203103 myserver kernel:  __down+0x1d/0x30
Nov 19 01:03:26.203111 myserver kernel:  down+0x54/0x80
Nov 19 01:03:26.203119 myserver kernel:  console_lock+0x25/0x70
Nov 19 01:03:26.203126 myserver kernel:  os_disable_console_access+0xe/0x20 [nvidia]
Nov 19 01:03:26.203133 myserver kernel:  RmShutdownAdapter+0x18d/0x3b0 [nvidia]
Nov 19 01:03:26.203909 myserver kernel:  rm_shutdown_adapter+0x58/0x60 [nvidia]
Nov 19 01:03:26.204004 myserver kernel:  nv_shutdown_adapter+0xae/0x1d0 [nvidia]
Nov 19 01:03:26.204012 myserver kernel:  nv_close_device+0x132/0x180 [nvidia]
Nov 19 01:03:26.204031 myserver kernel:  nvidia_close_callback+0x99/0x1a0 [nvidia]
Nov 19 01:03:26.204037 myserver kernel:  nvidia_close+0xf7/0x280 [nvidia]
Nov 19 01:03:26.204043 myserver kernel:  __fput+0xa0/0x2e0
Nov 19 01:03:26.204049 myserver kernel:  __fput_sync+0x1c/0x30
Nov 19 01:03:26.204055 myserver kernel:  __x64_sys_close+0x3e/0x90
Nov 19 01:03:26.204065 myserver kernel:  x64_sys_call+0x1fec/0x25a0
Nov 19 01:03:26.204073 myserver kernel:  do_syscall_64+0x7f/0x180
Nov 19 01:03:26.204081 myserver kernel:  ? arch_exit_to_user_mode_prepare.isra.0+0x1a/0xe0
Nov 19 01:03:26.204087 myserver kernel:  ? syscall_exit_to_user_mode+0x43/0x1e0
Nov 19 01:03:26.204093 myserver kernel:  ? do_syscall_64+0x8c/0x180
Nov 19 01:03:26.204100 myserver kernel:  ? __fput+0x160/0x2e0
Nov 19 01:03:26.204107 myserver kernel:  ? arch_exit_to_user_mode_prepare.isra.0+0x1a/0xe0
Nov 19 01:03:26.204112 myserver kernel:  ? syscall_exit_to_user_mode+0x43/0x1e0
Nov 19 01:03:26.204117 myserver kernel:  ? do_syscall_64+0x8c/0x180
Nov 19 01:03:26.204121 myserver kernel:  ? do_syscall_64+0x8c/0x180

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions