Skip to content

Kernel Null Pointer Dereference in CAS Cache with RAID10 Devices #1671

@mingyuetian

Description

@mingyuetian

Kernel Null Pointer Dereference in CAS Cache with RAID10 Devices

Bug Summary

CAS Cache version 25.03.0.0963.release causes a kernel null pointer dereference crash when attempting to create a cache instance using RAID10 devices. The crash occurs consistently during the discard operation phase of cache initialization.

Environment

  • Operating System: Ubuntu 22.04.5 LTS
  • Linux Kernel: 5.15.0-144-generic
  • CAS Cache Version: 25.03.0.0963.release
  • Device Type: MD RAID10 array (4x NVMe devices)
  • RAID Configuration:
    • Level: RAID10
    • Layout: near=2
    • Chunk Size: 512K
    • Devices: 4x NVMe drives

Steps to Reproduce

  1. Create a RAID10 array with 4 NVMe devices:

    mdadm --create /dev/md1 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
  2. Clean the device:

    wipefs -a /dev/md1
    dd if=/dev/zero of=/dev/md1 bs=1M count=10
  3. Attempt to create CAS cache instance:

    casadm -S -i 1 -d /dev/disk/by-id/md-uuid-[uuid]

Expected Behavior

CAS cache instance should be created successfully without system crashes.

Actual Behavior

  • The casadm command hangs indefinitely (tested for 5+ minutes)
  • Kernel crashes with null pointer dereference
  • System becomes unstable

Crash Details

Kernel Panic Stack Trace

[ 4307.057075] CR2: 0000000000000000
[ 4307.057077] ---[ end trace e4a25646554913d5 ]---
[ 4307.118416] RIP: 0010:0x0
[ 4307.118421] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 4307.118423] RSP: 0018:ffffa451cd92b948 EFLAGS: 00010206
[ 4307.118425] RAX: 0000000000000000 RBX: 0000000000092800 RCX: 0000000000000001

Call Stack Analysis

The crash occurs in the CAS Cache discard operation chain:

block_dev_forward_discard+0x184/0x290 [cas_cache]
ocf_volume_forward_discard+0x4d/0x80 [cas_cache]
ocf_req_forward_cache_discard+0x39/0x50 [cas_cache]
ocf_submit_cache_discard+0xa0/0x130 [cas_cache]
_ocf_mngt_attach_discard+0x7b/0xf0 [cas_cache]
_ocf_pipeline_run_step+0xeb/0x170 [cas_cache]
ocf_queue_run+0xf3/0x110 [cas_cache]
_cas_io_queue_thread+0x6f/0x110 [cas_cache]

Root Cause Analysis

  • Error Type: Null pointer dereference (CR2: 0000000000000000)
  • Location: Function pointer call to address 0x0 (RIP: 0010:0x0)
  • Module: CAS Cache discard handling code
  • Trigger: RAID10 device discard operations during cache initialization

Technical Details

RAID10 Device Information

md1 : active raid10 nvme3n1[3] nvme2n1[2] nvme0n1[0] nvme1n1[1]
      4000532480 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 0/30 pages [0KB], 65536KB chunk

Process State

$ ps aux | grep casadm
root  55713  0.0  0.0  12064  2128 pts/1  Sl+  14:57  0:00 casadm -S -i 1 -d /dev/disk/by-id/md-uuid-[uuid]

Reproduction Rate

  • 100% reproducible across multiple attempts
  • Multiple systems affected (tested on storage01, storage03)
  • Consistent crash location in discard operation chain

Workarounds Attempted

  1. Using --force flag: Still crashes
  2. Using --cache-mode wt: Still crashes
  3. Using --no-flush: Still crashes
  4. Different by-id paths: Still crashes

Impact Assessment

  • Severity: Critical - Kernel crash/system instability
  • Scope: RAID10 devices with CAS Cache 25.03.0.0963.release
  • Data Safety: No data corruption observed, but system requires reboot

Suggested Investigation Areas

  1. Null function pointer in CAS Cache discard handling code
  2. RAID10-specific discard operation compatibility
  3. Memory management in block_dev_forward_discard function
  4. Race condition during cache initialization with RAID10 devices

Additional Notes

  • Regular block devices (non-RAID) may not be affected
  • This appears to be a regression or compatibility issue specific to RAID10
  • The crash occurs during cache initialization, not during normal I/O operations

Request

Please investigate this critical kernel crash bug. The null pointer dereference in the discard handling path makes CAS Cache unusable with RAID10 devices in the current release.

Would you like crash dumps or additional debugging information?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions