-
Notifications
You must be signed in to change notification settings - Fork 96
Open
Labels
Description
Kernel Null Pointer Dereference in CAS Cache with RAID10 Devices
Bug Summary
CAS Cache version 25.03.0.0963.release causes a kernel null pointer dereference crash when attempting to create a cache instance using RAID10 devices. The crash occurs consistently during the discard operation phase of cache initialization.
Environment
- Operating System: Ubuntu 22.04.5 LTS
- Linux Kernel: 5.15.0-144-generic
- CAS Cache Version: 25.03.0.0963.release
- Device Type: MD RAID10 array (4x NVMe devices)
- RAID Configuration:
- Level: RAID10
- Layout: near=2
- Chunk Size: 512K
- Devices: 4x NVMe drives
Steps to Reproduce
-
Create a RAID10 array with 4 NVMe devices:
mdadm --create /dev/md1 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
-
Clean the device:
wipefs -a /dev/md1 dd if=/dev/zero of=/dev/md1 bs=1M count=10
-
Attempt to create CAS cache instance:
casadm -S -i 1 -d /dev/disk/by-id/md-uuid-[uuid]
Expected Behavior
CAS cache instance should be created successfully without system crashes.
Actual Behavior
- The
casadmcommand hangs indefinitely (tested for 5+ minutes) - Kernel crashes with null pointer dereference
- System becomes unstable
Crash Details
Kernel Panic Stack Trace
[ 4307.057075] CR2: 0000000000000000
[ 4307.057077] ---[ end trace e4a25646554913d5 ]---
[ 4307.118416] RIP: 0010:0x0
[ 4307.118421] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 4307.118423] RSP: 0018:ffffa451cd92b948 EFLAGS: 00010206
[ 4307.118425] RAX: 0000000000000000 RBX: 0000000000092800 RCX: 0000000000000001
Call Stack Analysis
The crash occurs in the CAS Cache discard operation chain:
block_dev_forward_discard+0x184/0x290 [cas_cache]
ocf_volume_forward_discard+0x4d/0x80 [cas_cache]
ocf_req_forward_cache_discard+0x39/0x50 [cas_cache]
ocf_submit_cache_discard+0xa0/0x130 [cas_cache]
_ocf_mngt_attach_discard+0x7b/0xf0 [cas_cache]
_ocf_pipeline_run_step+0xeb/0x170 [cas_cache]
ocf_queue_run+0xf3/0x110 [cas_cache]
_cas_io_queue_thread+0x6f/0x110 [cas_cache]
Root Cause Analysis
- Error Type: Null pointer dereference (
CR2: 0000000000000000) - Location: Function pointer call to address 0x0 (
RIP: 0010:0x0) - Module: CAS Cache discard handling code
- Trigger: RAID10 device discard operations during cache initialization
Technical Details
RAID10 Device Information
md1 : active raid10 nvme3n1[3] nvme2n1[2] nvme0n1[0] nvme1n1[1]
4000532480 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
bitmap: 0/30 pages [0KB], 65536KB chunk
Process State
$ ps aux | grep casadm
root 55713 0.0 0.0 12064 2128 pts/1 Sl+ 14:57 0:00 casadm -S -i 1 -d /dev/disk/by-id/md-uuid-[uuid]Reproduction Rate
- 100% reproducible across multiple attempts
- Multiple systems affected (tested on storage01, storage03)
- Consistent crash location in discard operation chain
Workarounds Attempted
- Using
--forceflag: Still crashes - Using
--cache-mode wt: Still crashes - Using
--no-flush: Still crashes - Different by-id paths: Still crashes
Impact Assessment
- Severity: Critical - Kernel crash/system instability
- Scope: RAID10 devices with CAS Cache 25.03.0.0963.release
- Data Safety: No data corruption observed, but system requires reboot
Suggested Investigation Areas
- Null function pointer in CAS Cache discard handling code
- RAID10-specific discard operation compatibility
- Memory management in
block_dev_forward_discardfunction - Race condition during cache initialization with RAID10 devices
Additional Notes
- Regular block devices (non-RAID) may not be affected
- This appears to be a regression or compatibility issue specific to RAID10
- The crash occurs during cache initialization, not during normal I/O operations
Request
Please investigate this critical kernel crash bug. The null pointer dereference in the discard handling path makes CAS Cache unusable with RAID10 devices in the current release.
Would you like crash dumps or additional debugging information?