You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi folks, I’m currently studying how GPU faults are handled and I’m trying to understand whether there is a practical way to trigger a UVM non-replayable fault.
As I understand it, UVM categorizes faults into replayable and non-replayable. Roughly speaking, faults coming from the Graphics Engine (SM) are replayable, while faults coming from the Copy Engine or PBDMA are non-replayable. So far, the only detailed explanation I’ve found is in the comments inside kernel-open/nvidia-uvm/uvm_gpu_non_replayable_faults.c (if there is any official documentation elsewhere, I’d really appreciate pointers).
The comment gives an example: “An example of a Copy Engine non-replayable fault is a memory copy between two virtual addresses on a GPU, in which either the source or destination pointers are not currently mapped to a physical address in the page tables of the GPU.”
I tried to reproduce this in two ways:
Using cudaMallocManaged and then applying cuMemAdvise to make the destination pages preferred on the CPU, this way does not guarantee the physical page on GPU has been evicted.
Using the VMM API (cuMemCreate etc.) to create a valid GPU VA range without backing it with physical memory, this way should guarantee it.
But none of these attempts triggered a non-replayable fault. I monitored schedule_non_replayable_faults_handler in kernel-open/nvidia-uvm/uvm_gpu_isr.c and it never returned one(means one handler is scheduled). Instead, for the first way, i only got replayable fault, because UVM trying to migrate page from CPU to GPU. For the second way, I only got a segmentation fault from the CPU side :(
Before I keep digging, I wanted to ask:
Has anyone successfully triggered a UVM non-replayable fault, or has insights into conditions that reliably cause one?
Any suggestions or thoughts would be greatly appreciated!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi folks, I’m currently studying how GPU faults are handled and I’m trying to understand whether there is a practical way to trigger a UVM non-replayable fault.
As I understand it, UVM categorizes faults into replayable and non-replayable. Roughly speaking, faults coming from the Graphics Engine (SM) are replayable, while faults coming from the Copy Engine or PBDMA are non-replayable. So far, the only detailed explanation I’ve found is in the comments inside
kernel-open/nvidia-uvm/uvm_gpu_non_replayable_faults.c(if there is any official documentation elsewhere, I’d really appreciate pointers).The comment gives an example:
“An example of a Copy Engine non-replayable fault is a memory copy between two virtual addresses on a GPU, in which either the source or destination pointers are not currently mapped to a physical address in the page tables of the GPU.”I tried to reproduce this in two ways:
But none of these attempts triggered a non-replayable fault. I monitored
schedule_non_replayable_faults_handlerinkernel-open/nvidia-uvm/uvm_gpu_isr.cand it never returned one(means one handler is scheduled). Instead, for the first way, i only got replayable fault, because UVM trying to migrate page from CPU to GPU. For the second way, I only got a segmentation fault from the CPU side :(Before I keep digging, I wanted to ask:
Has anyone successfully triggered a UVM non-replayable fault, or has insights into conditions that reliably cause one?
Any suggestions or thoughts would be greatly appreciated!
Beta Was this translation helpful? Give feedback.
All reactions