-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Fix ACT layer gradient computation on CUDA #3128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Move effective_weights accumulation into update_act_state() and finalize_act_output() kernels so that the weights used in backward() match the actual forward pass computation. Previously, true_effective_weights_ was computed on the host using remainders_/cumulative_halting_ values that became stale after CUDA kernels updated them on the device. This caused gradient mismatches in test_layer() on GPU builds.
|
Codex has the following notes: This PR fixes the immediate test failure, but while investigating I noticed the ACT layer implementation diverges from Graves (2016) in some significant ways:
|
|
Thanks, I verified that this does indeed sort out that failing test. @Cydral FYI, something not quite right in the layer here still seems like. |
|
@davisking, @joelnn, regarding the CUDA synchronization issue: I had indeed noticed similar problems in the ACT layer and also in RMSnorm. I initially introduced __syncthreads() calls to coordinate computation, but this doesn't work across thread blocks—it only synchronizes threads within a single block, leading to race conditions when cross-block coordination is needed. This same issue was causing the general behavior problems I reported earlier in the attention block, which has been addressed in #3124 (though other functions may still suffer from similar issues). For the other remarks from "Codex", I will take a look even if I believe I had already made some improvements on these points but will verify and follow up. To sum up, with the version in #3124, there is no more issue with the test part (at least from my side). @joelnn, could you please test and confirm? |
|
@joelnn, To clarify a bit this ACT class... the forward pass is fully compliant with the mean field ACT formulation from Graves. Halting probabilities, cumulative halting, remainder handling, early stopping, and ponder statistics all follow the original proposal. A deliberate design choice was made to target Transformer style architectures rather than recurrent ones. There is no evolving internal state across ACT steps. Instead, ACT is used as adaptive computation weighting over a fixed representation, which keeps the mechanism efficient and easy to integrate. |
|
Sounds good. Can you split out the other fixes you mentioned in a separate PR and send me those by themselves? :D |
|
Because I'm working on a branch cascaded from master, all the fixes I've made end up in the large PR currently being finalized for the whole Transformer implementation. I'll need to fork fresh from the main repository and add just the updated versions of cuda_dlib.h and cuda_dlib.cu. That should work—fingers crossed it doesn't cause merge conflicts with my current changes... |
|
@davisking, |
Move effective_weights accumulation into update_act_state() and finalize_act_output() kernels so that the weights used in backward() match the actual forward pass computation.
Previously, true_effective_weights_ was computed on the host using remainders_/cumulative_halting_ values that became stale after CUDA kernels updated them on the device. This caused gradient mismatches in test_layer() on GPU builds.