Fix ACT layer gradient computation on CUDA #3128

joelnn · 2025-12-28T19:10:07Z

Move effective_weights accumulation into update_act_state() and finalize_act_output() kernels so that the weights used in backward() match the actual forward pass computation.

Previously, true_effective_weights_ was computed on the host using remainders_/cumulative_halting_ values that became stale after CUDA kernels updated them on the device. This caused gradient mismatches in test_layer() on GPU builds.

Move effective_weights accumulation into update_act_state() and finalize_act_output() kernels so that the weights used in backward() match the actual forward pass computation. Previously, true_effective_weights_ was computed on the host using remainders_/cumulative_halting_ values that became stale after CUDA kernels updated them on the device. This caused gradient mismatches in test_layer() on GPU builds.

joelnn · 2025-12-28T19:13:31Z

Codex has the following notes:

This PR fixes the immediate test failure, but while investigating I noticed the ACT layer implementation diverges from Graves (2016) in some significant ways:

No state evolution: The paper defines intermediate states as s^n_t = S(s^{n-1}_t, x^n_t) — state evolves across pondering steps. The current implementation reuses the same input tensor for all steps, making the
output effectively a weighted identity (weights sum to 1 via halt threshold + remainder).
Incomplete backward pass: params_grad is set to 0, so no gradients flow to W_halt/b_halt. The paper's gradient derivation (Section 3.2) shows how halting probabilities should receive gradients from both the loss
and the ponder cost.
Ponder penalty not backpropagated: grad_logits_ is declared but unused; the ponder cost τ·ρ(x) is computed but doesn’t influence training.

davisking · 2025-12-28T19:48:46Z

Thanks, I verified that this does indeed sort out that failing test. @Cydral FYI, something not quite right in the layer here still seems like.

Cydral · 2025-12-29T08:26:28Z

@davisking, @joelnn, regarding the CUDA synchronization issue: I had indeed noticed similar problems in the ACT layer and also in RMSnorm. I initially introduced __syncthreads() calls to coordinate computation, but this doesn't work across thread blocks—it only synchronizes threads within a single block, leading to race conditions when cross-block coordination is needed. This same issue was causing the general behavior problems I reported earlier in the attention block, which has been addressed in #3124 (though other functions may still suffer from similar issues). For the other remarks from "Codex", I will take a look even if I believe I had already made some improvements on these points but will verify and follow up. To sum up, with the version in #3124, there is no more issue with the test part (at least from my side). @joelnn, could you please test and confirm?

Cydral · 2025-12-29T08:46:55Z

@joelnn, To clarify a bit this ACT class... the forward pass is fully compliant with the mean field ACT formulation from Graves. Halting probabilities, cumulative halting, remainder handling, early stopping, and ponder statistics all follow the original proposal. A deliberate design choice was made to target Transformer style architectures rather than recurrent ones. There is no evolving internal state across ACT steps. Instead, ACT is used as adaptive computation weighting over a fixed representation, which keeps the mechanism efficient and easy to integrate.
At the moment, gradients for the halting mechanism itself are not implemented. In other words, the halting parameters are not yet learned and the ponder cost is not backpropagated. This part is acknowledged and currently under study, as applying halting gradients cleanly outside a recurrent setting requires care.
So the current layer should be seen as an "Adaptive Computation Weighting" variant: faithful to ACT in the forward pass, simplified by design, with learned halting planned next.

davisking · 2025-12-29T14:34:48Z

Sounds good. Can you split out the other fixes you mentioned in a separate PR and send me those by themselves? :D

joelnn · 2025-12-29T15:05:52Z

@Cydral I don't have a direct interest in the ACT layer or the DNN pieces in general, I was just checking that the tests pass on my platform. So I will excuse myself from reviewing #3124 etc. If you have an alternative fix to this PR, that would be completely fine with me.

Cydral · 2025-12-29T18:23:29Z

@Cydral I don't have a direct interest in the ACT layer or the DNN pieces in general, I was just checking that the tests pass on my platform. So I will excuse myself from reviewing #3124 etc. If you have an alternative fix to this PR, that would be completely fine with me.
I understand the merge was done to pass the tests. Personally, I hadn't noticed a gradient issue with this layer specifically, other than the point I mentioned earlier: an intrinsic problem in the CUDA code that eventually produced incorrect results. This was a bug I had observed more broadly but couldn't isolate at the time. It was indeed caused by __syncthreads() which only synchronizes within a single thread block, not across blocks.
We can probably leave the merge as is for now. I was planning to revisit this layer in particular, as it goes hand in hand with another example I'm preparing (Hierarchical Reasoning Model), but since I'm currently focused on other parts, this may take some time...

Cydral · 2025-12-29T18:34:41Z

Sounds good. Can you split out the other fixes you mentioned in a separate PR and send me those by themselves? :D

Because I'm working on a branch cascaded from master, all the fixes I've made end up in the large PR currently being finalized for the whole Transformer implementation. I'll need to fork fresh from the main repository and add just the updated versions of cuda_dlib.h and cuda_dlib.cu. That should work—fingers crossed it doesn't cause merge conflicts with my current changes...

Cydral · 2025-12-30T09:52:01Z

@davisking,
Wait, I just pulled the current branch and looking more closely, I see that the ACT utility function interfaces have changed. Additionally, there are also code deletions within the layer itself. Could you please revert this, so I can simply inject the "fixed" CUDA functions? For reference, other functions are affected by the same issue:
_cuda_inverse_norms
_cuda_dot_prods
_cuda_multiply_conv2
_cuda_layer_normalize
_cuda_layer_normalize_gradient
This is the issue I mentioned earlier: __syncthreads() only synchronizes threads within a single block, not across blocks. When using grid_stride_range_y, work is distributed across multiple blocks, so the synchronization barriers don't actually ensure all data is ready before proceeding to the next phase. I've been chasing this bug for months—it becomes much more apparent under heavy GPU load, especially on high-performance cards like the RTX 6000 where the race conditions are more likely to manifest.
I also noticed there were previous reports about precision issues with _cuda_multiply_conv2 (?). This could be related to the same underlying problem noted by @joelnn. I think it would be best to fix the synchronization issue first across all affected functions, and then apply any additional fixes if there are other issues here...
I can provide fixes for all of these functions, but it will take me some time.

davisking merged commit 0f86c88 into davisking:master Dec 28, 2025
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ACT layer gradient computation on CUDA #3128

Fix ACT layer gradient computation on CUDA #3128

Uh oh!

joelnn commented Dec 28, 2025

Uh oh!

joelnn commented Dec 28, 2025

Uh oh!

davisking commented Dec 28, 2025

Uh oh!

Uh oh!

Cydral commented Dec 29, 2025

Uh oh!

Cydral commented Dec 29, 2025

Uh oh!

davisking commented Dec 29, 2025

Uh oh!

joelnn commented Dec 29, 2025

Uh oh!

Cydral commented Dec 29, 2025

Uh oh!

Cydral commented Dec 29, 2025

Uh oh!

Cydral commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix ACT layer gradient computation on CUDA #3128

Fix ACT layer gradient computation on CUDA #3128

Uh oh!

Conversation

joelnn commented Dec 28, 2025

Uh oh!

joelnn commented Dec 28, 2025

Uh oh!

davisking commented Dec 28, 2025

Uh oh!

Uh oh!

Cydral commented Dec 29, 2025

Uh oh!

Cydral commented Dec 29, 2025

Uh oh!

davisking commented Dec 29, 2025

Uh oh!

joelnn commented Dec 29, 2025

Uh oh!

Cydral commented Dec 29, 2025

Uh oh!

Cydral commented Dec 29, 2025

Uh oh!

Cydral commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants