Skip to content

Conversation

@joelnn
Copy link
Contributor

@joelnn joelnn commented Dec 28, 2025

Move effective_weights accumulation into update_act_state() and finalize_act_output() kernels so that the weights used in backward() match the actual forward pass computation.

Previously, true_effective_weights_ was computed on the host using remainders_/cumulative_halting_ values that became stale after CUDA kernels updated them on the device. This caused gradient mismatches in test_layer() on GPU builds.

Move effective_weights accumulation into update_act_state() and
finalize_act_output() kernels so that the weights used in backward()
match the actual forward pass computation.

Previously, true_effective_weights_ was computed on the host using
remainders_/cumulative_halting_ values that became stale after CUDA
kernels updated them on the device. This caused gradient mismatches
in test_layer() on GPU builds.
@joelnn
Copy link
Contributor Author

joelnn commented Dec 28, 2025

Codex has the following notes:

This PR fixes the immediate test failure, but while investigating I noticed the ACT layer implementation diverges from Graves (2016) in some significant ways:

  1. No state evolution: The paper defines intermediate states as s^n_t = S(s^{n-1}_t, x^n_t) — state evolves across pondering steps. The current implementation reuses the same input tensor for all steps, making the
    output effectively a weighted identity (weights sum to 1 via halt threshold + remainder).
  2. Incomplete backward pass: params_grad is set to 0, so no gradients flow to W_halt/b_halt. The paper's gradient derivation (Section 3.2) shows how halting probabilities should receive gradients from both the loss
    and the ponder cost.
  3. Ponder penalty not backpropagated: grad_logits_ is declared but unused; the ponder cost τ·ρ(x) is computed but doesn’t influence training.

@davisking
Copy link
Owner

Thanks, I verified that this does indeed sort out that failing test. @Cydral FYI, something not quite right in the layer here still seems like.

@davisking davisking merged commit 0f86c88 into davisking:master Dec 28, 2025
9 checks passed
@Cydral
Copy link
Contributor

Cydral commented Dec 29, 2025

@davisking, @joelnn, regarding the CUDA synchronization issue: I had indeed noticed similar problems in the ACT layer and also in RMSnorm. I initially introduced __syncthreads() calls to coordinate computation, but this doesn't work across thread blocks—it only synchronizes threads within a single block, leading to race conditions when cross-block coordination is needed. This same issue was causing the general behavior problems I reported earlier in the attention block, which has been addressed in #3124 (though other functions may still suffer from similar issues). For the other remarks from "Codex", I will take a look even if I believe I had already made some improvements on these points but will verify and follow up. To sum up, with the version in #3124, there is no more issue with the test part (at least from my side). @joelnn, could you please test and confirm?

@Cydral
Copy link
Contributor

Cydral commented Dec 29, 2025

@joelnn, To clarify a bit this ACT class... the forward pass is fully compliant with the mean field ACT formulation from Graves. Halting probabilities, cumulative halting, remainder handling, early stopping, and ponder statistics all follow the original proposal. A deliberate design choice was made to target Transformer style architectures rather than recurrent ones. There is no evolving internal state across ACT steps. Instead, ACT is used as adaptive computation weighting over a fixed representation, which keeps the mechanism efficient and easy to integrate.
At the moment, gradients for the halting mechanism itself are not implemented. In other words, the halting parameters are not yet learned and the ponder cost is not backpropagated. This part is acknowledged and currently under study, as applying halting gradients cleanly outside a recurrent setting requires care.
So the current layer should be seen as an "Adaptive Computation Weighting" variant: faithful to ACT in the forward pass, simplified by design, with learned halting planned next.

@davisking
Copy link
Owner

Sounds good. Can you split out the other fixes you mentioned in a separate PR and send me those by themselves? :D

@joelnn
Copy link
Contributor Author

joelnn commented Dec 29, 2025

@Cydral I don't have a direct interest in the ACT layer or the DNN pieces in general, I was just checking that the tests pass on my platform. So I will excuse myself from reviewing #3124 etc. If you have an alternative fix to this PR, that would be completely fine with me.

@Cydral
Copy link
Contributor

Cydral commented Dec 29, 2025

@Cydral I don't have a direct interest in the ACT layer or the DNN pieces in general, I was just checking that the tests pass on my platform. So I will excuse myself from reviewing #3124 etc. If you have an alternative fix to this PR, that would be completely fine with me.
I understand the merge was done to pass the tests. Personally, I hadn't noticed a gradient issue with this layer specifically, other than the point I mentioned earlier: an intrinsic problem in the CUDA code that eventually produced incorrect results. This was a bug I had observed more broadly but couldn't isolate at the time. It was indeed caused by __syncthreads() which only synchronizes within a single thread block, not across blocks.
We can probably leave the merge as is for now. I was planning to revisit this layer in particular, as it goes hand in hand with another example I'm preparing (Hierarchical Reasoning Model), but since I'm currently focused on other parts, this may take some time...

@Cydral
Copy link
Contributor

Cydral commented Dec 29, 2025

Sounds good. Can you split out the other fixes you mentioned in a separate PR and send me those by themselves? :D

Because I'm working on a branch cascaded from master, all the fixes I've made end up in the large PR currently being finalized for the whole Transformer implementation. I'll need to fork fresh from the main repository and add just the updated versions of cuda_dlib.h and cuda_dlib.cu. That should work—fingers crossed it doesn't cause merge conflicts with my current changes...

@Cydral
Copy link
Contributor

Cydral commented Dec 30, 2025

@davisking,
Wait, I just pulled the current branch and looking more closely, I see that the ACT utility function interfaces have changed. Additionally, there are also code deletions within the layer itself. Could you please revert this, so I can simply inject the "fixed" CUDA functions? For reference, other functions are affected by the same issue:
_cuda_inverse_norms
_cuda_dot_prods
_cuda_multiply_conv2
_cuda_layer_normalize
_cuda_layer_normalize_gradient
This is the issue I mentioned earlier: __syncthreads() only synchronizes threads within a single block, not across blocks. When using grid_stride_range_y, work is distributed across multiple blocks, so the synchronization barriers don't actually ensure all data is ready before proceeding to the next phase. I've been chasing this bug for months—it becomes much more apparent under heavy GPU load, especially on high-performance cards like the RTX 6000 where the race conditions are more likely to manifest.
I also noticed there were previous reports about precision issues with _cuda_multiply_conv2 (?). This could be related to the same underlying problem noted by @joelnn. I think it would be best to fix the synchronization issue first across all affected functions, and then apply any additional fixes if there are other issues here...
I can provide fixes for all of these functions, but it will take me some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants