faster NS algorithm (hybrid with 4 iterations) #10

thib-s · 2025-09-01T10:11:43Z

This is my attempt to improve the speed of the Newton-Schulz algorithm by making it converge with only four iterations.

I want to highlight that this approach changes the underlying algorithm. Extra verification may be desirable before merging the PR. Any tests and comments are welcome!

Changes

Fewer iterations:

We remove the previous normalization to switch to AOL rescaling
Which is further explained in the paper: https://arxiv.org/pdf/2208.03160

This consists of computing W@W^t using ns_line_1 and then computing the scaling factors: fast_inv_sqrt(reduce_sum(abs(WW^t), axis=-1)) which is a vector.

Since the main operation to compute those corresponds to ns_line_1, we can fuse it with the first Newton-Schulz iterate. Furthermore, this gives a better starting point for the Newton-Schulz iterations as the matrix is closer to orthogonal.

Thanks to this, we can save one iteration of Newton-Schulz. However, the non-linear nature of AOL prevents the use Jiacheng's approach to computing new polynomial factors. So we rely on a genetic algorithm to optimize those (see https://github.com/thib-s/flash-newton-schulz](https://github.com/thib-s/flash-newton-schulz) ).
This is done in the file opt_params.py, which can be run to find better polynomials.

triton kernel for ns_line_3:

I noticed that the ns_line_3 function was reading X multiple times, so I wrote a Triton kernel to avoid multiple loading of the same data. This gives a marginal speedup on small matrices, where loading data is the bottleneck. (It can be removed for increased code readability).

Tests

Tests on the 160m training script do not show direct regression:

While very promising, I cannot test this at a larger scale. It would be great if someone could confirm the absence of regression at larger scales.

Current results:

Using a L40S GPU, we obtain a decent speedup:

When tested on random uniform matrices, the matrices seem closer to orthogonal:

Extra tests also showed stable results on heavy-tailed distributions (Levy).

faster NS algorithm (hybrid with 4 iterations)

27557c5

thib-s force-pushed the main branch from 5802570 to 4528107 Compare September 22, 2025 10:17

fixed issue with recomputation of A

6d7489b

thib-s force-pushed the main branch from 4528107 to 6d7489b Compare September 24, 2025 11:33

thib-s marked this pull request as ready for review September 25, 2025 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

faster NS algorithm (hybrid with 4 iterations) #10

faster NS algorithm (hybrid with 4 iterations) #10

Uh oh!

thib-s commented Sep 1, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

faster NS algorithm (hybrid with 4 iterations) #10

Are you sure you want to change the base?

faster NS algorithm (hybrid with 4 iterations) #10

Uh oh!

Conversation

thib-s commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Fewer iterations:

triton kernel for ns_line_3:

Tests

Current results:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thib-s commented Sep 1, 2025 •

edited

Loading