Skip to content

Conversation

@dhardy
Copy link
Member

@dhardy dhardy commented Dec 23, 2025

Summary

Optimize fill_bytes and next_u64_from_u32

Details

Eliminates an impossible-to-reach panic in fill_bytes's chunks.into_remainder handling.

Reduces next_u64_from_u32 to a single call instruction.

I'm not sure the result is optimal (there are several possible jumps in fill_bytes's chunk-iterator / memcpy code) but it's all I have time for and likely good enough.

Code on godbolt.

Benchmarks in progress...

@dhardy dhardy requested a review from newpavlov December 23, 2025 12:32
@dhardy
Copy link
Member Author

dhardy commented Dec 23, 2025

Benchmarks (using https://github.com/rust-random/rand/pull/1697)
random_bytes/pcg32      time:   [314.37 ns 314.76 ns 315.27 ns]
                        thrpt:  [3.0250 GiB/s 3.0299 GiB/s 3.0336 GiB/s]
                 change:
                        time:   [+5.1186% +5.3365% +5.5636%] (p = 0.00 < 0.05)
                        thrpt:  [-5.2704% -5.0661% -4.8694%]
                        Performance has regressed.
random_bytes/pcg64      time:   [253.99 ns 254.13 ns 254.29 ns]
                        thrpt:  [3.7503 GiB/s 3.7527 GiB/s 3.7548 GiB/s]
                 change:
                        time:   [+0.2050% +0.2781% +0.3456%] (p = 0.00 < 0.05)
                        thrpt:  [-0.3444% -0.2773% -0.2046%]
                        Change within noise threshold.
Found 24 outliers among 100 measurements (24.00%)
  6 (6.00%) low mild
  6 (6.00%) high mild
  12 (12.00%) high severe
random_bytes/pcg64mcg   time:   [212.15 ns 212.19 ns 212.22 ns]
                        thrpt:  [4.4937 GiB/s 4.4945 GiB/s 4.4953 GiB/s]
                 change:
                        time:   [-3.9152% -3.8398% -3.7600%] (p = 0.00 < 0.05)
                        thrpt:  [+3.9069% +3.9932% +4.0747%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
random_bytes/pcg64dxsm  time:   [256.53 ns 256.66 ns 256.82 ns]
                        thrpt:  [3.7134 GiB/s 3.7157 GiB/s 3.7176 GiB/s]
                 change:
                        time:   [+4.4420% +4.5508% +4.6617%] (p = 0.00 < 0.05)
                        thrpt:  [-4.4541% -4.3527% -4.2531%]
                        Performance has regressed.
random_bytes/chacha8    time:   [231.02 ns 231.09 ns 231.19 ns]
                        thrpt:  [4.1251 GiB/s 4.1269 GiB/s 4.1282 GiB/s]
                 change:
                        time:   [-12.047% -11.943% -11.841%] (p = 0.00 < 0.05)
                        thrpt:  [+13.432% +13.563% +13.697%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe
random_bytes/chacha12   time:   [301.49 ns 301.58 ns 301.67 ns]
                        thrpt:  [3.1614 GiB/s 3.1623 GiB/s 3.1632 GiB/s]
                 change:
                        time:   [-10.197% -10.080% -9.9426%] (p = 0.00 < 0.05)
                        thrpt:  [+11.040% +11.210% +11.354%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low severe
  3 (3.00%) low mild
  1 (1.00%) high mild
random_bytes/chacha20   time:   [439.97 ns 440.10 ns 440.24 ns]
                        thrpt:  [2.1663 GiB/s 2.1670 GiB/s 2.1676 GiB/s]
                 change:
                        time:   [-4.0692% -4.0131% -3.9606%] (p = 0.00 < 0.05)
                        thrpt:  [+4.1239% +4.1808% +4.2418%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
random_bytes/std        time:   [308.83 ns 308.91 ns 308.98 ns]
                        thrpt:  [3.0865 GiB/s 3.0873 GiB/s 3.0880 GiB/s]
                 change:
                        time:   [-3.9856% -3.9010% -3.8220%] (p = 0.00 < 0.05)
                        thrpt:  [+3.9739% +4.0593% +4.1511%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
random_bytes/small      time:   [171.11 ns 171.17 ns 171.23 ns]
                        thrpt:  [5.5696 GiB/s 5.5714 GiB/s 5.5733 GiB/s]
                 change:
                        time:   [-0.2274% -0.1485% -0.0733%] (p = 0.00 < 0.05)
                        thrpt:  [+0.0733% +0.1487% +0.2280%]
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
random_bytes/os         time:   [1.4060 µs 1.4061 µs 1.4063 µs]
                        thrpt:  [694.44 MiB/s 694.50 MiB/s 694.56 MiB/s]
                 change:
                        time:   [-0.2810% -0.2473% -0.2165%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2170% +0.2480% +0.2818%]
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
random_bytes/thread     time:   [313.99 ns 314.76 ns 315.48 ns]
                        thrpt:  [3.0229 GiB/s 3.0299 GiB/s 3.0372 GiB/s]
                 change:
                        time:   [-1.0191% -0.8806% -0.7515%] (p = 0.00 < 0.05)
                        thrpt:  [+0.7572% +0.8884% +1.0295%]
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

random_u32/pcg32        time:   [1.0445 ns 1.0446 ns 1.0448 ns]
                        thrpt:  [3.5655 GiB/s 3.5661 GiB/s 3.5667 GiB/s]
                 change:
                        time:   [-0.0583% -0.0350% -0.0122%] (p = 0.00 < 0.05)
                        thrpt:  [+0.0122% +0.0350% +0.0583%]
                        Change within noise threshold.
Found 218 outliers among 1000 measurements (21.80%)
  17 (1.70%) high mild
  201 (20.10%) high severe
random_u32/pcg64        time:   [1.3556 ns 1.3565 ns 1.3573 ns]
                        thrpt:  [2.7447 GiB/s 2.7463 GiB/s 2.7481 GiB/s]
                 change:
                        time:   [-0.6592% -0.5683% -0.4817%] (p = 0.00 < 0.05)
                        thrpt:  [+0.4840% +0.5715% +0.6635%]
                        Change within noise threshold.
Found 206 outliers among 1000 measurements (20.60%)
  28 (2.80%) low severe
  19 (1.90%) high mild
  159 (15.90%) high severe
random_u32/pcg64mcg     time:   [944.56 ps 944.68 ps 944.81 ps]
                        thrpt:  [3.9429 GiB/s 3.9434 GiB/s 3.9439 GiB/s]
                 change:
                        time:   [-0.1245% -0.0994% -0.0739%] (p = 0.00 < 0.05)
                        thrpt:  [+0.0739% +0.0995% +0.1247%]
                        Change within noise threshold.
Found 154 outliers among 1000 measurements (15.40%)
  23 (2.30%) high mild
  131 (13.10%) high severe
random_u32/pcg64dxsm    time:   [1.4340 ns 1.4342 ns 1.4343 ns]
                        thrpt:  [2.5972 GiB/s 2.5975 GiB/s 2.5978 GiB/s]
                 change:
                        time:   [+0.2881% +0.3116% +0.3358%] (p = 0.00 < 0.05)
                        thrpt:  [-0.3347% -0.3107% -0.2873%]
                        Change within noise threshold.
Found 163 outliers among 1000 measurements (16.30%)
  20 (2.00%) low severe
  24 (2.40%) high mild
  119 (11.90%) high severe
random_u32/chacha8      time:   [1.1834 ns 1.1837 ns 1.1839 ns]
                        thrpt:  [3.1465 GiB/s 3.1473 GiB/s 3.1480 GiB/s]
                 change:
                        time:   [-0.6564% -0.6274% -0.5991%] (p = 0.00 < 0.05)
                        thrpt:  [+0.6027% +0.6314% +0.6608%]
                        Change within noise threshold.
Found 139 outliers among 1000 measurements (13.90%)
  58 (5.80%) low mild
  34 (3.40%) high mild
  47 (4.70%) high severe
random_u32/chacha12     time:   [1.2435 ns 1.2439 ns 1.2442 ns]
                        thrpt:  [2.9941 GiB/s 2.9949 GiB/s 2.9957 GiB/s]
                 change:
                        time:   [-1.2178% -1.1768% -1.1374%] (p = 0.00 < 0.05)
                        thrpt:  [+1.1505% +1.1908% +1.2328%]
                        Performance has improved.
Found 101 outliers among 1000 measurements (10.10%)
  96 (9.60%) high mild
  5 (0.50%) high severe
random_u32/chacha20     time:   [1.7942 ns 1.7945 ns 1.7949 ns]
                        thrpt:  [2.0755 GiB/s 2.0759 GiB/s 2.0763 GiB/s]
                 change:
                        time:   [-0.6290% -0.5986% -0.5681%] (p = 0.00 < 0.05)
                        thrpt:  [+0.5714% +0.6022% +0.6329%]
                        Change within noise threshold.
Found 57 outliers among 1000 measurements (5.70%)
  40 (4.00%) high mild
  17 (1.70%) high severe
random_u32/std          time:   [1.1958 ns 1.1961 ns 1.1964 ns]
                        thrpt:  [3.1137 GiB/s 3.1145 GiB/s 3.1153 GiB/s]
                 change:
                        time:   [+0.6913% +0.7184% +0.7474%] (p = 0.00 < 0.05)
                        thrpt:  [-0.7418% -0.7133% -0.6866%]
                        Change within noise threshold.
Found 46 outliers among 1000 measurements (4.60%)
  8 (0.80%) low mild
  29 (2.90%) high mild
  9 (0.90%) high severe
random_u32/small        time:   [636.64 ps 636.77 ps 636.90 ps]
                        thrpt:  [5.8491 GiB/s 5.8503 GiB/s 5.8515 GiB/s]
                 change:
                        time:   [-0.0082% +0.0193% +0.0487%] (p = 0.16 > 0.05)
                        thrpt:  [-0.0486% -0.0193% +0.0082%]
                        No change in performance detected.
Found 218 outliers among 1000 measurements (21.80%)
  21 (2.10%) high mild
  197 (19.70%) high severe
random_u32/os           time:   [14.479 ns 14.482 ns 14.486 ns]
                        thrpt:  [263.34 MiB/s 263.41 MiB/s 263.47 MiB/s]
                 change:
                        time:   [-1.1040% -1.0693% -1.0224%] (p = 0.00 < 0.05)
                        thrpt:  [+1.0330% +1.0809% +1.1164%]
                        Performance has improved.
Found 29 outliers among 1000 measurements (2.90%)
  1 (0.10%) low severe
  24 (2.40%) high mild
  4 (0.40%) high severe
random_u32/thread       time:   [1.2000 ns 1.2003 ns 1.2006 ns]
                        thrpt:  [3.1029 GiB/s 3.1037 GiB/s 3.1045 GiB/s]
                 change:
                        time:   [+0.4258% +0.4542% +0.4839%] (p = 0.00 < 0.05)
                        thrpt:  [-0.4815% -0.4521% -0.4240%]
                        Change within noise threshold.
Found 26 outliers among 1000 measurements (2.60%)
  1 (0.10%) low mild
  24 (2.40%) high mild
  1 (0.10%) high severe

random_u64/pcg32        time:   [2.0870 ns 2.0877 ns 2.0882 ns]
                        thrpt:  [3.5679 GiB/s 3.5688 GiB/s 3.5699 GiB/s]
                 change:
                        time:   [-0.1059% -0.0577% -0.0153%] (p = 0.01 < 0.05)
                        thrpt:  [+0.0153% +0.0577% +0.1060%]
                        Change within noise threshold.
Found 167 outliers among 1000 measurements (16.70%)
  8 (0.80%) low severe
  28 (2.80%) high mild
  131 (13.10%) high severe
random_u64/pcg64        time:   [1.3354 ns 1.3391 ns 1.3426 ns]
                        thrpt:  [5.5495 GiB/s 5.5641 GiB/s 5.5792 GiB/s]
                 change:
                        time:   [-0.0613% +0.1906% +0.4601%] (p = 0.16 > 0.05)
                        thrpt:  [-0.4580% -0.1902% +0.0613%]
                        No change in performance detected.
Found 286 outliers among 1000 measurements (28.60%)
  141 (14.10%) low severe
  17 (1.70%) high mild
  128 (12.80%) high severe
random_u64/pcg64mcg     time:   [948.41 ps 948.86 ps 949.30 ps]
                        thrpt:  [7.8485 GiB/s 7.8522 GiB/s 7.8559 GiB/s]
                 change:
                        time:   [-0.2421% -0.1750% -0.1138%] (p = 0.00 < 0.05)
                        thrpt:  [+0.1139% +0.1753% +0.2427%]
                        Change within noise threshold.
Found 39 outliers among 1000 measurements (3.90%)
  36 (3.60%) high mild
  3 (0.30%) high severe
random_u64/pcg64dxsm    time:   [1.2211 ns 1.2214 ns 1.2218 ns]
                        thrpt:  [6.0982 GiB/s 6.1000 GiB/s 6.1016 GiB/s]
                 change:
                        time:   [-14.876% -14.855% -14.833%] (p = 0.00 < 0.05)
                        thrpt:  [+17.416% +17.447% +17.476%]
                        Performance has improved.
Found 43 outliers among 1000 measurements (4.30%)
  41 (4.10%) high mild
  2 (0.20%) high severe
random_u64/chacha8      time:   [1.6276 ns 1.6279 ns 1.6282 ns]
                        thrpt:  [4.5759 GiB/s 4.5768 GiB/s 4.5777 GiB/s]
                 change:
                        time:   [+3.2994% +3.3265% +3.3525%] (p = 0.00 < 0.05)
                        thrpt:  [-3.2438% -3.2194% -3.1940%]
                        Performance has regressed.
Found 93 outliers among 1000 measurements (9.30%)
  4 (0.40%) low mild
  26 (2.60%) high mild
  63 (6.30%) high severe
random_u64/chacha12     time:   [2.1290 ns 2.1293 ns 2.1296 ns]
                        thrpt:  [3.4986 GiB/s 3.4991 GiB/s 3.4996 GiB/s]
                 change:
                        time:   [-0.0708% -0.0436% -0.0160%] (p = 0.00 < 0.05)
                        thrpt:  [+0.0160% +0.0436% +0.0708%]
                        Change within noise threshold.
Found 92 outliers among 1000 measurements (9.20%)
  2 (0.20%) low severe
  4 (0.40%) low mild
  22 (2.20%) high mild
  64 (6.40%) high severe
random_u64/chacha20     time:   [3.2094 ns 3.2099 ns 3.2104 ns]
                        thrpt:  [2.3207 GiB/s 2.3211 GiB/s 2.3215 GiB/s]
                 change:
                        time:   [-0.2781% -0.2449% -0.2122%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2126% +0.2455% +0.2789%]
                        Change within noise threshold.
Found 245 outliers among 1000 measurements (24.50%)
  126 (12.60%) low severe
  14 (1.40%) low mild
  18 (1.80%) high mild
  87 (8.70%) high severe
random_u64/std          time:   [2.0725 ns 2.0767 ns 2.0808 ns]
                        thrpt:  [3.5807 GiB/s 3.5878 GiB/s 3.5949 GiB/s]
                 change:
                        time:   [+3.5284% +3.6897% +3.8350%] (p = 0.00 < 0.05)
                        thrpt:  [-3.6933% -3.5584% -3.4081%]
                        Performance has regressed.
random_u64/small        time:   [650.12 ps 650.26 ps 650.40 ps]
                        thrpt:  [11.455 GiB/s 11.458 GiB/s 11.460 GiB/s]
                 change:
                        time:   [-0.8041% -0.7763% -0.7482%] (p = 0.00 < 0.05)
                        thrpt:  [+0.7539% +0.7824% +0.8106%]
                        Change within noise threshold.
Found 48 outliers among 1000 measurements (4.80%)
  48 (4.80%) high mild
random_u64/os           time:   [20.870 ns 20.875 ns 20.881 ns]
                        thrpt:  [365.38 MiB/s 365.47 MiB/s 365.56 MiB/s]
                 change:
                        time:   [-1.4342% -1.3997% -1.3659%] (p = 0.00 < 0.05)
                        thrpt:  [+1.3848% +1.4195% +1.4551%]
                        Performance has improved.
Found 217 outliers among 1000 measurements (21.70%)
  22 (2.20%) low severe
  11 (1.10%) low mild
  16 (1.60%) high mild
  168 (16.80%) high severe
random_u64/thread       time:   [2.0048 ns 2.0052 ns 2.0056 ns]
                        thrpt:  [3.7148 GiB/s 3.7156 GiB/s 3.7164 GiB/s]
                 change:
                        time:   [+0.2088% +0.2418% +0.2755%] (p = 0.00 < 0.05)
                        thrpt:  [-0.2747% -0.2412% -0.2084%]
                        Change within noise threshold.
Found 44 outliers among 1000 measurements (4.40%)
  43 (4.30%) high mild
  1 (0.10%) high severe
reseeding_bytes/chacha20_4k
                        time:   [381.28 µs 381.36 µs 381.45 µs]
                        thrpt:  [2.5602 GiB/s 2.5607 GiB/s 2.5613 GiB/s]
                 change:
                        time:   [-1.2454% -1.1847% -1.1231%] (p = 0.00 < 0.05)
                        thrpt:  [+1.1359% +1.1989% +1.2612%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
reseeding_bytes/chacha20_16k
                        time:   [366.15 µs 366.23 µs 366.32 µs]
                        thrpt:  [2.6659 GiB/s 2.6665 GiB/s 2.6671 GiB/s]
                 change:
                        time:   [-0.6439% -0.6060% -0.5690%] (p = 0.00 < 0.05)
                        thrpt:  [+0.5723% +0.6096% +0.6480%]
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
reseeding_bytes/chacha20_32k
                        time:   [363.90 µs 364.02 µs 364.13 µs]
                        thrpt:  [2.6819 GiB/s 2.6828 GiB/s 2.6836 GiB/s]
                 change:
                        time:   [-0.3650% -0.3314% -0.2986%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2995% +0.3325% +0.3663%]
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
reseeding_bytes/chacha20_64k
                        time:   [362.82 µs 363.09 µs 363.42 µs]
                        thrpt:  [2.6871 GiB/s 2.6896 GiB/s 2.6916 GiB/s]
                 change:
                        time:   [-0.5035% -0.4485% -0.3832%] (p = 0.00 < 0.05)
                        thrpt:  [+0.3846% +0.4505% +0.5060%]
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
reseeding_bytes/chacha20_256k
                        time:   [361.53 µs 361.67 µs 361.83 µs]
                        thrpt:  [2.6990 GiB/s 2.7002 GiB/s 2.7012 GiB/s]
                 change:
                        time:   [-0.3746% -0.3252% -0.2711%] (p = 0.00 < 0.05)
                        thrpt:  [+0.2719% +0.3263% +0.3760%]
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
reseeding_bytes/chacha20_1024k
                        time:   [361.01 µs 361.06 µs 361.12 µs]
                        thrpt:  [2.7042 GiB/s 2.7047 GiB/s 2.7051 GiB/s]
                 change:
                        time:   [-1.0130% -0.9190% -0.8275%] (p = 0.00 < 0.05)
                        thrpt:  [+0.8344% +0.9275% +1.0234%]
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild

random_bytes output for ChaCha* looks a bit better (4-13.5% more throughput); otherwise little has changed (most results under 1% deviation, though there are a few spurious results, most notably random_u64/pcg64dxsm is ~15% faster).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants