Skip to content

Conversation

@mathieucarbou
Copy link
Member

Reverts #89

@mathieucarbou
Copy link
Member Author

This merge was a mistake - esp32 has a hardware implemented rand().

@willmmiles
Copy link

willmmiles commented Sep 18, 2025

I'd assumed you'd already benchmarked the new code and proven it was faster?

Before approving #89, I checked the ESP32 hardware RNG API. The documentation clearly states that it will busy-wait if it believes there's insufficient entropy for returning cryptographic-grade random numbers, and rate-limits incoming calls. AsyncTCP doesn't need true randomness for poll throttling - any busy waiting would be inappropriate for this application. The implication is that using the hardware RNG is not appropriate here.

It's the old /dev/urandom vs /dev/random debate ... a nonblocking and guaranteed fast non-cryptographic PRNG really is useful for a lot of applications!

@me-no-dev
Copy link
Member

It busy wait only if called within about 16 us, which I doubt will ever be the case here

@mathieucarbou mathieucarbou changed the title Revert "Switch from rand() to an XOR Shift to favor speed over random quality" Switch from rand() to esp_random() on ESP32 and XOR Shift on Libretiny Sep 18, 2025
@mathieucarbou mathieucarbou marked this pull request as ready for review September 18, 2025 09:34
@mathieucarbou
Copy link
Member Author

@me-no-dev @willmmiles PR updated

@willmmiles
Copy link

It busy wait only if called within about 16 us, which I doubt will ever be the case here

I'd expect to hit that any time there are more than two connections open, especially on faster chips. Every timer tick, LwIP loops through the open connections, calling the poll callback for each back-to-back. It's not quite a tight loop but it'll be pretty darn fast, especially if it decides not to queue a poll event for a client.

In any event, personally I'd aim for a solution that's got guaranteed good performance rather than hoping our code is slow.

@mathieucarbou
Copy link
Member Author

In any event, personally I'd aim for a solution that's got guaranteed good performance rather than hoping our code is slow.

I agree with this assessment too.

How this can be tested ?
By doing a little bench test between these 2 solutions, after WiFi is connected, and after we have started an asycntcp server for example ?

@willmmiles
Copy link

In any event, personally I'd aim for a solution that's got guaranteed good performance rather than hoping our code is slow.

I agree with this assessment too.

How this can be tested ? By doing a little bench test between these 2 solutions, after WiFi is connected, and after we have started an asycntcp server for example ?

We're firmly in the realm of microbenchmarking on this ...

  • Set CONFIG_ASYNC_TCP_QUEUE_SIZE to a relatively low number, to encourage poll coalescence. (It's only used for that purpose now anyways - it's no longer a hard limit, so there's no safety risk here.)
  • Open many idle connections and disable timeouts. Polling is the only path that triggers the randomness calls, so we don't need any data traffic.
  • Count calls of AsyncTCP_detail::tcp_poll, poll events queued, and executions of the poll callback from the client (since there are two discard paths).
  • Count iterations of loop() -- ie. measure the "idle time" not consumed by AsyncTCP.
  • Periodically emit the statistics

The key metric is number of loops per tcp_poll, corrected for the cost of event queuing and execution. If we wanted to try a more invasive benchmark, we could arrange to always discard the event to eliminate those costs, while keeping the random conditional path.

@mathieucarbou
Copy link
Member Author

mathieucarbou commented Sep 18, 2025

In any event, personally I'd aim for a solution that's got guaranteed good performance rather than hoping our code is slow.

I agree with this assessment too.
How this can be tested ? By doing a little bench test between these 2 solutions, after WiFi is connected, and after we have started an asycntcp server for example ?

We're firmly in the realm of microbenchmarking on this ...

  • Set CONFIG_ASYNC_TCP_QUEUE_SIZE to a relatively low number, to encourage poll coalescence. (It's only used for that purpose now anyways - it's no longer a hard limit, so there's no safety risk here.)
  • Open many idle connections and disable timeouts. Polling is the only path that triggers the randomness calls, so we don't need any data traffic.
  • Count calls of AsyncTCP_detail::tcp_poll, poll events queued, and executions of the poll callback from the client (since there are two discard paths).
  • Count iterations of loop() -- ie. measure the "idle time" not consumed by AsyncTCP.
  • Periodically emit the statistics

The key metric is number of loops per tcp_poll, corrected for the cost of event queuing and execution. If we wanted to try a more invasive benchmark, we could arrange to always discard the event to eliminate those costs, while keeping the random conditional path.

Why we need all that ? Just a main.cpp compiled with WiFi on (to have some entropy) is enough, then we trigger some calls to rand() and xorshift. We do not need to test within the context of async tcp: we just want to know which one is faster right ?

At least, with a loop, we'll see clearly the impact of the busy wait, since it will be exagerated, as if there were many concurrent connections.

@willmmiles
Copy link

Why we need all that ? Just a main.cpp compiled with WiFi on (to have some entropy) is enough, then we trigger some calls to rand() and xorshift. We do not need to test within the context of async tcp: we just want to know which one is faster right ?

I'm anticipating an argument that rand() would be equally fast when called in AsyncTCP's context because there are unavoidable delays in LwIP/AsyncTCP code paths that will ultimately meet esp_random()'s back-to-back call requirements. I don't think there's any disagreement that there can be a delay if it's called in a tight loop.

@me-no-dev
Copy link
Member

as @willmmiles said. esp_random blocks only if called within 16 microseconds (and up to 16 microseconds)

@me-no-dev
Copy link
Member

me-no-dev commented Sep 18, 2025

if the poll is called every 16 microseconds or more, then esp_random will take two register reads + one ccount read + add and compare

@willmmiles
Copy link

if the poll is called every 16 microseconds or more, then esp_random will take two register reads + one ccount read + add and compare

LwIP's poll handler runs infrequently -- but every time it runs, it loops over all open connections and runs the poll callback back-to-back, so AsyncTCP_detail::tcp_poll gets called many times in short succession.

@mathieucarbou
Copy link
Member Author

if the poll is called every 16 microseconds or more, then esp_random will take two register reads + one ccount read + add and compare

LwIP's poll handler runs infrequently -- but every time it runs, it loops over all open connections and runs the poll callback back-to-back, so AsyncTCP_detail::tcp_poll gets called many times in short succession.

that's why I wanted to test the worst case: a loop.

@me-no-dev
Copy link
Member

that's why I wanted to test the worst case: a loop.

this is not realistic and we know the results already

@mathieucarbou
Copy link
Member Author

mathieucarbou commented Sep 19, 2025

that's why I wanted to test the worst case: a loop.

this is not realistic and we know the results already

In that case, why not use an algorithm that is deterministic and won't risk to slowdown the loop ?

I get that this is a trade-off between slowing down a loop that runs infrequently or not.

If we favor the 80% use cases where there won't be a loop, then I think we can marge the PR as-is ?

Is that ok for you @willmmiles and @me-no-dev ?

@willmmiles
Copy link

If we favor the 80% use cases where there won't be a loop, then I think we can marge the PR as-is ?

That's not true. There is loop when LwIP calls the AsyncTCP_detail::tcp_poll handlers for all open connections in short succession. If you have more than one open connection, then you've got a loop that will trigger the busy-wait behavior.

@mathieucarbou
Copy link
Member Author

mathieucarbou commented Sep 20, 2025

If we favor the 80% use cases where there won't be a loop, then I think we can marge the PR as-is ?

That's not true. There is loop when LwIP calls the AsyncTCP_detail::tcp_poll handlers for all open connections in short succession. If you have more than one open connection, then you've got a loop that will trigger the busy-wait behavior.

Ok! So then it goes quite against what @me-no-dev said:

that's why I wanted to test the worst case: a loop.

this is not realistic and we know the results already

If we have a loop as soon as we have more than 1 connection then this becomes a relevant test and we know the answer and in that case the currently merged implementation will avoid this busy wait.

Now, what becomes perhaps more relevant to test is how much time is really lost is such loop compare to an xorshift.

Which is why I wanted to test both in a loop (worst case) to see by how much time they differ.

If they are close then it means the optimisation is useless and we just merge this PR that rollbacks to esp_rand() for esp32.

but if the xorshift allows far more itérations then i think the discussion is closed and we close this PR.

@me-no-dev @willmmiles : does it make sense or not ?

@willmmiles
Copy link

If we have a loop as soon as we have more than 1 connection then this becomes a relevant test and we know the answer and in that case the currently merged implementation will avoid this busy wait.

Now, what becomes perhaps more relevant to test is how much time is really lost is such loop compare to an xorshift.

The confounding factor is that esp_random()'s busy-wait is measured based on the CPU cycle counter. So how much time is wasted depends entirely on how much other work is done in the loop. The wait time for a tight loop is documented (16 microseconds), so I don't think it's valuable to repeat that test. The cost of the other work being done in the loop is critical for understanding what the cost of busy-waiting would be for us.

I believe that LwIP's poll loop will take less than 16us between calls to AsyncTCP_detail::tcp_poll if a poll event is not queued. (It might even be that fast when a poll event is queued, too, on chips as fast as the ESP32-P4.) To test this directly, we'd need to instrument AsyncTCP_detail::tcp_poll with something that stores the CPU cycle counter as it passes through. I can see about sketching a little test program to validate this.

@mathieucarbou
Copy link
Member Author

If we have a loop as soon as we have more than 1 connection then this becomes a relevant test and we know the answer and in that case the currently merged implementation will avoid this busy wait.
Now, what becomes perhaps more relevant to test is how much time is really lost is such loop compare to an xorshift.

The confounding factor is that esp_random()'s busy-wait is measured based on the CPU cycle counter. So how much time is wasted depends entirely on how much other work is done in the loop. The wait time for a tight loop is documented (16 microseconds), so I don't think it's valuable to repeat that test. The cost of the other work being done in the loop is critical for understanding what the cost of busy-waiting would be for us.

I believe that LwIP's poll loop will take less than 16us between calls to AsyncTCP_detail::tcp_poll if a poll event is not queued. (It might even be that fast when a poll event is queued, too, on chips as fast as the ESP32-P4.) To test this directly, we'd need to instrument AsyncTCP_detail::tcp_poll with something that stores the CPU cycle counter as it passes through. I can see about sketching a little test program to validate this.

Thanks!
I understand more your first idea about testing with async tcp now...

My idea was to quickly find a way to decide what to do by trying to know the relative cost of an xorshift call.

What I mean by that is that if a call to xor_shift_next() costs ~10 us, then we just switch back to esp_rand(), no question asked, because this is similar.
But if the cost of a call to xor_shift_next() is really insignificant, compard to the potential impacts of the busy waits, then I don't see why we would keep esp_random().
That was my reasonning behing my idea to simplify the test.

@willmmiles
Copy link

willmmiles commented Sep 22, 2025

Ah, I see! As I'd mentioned above, I'd assumed you already had some measurements identifying rand() as a performance concern, and benchmarked xor_shift_next() as (at least sometimes) faster. I'd be surprised if xor_shift_next() took very many cycles to execute (I'd expect <100), but empirical proof is often helpful for confirming assumptions like that.

@mathieucarbou
Copy link
Member Author

mathieucarbou commented Sep 22, 2025

Ah, I see! As I'd mentioned above, I'd assumed you already had some measurements identifying rand() as a performance concern, and benchmarked xor_shift_next() as (at least sometimes) faster. I'd be surprised if xor_shift_next() took very many cycles to execute (I'd expect <100), but empirical proof is often helpful for confirming assumptions like that.

Yes here is a summary: https://www.perplexity.ai/search/benchmark-de-rand-random-on-es-21pmlAj4TKqCAl33xTQSnw

And I did a small benchmark to counter verify these findings because the order of magitude for the differences is huge.

#include <Arduino.h>
#include <WiFi.h>

static uint32_t xor_shift_state = 31;  // any nonzero seed will do
static uint32_t xor_shift_next() {
  uint32_t x = xor_shift_state;
  x ^= x << 13;
  x ^= x >> 17;
  x ^= x << 5;
  return xor_shift_state = x;
}

#define CALLS    1000UL
#define SLEEP_MS 0
static uint32_t generated = 0;

static int64_t bench_1() {
  int64_t start = esp_timer_get_time();
  for (size_t i = 0; i < CALLS; i++) {
    generated = esp_random();
#if SLEEP_MS > 0
    delay(20);
#endif
  }
  return esp_timer_get_time() - start;
}

static int64_t bench_2() {
  int64_t start = esp_timer_get_time();
  for (size_t i = 0; i < CALLS; i++) {
    generated = xor_shift_next();
#if SLEEP_MS > 0
    delay(20);
#endif
  }
  return esp_timer_get_time() - start;
}

void setup() {
  Serial.begin(115200);
  while (!Serial) {
    continue;
  }

  WiFi.begin("IoT", "");
  while (WiFi.status() != WL_CONNECTED) {
    delay(500);
  }
  Serial.println("Connected to WiFi!");
  Serial.println(WiFi.localIP());
}

static uint32_t last = 0;
void loop() {
  if (millis() - last >= 1000) {
#if SLEEP_MS > 0
    int64_t dt1 = bench_1() - CALLS * SLEEP_MS * 1000UL;
    int64_t dt2 = bench_2() - CALLS * SLEEP_MS * 1000UL;
#else
    int64_t dt1 = bench_1();
    int64_t dt2 = bench_2();
#endif
    Serial.printf("esp_random():     %lu calls in %lld us\n", CALLS, dt1);
    Serial.printf("xor_shift_next(): %lu calls in %lld us\n", CALLS, dt2);
    last = millis();
  }
}

Compiled on lastest Arduino Core 3 for esp32dev board with the same pio file we compile our examples.

With SLEEP_MS set to 0, so called in a loop, I get (after letting it run for a while):

esp_random():     10000 calls in 5273 us
xor_shift_next(): 10000 calls in 251 us
esp_random():     10000 calls in 5273 us
xor_shift_next(): 10000 calls in 251 us
esp_random():     10000 calls in 5273 us
xor_shift_next(): 10000 calls in 251 us
esp_random():     10000 calls in 5273 us
xor_shift_next(): 10000 calls in 251 us

Since our xorshift is even simplified compared to the ones on 64-bits tested in the article, a call in a loop will be 21 times faster.

So I did another test, to try figure out what would be the cost of a single call to rand() vs xor_shift if called at interval of 20ms, to avoid this busy wait.

With SLEEP_MS set to 20 ms and after accounting for the delays, I get (after letting it run for a while):

esp_random():     1000 calls in 543 us
xor_shift_next(): 1000 calls in -8 us
esp_random():     1000 calls in 7 us
xor_shift_next(): 1000 calls in -3 us
esp_random():     1000 calls in 4 us
xor_shift_next(): 1000 calls in 0 us
esp_random():     1000 calls in 4 us
xor_shift_next(): 1000 calls in 0 us
esp_random():     1000 calls in 3 us
xor_shift_next(): 1000 calls in 0 us
esp_random():     1000 calls in 3 us
xor_shift_next(): 1000 calls in 0 us
esp_random():     1000 calls in 4 us
xor_shift_next(): 1000 calls in 0 us

The first calls to esp_random() or delays (don't know) are costly, but after that, xor_shift_next is still faster by a factor that is unknow. The call costs nothing compared to calling rand().

So to me, in our case we have zero advantage of calling rand(), even if hardware based. We can be highly impacted by teh cost of the busy wait when looping and moreover there is no considerebale advantage also of using it even if not looping.

That's why I did not find the need of testing within AsyncTCP... Because the order of magnitude between both is I think significant enough to just decide to use an xor_shift always.

@mathieucarbou mathieucarbou marked this pull request as draft September 22, 2025 08:40
@mathieucarbou mathieucarbou changed the title Switch from rand() to esp_random() on ESP32 and XOR Shift on Libretiny Should we switch from rand() to esp_random() on ESP32 and XOR Shift on Libretiny ? Sep 22, 2025
@willmmiles
Copy link

I think based on your results that we should probably close this, then. xorshift is good enough; fast; doesn't require any platform selection #ifdefs (a big plus as far as I'm concerned!); and ultimately the long-term solution is to get rid of the stochastic behaviour anyways.

@mathieucarbou
Copy link
Member Author

I agree. If @me-no-dev brings up some good arguments in favor of resurrecting this PR, I will do, but for now I concurr we can close the PR and keep the xorshift in main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants