Hi~ I've noticed that using torch.cuda.Event for measuring kernel performance can lead to significant timing variance across multiple runs, affecting stability. Given that torch.profiler offers more stable and robust measurements of pure kernel execution time, was there a specific reason for choosing cuda.Event in the benchmark's design?