Thank you for the contribution.
In the paper, you mentioned that DistServe does not consider preemption. During experiments/benchmarking, how do you control the request rates and the number of token generated for each request to make sure the decode GPU doesn't hit the memory limit? Thanks.