How to guarantee that kv cache transmission finished

I find that `cudaMemcpy2DAsync` is used for kv cache transmission. However, in `Paraworker:migrate_blocks`, there is no `torch.cuda.synchronize()`
https://github.com/LLMServe/DistServe/blob/3a5c5397a260c2a53c815688d0df1796dd54128e/distserve/worker.py#L250
I wonder how it guarantee that kv cache transmission finished?

looking forward to your reply, thanks! @interestingLSY @PKUFlyingPig