I find that cudaMemcpy2DAsync is used for kv cache transmission. However, in Paraworker:migrate_blocks, there is no torch.cuda.synchronize()
I wonder how it guarantee that kv cache transmission finished?
looking forward to your reply, thanks! @interestingLSY @PKUFlyingPig