[DRAFT] CUDA: Improve performance via less synchronizations between token #17795
+94
−68
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[DRAFT]
This PR suggest to remove some superfluous synchronization calls between tokens to be faster on CUDA backends. I see between 1% and 2% perf gain depending on the model, GPU and settings.
Mechanism
The performance impact is best explained visually. Here are the "before" and "after" nsight system traces. They are not to scale. The relevant part is the row with the green and red bubbles. Both images show the overhead between the GPU execution of two tokens. The generation of the n-th token ends on the left hand side of the screenshot, in the green bubble titled
cudaStreamSynchronize. The calculation of the next/n+1st token starts on the right hand side, at the green bar titledcudaGraphLaunch. In between, there is CPU orchestration overhead. This PR aims to shrink the time spent in the middle, between GPU token generation. Original:In the middle of the above image, we see red and green bubbles alternating. In this case, the green bubbles are synchronization steps, the red bubbles are asynchronous copy calls from host to device. If async operations are immediately followed by synchronization calls, they are executed synchronously. This is not efficient. Removing the green synchronization operations between asynchronous copy calls leads to asynchronous copies and reduced overhead between GPU token generation:
Performance
I benchmarked on a RTX Pro 6000 Blackwell using
./llama-bench -m $models -p 0 -n 128,256,512 -fa 1.My testing shows around 1% improvement, with
gpt-oss-20bgaining up to 1.4%.llama 3B Q4_K - Mediumshows very high variance, prompting me to run the tests again with-r 100. At-r 100, a clearer trend of improved performance forgemma3n E2B Q8_0is also visible.Details with default `-r 5`
Baseline:
PR:
Speedup:
Details with `-r 100`
Baseline:
PR:
Speedup:
Implementation Concerns
The approach here aims to minimize changes in the general backend, and to other backends. However, the synchronization calls originate from the general backend. Some changes there are unavoidable, as well as retaining a synchronization call after the last copy to ensure correctness across backends.
Additionally, AFAIK there is no documentation on the functional guarantees of a function like
ggml_copy_tensor, and it could be that the current design proposal violates existing assumptions, or practices around potentially breaking ABIs between ggml and llama.cpp. For this reason, this PR is a draft.I also have not copy-pasted my additions
ggml_backend_buffer_iinterface changes (addedset_tensor_async+ whitespace) to the the other backends just yet. This causes the unrelated tests to fail.Please advise on the architectural choices of this implementation.
For example, we could make the
set_tensorin the CUDA backend default async. This would avoid interface changes, but change the behavior of similar functions between backends.@ggerganov @JohannesGaessler