Skip to content

Conversation

@czhu15
Copy link

@czhu15 czhu15 commented Nov 26, 2025

Split fp8_fused_sdpa into two phases to decrease the TTFT.
The first phase will call fused_sdpa kernel w/o mask for prefix cached part.
The second phase will call fused_sdpa kernel with mask for the new prompt part.
Via splitting fp8_fused_sdpa into two phases, it decreases the memory consumption and also decreases the TTFT with current synapse fused_sdpa kernel.

@czhu15
Copy link
Author

czhu15 commented Nov 26, 2025

cc @yangulei

@czhu15 czhu15 marked this pull request as draft November 26, 2025 00:40
Co-authored-by: Youlei Yang <youlei.yang@intel.com>
Signed-off-by: Bob Zhu <bob.zhu@intel.com>
@czhu15 czhu15 marked this pull request as ready for review December 1, 2025 01:54
@czhu15
Copy link
Author

czhu15 commented Dec 1, 2025

The output of the APC example code is OK.
The performance of TTFT is decreased to ~2 seconds with the customer's test data.

@chensuyue chensuyue requested a review from xin3he December 3, 2025 08:02
@xin3he xin3he requested a review from yiliu30 December 4, 2025 05:19
@xin3he
Copy link
Contributor

xin3he commented Dec 4, 2025

@czhu15 Thank you for raising this enhancement, we will double check this change and ensure it's not breaking current usages.

Copy link
Contributor

@linoybu linoybu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the vLLM plugin, we are currently using FSDPA only during the prefill phase.
You can see this distinction here:
https://github.com/vllm-project/vllm-gaudi/blob/b8515d5fb8d5966768ad03e71bbbe1ad6661d7df/vllm_gaudi/attention/backends/hpu_attn.py#L262
It appears to be an attempt to separate decode and prefill operations to improve performance.
My question is: if we are not using FSDPA for decode, should we still expect any performance improvement?
Also, do you have a ticket that explains more about this issue?

@xin3he
Copy link
Contributor

xin3he commented Dec 4, 2025

Thank you for this contribution. @czhu15
Per my understanding, this change is pure for prefill stage and split the prefill stage into two steps based on VLLM_FUSEDSDPA_SPLIT_THLD to reduce the peak memory usage and somehow improves the TTFT.
My suggestion would be keep the original behavior if VLLM_FUSEDSDPA_SPLIT_THLD is set.
To make it as default, we need more information.

@czhu15
Copy link
Author

czhu15 commented Dec 4, 2025

Yes. This PR only applies only during the prefill phase. More specific, for the prefill phase when prefix caching is enabled. Current implementation is to pass a (big) atten_bias to the kernel, which can easily lead to OOM issue.
There is some discussion in below ticket, though lots of discussion with B, Jayachandran (jayachandran.b@intel.com) was done via teams.
https://jira.habana-labs.com/browse/SW-241376
To keep the non-split behavior, user can just set VLLM_FUSEDSDPA_SPLIT_THLD to 0. Pls feel free to check the performance in INC under different scenarios to see how to set the default value.

@yiliu30
Copy link
Contributor

yiliu30 commented Dec 5, 2025

Thank you for this contribution. @czhu15 Per my understanding, this change is pure for prefill stage and split the prefill stage into two steps based on VLLM_FUSEDSDPA_SPLIT_THLD to reduce the peak memory usage and somehow improves the TTFT. My suggestion would be keep the original behavior if VLLM_FUSEDSDPA_SPLIT_THLD is set. To make it as default, we need more information.

Hi @xin3he This PR was targeted at aice/v122 or v3.6.post.oot for now. It’s okay to allow more flexibility in order to pursue ultimate performance.

@yiliu30
Copy link
Contributor

yiliu30 commented Dec 5, 2025

Hi @czhu15 please let me know once the local tests pass. I can help with the merge, or you’re welcome to do it yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants