-
Notifications
You must be signed in to change notification settings - Fork 73
Open
Labels
questionFurther information is requestedFurther information is requested
Description
Describe the issue
hi,
I got some interesting results, could you share some comments?
Firstly, I followed your guide and tested Llama-3-8B-Instruct-262k with benchmark_e2e_vllm_tp.py on L20 single card, the results seems right. Check the table, from 40k seq, sparse works faster than flashatten. But when I tested it on 4 cards, sparse has no better results than dense. It seems abnormal... The same results on qwen2.5-32b. From your paper, you tested the sparse one with 8 A100, could you share the comparison results with Flashattn or give us some advice to fix this problem? Thanks.
vllm: 0.9.2
minference: 0.1.6.0
triton: 3.3.0
Meta-Llama-3.1-70B-Instruct @l20
| TTFT | Input length | Time(TP=1) | Time(TP=4) | |
|---|---|---|---|---|
| Dense | 10k | 1.58 | 0.80 | |
| Sparse | 2.68 | 1.09 | ||
| Dense | 20k | 3.64 | 1.71 | |
| Sparse | 4.67 | 1.98 | ||
| Dense | 30k | 6.17 | 2.73 | |
| Sparse | 6.63 | 2.92 | ||
| Dense | 40k | 9.18 | 3.86 | |
| Sparse | 8.60 | 3.96 | ||
| Dense | 50k | 12.76 | 5.11 | |
| Sparse | 10.30 | 5.11 | ||
| Dense | 60k | 19.55 | 6.49 | |
| Sparse | 12.74 | 6.50 | ||
| Dense | 70k | 21.12 | 8.00 | |
| Sparse | 14.75 | 7.99 | ||
| Dense | 80k | 26.02 | 9.58 | |
| Sparse | 17.12 | 9.57 | ||
| Dense | 90k | 32.91 | 11.35 | |
| Sparse | 19.10 | 11.33 | ||
| Dense | 100k | 37.17 | 13.18 | |
| Sparse | 20.65 | 13.17 |
Qwen2.5-32B-Instrcut TP=4
| qwen2.5-32b | dense | sparse | |
|---|---|---|---|
| 50k | 15.78 | 17.10 | |
| 60k | 19.76 | 20.66 | |
| 70k | 23.94 | 23.95 | |
| 80k | 28.7 | 27.85 | |
| 90k | 33.75 | 32.27 | |
| 120k | 49.80 | 42.93 |
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested