Skip to content

[Question]: No speedup on multi-card with vllm #188

@Andy0422

Description

@Andy0422

Describe the issue

hi,

I got some interesting results, could you share some comments?

Firstly, I followed your guide and tested Llama-3-8B-Instruct-262k with benchmark_e2e_vllm_tp.py on L20 single card, the results seems right. Check the table, from 40k seq, sparse works faster than flashatten. But when I tested it on 4 cards, sparse has no better results than dense. It seems abnormal... The same results on qwen2.5-32b. From your paper, you tested the sparse one with 8 A100, could you share the comparison results with Flashattn or give us some advice to fix this problem? Thanks.

vllm: 0.9.2
minference: 0.1.6.0
triton: 3.3.0

<style> </style>

Meta-Llama-3.1-70B-Instruct @l20

TTFT   Input length Time(TP=1) Time(TP=4)
  Dense 10k 1.58 0.80
  Sparse   2.68 1.09
  Dense 20k 3.64 1.71
  Sparse   4.67 1.98
  Dense 30k 6.17 2.73
  Sparse   6.63 2.92
  Dense 40k 9.18 3.86
  Sparse   8.60 3.96
  Dense 50k 12.76 5.11
  Sparse   10.30 5.11
  Dense 60k 19.55 6.49
  Sparse   12.74 6.50
  Dense 70k 21.12 8.00
  Sparse   14.75 7.99
  Dense 80k 26.02 9.58
  Sparse   17.12 9.57
  Dense 90k 32.91 11.35
  Sparse   19.10 11.33
  Dense 100k 37.17 13.18
  Sparse   20.65 13.17
<style> </style>

Qwen2.5-32B-Instrcut TP=4

qwen2.5-32b   dense sparse
  50k 15.78 17.10
  60k 19.76 20.66
  70k 23.94 23.95
  80k 28.7 27.85
  90k 33.75 32.27
  120k 49.80 42.93

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions