Skip to content

Support GroupIndex to be random string to accelerate rollout #710

@hasB4K

Description

@hasB4K

What would you like to be added:
The possibility to have GroupIndex to not be integers for deployment that don't need it. Something like that:

NAME       READY   STATUS    RESTARTS   AGE
vllm-8hc4x     1/1     Running   0          2s
vllm-8hc4x-1   1/1     Running   0          2s
vllm-mk9w6     1/1     Running   0          2s
vllm-mk9w6-1   1/1     Running   0          2s

Why is this needed:

Example from the doc:

NAME       READY   STATUS    RESTARTS   AGE
vllm-0     1/1     Running   0          2s
vllm-0-1   1/1     Running   0          2s
vllm-1     1/1     Running   0          2s
vllm-1-1   1/1     Running   0          2s

Group index are the first 0 and 1 so (in bold) vllm-0-1 and vllm-1-1. (it's obvious, but just in case: GroupIndex == ReplicaIndex).

The issue with that is when we do rollout, it always tries to keep consecutive group indices since LWS reconciles statefulSets and those statefulsets have consistent naming. So if I redeploy with maxSurge at 2, we will have this:

NAME       READY   STATUS    RESTARTS   AGE
vllm-0     1/1     Running   0          20h
vllm-0-1   1/1     Running   0          20h
vllm-1     1/1     Running   0          20h
vllm-1-1   1/1     Running   0          20h
vllm-2     0/1     Pending   0          2s
vllm-2-1   0/1     Pending   0          2s
vllm-3     0/1     Pending   0          2s
vllm-3-1   0/1     Pending   0          2s

Then this:

NAME       READY   STATUS    RESTARTS   AGE
vllm-0     0/1     Pending   0          2s
vllm-0-1   0/1     Pending   0          2s
vllm-1     0/1     Pending   0          2s
vllm-1-1   0/1     Pending   0          2s
vllm-2     1/1     Running   0          2m
vllm-2-1   1/1     Running   0          2m
vllm-3     1/1     Running   0          2m
vllm-3-1   1/1     Running   0          2m

and finally this again:

NAME       READY   STATUS    RESTARTS   AGE
vllm-0     1/1     Running   0          20s
vllm-0-1   1/1     Running   0          20s
vllm-1     1/1     Running   0          20s
vllm-1-1   1/1     Running   0          20s

It slow down the rollout a lot. Do you have any thoughts on this @kerthcet? (cc @synthe102)

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions