Skip to content

Conversation

@xinSky00
Copy link

Purpose

This PR implements a Triton kernel for ReRoPE to enable efficient context length extension in vLLM. The ReRoPE-based approach delivers 3-5x speedup for long sequences.

It implements segment-wise attention computation, applying full rotary embeddings within a fixed window while constraining positional encodings beyond the window boundary, enabling models to handle sequences beyond their pre-training length without fine-tuning.

Modifications

  1. New Environment Variable: VLLM_ATTENTION_BACKEND
  • Must be set to TRITON_ATTN_VLLM_V1 to enable the Triton-based backend
  • Usage: export VLLM_ATTENTION_BACKEND="TRITON_ATTN_VLLM_V1"
  • Default: FLASH_ATTN (standard FlashAttention backend)
  1. Modified Model Configuration Parameter: max_position_embeddings
  • Users must adjust this parameter via --hf-overrides to match their target input length
  • This ensures the RoPE embeddings are properly computed for extended sequences
  • Example: --hf-overrides '{"max_position_embeddings": 327680}'
  1. ReRoPE-specific parameters: rerope_window and training_length should be configured based on the model's original pre-training length
  • These values determine the segment boundaries for attention computation and must align with the model's original training configuration

Test

run the file of offline_inference_rerope.py

  • os.environ["VLLM_ATTENTION_BACKEND"] = "TRITON_ATTN_VLLM_V1"
  • os.environ["VLLM_USE_REROPE"] = "true"
  • model: Qwen2.5-14B-Instruct
  • Dataset: multifieldqa_zh.jsonl
  • results
    • prompt length: about 130k tokens
      img_v3_02sr_fe6a1e47-07a6-45a4-8646-0b9b45134adg
    • prompt length: about 315k tokens
      img_v3_02sr_5c35a298-349c-4540-be61-11d36c845b4g

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant