Skip to content

Enabling sequence_parallel slows down training with fp16 #182

@cavdard

Description

@cavdard

I am testing GPT2 model training using TransformerLayer.

Training slows down significantly when sequence_parallel=True, achieves 1/5th of throughput of training without sequence_parallel.
I also observe that sequence_parallel=True results in OOM for some batch sizes where sequence_parallel=False can run successfully.

Do you have any recommendation to achieve better throughput with sequence_parallel and fp16?

Model is ~4.3B with 12 layers, tp_size=4, fp16, seq_len=2048, training with 8 A100 GPUs.

transformer_engine.pytorch.TransformerLayer(
5120,
20480,
40,
layer_number=(l+1),
self_attn_mask_type="causal",
tp_group=tp_group(),
tp_size=tp_size,
params_dtype=torch.float16,
output_layernorm=True,
layer_type="encoder",
set_parallel_mode=True,
fuse_qkv_params=True,
sequence_parallel=True,
qkv_weight_interleaved=False,
attention_softmax_in_fp32=False,
)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions