Update moe branch for torchrun and fix error while loading trainer.ds_module #7

pnunna93 · 2023-11-01T18:40:35Z

What does this PR do?

Fixes error while loading trainer.ds_module
Changes torch.distributed.launch to torchrun

Fix the issue "[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!" On DeepSpeed library 0.15.0, the commit 7260890452eb89185f9ab1e09550938f78ea91db changed the return output tensor exp_counts from 'cpu' to device when calling deepspeed.moe.layer.MoE() This change reduces cpu host overhead when using moe. The device type of self.expert_counts tensor in Fairseq transformer_moe_layer module needs to be changed from cuda from cpu, for ds library >= 0.15.0 Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>

transformer_moe_layer: Fix Runtime error

Feature/deepspeed moe

pnunna93 and others added 15 commits October 27, 2023 16:11

Remove module for trainer.model

d1d81fa

Changes for torchrun

4b34146

Merge pull request #1 from jagadish-amd/fix-SWDEV-485020

9c68be1

transformer_moe_layer: Fix Runtime error

change version for fambench

e8ee21a

upd data class

cdfd220

upd data class

d47d475

remove np float

eb0d5ec

upd data class

1c67749

update requirement and config

e5de02a

update requirement

033b00f

update requirement

a0262f8

omegaconf utils update

8026835

add torchaudio

5f7a7d3

Merge pull request #2 from hakankiymaz-amd/feature/deepspeed_moe

b697bf1

Feature/deepspeed moe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update moe branch for torchrun and fix error while loading trainer.ds_module #7

Update moe branch for torchrun and fix error while loading trainer.ds_module #7

Uh oh!

pnunna93 commented Nov 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Update moe branch for torchrun and fix error while loading trainer.ds_module #7

Are you sure you want to change the base?

Update moe branch for torchrun and fix error while loading trainer.ds_module #7

Uh oh!

Conversation

pnunna93 commented Nov 1, 2023

What does this PR do?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants