Skip to content

Conversation

@CarlosGomes98
Copy link
Contributor

This fixes a bug in the reference. For flux, it is important that each DP rank has a different seed, in order to sample different noise for each data sample. Failure to do this results in slower convergence,.

The torchtitan code was set up in such a way that seeds were different amongst dp ranks but then the same among fsdp ranks. This was reported and fixed. This MR pulls in the changes made to the torchtitan repository.

This affects RCPs and will require their recalculation. For GBS 1k and larger, the convergence change is not very large (3-4%). However, for GBS 512, the difference is quite large. This is due to the fact that this RCP was computed with a smaller number of nodes. Since fsdp was being used within nodes, this is more affected by this change, and speeds up convergence by 14%.

@CarlosGomes98 CarlosGomes98 requested a review from a team as a code owner November 18, 2025 17:59
@github-actions
Copy link

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@CarlosGomes98
Copy link
Contributor Author

relevant rcp update: mlcommons/logging#443

@ShriyaRishab
Copy link
Contributor

Approved in 12/4 WG

@ShriyaRishab ShriyaRishab merged commit 803adc1 into mlcommons:master Dec 4, 2025
1 check passed
@github-actions github-actions bot locked and limited conversation to collaborators Dec 4, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants