[Flux] fix incorrect seed setting for dp shard #844
Merged
+1
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This fixes a bug in the reference. For flux, it is important that each DP rank has a different seed, in order to sample different noise for each data sample. Failure to do this results in slower convergence,.
The torchtitan code was set up in such a way that seeds were different amongst dp ranks but then the same among fsdp ranks. This was reported and fixed. This MR pulls in the changes made to the torchtitan repository.
This affects RCPs and will require their recalculation. For GBS 1k and larger, the convergence change is not very large (3-4%). However, for GBS 512, the difference is quite large. This is due to the fact that this RCP was computed with a smaller number of nodes. Since fsdp was being used within nodes, this is more affected by this change, and speeds up convergence by 14%.