Skip to content

Conversation

@jpata
Copy link
Owner

@jpata jpata commented Aug 27, 2025

This PR moves the training from an epoch-based formulation to a step-based formulation. The advantage is that very long trainings can be continued directly where they left off, as soon as they finish, rather than having to restart at the epoch boundary.

Technically, this is done by keeping track of the dataset and sampler state across multiple workers.

As validation, I trained a model (light blue) for a very long time (500k steps, ~20 days) on 8x AMD MI200X cards. It can be compared to the previous model freeze at #437 or https://huggingface.co/jpata/particleflow/tree/main/cms/v2.6.0pre1/pyg-cms_20250722_101813_274478 (dark blue). The cms_pf_qcd_nopu dataset size was also increased in this run from 5M events to 20M events. Overall, this corresponds to about 10 epochs over the full set of concatenated data, but the epochs between these two runs are not exactly the same. We see that the jet resolution metric (IQR) and the jet reco-gen matching fraction metric keep improving throughout the training.

Screenshot 2025-11-05 at 00 16 48 Screenshot 2025-11-05 at 00 16 59

The training loss is stable across ~20x 24-hour restarts. The regression loss fluctuates somewhat, and potentially a more stable solution could be better for optimization.
Screenshot 2025-11-05 at 00 27 35

The validation loss is computed over a subset of the data (to avoid wasting a lot of time not training) at a regular interval and shows convergence.
Screenshot 2025-11-05 at 00 26 43

@jpata jpata merged commit ea1bf04 into main Dec 23, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants