Move from an epoch-based optimization to a step-based optimization #442

jpata · 2025-08-27T14:49:28Z

This PR moves the training from an epoch-based formulation to a step-based formulation. The advantage is that very long trainings can be continued directly where they left off, as soon as they finish, rather than having to restart at the epoch boundary.

Technically, this is done by keeping track of the dataset and sampler state across multiple workers.

As validation, I trained a model (light blue) for a very long time (500k steps, ~20 days) on 8x AMD MI200X cards. It can be compared to the previous model freeze at #437 or https://huggingface.co/jpata/particleflow/tree/main/cms/v2.6.0pre1/pyg-cms_20250722_101813_274478 (dark blue). The cms_pf_qcd_nopu dataset size was also increased in this run from 5M events to 20M events. Overall, this corresponds to about 10 epochs over the full set of concatenated data, but the epochs between these two runs are not exactly the same. We see that the jet resolution metric (IQR) and the jet reco-gen matching fraction metric keep improving throughout the training.

The training loss is stable across ~20x 24-hour restarts. The regression loss fluctuates somewhat, and potentially a more stable solution could be better for optimization.

The validation loss is computed over a subset of the data (to avoid wasting a lot of time not training) at a regular interval and shows convergence.

…nto jp_20250826_stepopt

jpata added 30 commits August 26, 2025 15:59

initial attempt

8f6c16b

training and eval works on steps

2ede6c0

add some logging

bb17fa6

fix optimizer loading

570b274

cleanup epoch->step

4fe883d

work on scheduler restore

a683b03

log dataloader index, ensure lr_scheduler is correctly restored

81d11f7

format

0a237e4

restore dataloader state

7d2d186

format

a23bdc1

improve logging

6646b99

better logging for eval

173def3

ensure model compilation

87f0714

batch size for test

9d562cf

added smi

817a96e

distributed sampler does not have state dict

47c6e80

update ray train and test

71e0c31

format

a238366

attempt to restore dataloader reproducibly

981dc22

works without shuffle

39aa923

fix the loader state dict

5eb838e

ensure tests pass, format

d9df51f

disable tqdm in jobs

fab32c3

enable dataloader fast forwarding

a423f3a

fix tests and ensure fast-forwarding works by resuming the sampler

d526f69

fix test

3d5cb5c

added missing test

1d6cb9e

fix

3a6a433

enable cmdline switch to lamb

d42c9c6

fix override

c16b06c

jpata and others added 29 commits September 25, 2025 15:56

fix persistent workers

eb25373

format

d60079f

run in torch container

b0779b3

get rid of tensorflow

216f48d

do not use latest

e762ad8

install wget

3827236

seems like tensorflow is required by tensorflow datasets

107430b

use devel image

7dc48d9

default rank

154af85

show PF and MLPF at the same time

76912ed

up

ad5d0a6

finalize 13.6 and 14 tev comparison plots

de1f4da

updated plots

f71be69

format

c686590

update plots

9357579

generate additional val samples changing ONLY the c.o.m.

90ea0bb

generate additional val samples changing ONLY the c.o.m.

4638502

freeze pyg-cms_20251006_094347_769570 training

3eaa21a

add val2 sample

f8aa19e

Merge branch 'jp_20250826_stepopt' of github.com:jpata/particleflow i…

28a5aad

…nto jp_20250826_stepopt

v3 validation samples (Fikri's config)

4d32fa1

up

bfd8c35

update plots

4cf762c

v3 qcd

265453c

use torch runtime image

55342cf

format

472aa90

update docker image

e67d8ce

install gcc

505702a

disable ray

0ea8b1e

jpata merged commit ea1bf04 into main Dec 23, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move from an epoch-based optimization to a step-based optimization #442

Move from an epoch-based optimization to a step-based optimization #442

Uh oh!

jpata commented Aug 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Move from an epoch-based optimization to a step-based optimization #442

Move from an epoch-based optimization to a step-based optimization #442

Uh oh!

Conversation

jpata commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jpata commented Aug 27, 2025 •

edited

Loading