Updates for high resolution #46

ma595 · 2025-05-23T13:34:37Z

Fixes: #49

Modify coupling example for high resolution:

Point to model trained up to its 85th Epoch.
Amend paths in get-model-and-data.sh
Run inference.py, infer.f90 and infer.py
Amend README.md with updated paths.
Adds torch.no_grad() during inference to reduce memory requirements.

ma595 · 2025-05-27T17:43:21Z

After a few issues deploying CUDA, torch and then running poetry, I ran the inference.py script.

I get the following error which seems to point to two problems

The batch norm layers should exist in the model definition (I attempted to remove these), and;
There is a dimension mismatch (error is same after removing batch norm).

Regarding the dimension mismatch. This occurs because the checkpoint model is trained with input dimension idim = 551, but the 2015 data has input dimension = 491. The current model gets this value from the input of the 2015 data, which suggests that this data isn't the correct dimensionality (incorrect number of features) as was used for training.

~/work/nonlocal_gwfluxes/era5_training # python3 inference.py -M ann -d global -v global -f uvthetaw -e 45 -m 1 -s 1 -t era5 -i inputs/ -c model-huggingface/ -o outputs/ --script
model=ann
horizontal=global
vertical=global
features=uvthetaw
epoch=45
stencil=1
month=1
checkpoint_dir=model-huggingface
input_dir=inputs
output_dir=outputs
script=True
Traceback (most recent call last):
  File "/home/matt-archer/work/nonlocal_gwfluxes/era5_training/inference.py", line 234, in <module>
    model.load_state_dict(checkpoint["model_state_dict"])
  File "/home/matt-archer/work/nonlocal_gwfluxes/.nlgw/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ANN_CNN:
	Unexpected key(s) in state_dict: "bnorm1.weight", "bnorm1.bias", "bnorm1.running_mean", "bnorm1.running_var", "bnorm1.num_batches_tracked", "bnorm2.weight", "bnorm2.bias", "bnorm2.running_mean", "bnorm2.running_var", "bnorm2.num_batches_tracked", "bnorm3.weight", "bnorm3.bias", "bnorm3.running_mean", "bnorm3.running_var", "bnorm3.num_batches_tracked", "bnorm4.weight", "bnorm4.bias", "bnorm4.running_mean", "bnorm4.running_var", "bnorm4.num_batches_tracked", "bnorm5.weight", "bnorm5.bias", "bnorm5.running_mean", "bnorm5.running_var", "bnorm5.num_batches_tracked", "bnorm6.weight", "bnorm6.bias", "bnorm6.running_mean", "bnorm6.running_var", "bnorm6.num_batches_tracked". 
	size mismatch for layer1.weight: copying a param with shape torch.Size([2204, 551]) from checkpoint, the shape in current model is torch.Size([1964, 491]).
	size mismatch for layer1.bias: copying a param with shape torch.Size([2204]) from checkpoint, the shape in current model is torch.Size([1964]).
	size mismatch for layer2.weight: copying a param with shape torch.Size([2204, 2204]) from checkpoint, the shape in current model is torch.Size([1964, 1964]).
	size mismatch for layer2.bias: copying a param with shape torch.Size([2204]) from checkpoint, the shape in current model is torch.Size([1964]).
	size mismatch for layer3.weight: copying a param with shape torch.Size([2204, 2204]) from checkpoint, the shape in current model is torch.Size([1964, 1964]).
	size mismatch for layer3.bias: copying a param with shape torch.Size([2204]) from checkpoint, the shape in current model is torch.Size([1964]).
	size mismatch for layer4.weight: copying a param with shape torch.Size([2204, 2204]) from checkpoint, the shape in current model is torch.Size([1964, 1964]).
	size mismatch for layer4.bias: copying a param with shape torch.Size([2204]) from checkpoint, the shape in current model is torch.Size([1964]).
	size mismatch for layer5.weight: copying a param with shape torch.Size([2204, 2204]) from checkpoint, the shape in current model is torch.Size([1964, 1964]).
	size mismatch for layer5.bias: copying a param with shape torch.Size([2204]) from checkpoint, the shape in current model is torch.Size([1964]).
	size mismatch for layer6.weight: copying a param with shape torch.Size([548, 2204]) from checkpoint, the shape in current model is torch.Size([548, 1964]).

What do you think @TomMelt ? It's possible I'm missing something here..

TomMelt · 2025-05-28T10:00:12Z

I reproduce the same error as you. It looks like it might be related to hard-coded values in the utils/dataloader_definition.py file, e.g.,

nonlocal_gwfluxes/utils/dataloader_definition.py

Lines 55 to 56 in 72a38f5

    
           elif self.features == "uvthetaw": 
        
               self.v = np.arange(0, 491)  # for u,v,theta,w

FWIW, I did try changing every instance of 491 -> 551 but that lead to this error instead:

$ python inference.py -M ann -d global -v global -f uvthetaw -e 45 -m 1 -s 1 -t era5 -i inputs/ -c model-huggingface/ -o outputs/ --script
/home/melt/.envs/nlgw/lib/python3.10/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
model=ann
horizontal=global
vertical=global
features=uvthetaw
epoch=45
stencil=1
month=1
checkpoint_dir=model-huggingface
input_dir=inputs
output_dir=outputs
script=True
Traceback (most recent call last):
  File "/home/melt/sync/cambridge/projects/current/nlgw-cam/nonlocal_gwfluxes/era5_training/inference.py", line 239, in <module>
    model.load_state_dict(checkpoint["model_state_dict"])
  File "/home/melt/.envs/nlgw/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ANN_CNN:
        Unexpected key(s) in state_dict: "bnorm1.weight", "bnorm1.bias", "bnorm1.running_mean", "bnorm1.running_var", "bnorm1.num_batches_tracked", "bnorm2.weight", "bnorm2.bias", "bnorm2.running_mean", "bnorm2.running_var", "bnorm2.num_batches_tracked", "bnorm3.weight", "bnorm3.bias", "bnorm3.running_mean", "bnorm3.running_var", "bnorm3.num_batches_tracked", "bnorm4.weight", "bnorm4.bias", "bnorm4.running_mean", "bnorm4.running_var", "bnorm4.num_batches_tracked", "bnorm5.weight", "bnorm5.bias", "bnorm5.running_mean", "bnorm5.running_var", "bnorm5.num_batches_tracked", "bnorm6.weight", "bnorm6.bias", "bnorm6.running_mean", "bnorm6.running_var", "bnorm6.num_batches_tracked".

ma595 · 2025-05-28T16:38:32Z

Just confirming that I ran inference.py, infer.py and infer.f90 with minimal changes. As @TomMelt advised, setting FTorch hash to e0727a7 was needed in order to run on GPU.

Suggested improvements:

Pass test_years as command line argument
Currently we chop out the batch norm layers - as these are not part of the model definition. Is this the correct approach?
In the dataloader_definition.py we manually hard code the dimensionality of the transformed data inputs to correspond to the resolution of the data - consider reworking. Check with @TomMelt what 551 corresponds to (presumably stacked u,v,theta,w - or 3 variables)? Could easily check in the data features.

ma595 · 2025-06-06T15:28:34Z

Now fixed #49

amangupta2 · 2025-06-17T23:24:05Z

Sorry about that, going forward maybe we can replace these hard-coded values with something like 3nlev + 3 or 4nlevl +3. So, one would only need to specify the number of vertical levels for the training data.

amangupta2

There is an easy way to remove these hard-coded values.
uvw and uvtheta configs use three variables. uvthetaw uses four. Accordingly,

for three variables: np.arange(0, 369) --> np.arange(0, nlev*3 + 3)
for three variables: np.arange(0, 491) --> np.arange(0, nlev*4 + 3)
(np.arange(3, 247), np.arange(369, 491), axis=0) --> (np.arange(3, nlev2+3), np.arange(nlev3+3, nlev*4+3), axis=0)

Happy to add these changes to the code if we anticipate multiple retraining iterations.

Somehow, I had to point to the "1x1_inputfeatures_u_v_theta_w_uw_vw_gcp_era5_training_data_hourly_2015_constant_mu_sigma_scaling01.nc" file, instead of the original "1x1_inputfeatures_u_v_theta_w_uw_vw_era5_training_data_hourly_2015_constant_mu_sigma_scaling01.nc

I am surprised the original file name worked for tracing, since its vertical dimension length might be different.

TomMelt added 2 commits May 22, 2025 21:45

chore: add new data to the download script

a1ebaa1

docs: update commands

0f5d416

ma595 marked this pull request as draft May 23, 2025 14:05

Update path to new (45 epoch) trained model

675c758

Modify inference.py for high res model / data

72a38f5

ma595 added 2 commits May 28, 2025 11:10

Amend hardcoded integer for higher resolution model

3ec3a57

Use torch.no_grad() to save memory during inference

38adde6

ma595 marked this pull request as ready for review May 28, 2025 16:55

ma595 requested review from TomMelt and amangupta2 May 28, 2025 16:58

Ensure model definition is compatible with trained version

0e7db00

ma595 added 2 commits June 12, 2025 10:28

Update to epoch 85 model

96a1140

Add device flag to model filename

9dde080

amangupta2 reviewed Jun 17, 2025

View reviewed changes

amangupta2 self-requested a review June 17, 2025 23:33

amangupta2 approved these changes Jun 17, 2025

View reviewed changes

TomMelt and others added 2 commits June 18, 2025 06:35

chore: update to be more consistent with CESM

e6dcdcc

Minor changes to the file to trace the L93 ANN and UNet

c8deabc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Updates for high resolution #46

Updates for high resolution #46

Uh oh!

ma595 commented May 23, 2025 •

edited

Loading

Uh oh!

ma595 commented May 27, 2025 •

edited

Loading

Uh oh!

TomMelt commented May 28, 2025

Uh oh!

ma595 commented May 28, 2025 •

edited

Loading

Uh oh!

ma595 commented Jun 6, 2025

Uh oh!

amangupta2 commented Jun 17, 2025

Uh oh!

amangupta2 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Updates for high resolution #46

Are you sure you want to change the base?

Updates for high resolution #46

Uh oh!

Conversation

ma595 commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ma595 commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomMelt commented May 28, 2025

Uh oh!

ma595 commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ma595 commented Jun 6, 2025

Uh oh!

amangupta2 commented Jun 17, 2025

Uh oh!

amangupta2 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ma595 commented May 23, 2025 •

edited

Loading

ma595 commented May 27, 2025 •

edited

Loading

ma595 commented May 28, 2025 •

edited

Loading