Skip to content

Conversation

@ma595
Copy link
Collaborator

@ma595 ma595 commented May 23, 2025

Fixes: #49

Modify coupling example for high resolution:

  • Point to model trained up to its 85th Epoch.
  • Amend paths in get-model-and-data.sh
  • Run inference.py, infer.f90 and infer.py
  • Amend README.md with updated paths.
  • Adds torch.no_grad() during inference to reduce memory requirements.

@ma595 ma595 marked this pull request as draft May 23, 2025 14:05
@ma595
Copy link
Collaborator Author

ma595 commented May 27, 2025

After a few issues deploying CUDA, torch and then running poetry, I ran the inference.py script.

I get the following error which seems to point to two problems

  1. The batch norm layers should exist in the model definition (I attempted to remove these), and;
  2. There is a dimension mismatch (error is same after removing batch norm).

Regarding the dimension mismatch. This occurs because the checkpoint model is trained with input dimension idim = 551, but the 2015 data has input dimension = 491. The current model gets this value from the input of the 2015 data, which suggests that this data isn't the correct dimensionality (incorrect number of features) as was used for training.

~/work/nonlocal_gwfluxes/era5_training # python3 inference.py -M ann -d global -v global -f uvthetaw -e 45 -m 1 -s 1 -t era5 -i inputs/ -c model-huggingface/ -o outputs/ --script
model=ann
horizontal=global
vertical=global
features=uvthetaw
epoch=45
stencil=1
month=1
checkpoint_dir=model-huggingface
input_dir=inputs
output_dir=outputs
script=True
Traceback (most recent call last):
  File "/home/matt-archer/work/nonlocal_gwfluxes/era5_training/inference.py", line 234, in <module>
    model.load_state_dict(checkpoint["model_state_dict"])
  File "/home/matt-archer/work/nonlocal_gwfluxes/.nlgw/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ANN_CNN:
	Unexpected key(s) in state_dict: "bnorm1.weight", "bnorm1.bias", "bnorm1.running_mean", "bnorm1.running_var", "bnorm1.num_batches_tracked", "bnorm2.weight", "bnorm2.bias", "bnorm2.running_mean", "bnorm2.running_var", "bnorm2.num_batches_tracked", "bnorm3.weight", "bnorm3.bias", "bnorm3.running_mean", "bnorm3.running_var", "bnorm3.num_batches_tracked", "bnorm4.weight", "bnorm4.bias", "bnorm4.running_mean", "bnorm4.running_var", "bnorm4.num_batches_tracked", "bnorm5.weight", "bnorm5.bias", "bnorm5.running_mean", "bnorm5.running_var", "bnorm5.num_batches_tracked", "bnorm6.weight", "bnorm6.bias", "bnorm6.running_mean", "bnorm6.running_var", "bnorm6.num_batches_tracked". 
	size mismatch for layer1.weight: copying a param with shape torch.Size([2204, 551]) from checkpoint, the shape in current model is torch.Size([1964, 491]).
	size mismatch for layer1.bias: copying a param with shape torch.Size([2204]) from checkpoint, the shape in current model is torch.Size([1964]).
	size mismatch for layer2.weight: copying a param with shape torch.Size([2204, 2204]) from checkpoint, the shape in current model is torch.Size([1964, 1964]).
	size mismatch for layer2.bias: copying a param with shape torch.Size([2204]) from checkpoint, the shape in current model is torch.Size([1964]).
	size mismatch for layer3.weight: copying a param with shape torch.Size([2204, 2204]) from checkpoint, the shape in current model is torch.Size([1964, 1964]).
	size mismatch for layer3.bias: copying a param with shape torch.Size([2204]) from checkpoint, the shape in current model is torch.Size([1964]).
	size mismatch for layer4.weight: copying a param with shape torch.Size([2204, 2204]) from checkpoint, the shape in current model is torch.Size([1964, 1964]).
	size mismatch for layer4.bias: copying a param with shape torch.Size([2204]) from checkpoint, the shape in current model is torch.Size([1964]).
	size mismatch for layer5.weight: copying a param with shape torch.Size([2204, 2204]) from checkpoint, the shape in current model is torch.Size([1964, 1964]).
	size mismatch for layer5.bias: copying a param with shape torch.Size([2204]) from checkpoint, the shape in current model is torch.Size([1964]).
	size mismatch for layer6.weight: copying a param with shape torch.Size([548, 2204]) from checkpoint, the shape in current model is torch.Size([548, 1964]).

What do you think @TomMelt ? It's possible I'm missing something here..

@TomMelt
Copy link
Collaborator

TomMelt commented May 28, 2025

I reproduce the same error as you. It looks like it might be related to hard-coded values in the utils/dataloader_definition.py file, e.g.,

elif self.features == "uvthetaw":
self.v = np.arange(0, 491) # for u,v,theta,w

FWIW, I did try changing every instance of 491 -> 551 but that lead to this error instead:

$ python inference.py -M ann -d global -v global -f uvthetaw -e 45 -m 1 -s 1 -t era5 -i inputs/ -c model-huggingface/ -o outputs/ --script
/home/melt/.envs/nlgw/lib/python3.10/site-packages/torch/cuda/__init__.py:138: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
model=ann
horizontal=global
vertical=global
features=uvthetaw
epoch=45
stencil=1
month=1
checkpoint_dir=model-huggingface
input_dir=inputs
output_dir=outputs
script=True
Traceback (most recent call last):
  File "/home/melt/sync/cambridge/projects/current/nlgw-cam/nonlocal_gwfluxes/era5_training/inference.py", line 239, in <module>
    model.load_state_dict(checkpoint["model_state_dict"])
  File "/home/melt/.envs/nlgw/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for ANN_CNN:
        Unexpected key(s) in state_dict: "bnorm1.weight", "bnorm1.bias", "bnorm1.running_mean", "bnorm1.running_var", "bnorm1.num_batches_tracked", "bnorm2.weight", "bnorm2.bias", "bnorm2.running_mean", "bnorm2.running_var", "bnorm2.num_batches_tracked", "bnorm3.weight", "bnorm3.bias", "bnorm3.running_mean", "bnorm3.running_var", "bnorm3.num_batches_tracked", "bnorm4.weight", "bnorm4.bias", "bnorm4.running_mean", "bnorm4.running_var", "bnorm4.num_batches_tracked", "bnorm5.weight", "bnorm5.bias", "bnorm5.running_mean", "bnorm5.running_var", "bnorm5.num_batches_tracked", "bnorm6.weight", "bnorm6.bias", "bnorm6.running_mean", "bnorm6.running_var", "bnorm6.num_batches_tracked".

@ma595
Copy link
Collaborator Author

ma595 commented May 28, 2025

Just confirming that I ran inference.py, infer.py and infer.f90 with minimal changes. As @TomMelt advised, setting FTorch hash to e0727a7 was needed in order to run on GPU.

Suggested improvements:

  • Pass test_years as command line argument
  • Currently we chop out the batch norm layers - as these are not part of the model definition. Is this the correct approach?
  • In the dataloader_definition.py we manually hard code the dimensionality of the transformed data inputs to correspond to the resolution of the data - consider reworking. Check with @TomMelt what 551 corresponds to (presumably stacked u,v,theta,w - or 3 variables)? Could easily check in the data features.

@ma595 ma595 marked this pull request as ready for review May 28, 2025 16:55
@ma595 ma595 requested review from TomMelt and amangupta2 May 28, 2025 16:58
@ma595
Copy link
Collaborator Author

ma595 commented Jun 6, 2025

Now fixed #49

@amangupta2
Copy link
Contributor

Sorry about that, going forward maybe we can replace these hard-coded values with something like 3nlev + 3 or 4nlevl +3. So, one would only need to specify the number of vertical levels for the training data.

Copy link
Contributor

@amangupta2 amangupta2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There is an easy way to remove these hard-coded values.
    uvw and uvtheta configs use three variables. uvthetaw uses four. Accordingly,
  • for three variables: np.arange(0, 369) --> np.arange(0, nlev*3 + 3)
  • for three variables: np.arange(0, 491) --> np.arange(0, nlev*4 + 3)
  • (np.arange(3, 247), np.arange(369, 491), axis=0) --> (np.arange(3, nlev2+3), np.arange(nlev3+3, nlev*4+3), axis=0)

Happy to add these changes to the code if we anticipate multiple retraining iterations.

  1. Somehow, I had to point to the "1x1_inputfeatures_u_v_theta_w_uw_vw_gcp_era5_training_data_hourly_2015_constant_mu_sigma_scaling01.nc" file, instead of the original "1x1_inputfeatures_u_v_theta_w_uw_vw_era5_training_data_hourly_2015_constant_mu_sigma_scaling01.nc

I am surprised the original file name worked for tracing, since its vertical dimension length might be different.

@amangupta2 amangupta2 self-requested a review June 17, 2025 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add batch norms back into model def

4 participants