A spinup acceleration tool for land surface model (LSM) family of ORCHIDEE.
Concept: The proposed machine-learning (ML)-enabled spin-up acceleration procedure (MLA) predicts the steady-state of any land pixel of the full model domain after training on a representative subset of pixels. As the computational efficiency of the current generation of LSMs scales linearly with the number of pixels and years simulated, MLA reduces the computation time quasi-linearly with the number of pixels predicted by ML.
Documentation of aims, concepts, workflows are described in Sun et al (2022).
The SPINacc package includes:
main.py- The main python module that steers the execution of SPINacc.DEF_*/- Directories with configuration files for each of the supported ORCHIDEE versions.config.py- Settings to configure the machine learning performance.varlist.json- Configure paths to ORCHIDEE forcing output and climate data.varlist-explained.md- Documentation of data sources used in SPINacc.
Tools/*- Modules called bymain.pyAuxilaryTools/SteadyState_checker.py- Tool to assess the state of equilibration in ORCHIDEE simulations.tests/- Reproducibility and regression testsORCHIDEE_cecill.txt- ORCHIDEE's license filejob- Job file for a bash environmentjob_tcsh- Job file for a tcsh environment
Here are the steps to launch SPINacc end-to-end, including the optional tests.
SPINacc has been tested and developed using
Python==3.9.*.
- Navigate to the location in which you wish to install and clone the repo as so:
git clone git@github.com:CALIPSO-project/SPINacc.git - Create a virtual environment and activate:
python3 -m venv ./venv3 source ./venv3/bin/activate - Build all relevant dependencies:
cd SPINacc pip install -r requirements.txt
These instructions are applicable regardless of the system you work on, however if you already have access to datasets on the Obelix supercomputer it is likely that SPINacc will run with minimal modification (see Running on Obelix if you believe this is the case). We provide a ZENODO repository that contains forcing data here as well as reference output for reproducibility testing.
It includes:
ORCHIDEE_forcing_data- Explained in DEF_Trunk/varlist-explained.mdreferencedata - necessary to run the reproducibility checks (Now OUTDATED see Reproducibility tests).
The setup-data.sh script has been provided to automate the download of the associated ZENODO repository and set paths to the forcing data and climate data in DEF_Trunk/varlist.json. The ZENODO repository does not include climate data files (variable name twodeg, without this, initialisation will fail and SPINacc will be unable to proceed). The climate data will be made available upon request to Daniel Goll (https://www.lsce.ipsl.fr/en/pisp/daniel-goll/).
To ensure the script works without error, set the MYTWODEG and MYFORCING paths appropriately. The MYFORCING path points to where you want the forcing data to be extracted to. The default location is ORCHIDEE_forcing_data in the project root.
The script runs the sed command to replace all occurences of /home/surface5/vbastri/ with the downloaded and extracted ORCHIDEE_forcing_data in /your/path/to/forcing/vlad_files/vlad_files/ in DEF_Trunk/varlist.json. This can be done manually if desired.
These instructions are designed to get up and running with SPINacc quickly and then run the accompanying tests. See the section below on Obtaining 'best' performance for a more detailed overview of how to optimally adjust ML performance.
-
In
DEF_Trunk/config.pymodify theresults_dirvariable to point to a different path if desired. To run SPINacc from end-to-end, ensure that the steps are set as follows:tasks = [ 1, 2, 4, 5, ] # 1 = test clustering # 2 = clustering # 3 = compress forcing # 4 = ML # 5 = evaluation / visualisationIf running from scratch, ensure that
start_from_scratchis set toTrueinconfig.py. Thestart_from_sratchstep creates apackdata.ncfile and only needs to be done once for a given version of ORCHIDEE. It is also possible to run just a single task, if desired. -
Then run:
python main.py DEF_Trunk/By default,
main.pywill look for theDEF_Trunkdirectory. SPINacc supports passing other configuration / job directories as arguments tomain.py(i.e.python main.py DEF_CNP2/. It is helpful to create copies of the default configurations and then modify for your own purposes to avoid continuously stashing work. )Results are located in your output directory under
MLacc_results.csv. Visualisations of R2, Slope and dNRMSE are can be found each component inEval_all_biomassCpool.png,Eval_all_litterCpool.pngandEval_all_somCpool.png.For other versions of ORCHIDEE, i.e. CNP2, outputs will be structured similarly.
It is possible to run a set of baseline checks that compare the code to the reference output. As of January 2025, the reference dataset has been updated and is now stored in https://github.com/ma595/SPINacc-results for CNP2 and Trunk. We are working towards a new Zenodo release.
These tests are useful to ensure that regressions have not been unexpectedly introduced during development.
-
Begin by downloading the reference output from GitHub.
git clone https://github.com/ma595/SPINacc-results -
In
DEF_Trunk/config.pyset thereference_dirvariable to point toSPINacc-results/Trunk. -
[Optional] To execute the reproducibility checks at runtime ensure that
Truevalues are set in all relevant steps inDEF_Trunk/config.py. -
Alternatively, the tests can be executed after the successful completion of a run by doing the following:
pytest --trunk=DEF_Trunk/ -v --capture=sysAbove it is possible to point to different output directories with the
--trunkflag.To run a single test do:
pytest --trunk=DEF_Trunk -v --capture=sys ./tests/test_task4.pyThe command line arguments
-vand--capture=sysmakes test output more visible to users. -
The configuration
config.pyin branchmainshould be configured correctly. But if not, ensure that the following assignments have been made.kmeans_clusters = 4 max_kmeans_clusters = 9 random_seed = 1000 algorithms = ['bt',] take_year_average = True take_unique = False smote_bat = True sel_most_PFTs = FalseThe SPINacc-results repo also contains the https://github.com/ma595/SPINacc-results/tree/main/jobs/DEF_Trunk settings used to obtain the reference output.
-
The checks are as follows:
test_init.py: Computes recursive compare ofpackdata.ncto referencepackdata.nc.test_task1.py: Checksdist_all.npyto the reference.test_task2.py: ChecksIDloc.npy,IDSel.npyandIDx.npyto the reference.test_task3.py: Currently not checked.test_task4.py: Compares the newMLacc_results.csvacross all components. Tolerance is 1e-2.test_task4_2.py: Compares the updated restart fileSBG_FGSPIN.340Y.ORC22v8034_22501231_stomate_rest.ncto reference.
An automated test that runs the entire DEF_Trunkpipeline from end-to-end is executed when a release is tagged. It can be forced to run using GitHub's command line tool gh. See the the official documentation for how to install on your system. Then execute the remote test as follows:
gh run list --workflow=build-and-run.yml
The following settings can change the performance of SPINacc:
algorithms: ML algorithms. Multiple can be selected for any given run. The results will be stacked in theMLacc_results.csv. Options include:bt: Bagging treerf: Random forestnn: Neural networkridge: Ridge regressionbest: A 'shotgun' approach that selects the best performing ml algorithm for the given target variable. This is assessed based on the performance on a subset of the data (seeselect_best_modelintrain.py), so worse performance may be exhibited on some variables compared to selectingbtdirectly.
take_year_average(required): IfTrue, all annual data is averaged into a single year's worth of data. IfFalse, all years are used - this has the effect of multiplying the quantity of training data, X, for a given target variable Y, by the number of years.smote_bat(required): Synthetic minority oversampling.take_unique(default -True): Take unique pixels only from output of Clustering step - will reduce the number of selected pixels, removing duplicates. This function was kept to gain correspondence with a previous implementation of SPINacc.old_cluster(default -True): IfTrue, the clustering step will use the old clustering method - i.e. Randomly samples Nc examples or takes all samples if number of samples is less than Nc. Ifold_cluster = False, the new clustering method will take the max(Nc, 20% subset of locations).sel_most_PFT_sites(default -False): IfTrueandold_cluster = False, it will preferentially select samples that contain more PFTs using the 20% rule detailed previously. Ifold_cluster = Trueandsel_most_PFT_sites = True, an error is thrown.
We recommend always setting parallel = True in config.py to speed up the execution of SPINacc. The serial and parallel execution gives exactly the same results, however it may sometimes be useful to turn this off for debugging purposes.
The following settings are recommended to obtain best machine learning performance with SPINacc. Note that training time will be longer with take_year_average set to False.
algorithms = ["best"]
take_year_average = False # this will take much longer to finish.
take_unique = True
smote_bat = True
A new clustering approach is still being tested to see if performance is improved. See PR #93. To test the new implementation set the following:
sel_most_PFTs = True
old_cluster = False
If you are already using the obelix supercomputer is likely that SPINacc will work without much adjustment to the varlist.json file.
Jobs can be submitted using the provided pbs scripts, job:
- In job : setenv dirpython '/your/path/to/SPINacc/' and setenv dirdef 'DEF_Trunk/'
- Then launch your first job using qsub -q short job, for task 1
- For tasks 3 and 4, it is better to use qsub -q medium job
An overview of the tasks is provided as follows:
Extracts climatic variables over 11 years and stores in a packdata.nc file. Subsequent steps are unable to proceed unless this step completes successfully.
Evaluates the impact of varying the number of K-means clusters on model performance, setting a default of 4 clusters and producing a ‘dist_all.png’ graph.
Performs the clustering using a K mean algorithm and saves the information on the location of the selected pixels (files starting with 'ID'). The location of the selected pixel (red) for a given PFT and all pixel with a cover fraction exceeding 'cluster_thres' [defined in varlist.json] (grey) are plotted in the figures 'ClustRes_PFT**.png'. Example of PFT2 is shown here:
Creates compressed forcing files for ORCHIDEE, containing data for selected pixels only, aligned on a global pseudo-grid for efficient pixel-level simulations, with file specifications listed in varlist.json.
- Performs the ML training on results from ORCHIDEE simulation using the compressed forcing (production mode: resp-format=compressed) or global forcing (debug mode: resp-format=global).
- Extrapolation to a global grid.
- Writes the state variables into global restart files for ORCHIDEE. For Trunk, this is
SBG_FGSPIN.340Y.ORC22v8034_22501231_stomate_rest.nc. - Evaluates ML training outputs vs real model outputs and writes performance metrics to
MLacc_results.csv.
This visualises ML performance from Task 4, offering two evaluation modes, global pixel evaluation and leave-one-cross-validation (LOOCV) for training sites, generating plots for various state variables at the PFT level, including comparisons of ML predictions with conventional spinup data.



