Skip to content
Snippets Groups Projects

Check our video

Table of Contents

Introduction to Atmospheric Machine Learning Benchmarking System

Atmopsheric Machine Learning Benchmarking System (AMBS) aims to provide state-of-the-art video prediction methods applied to the meteorological domain. In the scope of the study Temperature forecasting by deep learning methods (Gong et al., 2022), the hourly evolution of the 2m temperature over a used-defined region is targeted.

This repository contains the source-code to reproduce the experiments in Gong et al., 2022 and also describes the access to the required datasets. This README describes how to train the video prediction model architectures (vanilla ConvLSTM, WeatherBench and SAVP) with ERA5 reanalysis data that are originally provided by the European Centre for Medium-Range Weather Forecasts. The deep neural networks are trained to predict the 2m temperature over the next 12 hours based on data from the preceding 12 hours. In its basic configuration, the 850hPa temperature, the total cloud cover, and the 2m temperature itself serve as input variables. However, ablation studies in terms of the predictor variables, the target region, and the input sequence length are possible as described in the paper.

Requirements

Recommended:

Getting started

Download of the repository

Two versions of the workflow are available.

  1. A frozen version is provided via zenodo under this DOI: 10.5281/zenodo.6901503. Download the provided zip-archive and unpack it:

    unzip ambs.zip
  2. The continuously updated repository can be cloned from gitlab:

    git clone --single-branch --branch Gong2022_temperature_forecasts https://gitlab.jsc.fz-juelich.de/esde/machine-learning/ambs.git

This will create a directory called ambs under which this README-file and three subdirectories are placed. The licenses for the software used in this repository are listed under LICENSES/. Two Jupyter Notebooks and a shell-script which have been used to do some extra evaluation following the reviewer comments can be found under the Jupyter_Notebooks/-directory. However, these evaluations have not been fully integrated into the workflow yet. The subdirectory video_prediction_tools/ contains everything which is needed in the workflow and is, therefore, called the top-level directory in the following.

Thus, change into this subdirectory after cloning:

cd ambs/video_prediction_tools/

Download of NVIDIA's TF1.15 singularity container

In case, your HPC-system allows for the usage of singularity containers or if you have a NVIDIA GPU available, you can run the workflow with the help of NVIDIA's TensorFlow 1.15-containers. Note that this is the recommended approach! To get the correct container version, check your NVIDIA driver using the command nvidia-smi. Then search here for a suitable container version (try to get the latest possible container ) and download the singularity image via

singularity pull <path_to_image>/nvidia_tensorflow_<version>-tf1-py3.sif docker://nvcr.io/nvidia/tensorflow:<version>-tf1-py3

where <version> must be set accordingly. Ensure that your current target directory (<path_to_image>) offers enough memory. The respective images are about 3-5 GB large.
Then, create a symbolic link of the singularity container into the HPC_scripts and no_HPC_scripts-directory, respectively:

ln -s <path_to_image>/nvidia_tensorflow_<version>-tf1-py3.sif HPC_scripts/tensorflow_<version>-tf1-py3.sif
ln -s <path_to_image>/nvidia_tensorflow_<version>-tf1-py3.sif no_HPC_scripts/tensorflow_<version>-tf1-py3.sif

Note the slightly different name used for the symbolic links which is recommended to easily distinguish between the original file and the symbolic link.

Workflow without singularity containers

It is also possible to run our workflow without the usage of NVIDIA's singularity containers. The settings to enable this possibility are outlined below when the set-up of the virtual environment is described.
However, GPU-support for Tensorflow is generally recommended to avoid extreme long training times. If only a CPU is available, training can still be executed on a small toy dataset (e.g. for testing purposes on a private computer).

Virtual environment

On HPC-systems

The runscripts under HPC_scripts can be used provided that the HPC-system uses SLURM for managing jobs. Otherwise, you may try to use the runscripts under no_HPC_scripts or set-up own runscripts based on your operating system.

Case I - With TF1.15 singularity container

After retrieving a singularity container that fits your operating HPC-system (see above), create a virtual environment as follows:

source create_env.sh <my_virtual_env> -base_dir=<my_target_dir> -tf_container=<used_container>
Case II - Without TF1.15 singularity container, but with software modules

Provided that your operating HPC-system provides the usage of TF 1.13 (or later) via modules, the virtual environment can be set-up after adapting modules_train.sh. In addition to Tensorflow (and its dependencies), modules for opening and reading h5- and netCDF-files must be loaded as well. Afterwards, the virtual environment can be created by

source create_env.sh <my_virtual_env> -base_dir=<my_target_dir> -l_nocontainer 

On other systems

On other systems with access to a NVIDIA GPU (e.g. personal computer or cluster without SLURM), the virtual environment can be set-up by adding the flag -l_nohpc. When working with a CPU or with a GPU from another hardware manufacturer, the flag -l_nocontainer must also be set. In the latter case, the command to set-up the virtual environment reads as

source create_env.sh <my_virtual_env> -base_dir=<my_target_dir> -l_nocontainer -l_nohpc

Further details on the arguments

In the set-up commands for the virtual environment mentioned above, <my_virual_env> corresponds to the user-defined name of the virtual environment.<my_target_dir> points to an (existing) directory which offers enough memory to store large amounts of data (>>100 GB) This directory should also already hold the ERA5-data as described below. Besides, the basic directory tree for the output of the workflow steps should follow the description provided here. The argument -tf_container=<used_container> allows you to specify the used singularity container (in Case I only!). Thus, used_container should correspond to tensorflow_<version>-tf1-py3.sif as described in this section above.

Get the datasets

The experiments described in the GMD paper rely on the ERA5 dataset from which 13 years are used for the dataset of the video prediction models (training, validation and test datasets).

Complete ERA5 dataset (> 100 GB memory)

To obtain the complete input datasets, two approaches are possible:

  1. The raw ERA5 reanalysis data can be downloaded from the ECMWF MARS archive. Once access is granted, the data can be retrieved with the provided retrieval script by specifying a spatial resolution of 0.3° (see keyword GRID as described here). Different meteorological variables can be selected via the key param (see here) which also supports request based on short names. A comprehensive overview is provided by the parameter database. In the experiments of this study, the variables 2t, tcc and t have been downloaded, where the latter has been interpolated onto the 850 hPa pressure level with CDO. For further information on the ERA5 dataset, please consult the documentation.
  2. The extracted and preprocessed data used in the manuscript can also be downloaded via datapub. This allows the user to start directly with training the video prediction models, but fixes the possible applications to our conducted experiments. Further details on the preprocessed data is outlined in the corresponding README.

Toy dataset

For data-driven ML models, the typical use-case is to work on large datasets. Nevertheless, we also prepared a toy dataset (data from one month in 2008 with few variables) to help users to run tests on their own machine or to do some quick tests.

Climatological reference data

To compute anomaly correlations in the postprocessing step (see below), the climatological mean of the evaluated data is required. This data constitutes the climatological mean for each daytime hour and for each month based on the period 1990-2019. The data are also provided with our toy dataset and can be downloaded from b2share.

Run the workflow

Depending on the computing system you are working on, the workflow steps will be invoked by dedicated runscripts either from the directory HPC_scripts/ (with SLURM) or from no_HPC_scripts/ (without SLURM). The used directory names are self-explanatory.

To help the users conduct different experiments with varying configurations (e.g. input variables, hyperparameters etc.), each runscript can be set up conveniently with the help of the Python-script generate_runscript.py. Its usage as well the workflow runscripts are described subsequently. However, the runscript templates (e.g. HPC_scripts/preprocess_data_era5_step1_template.sh) can be consulted to retrieve the parameters that are set for the main experiment in our study (see Dataset ID 1-3 in the manuscript or this directory on datapub).

Create customized runscripts

Specific runscripts for each workflow step (see below) are generated conveniently by keyboard interaction.

The interactive Python script generate_runscript.py under the env_setup-directory thereby has to be executed after running create_env.sh. Note that this script only creates a new virtual environment if <env_name> has not been used before. If the corresponding virtual environment is already existing, it is simply activated.

After prompting

python generate_runscript.py --venv_path <env_name>

you will be asked which workflow runscript shall be generated. You can choose one of the following workflow step names:

  • extract
  • preprocess1
  • preprocess2
  • train
  • postprocess

The subsequent keyboard interaction then allows the user to make individual settings to the workflow step at hand. By typing help, guidance for the keyboard interaction can be received.

Note: The runscript creation depends on the preceding steps (i.e. by checking the arguments from keyboard interaction). Thus, they should be created after the preceding workflow step has been conducted successfully instead of all at once at the beginning!

Running the workflow steps

Having created the runscript by keyboard interaction, the workflow steps can be run sequentially.

Note that you have to adapt the batch parameters such as #SBATCH --account or #SBACTH --partition when running on a HPC-system with SLURM support.

The following steps are part of the workflow:

  1. Data Extraction:
    This script retrieves the demanded variables for user-defined years from the ERA5 reanalysis grib-files and stores the data into netCDF-files.

    [sbatch] ./data_extraction_era5.sh
  2. Data Preprocessing:
    In this step, the ERA 5-data is sliced to the region of interest (preprocesing step 1). All data is loaded into memory once which allows computing some statistics (for later normalization) and then saved as pickle-files in the output directory. The TFRecord-files which are streamed to the neural network for training and postprocessing are created in preprocessing step 2. Thus, two (batch-) scripts have to be executed:

    [sbatch] ./preprocess_data_era5_step1.sh
    [sbatch] ./preprocess_data_era5_step2.sh
  3. Training:
    Training of one of the available models with the preprocessed data happens in this step. Note that the exp_id is generated automatically when running generate_runscript.py.

    [sbatch] ./train_model_era5_<exp_id>.sh
  4. Postprocessing:
    At this step, the test dataset is applied on the data model. The predictions are stored in netCDF-files, while the models get also evaluated by some score metrics. Besides, example plots covering the range of the MSE are created for visualization. Note that the exp_id is inferred from the chosen experiment when running generate_runscript.py. Furthermore, the climatology reference data for calculating the anamoly correlation coefficient is expected to be placed under <my_target_dir>.

    [sbatch] ./visualize_postprocess_era5_<exp_id>.sh
  5. Meta-Postprocessing:
    AMBS also provides a runscript to compare different models against each other (called meta-postprocessing). This happens in the meta_postprocess-step. While the runscript generator currently cannot handle this step, this step can be configured by adapting the file meta_config.json in the meta_postprocess_config/-directory. The related runscript can be created from the template is also provided under HPC_scripts/ and no_HPC_scripts/, respectively.

    [sbatch] ./meta_postprocess_era5.sh

Additional Jupyter Notebooks

Following up the interactive discussion during the peer-review phase (click on discussion when opening the manuscript's landing page), some additional evaluations have been conducted. While training of one convolutional model from WeatherBench has been integrated into the workflow, some evaluations (e.g. ERA 5 short-range forecast evaluation) have been realized in scope of Jupyter Notebooks. These Notebooks are provided in the Jupyter_Notebooks/-directory where further (technical) details are given in the Notebooks. The software requirements to run these Jupyter Notebooks are the same as for the workflow (see above).

Directory tree and naming convention

To successfully run the workflow and enable tracking the results from each workflow step, inputs and output directories as well as the file names should be follow the convention depicted below.

At first, we show the directory structure for the ERA5 dataset which serves as the raw input data source in this study. In detail, the data is hourly available and stored into two different kind of grib files. The files with suffix *_ml.grb provide data on the model levels of the underlying IFS model (to allow subsequent interpolation onto pressure levels), whereas *_sf.grb include data without a vertical dimension.

├── ERA5 dataset
│   ├── <YYYY>
│   │   ├── <MM>
│   │   │   ├── *_ml.grb 
│   │   │   ├── *_sf.grb 
│   │   │   ├── ...
│   │   ├── <MM>
│   │   │   ├── *_ml.grb 
│   │   │   ├── *_sf.grb 
│   │   │   ├── ...

The base output directory, where all the results of the workflow are stored, should be set up when running the workflow for the first time (see <my_target_dir>-parameter of create_env.sh as described here).

The structure of the base output directory (i.e. the directory tree) should be as follows. More details on the naming convention are provided below.

├── extractedData
│   ├── <YYYY>
│   │   ├── <MM>
│   │   │   ├── ecmwf_era5_<YYMMDDHH>.nc
├── preprocessedData
│   ├── <directory_name_convention>
│   │   ├── pickle
│   │   │   ├── <YYYY>
│   │   │   │   ├── X_<MM>.pkl
│   │   │   │   ├── T_<MM>.pkl
│   │   │   │   ├── stat_<MM>.pkl
│   │   ├── tfrecords_seq_len_<X>
│   │   │   ├── sequence_Y_<YYYY>_M_<MM>.tfrecords
│   │   │── metadata.json
│   │   │── options.json
├── models
│   ├── <directory_name_convention>
│   │   ├── <model_name>
│   │   │   ├── <timestamp>_<user>_<exp_id>
│   │   │   │   ├── checkpoint_<iteration_step>
│   │   │   │   │   ├── model_*
│   │   │   │   │── timing_per_iteration_time.pkl
│   │   │   │   │── timing_total_time.pkl
│   │   │   │   │── timing_training_time.pkl
│   │   │   │   │── train_losses.pkl
│   │   │   │   │── val_losses.pkl
│   │   │   │   │── *.json 
├── results
│   ├── <directory_name_convention>
│   │   ├── <model_name>
│   │   │   ├── <timestamp>_<user>_<exp_id>
│   │   │   │   ├── vfp_date_<YYYYMMDDHH>_*.nc
│   │   │   │   ├── evalutation_metrics.nc
│   │   │   │   ├── *.png
├── meta_postprocoess
│   ├── <exp_id>

Overview on placeholders of the output directory tree

Arguments Value
<YYYY> four-digit years, e.g. 2007, 2008 etc.
<MM> two-digit month, e.g. 01, 02, ..., 12
<DD> two-digit day, e.g. 01, 02, ..., 31
<HH> two-digit day, e.g. 01, 02, ..., 24
<X> index of sequence (TFRecords)
<directory_name_convention> name indicating the data period, the target domain and the selected variables
<model_name> convLSTM, savp or weatherBench
<timestamp> time stamp for experiment (from runscript generator)
<user> the user name on the operating system
<exp_id> experiment ID (customized by user)

Directory name convention

The meaning of all components of the directory name convention Y<YYYY>-<YYYY>M<MM>to<MM>-<nx>x<ny>-<nn.nn>N<ee.ee>E-<var1>_<var2>_<var3> is:

  • Y<YYYY>-<YYYY>M<MM>to<MM>: data period defined by years and months
  • <nx>x<ny>: the size of the target domain, e.g. 92x56 means 92 grid points in longitude and 56 grid points in latitude direction
  • <nn.nn>N<ee.ee>E: the geolocation of the south-west corner of the target domain, e.g. 38.40N0.00E for the largest target domain
  • <var1>_<var2>_<var3>: short names of selected meteorological variables (channels)

Benchmarking architectures

Currently, the workflow includes the following ML architectures, and we are working on integrating more into the system.

Contributors and contact

The project is currently developed by:

Former code developers are Scarlet Stadtler and Severin Hussmann.