diff --git a/README.md b/README.md index e3d59386f83ec65a836e098b545ef98137894c67..c7ad202476bf88292f696d3cefa49ec86a2da153 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,4 @@ + <img src="ambs_logo.jpg" width="1000" height="400"> [Check our video](https://www.youtube.com/watch?v=Tf2BDDlSDeQ) @@ -5,269 +6,234 @@ ## Table of Contents -- [Introduction to Atmospheric Machine Learning Benchmarking System](#introduction-to-atmopsheric-machine-learning-benchmarking-system) -- [Prepare your dataset](#prepare-your-dataset) - + [Access the ERA5 dataset (~TB)](#access-the-era5-dataset---tb-) - + [Dry run with small samples (~15 GB)](#dry-run-with-small-samples---15-gb-) - + [Climatological mean data](#climatological-mean-data) -- [Prerequisites](#prerequisites) -- [Installation](#installation) - * [Get NVIDIA's TF1.15 container](#get-nvidia-s-tf115-container) -- [Start with AMBS](#start-with-ambs) - * [Set-up the virtual environment](#set-up-the-virtual-environment) - + [On JSC's HPC-system](#on-jsc-s-hpc-system) - + [On other HPC systems](#on-other-hpc-systems) +- [Introduction to the Atmospheric Machine Learning Benchmarking System](#Introduction-to-Atmospheric-Machine-Learning-Benchmarking-System) +- [Requirements](#Requirements) +- [Getting started](#Getting-started) + * [Download of the repository](#Download-of-the-repository) + * [Download of NVIDIA's TF1.15 singularity container](#Download-of-NVIDIA's-TF1.15-singularity-container) + * [Workflow without singularity containers](#Download-without-singularity-containers) + * [Virtual environment](#Virtual-environment) + + [On HPC-systems](#On-HPC-systems) + + [On other systems](#On-other-systems) + + [Further details on the arguments](#Further-details-on-the-arguments) - [Case I - Usage of singularity TF1.15 container](#case-i---usage-of-singularity-tf115-container) - [Case II - Usage of singularity TF1.15 container](#case-ii---usage-of-singularity-tf115-container) - [Further details on the arguments](#further-details-on-the-arguments) - + [Other systems](#other-systems) - - [Case I - Usage of singularity TF1.15 container](#case-i---usage-of-singularity-tf115-container-1) - - [Case II - Usage of singularity TF1.15 container](#case-ii---usage-of-singularity-tf115-container-1) - - [Further details](#further-details) - * [Run the workflow](#run-the-workflow) - + [Create specific runscripts](#create-specific-runscripts) - * [Running the workflow substeps](#running-the-workflow-substeps) - * [Compare and visualize the results](#compare-and-visualize-the-results) - * [Input and Output folder structure and naming convention](#input-and-output-folder-structure-and-naming-convention) -- [Benchmarking architectures](#benchmarking-architectures) -- [Contributors and contact](#contributors-and-contact) -- [On-going work](#on-going-work) - - -## Introduction to Atmopsheric Machine Learning Benchmarking System - -**A**tmopsheric **M**achine Learning **B**enchmarking **S**ystem (AMBS) aims to provide state-of-the-art video prediction methods applied to the meteorological domain. In the scope of the current application, the hourly evolution of the 2m temperature over a used-defined region is focused. - -Different Deep Learning video prediction architectures such as ConvLSTM and SAVP are trained with ERA5 reanalysis to perform a prediction for 12 hours based on the previous 12 hours. In addition to the 2m temperature (2t) itself, other variables can be fed to the video frame prediction models to enhance their capability to learn the complex physical processes driving the diurnal cycle of temperature. Currently, the recommended additional meteorological variables are the 850 hPa temperature (t850) and the total cloud cover (tcc) as described in our preprint GMD paper. - - -## Prepare your dataset - - -#### Access the ERA5 dataset (~TB) -The experiments described in the GMD paper rely on the ERA5 dataset from which 13 years are used for the dataset of the video prediction models (training, validation and test datasets). - -- For users of JSC's HPC system: Access to the ERA5 dataset is possible via the data repository [meteocloud](https://datapub.fz-juelich.de/slcs/meteocloud/). The corresponding path the grib-data files (used for data extraction, see below) is: `/p/fastdata/slmet/slmet111/met_data/ecmwf/era5/grib`. If you meet access permission issues, please contact: Stein, Olaf <o.stein@fz-juelich.de> - -- For other users (also on other HPC-systems): You can retrieve the ERA5 data from the [ECMWF MARS archive](https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#ERA5:datadocumentation-DataorganisationandhowtodownloadERA5). Once you have access to the archive, the data can be downloaded by specifying a resolution of 0.3° in the retrieval script (keyword "GRID", see [here](https://confluence.ecmwf.int/pages/viewpage.action?pageId=123799065)). The variable names and the corresponding paramID can be found in the ECMWF documentaation website [ERA5 documentations](https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#ERA5:datadocumentation-Howtoacknowledge,citeandrefertoERA5). For further informations on the ERA5 dataset, please consult the [documentation](https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation) provided by ECMWF. +- [Get the datasets](#Get-the-datasets) + + [Complete dataset (> 100 GB memory)](#Complete-dataset-(>-100-GB-memory)) + + [Toy dataset](#Toy-dataset) + + [Climatological reference data](#Climatological-reference-data) +- [Run the workflow](#Run-the-workflow) + + [Create customized runscripts](#create-customized-runscripts) + * [Running the workflow steps](#Running-the-workflow-steps) + * [Additional Jupyter Notebooks](#Additional-Jupyter-Notebooks) +- [Directory tree and naming convention](#Directory-tree-and-naming-convention) +- [Benchmarking architectures](#Benchmarking-architectures) +- [Contributors and contact](#Contributors-and-contact) -We recommend the users to store the data following the directory structure for the input data described [below](#Input-and-Output-folder-structure-and-naming-convention). +## Introduction to Atmospheric Machine Learning Benchmarking System -#### Dry run with small samples (~ 5 - ~ 15 GB) +**A**tmopsheric **M**achine Learning **B**enchmarking **S**ystem (AMBS) aims to provide state-of-the-art video prediction methods applied to the meteorological domain. In the scope of the study [Temperature forecasting by deep learning methods *(Gong et al., 2022)*](https://doi.org/10.5194/gmd-2021-430), the hourly evolution of the 2m temperature over a used-defined region is targeted. -In our application, the typical use-case is to work on a large dataset. Nevertheless, we also prepared an example dataset (1 month data in 2007, 2008, 2009 respectively data with few variables) to help users to run tests on their own machine or to do some quick tests. The data can be downloaded by requesting from Bing Gong <b.gong@fz-juelich.de>. Users of the deepacf-project at JSC can also access the files from `/p/project/deepacf/deeprain/video_prediction_shared_folder/GMD_samples`. +This repository contains the source-code to reproduce the experiments in [*Gong et al., 2022*](https://doi.org/10.5194/gmd-2021-430) and also describes the access to the required datasets. +This README describes how to train the video prediction model architectures (vanilla ConvLSTM, WeatherBench and SAVP) with [ERA5 reanalysis](https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5) data that are originally provided by the European Centre for Medium-Range Weather Forecasts. The deep neural networks are trained to predict the 2m temperature over the next 12 hours based on data from the preceding 12 hours. In its basic configuration, the 850hPa temperature, the total cloud cover, and the 2m temperature itself serve as input variables. However, ablation studies in terms of the predictor variables, the target region, and the input sequence length are possible as described in the [paper](https://doi.org/10.5194/gmd-2021-430). -#### Climatological mean data -To compute anomaly correlations in the postprocessing step (see below), climatological mean data is required. This data constitutes the climatological mean for each daytime hour and for each month for the period 1990-2019. -For convenince, the data is also provided with our frozon version of code and can be downloaded from the (link)[https://b2share.eudat.eu/records/744bbb4e6ee84a09ad368e8d16713118]. - -## Prerequisites +## Requirements +- GPU (only toy dataset support with CPU) - Linux or macOS -- Python>=3.6 -- NVIDIA GPU + CUDA CuDNN or CPU (small dataset only) -- MPI -- Tensorflow 1.13.1 or [CUDA-enabled NVIDIA](https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/overview.html#overview) TensorFlow 1.15 within a (singularity)[https://sylabs.io/guides/3.5/user-guide/quick_start.html] container -- [CDO](https://code.mpimet.mpg.de/projects/cdo/embedded/index.html) >= 1.9.5 - -## Installation - - -Clone this repo by typing the following command in your personal target dirctory: - -```bash -git clone https://gitlab.jsc.fz-juelich.de/esde/machine-learning/ambs.git -``` - -Since the project is continuously developed and make the experiments described in the GMD paper reproducible, we also provide a frozen version: - -```bash -git clone https://gitlab.jsc.fz-juelich.de/esde/machine-learning/ambs_gmd1.git -``` +- [MPI](https://mpi4py.readthedocs.io/en/stable/install.html#requirements) +- [singularity](https://docs.sylabs.io/guides/3.0/user-guide/quick_start.html) +- Python >= 3.6 with the following packages + - [tensorflow](https://www.tensorflow.org/versions/r1.15/api_docs/python/tf) >= 1.13.1 and < 2.0 with GPU support + - [mpi4py](https://mpi4py.readthedocs.io/en/stable/) >= 3.0.1 + - [numpy](https://numpy.org/) >= 1.17.3 < 1.19.5 + - [xarray](https://docs.xarray.dev/en/stable/) >= 0.16.0 + - [pandas](https://pandas.pydata.org/) >= 0.25.3 + - [scikit-image](https://scikit-image.org/) >= 0.17.2 + - [opencv-python-headless](https://pypi.org/project/opencv-python-headless/) >= 4.2.0.34 + - [netcdf4](https://pypi.org/project/netCDF4/) >= 1.5.8 + - [basemap](https://matplotlib.org/basemap/) >= 1.3.0 + - [matplotlib](https://matplotlib.org/) >= 3.3.0 +- [CDO](https://code.mpimet.mpg.de/projects/cdo/embedded/index.html) >= 1.9.6 +- [NCO](http://nco.sourceforge.net/) >= 4.9.5 + +**Recommended:** +- [Tensorflow 1.15 CUDA-enabled NVIDIA singularity container](https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/overview.html#overview) + +## Getting started + +### Download of the repository + +Two versions of the workflow are available. +:warning: Must be adapted once the zenodo-arhcive is renewed :warning: +1) A frozen version is provided via [zenodo](https://doi.org/10.5281/zenodo.6833611) under this DOI: 10.5281/zenodo.6833611. Download the provided zip-archive and unpack it: + ```bash + unzip ambs.zip + ``` +2) The continuously updated repository can be cloned from gitlab: + ```bash + git clone https://gitlab.jsc.fz-juelich.de/esde/machine-learning/ambs.git + ``` -This will create a directory called `ambs` under which this README-file and two subdirectories are placed. The subdirectory `[...]/ambs/test/` contains unittest-scripts for the workflow and is therefore of minor relevance for non-developers. The subdirectory `[...]/ambs/video_prediction_tools` contains everything which is needed in the workflow and is, therefore, called the top-level directory in the following. +This will create a directory called `ambs` under which this README-file and three subdirectories are placed. +The licenses for the software used in this repository are listed under `LICENSES/`. +Two Jupyter Notebooks and a shell-script which have been used to do some extra evaluation following the reviewer comments can be found under the `Jupyter_Notebooks/`-directory. However, these evaluations have not been fully integrated into the workflow yet. +The subdirectory `video_prediction_tools/` contains everything which is needed in the workflow and is, therefore, called the top-level directory in the following. Thus, change into this subdirectory after cloning: ```bash cd ambs/video_prediction_tools/ ``` -### Get NVIDIA's TF1.15 container +### Download of NVIDIA's TF1.15 singularity container -In case, your HPC-system allows for the usage of singularity containers (such as JSC's HPC-system does) or if you have a NVIDIA GPU available, you can run the workflow with the help of NVIDIA's TensorFlow 1.15-containers. Note that this is the recommended approach! -To get the correct container version, check your NVIDIA driver with the help of `nvidia-smi`. Then search [here](https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/index.html) for a suitable container version (try to get the latest possible container ) and download the singularity image via +In case, your HPC-system allows for the usage of singularity containers or if you have a NVIDIA GPU available, you can run the workflow with the help of NVIDIA's TensorFlow 1.15-containers. Note that this is the recommended approach! +To get the correct container version, check your NVIDIA driver using the command `nvidia-smi`. Then search [here](https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/index.html) for a suitable container version (try to get the latest possible container ) and download the singularity image via ``` singularity pull <path_to_image>/nvidia_tensorflow_<version>-tf1-py3.sif docker://nvcr.io/nvidia/tensorflow:<version>-tf1-py3 ``` -where `<version>` is set accordingly. Ensure that your current target directory (`<path_to_image>`) offers enough memory. The respective images are about 3-5 GB large. -Then create a symbolic link of the singularity container into the `HPC_scripts` and `no_HPC_scripts`-directory, respectively: +where `<version>` must be set accordingly. +Ensure that your current target directory (`<path_to_image>`) offers enough memory. The respective images are about 3-5 GB large. <br> +Then, create a symbolic link of the singularity container into the `HPC_scripts` and `no_HPC_scripts`-directory, respectively: ``` ln -s <path_to_image>/nvidia_tensorflow_<version>-tf1-py3.sif HPC_scripts/tensorflow_<version>-tf1-py3.sif ln -s <path_to_image>/nvidia_tensorflow_<version>-tf1-py3.sif no_HPC_scripts/tensorflow_<version>-tf1-py3.sif ``` -Note the slightly different name used for the symbolic link which is recommended to easily distinguish between the original file and the symbolic link. - -For users with access to JSC's HPC-system: The required singularity image is available from `ambs/video_prediction_tools/HPC_scripts`. Thus, simply set `<path_to_image>` accordingly in the commands above. -Note that you need to log in [Judoor account]https://judoor.fz-juelich.de/login) and specifically request access to restricted container software beforehand! +Note the slightly different name used for the symbolic links which is recommended to easily distinguish between the original file and the symbolic link. -In case, your operating system supports TF1.13 (or TF1.15) with GPU-support and does not allow for usage of NVIDIA's singularity containers, you can set your environment up as described below. +### Workflow without singularity containers +It is also possible to run our workflow without the usage of NVIDIA's singularity containers. +The settings to enable this possibility are outlined below when the set-up of the virtual environment is described. +However, GPU-support for Tensorflow is generally recommended to avoid extreme long training times. +If only a CPU is available, training can still be executed on a small toy dataset (e.g. for testing purposes on a private computer). +### Virtual environment -## Start with AMBS +#### On HPC-systems +The runscripts under `HPC_scripts` can be used provided that the HPC-system uses SLURM for managing jobs. Otherwise, you may try to use the runscripts under `no_HPC_scripts` or set-up own runscripts based on your operating system. -### Set-up the virtual environment +##### Case I - With TF1.15 singularity container -The workflow can be set-up on different operating systems. The related virtual environment can be set up with the help of the `create_env.sh`-script under the `env_setup`-directory. -This script will place all virtual environments under the `virtual_envs`-directory. -Depending on your system, you may do the following: - -#### On JSC's HPC-system -After linking the TF1.15 singularity container in the directories for the runscript (see previous step), simply run +After retrieving a singularity container that fits your operating HPC-system (see [above](#Get-NVIDIA's-TF1.15-singularity-container)), create a virtual environment as follows: ``` -source create_env.sh <my_virtual_env> +source create_env.sh <my_virtual_env> -base_dir=<my_target_dir> -tf_container=<used_container> ``` -where `<my_virtual_env>` corresponds to a user-defined name of the virtual environment. -By default, the script assumes that all data (input and preprocessed data as well as trained models and data from postprocessing) will be stored in the shared directory `/p/project/deepacf/deeprain/video_prediction_shared_folder/`. This directory is called 'base-directory' in the following. - -In case that you (need to) deviate from this, you can set a customized base-directory. For this, add the `-base_dir`-flag to the call of `create_env.sh`, i.e.: +##### Case II - Without TF1.15 singularity container, but with software modules +Provided that your operating HPC-system provides the usage of TF 1.13 (or later) via modules, the virtual environment can be set-up after adapting `modules_train.sh`. In addition to Tensorflow (and its dependencies), modules for opening and reading h5- and netCDF-files must be loaded as well. Afterwards, the virtual environment can be created by ``` -source create_env.sh <my_virtual_env> -base_dir=<my_target_dir> +source create_env.sh <my_virtual_env> -base_dir=<my_target_dir> -l_nocontainer ``` -**Note:** Suifficient read-write permissions and a reasonable amount of memory space are mandatory for alternative base-directories. - -#### On other HPC systems -On other HPC-systems, the AMBS workflow can also be run. The runscripts under `HPC_scripts` can still be used provided that your HPC-system uses SLURM for managing jobs. Otherwise, you may try to use the runscripts under `no_HPC_scripts` or set-up own runscripts based on your operating system. - -##### Case I - Usage of singularity TF1.15 container - -After retrieving a singlualrity container that fits your operating HPC-system (see [above](#get-nVIDIA's-tF1.15-container)), create a virtual environment as follows: -``` -source create_env.sh <my_virtual_env> -base_dir=<my_target_dir> -tf_container=<used_container> +#### On other systems +On other systems with access to a NVIDIA GPU (e.g. personal computer or cluster without SLURM), the virtual environment can be set-up by adding the flag `-l_nohpc`. When working with a CPU or with a GPU from another hardware manufacturer, the flag `-l_nocontainer` must also be set. +In the latter case, the command to set-up the virtual environment reads as +```bash +source create_env.sh <my_virtual_env> -base_dir=<my_target_dir> -l_nocontainer -l_nohpc ``` -Further details on the arguments are given after Case II. -##### Case II - Usage of singularity TF1.15 container -In case that running singularity containers is not possible for you, but your operating HPC-system provides the usage of TF 1.13 (or later) via modules, the source-code can still be run. -However, this requires you to populate `modules_train.sh` where all modules are listed. Note that you also need to load modules for opening and reading h5- and netCDF-files as well . Afterwards, the virtual environment can be created by -``` -source create_env.sh <my_virtual_env> -base_dir=<my_target_dir> -l_nocontainer -``` -##### Further details on the arguments +#### Further details on the arguments In the set-up commands for the virtual environment mentioned above, `<my_virual_env>` corresponds to the user-defined name of the virtual environment.`<my_target_dir>` points to an (existing) directory which offers enough memory to store large amounts of data (>>100 GB) -This directory should also already hold the ERA5-data as described [above](#Access-the-ERA5-dataset-(~TB)). Besides, the basic directory tree for the output of the workflow steps should follow the description provided [here]((#Input-and-Output-folder-structure-and-naming-convention)). -The argument `-tf_container=<used_container>` allows you to specify the used singularity container (in Case I only!). Thus, `used_container` should correspond to `tensorflow_<version>-tf1-py3.sif` as described in this [section](#Get-NVIDIA's-TF1.15-container) above. +This directory should also already hold the ERA5-data as described [below](#Access-the-ERA5-dataset-(~TB)). Besides, the basic directory tree for the output of the workflow steps should follow the description provided [here]((#Input-and-Output-folder-structure-and-naming-convention)). +The argument `-tf_container=<used_container>` allows you to specify the used singularity container (in Case I only!). Thus, `used_container` should correspond to `tensorflow_<version>-tf1-py3.sif` as described in this [section](#Download-of-NVIDIA's-TF1.15-singularity-container) above. -#### Other systems -On other systems with access to a NVIDIA GPU, the virtual environment can be run as follows. -In case that you don't have access to a NVIDIA GPU, you can still run TensorFlow on your CPU. However, training becomes very slow then and thus, we recommend to just test with the small dataset mentioned [above](#dry-run-with- small-samples-(~15-GB)). +## Get the datasets -Again, we describe the step to set-up the virtual environment separately in the following. +The experiments described in the GMD paper rely on the ERA5 dataset from which 13 years are used for the dataset of the video prediction models (training, validation and test datasets). -##### Case I - Usage of singularity TF1.15 container +### Complete ERA5 dataset (> 100 GB memory) -After retrieving a singlualrity container that fits your operating machine (see [above](#Get-NVIDIA's-TF1.15-container)), create a virtual environment as follows: -``` -source create_env.sh <my_virtual_env> -base_dir=<my_target_dir> -l_nohpc -``` -Further details on the arguments are given after Case II. +To obtain the complete input datasets, two approaches are possible: -##### Case II - Usage of singularity TF1.15 container +1) The raw ERA5 reanalysis data can be downloaded from the [ECMWF MARS archive](https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#ERA5:datadocumentation-DataorganisationandhowtodownloadERA5). Once access is granted, the data can be retrieved with the provided retrieval script by specifying a spatial resolution of 0.3° (see keyword `GRID` as described [here](https://confluence.ecmwf.int/pages/viewpage.action?pageId=123799065)). Different meteorological variables can be selected via the key `param` (see [here](https://confluence.ecmwf.int/pages/viewpage.action?pageId=149335858)) which also supports request based on short names. A comprehensive overview is provided by the [parameter database](https://apps.ecmwf.int/codes/grib/param-db/). In the experiments of this study, the variables `2t`, `tcc` and `t` have been downloaded, where the latter has been interpolated onto the 850 hPa pressure level with [CDO](https://code.mpimet.mpg.de/projects/cdo/embedded/index.html#x1-7060002.12.10). For further information on the ERA5 dataset, please consult the [documentation](https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation). +2) The extracted and preprocessed data (see [below]()) used in the [manuscript](https://doi.org/10.5194/gmd-2021-430) can also be downloaded via [datapub](https://datapub.fz-juelich.de/esde/esde-nfs/online_publication/2mT_by_DL/). This allows the user to start directly with training the video prediction models, but fixes the possible applications to our conducted experiments. Further details on the preprocessed data is outlined in the corresponding [README](https://datapub.fz-juelich.de/esde/esde-nfs/online_publication/2mT_by_DL/ReadMe.md). -Without using a singularity container (and using your CPU instead), please run -``` -source create_env.sh <my_virtual_env> -base_dir=<my_target_dir> -l_nocontainer -l_nohpc -``` -**Note:** To reproduce the results of GMD paper, we recommend to use the case II. +### Toy dataset -##### Further details -Futher details on the used arguments are provided [above](#Further-details-on-the-arguments). The only exception holds for the `l_nohpc`-flag that is used to indicate that you are not running on a HPC-system. +For data-driven ML models, the typical use-case is to work on large datasets. Nevertheless, we also prepared a [toy dataset](https://b2share.eudat.eu/records/744bbb4e6ee84a09ad368e8d16713118) (data from one month in 2008 with few variables) to help users to run tests on their own machine or to do some quick tests. +### Climatological reference data -### Run the workflow +To compute anomaly correlations in the postprocessing step (see [below](#Running-the-workflow-steps)), the climatological mean of the evaluated data is required. This data constitutes the climatological mean for each daytime hour and for each month based on the period 1990-2019. The data are also provided with our toy dataset and can be downloaded from [b2share](https://b2share.eudat.eu/records/744bbb4e6ee84a09ad368e8d16713118). -Depending on the computing system you are working on, the workflow steps will be invoked by dedicated runscripts either from the directory `HPC_scripts/` or from `no_HPC_scripts`. The used directory names are self-explanatory. +## Run the workflow -To help the users conduct different experiments with varying configurations (e.g. input variables, hyperparameters etc), each runscript can be set up conveniently with the help of the Python-script `generate_runscript.py`. Its usage as well the workflow runscripts are described subsequently. +Depending on the computing system you are working on, the workflow steps will be invoked by dedicated runscripts either from the directory `HPC_scripts/` (with SLURM) or from `no_HPC_scripts/` (without SLURM). The used directory names are self-explanatory. +To help the users conduct different experiments with varying configurations (e.g. input variables, hyperparameters etc.), each runscript can be set up conveniently with the help of the Python-script `generate_runscript.py`. Its usage as well the workflow runscripts are described subsequently. However, the runscript templates (e.g. `HPC_scripts/preprocess_data_era5_step1_template.sh`) can be consulted to retrieve the parameters that are set for the main experiment in our study (see Dataset ID 1-3 in the [manuscript](https://doi.org/10.5194/gmd-2021-430) or [this directory](https://datapub.fz-juelich.de/esde/esde-nfs/online_publication/2mT_by_DL/era5-Y2007-2019M01to12-92x56-3840N0000E-2t_tcc_t_850/) on datapub). -#### Create specific runscripts -Specific runscripts for each workflow substep (see below) are generated conveniently by keyboard interaction. +### Create customized runscripts -The interactive Python script under the folder `generate_runscript.py` thereby has to be executed after running `create_env.sh`. Note that this script only creates a new virtual environment if `<env_name>` has not been used before. If the corresponding virtual environment is already existing, it is simply activated. +Specific runscripts for each workflow step (see below) are generated conveniently by keyboard interaction. + +The interactive Python script `generate_runscript.py` under the `env_setup`-directory thereby has to be executed after running `create_env.sh`. Note that this script only creates a new virtual environment if `<env_name>` has not been used before. If the corresponding virtual environment is already existing, it is simply activated. After prompting ```bash python generate_runscript.py --venv_path <env_name> ``` -you will be asked first which workflow runscript shall be generated. You can choose one of the following workflow step names: +you will be asked which workflow runscript shall be generated. You can choose one of the following workflow step names: - extract - preprocess1 - preprocess2 - train - postprocess -The subsequent keyboard interaction then allows the user to make individual settings to the workflow step at hand. By pressing simply Enter, the user may receive some guidance for the keyboard interaction. - -Note that the runscript creation of later workflow substeps depends on the preceding steps (i.e. by checking the arguments from keyboard interaction). -Thus, they should be created sequentially instead of all at once at the beginning! +The subsequent keyboard interaction then allows the user to make individual settings to the workflow step at hand. By typing `help`, guidance for the keyboard interaction can be received. +**Note**: The runscript creation depends on the preceding steps (i.e. by checking the arguments from keyboard interaction). +Thus, they should be created after the preceding workflow step has been conducted successfully instead of all at once at the beginning! -**NoteI**: The runscript creation depends on the preceding steps (i.e. by checking the arguments from keyboard interaction). -Thus, they should be created sequentially instead of all at once at the beginning! Note that running the workflow step is also mandatory, before the runscript for the next workflow step can be created. - -**Note II**: Remember to enable your virtual environment before running `generate_runscripts.py`. For this, you can simply run -``` -source create_env.sh <env_name> -``` -where `<env_name>` corresponds to - -### Running the workflow substeps +### Running the workflow steps -Having created the runscript by keyboard interaction, the workflow substeps can be run sequentially. +Having created the runscript by keyboard interaction, the workflow steps can be run sequentially. -Note that you have to adapt the `account`, the `partition` as well as the e-mail address in case you running on a HPC-system other than JSC's HPC-systems (HDF-ML, Juwels Cluster and Juwels Booster). +Note that you have to adapt the batch parameters such as `#SBATCH --account` or `#SBACTH --partition` when running on a HPC-system with SLURM support. -Now, it is time to run the AMBS workflow -1. **Data Extraction**:<br> This script retrieves the demanded variables for user-defined years from complete ERA% reanalysis grib-files and stores the data into netCDF-files. -```bash -[sbatch] ./data_extraction_era5.sh -``` +The following steps are part of the workflow: +1) **Data Extraction**:<br> This script retrieves the demanded variables for user-defined years from the ERA5 reanalysis grib-files and stores the data into netCDF-files. + ```bash + [sbatch] ./data_extraction_era5.sh + ``` -2. **Data Preprocessing**:<br> Crop the ERA 5-data (multiple years possible) to the region of interest (preprocesing step 1). All the year data will be touched once and the statistics are calculated and saved in the output folder. The TFrecord-files which are fed to the trained model (next workflow step) are created afterwards. Thus, two cases exist at this stage: +2) **Data Preprocessing**:<br> In this step, the ERA 5-data is sliced to the region of interest (preprocesing step 1). All data is loaded into memory once which allows computing some statistics (for later normalization) and saved then saved as pickle-files in the output directory. The TFRecord-files which are streamed to the neural network for training and postprocessing are created in preprocessing step 2. Thus, two (batch-) scripts have to be executed: ```bash [sbatch] ./preprocess_data_era5_step1.sh [sbatch] ./preprocess_data_era5_step2.sh ``` -3. **Training**:<br> Training of one of the available models with the preprocessed data. Note that the `exp_id` is generated automatically when running `generate_runscript.py`. +3) **Training**:<br> Training of one of the available models with the preprocessed data happens in this step. Note that the `exp_id` is generated automatically when running `generate_runscript.py`. ```bash [sbatch] ./train_model_era5_<exp_id>.sh ``` -4. **Postprocessing**:<br> Create some plots and calculate the evaluation metrics for test dataset. Note that the `exp_id` is generated automatically when running `generate_runscript.py`. +4) **Postprocessing**:<br> At this step, the test dataset is applied on the data model. The predictions are stored in netCDF-files, while the models get also evaluated by some score metrics. Besides, example plots covering the range of the MSE are created for visualization. Note that the `exp_id` is inferred from the chosen experiment when running `generate_runscript.py`. ```bash [sbatch] ./visualize_postprocess_era5_<exp_id>.sh ``` +5) **Meta-Postprocessing**: <br> AMBS also provides a runscript to compare different models against each other (called meta-postprocessing). This happens in the `meta_postprocess`-step. While the runscript generator currently cannot handle this step, this step can be configured by adapting the file `meta_config.json` in the ` meta_postprocess_config`-directory. The related runscript can be created from the template is also provided under `HPC_scripts/` and `no_HPC_scripts`, respectively. + ```bash + [sbatch] ./meta_postprocess_era5.sh + ``` -### Compare and visualize the results +### Additional Jupyter Notebooks -AMBS also provides the tool (called meta-postprocessing) for the users to compare different experiments results and visualize the results as shown in GMD paper through the`meta_postprocess`-step. The runscript template are also prepared in the `HPC_scripts`, `no_HPC_scripts`. +Following up the interactive discussion during the peer-review phase (click on `discussion` when opening the [manuscript's landing page](https://doi.org/10.5194/gmd-2021-430)), some additional evaluations have been conducted. While training of one convolutional model from WeatherBench has been integrated into the workflow, some evaluations (e.g. IFS forecast evaluation) have been realized in scope of Jupyter Notebooks. These Notebooks are provided in the `Jupyter_Notebooks/`-directory where further (technical) details is given in the Notebooks. The software requirements to run these Jupyter Notebooks are the same as for the workflow (see [above](#Software-requirements)). -### Input and Output folder structure and naming convention -To successfully run the workflow and enable tracking the results from each workflow step, inputs and output directories, and the file name convention should be constructed as described below: +## Directory tree and naming convention +To successfully run the workflow and enable tracking the results from each workflow step, inputs and output directories as well as the file names should be follow the convention depicted below. -Below, we show at first the input data structure for the ERA5 dataset. In detail, the data is recorded hourly and stored into two different kind of grib files. The file with suffix `*_ml.grb` consists of multi-layer data, whereas `*_sf.grb` only includes the surface data. +At first, we show the input data structure for the ERA5 dataset. In detail, the data is recorded hourly and stored into two different kind of grib files. The files with suffix `*_ml.grb` provide data on the model levels of the underlying IFS model (to allow subsequent interpolation onto pressure levels), whereas `*_sf.grb` include data without a vertical dimension. ``` ├── ERA5 dataset @@ -282,28 +248,30 @@ Below, we show at first the input data structure for the ERA5 dataset. In detail │ │ │ ├── ... ``` -The root output directory should be set up when you run the workflow at the first time as aformentioned. +The root output directory should be set up when you run the workflow at the first time as (see `<my_target_dir>`-parameter of `create_env.sh` as described [here](#Virtual-environment)). -The output structure for each step of the workflow along with the file name convention are described below: +The output structure for each workflow step (directory tree) following the filename convention should be: ``` -├── ExtractedData -│ ├── [Year] -│ │ ├── [Month] -│ │ │ ├── **/*.netCDF -├── PreprocessedData -│ ├── [Data_name_convention] +├── extractedData +│ ├── [<YYYY>] +│ │ ├── [<MM>] +│ │ │ ├── ecmwf_era5_[<YYMMDDHH>].nc +├── preprocessedData +│ ├── [directory_name_convention] │ │ ├── pickle -│ │ │ ├── X_<Month>.pkl -│ │ │ ├── T_<Month>.pkl -│ │ │ ├── stat_<Month>.pkl -│ │ ├── tfrecords -│ │ │ ├── sequence_Y_<Year>_M_<Month>.tfrecords +│ │ │ ├── [<YYYY>] +│ │ │ │ ├── X_[<MM>].pkl +│ │ │ │ ├── T_[<MM>].pkl +│ │ │ │ ├── stat_[<MM>].pkl +│ │ ├── tfrecords_seq_len_[X] +│ │ │ ├── sequence_Y_[<YYYY>]_M_[<MM>].tfrecords │ │ │── metadata.json -├── Models -│ ├── [Data_name_convention] +│ │ │── options.json +├── models +│ ├── [directory_name_convention] │ │ ├── [model_name] │ │ │ ├── <timestamp>_<user>_<exp_id> -│ │ │ │ ├── checkpoint_<iteration> +│ │ │ │ ├── checkpoint_<iteration_step> │ │ │ │ │ ├── model_* │ │ │ │ │── timing_per_iteration_time.pkl │ │ │ │ │── timing_total_time.pkl @@ -311,53 +279,48 @@ The output structure for each step of the workflow along with the file name conv │ │ │ │ │── train_losses.pkl │ │ │ │ │── val_losses.pkl │ │ │ │ │── *.json -├── Results -│ ├── [Data_name_convention] -│ │ ├── [training_mode] -│ │ │ ├── [source_data_name_convention] -│ │ │ │ ├── [model_name] -│ │ │ │ │ ├── *.nc +├── results +│ ├── [directory_name_convention] +│ │ ├── [model_name] +│ │ │ ├── <timestamp>_<user>_<exp_id> +│ │ │ │ ├── vfp_date_[<YYYYMMDDHH>]_*.nc +│ │ │ │ ├── evalutation_metrics.nc +│ │ │ │ ├── *.png ├── meta_postprocoess │ ├── [experiment ID] ``` -- ***Details of file name convention:*** +#### Overview on placeholders of the output directory tree | Arguments | Value | |--- |--- | -| [Year] | 2005;2006;2007,...,2019| -| [Month] | 01;02;03 ...,12| -|[Data_name_convention]|Y[yyyy]to[yyyy]M[mm]to[mm]-[nx]_[ny]-[nn.nn]N[ee.ee]E-[var1]_[var2]_[var3]| -|[model_name]| convLSTM, savp, ...| - +| [<YYYY>] | four-digit years, e.g. 2007, 2008 etc.| +| [<MM>] | two-digit month, e.g. 01, 02, ..., 12| +| [< DD>] | two-digit day, e.g. 01, 02, ..., 31| +| [<HH>] | two-digit day, e.g. 01, 02, ..., 24| +|[directory_name_convention]| name indicating the data period, the target domain and the selected variables| +[model_name] | convLSTM, savp or weatherBench| -- ***Data name convention*** -`Y[yyyy]to[yyyy]M[mm]to[mm]-[nx]_[ny]-[nn.nn]N[ee.ee]E-[var1]_[var2]_[var3]` - * Y[yyyy]to[yyyy]M[mm]to[mm] - * [nx]_[ny]: the size of images,e.g 64_64 means 64*64 pixels - * [nn.nn]N[ee.ee]E: the geolocation of selected regions with two decimal points. e.g : 0.00N11.50E - * [var1]_[var2]_[var3]: the abbrevation of selected variables - -Here we give some examples to explain the name conventions: -| Examples | Name abbrevation | -|--- |--- | -|all data from March to June of the years 2005-2015 | Y2005toY2015M03to06 | -|data from February to May of years 2005-2008 + data from March to June of year 2015| Y2005to2008M02to05_Y2015M03to06 | -|Data from February to May, and October to December of 2005 | Y2005M02to05_Y2015M10to12 | -|operational’ data base: whole year 2016 | Y2016M01to12 | -|add new whole year data of 2017 on the operational data base |Y2016to2017M01to12 | -|Note: Y2016to2017M01to12 = Y2016M01to12_Y2017M01to12 +#### Directory name convention +The meaning of all components of the directory name convention +`Y[yyyy]-[yyyy]M[mm]to[mm]-[nx]x[ny]-[nn.nn]N[ee.ee]E-[var1]_[var2]_[var3]` is: + +* `Y[<YYYY>]to[<YYYY>]M[<MM>]to[<MM>]`:>]`: data period defined by years and months +* `[nx]x[ny]`: the size of the target domain, e.g. 92x56 means 92 grid points in longitude and 56 grid points in latitude direction +* `[nn.nn]N[ee.ee]E`: the geolocation of the south-west corner of the target domain, e.g. 38.40N0.00E for the largest target domain +* `[var1]_[var2]_[var3]`: short names of selected meteorological variables + ## Benchmarking architectures Currently, the workflow includes the following ML architectures, and we are working on integrating more into the system. -- ConvLSTM: [paper](https://papers.nips.cc/paper/5955-convolutional-lstm-network-a-machine-learning-approach-for-precipitation-nowcasting.pdf),[code](https://github.com/loliverhennigh/Convolutional-LSTM-in-Tensorflow) -- Stochastic Adversarial Video Prediction (SAVP): [paper](https://arxiv.org/pdf/1804.01523.pdf),[code](https://github.com/alexlee-gk/video_prediction) -- Variational Autoencoder:[paper](https://arxiv.org/pdf/1312.6114.pdf) +- ConvLSTM: [paper](https://papers.nips.cc/paper/5955-convolutional-lstm-network-a-machine-learning-approach-for-precipitation-nowcasting.pdf), [code](https://github.com/loliverhennigh/Convolutional-LSTM-in-Tensorflow) +- Stochastic Adversarial Video Prediction (SAVP): [paper](https://arxiv.org/pdf/1804.01523.pdf), [code](https://github.com/alexlee-gk/video_prediction) +- WeatherBench: [paper](https://doi.org/10.1029/2020MS002203), [code](https://github.com/pangeo-data/WeatherBench) ## Contributors and contact -The project is currently developed by Bing Gong, Michael Langguth, Amirpasha Mozafarri, and Yan Ji. +The project is currently developed by: - Bing Gong: b.gong@fz-juelich.de - Michael Langguth: m.langguth@fz-juelich.de @@ -365,11 +328,3 @@ The project is currently developed by Bing Gong, Michael Langguth, Amirpasha Moz - Yan Ji: y.ji@fz-juelich.de Former code developers are Scarlet Stadtler and Severin Hussmann. - -## On-going work - -- Port to PyTorch version -- Parallel training neural network -- Integrate precipitation data and new architecture used in our submitted CVPR paper -- Integrate the ML benchmark datasets such as Moving MNIST -