Skip to content
Snippets Groups Projects
Select Git revision
  • bing_issues#190_tf2
  • bing_tf2_convert
  • bing_issue#189_train_modular
  • simon_#172_integrate_weatherbench
  • develop
  • bing_issue#188_restructure_ambs
  • yan_issue#100_extract_prcp_data
  • bing_issue#170_data_preprocess_training_tf1
  • Gong2022_temperature_forecasts
  • bing_issue#186_clean_GMD1_tag
  • yan_issue#179_integrate_GZAWS_data_onfly
  • bing_issue#178_runscript_bug_postprocess
  • michael_issue#187_bugfix_setup_runscript_template
  • bing_issue#180_bugs_postprpocess_meta_postprocess
  • yan_issue#177_repo_for_CLGAN_gmd
  • bing_issue#176_integrate_weather_bench
  • michael_issue#181_eval_era5_forecasts
  • michael_issue#182_eval_subdomain
  • michael_issue#119_warmup_Horovod
  • bing_issue#160_test_zam347
  • ambs_v1
  • ambs_gmd_nowcasting_v1.0
  • GMD1
  • modular_booster_20210203
  • new_structure_20201004_v1.0
  • old_structure_20200930
26 results

ambs

  • Clone with SSH
  • Clone with HTTPS
  • user avatar
    merge readme from scarlet #31 branch to develop branch
    gong1 authored
    61ad7676
    History

    AMBS

    Atmopsheric Machine learning Benchmarking System (AMBS) aims to provide state-of-the-art video prediction methods applied to the meteorological domain. In the scope of the current application, the hourly evolution of the 2m temperature over a used-defined region is the target application. Different Deep Learning video prediction architectures such as convLSTM, MCnet or SAVP are trained with ERA5 reanalysis to perform a 10 hour prediction based on the previous 10 hours. In addition to the 2m temperature, additional meteorological variables like the mean sealevel pressure and the 500 hPa geopotential are fed to the underlying neural networks in order to enhance the model's capability to capture the atmospheric state and its (expected) evolution over time.

    The project is currently developed by Amirpasha Mozafarri, Michael Langguth, Bing Gong, and Scarlet Stadtler.

    Prerequisites

    • Linux or macOS
    • Python 3
    • CPU or NVIDIA GPU + CUDA CuDNN
    • MPI
    • Tensorflow 1.13.1 or CUDA-enabled NVIDIA TensorFlow 1.15 converted to singularity container (on Juwels Booster)

    Installation

    • Clone this repo:
    git clone https://gitlab.version.fz-juelich.de/toar/ambs.git

    Set-up env on Jülich's HPC systems and zam347

    The following commands will setup a user-specific virtual environment either on Juwels, HDF-ML (HPC clusters) or on zam347 for you. The script create_env.sh automatically detects on which machine it is executed and loads/installs all required Python (binary) modules and packages. The virtual environment is set up under the subfolder video_prediction_savp/<env_name>. Besides, user-specific runscripts for each step of the workflow are created. If no experimental identifier is passed as a second (optional) argument (exp1 denotes the default identifer.) train_era5_exp1.sh among other runscripts are set up which enables the user to run sequentially through all workflow steps (see below).

    cd video_prediction_tools/env_setup
    source create_env.sh <env_name> [<exp_id>]

    Run the workflow

    Depending on the machine you are working on, change either to video_prediction_tools/HPC_scripts (on Juwels and HDF-ML) or to video_prediction_tools/Zam347_scripts. There, the respective runscripts for all steps of the workflow are located whose order is the following. Note that [sbatch] only has to precede on one of the HPC systems.

    1. Data Extraction: Retrieve ERA5 reanalysis data for one year. For multiple year, execute the runscript sequentially.
    [sbatch] ./data_extraction_era5_<exp_id>.sh
    1. Data Preprocessing: Crop all data (multiple years possible) to the region of interest, perform normalization and create tf-record files. Note, that the first preprocessing step involves an automatic determination of the experimental directory where the input data for training, the model and the output data will be stored. The direcory paths in the subsequent workflow steps are also automatically adapted.
    [sbatch] ./preprocess_data_step1_<exp_id>.sh
    [sbatch] ./preprocess_data_step2_<exp_id>.sh
    1. Training: Training of one of the available models (see bewlow) with the preprocessed data.
    [sbatch] ./train_model_era5_<exp_id>.sh
    1. Postprocess: Create some plots and calculate evaluation metrics for test dataset.
    [sbatch] ./visualize_postprocess_era5_<exp_id>.sh

    Create additional runscripts

    In case that you want to perform experiments with varying configuration (e.g. another set of hyperparameters, but still the same input dataset for training), it is convenient to create individual runscripts from the templates. This can be done with the help of generate_workflow_runscripts.sh.

    The first argument <runscript_name> defines the relative path to the template runscript which should be converted to an executable one. Note that only the suffix of the template's name must be passed, e.g. ../HPC_scripts/train_era5 in order to create a runscript for the training substep. The second argument <venv_name> denotes the name of the virtual environment which has to be set up in advance and which should be used by the runscript at hand.

    Additional optional arguments can be passed to control the experimental identifier and to set manually the realtive path to the directory where the preprocessed data is stored (used for the training and postprocessing substep). These optional arguments have to follow a naming convention in order to be identified by generate_workflow_runscripts.sh. The experimental identifer can be passed by adding -exp_id=<id>while the path to the preprocessed data requires passing of -exp_dir=<relative_path_to_dir>. Note, that the default value exp1 is used as experimental identifier if the -exp_id=<id> is omitted.

    ./generate_workflow_runscripts.sh <runscript_name> <venv_name> [-exp_id=<id>] [-exp_dir=<relative_dir_to_path>]

    Specific example:

    
    ./generate_workflow_runscripts.sh train_model_era5 venv_juwels -exp_id=exp_test -exp_dir=era5-Y2010toY2222M01to12-160x128-2970N1500W-T2_MSL_gph500

    Notes for Juwels Booster

    The computionally expensive training of the Deep Learning video prediction architectures is supposed to benefit from the Booster module which is installed at JSC in autumn 2020. In order to test the potential speed-up on this state-of-the-art HPC system, optimized for massively parallel workloads, we selected the convLSTM-architecture as a test candidate in the scope of the Juwels Booster Early Access program.

    To run the training on the Booster, change to the recent Booster related working branch of this git repository.

    git checkout scarlet_issue#031_booster
    
    

    Currently, there is no system installation of TensorFlow available on Juwels Booster. As an intermediate solution, a singularity container with a CUDA-enabled NVIDIA TensorFlow v1.15 was made available which has to be reflected when setting up the virtual environment and when submiiting the job. The corresponding singularity image file is shared under /p/project/deepacf/deeprain/video_prediction_shared_folder/containers_juwels_booster/ and needs to be linked to the HPC_scripts-directory. If you are not part of the deepacf-project, the singularity file can be accessed alternatively from

    cd video_prediction_tools/HPC_scripts
    ln -sf /p/project/deepacf/deeprain/video_prediction_shared_folder/containers_juwels_booster/tf-1.15-ofed-venv.sif tf-1.15-ofed-venv.sif 

    Before setting up the virtual environment for the workflow, some required modules have to be loaded and an interactive shell has to be spawn within the container.

    cd ../env_setup
    source modules_train.sh 
    cd ../HPC_scripts/
    singularity shell tf-1.15-ofed-venv.sif

    Now, the virtual environment can be created as usual, i.e. by doing the following steps within the container shell:

    cd ../env_setup
    source create_env.sh <env_name> [<exp_id>]

    This creates a corresponding directory video_prediction_tools/<env_name>.

    Note that create_env.sh is system-aware and therefore only creates an executable runscript for the training step, e.g. video_prediction_tools/HPC_scripts/train_model_era5_booster_exp1.sh. Since all other substeps of the workflow are supposed to be performed on other systems so far, the data in-and output directories cannot be determined automatically during the workflow. Thus, the user has to modify the corresponding runscript by adapting manually the script varibales source_dir and destination_dir in line 19+20 of train_model_era5_booster_<exp_id>.sh.

    Finally, the training job can be submitted to Juwels Booster after exiting from the cotainer instance by

    exit 
    cd ../HPC_scripts
    salloc --partition=booster --nodes=1 --account=ea_deepacf --time=00:10:00 --reservation  earlyaccess srun  singularity exec --nv tf-1.15-ofed-venv.sif  ./train_model_era5_booster_<exp_id>.sh

    Output folder structure and naming convention

    The details can be found name_convention

    ├── ExtractedData
    │   ├── [Year]
    │   │   ├── [Month]
    │   │   │   ├── **/*.netCDF
    ├── PreprocessedData
    │   ├── [Data_name_convention]
    │   │   ├── pickle
    │   │   │   ├── train
    │   │   │   ├── val
    │   │   │   ├── test
    │   │   ├── tfrecords
    │   │   │   ├── train
    │   │   │   ├── val
    │   │   │   ├── test
    ├── Models
    │   ├── [Data_name_convention]
    │   │   ├── [model_name]
    │   │   ├── [model_name]
    ├── Results
    │   ├── [Data_name_convention]
    │   │   ├── [training_mode]
    │   │   │   ├── [source_data_name_convention]
    │   │   │   │   ├── [model_name]
    

    Benchmarking architectures:

    Contact