Skip to content
Snippets Groups Projects
user avatar
lukas leufen authored
dbc30525
History

MLAir - Machine Learning on Air Data

MLAir (Machine Learning on Air data) is an environment that simplifies and accelerates the creation of new machine learning (ML) models for the analysis and forecasting of meteorological and air quality time series.

Installation

MLAir is based on several python frameworks. To work properly, you have to install all packages from the requirements.txt file. Additionally to support the geographical plotting part it is required to install geo packages built for your operating system. Name names of these package may differ for different systems, we refer here to the opensuse / leap OS. The geo plot can be removed from the plot_list, in this case there is no need to install the geo packages.

  • (geo) Install proj on your machine using the console. E.g. for opensuse / leap zypper install proj
  • (geo) A c++ compiler is required for the installation of the program cartopy
  • Install all requirements from requirements.txt preferably in a virtual environment
  • (tf) Currently, TensorFlow-1.13 is mentioned in the requirements. We already tested the TensorFlow-1.15 version and couldn't find any compatibility errors. Please note, that tf-1.13 and 1.15 have two distinct branches each, the default branch for CPU support, and the "-gpu" branch for GPU support. If the GPU version is installed, MLAir will make use of the GPU device.
  • Installation of MLAir:
    • Either clone MLAir from the gitlab repository and use it without installation (beside the requirements)
    • or download the distribution file (?? .whl) and install it via pip install <??>. In this case, you can simply import MLAir in any python script inside your virtual environment using import mlair.

How to start with MLAir

In this section, we show three examples how to work with MLAir.

Example 1

We start MLAir in a dry run without any modification. Just import mlair and run it.

import mlair

# just give it a dry run without any modification 
mlair.run()

The logging output will show you many informations. Additional information (including debug messages) are collected inside the experiment path in the logging folder.

INFO: mlair started
INFO: ExperimentSetup started
INFO: Experiment path is: /home/<usr>/mlair/testrun_network 
...
INFO: load data for DEBW001 from JOIN 
...
INFO: Training started
...
INFO: mlair finished after 00:00:12 (hh:mm:ss)

Example 2

Now we update the stations and customise the window history size parameter.

import mlair

# our new stations to use
stations = ['DEBW030', 'DEBW037', 'DEBW031', 'DEBW015', 'DEBW107']

# expanded temporal context to 14 (days, because of default sampling="daily")
window_history_size = 14

# restart the experiment with little customisation
mlair.run(stations=stations, 
          window_history_size=window_history_size)

The output looks similar, but we can see, that the new stations are loaded.

INFO: mlair started
INFO: ExperimentSetup started
...
INFO: load data for DEBW030 from JOIN 
INFO: load data for DEBW037 from JOIN 
...
INFO: Training started
...
INFO: mlair finished after 00:00:24 (hh:mm:ss)

Example 3

Let's just apply our trained model to new data. Therefore we keep the window history size parameter but change the stations. In the run method, we need to disable the trainable and create new model parameters. MLAir will use the model we have trained before. Note, this only works if the experiment path has not changed or a suitable trained model is placed inside the experiment path.

import mlair

# our new stations to use
stations = ['DEBY002', 'DEBY079']

# same setting for window_history_size
window_history_size = 14

# run experiment without training
mlair.run(stations=stations, 
          window_history_size=window_history_size, 
          create_new_model=False, 
          trainable=False)

We can see from the terminal that no training was performed. Analysis is now made on the new stations.

INFO: mlair started
...
INFO: No training has started, because trainable parameter was false. 
...
INFO: mlair finished after 00:00:06 (hh:mm:ss)

Customised workflows and models

Custom Workflow

MLAir provides a default workflow. If additional steps are to be performed, you have to append custom run modules to the workflow.

import mlair
import logging

class CustomStage(mlair.RunEnvironment):
    """A custom MLAir stage for demonstration."""

    def __init__(self, test_string):
        super().__init__()  # always call super init method
        self._run(test_string)  # call a class method
        
    def _run(self, test_string):
        logging.info("Just running a custom stage.")
        logging.info("test_string = " + test_string)
        epochs = self.data_store.get("epochs")
        logging.info("epochs = " + str(epochs))

        
# create your custom MLAir workflow
CustomWorkflow = mlair.Workflow()
# provide stages without initialisation
CustomWorkflow.add(mlair.ExperimentSetup, epochs=128)
# add also keyword arguments for a specific stage
CustomWorkflow.add(CustomStage, test_string="Hello World")
# finally execute custom workflow in order of adding
CustomWorkflow.run()
INFO: mlair started
...
INFO: ExperimentSetup finished after 00:00:12 (hh:mm:ss)
INFO: CustomStage started
INFO: Just running a custom stage.
INFO: test_string = Hello World
INFO: epochs = 128
INFO: CustomStage finished after 00:00:01 (hh:mm:ss)
INFO: mlair finished after 00:00:13 (hh:mm:ss)

Custom Model

Each model has to inherit from the abstract model class to ensure a smooth training and evaluation behaviour. It is required to implement the set model and set compile options methods. The later has to set the loss at least.


import keras
from keras.losses import mean_squared_error as mse
from keras.optimizers import SGD

from mlair.model_modules import AbstractModelClass

class MyLittleModel(AbstractModelClass):
    """
    A customised model with a 1x1 Conv, and 3 Dense layers (32, 16
    window_lead_time). Dropout is used after Conv layer.
    """
    def __init__(self, window_history_size, window_lead_time, channels):
        super().__init__()
        # settings
        self.window_history_size = window_history_size
        self.window_lead_time = window_lead_time
        self.channels = channels
        self.dropout_rate = 0.1
        self.activation = keras.layers.PReLU
        self.lr = 1e-2
        # apply to model
        self.set_model()
        self.set_compile_options()
        self.set_custom_objects(loss=self.compile_options['loss'])

    def set_model(self):
        # add 1 to window_size to include current time step t0
        shape = (self.window_history_size + 1, 1, self.channels)
        x_input = keras.layers.Input(shape=shape)
        x_in = keras.layers.Conv2D(32, (1, 1), padding='same')(x_input)
        x_in = self.activation()(x_in)
        x_in = keras.layers.Flatten()(x_in)
        x_in = keras.layers.Dropout(self.dropout_rate)(x_in)
        x_in = keras.layers.Dense(32)(x_in)
        x_in = self.activation()(x_in)
        x_in = keras.layers.Dense(16)(x_in)
        x_in = self.activation()(x_in)
        x_in = keras.layers.Dense(self.window_lead_time)(x_in)
        out = self.activation()(x_in)
        self.model = keras.Model(inputs=x_input, outputs=[out])

    def set_compile_options(self):
        self.compile_options = {"optimizer": SGD(lr=self.lr),
                                "loss": mse, 
                                "metrics": ["mse"]}

Transformation

There are two different approaches (called scopes) to transform the data:

  1. station: transform data for each station independently (somehow like batch normalisation)
  2. data: transform all data of each station with shared metrics

Transformation must be set by the transformation attribute. If transformation = None is given to ExperimentSetup, data is not transformed at all. For all other setups, use the following dictionary structure to specify the transformation.

transformation = {"scope": <...>, 
                  "method": <...>,
                  "mean": <...>,
                  "std": <...>}
ExperimentSetup(..., transformation=transformation, ...)

scopes

station: mean and std are not used

data: either provide already calculated values for mean and std (if required by transformation method), or choose from different calculation schemes, explained in the mean and std section.

supported transformation methods

Currently supported methods are:

  • standardise (default, if method is not given)
  • centre

mean and std

"mean"="accurate": calculate the accurate values of mean and std (depending on method) by using all data. Although, this method is accurate, it may take some time for the calculation. Furthermore, this could potentially lead to memory issue (not explored yet, but could appear for a very big amount of data)

"mean"="estimate": estimate mean and std (depending on method). For each station, mean and std are calculated and afterwards aggregated using the mean value over all station-wise metrics. This method is less accurate, especially regarding the std calculation but therefore much faster.

We recommend to use the later method estimate because of following reasons:

  • much faster calculation
  • real accuracy of mean and std is less important, because it is "just" a transformation / scaling
  • accuracy of mean is almost as high as in the accurate case, because of \bar{x_{ij}} = \bar{\left(\bar{x_i}\right)_j}. The only difference is, that in the estimate case, each mean is equally weighted for each station independently of the actual data count of the station.
  • accuracy of std is lower for estimate because of \var{x_{ij}} \ne \bar{\left(\var{x_i}\right)_j}, but still the mean of all station-wise std is a decent estimate of the true std.

"mean"=<value, e.g. xr.DataArray>: If mean and std are already calculated or shall be set manually, just add the scaling values instead of the calculation method. For method centre, std can still be None, but is required for the standardise method. Important: Format of given values must match internal data format of DataPreparation class: xr.DataArray with dims=["variables"] and one value for each variable.

Special Remarks

Special instructions for installation on Jülich HPC systems

Please note, that the HPC setup is customised for JUWELS and HDFML. When using another HPC system, you can use the HPC setup files as a skeleton and customise it to your needs.

The following instruction guide you through the installation on JUWELS and HDFML.

  • Clone the repo to HPC system (we recommend to place it in /p/projects/<project name>).
  • Setup venv by executing source setupHPC.sh. This script loads all pre-installed modules and creates a venv for all other packages. Furthermore, it creates slurm/batch scripts to execute code on compute nodes.
    You have to enter the HPC project's budget name (--account flag).
  • The default external data path on JUWELS and HDFML is set to /p/project/deepacf/intelliaq/<user>/DATA/toar_<sampling>.
    To choose a different location open run.py and add the following keyword argument to ExperimentSetup: data_path=<your>/<custom>/<path>.
  • Execute python run.py on a login node to download example data. The program will throw an OSerror after downloading.
  • Execute either sbatch run_juwels_develgpus.bash or sbatch run_hdfml_batch.bash to verify that the setup went well.
  • Currently cartopy is not working on our HPC system, therefore PlotStations does not create any output.

Note: The method PartitionCheck currently only checks if the hostname starts with ju or hdfmll. Therefore, it might be necessary to adopt the if statement in PartitionCheck._run.

Security using JOIN

  • To use hourly data from ToarDB via JOIN interface, a private token is required. Request your personal access token and add it to src/join_settings.py in the hourly data section. Replace the TOAR_SERVICE_URL and the Authorization value. To make sure, that this sensitive data is not uploaded to the remote server, use the following command to prevent git from tracking this file: git update-index --assume-unchanged src/join_settings.py