Skip to content
Snippets Groups Projects
Select Git revision
  • enxhi_issue460_remove_TOAR-I_access
  • michael_issue459_preprocess_german_stations
  • sh_pollutants
  • develop protected
  • master default protected
  • release_v2.4.0
  • michael_issue450_feat_load-ifs-data
  • lukas_issue457_feat_set-config-paths-as-parameter
  • lukas_issue454_feat_use-toar-statistics-api-v2
  • lukas_issue453_refac_advanced-retry-strategy
  • lukas_issue452_bug_update-proj-version
  • lukas_issue449_refac_load-era5-data-from-toar-db
  • lukas_issue451_feat_robust-apriori-estimate-for-short-timeseries
  • lukas_issue448_feat_load-model-from-path
  • lukas_issue447_feat_store-and-load-local-clim-apriori-data
  • lukas_issue445_feat_data-insight-plot-monthly-distribution
  • lukas_issue442_feat_bias-free-evaluation
  • lukas_issue444_feat_choose-interp-method-cams
  • 414-include-crps-analysis-and-other-ens-verif-methods-or-plots
  • lukas_issue384_feat_aqw-data-handler
  • v2.4.0 protected
  • v2.3.0 protected
  • v2.2.0 protected
  • v2.1.0 protected
  • Kleinert_etal_2022_initial_submission
  • v2.0.0 protected
  • v1.5.0 protected
  • v1.4.0 protected
  • v1.3.0 protected
  • v1.2.1 protected
  • v1.2.0 protected
  • v1.1.0 protected
  • IntelliO3-ts-v1.0_R1-submit
  • v1.0.0 protected
  • v0.12.2 protected
  • v0.12.1 protected
  • v0.12.0 protected
  • v0.11.0 protected
  • v0.10.0 protected
  • IntelliO3-ts-v1.0_initial-submit
40 results

mlair

  • Clone with SSH
  • Clone with HTTPS
  • user avatar
    lukas leufen authored
    5f47e92b
    History

    MLAir - Machine Learning on Air Data

    MLAir (Machine Learning on Air data) is an environment that simplifies and accelerates the creation of new machine learning (ML) models for the analysis and forecasting of meteorological and air quality time series. You can find the docs here.

    Installation

    MLAir is based on several python frameworks. To work properly, you have to install all packages from the requirements.txt file. Additionally to support the geographical plotting part it is required to install geo packages built for your operating system. Name names of these package may differ for different systems, we refer here to the opensuse / leap OS. The geo plot can be removed from the plot_list, in this case there is no need to install the geo packages.

    • (geo) Install proj on your machine using the console. E.g. for opensuse / leap zypper install proj
    • (geo) A c++ compiler is required for the installation of the program cartopy
    • Install all requirements from requirements.txt preferably in a virtual environment
    • (tf) Currently, TensorFlow-1.13 is mentioned in the requirements. We already tested the TensorFlow-1.15 version and couldn't find any compatibility errors. Please note, that tf-1.13 and 1.15 have two distinct branches each, the default branch for CPU support, and the "-gpu" branch for GPU support. If the GPU version is installed, MLAir will make use of the GPU device.
    • Installation of MLAir:
      • Either clone MLAir from the gitlab repository and use it without installation (beside the requirements)
      • or download the distribution file (?? .whl) and install it via pip install <??>. In this case, you can simply import MLAir in any python script inside your virtual environment using import mlair.

    How to start with MLAir

    In this section, we show three examples how to work with MLAir. Note, that for these examples MLAir was installed using the distribution file. In case you are using the git clone it is required to adjust the import path if not directly executed inside the source directory of MLAir.

    Example 1

    We start MLAir in a dry run without any modification. Just import mlair and run it.

    import mlair
    
    # just give it a dry run without any modification 
    mlair.run()

    The logging output will show you many informations. Additional information (including debug messages) are collected inside the experiment path in the logging folder.

    INFO: mlair started
    INFO: ExperimentSetup started
    INFO: Experiment path is: /home/<usr>/mlair/testrun_network 
    ...
    INFO: load data for DEBW001 from JOIN 
    ...
    INFO: Training started
    ...
    INFO: mlair finished after 00:00:12 (hh:mm:ss)

    Example 2

    Now we update the stations and customise the window history size parameter.

    import mlair
    
    # our new stations to use
    stations = ['DEBW030', 'DEBW037', 'DEBW031', 'DEBW015', 'DEBW107']
    
    # expanded temporal context to 14 (days, because of default sampling="daily")
    window_history_size = 14
    
    # restart the experiment with little customisation
    mlair.run(stations=stations, 
              window_history_size=window_history_size)

    The output looks similar, but we can see, that the new stations are loaded.

    INFO: mlair started
    INFO: ExperimentSetup started
    ...
    INFO: load data for DEBW030 from JOIN 
    INFO: load data for DEBW037 from JOIN 
    ...
    INFO: Training started
    ...
    INFO: mlair finished after 00:00:24 (hh:mm:ss)

    Example 3

    Let's just apply our trained model to new data. Therefore we keep the window history size parameter but change the stations. In the run method, we need to disable the trainable and create new model parameters. MLAir will use the model we have trained before. Note, this only works if the experiment path has not changed or a suitable trained model is placed inside the experiment path.

    import mlair
    
    # our new stations to use
    stations = ['DEBY002', 'DEBY079']
    
    # same setting for window_history_size
    window_history_size = 14
    
    # run experiment without training
    mlair.run(stations=stations, 
              window_history_size=window_history_size, 
              create_new_model=False, 
              trainable=False)

    We can see from the terminal that no training was performed. Analysis is now made on the new stations.

    INFO: mlair started
    ...
    INFO: No training has started, because trainable parameter was false. 
    ...
    INFO: mlair finished after 00:00:06 (hh:mm:ss)

    Default Workflow

    MLAir is constituted of so-called run_modules which are executed in a distinct order called workflow. MLAir provides a default_workflow. This workflow runs the run modules ExperimentSetup, PreProcessing, ModelSetup, Training, and PostProcessing one by one.

    Sketch of the default workflow.

    import mlair
    
    # create your custom MLAir workflow
    DefaultWorkflow = mlair.DefaultWorkflow()
    # execute default workflow
    DefaultWorkflow.run()

    The output of running this default workflow will be structured like the following.

    INFO: mlair started
    INFO: ExperimentSetup started
    ...
    INFO: ExperimentSetup finished after 00:00:01 (hh:mm:ss)
    INFO: PreProcessing started
    ...
    INFO: PreProcessing finished after 00:00:11 (hh:mm:ss)
    INFO: ModelSetup started
    ...
    INFO: ModelSetup finished after 00:00:01 (hh:mm:ss)
    INFO: Training started
    ...
    INFO: Training finished after 00:02:15 (hh:mm:ss)
    INFO: PostProcessing started
    ...
    INFO: PostProcessing finished after 00:01:37 (hh:mm:ss)
    INFO: mlair finished after 00:04:05 (hh:mm:ss)

    Customised Run Module and Workflow

    It is possible to create new custom run modules. A custom run module is required to inherit from the base class RunEnvironment and to hold the constructor method __init__(). This method has to execute the module on call. In the following example, this is done by using the _run() method that is called by the initialiser. It is possible to parse arguments to the custom run module as shown.

    import mlair
    import logging
    
    class CustomStage(mlair.RunEnvironment):
        """A custom MLAir stage for demonstration."""
    
        def __init__(self, test_string):
            super().__init__()  # always call super init method
            self._run(test_string)  # call a class method
    
        def _run(self, test_string):
            logging.info("Just running a custom stage.")
            logging.info("test_string = " + test_string)
            epochs = self.data_store.get("epochs")
            logging.info("epochs = " + str(epochs))

    If a custom run module is defined, it is required to adjust the workflow. For this, you need to load the empty Workflow class and add each run module that is required. The order of adding modules defines the order of execution if running the workflow.

    # create your custom MLAir workflow
    CustomWorkflow = mlair.Workflow()
    # provide stages without initialisation
    CustomWorkflow.add(mlair.ExperimentSetup, epochs=128)
    # add also keyword arguments for a specific stage
    CustomWorkflow.add(CustomStage, test_string="Hello World")
    # finally execute custom workflow in order of adding
    CustomWorkflow.run()

    The output will look like:

    INFO: mlair started
    ...
    INFO: ExperimentSetup finished after 00:00:12 (hh:mm:ss)
    INFO: CustomStage started
    INFO: Just running a custom stage.
    INFO: test_string = Hello World
    INFO: epochs = 128
    INFO: CustomStage finished after 00:00:01 (hh:mm:ss)
    INFO: mlair finished after 00:00:13 (hh:mm:ss)

    Custom Model

    Create your own model to run your personal experiment. To guarantee a proper integration in the MLAir workflow, models are restricted to inherit from the AbstractModelClass. This will ensure a smooth training and evaluation behaviour.

    How to create a customised model?

    • Create a new model class inheriting from AbstractModelClass
    from mlair import AbstractModelClass
    import keras
    
    class MyCustomisedModel(AbstractModelClass):
    
        def __init__(self, shape_inputs: list, shape_outputs: list):
    
            super().__init__(shape_inputs[0], shape_outputs[0])
    
            # settings
            self.dropout_rate = 0.1
            self.activation = keras.layers.PReLU
    
            # apply to model
            self.set_model()
            self.set_compile_options()
            self.set_custom_objects(loss=self.compile_options['loss'])
    • Make sure to add the super().__init__() and at least set_model() and set_compile_options() to your custom init method.
    • The shown model expects a single input and output branch provided in a list. Therefore shapes of input and output are extracted and then provided to the super class initialiser.
    • Some general settings like the dropout rate are set in the init method additionally.
    • If you have custom objects in your model, that are not part of the keras or tensorflow frameworks, you need to add them to custom objects. To do this, call set_custom_objects with arbitrarily kwargs. In the shown example, the loss has been added for demonstration only, because we use a build-in loss function. Nonetheless, we always encourage you to add the loss as custom object, to prevent potential errors when loading an already created model instead of training a new one.
    • Now build your model inside set_model() by using the instance attributes self.shape_inputs and self.shape_outputs and storing the model as self.model.
    class MyCustomisedModel(AbstractModelClass):
    
        def set_model(self):
            x_input = keras.layers.Input(shape=self.shape_inputs)
            x_in = keras.layers.Conv2D(32, (1, 1), padding='same', name='{}_Conv_1x1'.format("major"))(x_input)
            x_in = self.activation(name='{}_conv_act'.format("major"))(x_in)
            x_in = keras.layers.Flatten(name='{}'.format("major"))(x_in)
            x_in = keras.layers.Dropout(self.dropout_rate, name='{}_Dropout_1'.format("major"))(x_in)
            x_in = keras.layers.Dense(16, name='{}_Dense_16'.format("major"))(x_in)
            x_in = self.activation()(x_in)
            x_in = keras.layers.Dense(self.shape_outputs, name='{}_Dense'.format("major"))(x_in)
            out_main = self.activation()(x_in)
            self.model = keras.Model(inputs=x_input, outputs=[out_main])
    • Your are free how to design your model. Just make sure to save it in the class attribute model.
    • Additionally, set your custom compile options including the loss definition.
    class MyCustomisedModel(AbstractModelClass):
    
        def set_compile_options(self):
            self.initial_lr = 1e-2
            self.optimizer = keras.optimizers.SGD(lr=self.initial_lr, momentum=0.9)
            self.lr_decay = mlair.model_modules.keras_extensions.LearningRateDecay(base_lr=self.initial_lr,
                                                                                   drop=.94,
                                                                                   epochs_drop=10)
            self.loss = keras.losses.mean_squared_error
            self.compile_options = {"metrics": ["mse", "mae"]}
    • The allocation of the instance parameters initial_lr, optimizer, and lr_decay could be also part of the model class' initialiser. The same applies to self.loss and compile_options, but we recommend to use the set_compile_options method for the definition of parameters, that are related to the compile options.

    • More important is that the compile options are actually saved. There are three ways to achieve this.

      • (1): Set all compile options by parsing a dictionary with all options to self.compile_options.

      • (2): Set all compile options as instance attributes. MLAir will search for these attributes and store them.

      • (3): Define your compile options partly as dictionary and instance attributes (as shown in this example).

      • If using (3) and defining the same compile option with different values, MLAir will raise an error.

        Incorrect: (Will raise an error because of a mismatch for the optimizer parameter.)

        def set_compile_options(self):
            self.optimizer = keras.optimizers.SGD()
            self.loss = keras.losses.mean_squared_error
            self.compile_options = {"optimizer" = keras.optimizers.Adam()}

    Specials for Branched Models

    • If you have a branched model with multiple outputs, you need either set only a single loss for all branch outputs or provide the same number of loss functions considering the right order.
    class MyCustomisedModel(AbstractModelClass):
    
        def set_model(self):
            ...
            self.model = keras.Model(inputs=x_input, outputs=[out_minor_1, out_minor_2, out_main])
    
        def set_compile_options(self):
            self.loss = [keras.losses.mean_absolute_error] +  # for out_minor_1
                        [keras.losses.mean_squared_error] +   # for out_minor_2
                        [keras.losses.mean_squared_error]     # for out_main

    How to access my customised model?

    If the customised model is created, you can easily access the model with

    >>> MyCustomisedModel().model
    <your custom model>

    The loss is accessible via

    >>> MyCustomisedModel().loss
    <your custom loss>

    You can treat the instance of your model as instance but also as the model itself. If you call a method, that refers to the model instead of the model instance, you can directly apply the command on the instance instead of adding the model parameter call.

    >>> MyCustomisedModel().model.compile(**kwargs) == MyCustomisedModel().compile(**kwargs)
    True

    Special Remarks

    Special instructions for installation on Jülich HPC systems

    Please note, that the HPC setup is customised for JUWELS and HDFML. When using another HPC system, you can use the HPC setup files as a skeleton and customise it to your needs.

    The following instruction guide you through the installation on JUWELS and HDFML.

    • Clone the repo to HPC system (we recommend to place it in /p/projects/<project name>).
    • Setup venv by executing source setupHPC.sh. This script loads all pre-installed modules and creates a venv for all other packages. Furthermore, it creates slurm/batch scripts to execute code on compute nodes.
      You have to enter the HPC project's budget name (--account flag).
    • The default external data path on JUWELS and HDFML is set to /p/project/deepacf/intelliaq/<user>/DATA/toar_<sampling>.
      To choose a different location open run.py and add the following keyword argument to ExperimentSetup: data_path=<your>/<custom>/<path>.
    • Execute python run.py on a login node to download example data. The program will throw an OSerror after downloading.
    • Execute either sbatch run_juwels_develgpus.bash or sbatch run_hdfml_batch.bash to verify that the setup went well.
    • Currently cartopy is not working on our HPC system, therefore PlotStations does not create any output.

    Note: The method PartitionCheck currently only checks if the hostname starts with ju or hdfmll. Therefore, it might be necessary to adopt the if statement in PartitionCheck._run.

    Security using JOIN

    • To use hourly data from ToarDB via JOIN interface, a private token is required. Request your personal access token and add it to src/join_settings.py in the hourly data section. Replace the TOAR_SERVICE_URL and the Authorization value. To make sure, that this sensitive data is not uploaded to the remote server, use the following command to prevent git from tracking this file: git update-index --assume-unchanged src/join_settings.py

    remaining things

    Transformation

    There are two different approaches (called scopes) to transform the data:

    1. station: transform data for each station independently (somehow like batch normalisation)
    2. data: transform all data of each station with shared metrics

    Transformation must be set by the transformation attribute. If transformation = None is given to ExperimentSetup, data is not transformed at all. For all other setups, use the following dictionary structure to specify the transformation.

    transformation = {"scope": <...>, 
                      "method": <...>,
                      "mean": <...>,
                      "std": <...>}
    ExperimentSetup(..., transformation=transformation, ...)

    scopes

    station: mean and std are not used

    data: either provide already calculated values for mean and std (if required by transformation method), or choose from different calculation schemes, explained in the mean and std section.

    supported transformation methods

    Currently supported methods are:

    • standardise (default, if method is not given)
    • centre

    mean and std

    "mean"="accurate": calculate the accurate values of mean and std (depending on method) by using all data. Although, this method is accurate, it may take some time for the calculation. Furthermore, this could potentially lead to memory issue (not explored yet, but could appear for a very big amount of data)

    "mean"="estimate": estimate mean and std (depending on method). For each station, mean and std are calculated and afterwards aggregated using the mean value over all station-wise metrics. This method is less accurate, especially regarding the std calculation but therefore much faster.

    We recommend to use the later method estimate because of following reasons:

    • much faster calculation
    • real accuracy of mean and std is less important, because it is "just" a transformation / scaling
    • accuracy of mean is almost as high as in the accurate case, because of \bar{x_{ij}} = \bar{\left(\bar{x_i}\right)_j}. The only difference is, that in the estimate case, each mean is equally weighted for each station independently of the actual data count of the station.
    • accuracy of std is lower for estimate because of \var{x_{ij}} \ne \bar{\left(\var{x_i}\right)_j}, but still the mean of all station-wise std is a decent estimate of the true std.

    "mean"=<value, e.g. xr.DataArray>: If mean and std are already calculated or shall be set manually, just add the scaling values instead of the calculation method. For method centre, std can still be None, but is required for the standardise method. Important: Format of given values must match internal data format of DataPreparation class: xr.DataArray with dims=["variables"] and one value for each variable.