MLAir - Machine Learning on Air Data
MLAir (Machine Learning on Air data) is an environment that simplifies and accelerates the creation of new machine learning (ML) models for the analysis and forecasting of meteorological and air quality time series.
Installation
MLAir is based on several python frameworks. To work properly, you have to install all packages from the
requirements.txt
file. Additionally to support the geographical plotting part it is required to install geo
packages built for your operating system. Name names of these package may differ for different systems, we refer
here to the opensuse / leap OS. The geo plot can be removed from the plot_list
, in this case there is no need to
install the geo packages.
- (geo) Install proj on your machine using the console. E.g. for opensuse / leap
zypper install proj
- (geo) A c++ compiler is required for the installation of the program cartopy
- Install all requirements from
requirements.txt
preferably in a virtual environment - (tf) Currently, TensorFlow-1.13 is mentioned in the requirements. We already tested the TensorFlow-1.15 version and couldn't find any compatibility errors. Please note, that tf-1.13 and 1.15 have two distinct branches each, the default branch for CPU support, and the "-gpu" branch for GPU support. If the GPU version is installed, MLAir will make use of the GPU device.
- Installation of MLAir:
- Either clone MLAir from the gitlab repository and use it without installation (beside the requirements)
- or download the distribution file (?? .whl) and install it via
pip install <??>
. In this case, you can simply import MLAir in any python script inside your virtual environment usingimport mlair
.
How to start with MLAir
In this section, we show three examples how to work with MLAir.
Example 1
We start MLAir in a dry run without any modification. Just import mlair and run it.
import mlair
# just give it a dry run without any modification
mlair.run()
The logging output will show you many informations. Additional information (including debug messages) are collected inside the experiment path in the logging folder.
INFO: mlair started
INFO: ExperimentSetup started
INFO: Experiment path is: /home/<usr>/mlair/testrun_network
...
INFO: load data for DEBW001 from JOIN
...
INFO: Training started
...
INFO: mlair finished after 00:00:12 (hh:mm:ss)
Example 2
Now we update the stations and customise the window history size parameter.
import mlair
# our new stations to use
stations = ['DEBW030', 'DEBW037', 'DEBW031', 'DEBW015', 'DEBW107']
# expanded temporal context to 14 (days, because of default sampling="daily")
window_history_size = 14
# restart the experiment with little customisation
mlair.run(stations=stations,
window_history_size=window_history_size)
The output looks similar, but we can see, that the new stations are loaded.
INFO: mlair started
INFO: ExperimentSetup started
...
INFO: load data for DEBW030 from JOIN
INFO: load data for DEBW037 from JOIN
...
INFO: Training started
...
INFO: mlair finished after 00:00:24 (hh:mm:ss)
Example 3
Let's just apply our trained model to new data. Therefore we keep the window history size parameter but change the stations. In the run method, we need to disable the trainable and create new model parameters. MLAir will use the model we have trained before. Note, this only works if the experiment path has not changed or a suitable trained model is placed inside the experiment path.
import mlair
# our new stations to use
stations = ['DEBY002', 'DEBY079']
# same setting for window_history_size
window_history_size = 14
# run experiment without training
mlair.run(stations=stations,
window_history_size=window_history_size,
create_new_model=False,
trainable=False)
We can see from the terminal that no training was performed. Analysis is now made on the new stations.
INFO: mlair started
...
INFO: No training has started, because trainable parameter was false.
...
INFO: mlair finished after 00:00:06 (hh:mm:ss)
Customised workflows and models
Custom Workflow
MLAir provides a default workflow. If additional steps are to be performed, you have to append custom run modules to the workflow.
import mlair
import logging
class CustomStage(mlair.RunEnvironment):
"""A custom MLAir stage for demonstration."""
def __init__(self, test_string):
super().__init__() # always call super init method
self._run(test_string) # call a class method
def _run(self, test_string):
logging.info("Just running a custom stage.")
logging.info("test_string = " + test_string)
epochs = self.data_store.get("epochs")
logging.info("epochs = " + str(epochs))
# create your custom MLAir workflow
CustomWorkflow = mlair.Workflow()
# provide stages without initialisation
CustomWorkflow.add(mlair.ExperimentSetup, epochs=128)
# add also keyword arguments for a specific stage
CustomWorkflow.add(CustomStage, test_string="Hello World")
# finally execute custom workflow in order of adding
CustomWorkflow.run()
INFO: mlair started
...
INFO: ExperimentSetup finished after 00:00:12 (hh:mm:ss)
INFO: CustomStage started
INFO: Just running a custom stage.
INFO: test_string = Hello World
INFO: epochs = 128
INFO: CustomStage finished after 00:00:01 (hh:mm:ss)
INFO: mlair finished after 00:00:13 (hh:mm:ss)
Custom Model
Each model has to inherit from the abstract model class to ensure a smooth training and evaluation behaviour. It is required to implement the set model and set compile options methods. The later has to set the loss at least.
import keras
from keras.losses import mean_squared_error as mse
from keras.optimizers import SGD
from mlair.model_modules import AbstractModelClass
class MyLittleModel(AbstractModelClass):
"""
A customised model with a 1x1 Conv, and 3 Dense layers (32, 16
window_lead_time). Dropout is used after Conv layer.
"""
def __init__(self, window_history_size, window_lead_time, channels):
super().__init__()
# settings
self.window_history_size = window_history_size
self.window_lead_time = window_lead_time
self.channels = channels
self.dropout_rate = 0.1
self.activation = keras.layers.PReLU
self.lr = 1e-2
# apply to model
self.set_model()
self.set_compile_options()
self.set_custom_objects(loss=self.compile_options['loss'])
def set_model(self):
# add 1 to window_size to include current time step t0
shape = (self.window_history_size + 1, 1, self.channels)
x_input = keras.layers.Input(shape=shape)
x_in = keras.layers.Conv2D(32, (1, 1), padding='same')(x_input)
x_in = self.activation()(x_in)
x_in = keras.layers.Flatten()(x_in)
x_in = keras.layers.Dropout(self.dropout_rate)(x_in)
x_in = keras.layers.Dense(32)(x_in)
x_in = self.activation()(x_in)
x_in = keras.layers.Dense(16)(x_in)
x_in = self.activation()(x_in)
x_in = keras.layers.Dense(self.window_lead_time)(x_in)
out = self.activation()(x_in)
self.model = keras.Model(inputs=x_input, outputs=[out])
def set_compile_options(self):
self.compile_options = {"optimizer": SGD(lr=self.lr),
"loss": mse,
"metrics": ["mse"]}
Transformation
There are two different approaches (called scopes) to transform the data:
-
station
: transform data for each station independently (somehow like batch normalisation) -
data
: transform all data of each station with shared metrics
Transformation must be set by the transformation
attribute. If transformation = None
is given to ExperimentSetup
,
data is not transformed at all. For all other setups, use the following dictionary structure to specify the
transformation.
transformation = {"scope": <...>,
"method": <...>,
"mean": <...>,
"std": <...>}
ExperimentSetup(..., transformation=transformation, ...)
scopes
station: mean and std are not used
data: either provide already calculated values for mean and std (if required by transformation method), or choose from different calculation schemes, explained in the mean and std section.
supported transformation methods
Currently supported methods are:
- standardise (default, if method is not given)
- centre
mean and std
"mean"="accurate"
: calculate the accurate values of mean and std (depending on method) by using all data. Although,
this method is accurate, it may take some time for the calculation. Furthermore, this could potentially lead to memory
issue (not explored yet, but could appear for a very big amount of data)
"mean"="estimate"
: estimate mean and std (depending on method). For each station, mean and std are calculated and
afterwards aggregated using the mean value over all station-wise metrics. This method is less accurate, especially
regarding the std calculation but therefore much faster.
We recommend to use the later method estimate because of following reasons:
- much faster calculation
- real accuracy of mean and std is less important, because it is "just" a transformation / scaling
- accuracy of mean is almost as high as in the accurate case, because of \bar{x_{ij}} = \bar{\left(\bar{x_i}\right)_j}. The only difference is, that in the estimate case, each mean is equally weighted for each station independently of the actual data count of the station.
- accuracy of std is lower for estimate because of \var{x_{ij}} \ne \bar{\left(\var{x_i}\right)_j}, but still the mean of all station-wise std is a decent estimate of the true std.
"mean"=<value, e.g. xr.DataArray>
: If mean and std are already calculated or shall be set manually, just add the
scaling values instead of the calculation method. For method centre, std can still be None, but is required for the
standardise method. Important: Format of given values must match internal data format of DataPreparation
class: xr.DataArray
with dims=["variables"]
and one value for each variable.
Special Remarks
Special instructions for installation on Jülich HPC systems
Please note, that the HPC setup is customised for JUWELS and HDFML. When using another HPC system, you can use the HPC setup files as a skeleton and customise it to your needs.
The following instruction guide you through the installation on JUWELS and HDFML.
- Clone the repo to HPC system (we recommend to place it in
/p/projects/<project name>
). - Setup venv by executing
source setupHPC.sh
. This script loads all pre-installed modules and creates a venv for all other packages. Furthermore, it creates slurm/batch scripts to execute code on compute nodes.
You have to enter the HPC project's budget name (--account flag). - The default external data path on JUWELS and HDFML is set to
/p/project/deepacf/intelliaq/<user>/DATA/toar_<sampling>
.
To choose a different location openrun.py
and add the following keyword argument toExperimentSetup
:data_path=<your>/<custom>/<path>
. - Execute
python run.py
on a login node to download example data. The program will throw an OSerror after downloading. - Execute either
sbatch run_juwels_develgpus.bash
orsbatch run_hdfml_batch.bash
to verify that the setup went well. - Currently cartopy is not working on our HPC system, therefore PlotStations does not create any output.
Note: The method PartitionCheck
currently only checks if the hostname starts with ju
or hdfmll
.
Therefore, it might be necessary to adopt the if
statement in PartitionCheck._run
.
Security using JOIN
- To use hourly data from ToarDB via JOIN interface, a private token is required. Request your personal access token and
add it to
src/join_settings.py
in the hourly data section. Replace theTOAR_SERVICE_URL
and theAuthorization
value. To make sure, that this sensitive data is not uploaded to the remote server, use the following command to prevent git from tracking this file:git update-index --assume-unchanged src/join_settings.py