diff --git a/README.md b/README.md index 5c55b4094232908a56cdcf61ba437976f8714e8b..de55ada298c92a59f6c8ffcb1d4e7ed122f031ff 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,8 @@ # MLAir - Machine Learning on Air Data MLAir (Machine Learning on Air data) is an environment that simplifies and accelerates the creation of new machine -learning (ML) models for the analysis and forecasting of meteorological and air quality time series. +learning (ML) models for the analysis and forecasting of meteorological and air quality time series. You can find the +docs [here](http://toar.pages.jsc.fz-juelich.de/mlair/docs/). # Installation @@ -27,7 +28,9 @@ install the geo packages. # How to start with MLAir -In this section, we show three examples how to work with MLAir. +In this section, we show three examples how to work with MLAir. Note, that for these examples MLAir was installed using +the distribution file. In case you are using the git clone it is required to adjust the import path if not directly +executed inside the source directory of MLAir. ## Example 1 @@ -112,12 +115,50 @@ INFO: No training has started, because trainable parameter was false. INFO: mlair finished after 00:00:06 (hh:mm:ss) ``` -# Customised workflows and models -# Custom Workflow +# Default Workflow -MLAir provides a default workflow. If additional steps are to be performed, you have to append custom run modules to -the workflow. +MLAir is constituted of so-called `run_modules` which are executed in a distinct order called `workflow`. MLAir +provides a `default_workflow`. This workflow runs the run modules `ExperimentSetup`, `PreProcessing`, +`ModelSetup`, `Training`, and `PostProcessing` one by one. + + + +```python +import mlair + +# create your custom MLAir workflow +DefaultWorkflow = mlair.DefaultWorkflow() +# execute default workflow +DefaultWorkflow.run() +``` +The output of running this default workflow will be structured like the following. +```log +INFO: mlair started +INFO: ExperimentSetup started +... +INFO: ExperimentSetup finished after 00:00:01 (hh:mm:ss) +INFO: PreProcessing started +... +INFO: PreProcessing finished after 00:00:11 (hh:mm:ss) +INFO: ModelSetup started +... +INFO: ModelSetup finished after 00:00:01 (hh:mm:ss) +INFO: Training started +... +INFO: Training finished after 00:02:15 (hh:mm:ss) +INFO: PostProcessing started +... +INFO: PostProcessing finished after 00:01:37 (hh:mm:ss) +INFO: mlair finished after 00:04:05 (hh:mm:ss) +``` + +# Customised Run Module and Workflow + +It is possible to create new custom run modules. A custom run module is required to inherit from the base class +`RunEnvironment` and to hold the constructor method `__init__()`. This method has to execute the module on call. +In the following example, this is done by using the `_run()` method that is called by the initialiser. It is +possible to parse arguments to the custom run module as shown. ```python import mlair @@ -129,14 +170,19 @@ class CustomStage(mlair.RunEnvironment): def __init__(self, test_string): super().__init__() # always call super init method self._run(test_string) # call a class method - + def _run(self, test_string): logging.info("Just running a custom stage.") logging.info("test_string = " + test_string) epochs = self.data_store.get("epochs") logging.info("epochs = " + str(epochs)) +``` - +If a custom run module is defined, it is required to adjust the workflow. For this, you need to load the empty +`Workflow` class and add each run module that is required. The order of adding modules defines the order of +execution if running the workflow. + +```python # create your custom MLAir workflow CustomWorkflow = mlair.Workflow() # provide stages without initialisation @@ -146,6 +192,9 @@ CustomWorkflow.add(CustomStage, test_string="Hello World") # finally execute custom workflow in order of adding CustomWorkflow.run() ``` + +The output will look like: + ```log INFO: mlair started ... @@ -158,60 +207,175 @@ INFO: CustomStage finished after 00:00:01 (hh:mm:ss) INFO: mlair finished after 00:00:13 (hh:mm:ss) ``` -## Custom Model +# Custom Model -Each model has to inherit from the abstract model class to ensure a smooth training and evaluation behaviour. It is -required to implement the set model and set compile options methods. The later has to set the loss at least. +Create your own model to run your personal experiment. To guarantee a proper integration in the MLAir workflow, models +are restricted to inherit from the `AbstractModelClass`. This will ensure a smooth training and evaluation +behaviour. -```python +## How to create a customised model? + +* Create a new model class inheriting from `AbstractModelClass` +```python +from mlair import AbstractModelClass import keras -from keras.losses import mean_squared_error as mse -from keras.optimizers import SGD - -from mlair.model_modules import AbstractModelClass - -class MyLittleModel(AbstractModelClass): - """ - A customised model with a 1x1 Conv, and 3 Dense layers (32, 16 - window_lead_time). Dropout is used after Conv layer. - """ - def __init__(self, window_history_size, window_lead_time, channels): - super().__init__() + +class MyCustomisedModel(AbstractModelClass): + + def __init__(self, shape_inputs: list, shape_outputs: list): + + super().__init__(shape_inputs[0], shape_outputs[0]) + # settings - self.window_history_size = window_history_size - self.window_lead_time = window_lead_time - self.channels = channels self.dropout_rate = 0.1 self.activation = keras.layers.PReLU - self.lr = 1e-2 + # apply to model self.set_model() self.set_compile_options() self.set_custom_objects(loss=self.compile_options['loss']) +``` + +* Make sure to add the `super().__init__()` and at least `set_model()` and `set_compile_options()` to your + custom init method. +* The shown model expects a single input and output branch provided in a list. Therefore shapes of input and output are + extracted and then provided to the super class initialiser. +* Some general settings like the dropout rate are set in the init method additionally. +* If you have custom objects in your model, that are not part of the keras or tensorflow frameworks, you need to add + them to custom objects. To do this, call `set_custom_objects` with arbitrarily kwargs. In the shown example, the + loss has been added for demonstration only, because we use a build-in loss function. Nonetheless, we always encourage + you to add the loss as custom object, to prevent potential errors when loading an already created model instead of + training a new one. +* Now build your model inside `set_model()` by using the instance attributes `self.shape_inputs` and + `self.shape_outputs` and storing the model as `self.model`. + +```python +class MyCustomisedModel(AbstractModelClass): def set_model(self): - # add 1 to window_size to include current time step t0 - shape = (self.window_history_size + 1, 1, self.channels) - x_input = keras.layers.Input(shape=shape) - x_in = keras.layers.Conv2D(32, (1, 1), padding='same')(x_input) - x_in = self.activation()(x_in) - x_in = keras.layers.Flatten()(x_in) - x_in = keras.layers.Dropout(self.dropout_rate)(x_in) - x_in = keras.layers.Dense(32)(x_in) + x_input = keras.layers.Input(shape=self.shape_inputs) + x_in = keras.layers.Conv2D(32, (1, 1), padding='same', name='{}_Conv_1x1'.format("major"))(x_input) + x_in = self.activation(name='{}_conv_act'.format("major"))(x_in) + x_in = keras.layers.Flatten(name='{}'.format("major"))(x_in) + x_in = keras.layers.Dropout(self.dropout_rate, name='{}_Dropout_1'.format("major"))(x_in) + x_in = keras.layers.Dense(16, name='{}_Dense_16'.format("major"))(x_in) x_in = self.activation()(x_in) - x_in = keras.layers.Dense(16)(x_in) - x_in = self.activation()(x_in) - x_in = keras.layers.Dense(self.window_lead_time)(x_in) - out = self.activation()(x_in) - self.model = keras.Model(inputs=x_input, outputs=[out]) + x_in = keras.layers.Dense(self.shape_outputs, name='{}_Dense'.format("major"))(x_in) + out_main = self.activation()(x_in) + self.model = keras.Model(inputs=x_input, outputs=[out_main]) +``` + +* Your are free how to design your model. Just make sure to save it in the class attribute model. +* Additionally, set your custom compile options including the loss definition. + +```python +class MyCustomisedModel(AbstractModelClass): def set_compile_options(self): - self.compile_options = {"optimizer": SGD(lr=self.lr), - "loss": mse, - "metrics": ["mse"]} + self.initial_lr = 1e-2 + self.optimizer = keras.optimizers.SGD(lr=self.initial_lr, momentum=0.9) + self.lr_decay = mlair.model_modules.keras_extensions.LearningRateDecay(base_lr=self.initial_lr, + drop=.94, + epochs_drop=10) + self.loss = keras.losses.mean_squared_error + self.compile_options = {"metrics": ["mse", "mae"]} ``` +* The allocation of the instance parameters `initial_lr`, `optimizer`, and `lr_decay` could be also part of + the model class' initialiser. The same applies to `self.loss` and `compile_options`, but we recommend to use + the `set_compile_options` method for the definition of parameters, that are related to the compile options. +* More important is that the compile options are actually saved. There are three ways to achieve this. + + * (1): Set all compile options by parsing a dictionary with all options to `self.compile_options`. + * (2): Set all compile options as instance attributes. MLAir will search for these attributes and store them. + * (3): Define your compile options partly as dictionary and instance attributes (as shown in this example). + * If using (3) and defining the same compile option with different values, MLAir will raise an error. + + Incorrect: (Will raise an error because of a mismatch for the `optimizer` parameter.) + ```python + def set_compile_options(self): + self.optimizer = keras.optimizers.SGD() + self.loss = keras.losses.mean_squared_error + self.compile_options = {"optimizer" = keras.optimizers.Adam()} + ``` + + +## Specials for Branched Models + +* If you have a branched model with multiple outputs, you need either set only a single loss for all branch outputs or + provide the same number of loss functions considering the right order. + +```python +class MyCustomisedModel(AbstractModelClass): + + def set_model(self): + ... + self.model = keras.Model(inputs=x_input, outputs=[out_minor_1, out_minor_2, out_main]) + + def set_compile_options(self): + self.loss = [keras.losses.mean_absolute_error] + # for out_minor_1 + [keras.losses.mean_squared_error] + # for out_minor_2 + [keras.losses.mean_squared_error] # for out_main +``` + + +## How to access my customised model? + +If the customised model is created, you can easily access the model with + +```python +>>> MyCustomisedModel().model +<your custom model> +``` + +The loss is accessible via + +```python +>>> MyCustomisedModel().loss +<your custom loss> +``` + +You can treat the instance of your model as instance but also as the model itself. If you call a method, that refers to +the model instead of the model instance, you can directly apply the command on the instance instead of adding the model +parameter call. + +```python +>>> MyCustomisedModel().model.compile(**kwargs) == MyCustomisedModel().compile(**kwargs) +True +``` + +# Special Remarks + +## Special instructions for installation on Jülich HPC systems + +_Please note, that the HPC setup is customised for JUWELS and HDFML. When using another HPC system, you can use the HPC +setup files as a skeleton and customise it to your needs._ + +The following instruction guide you through the installation on JUWELS and HDFML. +* Clone the repo to HPC system (we recommend to place it in `/p/projects/<project name>`). +* Setup venv by executing `source setupHPC.sh`. This script loads all pre-installed modules and creates a venv for +all other packages. Furthermore, it creates slurm/batch scripts to execute code on compute nodes. <br> +You have to enter the HPC project's budget name (--account flag). +* The default external data path on JUWELS and HDFML is set to `/p/project/deepacf/intelliaq/<user>/DATA/toar_<sampling>`. +<br>To choose a different location open `run.py` and add the following keyword argument to `ExperimentSetup`: +`data_path=<your>/<custom>/<path>`. +* Execute `python run.py` on a login node to download example data. The program will throw an OSerror after downloading. +* Execute either `sbatch run_juwels_develgpus.bash` or `sbatch run_hdfml_batch.bash` to verify that the setup went well. +* Currently cartopy is not working on our HPC system, therefore PlotStations does not create any output. + +Note: The method `PartitionCheck` currently only checks if the hostname starts with `ju` or `hdfmll`. +Therefore, it might be necessary to adopt the `if` statement in `PartitionCheck._run`. + +## Security using JOIN + +* To use hourly data from ToarDB via JOIN interface, a private token is required. Request your personal access token and +add it to `src/join_settings.py` in the hourly data section. Replace the `TOAR_SERVICE_URL` and the `Authorization` +value. To make sure, that this **sensitive** data is not uploaded to the remote server, use the following command to +prevent git from tracking this file: `git update-index --assume-unchanged src/join_settings.py` + + +# remaining things ## Transformation @@ -267,33 +431,3 @@ class: `xr.DataArray` with `dims=["variables"]` and one value for each variable. - - -# Special Remarks - -## Special instructions for installation on Jülich HPC systems - -_Please note, that the HPC setup is customised for JUWELS and HDFML. When using another HPC system, you can use the HPC -setup files as a skeleton and customise it to your needs._ - -The following instruction guide you through the installation on JUWELS and HDFML. -* Clone the repo to HPC system (we recommend to place it in `/p/projects/<project name>`). -* Setup venv by executing `source setupHPC.sh`. This script loads all pre-installed modules and creates a venv for -all other packages. Furthermore, it creates slurm/batch scripts to execute code on compute nodes. <br> -You have to enter the HPC project's budget name (--account flag). -* The default external data path on JUWELS and HDFML is set to `/p/project/deepacf/intelliaq/<user>/DATA/toar_<sampling>`. -<br>To choose a different location open `run.py` and add the following keyword argument to `ExperimentSetup`: -`data_path=<your>/<custom>/<path>`. -* Execute `python run.py` on a login node to download example data. The program will throw an OSerror after downloading. -* Execute either `sbatch run_juwels_develgpus.bash` or `sbatch run_hdfml_batch.bash` to verify that the setup went well. -* Currently cartopy is not working on our HPC system, therefore PlotStations does not create any output. - -Note: The method `PartitionCheck` currently only checks if the hostname starts with `ju` or `hdfmll`. -Therefore, it might be necessary to adopt the `if` statement in `PartitionCheck._run`. - -## Security using JOIN - -* To use hourly data from ToarDB via JOIN interface, a private token is required. Request your personal access token and -add it to `src/join_settings.py` in the hourly data section. Replace the `TOAR_SERVICE_URL` and the `Authorization` -value. To make sure, that this **sensitive** data is not uploaded to the remote server, use the following command to -prevent git from tracking this file: `git update-index --assume-unchanged src/join_settings.py` diff --git a/docs/_source/customise.rst b/docs/_source/customise.rst index 94a760ddff53e041645bf705f4265c90684c09d0..e8aea9554e5a1c96df2b2452b1f1ee7353780b42 100644 --- a/docs/_source/customise.rst +++ b/docs/_source/customise.rst @@ -117,6 +117,7 @@ How to create a customised model? .. code-block:: python from mlair import AbstractModelClass + import keras class MyCustomisedModel(AbstractModelClass): @@ -148,8 +149,6 @@ How to create a customised model? .. code-block:: python - import keras - class MyCustomisedModel(AbstractModelClass): def set_model(self): @@ -240,5 +239,67 @@ parameter call. >>> MyCustomisedModel().model.compile(**kwargs) == MyCustomisedModel().compile(**kwargs) True + +Data Handler +------------ + +The basic concept of a data handler is to ensure an appropriate handling of input and target data. This includes the +loading and preparation of data and their provision in a predefined format. The user is given free rein as to which +steps the loading and preparation must include. The only constraint is that data is considered as a collection of +stations. This means that one instance of the data handler is created per station. MLAir then takes over the iteration +over the collection of stations or distributes the data during the training according to the given batch size. With very +large data sets, memory problems may occur if all data is loaded and held in main memory. In such a case it is +recommended to open the data only temporarily. This has no effect on the training itself, as the data is then +automatically distributed by MLAir. + +Interface of a data handler +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +A data handler should inherit from the :py:`AbstractDataHandler` class. This class has some key features: + +* :py:`cls.requirements()` can be used, to request all available :py:`args` and :py:`kwargs` from MLAir to build the + class. +* :py:`cls.build(*args, **kwargs)` returns in default mode the class itself. This can be modified (=overwritten) to + execute some pre-build operations. +* :py:`self.get_X(upsampling, as_numpy)` should return the input data either as NumPy array or xarray. With the + upsamling argument it is possible to implement a feature to weight inputs during training. +* :py:`self.get_Y(upsampling, as_numpy)` the same but for the target data. +* :py:`self.transformation(*args, **kwargs)` is a placeholder to execute any desired transformation. This class method + is called during the preprocessing stage in the default MLAir workflow. Note that a transformation operation is only + estimated on the train data subset and afterwards applied on all data subsets. +* :py:`self.get_coordinates()` is a placeholder and can be used to return a position for a geographical overview plot. + +During the preprocessing stage the following is executed: + +1) MLAir requests all required parameters that should be set during Experiment Setup stage by calling + :py:`data_handler.requirements()`. +2) The data handler is build for each station using :py:`data_handler.build(station, **kwargs)` to check if data is + available for the given station. +3) If valid: The build data handler is added to a internal data collection, that collects all contributing data + handlers. +4) MLAir creates subsets for training, validation, and testing. Therefore, a separate data handler for each subset is + created using subset parameters (e.g. start and end). + +Later on during ModelSetup, Training and PostProcessing, MLAir requests data using :py:`data_handler.get_X()` and +:py:`data_handler.get_Y()`. + +Default Data Handler +~~~~~~~~~~~~~~~~~~~~ + +The default data handler accesses data from the TOAR database. + + Custom Data Handler -------------------- +~~~~~~~~~~~~~~~~~~~ + +* Choose your personal data source, either a web interface or locally available data. +* Create your custom data handler class by inheriting from :py:`AbstractDataHandler`. +* Implement the initialiser :py:`__init__(*args, **kwargs)` and make sure to call the super class initialiser as well. + After executing this method data should be ready to use. Besides there are no further rules for the initialiser. +* (optionally) Modify the class method :py:`cls.build(*args, **kwargs)` to calculate pre-build operations. Otherwise the + data handler calls the class initialiser. On modification make sure to return the class at the end. +* (optionally) Add names of required arguments to the :py:`cls._requirements` list. It is not required to add args and + kwargs from the initialiser, they are added automatically. Modifying the requirements is only necessary if the build + method is modified (see previous bullet). +* (optionally) Overwrite the base class :py:`self.get_coordinates()` method to return coordinates as dictionary with + keys *lon* and *lat*. diff --git a/mlair/data_handler/advanced_data_handler.py b/mlair/data_handler/advanced_data_handler.py index 1c6ff142c406a953bc4fc023c1cf0b36d3cd08e7..f0dc874a050c274b0b4b6692073d8f7332d27c1d 100644 --- a/mlair/data_handler/advanced_data_handler.py +++ b/mlair/data_handler/advanced_data_handler.py @@ -84,6 +84,7 @@ class AbstractDataHandler: return self.get_X(upsampling, as_numpy), self.get_Y(upsampling, as_numpy) def get_coordinates(self) -> Union[None, Dict]: + """Return coordinates as dictionary with keys `lon` and `lat`.""" return None @@ -91,7 +92,7 @@ class DefaultDataHandler(AbstractDataHandler): _requirements = remove_items(inspect.getfullargspec(DataHandlerSingleStation).args, ["self", "station"]) - def __init__(self, id_class, data_path, min_length=0, + def __init__(self, id_class: DataHandlerSingleStation, data_path: str, min_length: int = 0, extreme_values: num_or_list = None, extremes_on_right_tail_only: bool = False, name_affix=None): super().__init__() self.id_class = id_class @@ -109,7 +110,7 @@ class DefaultDataHandler(AbstractDataHandler): self._store(fresh_store=True) @classmethod - def build(cls, station, **kwargs): + def build(cls, station: str, **kwargs): sp_keys = {k: copy.deepcopy(kwargs[k]) for k in cls._requirements if k in kwargs} sp = DataHandlerSingleStation(station, **sp_keys) dp_args = {k: copy.deepcopy(kwargs[k]) for k in cls.own_args("id_class") if k in kwargs}