# MachineLearningTools

This is a collection of all relevant functions used for ML stuff in the ESDE group

## Inception Model

See a description [here](https://towardsdatascience.com/a-simple-guide-to-the-versions-of-the-inception-network-7fc52b863202)
or take a look on the papers [Going Deeper with Convolutions (Szegedy et al., 2014)](https://arxiv.org/abs/1409.4842)
and [Network In Network (Lin et al., 2014)](https://arxiv.org/abs/1312.4400).


# Installation

* Install __proj__ on your machine using the console. E.g. for opensuse / leap `zypper install proj`
* c++ compiler required for cartopy installation

## HPC - JUWELS and HDFML setup
The following instruction guide you throug the installation on JUWELS and HDFML. 
* Clone the repo to HPC system (we recommend to place it in `/p/projects/<project name>`.
* Setup venv by executing `source setupHPC.sh`. This script loads all pre-installed modules and creates a venv for 
all other packages. Furthermore, it creates slurm/batch scripts to execute code on compute nodes. <br> 
You have to enter the HPC project's budget name (--account flag).
* The default external data path on JUWELS and HDFML is set to `/p/project/deepacf/intelliaq/<user>/DATA/toar_<sampling>`. 
<br>To choose a different location open `run.py` and add the following keyword argument to `ExperimentSetup`: 
`data_path=<your>/<custom>/<path>`. 
* Execute `python run.py` on a login node to download example data. The program will throw an OSerror after downloading.
* Execute either `sbatch run_juwels_develgpus.bash` or `sbatch run_hdfml_batch.bash` to verify that the setup went well.
* Currently cartopy is not working on our HPC system, therefore PlotStations does not create any output.

### HPC JUWELS and HDFML remarks 
Please note, that the HPC setup is customised for JUWELS and HDFML. When using another HPC system, you can use the HPC setup files as a skeleton and customise it to your needs. 

Note: The method `PartitionCheck` currently only checks if the hostname starts with `ju` or `hdfmll`. 
Therefore, it might be necessary to adopt the `if` statement in `PartitionCheck._run`.


# Security

* To use hourly data from ToarDB via JOIN interface, a private token is required. Request your personal access token and
add it to `src/join_settings.py` in the hourly data section. Replace the `TOAR_SERVICE_URL` and the `Authorization` 
value. To make sure, that this **sensitive** data is not uploaded to the remote server, use the following command to
prevent git from tracking this file: `git update-index --assume-unchanged src/join_settings.py
`

# Customise your experiment

This section summarises which parameters can be customised for a training.

## Transformation

There are two different approaches (called scopes) to transform the data:
1) `station`: transform data for each station independently (somehow like batch normalisation)
1) `data`: transform all data of each station with shared metrics

Transformation must be set by the `transformation` attribute. If `transformation = None` is given to `ExperimentSetup`, 
data is not transformed at all. For all other setups, use the following dictionary structure to specify the 
transformation.
```
transformation = {"scope": <...>, 
                  "method": <...>,
                  "mean": <...>,
                  "std": <...>}
ExperimentSetup(..., transformation=transformation, ...)
```

### scopes

**station**: mean and std are not used

**data**: either provide already calculated values for mean and std (if required by transformation method), or choose 
from different calculation schemes, explained in the mean and std section.

### supported transformation methods
Currently supported methods are:
* standardise (default, if method is not given)
* centre

### mean and std
`"mean"="accurate"`: calculate the accurate values of mean and std (depending on method) by using all data. Although, 
this method is accurate, it may take some time for the calculation. Furthermore, this could potentially lead to memory 
issue (not explored yet, but could appear for a very big amount of data)

`"mean"="estimate"`: estimate mean and std (depending on method). For each station, mean and std are calculated and
afterwards aggregated using the mean value over all station-wise metrics. This method is less accurate, especially 
regarding the std calculation but therefore much faster.

We recommend to use the later method *estimate* because of following reasons:
* much faster calculation
* real accuracy of mean and std is less important, because it is "just" a transformation / scaling
* accuracy of mean is almost as high as in the *accurate* case, because of 
$\bar{x_{ij}} = \bar{\left(\bar{x_i}\right)_j}$. The only difference is, that in the *estimate* case, each mean is 
equally weighted for each station independently of the actual data count of the station.
* accuracy of std is lower for *estimate* because of $\var{x_{ij}} \ne \bar{\left(\var{x_i}\right)_j}$, but still the mean of all 
station-wise std is a decent estimate of the true std.

`"mean"=<value, e.g. xr.DataArray>`: If mean and std are already calculated or shall be set manually, just add the
scaling values instead of the calculation method. For method *centre*, std can still be None, but is required for the
*standardise* method. **Important**: Format of given values **must** match internal data format of DataPreparation 
class: `xr.DataArray` with `dims=["variables"]` and one value for each variable.