diff --git a/README.md b/README.md index 31365da89169cfe2be58de89a574ae4b69e40224..3467a31f23b7f770d32afb91cb62d5207ccf3d62 100644 --- a/README.md +++ b/README.md @@ -20,4 +20,60 @@ and [Network In Network (Lin et al., 2014)](https://arxiv.org/abs/1312.4400). add it to `src/join_settings.py` in the hourly data section. Replace the `TOAR_SERVICE_URL` and the `Authorization` value. To make sure, that this **sensitive** data is not uploaded to the remote server, use the following command to prevent git from tracking this file: `git update-index --assume-unchanged src/join_settings.py -` \ No newline at end of file +` + +# Customise your experiment + +This section summarises which parameters can be customised for a training. + +## Transformation + +There are two different approaches (called scopes) to transform the data: +1) `station`: transform data for each station independently (somehow like batch normalisation) +1) `data`: transform all data of each station with shared metrics + +Transformation must be set by the `transformation` attribute. If `transformation = None` is given to `ExperimentSetup`, +data is not transformed at all. For all other setups, use the following dictionary structure to specify the +transformation. +``` +transformation = {"scope": <...>, + "method": <...>, + "mean": <...>, + "std": <...>} +ExperimentSetup(..., transformation=transformation, ...) +``` + +### scopes + +**station**: mean and std are not used + +**data**: either provide already calculated values for mean and std (if required by transformation method), or choose +from different calculation schemes, explained in the mean and std section. + +### supported transformation methods +Currently supported methods are: +* standardise (default, if method is not given) +* centre + +### mean and std +`"mean"="accurate"`: calculate the accurate values of mean and std (depending on method) by using all data. Although, +this method is accurate, it may take some time for the calculation. Furthermore, this could potentially lead to memory +issue (not explored yet, but could appear for a very big amount of data) + +`"mean"="estimate"`: estimate mean and std (depending on method). For each station, mean and std are calculated and +afterwards aggregated using the mean value over all station-wise metrics. This method is less accurate, especially +regarding the std calculation but therefore much faster. + +We recommend to use the later method *estimate* because of following reasons: +* much faster calculation +* real accuracy of mean and std is less important, because it is "just" a transformation / scaling +* accuracy of mean is almost as high as in the *accurate* case, because of +$\bar{x_{ij}} = \bar{\left(\bar{x_i}\right)_j}$. The only difference is, that in the *estimate* case, each mean is +equally weighted for each station independently of the actual data count of the station. +* accuracy of std is lower for *estimate* because of $\var{x_{ij}} \ne \bar{\left(\var{x_i}\right)_j}$, but still the mean of all +station-wise std is a decent estimate of the true std. + +`"mean"=<value, e.g. xr.DataArray>`: If mean and std are already calculated or shall be set manually, just add the +scaling values instead of the calculation method. For method *centre*, std can still be None, but is required for the +*standardise* method. **Important**: Format of given values **must** match internal data format of DataPreparation +class: `xr.DataArray` with `dims=["variables"]` and one value for each variable. \ No newline at end of file diff --git a/src/data_handling/data_generator.py b/src/data_handling/data_generator.py index 842846ae0753c52e254138edceae7ac0c0bc5e5a..79e1e7e72c1779d18a11652ab132c253e1dff806 100644 --- a/src/data_handling/data_generator.py +++ b/src/data_handling/data_generator.py @@ -104,8 +104,8 @@ class DataGenerator(keras.utils.Sequence): method = transformation.get("method", "standardise") mean = transformation.get("mean", None) std = transformation.get("std", None) - if isinstance(mean, str): - if scope == "data": + if scope == "data": + if isinstance(mean, str): if mean == "accurate": mean, std = self.calculate_accurate_transformation(method) elif mean == "estimate": @@ -113,10 +113,10 @@ class DataGenerator(keras.utils.Sequence): else: raise ValueError(f"given mean attribute must either be equal to strings 'accurate' or 'estimate' or" f"be an array with already calculated means. Given was: {mean}") - elif scope == "station": - mean, std = None, None - else: - raise ValueError(f"Scope argument can either be 'station' or 'data'. Given was: {scope}") + elif scope == "station": + mean, std = None, None + else: + raise ValueError(f"Scope argument can either be 'station' or 'data'. Given was: {scope}") transformation["method"] = method transformation["mean"] = mean transformation["std"] = std