CHANGELOG.md



To find the state of this project's repository at the time of any of these versions, check out the tags.


Changelog
All notable changes to this project will be documented in this file.

v2.3.0 -  2022-11-25  - new models and plots

general:

new model classes for ResNet and U-Net
new plots and variations of existing plots


new features:

new model classes: ResNet (#419 (closed)), U-Net (#423 (closed))
seasonal mse stack plot (#422 (closed))
new aggregated and line versions of Time Evolution Plot (#424 (closed), #427 (closed))
box-and-whisker plots are created for all error metrics (#431 (closed))
new split and frequency distribution versions of box-and-whisker plots for error metrics (#425 (closed), #434 (closed))
new evaluation metric: mean error / bias (#430 (closed))
conditional quantiles are now available for all competitors too (#435 (closed))
new map plot showing mse at locations (#432 (closed))


technical:

speed up in model setup (#421 (closed))
bugfix for boundary trim in FIR filter (#418 (closed))
persistence is now calculated only on demand (#426 (closed))
block mse are stored locally in a file (#428 (closed))
fix issue with boolean variables not recognized by argparse (#417 (closed))
renaming of ahead labels (#436 (closed))


v2.2.0 -  2022-08-16  - new data sources and python3.9

general:

new data sources: era5 data and ToarDB V2
CAMS competitor available
improved execution speed
MLAir is now updated to python3.9


new features:

new data loading method to load era5 data on Jülich systems (#393 (closed))
new data loading method to load data from ToarDB V2 (#396 (closed))
implemented competitor model using CAMS ensemble forecasts (#394 (closed))
OLS competitor is only calculated if provided in competitor list (#404 (closed))
experimental: snapshot creation to skip preprocessing stage (#346 (closed), #405 (closed), #406 (closed))
new workflow HyperSearchWorkflow stopping after training stage (#408 (closed))


technical:

fixed minor issues and improved execution speed in postprocessing (#401 (closed), #413 (closed))
improved speed in keras iterator creation (#409 (closed))
solved bug for very long competitor time series (#395 (closed))
updated python, HPC and CI environment (#402 (closed), #403 (closed), #407 (closed), #410 (closed))
fix for climateFIR data handler (#399 (closed))
fix for report model error (#416 (closed))


v2.1.0 -  2022-06-07  - new evaluation metrics and improved training

general:

new evaluation metrics, IOA and MNMB
advanced train options for early stopping
reduced execution time by refactoring


new features:

uncertainty estimation of MSE is now applied for each season separately (#374 (closed))
added different configurations of early stopping to use either last trained or best epoch (#378 (closed))
train monitoring plots now add a star for best epoch when using early stopping (#367 (closed))
new evaluation metric index of agreement, IOA (#376 (closed))
new evaluation metric modified normalised mean bias, MNMB (#380 (closed))
new plot available that shows temporal evolution of MSE for each station (#381 (closed))


technical:

reduced loading of forecast path from data store (#328 (closed))
bug fix for not catched error during transformation (#385 (closed))
bug fix for data handler with climate and fir filter leading to calculate transformation always with fir filter (#387 (closed))
improved duration for latex report creation at end of preprocessing (#388 (closed))
enhanced speed for make prediction in postprocessing (#389 (closed))
fix to always create version badge from version and not from tag name (#382 (closed))


v2.0.0 -  2022-04-08  - tf2 usage, new model classes, and improved uncertainty estimate

general:

MLAir now uses tensorflow v2
new customisable model classes for CNN and RNN
improved uncertainty estimate


new features:

MLAir depends now on tensorflow v2 (#331 (closed))
new CNN class that can be configured layer-wise (#368 (closed))
new RNN class that can be configured in more detail (#361 (closed))
new branched-input CNN class (#368 (closed))
new branched-input RNN class (#362 (closed))
set custom model display name that is used in plots (#341 (closed))
specify names of input branches to use in feature importance plots (#356 (closed))
uncertainty estimate of model error is now calculated for each forecast step additionally (#359 (closed))
data transformation properties are stored locally and can be loaded into an experiment run (#345 (closed))
uncertainty estimate includes now a Mann-Whitney U rank test (#355 (closed))
data handlers can now have access to "future" data specified by new parameter extend_length_opts (#339 (closed))


technical:

MLAir now uses python3.8 on Jülich HPC systems (#375 (closed))
no support of MLAir for tensorflow v1.X, replaced by tf v2.X (#331 (closed))
all data handlers with filters can return data as branches (#370 (closed))
bug fix to force model name and competitor names to be unique (#366 (closed), #369 (closed))
fix to use only a single forecast step (#315 (closed))
CI pipeline adjustments (#340 (closed), #365 (closed))
new option to set the level of the print logging (#364 (closed))
advanced logging for batch data creation and in postprocessing (#350 (closed), #360 (closed))
batch data creation is skipped on disabled training (#341 (closed))
multiprocessing pools are now closed properly (#342 (closed))
bug fix if no competitor data is available (#343 (closed))
bug fix for model loading (#343 (closed))
models plotted by PlotSampleUncertaintyFromBootstrap are now ordered by mean error (#344 (closed))
fix for usage of lazy data caused unintended reloading of data (#347 (closed))
fix for latex reports no showing all stations and competitors (#349 (closed))
refactoring of hard coded dimension names in skill scores calculation (#357 (closed))
bug fix of order of bootstrap method in feature importance calculation causes errors (#358 (closed))
distinguish now between window_history_offset (pos of last time step), window_history_size (total length of input
sample), and extend_length_opts ("future" data that is available at given time) (#353 (closed))


v1.5.0 -  2021-11-11  - new uncertainty estimation

general:

introduces method to estimate sample uncertainty
improved multiprocessing
last release with tensorflow v1 support


new features:

test set sample uncertainty estmation during postprocessing (#333 (closed))
support of Kolmogorov Zurbenko filter for data handlers with filters (#334 (closed))


technical:

new communication scheme for multiprocessing (#321 (closed), #322 (closed))
improved error reporting (#323 (closed))
feature importance returns now unaggregated results (#335 (closed))
error metrics are reported for all competitors (#332 (closed))
minor bugfixes and refacs (#330 (closed), #326 (closed), #329 (closed), #325 (closed), #324 (closed), #320 (closed), #337 (closed))


v1.4.0 -  2021-07-27  - new model classes and data handlers, improved usability and transparency

general:

many technical adjustments to improve usability and transparency of MLAir
new FCN and CNN classes for easy NN model creation
new plots


new features:

new FCN class that can be customized in many ways (#284 (closed))
also new CNN class (#289 (closed))
added new bootstrap analysis method: mean bootstrapping (#300 (closed))
new data handler using FIR filters (#306 (closed))
performance measures are now stored in local files (#286 (closed))
histogram plots for inputs and targets (#299 (closed))
periodogram plots for filtered data (#298 (closed))


technical:

a calling run script can be stored inside experiment folder if reference to this script is parsed as argument (#99 (closed))
new callback to track epoch-runtime (#312 (closed))
added switch to use multiprocessing (#297 (closed))
customize maximum number of parallel processes (#308 (closed))
support non-monotonic window lead times (#313 (closed))
resolved bug with FileExistsError (#311 (closed))
resolved bug if no chemical is used at all (#307 (closed))
min/max scaler now scales between -1 and 1 (#302 (closed))
added missing offset parameter to some data handlers (#305 (closed))
improved data store logging (#304 (closed))
improved logging message on station removal in preprocessing (#294 (closed))
limited number of retries in JOIN module (#296 (closed))
adjusted competing skill score plot (#301 (closed))
transformation parameter check (#295 (closed))
implemented lazy data preprocessing for selected data handlers (#292 (closed))
fix bug in separation of scales data handler (#290 (closed))


v1.3.0 -  2021-02-24  - competitors and improved transformation

general:

release of official MLAir logo (#274 (closed))
new transformation schema for better independence of MLAir and data handler (#272 (closed))
competing models can be included in postprocessing for direct comparison (#198 (closed))


new features:

new helper functions for geographic issues (#280 (closed))
default data handler and inheritances can use min/max and log transformation (#276 (closed), #275 (closed))
include IntelliO3-ts model as reference via automatic download (#131 (closed))


technical:

experiment name now always includes target sampling type (#263 (closed))
competitive skill score plot is refactored (#260 (closed))
bug fix for climatological skill scores (#259 (closed))
bug fix for custom objects handling (#277 (closed))
bug fix for monitoring plots when multiple output branches are used (#278 (closed))
update requirements to newer version and dependencies (#262 (closed), #273 (closed))
HPC scripts are updated to work properly with parallel data processing (#281 (closed))


v1.2.1 -  2021-02-08  - bug fix for recursive import error

general:

applied bug fix


technical:

bug fix for recursive import error, (#269 (closed))


v1.2.0 -  2020-12-18  - parallel preprocessing and improved data handlers

general:

new plots
parallelism for faster preprocessing
improved data handler with mixed sampling types
enhanced test coverage


new features:

station map plot highlights now subsets on the map and displays number of stations for each subset (#227 (closed), #231 (closed))
two new data availability plots PlotAvailabilityHistogram (#191 (closed), #192 (closed), #223 (closed))
introduced parallel code in preprocessing if system supports parallelism (#164 (closed), #224 (closed), #225 (closed))
data handler DataHandlerMixedSampling (and inheritances) supports an offset parameter to end inputs at a different time than 00 hours (#220 (closed))
args for data handler DataHandlerMixedSampling (and inheritances) that differ for input and target can now be parsed as tuple (#229 (closed))


technical:

added templates for release and bug issues (#189 (closed))
improved test coverage (#236 (closed), #238 (closed), #239 (closed), #240 (closed), #241 (closed), #242 (closed), #243 (closed), #244 (closed), #245 (closed))
station map plot includes now number of stations for each subset (#231 (closed))
postprocessing plots are encapsulated in try except statements (#107 (closed))
updated git settings (#213 (closed))
bug fix for data handler (#235 (closed))
reordering and bug fix for preprocessing reporting (#207 (closed), #232 (closed))
bug fix for outdated system path style (#226 (closed))
new plots are included in default plot list (#211 (closed))

helpers/join connection to ToarDB (e.g. used by DefaultDataHandler) reports now which variable could not be loaded (#222 (closed))
plot PlotBootstrapSkillScore can now additionally highlight specific variables, but not included in postprocessing up to now (#201 (closed))
data handler DataHandlerMixedSampling has now a reduced data loading (#221 (closed))


v1.1.0 -  2020-11-18  - hourly resolution support and new data handlers

general:

MLAir can be used with 1H resolution data from JOIN
new data handlers to use the Kolmogorov-Zurbenko filter and mixed sampling types


new features:

new data handler DataHandlerKzFilter to use Kolmogorov-Zurbenko filter (kz filter) on inputs (#195 (closed))
new data handler DataHandlerMixedSampling that can used mixed sampling types for input and target (#197 (closed))
new data handler DataHandlerMixedSamplingWithFilter that uses kz filter and mixed sampling (#197 (closed))
new data handler DataHandlerSeparationOfScales to filter-depended time steps sizes on filtered inputs using mixed sampling (#196 (closed))


technical:

bug fix for very short time series in TimeSeriesPlot (#215 (closed))
bug fix for variable dictionary when using hourly resolution (#212 (closed))
variable naming for data from JOIN interface harmonised (#206 (closed))
transformation setup is now separated for inputs and targets (#202 (closed))
bug fix in PlotClimatologicalSkillScore if only single station is used (#193 (closed))
preprocessed data is now stored inside experiment and not in the data folder


v1.0.0 -  2020-10-08  - official release of new version 1.0.0

general:

This is the first official release of MLAir ready for use
updated license, installation instruction


technical:

restructured order of packages in requirements


v0.12.2 -  2020-10-01  - HDFML support

general:

HDFML support


technical:

installation script for HDFML adjusted, #183 (closed)


v0.12.1 -  2020-09-28  - examples in notebook

general:

introduced a notebook documentation for easy starting, #174 (closed)

updated special installation instructions for the Juelich HPC systems, #172 (closed)


new features:

names of input and output shape are renamed consistently to: input_shape, and output_shape, #175 (closed)


technical:

it is possible to assign a custom name to a run module (e.g. used in logging), #173 (closed)


v0.12.0 -  2020-09-21  - Documentation and Bugfixes

general:

improved documentation include installation instructions and many examples from the paper, #153 (closed)

bugfixes (see technical)


new features:


MyLittleModel is now a pure feed-forward network (before it had a CNN part), #168 (closed)


technical:

new compile options check to ensure its execution, #154 (closed)

bugfix for key errors in time series plot, #169 (closed)

bugfix for not used kwargs in DefaultDataHandler, #170 (closed)


trainable parameter is renamed by train_model to prevent confusion with the tf trainable parameter, #162 (closed)

fixed HPC installation failure, #159 (closed)


v0.11.0 -  2020-08-24  -  Advanced Data Handling for MLAir

general

Introduce advanced data handling with much more flexibility (independent of TOAR DB, custom data handling is
pluggable), #144 (closed)

default data handler is still using TOAR DB


new features

default data handler using TOAR DB refactored according to advanced data handling, #140 (closed), #141 (closed), #152 (closed)

data sets are handled as collections, #142 (closed), and are iterable in a standard way (StandardIterator) and optimised for
keras (KerasIterator), #143 (closed)

automatically moving station map plot, #136 (closed)


technical

model modules available from package, #139 (closed)

renaming of parameter time dimension, #151 (closed)

refactoring of README.md, #138 (closed)


v0.10.0 -  2020-07-15  -  MLAir is official name, Workflows, easy Model plug-in

general

Official project name is released: MLAir (Machine Learning on Air data)
a model class can now easily be plugged in into MLAir. #121 (closed)

introduced new concept of workflows, #134 (closed)


new features

workflows are used to execute a sequence of run modules, #134 (closed)

default workflows for standard and the Juelich HPC systems are available, custom workflows can be defined, #134 (closed)

seasonal decomposition is available for conditional quantile plot, #112 (closed)

map plot is created with coordinates, #108 (closed)


flatten_tails are now more general and easier to customise, #114 (closed)

model classes have custom compile options (replaces set_loss), #110 (closed)

model can be set in ExperimentSetup from outside, #121 (closed)

default experiment settings can be queried using get_defaults(), #123 (closed)

training and model settings are reported as MarkDown and Tex tables, #145 (closed)


technical

Juelich HPC systems are supported and installation scripts are available, #106 (closed)

data store is tracked, I/O is saved and illustrated in a plot, #116 (closed)

batch size, epoch parameter have to be defined in ExperimentSetup, #127 (closed), #122 (closed)

automatic documentation with sphinx, #109 (closed)

default experiment settings are updated, #123 (closed)

refactoring of experiment path and its default naming, #124 (closed)

refactoring of some parameter names, #146 (closed)

preparation for package distribution with pip, #119 (closed)

all run scripts are updated to run with workflows, #134 (closed)

the experiment folder is restructured, #130 (closed)


v0.9.0  -  2020-04-15  -  faster bootstraps, extreme value upsamling

general

improved and faster bootstrap workflow
new plot PlotAvailability
extreme values upsampling
improved runtime environment


new features

entire bootstrap workflow has been refactored and much faster now, can be skipped with evaluate_bootstraps=False, #60 (closed)

upsampling of extreme values, set with parameter extreme_values=[your_values_standardised] (e.g. [1, 2]) and
extremes_on_right_tail_only=<True/False> if only right tail of distribution is affected or both, #58 (closed), #87 (closed)

minimal data length property (in total and for all subsets), #76 (closed)

custom objects in model class to load customised model objects like padding class, loss, #72 (closed)

new plot for data availability: PlotAvailability, #103 (closed)

introduced (default) plot_list to specify which plots to draw
latex and markdown information on sample sizes for each station, #90 (closed)


technical

implemented tests on gpu and from scratch for develop, release and master branches, #95 (closed)

usage of tensorflow 1.13.1 (gpu / cpu), separated in 2 different requirements, #81 (closed)

new abstract plot class to have uniform plot class design
New time tracking wrapper to use for functions or classes
improved logger (info on display, debug into file), #73 (closed), #85 (closed), #88 (closed)

improved run environment, especially for error handling, #86 (closed)

prefix general in data store scope is now optional and can be skipped. If given scope is not general, it is
treated as subscope, #82 (closed)

all 2D Padding classes are now selected by Padding2D(padding_name=<padding_type>) e.g.
Padding2D(padding_name="SymPad2D"), #78 (closed)

custom learning rate (or lr_decay) is optional now, #71 (closed)