On data handling
This readme declares which function loads which data and where it is stored.
experiment setup
data_path is the destination where all downloaded data is locally stored. Data is downloaded from TOARDB either using the JOIN interface or a direct connection to the underlying PostgreSQL DB. If data was already downloaded, no new download will be started. Missing data will be downloaded on the fly and saved in data_path.
data_path = src.helpers.prepare_host()
Current implementation leads to following paths:
hostname | path | comment |
---|---|---|
ZAM144 | /home/{user}/Data/toar_daily/ |
notebook Felix |
zam347 | /home/{user}/Data/toar_daily/ |
ESDE server |
linux-gzsx | /home/{user}/machinelearningtools/data/toar_daily/ |
notebook Lukas |
jureca | /p/project/cjjsc42/{user}/DATA/toar_daily/ |
JURECA |
juwels | /p/home/jusers/{user}/juwels/intelliaq/DATA/toar_daily/ |
JUWELS |
runner-6HmDp9Qd-project-2411-concurrent | /home/{user}/machinelearningtools/data/toar_daily/ |
gitlab-runner |
experiment_path is the root folder in that all results from the experiment are saved. For each experiment there should
be distinct folder. Experiment path is can be set in ExperimentSetup. experiment_date
can be set by parser_args and
experiment_path
(this argument is not the same as the internal stored experiment_path!) as args. The experiment_path
is the combination of both given arguments os.path.join(experiment_path, f"{experiment_date}_network")
. Inside this
folder, several subfolders are created in the course of the program.
data_path
<station1>_<var1>_<var2>_..._<varx>.nc
<station1>_<var1>_<var2>_..._<varx>_meta.csv
<station2>_<var1>_<var2>_..._<varx>.nc
<station2>_<var1>_<var2>_..._<varx>_meta.csv
------
experiment_path
| history.json
| history_lr.json
| <experiment_name>_model.pdf
| <experiment_name>_model-best.h5
| <experiment_name>_my_model.h5
├─── forecasts
| forecasts_<station1>_test.nc
| forecasts_<station2>_test.nc
| ...
└─── plots
conditional_quantiles_cali-ref_plot.pdf
conditional_quantiles_like-bas_plot.pdf
monthly_summary_box_plot.pdf
skill_score_clim_all_terms_<architecture>.pdf
skill_score_clim_<architecture>.pdf
skill_score_competitive_<architecture>.pdf
station_map.pdf
<experiment_name>_history_learning_rate.pdf
<experiment_name>_history_loss.pdf
<experiment_name>_history_main_loss.pdf
<experiment_name>_history_main_mse.pdf
...
plot_path includes all created plots. If not given, this is create into the experiment_path by default (as shown in
the folder structure above). Can be customised by ExperimentSetup(plot_path=<path>)
.
forecast_path is the place, where all forecasts are stored as netcdf file. Each file consists exactly one single
station. If not given, this is create into the experiment_path by default (as shown in the folder structure above). Can
be customised by ExperimentSetup(forecast_path=<path>)
.
pre-processing
Each requested station is check whether it is already included in data_path. The files all following the naming
convention <station_name>_<sorted_list_of_all_variables_split_by_underscore>.nc
. E.g. the station DEBW013 with the
variables cloudcover, NO, NO2, O3 and temp (all TOARDB short names) is saved as DEBW013_cloudcover_no_no2_o3_temp.nc
,
whereas the same station with only O3 and temperature becomes DEBW013_o3_temp.nc
. Although all data of the latter file
is potentially also included in the former file, the program will always download the data specification for new and
save this data into a new file. Only if the exactly fitting file is available locally, no data is downloaded. NOTE:
There is no check on data time range, only the name is compared. Set overwrite_local_data=True
in experiment_setup.py
to overwrite local data by downloading new data.
model setup
checkpoint is created inside experiment_path as <experiment_name>_model-best.h5
.
The architecture of the model is plotted into experiment_path as <experiment_name>_model.pdf
training
Training metrics are saved in history.json
and history_lr.json
.
Best model is saved in <experiment_name>_my_model.h5
.
post-processing
During the make_forecast method, all calculated forecasts of the neural network, persistence, ordinary least squared
and the target values with the regarding lead time are saved locally inside forecast_path as
forecasts_<station>_test.nc
.
All plots are created inside plot_path.