MLAir issueshttps://gitlab.jsc.fz-juelich.de/esde/machine-learning/mlair/-/issues2022-08-24T15:16:12+02:00https://gitlab.jsc.fz-juelich.de/esde/machine-learning/mlair/-/issues/420DataHandler with multiple stats per variable2022-08-24T15:16:12+02:00Ghost UserDataHandler with multiple stats per variableImplement new strategy to interoperate one variable with multiple statistics. For example: `stats_per_var = {'o3': ['dma8eu', 'perc95', 'perc05'], 'relhum': 'average_values'}`
- [x] upload downloads
- [x] reset names with statistic name...Implement new strategy to interoperate one variable with multiple statistics. For example: `stats_per_var = {'o3': ['dma8eu', 'perc95', 'perc05'], 'relhum': 'average_values'}`
- [x] upload downloads
- [x] reset names with statistic name (e.g. o3_dma8eu, o3_perc95, o3_perc05)
- [x] update variable list in DataHander with updated names
- [ ] implement statistics selector for target variablehttps://gitlab.jsc.fz-juelich.de/esde/machine-learning/mlair/-/issues/384AQW data handler2022-07-11T10:53:07+02:00Ghost UserAQW data handler# Data Handler for AQWatch
*data handler designed for work of @li40*
## Structure of Data
**Inputs**: forecasts from CTMs
* root folder: `mod`
* structured per region of interest (each will be a separate experiment with a different NN...# Data Handler for AQWatch
*data handler designed for work of @li40*
## Structure of Data
**Inputs**: forecasts from CTMs
* root folder: `mod`
* structured per region of interest (each will be a separate experiment with a different NN), e.g. `/colorado`
* depending on region: different number and names of CTMs and always mean of ensemble *(not used as input for now)*, e.g. `/lotos_tno`
* data are already interpolated on station level
* data are stored per forecast date `/$YYYYMMDD`
* single file per species, e.g. `$nvar_TNO_inp.nc`
* data files are structured as follows: first dimension is time (model time UTC), second dimension is station (named by station long name).
**Targets**: observations from measurement stations
* root folder: `/obs/obs_download_scripts`
* structured per region of interest (each will be a separate experiment with a different NN), e.g. `/Colorado`
* data are inside data folder `/Data`
* single file per species and date (including all stations and 24h), e.g. `obs_$nvar_$YYYYMMDD.nc`
* data files are structured as follows: index is date_utc, columns are stations indicated by id (size is timesteps x stations).
**Competitor**: ensemble mean calculated over all CTM forecasts
* root folder: `mod`
* structured per region of interest (each will be a separate experiment with a different NN), e.g. `/colorado`
* stored in directory `/ens`
* structure same as for inputs
```
|-- mod
| `-- colorado
| |-- ens (MEAN of CTM#1-3)
| | `-- $YYYYMMDD (%forecast_date)
| | `-- interpolated
| | `-- $nvar_ensmean_inp.nc # time series of ensemble mean (competitor)
| |-- lotos_tno (CTM#1)
| | |-- $YYYYMMDD (%forecast_date)
| | | `-- interpolated
| | | `-- $nvar_TNO_inp.nc # time series of CTM#1 (input feature)
| | `-- stations_colo.csv # Station info
| |-- silam_fmi (CTM#2)
| | |-- $YYYYMMDD (%forecast_date)
| | | `-- interpolated
| | | `-- $nvar_FMI_inp.nc # time series of CTM#2 (input feature)
| | `-- stations_colo.csv
| `-- wrf_ucar (CTM#3)
| |-- $YYYYMMDD (%forecast_date)
| | `-- interpolated
| | `-- $nvar_UCAR_inp.nc # time series of CTM#3 (input feature)
| `-- stations_colo.csv
`-- obs
`-- obs_download_scripts
`-- Colorado
|-- stations_colo.csv
`-- Data
`--obs_$nvar_$YYYYMMDD.nc (%obs_date) # time series of observation (target)
$nvar = ['co','so2','no2','o3','pm10','pm25'] # pollutant species available
%forecast_date = current day
%obs_date = one day before current day
```
## Start Script
This is a basic script that could be used for the AQWatch data handler. The script does not setup the NN explicitly but can be used to check if the workflow passes.
```python
__author__ = "Lukas Leufen"
__date__ = '2022-05-18'
import argparse
import os
import sys
sys.path.append("<abs_path_to_mlair>")
from mlair.workflows import DefaultWorkflow
from mlair.data_handler.data_handler_aqwatch import DataHandlerAQWatch, DataHandlerAQWatchSingleStation
def main(parser_args):
args = dict(data_handler=DataHandlerAQWatch,
interpolation_limit=3, overwrite_local_data=False,
overwrite_lazy_data=True,
lazy_preprocessing=True,
train_min_length=0, # just replace defaults which are 90
val_min_length=0, # just replace defaults which are 90
test_min_length=0, # just replace defaults which are 90
window_history_size=0, # has to be 0 to indicate t0
window_lead_time=0, # has to be 0 to indicate t0
start="2022-05-01", # start and train_start should be the same
train_start="2022-05-01",
train_end="2022-05-02",
val_start="2022-05-02",
val_end="2022-05-10",
test_start="2022-05-10",
test_end="2022-05-20",
end="2022-05-20", # end and test_end should be the same
region="colorado", # specify the region
variables=["no2"], # this sets your variable, currently it is not possible to use more than one
target_var=["no2"], # this sets your variable, currently it is not possible to use more than one
ctm_list=["test_ctm", "test2_ctm"], # name models to use
competitors=["aqw_ens_mean"],
sampling="hourly",
#stations=['80050006', '80131001', '80130014', '80350004', '80410017', '80410015', '80410013',
# '80830006', '80310002', '80310013', '80691004', '80690011', '80690009', '80310028',
# '80590011', '80519991', '80770017', '81230009', '81230006', '80050002', '80310027',
# '80310026', '80130003', '80410016', '80830101', '80770020', '81030006', '80450012',
# '80450007', '80590006', '80690007', '80699991', '80677001', '80677003', '80013001',
# '80970008'],
stations=["Aurora East", "Boulder Reservoir","Chatfield Park - 11500 N. Roxborough Park Rd.","Colorado Springs - USAF Academy","Cortez Ozone","Denver - CAMP - 2105 Broadway","Fort Collins - CSU - 708 S. Mason St.","Fort Collins - West - Laporte Ave. & Overland Tr.","Golden - NREL - 2054 Quaker St.","Gothic","Greeley - Weld Co. Tower - 3101 35th Ave.","Highland Reservoir - 8100 S. University Blvd.","La Casa NCORE - 4545 Navajo St.","Manitou Springs","Mesa Verde NP","Palisade Ozone","Rangely, CO","Rifle Ozone","Rocky Flats - N - 16600 W. Colo. Hwy. 128","Rocky Mountain NP","Ute 1","Ute 3","Welby - 78th Ave. & Steele St."],
transformation={
"o3": {"method": "log"},
"no": {"method": "log"},
"no2": {"method": "log"}, },
data_path=os.path.join(".", "data", "aqw_data"), # <- root folder of data containing obs and mod
**parser_args.__dict__,
)
workflow = DefaultWorkflow(**args, start_script=__file__)
workflow.run()
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--experiment_date', metavar='--exp_date', type=str, default=None,
help="set experiment date as string")
args = parser.parse_args(["--experiment_date", "testrun"])
main(args)
```
## TODOs
* [ ] include competitor: maybe write a class that can load the AQWatch data (and stores them), similar to the IntelliO3 competitor
* [ ] check if workflow works from begin to end
* [ ] be able to load meta data as lon/lat from meta files as `station_colo.csv`https://gitlab.jsc.fz-juelich.de/esde/machine-learning/mlair/-/issues/363calculate toar metrics on hourly forecasts2022-02-22T14:17:51+01:00Ghost Usercalculate toar metrics on hourly forecasts# TOAR metrics on hourly data
* [ ] forecast e.g. 4x24h by model
* [ ] be able to apply toar metrics on this data to return 1d resoluted data
* [ ] use this data as forecast to be used as comparison with other models# TOAR metrics on hourly data
* [ ] forecast e.g. 4x24h by model
* [ ] be able to apply toar metrics on this data to return 1d resoluted data
* [ ] use this data as forecast to be used as comparison with other modelshttps://gitlab.jsc.fz-juelich.de/esde/machine-learning/mlair/-/issues/354Allow fixed variable list for feature importance boots2022-01-19T12:35:41+01:00Ghost UserAllow fixed variable list for feature importance bootsAllow fixed variable list for feature importance bootsAllow fixed variable list for feature importance bootshttps://gitlab.jsc.fz-juelich.de/esde/machine-learning/mlair/-/issues/319Apply TOAR Statistics on WRF-data handler2021-09-07T11:25:02+02:00Ghost UserApply TOAR Statistics on WRF-data handlerInclude calculation of [TOAR statistics](https://gitlab.jsc.fz-juelich.de/esde/toar-public/toarstats/) into the WRF data hander class.
- [x] Include time zone information
- [x] Convert to local time zone
- [x] Apply statistics (ensure c...Include calculation of [TOAR statistics](https://gitlab.jsc.fz-juelich.de/esde/toar-public/toarstats/) into the WRF data hander class.
- [x] Include time zone information
- [x] Convert to local time zone
- [x] Apply statistics (ensure correct time zones!)https://gitlab.jsc.fz-juelich.de/esde/machine-learning/mlair/-/issues/309Class-based Oversampling technique2021-06-23T16:24:31+02:00Ghost UserClass-based Oversampling technique# Target
Implement a class-based Oversampling technique. Classes are defined by fixed ppb intervalls and the Oversampling then (fully) balances the frequency of the classes. The method is added in pre-processing.
# Tasks
* [x] add met...# Target
Implement a class-based Oversampling technique. Classes are defined by fixed ppb intervalls and the Oversampling then (fully) balances the frequency of the classes. The method is added in pre-processing.
# Tasks
* [x] add method `apply_oversampling` in `PreProcessing`
* [x] store results of `apply_oversampling` in data store
* [x] make all hardcoded parameters (e.g. `bins` or `rates_cap`) more flexible
* [x] add parameter to experiment setup (init, run)
* [x] load information from data store within apply_oversampling by using `data_store.get_default(...)`
* [x] defaults could be either set in the experiment setup (by using the defaults file) or just in the get_default call
The following steps are not specified currently: DataHandler should be able to use the oversampling informationhttps://gitlab.jsc.fz-juelich.de/esde/machine-learning/mlair/-/issues/287WRF-Datahandler should inherit from SingleStationDatahandler2021-03-03T11:40:33+01:00Ghost UserWRF-Datahandler should inherit from SingleStationDatahandler The wrf datahandler should inherit from SingleStationDataHandler to ensure that transform methods etc are available. The wrf datahandler should inherit from SingleStationDataHandler to ensure that transform methods etc are available.https://gitlab.jsc.fz-juelich.de/esde/machine-learning/mlair/-/issues/271add CDC database DataHandler2021-03-08T15:38:05+01:00Ghost Useradd CDC database DataHandlerThe goal is to add a database access script inside the DataHandler structure for the usage in the master thesis project of Falco.The goal is to add a database access script inside the DataHandler structure for the usage in the master thesis project of Falco.https://gitlab.jsc.fz-juelich.de/esde/machine-learning/mlair/-/issues/267Create WRF-Chem data handler2021-03-15T11:45:19+01:00Ghost UserCreate WRF-Chem data handlerCreate a Default WRF-Chem data handler. The data handler should be able to
- [x] Read WRF-Chem data
- [x] Extract the column (height and time) for a given lat/lon coordinates
- [ ] Creates labels (e.g. conc's on surface cell at time $t_i...Create a Default WRF-Chem data handler. The data handler should be able to
- [x] Read WRF-Chem data
- [x] Extract the column (height and time) for a given lat/lon coordinates
- [ ] Creates labels (e.g. conc's on surface cell at time $t_i$ where i > 0
- [ ] Creates inputs (e.g. vars from column for times $t_{-J}$ to $t_{0}$, where $J$ is the total number of previous time steps.Include WRC-Chem data handler and run pipeline