extended documentation

added documentation on station averaging

extended documentation
30a86554 · Carsten Hinz · 3a8acccd · 30a86554 · 30a86554 · 30a86554
Commit 30a86554 authored 3 months ago by Carsten Hinz
--- a/README.md
+++ b/README.md
@@ -201,11 +201,25 @@ As soon as a request is finished, the status endpoint will not be valid forever.
 There is no check, if a request is already running. Therefore, submitting a request multiple times, leads to additional load on the system and slows down all requests. 
 The TOAR database has only a limited number of workers for performing a statistical analysis. Therefore, it is advised to run one request after another, especially for large requests covering a large number of stations and or a longer time.

+## A brief reminder on timeseries and stations
+The TOAR database uses timeseries, which are associated to a station.
+At an individual station, one or more physical sensors are mounted. Those can measure different variables or in some cases the same variable with different techniques.
+A station can also be part of different networks, that contribute data to the TOAR database. 
+A more detailed description on the included data can be found in 
+[Chapter Three: The TOAR data processing workflow](https://toar-data.fz-juelich.de/sphinx/TOAR_TG_Vol02_Data_Processing/build/latex/toardataprocessing--technicalguide.pdf).
+
+In the case of gridding, this can lead to systematic errors. For example, the statistical weight of a station can be increased, if it is contributed twice.
+
 ## Gridding

 The gridding uses a user defined grid to combine all stations in a cell.
 Per cell mean, standard deviation and the number of stations are reported in the resulting xarray dataset.

+### Station averaging
+
+The timeseries at each station can be averages before the gridding is done. This results in the same statistical weight for each station.
+This can introduce or remove systematic errors in the data analysis, depending on the calculated statistical aggregates.
+
 ## Contributors

 The contributors include all projects, organizations and persons that are associated to any timeseries of a gridded dataset with the roles "contributor" and "originator". In offline mode, this information is conserved by saving the timeseries IDs in a dedicated file with one ID per line. In the metadata of a dataset, this file name is stated together with the contributors endpoint (at the moment: `https://toar-data.fz-juelich.de/api/v2/timeseries/request_contributors`) to retrieve the actual names. Therefore, the created contributor file need to be submitted as a POST request.

--- a/examples/03_produce_data_station_metadata.ipynb
+++ b/examples/03_produce_data_station_metadata.ipynb
@@ -136,7 +136,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "Config = namedtuple(\"Config\", [\"grid\", \"time\", \"variables\", \"stats\", \"station_metadata\"])\n",
+    "Config = namedtuple(\"Config\", [\"grid\", \"time\", \"variables\", \"stats\", \"station_metadata\", \"average_timeseries_at_station\"])\n",
    "\n",
    "#uncomment, if you want to change the metadata:\n",
    "station_metadata ={\n",
@@ -153,7 +153,8 @@
    "    [\"mole_fraction_of_ozone_in_air\"],\n",
    "    #[ \"mean\" ], \n",
    "    [ \"dma8epa_strict\" ],\n",
-    "    station_metadata\n",
+    "    station_metadata,\n",
+    "    average_timeseries_at_station=True\n",
    ")\n",
    "configs[f\"test_ta\"] = request_config\n"
   ]
@@ -192,7 +193,7 @@
    "        variables=config.variables,\n",
    "        stats=config.stats,\n",
    "        contributors_path=result_basepath,\n",
-    "        average_TS_at_station=True,\n",
+    "        average_TS_at_station=config.average_timeseries_at_station,\n",
    "        **config.station_metadata\n",
    "    )\n",
    "\n",

 %% Cell type:markdown id: tags:

 # Example with optional parameters
 Toargridding has a number of required arguments for a dataset. Those include the time range, variable and statistical analysis. The TAOR-DB has a large number of metadata fileds that can be used to further refine this request.
 A python dictionary can be provided to include theses other fields. The analysis service provides an error message, if the requested parameters does not exist (check for typos) or if the provided value is wrong.

 In this example we want to obtain data from 2012.

 The fist block contains the includes and the setup of the logging.

 %% Cell type:markdown id: tags:

 #### inclusion of packages

 %% Cell type:code id: tags:

 ``` python
 import logging
 from datetime import datetime as dt
 from collections import namedtuple
 from pathlib import Path

 from toargridding.toar_rest_client import AnalysisServiceDownload, Connection
 from toargridding.grids import RegularGrid
 from toargridding.gridding import get_gridded_toar_data
 from toargridding.metadata import TimeSample
 ```

 %% Cell type:markdown id: tags:

 #### Setup of logging

 In the next step we setup the logging, i.e. the level of information that is displayed as output.

 We start with a default setup and restrict the output to information and more critical output like warnings and errors.

 We also add logging to a file. This will create a new log file at midnight and keep up to 7 log files.

 %% Cell type:code id: tags:

 ``` python
 from toargridding.defaultLogging import toargridding_defaultLogging

 logger = toargridding_defaultLogging()
 #logger.addShellLogger(logging.INFO)
 logger.addShellLogger(logging.DEBUG)
 logger.logExceptions()
 log_path = Path("log")
 log_path.mkdir(exist_ok=True)
 logger.addRotatingLogFile( log_path / "produce_data_station_metadata.log")#we need to explicitly set a logfile
 ```

 %% Cell type:markdown id: tags:

 #### Setting up the analysis

 We need to prepare our connection to the analysis service of the toar database, which will provide us with temporal and statistical aggregated data.
 Besides the url of the service, we also need to setup two directories on our computer:
 - one to save the data provided by the analysis service (called cache)
 - a second to store our gridded dataset (called results)
 Those will be found in the directory examples/cache and examples/results.

 %% Cell type:code id: tags:

 ``` python

 stats_endpoint = "https://toar-data.fz-juelich.de/api/v2/analysis/statistics/"
 cache_basepath = Path("cache")
 result_basepath = Path("results")
 cache_basepath.mkdir(exist_ok=True)
 result_basepath.mkdir(exist_ok=True)
 analysis_service = AnalysisServiceDownload(stats_endpoint=stats_endpoint, cache_dir=cache_basepath, sample_dir=result_basepath, use_downloaded=True)
 ```

 %% Cell type:markdown id: tags:

 Our following request will take some time, so we edit the durations between two checks, if our data are ready for download and the maximum duration for checking.
 We will check every 45min for 12h.

 %% Cell type:code id: tags:

 ``` python
 analysis_service.connection.set_request_times(interval_min=15, max_wait_minutes=12*60)
 ```

 %% Cell type:markdown id: tags:

 #### Preparation of requests with station metadata

 We restrict our request to one year and of daily mean ozone data. In addition we would like to only include urban stations.

 We use a container class to keep the configurations together (type: namedtuple).

 We also want to refine our station selection by using further metadata.
 Therefore, we create the `station_metadata` dictionary. We can use the further metadata stored in the TOAR-DB by providing their name and our desired value. This also discards stations, without a provided value for a metadata field. We can find information on different metadata values in the [documentation](https://toar-data.fz-juelich.de/sphinx/TOAR_UG_Vol03_Database/build/latex/toardatabase--userguide.pdf). For example for the *toar1_category* on page 18 and for the *type_of_area* on page 20.

 We can use this to filter for all additional metadata, which are supported by the [statistics endpoint of the analysis service](https://toar-data.fz-juelich.de/api/v2/analysis/#statistics), namely station metadata and timeseries metadata.

 In the end we have wo requests, that we want to submit.

 %% Cell type:code id: tags:

 ``` python
-Config = namedtuple("Config", ["grid", "time", "variables", "stats", "station_metadata"])
+Config = namedtuple("Config", ["grid", "time", "variables", "stats", "station_metadata", "average_timeseries_at_station"])

 #uncomment, if you want to change the metadata:
 station_metadata ={
    #"toar1_category" : "Urban" #uncomment if wished:-)
    "type_of_area" : "Rural" #also test Rural, Suburban, Urban
 }

 grid = RegularGrid( lat_resolution=1.9, lon_resolution=2.5, )

 configs = dict()
 request_config = Config(
    grid,
    TimeSample( start=dt(2012,1,1), end=dt(2012,12,31), sampling="daily"),
    ["mole_fraction_of_ozone_in_air"],
    #[ "mean" ],
    [ "dma8epa_strict" ],
-    station_metadata
+    station_metadata,
+    average_timeseries_at_station=True
 )
 configs[f"test_ta"] = request_config
 ```

 %% Cell type:markdown id: tags:

 #### execution of toargridding and saving of results
 Now we want to request the data from the TOAR analysis service and create the gridded dataset.
 Therefore, we call the function `get_gridded_toar_data` with everything we have prepared until now.

 The request will be submitted to the analysis service, which will process the request. On our side, we will check in intervals, if the processing is finished. After several request, we will stop checking. The setup for this can be found a few cells above.
 A restart of this cell allows to continue the look-up, if the data are available.

 The obtained data are stored in the result directory (`results_basepath`). Before submitting a request, toargridding checks his cache, if the data have already been downloaded.

 Last but not least, we want to save our dataset as netCDF file.
 In the global metadata of this file we can find a recipe on how to obtain a list of contributors with the contributors file created by `get_gridded_toar_data`. This function also creates the required  file with the extension "*.contributors".

 %% Cell type:code id: tags:

 ``` python

 for config_id, config in configs.items():
    print(f"\nProcessing {config_id}:")
    print(f"--------------------")
    datasets, metadatas = get_gridded_toar_data(
        analysis_service=analysis_service,
        grid=config.grid,
        time=config.time,
        variables=config.variables,
        stats=config.stats,
        contributors_path=result_basepath,
-        average_TS_at_station=True,
+        average_TS_at_station=config.average_timeseries_at_station,
        **config.station_metadata
    )

    for dataset, metadata in zip(datasets, metadatas):
        dataset.to_netcdf(result_basepath / f"{metadata.get_id()}_{config.grid.get_id()}.nc")
 ```

--- a/src/toargridding/__about__.py
+++ b/src/toargridding/__about__.py
-VERSION = "0.4.3"
+VERSION = "0.4.4"
--- a/src/toargridding/gridding.py
+++ b/src/toargridding/gridding.py
@@ -45,6 +45,8 @@ def get_gridded_toar_data(
    contributors_path:
        pathname to write the contributors path. We advise to store the contributor files into the same directory as the resulting data.
        Without a provided path, it is assumed, that toargridding is operated as a service and the contributors can be directly provided through the contributors endpoint. This is not yet implemented.
+    average_TS_at_station_
+        enable the averaging of all timeseries at a station. This is useful for the calculation of the average timeseries at a station. Be careful, depending on your statistical aggregation, this can lead to systematic errors or remove those.
    kwards:
        - history this allows a replacement of the history field in the metadata of the resulting datasets
        - all remaining kwargs are passed as filters to the request. This allows a refinement of the request.