diff --git a/README.md b/README.md index 031c94431befcd8efca36423e465efc7c1034615..dbd1511a8aecc86f89d823201b7751bed318d56c 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ The request to the database also allows a statistical analysis of the requested The mean and standard deviation of all stations within a cell are computed. The tool handles the request to the database over the REST API and the subsequent processing. -The results of the gridding are provided as xarray objects for subsequent processing by the user. +The results of the gridding are provided as [xarray datasets](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html) for subsequent processing and visualization by the user. This project is in beta with the intended basic functionalities. The documentation is work in progress. @@ -17,48 +17,52 @@ The documentation is work in progress. TBD, see pyproject.toml + # Installation -Move to the folder you want to create download this project to. -We now need to download the source code (https://gitlab.jsc.fz-juelich.de/esde/toar-public/toargridding/-/tree/dev?ref_type=heads). Either as ZIP folder or via git: +Move to the folder you want to download this project to. +We now need to download the source code from the [repository](https://gitlab.jsc.fz-juelich.de/esde/toar-public/toargridding/-/tree/dev?ref_type=heads) either as ZIP file or via git: ## Download with GIT Clone the project from its git repository: -``` +```bash git clone https://gitlab.jsc.fz-juelich.de/esde/toar-public/toargridding.git ``` With git we need to checkout the development branch (dev). Therefore we need to change to the project directory first: -``` +```bash cd toargridding git checkout dev ``` ## Installing Dependencies and Setting up Virtual Enviorment -The handling of required packages is done with poetry (https://python-poetry.org/). -After installing poetry, you can simply install all required dependencies for this project by runing poetry in the project directory: -``` +The handling of required packages is done with [poetry](https://python-poetry.org/). +After installing poetry, you can simply install all required dependencies for this project by running poetry in the project directory: +```bash poetry install ``` -This also creates a virtual enviorment, which ensures that different projects do not interfere with their dependencies. -To run a jupyter notebook in the virtual enviorment execute -``` +This also creates a virtual environment, which ensures that the dependencies of different projects do not interfere. +To run a jupyter notebook in the virtual environment execute +```bash +#for selecting a notebook over the file browser in your webbrowser: poetry run jupyter notebook +#or for directly opening a notebook: +poetry run jupyter notebook [/path/to/scriptname.py] ``` and to run a script use -``` +```bash poetry run python [/path/to/scriptname.py] ``` # How does this tool work? -This tool has two main parts. The first handles requests to the TOAR database and the analysis of the data. +This tool has two main parts. The first handles requests to the TOAR database via its analysis service. This includes the statistical analysis of the requested timeseries. The second part is the gridding, which is performed offline. ## Request to TOAR Database with Statistical Analysis -Requests are send to the analysis service of the TOAR database. This allows a selection of different stations base on their metadata and performing a statistical analysis. -Whenever a request is submitted, it will be processed. The returned status endpoint will point ot the results as soon as the process is finished. +Requests are send to the analysis service of the TOAR database. This allows a selection of different stations based on their metadata and performing a statistical analysis. +Whenever a request is submitted, it will be processed. The returned status endpoint will point to the results as soon as the analysis is finished. A request can take several hours, depending on time range and the number of requested stations. At the moment, there is no possibility implemented to check the status of a running job until it is finished (Date: 2024-05-14). @@ -71,70 +75,78 @@ The TOAR database has only a limited number of workers for performing a statisti ## Gridding The gridding uses a user defined grid to combine all stations in a cell. -Per cell mean, standard deviation and the number of stations are reported. +Per cell mean, standard deviation and the number of stations are reported in the resulting xarray dataset. # Example -There are at the moment three example provided as jupyter notebooks (https://jupyter.org/). +There are at the moment five example provided as jupyter notebooks (https://jupyter.org/). +Jupyter uses your webbrowser to display results and the code blocks. Here, examples are provided in python. +As an alternative, visual studio code directly supports execution of jupyter notebooks. +For VS Code, please ensure to select the kernel of the virtual environment [see](https://code.visualstudio.com/docs/datascience/jupyter-notebooks). -Running them with the python environment produced by poetry can be done by -``` +Running the provided examples with the python environment created by poetry can be done by +```bash poetry run jupyter notebook ``` +as pointed out previously. ## High level function ``` -tests/produce_data_withOptional.ipynb +tests/produce_data_manyStations.ipynb +#(plase see next notebook for a faster example) ``` -Provides an example on how to download data, apply gridding and save the results as netCDF files. +This notebook provides an example on how to download data, apply gridding and save the results as [netCDF files](https://docs.xarray.dev/en/stable/user-guide/io.html). The AnalysisServiceDownload caches already obtained data on the local machine. -This allows different griddings without the necessity to repeat the request to the TOARDB and subsequent download. +This allows different griddings without the necessity to repeat the request to the TOARDB, the statistical analysis and the subsequent download. -In total two requests are executed by requesting different different statistical quantities (mean & dma8epax). -The example uses a dictionary to pass additional arguments to the request to the TAOR database (here: station category from TOAR 1). -A detailed list can be found at https://toar-data.fz-juelich.de/api/v2/#stationmeta +As an example we calculated the dma8epa_strict on a daily basis for the years 2000 to 2018 for all timeseries in the TOAR database. +The first attempt for this example covered the full range of 19 years in a single request. It turned out, that an extraction year by year is more reliable. +The subsequent requests function as a progress report and allow working with the data, while further requests are processed. +As the gridding is done offline, it will be executed for already downloaded files, whenever the notebook is rerun. Please note, that the file name also contains the day of creation. + +```bash +poetry run jupyter notebook tests/produce_data_withOptional.ipynb ``` -tests/produce_data_manyStations.ipynb -``` -Uses a similar request, but without the restriction to the station type. Therefore, a much larger number of stations is requested (about 1000 compared to a few hundred, that have a "toar1_category" classification used in the previous example). -Therefore, this example is restricted to the calculation of "dma8epax". +This example is based on the previous one but uses additional arguments to reduce the number of stations per request. As an example, different classifications of the stations are used: first the toar1_category and second the type_of_area. +Details can be found in [documentation of the FastAPI REST interface](https://toar-data.fz-juelich.de/api/v2/#stationmeta) or the [user guide](https://toar-data.fz-juelich.de/sphinx/TOAR_UG_Vol03_Database/build/latex/toardatabase--userguide.pdf). + +The selection of only a limited number of stations leads to significant faster results. On the downside, the used classifications are not available for all stations. ## Retrieving data +```bash +poetry run jupyter notebook tests/get_sample_data_manual.ipynb ``` -tests/get_sample_data.ipynb -``` -Downloads data from the TOAR database. The extracted data are written to disc. No further processing or gridding is done. -The result is a ZIP-file containing two CSV files. The first one contains the time series and the second one the coordinates of the stations. +Downloads data from the TOAR database with a manual creation of the request to the TOAR database. +The extracted data are written to disc. No further processing or gridding is done. +The result is a ZIP-file containing two CSV files. The first one contains the statistical analysis of the timeseries and the second one the coordinates of the stations. ## Retrieving data +```bash +poetry run jupyter notebook tests/get_sample_data.ipynb ``` -tests/get_sample_data_manual.ipynb -``` -Downloads data from the TOAR database with a manual creation of the request to the TOAR database. -As an example for addition parameters, the "toar1_category" is passed to the metadata object. -This example does not perform any gridding. -The result is a ZIP-file containing two CSV files. The first one contains the time series and the second one the coordinates of the stations. +As a comparison to the previous example, this one performs the same request by using the interface of this project. ## Retrieving data and visualization -``` -tests/quality_controll.ipynb +```bash +poetry run jupyter notebook tests/quality_controll.ipynb ``` Notebook for downloading and visualization of data. The data are downloaded and reused for subsequent executions of this notebook. The gridding is done on the downloaded data. Gridded data are not saved to disc. -## Benchmarks Requests to TOAR Database +# Benchmarks -``` -tests/benchmark.py +## Duration of Different Requests + +```bash +poetry run jupyter notebook tests/benchmark.py ``` This script requests datasets with different durations (days to month) from the TOAR Database and saves them to disc. It reports the duration for the different requests. There is no gridding involved. CAVE: This script can run several hours. - # Supported Grids The first supported grid is a regular grid with longitude and latitude. @@ -149,12 +161,12 @@ At the moment time differences larger than one day are working, i.e. start and e # Documentation of Source Code: -At the moment Carsten Hinz is working on a documentation of the source code, while getting familiar with it. -The aim is a brief overview on the functionalities and the arguments. As he personally does not like repetitions, +At the moment Carsten Hinz is working on a documentation of the source code, while getting familiar with this project. +The aim is a brief overview on the functionalities and the arguments of individual functions. As he personally does not like repetitions, the documentations might not match other style guides. It will definitely be possible to extend the documentation:-) -``` +```python class example: """An example class @@ -189,7 +201,7 @@ class example: ``` -``` +```python @dataclass class dataClass: """Brief description @@ -210,3 +222,6 @@ class dataClass: secStr : str = "Default value" ``` +# Tested platforms +This project has been tested on +- Rocky Linux 9 diff --git a/tests/get_sample_data.ipynb b/tests/get_sample_data.ipynb index af50c5be08371dd25effec6233bc8d80adc507cc..2fece6a8e3e3bd7346519dd0dce024d207687714 100644 --- a/tests/get_sample_data.ipynb +++ b/tests/get_sample_data.ipynb @@ -18,10 +18,7 @@ "statistic = \"mean\"\n", "\n", "time = TimeSample(start, end, sampling=sampling)\n", - "# { \"station_type_of_area\" : \"urban\" } category is not known\n", - "#metadata = Metadata.construct(\"mole_fraction_of_ozone_in_air\", time, statistic, { \"toar1_category\" : \"RuralLowElevation\"})#\n", - "#metadata = Metadata.construct(\"mole_fraction_of_ozone_in_air\", time, statistic, { \"type_of_area\" : \"Urban\" })#also test Rural, Suburban,\n", - "metadata = Metadata.construct(\"mole_fraction_of_ozone_in_air\", time, statistic, { \"type_of_area\" : \"Rural\" })#also test Rural, Suburban,\n", + "metadata = Metadata.construct(\"mole_fraction_of_ozone_in_air\", time, statistic)\n", "\n", "start_time = datetime.now()\n", "print(start_time)" @@ -39,8 +36,8 @@ "\n", "#creation of output directories\n", "toargridding_base_path = Path(\".\")\n", - "cache_dir = toargridding_base_path / \"results\"\n", - "download_dir = toargridding_base_path / \"data\"\n", + "cache_dir = toargridding_base_path / \"cache\"\n", + "download_dir = toargridding_base_path / \"results\"\n", "cache_dir.mkdir(parents=True, exist_ok=True)\n", "download_dir.mkdir(parents=True, exist_ok=True)\n", "\n", diff --git a/tests/get_sample_data_manual.ipynb b/tests/get_sample_data_manual.ipynb index fd3026b97d8e66c810691733e71b357e920a8cbf..86627cc38eaf211e193c377e63d6dafe697a3c89 100644 --- a/tests/get_sample_data_manual.ipynb +++ b/tests/get_sample_data_manual.ipynb @@ -11,7 +11,7 @@ "#samping monthly or daily\n", "sampling = \"monthly\"\n", "start = datetime(2010,1,1)\n", - "end = datetime(2011,1,1)\n", + "end = datetime(2010,1,3)\n", "\n", "print(start.date(), end.date())\n", "print(start.isoformat(), end.isoformat())" @@ -89,7 +89,7 @@ "outputs": [], "source": [ "from pathlib import Path\n", - "outDir = Path(\"data\")\n", + "outDir = Path(\"results/manual\")\n", "outDir.mkdir(parents=False, exist_ok=True)\n", "fn = outDir / f\"{sampling}_{start.date()}_{end.date()}.zip\"\n", "with open(fn, \"w+b\") as sample_file:\n", diff --git a/tests/produce_data_manyStations.ipynb b/tests/produce_data_manyStations.ipynb index dbb7235ed3f13bc7837efa53661fbea97c0700a2..7e6e9ddc32b7da04b8a538f21f9f2b5db88b7636 100644 --- a/tests/produce_data_manyStations.ipynb +++ b/tests/produce_data_manyStations.ipynb @@ -26,16 +26,18 @@ "\n", "Config = namedtuple(\"Config\", [\"grid\", \"time\", \"variables\", \"stats\"])\n", "\n", - "valid_data = Config(\n", - " RegularGrid( lat_resolution=1.9, lon_resolution=2.5, ),\n", - " TimeSample( start=dt(2000,1,1), end=dt(2019,12,31), sampling=\"daily\"),#possibly adopt range:-)\n", - " [\"mole_fraction_of_ozone_in_air\"],#variable name\n", - " [ \"dma8epax\" ]# change to dma8epa_strict\n", - ")\n", + "grid = RegularGrid( lat_resolution=1.9, lon_resolution=2.5, )\n", "\n", - "configs = {\n", - " \"test_ta\" : valid_data\n", - "}\n", + "configs = dict()\n", + "for year in range (0,19):\n", + " valid_data = Config(\n", + " grid,\n", + " TimeSample( start=dt(2000+year,1,1), end=dt(2000+year,12,31), sampling=\"daily\"),#possibly adopt range:-)\n", + " [\"mole_fraction_of_ozone_in_air\"],#variable name\n", + " [ \"dma8epax\" ]# change to dma8epa_strict\n", + " )\n", + " \n", + " configs[f\"test_ta{year}\"] = valid_data\n", "\n", "#testing access:\n", "#config = configs[\"test_ta\"]\n", @@ -61,8 +63,12 @@ "analysis_service = AnalysisServiceDownload(stats_endpoint=stats_endpoint, cache_dir=cache_basepath, sample_dir=result_basepath, use_downloaded=True)\n", "\n", "Connection.DEBUG=True\n", + "minutes = 5\n", + "analysis_service.connection.wait_seconds = [minutes * 60 for i in range(5,61,minutes) ]\n", "\n", "for person, config in configs.items():\n", + " print(f\"\\nProcessing {person}:\")\n", + " print(f\"--------------------\")\n", " datasets, metadatas = get_gridded_toar_data(\n", " analysis_service=analysis_service,\n", " grid=config.grid,\n", @@ -72,7 +78,7 @@ " )\n", "\n", " for dataset, metadata in zip(datasets, metadatas):\n", - " dataset.to_netcdf(result_basepath / f\"{metadata.get_id()}.nc\")\n", + " dataset.to_netcdf(result_basepath / f\"{metadata.get_id()}_{config.grid.get_id()}.nc\")\n", " print(metadata.get_id())" ] } diff --git a/tests/produce_data_withOptional.ipynb b/tests/produce_data_withOptional.ipynb index 228a577e5a9cbfd6362c8583b6f2d8a2d8d32f2a..c7c7b99a8edfcb0d7b7a1cb035eaa43d4ebe89f0 100644 --- a/tests/produce_data_withOptional.ipynb +++ b/tests/produce_data_withOptional.ipynb @@ -40,22 +40,24 @@ " #\"type_of_area\" : \"Suburban\" #also test Rural, Suburban,\n", "}\n", "\n", - "valid_data = Config(\n", - " RegularGrid( lat_resolution=1.9, lon_resolution=2.5, ),\n", - " TimeSample( start=dt(2000,1,1), end=dt(2019,12,31), sampling=\"daily\"),#possibly adopt range:-)\n", - " [\"mole_fraction_of_ozone_in_air\"],#variable name\n", - " #[ \"mean\", \"dma8epax\"],# will start one request after another other...\n", - " [ \"dma8epa_strict\" ],\n", - " details4Query\n", - ")\n", + "grid = RegularGrid( lat_resolution=1.9, lon_resolution=2.5, )\n", "\n", - "configs = {\n", - " \"test_ta\" : valid_data\n", - "}\n", + "configs = dict()\n", + "for year in range (2,19):\n", + " valid_data = Config(\n", + " grid,\n", + " TimeSample( start=dt(2000+year,1,1), end=dt(2000+year,12,31), sampling=\"daily\"),#possibly adopt range:-)\n", + " [\"mole_fraction_of_ozone_in_air\"],#variable name\n", + " #[ \"mean\", \"dma8epax\"],# will start one request after another other...\n", + " [ \"dma8epa_strict\" ],\n", + " details4Query\n", + " )\n", + " \n", + " configs[f\"test_ta{year}\"] = valid_data\n", "\n", "#testing access:\n", - "config = configs[\"test_ta\"]\n", - "config.grid" + "#config = configs[\"test_ta2\"]\n", + "#config.grid" ] }, { @@ -64,7 +66,7 @@ "metadata": {}, "outputs": [], "source": [ - "#CAVE: this cell runs about 30minutes per requested year\n", + "#CAVE: this cell runs about 45minutes per requested year. therefore we increase the waiting duration to 1h per request.\n", "#the processing is done on the server of the TOAR database.\n", "#a restart of the cell continues the request to the REST API if the requested data are ready for download\n", "# The download can also take a few minutes\n", @@ -77,8 +79,12 @@ "analysis_service = AnalysisServiceDownload(stats_endpoint=stats_endpoint, cache_dir=cache_basepath, sample_dir=result_basepath, use_downloaded=True)\n", "\n", "Connection.DEBUG=True\n", + "minutes = 5\n", + "analysis_service.connection.wait_seconds = [minutes * 60 for i in range(5,61,minutes) ]\n", "\n", "for person, config in configs.items():\n", + " print(f\"\\nProcessing {person}:\")\n", + " print(f\"--------------------\")\n", " datasets, metadatas = get_gridded_toar_data(\n", " analysis_service=analysis_service,\n", " grid=config.grid,\n", @@ -89,7 +95,7 @@ " )\n", "\n", " for dataset, metadata in zip(datasets, metadatas):\n", - " dataset.to_netcdf(result_basepath / f\"{metadata.get_id()}.nc\")\n", + " dataset.to_netcdf(result_basepath / f\"{metadata.get_id()}_{config.grid.get_id()}.nc\")\n", " print(metadata.get_id())" ] } diff --git a/tests/quality_controll.ipynb b/tests/quality_controll.ipynb index 744629b6c82b765395457ecc516619fb65765e70..b3ad13bcb96baf064ec9e4699977d89057db4c68 100644 --- a/tests/quality_controll.ipynb +++ b/tests/quality_controll.ipynb @@ -34,7 +34,7 @@ "#maybe adopt the toargridding_base_path for your machine.\n", "toargridding_base_path = Path(\".\")\n", "cache_dir = toargridding_base_path / \"cache\"\n", - "data_download_dir = toargridding_base_path / \"data\"\n", + "data_download_dir = toargridding_base_path / \"results\"\n", "\n", "analysis_service = AnalysisServiceDownload(endpoint, cache_dir, data_download_dir)\n", "my_grid = RegularGrid(1.9, 2.5)\n", @@ -204,7 +204,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.5" + "version": "3.11.7" } }, "nbformat": 4,