Skip to content
Snippets Groups Projects
Select Git revision
  • testing
  • main default protected
  • 23-pypi-release-and-extended-documentation
  • dev
  • wip_tests_and_notebooks
  • cicd
  • v0.4.8
  • 0.4.7
  • v0.4.7
  • v0.4.6
  • v0.4.5
11 results

toargridding

  • Clone with SSH
  • Clone with HTTPS
  • Carsten Hinz's avatar
    Carsten Hinz authored
    added  comment to readme.
    4a5c89ab
    History

    TOAR Gridding Tool

    World map with mean ozon concentration

    Visualization of the mean ozon concentration per grid cell on the 3rd January 2016.

    About

    The TOARgridding tool projects data from the TOAR-II database (https://toar-data.fz-juelich.de/) onto a grid. The user can select the

    • variable,
    • statistical aggregation,
    • time period,
    • rectangular lat-lon grid of custom resolution,
    • and (optional) filtering according to the station metadata.

    This tool handles the request to the database over the REST API and the subsequent processing. The results of the gridding are provided as xarray datasets for subsequent processing and visualization by the user. While this project provides ready to use examples, it is intended as a library to be used in dedicated analysis scripts. Furthermore, the long term goal is to provide the gridding as a service over a RESTful API.

    This project is in beta with the intended basic functionalities.

    Table of Content

    Requirements

    This project requires python 3.10 or higher. For more information see pyproject.toml.

    This package relies on netCDF and HDF5 for saving data. You might need to install those as dependencies on your operation system.

    The examples in this package rely on jupyter notebooks, which is installed automatically. The visualization on a map in one example relies on cartopy, which has some additional dependencies including system libraries.

    Installation

    1) Install from PyPI

    We intent to provide this package on PyPI. This documentation is preliminary as it is written before the actual release as part of the preparation. We intent to allow an installation with

    pip install toargridding

    or

    python -m pip install toargridding

    We suggest, that you use the TOAR Gridding tool as part of an virtual environment for the creation of your own data analysis scripts. Therefore create directory, navigate to it

    # on linux:
    mkdir /my/path/data/analysis
    cd  /my/path/data/analysis

    and create a virtual environment

    python -m venv .venv
    source .venv/bin/activate
    pip install toargridding

    2) Install from Source

    Move to the folder you want to download this project to. We now need to download the source code from the repository either as ZIP file or via git:

    2.1) Download with GIT

    Clone the project from its git repository:

    git clone https://gitlab.jsc.fz-juelich.de/esde/toar-public/toargridding.git 

    With git we need to checkout the testing branch (testing), as the main branch is not yet populated. Therefore we need to change to the project directory first:

    cd toargridding

    2.2) Installing Dependencies and Setting up Virtual Environment

    Setup a virtual environment for your code to avoid conflicts with other projects. Here, you can use your preferred tool or run:

    python -m venv .venv
    source .venv/bin/activate

    The latter line activates the virtual environment for the further usage. To deactivate your environment call

    deactivate

    For the installation of all required dependencies we call with activated virtual environment:

    pip install -e .

    To be able to execute the examples, that are provided as jupyter notebooks, we need to install a additional packages by calling

    pip install -e ."[interactive]"

    To run the example notebooks:

    #for selecting a notebook over the file browser in your webbrowser:
    jupyter notebook
    #or for directly opening a notebook:
    jupyter notebook [/path/to/notebookname.ipynb]

    and to run a script use

    python [/path/to/scriptname.py]

    3) Known issues on Microsoft Windows

    On a windows system the installation with pip might fail, if the netCDF and HDF5 libraries are not yet installed. Switching from pip to conda can resolve this issue.

    Examples

    The git-repository of this package includes a number of examples as jupyter notebooks (https://jupyter.org/). Jupyter uses your web-browser to display results and the code blocks. As an alternative, visual studio code directly supports execution of jupyter notebooks. For VS Code, please ensure to select the kernel of the virtual environment see.

    After activating the virtual environment the notebooks can be run by calling

     jupyter notebook

    as pointed out previously.

    00: Retrieving data and visualization

     jupyter notebook examples/00_download_and_visualization.ipynb

    The aim of this first example is the creation of a gridded dataset and the visualization of one time point. For the visualization, cartopy is required, which might need dependencies, that might not be installed by pip. If you are experiencing any issues, do not hesitate to continue with the next examples. If you are still curious for the results, we have uploaded the resulting map as title image of this README.

    CAVE: We observed for a specific version and case, that the notebook is running into an exception from matplotlib.pyplot.pcolormesh. Please be aware, that the error message complains about the usage of 'flat' instead if 'nearest' shading. Here, the value changes at some point after calling the plot function. We did not investigated this further... There is also an export called examples/00_download_and_vis_export.py. This runs without an issue.

    01: Retrieval of one week:

    jupyter notebook examples/01_produce_data_one_week.ipynb

    As a first example we want to download data for ozon covering one week. We calculate the daily mean from each station before combining the results to a grid with a lateral resolution of about 1.9° and a longitudinal resolution of 2.5°. The results are saved as netCDF files.

    As the gridding is done offline, it will be executed for already downloaded files, whenever the notebook is rerun. Please note, that the file name for the gridded data also contains the date of creation. Therefore you might end up with several copies.

    02: Retrieval of several years:

     jupyter notebook examples/02_produce_data_manyStations.ipynb

    This notebook provides an example on how to download data, apply gridding and save the results as netCDF files. The AnalysisServiceDownload caches already obtained data on the local machine. This allows different grids without the necessity to repeat the request to the TOARDB, the statistical analysis and the subsequent download.

    As an example we calculated dma8epa_strict on a daily basis for the years 2000 to 2001 for all timeseries in the TOAR database. The first attempt for this example covered the full range of 19 years in a single request. It turned out, that an extraction year by year is more reliable. The subsequent requests function as a progress report and allow working with the data, while further requests are processed.

    03: Retrieval of several years with selection of specific stations:

     jupyter notebook examples/03_produce_data_withOptional.ipynb 

    This example is based on the previous one but uses additional arguments to refine the selection of stations. As an example, different classifications of the stations are used: first the "toar1_category" and second the "type_of_area". Details can be found in documentation of the FastAPI REST interface or the user guide.

    To speed up the execution we also restrict this example to one year.

    The selection of only a limited number of stations leads to significant faster results. On the downside, the used classifications are not available for all stations.

    How does this tool work?

    This tool has two main parts. The first part handles requests to the TOAR database via its analysis service. This includes the statistical analysis of the requested timeseries. The second part is the gridding, which is performed offline.

    Request to TOAR Database with Statistical Analysis

    Requests are send to the analysis service of the TOAR database. This allows a selection of different stations based on their metadata and performing a statistical analysis. Whenever a request is submitted, it will be processed. The returned status endpoint will point to the results as soon as the analysis is finished. A request can take several hours, depending on time range and the number of requested stations. This module stores the requests and their status endpoint in a local cache file. These endpoints are used to check, if the processing by the analysis service is finished. Requests are deleted from the cache after 14 days. You can adopt this by using Cache.set_max_days_in_cache([max age in days]). At the moment, there is no possibility implemented to check the status of a running job until it is finished (Date: 2024-05-14). It seems that crashed requests respond with an internal server error (HTML Status Code 500). Therefore, those requests are automatically deleted from the cache and resubmitted.

    As soon as a request is finished, the status endpoint will not be valid forever. The data will be stored longer in a cache by the analysis service. As soon as the same request is submitted, first the cache is checked, if the results have already been calculated. The retrieval of the results from the cache can take some time, similar to the analysis.

    There is no check, if a request is already running. Therefore, submitting a request multiple times, leads to additional load on the system and slows down all requests. The TOAR database has only a limited number of workers for performing a statistical analysis. Therefore, it is advised to run one request after another, especially for large requests covering a large number of stations and or a longer time.

    A brief reminder on timeseries and stations

    The TOAR database uses timeseries, which are associated to a station. At an individual station, one or more physical sensors are mounted. Those can measure different variables or in some cases the same variable with different techniques. A station can also be part of different networks, that contribute data to the TOAR database. A more detailed description on the included data can be found in Chapter Three: The TOAR data processing workflow.

    In the case of gridding, this can lead to systematic errors. For example, the statistical weight of a station can be increased, if it is contributed twice.

    Merged Timeseries

    The TOAR data infrastructure introduced a feature called timeseries_merged as an alternative to the classical extraction of individual timeseries. The implementation into the analysis service is part of Issue 24 and is still ongoing. The merging follows a logic better described in the official documentation (TODO: add link).

    TOARGridding can use this feature by setting data_aggregation_mode="mergedTS".

    Gridding

    The gridding uses a user defined grid to combine all stations in a cell. Per cell mean, standard deviation and the number of stations are reported in the resulting xarray dataset.

    Station averaging

    The timeseries at each station can be averages before the gridding is done. This results in the same statistical weight for each station. This can introduce or remove systematic errors in the data analysis, depending on the calculated statistical aggregates. This option is an alternative to the merged timeseries by setting data_aggregation_mode="meanTSByStation". Thus both operations can not be combined.

    Contributors

    The contributors include all projects, organizations and persons that are associated to any timeseries of a gridded dataset with the roles "contributor" and "originator". In offline mode, this information is conserved by saving the timeseries IDs in a dedicated file with one ID per line. In the metadata of a dataset, this file name is stated together with the contributors endpoint (at the moment: https://toar-data.fz-juelich.de/api/v2/timeseries/request_contributors) to retrieve the actual names. Therefore, the created contributor file need to be submitted as a POST request.

    This can for example be done with curl

    curl -X POST -F "file=@my.file_name.contributors" "https://toar-data.fz-juelich.de/api/v2/timeseries/request_contributors"
    #this is equivalent to:
    curl -X POST -F "file=@my.file_name.contributors" "https://toar-data.fz-juelich.de/api/v2/timeseries/request_contributors?format=text"

    Please be aware, that not all versions of curl support "," in file names.

    The default output format is a ready-to-use list of all programs, organizations and persons that contributed to this dataset. The list is alphabetically sorted within each of the three categories. The provided organizations include the affiliations of all individual persons, as stored in the TOAR database.

    The second option is json (append ?format=json to the request url), which provides the full information on all roles associated to the provided timeseries IDs. These data should be processed to fit your needs.

    In case you are combining several gridded datasets to a single dataset for your study, there is a tool, that created a combined contributors file.

    python tools/combine_contributor_files.py --output outputFileName.contributors /path/to/first.contributors /path/to/second.contributors /and/so/on.contributors

    This will combine any number of provided contributor files into a single file called outputFileName.contributors.

    Logging

    Output created by the different modules and classes of this package use the python logging. There is also a auxiliary class to reuse the same logger setup for examples and scripts provided by this package. It can also be used for custom scripts using this library. This can be used to configures a logging to the shell as well as to the system log of a linux system.

    Authentication and Authorization

    TOAR Gridding requires the statistics endpoint of an analysis service for the TOAR database. Those endpoint can require authentication and authorization. The default approach is the Helmholtz ID, which can be used by many researchers. The access requires a registration and assignments of required rights, which is documented here (authentication is not yet required; there is now documentation on how to obtain an account).

    The access to the services is granted with access tokens. Local tokens can for example be created with the OpenID Connect agent suggested within the documentation of the Helmholtz ID. Please ensure the security of your tokens. We do not store tokens on disk.

    Supported Grids

    The first supported grid is a regular grid with longitude and latitude covering the hole world.

    Supported Variables

    This module supports all variables of the TOAR database (Extraction: 2024-05-27). They can be identified by their cf_standardname or their name as stored in the TOAR database. The second option is shorter and as not all variables in the database have a cf_standardname. The up-to-date list of all available variables with their name and cf_standardname can be accesses by querying the TOAR database, e.g. with https://toar-data.fz-juelich.de/api/v2/variables/?limit=None. The configuration of toargridding can be updated by running the script tools/setupFunctions.py.

    Supported Time intervals

    At the moment only time differences larger than one day are working, i.e. start and end=start+1day leads to crashes.

    Setup functions:

    This package comes with all required information. There is a first function to fetch an update of the available variables from the TAOR-DB. This will override the original file:

     python tools/setupFunctions.py

    Tested platforms

    This project has been tested on

    • Rocky Linux 9

    Automated Testing

    At the moment automatic tests for this module are under development but not in an operational state.

    Acknowledgements

    We would like to thank Tabish Ansari for his evaluating and feedback on the first datasets datasets.

    Documentation of Source Code:

    The aim is a brief overview on the functionalities and the arguments of individual functions while limiting the amount of repetitions. the documentations might not match other style guides. It will definitely be possible to extend the documentation:-)

    class example:
        """An example class
    
        A more detailed explanation of the purpose of this example class.
        """
    
        def __init__(self, varA : int, varB : str):
            """Constructor
    
            Attributes:
            varA:
                brief details and more context
            varB:
                same here.
            """
            [implementation]    
        
        def func1(self, att1, att2):
            """Brief
    
            details
    
            Attributes:
            -----------
            att1:
                brief/details
            att2:
                brief/details
            """
    
            [implementation]
            
    @dataclass
    class dataClass:
        """Brief description
    
        optional details
    
        Parameters
        ----------
        anInt:
            brief description
        anStr:
            brief description
        secStr:
            brief description (explanation of default value, if this seems necessary)
        """
        anInt : int
        anStr : str
        secStr : str = "Default value"

    Citation

    We refer to the entries of the publication database of the Research Center Jülich for a citation in different formats: https://juser.fz-juelich.de/record/1033661