Rules for identifying stations have been fully implemented
Es fehlt noch Rule 2.3!
from draft TG02:
Rule 2.1: check the combination of role: resource_provider (organisation) and station code (select id from stations where *** and ). If a station with the same identifier and provider is found in the database, we can be very sure that the new data records belong to the same station. As an additional check, we then control whether the station coordinates from the new record are within 200 10 m of those stored in the database, and the database operator is informed about potential discrepancies.
Rule 2.2: if rule 2.1 did not lead to the identification of an existing station, the station coordinates are used as proxy for a station identifier. Based on a series of queries of the TOAR-1 database, where extensive work had been conducted to manually control station coordinates, we identified a threshold distance value of 30 10 m as most suitable criterion to decide if new data records should belong to an existing station or not (select * from stationmeta_core where ST_DistanceSphere(stationmeta_core.coordinates, ST_GeomFromText('POINT(lon lat)',4326)) < 10;). To further confirm if data points that are within 30 m really belong to the same station, we run a similarity check on the station name and station codes if they are contained in the metadata (see rule 2.3). The output of the coordinate test and the similarity check are reported to the database operator.
Rule 2.3: if the coordinates of the new data record are more than 30 m but less than 200 m away from the nearest station stored in the TOAR database, we perform a similarity test on the station names and station codes of those records that are within 200 m of the new coordinates (select word_similarity() ***). If either similarity is > 0.75, we assume that the locations are identical and that the station coordinates have been specified with insufficient accuracy. Otherwise, the new data records are treated as a new station. Obviously, if station name and station ids are not provided we have to rely on the coordinate information and therefore treat the new data records as a new station. Again, the outcome of the similarity tests will be reported to the database operator. —> don’t decide automatically, only generate a message!
Hier einige Infos zu solchen similarity tests:
https://www.postgresql.org/docs/current/pgtrgm.html
https://dev.to/kaleman15/fuzzy-searching-with-postgresql-97o
https://stackoverflow.com/questions/11249635/finding-similar-strings-with-postgresql-quickly
(nicht ganz passend, aber interessante Lektüre: https://railsware.com/blog/effective-similarity-search-in-postgresql/)
The published document (30/09/2021) is different from the draft version:
Step 10: Identify Station In order to ensure that data belonging to one measurement series are recorded as one time series at one station (and, conversely, data obtained at physically different locations are linked to different stations) the following set of rules has been implemented to decide if a new data record with station metadata information belongs to a station that is already recorded in the TOAR database. This seemingly easy problem is actually quite complicated in practice, because different monitoring networks may report station coordinates with different accuracy and sometimes the reported station coordinates are even wrong. Furthermore, there is no universal system of station identifiers established and in some cases, station identifiers are not even reported.
Rule 10.1: check the combination of role: resource_provider (organisation) and station code (db.query(models.Timeseries).filter(models.Timeseries.station_id == station_id).filter(models.Timeseries.variable_id == variable_id); role_num = get_value_from_str(toardb.toardb.RC_vocabulary,'ResourceProvider'); (contact.organisation.name == resource_provider) and (role_num == role.role)). If a station with the same identifier and provider is found in the database, we can be very sure that the new data records belong to the same station. As an additional check, we then control whether the station coordinates from the new record are within 10 m of those stored in the database, and the database operator is informed about potential discrepancies.
Rule 10.2: if rule 10.1 did not lead to the identification of an existing station, the station coordinates are used as proxy for a station identifier. Based on a series of queries of the TOAR-1 database, where extensive work had been conducted to manually control station coordinates, we identified a threshold distance value of 10 m as most suitable criterion to decide if new data records should belong to an existing station or not (select * from stationmeta_core where ST_DistanceSphere(stationmeta_core.coordinates, ST_GeomFromText('POINT(lon lat)',4326)) < 10;).