BUG: wrong time extend when using lazy preprocessing

Bug

Error description

Although there should be data for a longer period, MLAir cannot find data for val and test subsets.

Error message

only indirect error occuring at a later stage

First guess on error origin

  • There must something going wrong with the lazy preprocessing. Either the loading is not executed properly, or something with the storing is erroneous. Therefore check the loading properly!

  • There is a mismatch in the hash, so that the file is not overwriten. A check of the content of the hash looks as there is a issue with the data_origin parameter which unintentionally changes to default values and does not keep the given parameters from experiment setup. That the lazy preprocessing is not working correctly might be a direct result from this issue. Let's check this out.

Error origin

There are two issues:

  1. With enabled overwriting of lazy data, there is not removal of an existing pickle file, that should be overwritten. But when storing the newly calculated pickle file, there is a check if the file already exists and in this case the file is not overwritten (but then still the old file perseveres).

  2. The is a JOIN helper function helpers/join.py:download_join which works with the external parameter data_origin. Unintentionally, this method changes this parameter. As this parameter is a dictionary which is always parsed as reference and not as copy, this has an impact on the self.data_origin attributes of a data handler. Consequentially, the parameter changes between the existing check of loading lazy data and the existing check while storing as lazy data (this leads to a deviating hash and thus to two separate files).

Solution

  1. Change line in mlair/data_handler/data_handler_single_station.py. It is not required to catch the error, if filename is not existing, as this will cause an FileNotFoundError which would be anyway raised afterwards.
class DataHandlerSingleStation(AbstractDataHandler):
    ....
    def load_lazy(self):
        hash = self._get_hash()
        filename = os.path.join(self.lazy_path, hash + ".pickle")
        try:
            if self.overwrite_lazy_data is True:
+               os.remove(filename)
                raise FileNotFoundError
  1. Add a line to copy a given dictionary for data_origin in mlair/helpers/join.py:download_join
def download_join(...):
    ....
    # make sure station_name parameter is a list
    station_name = helpers.to_list(station_name)
+                                                            
+   # also ensure that given data_origin dict is no reference
+   data_origin = None if data_origin is None else {k: v for (k, v) in data_origin.items()}

    # get data connection settings
    join_url_base, headers = join_settings(sampling)

Edited by Ghost User