Bug: order in feature_importance_bootstrap_method causes crash

Bug

Order of list elements for argument feature_importance_bootstrap_method= can cause program crash.

Error description

within run.py

feature_importance_bootstrap_method=["zero_mean", "shuffle"], FAILS

feature_importance_bootstrap_method=["shuffle", "zero_mean"], WORKS

Error message

Traceback (most recent call last):
  File "[...]/mlair/run.py", line 46, in <module>
    main(args)
  File "[...]/mlair/run.py", line 38, in main
    workflow.run()
  File "[...]/mlair/mlair/workflows/abstract_workflow.py", line 30, in run
    stage(**self._registry_kwargs[pos])
  File "[...]/mlair/mlair/run_modules/post_processing.py", line 99, in __init__
    self._run()
  File "[...]/mlair/mlair/run_modules/post_processing.py", line 125, in _run
    self.report_feature_importance_results(self.feature_importance_skill_scores)
  File "[...]/mlair/mlair/run_modules/post_processing.py", line 1027, in report_feature_importance_results
    df = pd.DataFrame(res, columns=col_names)
  File "[...]/lib/python3.8/site-packages/pandas/core/frame.py", line 509, in __init__
    arrays, columns = to_arrays(data, columns, dtype=dtype)
  File "[...]/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 524, in to_arrays
    return _list_to_arrays(data, columns, coerce_float=coerce_float, dtype=dtype)
  File "[...]/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 567, in _list_to_arrays
    raise ValueError(e) from e
ValueError: 6 columns passed, passed data had 25 columns
2022-01-28 15:29:19,809 - INFO: PostProcessing finished after 0:01:03 (hh:mm:ss)  [run_environment.py:__del__:118]

Process finished with exit code 1

First guess on error origin

post_processing.py in method report_feature_importance_results -> col_names=

Error origin

Currently, the number of columns is determined by the first element of res by *list(range(len(res[0]) - 5)). In case, this element is shorter than the longest element, the pandas dataframe has too few columns.

Solution

Look for the longest result element and use this length to create the data frame.

class PostProcessing(RunEnvironment):
    ...
    def report_feature_importance_results(self, results):
        ...
        res = []
+       max_cols = 0
        for boot_type, d0 in results.items():
            for boot_method, d1 in d0.items():
                for station_name, vals in d1.items():
                    for boot_var in vals.coords[self.boot_var_dim].values.tolist():
                        for ahead in vals.coords[self.ahead_dim].values.tolist():
                            res.append([boot_type, boot_method, station_name, boot_var, ahead,
                                        *vals.sel({self.boot_var_dim: boot_var,
                                                   self.ahead_dim: ahead}).values.round(5).tolist()])
+                           max_cols = max(max_cols, len(res[-1]))
        col_names = [self.model_type_dim, "method", "station", self.boot_var_dim, self.ahead_dim,
-                    *list(range(len(res[0]) - 5))]
+                    *list(range(max_cols - 5))]
Edited by Ghost User