Merge branch 'devel'

0bc00fc0 · Clara Betancourt · 44a6cb7e · 8488dd83 · 0bc00fc0
Commit 0bc00fc0 authored 4 years ago by Clara Betancourt
--- a/source/introduction_jupyter.ipynb
+++ b/source/introduction_jupyter.ipynb
@@ -161,7 +161,7 @@
    "\n",
    "For the baseline experiments, you are going to perform later, the following data split will be used:\n",
    "\n",
-    "(The plot is interactive, zoom in unsing the buttons below)"
+    "(The plot is interactive, zoom in unsing the buttons below. If this does not work directly, please run the cell 2 - 3 times)"
   ]
  },
  {

 %% Cell type:markdown id: tags:
 # Ozone Mapping Introduction
 The following Jupyter notebook will introduce you to the AQ-Bench data set, its data preprocessing and three different baseline experiments: linear regression, neural network and random forest. Let's first import some modules and setup.
 %% Cell type:code id: tags:
 ``` python
 import numpy as np
 from matplotlib import pyplot as plt
 import pandas as pd
 from importlib import reload
 import ipywidgets as widgets
 from IPython.display import display, clear_output, display_html
 from settings import *
 from dataset_preanalysis import PreVis
 from dataset_preanalysis import PreVis
 from dataset_preanalysis import PreMis
 from dataset_datasplit import DataSplit
 from mapping_data import Data
 from mapping_linear_regression import LinearRegression
 from mapping_neural_network import NeuralNetwork
 from mapping_random_forest import RandomForest
 pd.set_option('display.max_columns', 200)
 pd.set_option('display.max_rows', 200)
 ```
 %% Cell type:markdown id: tags:
 ## AQ-Bench data set
 Let's have a look to the AQ-Bench data set. You see several meta data and ozone metrics - each of them has its own column. Each row shows data from a different station. It is linked to a unique ID.
 %% Cell type:code id: tags:
 ``` python
 dataset = pd.read_csv(resources_dir + AQbench_dataset_file)
 dataset.head()
 ```
 %% Cell type:markdown id: tags:
 To get some informations about the different variables, the following table will help you. The first column "column name" contains all variables, you have seen before. They are descibed by all other columns. Example: "input_target" shows you, if the varibale is meta data (input) or a metric (target).
 %% Cell type:code id: tags:
 ``` python
 information = pd.read_csv(resources_dir + AQbench_variables_file)
 information
 ```
 %% Cell type:markdown id: tags:
 If you like to get information about the variables' distributions, feel free to  play around with the following widget. Just choose a variable and a logarithmic histogram, which describes the variable, will emerge automatically:
 %% Cell type:code id: tags:
 ``` python
 %matplotlib inline
 def plot_previs(column_name):
    previs = PreVis(resources_dir + AQbench_dataset_file, output_dir, resources_dir)
    previs.read_csv_to_df()
    previs.vis(column_name)
    plt.show()
 options = information[information['input_target'].isin(['input', 'target'])]['column_name'].to_list()
 column_name = widgets.Dropdown(options=options, description='variable:')
 widgets.interact(plot_previs, column_name=column_name)
 ```
 %% Cell type:markdown id: tags:
 The data set has some missing values. Common machine learning algorithms cannot handle this so that rows with missing values will be dropped later. The following plot will give you an overview of missing data:
 %% Cell type:code id: tags:
 ``` python
 %matplotlib inline
 premis = PreMis(resources_dir + AQbench_dataset_file, resources_dir + AQbench_variables_file, output_dir)
 premis.fill_nan()
 premis.missingno_matrix()
 plt.show()
 ```
 %% Cell type:markdown id: tags:
 ## Preprocessing Data
 To use AQ-Bench meta data as input for ML algorithms, it must be preprocessed. Since basic ML algorithms cannot handle missing data, rows with missing values are erased completely (depending on the chosen target). The longitude is dropped, since this is a circular variable. Categorical meta data is one-hot encoded, which leads to 135 input features in total. Continuous meta data can be scaled - either with normalization or with robust scaling, which normalizes by a quantile range from 25% to 75% to avoid influence from outliers. Feel free to change the parameters of Data with the following widgets.
 %% Cell type:code id: tags:
 ``` python
 def print_data(target, scaling, scale_target):
    data = Data(target=target, scaling=scaling, scale_target=scale_target)
    display_html(data.data_yx.head())
 target = widgets.Dropdown(
    options=information[information['input_target'] == 'target']['column_name'].to_list(),
    values='o3_average_values', description='target')
 scaling = widgets.RadioButtons(options=['robust', 'normalize', 'None'], value='robust', description='scaling')
 scale_target = widgets.Checkbox(value=False, description='scale target')
 widgets.interact(print_data, target=target, scaling=scaling, scale_target=scale_target)
 ```
 %% Cell type:markdown id: tags:
 ## Datasplit
 For the baseline experiments, you are going to perform later, the following data split will be used:
-(The plot is interactive, zoom in unsing the buttons below)
+(The plot is interactive, zoom in unsing the buttons below. If this does not work directly, please run the cell 2 - 3 times)
 %% Cell type:code id: tags:
 ``` python
 reload(plt)
 %matplotlib notebook
 datasplit = DataSplit()
 datasplit.read_datasplit()
 datasplit.plot_datasplit()
 ```
 %% Cell type:markdown id: tags:
 If you like, you can compare the variables' distributions for the three data sets.
 %% Cell type:code id: tags:
 ``` python
 %matplotlib inline
 datasplit.read_datasplit()
 ids = {'training': datasplit.ids_train, 'development': datasplit.ids_val, 'test': datasplit.ids_test}
 def plot_previs_sets(column_name, set_):
    previs = PreVis(resources_dir + AQbench_dataset_file, output_dir, resources_dir)
    previs.read_csv_to_df()
    previs.data = previs.data[previs.data.index.isin(ids[set_])]
    previs.vis(column_name, plot_naming=set_)
 options = information[information['input_target'].isin(['input', 'target'])]['column_name'].to_list()
 column_name = widgets.Dropdown(options=options, description='variable:')
 set_ = widgets.RadioButtons(options=['training', 'development', 'test'],
                            value='training', description='data set:')
 widgets.interact(plot_previs_sets, column_name=column_name, set_=set_)
 ```
 %% Cell type:markdown id: tags:
 ## ML Algorithms
 Run the following cell to play with all three ML algorithms: linear regression, neural network and random forest. There are four tabs which are described in the following:
 **Data Preparation:** Before running any ML algorithm, please setup your data. Choose the parameters you want and subsequently click on the yellow button "Setup Data". If you want to change any parameters, click this button again, after.
 **Linear Regression:** You can train linear regression by clicking on the green button. You can also randomly mix the labels before training, to test, how this algorithm works, if there is no correlation between input and output values.
 **Neural Network:** Before training your neural network, you can choose the hyperparameters you want. The suggested values are recovered, when setting up data. There are different hyperparameters suggested, depending on the target. Additionally to the green and red button, there is also a blue one to overfit your model. Clicking this one, you can overfit your model with a very small number of data examples.
 **Random Forest:** Similar to linear regression.
 (If the output scrolls out including the widgets, you can click left from the output, to prevent this.)
 %% Cell type:code id: tags:
 ``` python
 %run mapping_jupyter.ipynb
 ```