Skip to content
Snippets Groups Projects
Commit 0bc00fc0 authored by Clara Betancourt's avatar Clara Betancourt
Browse files

Merge branch 'devel'

parents 44a6cb7e 8488dd83
Branches develop
No related tags found
No related merge requests found
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Ozone Mapping Introduction # Ozone Mapping Introduction
The following Jupyter notebook will introduce you to the AQ-Bench data set, its data preprocessing and three different baseline experiments: linear regression, neural network and random forest. Let's first import some modules and setup. The following Jupyter notebook will introduce you to the AQ-Bench data set, its data preprocessing and three different baseline experiments: linear regression, neural network and random forest. Let's first import some modules and setup.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import numpy as np import numpy as np
from matplotlib import pyplot as plt from matplotlib import pyplot as plt
import pandas as pd import pandas as pd
from importlib import reload from importlib import reload
import ipywidgets as widgets import ipywidgets as widgets
from IPython.display import display, clear_output, display_html from IPython.display import display, clear_output, display_html
from settings import * from settings import *
from dataset_preanalysis import PreVis from dataset_preanalysis import PreVis
from dataset_preanalysis import PreVis from dataset_preanalysis import PreVis
from dataset_preanalysis import PreMis from dataset_preanalysis import PreMis
from dataset_datasplit import DataSplit from dataset_datasplit import DataSplit
from mapping_data import Data from mapping_data import Data
from mapping_linear_regression import LinearRegression from mapping_linear_regression import LinearRegression
from mapping_neural_network import NeuralNetwork from mapping_neural_network import NeuralNetwork
from mapping_random_forest import RandomForest from mapping_random_forest import RandomForest
pd.set_option('display.max_columns', 200) pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 200) pd.set_option('display.max_rows', 200)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## AQ-Bench data set ## AQ-Bench data set
Let's have a look to the AQ-Bench data set. You see several meta data and ozone metrics - each of them has its own column. Each row shows data from a different station. It is linked to a unique ID. Let's have a look to the AQ-Bench data set. You see several meta data and ozone metrics - each of them has its own column. Each row shows data from a different station. It is linked to a unique ID.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
dataset = pd.read_csv(resources_dir + AQbench_dataset_file) dataset = pd.read_csv(resources_dir + AQbench_dataset_file)
dataset.head() dataset.head()
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
To get some informations about the different variables, the following table will help you. The first column "column name" contains all variables, you have seen before. They are descibed by all other columns. Example: "input_target" shows you, if the varibale is meta data (input) or a metric (target). To get some informations about the different variables, the following table will help you. The first column "column name" contains all variables, you have seen before. They are descibed by all other columns. Example: "input_target" shows you, if the varibale is meta data (input) or a metric (target).
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
information = pd.read_csv(resources_dir + AQbench_variables_file) information = pd.read_csv(resources_dir + AQbench_variables_file)
information information
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
If you like to get information about the variables' distributions, feel free to play around with the following widget. Just choose a variable and a logarithmic histogram, which describes the variable, will emerge automatically: If you like to get information about the variables' distributions, feel free to play around with the following widget. Just choose a variable and a logarithmic histogram, which describes the variable, will emerge automatically:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
%matplotlib inline %matplotlib inline
def plot_previs(column_name): def plot_previs(column_name):
previs = PreVis(resources_dir + AQbench_dataset_file, output_dir, resources_dir) previs = PreVis(resources_dir + AQbench_dataset_file, output_dir, resources_dir)
previs.read_csv_to_df() previs.read_csv_to_df()
previs.vis(column_name) previs.vis(column_name)
plt.show() plt.show()
options = information[information['input_target'].isin(['input', 'target'])]['column_name'].to_list() options = information[information['input_target'].isin(['input', 'target'])]['column_name'].to_list()
column_name = widgets.Dropdown(options=options, description='variable:') column_name = widgets.Dropdown(options=options, description='variable:')
widgets.interact(plot_previs, column_name=column_name) widgets.interact(plot_previs, column_name=column_name)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The data set has some missing values. Common machine learning algorithms cannot handle this so that rows with missing values will be dropped later. The following plot will give you an overview of missing data: The data set has some missing values. Common machine learning algorithms cannot handle this so that rows with missing values will be dropped later. The following plot will give you an overview of missing data:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
%matplotlib inline %matplotlib inline
premis = PreMis(resources_dir + AQbench_dataset_file, resources_dir + AQbench_variables_file, output_dir) premis = PreMis(resources_dir + AQbench_dataset_file, resources_dir + AQbench_variables_file, output_dir)
premis.fill_nan() premis.fill_nan()
premis.missingno_matrix() premis.missingno_matrix()
plt.show() plt.show()
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Preprocessing Data ## Preprocessing Data
To use AQ-Bench meta data as input for ML algorithms, it must be preprocessed. Since basic ML algorithms cannot handle missing data, rows with missing values are erased completely (depending on the chosen target). The longitude is dropped, since this is a circular variable. Categorical meta data is one-hot encoded, which leads to 135 input features in total. Continuous meta data can be scaled - either with normalization or with robust scaling, which normalizes by a quantile range from 25% to 75% to avoid influence from outliers. Feel free to change the parameters of Data with the following widgets. To use AQ-Bench meta data as input for ML algorithms, it must be preprocessed. Since basic ML algorithms cannot handle missing data, rows with missing values are erased completely (depending on the chosen target). The longitude is dropped, since this is a circular variable. Categorical meta data is one-hot encoded, which leads to 135 input features in total. Continuous meta data can be scaled - either with normalization or with robust scaling, which normalizes by a quantile range from 25% to 75% to avoid influence from outliers. Feel free to change the parameters of Data with the following widgets.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def print_data(target, scaling, scale_target): def print_data(target, scaling, scale_target):
data = Data(target=target, scaling=scaling, scale_target=scale_target) data = Data(target=target, scaling=scaling, scale_target=scale_target)
display_html(data.data_yx.head()) display_html(data.data_yx.head())
target = widgets.Dropdown( target = widgets.Dropdown(
options=information[information['input_target'] == 'target']['column_name'].to_list(), options=information[information['input_target'] == 'target']['column_name'].to_list(),
values='o3_average_values', description='target') values='o3_average_values', description='target')
scaling = widgets.RadioButtons(options=['robust', 'normalize', 'None'], value='robust', description='scaling') scaling = widgets.RadioButtons(options=['robust', 'normalize', 'None'], value='robust', description='scaling')
scale_target = widgets.Checkbox(value=False, description='scale target') scale_target = widgets.Checkbox(value=False, description='scale target')
widgets.interact(print_data, target=target, scaling=scaling, scale_target=scale_target) widgets.interact(print_data, target=target, scaling=scaling, scale_target=scale_target)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Datasplit ## Datasplit
For the baseline experiments, you are going to perform later, the following data split will be used: For the baseline experiments, you are going to perform later, the following data split will be used:
(The plot is interactive, zoom in unsing the buttons below) (The plot is interactive, zoom in unsing the buttons below. If this does not work directly, please run the cell 2 - 3 times)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
reload(plt) reload(plt)
%matplotlib notebook %matplotlib notebook
datasplit = DataSplit() datasplit = DataSplit()
datasplit.read_datasplit() datasplit.read_datasplit()
datasplit.plot_datasplit() datasplit.plot_datasplit()
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
If you like, you can compare the variables' distributions for the three data sets. If you like, you can compare the variables' distributions for the three data sets.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
%matplotlib inline %matplotlib inline
datasplit.read_datasplit() datasplit.read_datasplit()
ids = {'training': datasplit.ids_train, 'development': datasplit.ids_val, 'test': datasplit.ids_test} ids = {'training': datasplit.ids_train, 'development': datasplit.ids_val, 'test': datasplit.ids_test}
def plot_previs_sets(column_name, set_): def plot_previs_sets(column_name, set_):
previs = PreVis(resources_dir + AQbench_dataset_file, output_dir, resources_dir) previs = PreVis(resources_dir + AQbench_dataset_file, output_dir, resources_dir)
previs.read_csv_to_df() previs.read_csv_to_df()
previs.data = previs.data[previs.data.index.isin(ids[set_])] previs.data = previs.data[previs.data.index.isin(ids[set_])]
previs.vis(column_name, plot_naming=set_) previs.vis(column_name, plot_naming=set_)
options = information[information['input_target'].isin(['input', 'target'])]['column_name'].to_list() options = information[information['input_target'].isin(['input', 'target'])]['column_name'].to_list()
column_name = widgets.Dropdown(options=options, description='variable:') column_name = widgets.Dropdown(options=options, description='variable:')
set_ = widgets.RadioButtons(options=['training', 'development', 'test'], set_ = widgets.RadioButtons(options=['training', 'development', 'test'],
value='training', description='data set:') value='training', description='data set:')
widgets.interact(plot_previs_sets, column_name=column_name, set_=set_) widgets.interact(plot_previs_sets, column_name=column_name, set_=set_)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## ML Algorithms ## ML Algorithms
Run the following cell to play with all three ML algorithms: linear regression, neural network and random forest. There are four tabs which are described in the following: Run the following cell to play with all three ML algorithms: linear regression, neural network and random forest. There are four tabs which are described in the following:
**Data Preparation:** Before running any ML algorithm, please setup your data. Choose the parameters you want and subsequently click on the yellow button "Setup Data". If you want to change any parameters, click this button again, after. **Data Preparation:** Before running any ML algorithm, please setup your data. Choose the parameters you want and subsequently click on the yellow button "Setup Data". If you want to change any parameters, click this button again, after.
**Linear Regression:** You can train linear regression by clicking on the green button. You can also randomly mix the labels before training, to test, how this algorithm works, if there is no correlation between input and output values. **Linear Regression:** You can train linear regression by clicking on the green button. You can also randomly mix the labels before training, to test, how this algorithm works, if there is no correlation between input and output values.
**Neural Network:** Before training your neural network, you can choose the hyperparameters you want. The suggested values are recovered, when setting up data. There are different hyperparameters suggested, depending on the target. Additionally to the green and red button, there is also a blue one to overfit your model. Clicking this one, you can overfit your model with a very small number of data examples. **Neural Network:** Before training your neural network, you can choose the hyperparameters you want. The suggested values are recovered, when setting up data. There are different hyperparameters suggested, depending on the target. Additionally to the green and red button, there is also a blue one to overfit your model. Clicking this one, you can overfit your model with a very small number of data examples.
**Random Forest:** Similar to linear regression. **Random Forest:** Similar to linear regression.
(If the output scrolls out including the widgets, you can click left from the output, to prevent this.) (If the output scrolls out including the widgets, you can click left from the output, to prevent this.)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
%run mapping_jupyter.ipynb %run mapping_jupyter.ipynb
``` ```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment