"For the baseline experiments, you are going to perform later, the following data split will be used:\n",
"For the baseline experiments, you are going to perform later, the following data split will be used:\n",
"\n",
"\n",
"(The plot is interactive, zoom in unsing the buttons below)"
"(The plot is interactive, zoom in unsing the buttons below. If this does not work directly, please run the cell 2 - 3 times)"
]
]
},
},
{
{
...
...
%% Cell type:markdown id: tags:
%% Cell type:markdown id: tags:
# Ozone Mapping Introduction
# Ozone Mapping Introduction
The following Jupyter notebook will introduce you to the AQ-Bench data set, its data preprocessing and three different baseline experiments: linear regression, neural network and random forest. Let's first import some modules and setup.
The following Jupyter notebook will introduce you to the AQ-Bench data set, its data preprocessing and three different baseline experiments: linear regression, neural network and random forest. Let's first import some modules and setup.
Let's have a look to the AQ-Bench data set. You see several meta data and ozone metrics - each of them has its own column. Each row shows data from a different station. It is linked to a unique ID.
Let's have a look to the AQ-Bench data set. You see several meta data and ozone metrics - each of them has its own column. Each row shows data from a different station. It is linked to a unique ID.
To get some informations about the different variables, the following table will help you. The first column "column name" contains all variables, you have seen before. They are descibed by all other columns. Example: "input_target" shows you, if the varibale is meta data (input) or a metric (target).
To get some informations about the different variables, the following table will help you. The first column "column name" contains all variables, you have seen before. They are descibed by all other columns. Example: "input_target" shows you, if the varibale is meta data (input) or a metric (target).
If you like to get information about the variables' distributions, feel free to play around with the following widget. Just choose a variable and a logarithmic histogram, which describes the variable, will emerge automatically:
If you like to get information about the variables' distributions, feel free to play around with the following widget. Just choose a variable and a logarithmic histogram, which describes the variable, will emerge automatically:
The data set has some missing values. Common machine learning algorithms cannot handle this so that rows with missing values will be dropped later. The following plot will give you an overview of missing data:
The data set has some missing values. Common machine learning algorithms cannot handle this so that rows with missing values will be dropped later. The following plot will give you an overview of missing data:
To use AQ-Bench meta data as input for ML algorithms, it must be preprocessed. Since basic ML algorithms cannot handle missing data, rows with missing values are erased completely (depending on the chosen target). The longitude is dropped, since this is a circular variable. Categorical meta data is one-hot encoded, which leads to 135 input features in total. Continuous meta data can be scaled - either with normalization or with robust scaling, which normalizes by a quantile range from 25% to 75% to avoid influence from outliers. Feel free to change the parameters of Data with the following widgets.
To use AQ-Bench meta data as input for ML algorithms, it must be preprocessed. Since basic ML algorithms cannot handle missing data, rows with missing values are erased completely (depending on the chosen target). The longitude is dropped, since this is a circular variable. Categorical meta data is one-hot encoded, which leads to 135 input features in total. Continuous meta data can be scaled - either with normalization or with robust scaling, which normalizes by a quantile range from 25% to 75% to avoid influence from outliers. Feel free to change the parameters of Data with the following widgets.
Run the following cell to play with all three ML algorithms: linear regression, neural network and random forest. There are four tabs which are described in the following:
Run the following cell to play with all three ML algorithms: linear regression, neural network and random forest. There are four tabs which are described in the following:
**Data Preparation:** Before running any ML algorithm, please setup your data. Choose the parameters you want and subsequently click on the yellow button "Setup Data". If you want to change any parameters, click this button again, after.
**Data Preparation:** Before running any ML algorithm, please setup your data. Choose the parameters you want and subsequently click on the yellow button "Setup Data". If you want to change any parameters, click this button again, after.
**Linear Regression:** You can train linear regression by clicking on the green button. You can also randomly mix the labels before training, to test, how this algorithm works, if there is no correlation between input and output values.
**Linear Regression:** You can train linear regression by clicking on the green button. You can also randomly mix the labels before training, to test, how this algorithm works, if there is no correlation between input and output values.
**Neural Network:** Before training your neural network, you can choose the hyperparameters you want. The suggested values are recovered, when setting up data. There are different hyperparameters suggested, depending on the target. Additionally to the green and red button, there is also a blue one to overfit your model. Clicking this one, you can overfit your model with a very small number of data examples.
**Neural Network:** Before training your neural network, you can choose the hyperparameters you want. The suggested values are recovered, when setting up data. There are different hyperparameters suggested, depending on the target. Additionally to the green and red button, there is also a blue one to overfit your model. Clicking this one, you can overfit your model with a very small number of data examples.
**Random Forest:** Similar to linear regression.
**Random Forest:** Similar to linear regression.
(If the output scrolls out including the widgets, you can click left from the output, to prevent this.)
(If the output scrolls out including the widgets, you can click left from the output, to prevent this.)