Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
B
Bayesian Statistical Learning 2
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Container registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Alina Bazarova
Bayesian Statistical Learning 2
Commits
8c690d3d
Commit
8c690d3d
authored
3 weeks ago
by
Alina Bazarova
Browse files
Options
Downloads
Plain Diff
Merge branch 'feature-gp'
parents
aa631ea9
ac660f5b
No related branches found
No related tags found
1 merge request
!3
Update 03_one_dim_SVI
Changes
2
Show whitespace changes
Inline
Side-by-side
Showing
2 changed files
BLcourse2.3/03_one_dim_SVI.ipynb
+26
-6
26 additions, 6 deletions
BLcourse2.3/03_one_dim_SVI.ipynb
BLcourse2.3/03_one_dim_SVI.py
+26
-6
26 additions, 6 deletions
BLcourse2.3/03_one_dim_SVI.py
with
52 additions
and
12 deletions
BLcourse2.3/03_one_dim_SVI.ipynb
+
26
−
6
View file @
8c690d3d
...
...
@@ -17,6 +17,8 @@
"$\\newcommand{\\predve}[1]{\\mathbf{#1}}$\n",
"$\\newcommand{\\test}[1]{#1_*}$\n",
"$\\newcommand{\\testtest}[1]{#1_{**}}$\n",
"$\\newcommand{\\dd}{{\\rm{d}}}$\n",
"$\\newcommand{\\lt}[1]{_{\\text{#1}}}$\n",
"$\\DeclareMathOperator{\\diag}{diag}$\n",
"$\\DeclareMathOperator{\\cov}{cov}$"
]
...
...
@@ -141,6 +143,9 @@
"2015](https://proceedings.mlr.press/v38/hensman15.html). The model is\n",
"\"sparse\" since it works with a set of *inducing* points $(\\ma Z, \\ve u),\n",
"\\ve u=f(\\ma Z)$ which is much smaller than the train data $(\\ma X, \\ve y)$.\n",
"See also [the GPJax\n",
"docs](https://docs.jaxgaussianprocesses.com/_examples/uncollapsed_vi) for a\n",
"nice introduction.\n",
"\n",
"We have the same hyper parameters as before\n",
"\n",
...
...
@@ -260,16 +265,31 @@
"source": [
"# Fit GP to data: optimize hyper params\n",
"\n",
"Now we optimize the GP hyper parameters by doing a GP-specific variational inference (VI),\n",
"where we optimize not the log marginal likelihood (ExactGP case),\n",
"but an ELBO (evidence lower bound) objective\n",
"Now we optimize the GP hyper parameters by doing a GP-specific variational\n",
"inference (VI), where we don't maximize the log marginal likelihood (ExactGP\n",
"case), but an ELBO (\"evidence lower bound\") objective -- a lower bound on the\n",
"marginal likelihood (the \"evidence\"). In variational inference, an ELBO objective\n",
"shows up when minimizing the KL divergence between\n",
"an approximate and the true posterior\n",
"\n",
"$$\n",
"
\\max_\\ve\\zeta\\left(\\mathbb E_{q_{\\ve\\psi}(\\ve u)}\\left[\\ln p(\\ve y|\\ve u) \\right] -
\n",
"
D_{\\text{KL}}(q_{\\ve\\psi}(\\ve u)\\Vert p(\\ve u))\\right)
\n",
"
p(w|y) = \\frac{p(y|w)\\,p(w)}{\\int p(y|w)\\,p(w)\\,\\dd w}
\n",
"
= \\frac{p(y|w)\\,p(w)}{p(y)}
\n",
"$$\n",
"\n",
"with respect to\n",
"$$\n",
" \\ve\\zeta^* = \\text{arg}\\min_{\\ve\\zeta} D\\lt{KL}(q_{\\ve\\zeta}(w)\\,\\Vert\\, p(w|y))\n",
"$$\n",
"\n",
"to obtain the optimal variational parameters $\\ve\\zeta^*$ to approximate\n",
"$p(w|y)$ with $q_{\\ve\\zeta^*}(w)$.\n",
"\n",
"In our case the two distributions are the approximate\n",
"\n",
"$$q_{\\ve\\zeta}(\\mathbf f)=\\int p(\\mathbf f|\\ve u)\\,q_{\\ve\\psi}(\\ve u)\\,\\dd\\ve u\\quad(\\text{\"variational strategy\"})$$\n",
"\n",
"and the true $p(\\mathbf f|\\mathcal D)$ posterior over function values. We\n",
"optimize with respect to\n",
"\n",
"$$\\ve\\zeta = [\\ell, \\sigma_n^2, s, c, \\ve\\psi] $$\n",
"\n",
...
...
%% Cell type:markdown id:b146bc52 tags:
# About
In this notebook, we replace the ExactGP inference and log marginal
likelihood optimization by using sparse stochastic variational inference.
This serves as an example of the many methods
`gpytorch`
offers to make GPs
scale to large data sets.
$
\n
ewcommand{
\v
e}[1]{
\m
athit{
\b
oldsymbol{#1}}}$
$
\n
ewcommand{
\m
a}[1]{
\m
athbf{#1}}$
$
\n
ewcommand{
\p
red}[1]{
\r
m{#1}}$
$
\n
ewcommand{
\p
redve}[1]{
\m
athbf{#1}}$
$
\n
ewcommand{
\t
est}[1]{#1_
*
}$
$
\n
ewcommand{
\t
esttest}[1]{#1_{
**
}}$
$
\n
ewcommand{
\d
d}{{
\r
m{d}}}$
$
\n
ewcommand{
\l
t}[1]{_{
\t
ext{#1}}}$
$
\D
eclareMathOperator{
\d
iag}{diag}$
$
\D
eclareMathOperator{
\c
ov}{cov}$
%% Cell type:markdown id:f96e3304 tags:
# Imports, helpers, setup
%% Cell type:code id:193b2f13 tags:
```
python
##%matplotlib notebook
%
matplotlib
widget
##%matplotlib inline
```
%% Cell type:code id:c11c1030 tags:
```
python
import
math
from
collections
import
defaultdict
from
pprint
import
pprint
import
torch
import
gpytorch
from
matplotlib
import
pyplot
as
plt
from
matplotlib
import
is_interactive
import
numpy
as
np
from
torch.utils.data
import
TensorDataset
,
DataLoader
from
utils
import
extract_model_params
,
plot_samples
torch
.
set_default_dtype
(
torch
.
float64
)
torch
.
manual_seed
(
123
)
```
%% Cell type:markdown id:460d4c3d tags:
# Generate toy 1D data
Now we generate 10x more points as in the ExactGP case, still the inference
won't be much slower (exact GPs scale roughly as $N^3$). Note that the data we
use here is still tiny (1000 points is easy even for exact GPs), so the
method's usefulness cannot be fully exploited in our example
-- also we don't even use a GPU yet :).
%% Cell type:code id:f1a9647d tags:
```
python
def
ground_truth
(
x
,
const
):
return
torch
.
sin
(
x
)
*
torch
.
exp
(
-
0.2
*
x
)
+
const
def
generate_data
(
x
,
gaps
=
[[
1
,
3
]],
const
=
None
,
noise_std
=
None
):
noise_dist
=
torch
.
distributions
.
Normal
(
loc
=
0
,
scale
=
noise_std
)
y
=
ground_truth
(
x
,
const
=
const
)
+
noise_dist
.
sample
(
sample_shape
=
(
len
(
x
),)
)
msk
=
torch
.
tensor
([
True
]
*
len
(
x
))
if
gaps
is
not
None
:
for
g
in
gaps
:
msk
=
msk
&
~
((
x
>
g
[
0
])
&
(
x
<
g
[
1
]))
return
x
[
msk
],
y
[
msk
],
y
const
=
5.0
noise_std
=
0.1
x
=
torch
.
linspace
(
0
,
4
*
math
.
pi
,
1000
)
X_train
,
y_train
,
y_gt_train
=
generate_data
(
x
,
gaps
=
[[
6
,
10
]],
const
=
const
,
noise_std
=
noise_std
)
X_pred
=
torch
.
linspace
(
X_train
[
0
]
-
2
,
X_train
[
-
1
]
+
2
,
200
,
requires_grad
=
False
)
y_gt_pred
=
ground_truth
(
X_pred
,
const
=
const
)
print
(
f
"
{
X_train
.
shape
=
}
"
)
print
(
f
"
{
y_train
.
shape
=
}
"
)
print
(
f
"
{
X_pred
.
shape
=
}
"
)
fig
,
ax
=
plt
.
subplots
()
ax
.
scatter
(
X_train
,
y_train
,
marker
=
"
o
"
,
color
=
"
tab:blue
"
,
label
=
"
noisy data
"
)
ax
.
plot
(
X_pred
,
y_gt_pred
,
ls
=
"
--
"
,
color
=
"
k
"
,
label
=
"
ground truth
"
)
ax
.
legend
()
```
%% Cell type:markdown id:4ad60c5e tags:
# Define GP model
The model follows
[
this
example
](
https://docs.gpytorch.ai/en/stable/examples/04_Variational_and_Approximate_GPs/SVGP_Regression_CUDA.html
)
based on
[
Hensman et al., "Scalable Variational Gaussian Process Classification",
2015
](
https://proceedings.mlr.press/v38/hensman15.html
)
. The model is
"sparse" since it works with a set of
*inducing*
points $(
\m
a Z,
\v
e u),
\v
e u=f(
\m
a Z)$ which is much smaller than the train data $(
\m
a X,
\v
e y)$.
See also
[
the GPJax
docs
](
https://docs.jaxgaussianprocesses.com/_examples/uncollapsed_vi
)
for a
nice introduction.
We have the same hyper parameters as before
*
$
\e
ll$ =
`model.covar_module.base_kernel.lengthscale`
*
$
\s
igma_n^2$ =
`likelihood.noise_covar.noise`
*
$s$ =
`model.covar_module.outputscale`
*
$m(
\v
e x) = c$ =
`model.mean_module.constant`
plus additional ones, introduced by the approximations used:
*
the learnable inducing points $
\m
a Z$ for the variational distribution
$q_{
\v
e
\p
si}(
\v
e u)$
*
learnable parameters of the variational distribution $q_{
\v
e
\p
si}(
\v
e u)=
\m
athcal N(
\v
e m_u,
\m
a S)$: the
variational mean $
\v
e m_u$ and covariance $
\m
a S$ in form a lower triangular
matrix $
\m
a L$ such that $
\m
a S=
\m
a L
\,\m
a L^
\t
op$
%% Cell type:code id:444929a3 tags:
```
python
class
ApproxGPModel
(
gpytorch
.
models
.
ApproximateGP
):
def
__init__
(
self
,
Z
):
# Approximate inducing value posterior q(u), u = f(Z), Z = inducing
# points (subset of X_train)
variational_distribution
=
(
gpytorch
.
variational
.
CholeskyVariationalDistribution
(
Z
.
size
(
0
))
)
# Compute q(f(X)) from q(u)
variational_strategy
=
gpytorch
.
variational
.
VariationalStrategy
(
self
,
Z
,
variational_distribution
,
learn_inducing_locations
=
True
,
)
super
().
__init__
(
variational_strategy
)
self
.
mean_module
=
gpytorch
.
means
.
ConstantMean
()
self
.
covar_module
=
gpytorch
.
kernels
.
ScaleKernel
(
gpytorch
.
kernels
.
RBFKernel
()
)
def
forward
(
self
,
x
):
mean_x
=
self
.
mean_module
(
x
)
covar_x
=
self
.
covar_module
(
x
)
return
gpytorch
.
distributions
.
MultivariateNormal
(
mean_x
,
covar_x
)
likelihood
=
gpytorch
.
likelihoods
.
GaussianLikelihood
()
n_train
=
len
(
X_train
)
# Start values for inducing points Z, use 10% random sub-sample of X_train.
ind_points_fraction
=
0.1
ind_idxs
=
torch
.
randperm
(
n_train
)[:
int
(
n_train
*
ind_points_fraction
)]
model
=
ApproxGPModel
(
Z
=
X_train
[
ind_idxs
])
```
%% Cell type:code id:5861163d tags:
```
python
# Inspect the model
print
(
model
)
```
%% Cell type:code id:98032457 tags:
```
python
# Inspect the likelihood. In contrast to ExactGP, the likelihood is not part of
# the GP model instance.
print
(
likelihood
)
```
%% Cell type:code id:7ec97687 tags:
```
python
# Default start hyper params
print
(
"
model params:
"
)
pprint
(
extract_model_params
(
model
))
print
(
"
likelihood params:
"
)
pprint
(
extract_model_params
(
likelihood
))
```
%% Cell type:code id:167d584c tags:
```
python
# Set new start hyper params (scalars only)
model
.
mean_module
.
constant
=
3.0
model
.
covar_module
.
base_kernel
.
lengthscale
=
1.0
model
.
covar_module
.
outputscale
=
1.0
likelihood
.
noise_covar
.
noise
=
0.3
```
%% Cell type:markdown id:6704ce2f tags:
# Fit GP to data: optimize hyper params
Now we optimize the GP hyper parameters by doing a GP-specific variational inference (VI),
where we optimize not the log marginal likelihood (ExactGP case),
but an ELBO (evidence lower bound) objective
Now we optimize the GP hyper parameters by doing a GP-specific variational
inference (VI), where we don't maximize the log marginal likelihood (ExactGP
case), but an ELBO ("evidence lower bound") objective -- a lower bound on the
marginal likelihood (the "evidence"). In variational inference, an ELBO objective
shows up when minimizing the KL divergence between
an approximate and the true posterior
$$
\m
ax_
\v
e
\z
eta
\l
eft(
\m
athbb E_{q_{
\v
e
\p
si}(
\v
e u)}
\l
eft[
\l
n p(
\v
e y|
\v
e u)
\r
ight] -
D_{
\t
ext{KL}}(q_{
\v
e
\p
si}(
\v
e u)
\V
ert p(
\v
e u))
\r
ight)
p(w|y) =
\f
rac{p(y|w)
\,
p(w)}{
\i
nt p(y|w)
\,
p(w)
\,\d
d w}
=
\f
rac{p(y|w)
\,
p(w)}{p(y)}
$$
with respect to
$$
\v
e
\z
eta^
*
=
\t
ext{arg}
\m
in_{
\v
e
\z
eta} D
\l
t{KL}(q_{
\v
e
\z
eta}(w)
\,\V
ert
\,
p(w|y))
$$
to obtain the optimal variational parameters $
\v
e
\z
eta^
*
$ to approximate
$p(w|y)$ with $q_{
\v
e
\z
eta^
*
}(w)$.
In our case the two distributions are the approximate
$$q_{
\v
e
\z
eta}(
\m
athbf f)=
\i
nt p(
\m
athbf f|
\v
e u)
\,
q_{
\v
e
\p
si}(
\v
e u)
\,\d
d
\v
e u
\q
uad(
\t
ext{"variational strategy"})$$
and the true $p(
\m
athbf f|
\m
athcal D)$ posterior over function values. We
optimize with respect to
$$
\v
e
\z
eta = [
\e
ll,
\s
igma_n^2, s, c,
\v
e
\p
si] $$
with
$$
\v
e
\p
si = [
\v
e m_u,
\m
a Z,
\m
a L]$$
the parameters of the variational distribution $q_{
\v
e
\p
si}(
\v
e u)$.
In addition, we perform a stochastic
optimization by using a deep learning type mini-batch loop, hence
"stochastic" variational inference (SVI). The latter speeds up the
optimization since we only look at a fraction of data per optimizer step to
calculate an approximate loss gradient (
`loss.backward()`
).
%% Cell type:code id:f39d86f8 tags:
```
python
# Train mode
model
.
train
()
likelihood
.
train
()
optimizer
=
torch
.
optim
.
Adam
(
[
dict
(
params
=
model
.
parameters
()),
dict
(
params
=
likelihood
.
parameters
())],
lr
=
0.1
,
)
loss_func
=
gpytorch
.
mlls
.
VariationalELBO
(
likelihood
,
model
,
num_data
=
X_train
.
shape
[
0
]
)
train_dl
=
DataLoader
(
TensorDataset
(
X_train
,
y_train
),
batch_size
=
128
,
shuffle
=
True
)
n_iter
=
200
history
=
defaultdict
(
list
)
for
i_iter
in
range
(
n_iter
):
for
i_batch
,
(
X_batch
,
y_batch
)
in
enumerate
(
train_dl
):
batch_history
=
defaultdict
(
list
)
optimizer
.
zero_grad
()
loss
=
-
loss_func
(
model
(
X_batch
),
y_batch
)
loss
.
backward
()
optimizer
.
step
()
param_dct
=
dict
()
param_dct
.
update
(
extract_model_params
(
model
,
try_item
=
True
))
param_dct
.
update
(
extract_model_params
(
likelihood
,
try_item
=
True
))
for
p_name
,
p_val
in
param_dct
.
items
():
if
isinstance
(
p_val
,
float
):
batch_history
[
p_name
].
append
(
p_val
)
batch_history
[
"
loss
"
].
append
(
loss
.
item
())
for
p_name
,
p_lst
in
batch_history
.
items
():
history
[
p_name
].
append
(
np
.
mean
(
p_lst
))
if
(
i_iter
+
1
)
%
10
==
0
:
print
(
f
"
iter
{
i_iter
+
1
}
/
{
n_iter
}
,
{
loss
=
:
.
3
f
}
"
)
```
%% Cell type:code id:dc39b5d9 tags:
```
python
# Plot scalar hyper params and loss (ELBO) convergence
ncols
=
len
(
history
)
fig
,
axs
=
plt
.
subplots
(
ncols
=
ncols
,
nrows
=
1
,
figsize
=
(
ncols
*
3
,
3
),
layout
=
"
compressed
"
)
with
torch
.
no_grad
():
for
ax
,
(
p_name
,
p_lst
)
in
zip
(
axs
,
history
.
items
()):
ax
.
plot
(
p_lst
)
ax
.
set_title
(
p_name
)
ax
.
set_xlabel
(
"
iterations
"
)
```
%% Cell type:code id:c4907e39 tags:
```
python
# Values of optimized hyper params
print
(
"
model params:
"
)
pprint
(
extract_model_params
(
model
))
print
(
"
likelihood params:
"
)
pprint
(
extract_model_params
(
likelihood
))
```
%% Cell type:markdown id:4ae35e11 tags:
# Run prediction
%% Cell type:code id:629b4658 tags:
```
python
# Evaluation (predictive posterior) mode
model
.
eval
()
likelihood
.
eval
()
with
torch
.
no_grad
():
M
=
10
post_pred_f
=
model
(
X_pred
)
post_pred_y
=
likelihood
(
model
(
X_pred
))
fig
,
axs
=
plt
.
subplots
(
ncols
=
2
,
figsize
=
(
14
,
5
),
sharex
=
True
,
sharey
=
True
)
fig_sigmas
,
ax_sigmas
=
plt
.
subplots
()
for
ii
,
(
ax
,
post_pred
,
name
,
title
)
in
enumerate
(
zip
(
axs
,
[
post_pred_f
,
post_pred_y
],
[
"
f
"
,
"
y
"
],
[
"
epistemic uncertainty
"
,
"
total uncertainty
"
],
)
):
yf_mean
=
post_pred
.
mean
yf_samples
=
post_pred
.
sample
(
sample_shape
=
torch
.
Size
((
M
,)))
yf_std
=
post_pred
.
stddev
lower
=
yf_mean
-
2
*
yf_std
upper
=
yf_mean
+
2
*
yf_std
ax
.
plot
(
X_train
.
numpy
(),
y_train
.
numpy
(),
"
o
"
,
label
=
"
data
"
,
color
=
"
tab:blue
"
,
)
ax
.
plot
(
X_pred
.
numpy
(),
yf_mean
.
numpy
(),
label
=
"
mean
"
,
color
=
"
tab:red
"
,
lw
=
2
,
)
ax
.
plot
(
X_pred
.
numpy
(),
y_gt_pred
.
numpy
(),
label
=
"
ground truth
"
,
color
=
"
k
"
,
lw
=
2
,
ls
=
"
--
"
,
)
ax
.
fill_between
(
X_pred
.
numpy
(),
lower
.
numpy
(),
upper
.
numpy
(),
label
=
"
confidence
"
,
color
=
"
tab:orange
"
,
alpha
=
0.3
,
)
ax
.
set_title
(
f
"
confidence =
{
title
}
"
)
if
name
==
"
f
"
:
sigma_label
=
r
"
epistemic: $\pm 2\sqrt{\mathrm{diag}(\Sigma_*)}$
"
zorder
=
1
else
:
sigma_label
=
(
r
"
total: $\pm 2\sqrt{\mathrm{diag}(\Sigma_* + \sigma_n^2\,I)}$
"
)
zorder
=
0
ax_sigmas
.
fill_between
(
X_pred
.
numpy
(),
lower
.
numpy
(),
upper
.
numpy
(),
label
=
sigma_label
,
color
=
"
tab:orange
"
if
name
==
"
f
"
else
"
tab:blue
"
,
alpha
=
0.5
,
zorder
=
zorder
,
)
y_min
=
y_train
.
min
()
y_max
=
y_train
.
max
()
y_span
=
y_max
-
y_min
ax
.
set_ylim
([
y_min
-
0.3
*
y_span
,
y_max
+
0.3
*
y_span
])
plot_samples
(
ax
,
X_pred
,
yf_samples
,
label
=
"
posterior pred. samples
"
)
if
ii
==
1
:
ax
.
legend
()
ax_sigmas
.
set_title
(
"
total vs. epistemic uncertainty
"
)
ax_sigmas
.
legend
()
```
%% Cell type:markdown id:60abb7ec tags:
# Let's check the learned noise
%% Cell type:code id:d8a32f5b tags:
```
python
# Target noise to learn
print
(
"
data noise:
"
,
noise_std
)
# The two below must be the same
print
(
"
learned noise:
"
,
(
post_pred_y
.
stddev
**
2
-
post_pred_f
.
stddev
**
2
).
mean
().
sqrt
().
item
(),
)
print
(
"
learned noise:
"
,
np
.
sqrt
(
extract_model_params
(
likelihood
,
try_item
=
True
)[
"
noise_covar.noise
"
]
),
)
```
%% Cell type:code id:0637729a tags:
```
python
# When running as script
if
not
is_interactive
():
plt
.
show
()
```
...
...
This diff is collapsed.
Click to expand it.
BLcourse2.3/03_one_dim_SVI.py
+
26
−
6
View file @
8c690d3d
...
...
@@ -25,6 +25,8 @@
# $\newcommand{\predve}[1]{\mathbf{#1}}$
# $\newcommand{\test}[1]{#1_*}$
# $\newcommand{\testtest}[1]{#1_{**}}$
# $\newcommand{\dd}{{\rm{d}}}$
# $\newcommand{\lt}[1]{_{\text{#1}}}$
# $\DeclareMathOperator{\diag}{diag}$
# $\DeclareMathOperator{\cov}{cov}$
...
...
@@ -108,6 +110,9 @@ ax.legend()
# 2015](https://proceedings.mlr.press/v38/hensman15.html). The model is
# "sparse" since it works with a set of *inducing* points $(\ma Z, \ve u),
# \ve u=f(\ma Z)$ which is much smaller than the train data $(\ma X, \ve y)$.
# See also [the GPJax
# docs](https://docs.jaxgaussianprocesses.com/_examples/uncollapsed_vi) for a
# nice introduction.
#
# We have the same hyper parameters as before
#
...
...
@@ -183,16 +188,31 @@ likelihood.noise_covar.noise = 0.3
# # Fit GP to data: optimize hyper params
#
# Now we optimize the GP hyper parameters by doing a GP-specific variational inference (VI),
# where we optimize not the log marginal likelihood (ExactGP case),
# but an ELBO (evidence lower bound) objective
# Now we optimize the GP hyper parameters by doing a GP-specific variational
# inference (VI), where we don't maximize the log marginal likelihood (ExactGP
# case), but an ELBO ("evidence lower bound") objective -- a lower bound on the
# marginal likelihood (the "evidence"). In variational inference, an ELBO objective
# shows up when minimizing the KL divergence between
# an approximate and the true posterior
#
# $$
#
\max_\ve\zeta\left(\mathbb E_{q_{\ve\psi}(\ve u)}\left[\ln p(\ve y|\ve u) \right] -
#
D_{\text{KL}}(q_{\ve\psi}(\ve u)\Vert p(\ve u))\right)
#
p(w|y) = \frac{p(y|w)\,p(w)}{\int p(y|w)\,p(w)\,\dd w}
#
= \frac{p(y|w)\,p(w)}{p(y)}
# $$
#
# with respect to
# $$
# \ve\zeta^* = \text{arg}\min_{\ve\zeta} D\lt{KL}(q_{\ve\zeta}(w)\,\Vert\, p(w|y))
# $$
#
# to obtain the optimal variational parameters $\ve\zeta^*$ to approximate
# $p(w|y)$ with $q_{\ve\zeta^*}(w)$.
#
# In our case the two distributions are the approximate
#
# $$q_{\ve\zeta}(\mathbf f)=\int p(\mathbf f|\ve u)\,q_{\ve\psi}(\ve u)\,\dd\ve u\quad(\text{"variational strategy"})$$
#
# and the true $p(\mathbf f|\mathcal D)$ posterior over function values. We
# optimize with respect to
#
# $$\ve\zeta = [\ell, \sigma_n^2, s, c, \ve\psi] $$
#
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment