diff --git a/BLcourse2.3/01_one_dim.ipynb b/BLcourse2.3/01_one_dim.ipynb index 97a6a5892e3b9065fa3a04b5e2f0de588ddfdedf..7c993e8f48af0f824f9bc92a6af1fde90023830b 100644 --- a/BLcourse2.3/01_one_dim.ipynb +++ b/BLcourse2.3/01_one_dim.ipynb @@ -3,7 +3,9 @@ { "cell_type": "markdown", "id": "2a8ed8ae", - "metadata": {}, + "metadata": { + "lines_to_next_cell": 2 + }, "source": [ "# Notation\n", "$\\newcommand{\\ve}[1]{\\mathit{\\boldsymbol{#1}}}$\n", @@ -25,6 +27,19 @@ "$\\ma X$ to keep the notation consistent with the slides." ] }, + { + "cell_type": "markdown", + "id": "8a7cc3c8", + "metadata": {}, + "source": [ + "# About\n", + "\n", + "In this notebook, we explore the material presented in [the\n", + "slides](https://doi.org/10.6084/m9.figshare.25988176) with code, using the\n", + "[gpytorch](https://gpytorch.ai) library. We cover exact GP inference and\n", + "hyper parameter optimization using the marginal likelihood." + ] + }, { "cell_type": "markdown", "id": "ae440984", @@ -226,7 +241,7 @@ "outputs": [], "source": [ "# Default start hyper params\n", - "pprint(extract_model_params(model, raw=False))" + "pprint(extract_model_params(model))" ] }, { @@ -244,7 +259,7 @@ "model.covar_module.outputscale = 1.0\n", "model.likelihood.noise_covar.noise = 1e-3\n", "\n", - "pprint(extract_model_params(model, raw=False))" + "pprint(extract_model_params(model))" ] }, { @@ -256,7 +271,7 @@ "\n", "We sample a number of functions $f_m, m=1,\\ldots,M$ from the GP prior and\n", "evaluate them at all $\\ma X$ = `X_pred` points, of which we have $N=200$. So\n", - "we effectively generate samples from $p(\\predve f|\\ma X) = \\mathcal N(\\ve\n", + "we effectively generate samples from `pri_f` = $p(\\predve f|\\ma X) = \\mathcal N(\\ve\n", "c, \\ma K)$. Each sampled vector $\\predve f\\in\\mathbb R^{N}$ represents a\n", "sampled *function* $f$ evaluated the $N=200$ points in $\\ma X$. The\n", "covariance (kernel) matrix is $\\ma K\\in\\mathbb R^{N\\times N}$. Its diagonal\n", @@ -474,7 +489,7 @@ " optimizer.step()\n", " if (ii + 1) % 10 == 0:\n", " print(f\"iter {ii + 1}/{n_iter}, {loss=:.3f}\")\n", - " for p_name, p_val in extract_model_params(model).items():\n", + " for p_name, p_val in extract_model_params(model, try_item=True).items():\n", " history[p_name].append(p_val)\n", " history[\"loss\"].append(loss.item())" ] @@ -488,11 +503,14 @@ "source": [ "# Plot hyper params and loss (negative log marginal likelihood) convergence\n", "ncols = len(history)\n", - "fig, axs = plt.subplots(ncols=ncols, nrows=1, figsize=(ncols * 5, 5))\n", - "for ax, (p_name, p_lst) in zip(axs, history.items()):\n", - " ax.plot(p_lst)\n", - " ax.set_title(p_name)\n", - " ax.set_xlabel(\"iterations\")" + "fig, axs = plt.subplots(\n", + " ncols=ncols, nrows=1, figsize=(ncols * 3, 3), layout=\"compressed\"\n", + ")\n", + "with torch.no_grad():\n", + " for ax, (p_name, p_lst) in zip(axs, history.items()):\n", + " ax.plot(p_lst)\n", + " ax.set_title(p_name)\n", + " ax.set_xlabel(\"iterations\")" ] }, { @@ -503,7 +521,7 @@ "outputs": [], "source": [ "# Values of optimized hyper params\n", - "pprint(extract_model_params(model, raw=False))" + "pprint(extract_model_params(model))" ] }, { @@ -548,7 +566,7 @@ " post_pred_f = model(X_pred)\n", " post_pred_y = likelihood(model(X_pred))\n", "\n", - " fig, axs = plt.subplots(ncols=2, figsize=(12, 5))\n", + " fig, axs = plt.subplots(ncols=2, figsize=(14, 5), sharex=True, sharey=True)\n", " fig_sigmas, ax_sigmas = plt.subplots()\n", " for ii, (ax, post_pred, name, title) in enumerate(\n", " zip(\n", @@ -668,7 +686,9 @@ "print(\n", " \"learned noise:\",\n", " np.sqrt(\n", - " extract_model_params(model, raw=False)[\"likelihood.noise_covar.noise\"]\n", + " extract_model_params(model, try_item=True)[\n", + " \"likelihood.noise_covar.noise\"\n", + " ]\n", " ),\n", ")" ] @@ -694,6 +714,18 @@ "display_name": "bayes-ml-course", "language": "python", "name": "bayes-ml-course" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.3" } }, "nbformat": 4, diff --git a/BLcourse2.3/02_two_dim.ipynb b/BLcourse2.3/02_two_dim.ipynb index fdebd011672d5e62a2939579ae935f2658cd5463..9ae348f938d2f337bd74ce53ed79520695bb6fc7 100644 --- a/BLcourse2.3/02_two_dim.ipynb +++ b/BLcourse2.3/02_two_dim.ipynb @@ -5,7 +5,21 @@ "id": "8c19b309", "metadata": {}, "source": [ - "In this notebook, we use a GP to fit a 2D data set." + "# About\n", + "\n", + "In this notebook, we use a GP to fit a 2D data set. We use the same ExactGP\n", + "machinery as in the 1D case and show how GPs can be used for 2D interpolation\n", + "(when data is free of noise) or regression (noisy data). Think of this as a\n", + "toy geospatial data setting. Actually, in geostatistics, Gaussian process\n", + "regression is known as [Kriging](https://en.wikipedia.org/wiki/Kriging).\n", + "$\\newcommand{\\ve}[1]{\\mathit{\\boldsymbol{#1}}}$\n", + "$\\newcommand{\\ma}[1]{\\mathbf{#1}}$\n", + "$\\newcommand{\\pred}[1]{\\rm{#1}}$\n", + "$\\newcommand{\\predve}[1]{\\mathbf{#1}}$\n", + "$\\newcommand{\\test}[1]{#1_*}$\n", + "$\\newcommand{\\testtest}[1]{#1_{**}}$\n", + "$\\DeclareMathOperator{\\diag}{diag}$\n", + "$\\DeclareMathOperator{\\cov}{cov}$" ] }, { @@ -61,7 +75,14 @@ "lines_to_next_cell": 2 }, "source": [ - "# Generate toy 2D data" + "# Generate toy 2D data\n", + "\n", + "Our ground truth function is $f(\\ve x) = \\sin(r) / r$ with $\\ve x =\n", + "[x_0,x_1] \\in\\mathbb R^2$ and the radial distance\n", + "$r=\\sqrt{\\ve x^\\top\\,\\ve x}$, also known as \"Mexican hat\" function, which is a\n", + "radial wave-like pattern which decays with distance from the center $\\ve\n", + "x=\\ve 0$. We generate data by random sampling 2D points $\\ve x_i$ and calculating\n", + "$y_i = f(\\ve x_i)$, optionally adding Gaussian noise further down." ] }, { @@ -289,6 +310,15 @@ "ax.set_ylabel(\"X_1\")" ] }, + { + "cell_type": "markdown", + "id": "b7ff9e6e", + "metadata": {}, + "source": [ + "The gray surface is the ground truth function. The blue points are the\n", + "training data." + ] + }, { "cell_type": "markdown", "id": "a00bf4e4", @@ -353,7 +383,7 @@ "outputs": [], "source": [ "# Default start hyper params\n", - "pprint(extract_model_params(model, raw=False))" + "pprint(extract_model_params(model))" ] }, { @@ -369,7 +399,7 @@ "model.covar_module.outputscale = 8.0\n", "model.likelihood.noise_covar.noise = 0.1\n", "\n", - "pprint(extract_model_params(model, raw=False))" + "pprint(extract_model_params(model))" ] }, { @@ -403,7 +433,7 @@ " optimizer.step()\n", " if (ii + 1) % 10 == 0:\n", " print(f\"iter {ii + 1}/{n_iter}, {loss=:.3f}\")\n", - " for p_name, p_val in extract_model_params(model).items():\n", + " for p_name, p_val in extract_model_params(model, try_item=True).items():\n", " history[p_name].append(p_val)\n", " history[\"loss\"].append(loss.item())" ] @@ -416,11 +446,14 @@ "outputs": [], "source": [ "ncols = len(history)\n", - "fig, axs = plt.subplots(ncols=ncols, nrows=1, figsize=(ncols * 5, 5))\n", - "for ax, (p_name, p_lst) in zip(axs, history.items()):\n", - " ax.plot(p_lst)\n", - " ax.set_title(p_name)\n", - " ax.set_xlabel(\"iterations\")" + "fig, axs = plt.subplots(\n", + " ncols=ncols, nrows=1, figsize=(ncols * 3, 3), layout=\"compressed\"\n", + ")\n", + "with torch.no_grad():\n", + " for ax, (p_name, p_lst) in zip(axs, history.items()):\n", + " ax.plot(p_lst)\n", + " ax.set_title(p_name)\n", + " ax.set_xlabel(\"iterations\")" ] }, { @@ -431,7 +464,7 @@ "outputs": [], "source": [ "# Values of optimized hyper params\n", - "pprint(extract_model_params(model, raw=False))" + "pprint(extract_model_params(model))" ] }, { @@ -483,9 +516,42 @@ "id": "bff55240", "metadata": {}, "source": [ - "When `use_noise=False`, then the GP's prediction is an almost pefect\n", + "When `use_noise=False`, then the GP's prediction is an almost perfect\n", "reconstruction of the ground truth function (in-distribution, so where we\n", - "have data)." + "have data).\n", + "In this case, the plot makes the GP prediction look like a perfect\n", + "*interpolation* of the noise-free data, so $\\test{\\ve\\mu} = \\ve y$ at the\n", + "train points $\\test{\\ma X} = \\ma X$. This\n", + "would be true if our GP model had exactly zero noise, so the likelihood's\n", + "$\\sigma_n^2$ would be zero. However `print(model`)\n", + "\n", + "```\n", + "ExactGPModel(\n", + " (likelihood): GaussianLikelihood(\n", + " (noise_covar): HomoskedasticNoise(\n", + " (raw_noise_constraint): GreaterThan(1.000E-04)\n", + " )\n", + " )\n", + " ...\n", + " ```\n", + "\n", + "shows that actually the min value is $10^{-4}$, so we technically always have\n", + "a regression setting, just with very small noise. The reason is that in the\n", + "GP equations, we have\n", + "\n", + "$$\\test{\\ve\\mu} = \\test{\\ma K}\\,\\left(\\ma K+\\sigma_n^2\\,\\ma I_N\\right)^{-1}\\,\\ve y$$\n", + "\n", + "where $\\sigma_n^2$ acts as a *regularization* parameter (also called \"jitter\n", + "term\" sometimes), which improves the\n", + "numerical stability of the linear system solve step\n", + "\n", + "$$\\left(\\ma K+\\sigma_n^2\\,\\ma I_N\\right)^{-1}\\,\\ve y\\:.$$\n", + "\n", + "Also we always keep $\\sigma_n^2$ as hyper parameter that we learn, and the\n", + "smallest value the hyper parameter optimization can reach is $10^{-4}$.\n", + "\n", + "While 3D plots are fun, they are not optimal for judging how well\n", + "the GP model represents the ground truth function." ] }, { @@ -493,7 +559,10 @@ "id": "591e453d", "metadata": {}, "source": [ - "# Plot difference to ground truth and uncertainty" + "# Plot difference to ground truth and uncertainty\n", + "\n", + "Let's use contour plots to visualize the difference between GP prediction and\n", + "ground truth, as well as epistemic, total and aleatoric uncertainty." ] }, { @@ -504,7 +573,9 @@ "outputs": [], "source": [ "ncols = 4\n", - "fig, axs = plt.subplots(ncols=ncols, nrows=1, figsize=(ncols * 7, 5))\n", + "fig, axs = plt.subplots(\n", + " ncols=ncols, nrows=1, figsize=(ncols * 5, 4), layout=\"compressed\"\n", + ")\n", "\n", "vmax = post_pred_y.stddev.max()\n", "cs = []\n", @@ -590,7 +661,9 @@ "print(\n", " \"learned noise:\",\n", " np.sqrt(\n", - " extract_model_params(model, raw=False)[\"likelihood.noise_covar.noise\"]\n", + " extract_model_params(model, try_item=True)[\n", + " \"likelihood.noise_covar.noise\"\n", + " ]\n", " ),\n", ")" ] @@ -631,20 +704,17 @@ "* Exercise 3: `use_noise=True`, `use_gap=True`\n", " * in-distribution (where we have data)\n", " * The distinction between\n", - " epistemic and aleatoric in the way we define it is less meaningful,\n", + " epistemic and aleatoric uncertainty in the way we define it is less meaningful,\n", " hence, `f_std` doesn't correlate well with `y_pred - y_true`. The\n", " reason is that the noise $\\sigma_n$ shows up in two parts: (a) in the\n", " equation of $\\test{\\ma\\Sigma}$ itself, so the \"epistemic\" uncertainty\n", " `f_std` = $\\sqrt{\\diag\\test{\\ma\\Sigma}}$ is bigger just because we have\n", " noise (regression) and (b) we add it in $\\sqrt{\\diag(\\test{\\ma\\Sigma} +\n", " \\sigma_n^2\\,\\ma I_N)}$ to get the total `y_std`\n", - " * We learn the value of `noise_std` ($\\sigma_n$) quite well and add **its\n", - " square** as a constant ($\\test{\\ma\\Sigma} + \\sigma_n^2\\,\\ma I_N$). The\n", - " `y_std` plot looks like the `f_std` one, but shifted by a constant. But\n", - " this is not the case because we compare standard deviations and not\n", + " * The `y_std` plot looks like the `f_std` one, but shifted by a constant.\n", + " But this is not the case because we compare standard deviations and not\n", " variances, hence `y_std` - `f_std` is not constant, and in particular\n", - " $\\neq \\sigma_n$, but both are in the same numerical range (0.15 vs.\n", - " 0.2).\n", + " $\\neq \\sigma_n$, but both are in the same numerical range (0.15 vs. 0.2).\n", " * out-of-distribution: `f_std` (epistemic) dominates" ] }, @@ -653,7 +723,7 @@ "id": "08e949a2", "metadata": {}, "source": [ - "Exercises\n", + "# Exercises\n", "\n", "Go back up, switch on the settings for Exercise 2 and re-run the notebook.\n", "Same with Exercise 3." @@ -716,6 +786,18 @@ "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.3" } }, "nbformat": 4, diff --git a/BLcourse2.3/03_one_dim_SVI.ipynb b/BLcourse2.3/03_one_dim_SVI.ipynb index cdb7789d7ffdad559216fcb5701592637467479f..0520fa0366033d88482c76a09f7bf646e7ba4eed 100644 --- a/BLcourse2.3/03_one_dim_SVI.ipynb +++ b/BLcourse2.3/03_one_dim_SVI.ipynb @@ -246,7 +246,7 @@ "metadata": {}, "outputs": [], "source": [ - "# Set new start hyper params\n", + "# Set new start hyper params (scalars only)\n", "model.mean_module.constant = 3.0\n", "model.covar_module.base_kernel.lengthscale = 1.0\n", "model.covar_module.outputscale = 1.0\n",