diff --git a/BLcourse2.3/01_one_dim.ipynb b/BLcourse2.3/01_one_dim.ipynb
index 97a6a5892e3b9065fa3a04b5e2f0de588ddfdedf..7c993e8f48af0f824f9bc92a6af1fde90023830b 100644
--- a/BLcourse2.3/01_one_dim.ipynb
+++ b/BLcourse2.3/01_one_dim.ipynb
@@ -3,7 +3,9 @@
   {
    "cell_type": "markdown",
    "id": "2a8ed8ae",
-   "metadata": {},
+   "metadata": {
+    "lines_to_next_cell": 2
+   },
    "source": [
     "# Notation\n",
     "$\\newcommand{\\ve}[1]{\\mathit{\\boldsymbol{#1}}}$\n",
@@ -25,6 +27,19 @@
     "$\\ma X$ to keep the notation consistent with the slides."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "8a7cc3c8",
+   "metadata": {},
+   "source": [
+    "# About\n",
+    "\n",
+    "In this notebook, we explore the material presented in [the\n",
+    "slides](https://doi.org/10.6084/m9.figshare.25988176) with code, using the\n",
+    "[gpytorch](https://gpytorch.ai) library. We cover exact GP inference and\n",
+    "hyper parameter optimization using the marginal likelihood."
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "ae440984",
@@ -226,7 +241,7 @@
    "outputs": [],
    "source": [
     "# Default start hyper params\n",
-    "pprint(extract_model_params(model, raw=False))"
+    "pprint(extract_model_params(model))"
    ]
   },
   {
@@ -244,7 +259,7 @@
     "model.covar_module.outputscale = 1.0\n",
     "model.likelihood.noise_covar.noise = 1e-3\n",
     "\n",
-    "pprint(extract_model_params(model, raw=False))"
+    "pprint(extract_model_params(model))"
    ]
   },
   {
@@ -256,7 +271,7 @@
     "\n",
     "We sample a number of functions $f_m, m=1,\\ldots,M$ from the GP prior and\n",
     "evaluate them at all $\\ma X$ = `X_pred` points, of which we have $N=200$. So\n",
-    "we effectively generate samples from $p(\\predve f|\\ma X) = \\mathcal N(\\ve\n",
+    "we effectively generate samples from `pri_f` =  $p(\\predve f|\\ma X) = \\mathcal N(\\ve\n",
     "c, \\ma K)$. Each sampled vector $\\predve f\\in\\mathbb R^{N}$ represents a\n",
     "sampled *function* $f$ evaluated the $N=200$ points in $\\ma X$. The\n",
     "covariance (kernel) matrix is $\\ma K\\in\\mathbb R^{N\\times N}$. Its diagonal\n",
@@ -474,7 +489,7 @@
     "    optimizer.step()\n",
     "    if (ii + 1) % 10 == 0:\n",
     "        print(f\"iter {ii + 1}/{n_iter}, {loss=:.3f}\")\n",
-    "    for p_name, p_val in extract_model_params(model).items():\n",
+    "    for p_name, p_val in extract_model_params(model, try_item=True).items():\n",
     "        history[p_name].append(p_val)\n",
     "    history[\"loss\"].append(loss.item())"
    ]
@@ -488,11 +503,14 @@
    "source": [
     "# Plot hyper params and loss (negative log marginal likelihood) convergence\n",
     "ncols = len(history)\n",
-    "fig, axs = plt.subplots(ncols=ncols, nrows=1, figsize=(ncols * 5, 5))\n",
-    "for ax, (p_name, p_lst) in zip(axs, history.items()):\n",
-    "    ax.plot(p_lst)\n",
-    "    ax.set_title(p_name)\n",
-    "    ax.set_xlabel(\"iterations\")"
+    "fig, axs = plt.subplots(\n",
+    "    ncols=ncols, nrows=1, figsize=(ncols * 3, 3), layout=\"compressed\"\n",
+    ")\n",
+    "with torch.no_grad():\n",
+    "    for ax, (p_name, p_lst) in zip(axs, history.items()):\n",
+    "        ax.plot(p_lst)\n",
+    "        ax.set_title(p_name)\n",
+    "        ax.set_xlabel(\"iterations\")"
    ]
   },
   {
@@ -503,7 +521,7 @@
    "outputs": [],
    "source": [
     "# Values of optimized hyper params\n",
-    "pprint(extract_model_params(model, raw=False))"
+    "pprint(extract_model_params(model))"
    ]
   },
   {
@@ -548,7 +566,7 @@
     "    post_pred_f = model(X_pred)\n",
     "    post_pred_y = likelihood(model(X_pred))\n",
     "\n",
-    "    fig, axs = plt.subplots(ncols=2, figsize=(12, 5))\n",
+    "    fig, axs = plt.subplots(ncols=2, figsize=(14, 5), sharex=True, sharey=True)\n",
     "    fig_sigmas, ax_sigmas = plt.subplots()\n",
     "    for ii, (ax, post_pred, name, title) in enumerate(\n",
     "        zip(\n",
@@ -668,7 +686,9 @@
     "print(\n",
     "    \"learned noise:\",\n",
     "    np.sqrt(\n",
-    "        extract_model_params(model, raw=False)[\"likelihood.noise_covar.noise\"]\n",
+    "        extract_model_params(model, try_item=True)[\n",
+    "            \"likelihood.noise_covar.noise\"\n",
+    "        ]\n",
     "    ),\n",
     ")"
    ]
@@ -694,6 +714,18 @@
    "display_name": "bayes-ml-course",
    "language": "python",
    "name": "bayes-ml-course"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.3"
   }
  },
  "nbformat": 4,
diff --git a/BLcourse2.3/02_two_dim.ipynb b/BLcourse2.3/02_two_dim.ipynb
index fdebd011672d5e62a2939579ae935f2658cd5463..9ae348f938d2f337bd74ce53ed79520695bb6fc7 100644
--- a/BLcourse2.3/02_two_dim.ipynb
+++ b/BLcourse2.3/02_two_dim.ipynb
@@ -5,7 +5,21 @@
    "id": "8c19b309",
    "metadata": {},
    "source": [
-    "In this notebook, we use a GP to fit a 2D data set."
+    "# About\n",
+    "\n",
+    "In this notebook, we use a GP to fit a 2D data set. We use the same ExactGP\n",
+    "machinery as in the 1D case and show how GPs can be used for 2D interpolation\n",
+    "(when data is free of noise) or regression (noisy data). Think of this as a\n",
+    "toy geospatial data setting. Actually, in geostatistics, Gaussian process\n",
+    "regression is known as [Kriging](https://en.wikipedia.org/wiki/Kriging).\n",
+    "$\\newcommand{\\ve}[1]{\\mathit{\\boldsymbol{#1}}}$\n",
+    "$\\newcommand{\\ma}[1]{\\mathbf{#1}}$\n",
+    "$\\newcommand{\\pred}[1]{\\rm{#1}}$\n",
+    "$\\newcommand{\\predve}[1]{\\mathbf{#1}}$\n",
+    "$\\newcommand{\\test}[1]{#1_*}$\n",
+    "$\\newcommand{\\testtest}[1]{#1_{**}}$\n",
+    "$\\DeclareMathOperator{\\diag}{diag}$\n",
+    "$\\DeclareMathOperator{\\cov}{cov}$"
    ]
   },
   {
@@ -61,7 +75,14 @@
     "lines_to_next_cell": 2
    },
    "source": [
-    "# Generate toy 2D data"
+    "# Generate toy 2D data\n",
+    "\n",
+    "Our ground truth function is $f(\\ve x) = \\sin(r) / r$ with $\\ve x =\n",
+    "[x_0,x_1] \\in\\mathbb R^2$ and the radial distance\n",
+    "$r=\\sqrt{\\ve x^\\top\\,\\ve x}$, also known as \"Mexican hat\" function, which is a\n",
+    "radial wave-like pattern which decays with distance from the center $\\ve\n",
+    "x=\\ve 0$. We generate data by random sampling 2D points $\\ve x_i$ and calculating\n",
+    "$y_i = f(\\ve x_i)$, optionally adding Gaussian noise further down."
    ]
   },
   {
@@ -289,6 +310,15 @@
     "ax.set_ylabel(\"X_1\")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "b7ff9e6e",
+   "metadata": {},
+   "source": [
+    "The gray surface is the ground truth function. The blue points are the\n",
+    "training data."
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "a00bf4e4",
@@ -353,7 +383,7 @@
    "outputs": [],
    "source": [
     "# Default start hyper params\n",
-    "pprint(extract_model_params(model, raw=False))"
+    "pprint(extract_model_params(model))"
    ]
   },
   {
@@ -369,7 +399,7 @@
     "model.covar_module.outputscale = 8.0\n",
     "model.likelihood.noise_covar.noise = 0.1\n",
     "\n",
-    "pprint(extract_model_params(model, raw=False))"
+    "pprint(extract_model_params(model))"
    ]
   },
   {
@@ -403,7 +433,7 @@
     "    optimizer.step()\n",
     "    if (ii + 1) % 10 == 0:\n",
     "        print(f\"iter {ii + 1}/{n_iter}, {loss=:.3f}\")\n",
-    "    for p_name, p_val in extract_model_params(model).items():\n",
+    "    for p_name, p_val in extract_model_params(model, try_item=True).items():\n",
     "        history[p_name].append(p_val)\n",
     "    history[\"loss\"].append(loss.item())"
    ]
@@ -416,11 +446,14 @@
    "outputs": [],
    "source": [
     "ncols = len(history)\n",
-    "fig, axs = plt.subplots(ncols=ncols, nrows=1, figsize=(ncols * 5, 5))\n",
-    "for ax, (p_name, p_lst) in zip(axs, history.items()):\n",
-    "    ax.plot(p_lst)\n",
-    "    ax.set_title(p_name)\n",
-    "    ax.set_xlabel(\"iterations\")"
+    "fig, axs = plt.subplots(\n",
+    "    ncols=ncols, nrows=1, figsize=(ncols * 3, 3), layout=\"compressed\"\n",
+    ")\n",
+    "with torch.no_grad():\n",
+    "    for ax, (p_name, p_lst) in zip(axs, history.items()):\n",
+    "        ax.plot(p_lst)\n",
+    "        ax.set_title(p_name)\n",
+    "        ax.set_xlabel(\"iterations\")"
    ]
   },
   {
@@ -431,7 +464,7 @@
    "outputs": [],
    "source": [
     "# Values of optimized hyper params\n",
-    "pprint(extract_model_params(model, raw=False))"
+    "pprint(extract_model_params(model))"
    ]
   },
   {
@@ -483,9 +516,42 @@
    "id": "bff55240",
    "metadata": {},
    "source": [
-    "When `use_noise=False`, then the GP's prediction is an almost pefect\n",
+    "When `use_noise=False`, then the GP's prediction is an almost perfect\n",
     "reconstruction of the ground truth function (in-distribution, so where we\n",
-    "have data)."
+    "have data).\n",
+    "In this case, the plot makes the GP prediction look like a perfect\n",
+    "*interpolation* of the noise-free data, so $\\test{\\ve\\mu} = \\ve y$ at the\n",
+    "train points $\\test{\\ma X} = \\ma X$. This\n",
+    "would be true if our GP model had exactly zero noise, so the likelihood's\n",
+    "$\\sigma_n^2$ would be zero. However `print(model`)\n",
+    "\n",
+    "```\n",
+    "ExactGPModel(\n",
+    " (likelihood): GaussianLikelihood(\n",
+    "   (noise_covar): HomoskedasticNoise(\n",
+    "     (raw_noise_constraint): GreaterThan(1.000E-04)\n",
+    "   )\n",
+    " )\n",
+    " ...\n",
+    " ```\n",
+    "\n",
+    "shows that actually the min value is $10^{-4}$, so we technically always have\n",
+    "a regression setting, just with very small noise. The reason is that in the\n",
+    "GP equations, we have\n",
+    "\n",
+    "$$\\test{\\ve\\mu} = \\test{\\ma K}\\,\\left(\\ma K+\\sigma_n^2\\,\\ma I_N\\right)^{-1}\\,\\ve y$$\n",
+    "\n",
+    "where $\\sigma_n^2$ acts as a *regularization* parameter (also called \"jitter\n",
+    "term\" sometimes), which improves the\n",
+    "numerical stability of the linear system solve step\n",
+    "\n",
+    "$$\\left(\\ma K+\\sigma_n^2\\,\\ma I_N\\right)^{-1}\\,\\ve y\\:.$$\n",
+    "\n",
+    "Also we always keep $\\sigma_n^2$ as hyper parameter that we learn, and the\n",
+    "smallest value the hyper parameter optimization can reach is $10^{-4}$.\n",
+    "\n",
+    "While 3D plots are fun, they are not optimal for judging how well\n",
+    "the GP model represents the ground truth function."
    ]
   },
   {
@@ -493,7 +559,10 @@
    "id": "591e453d",
    "metadata": {},
    "source": [
-    "# Plot difference to ground truth and uncertainty"
+    "# Plot difference to ground truth and uncertainty\n",
+    "\n",
+    "Let's use contour plots to visualize the difference between GP prediction and\n",
+    "ground truth, as well as epistemic, total and aleatoric uncertainty."
    ]
   },
   {
@@ -504,7 +573,9 @@
    "outputs": [],
    "source": [
     "ncols = 4\n",
-    "fig, axs = plt.subplots(ncols=ncols, nrows=1, figsize=(ncols * 7, 5))\n",
+    "fig, axs = plt.subplots(\n",
+    "    ncols=ncols, nrows=1, figsize=(ncols * 5, 4), layout=\"compressed\"\n",
+    ")\n",
     "\n",
     "vmax = post_pred_y.stddev.max()\n",
     "cs = []\n",
@@ -590,7 +661,9 @@
     "print(\n",
     "    \"learned noise:\",\n",
     "    np.sqrt(\n",
-    "        extract_model_params(model, raw=False)[\"likelihood.noise_covar.noise\"]\n",
+    "        extract_model_params(model, try_item=True)[\n",
+    "            \"likelihood.noise_covar.noise\"\n",
+    "        ]\n",
     "    ),\n",
     ")"
    ]
@@ -631,20 +704,17 @@
     "* Exercise 3: `use_noise=True`, `use_gap=True`\n",
     "  * in-distribution (where we have data)\n",
     "    * The distinction between\n",
-    "      epistemic and aleatoric in the way we define it is less meaningful,\n",
+    "      epistemic and aleatoric uncertainty in the way we define it is less meaningful,\n",
     "      hence, `f_std` doesn't correlate well with `y_pred - y_true`. The\n",
     "      reason is that the noise $\\sigma_n$ shows up in two parts: (a) in the\n",
     "      equation of $\\test{\\ma\\Sigma}$ itself, so the \"epistemic\" uncertainty\n",
     "      `f_std` = $\\sqrt{\\diag\\test{\\ma\\Sigma}}$ is bigger just because we have\n",
     "      noise (regression) and (b) we add it in $\\sqrt{\\diag(\\test{\\ma\\Sigma} +\n",
     "      \\sigma_n^2\\,\\ma I_N)}$ to get the total `y_std`\n",
-    "    * We learn the value of `noise_std` ($\\sigma_n$) quite well and add **its\n",
-    "      square** as a constant ($\\test{\\ma\\Sigma} + \\sigma_n^2\\,\\ma I_N$). The\n",
-    "      `y_std` plot looks like the `f_std` one, but shifted by a constant. But\n",
-    "      this is not the case because we compare standard deviations and not\n",
+    "    * The `y_std` plot looks like the `f_std` one, but shifted by a constant.\n",
+    "      But this is not the case because we compare standard deviations and not\n",
     "      variances, hence `y_std` - `f_std` is not constant, and in particular\n",
-    "      $\\neq \\sigma_n$, but both are in the same numerical range (0.15 vs.\n",
-    "      0.2).\n",
+    "      $\\neq \\sigma_n$, but both are in the same numerical range (0.15 vs. 0.2).\n",
     "  * out-of-distribution: `f_std` (epistemic) dominates"
    ]
   },
@@ -653,7 +723,7 @@
    "id": "08e949a2",
    "metadata": {},
    "source": [
-    "Exercises\n",
+    "# Exercises\n",
     "\n",
     "Go back up, switch on the settings for Exercise 2 and re-run the notebook.\n",
     "Same with Exercise 3."
@@ -716,6 +786,18 @@
    "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.13.3"
   }
  },
  "nbformat": 4,
diff --git a/BLcourse2.3/03_one_dim_SVI.ipynb b/BLcourse2.3/03_one_dim_SVI.ipynb
index cdb7789d7ffdad559216fcb5701592637467479f..0520fa0366033d88482c76a09f7bf646e7ba4eed 100644
--- a/BLcourse2.3/03_one_dim_SVI.ipynb
+++ b/BLcourse2.3/03_one_dim_SVI.ipynb
@@ -246,7 +246,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Set new start hyper params\n",
+    "# Set new start hyper params (scalars only)\n",
     "model.mean_module.constant = 3.0\n",
     "model.covar_module.base_kernel.lengthscale = 1.0\n",
     "model.covar_module.outputscale = 1.0\n",