Section 1.1: Notations and Definitions
This book uses the following notations.
Let us denote by \(\|\cdot\|_2\) the Euclidean norm, and by \(\|\cdot\|\) a general norm.
For a differentiable function \(f\), let \(\nabla f(\mathbf{x})\) denote its gradient at \(\mathbf{x}\), and \(\partial f(\mathbf{x})\) denote its subdifferential set at \(\mathbf{x}\).
Let \(\partial_1 f(\mathbf{w}, \mathbf{u})\) and \(\partial_2 f(\mathbf{w}, \mathbf{u})\) denote the partial subgradients of \(f\) with respect to the first variable \(\mathbf{w}\) and the second variable \(\mathbf{u}\), respectively.
Define the \(d\)-dimensional probability simplex as
\[ \Delta_d = \left\{ \mathbf{x} \in \mathbb{R}^d : x_i \geq 0\ \forall i,\ \sum_{i=1}^d x_i = 1 \right\}. \]Let \(\mathbb{I}(\cdot)\) denote the standard indicator function, which returns 1 if the input condition is true and 0 otherwise. Let \(\mathbb{I}_{0-\infty}(\cdot)\) denote the zero-infinity indicator function, which returns 0 if the input condition is true and \(\infty\) otherwise.
Denote by \(\mathbf 1\) a vector of all ones. Let \(\mathbf{e}_i\) denote the standard basis vector with a 1 in the \(i\)-th coordinate and 0 in all other entries.
Let \(\mathbf x\sim\mathbb P\) denote a random variable that follows a distribution \(\mathbb P\).
\([n]\) denotes the set of all integers from \(1\) to \(n\), i.e., \([n]=\{1,\ldots, n\}\).
We use \(\langle\mathbf x, \mathbf y\rangle\) interchangeable with \(\mathbf x^{\top}\mathbf y\) to denote the inner product of two vectors.
\(\log(x)\) is in the base of natural constant \(e\).
w.r.t is short for with respect to.
s.t. is short for subject to.
Definition 1.1 (Dual Norm). Let \(\|\cdot\|\) be a norm on \(\mathbb{R}^d\). Its dual norm \(\|\cdot\|_* : \mathbb{R}^d \rightarrow \mathbb{R}\) is defined as \[ \|\mathbf{y}\|_* := \sup \left\{ \mathbf{x}^{\top}\mathbf{y} : \|\mathbf{x}\| \le 1 \right\}. \]
Example 1.1:
- \(\|\cdot\|_2\) is the dual norm of itself due to the Cauchy-Schwarz inequality: \(\mathbf{x}^{\top}\mathbf{y} \leq \|\mathbf{x}\|_2 \|\mathbf{y}\|_2\).
Example 1.2:
- \(\|\cdot\|_\infty\) and \(\|\cdot\|_1\) are dual norms of each other: \(\mathbf{x}^{\top}\mathbf{y} \leq \|\mathbf{x}\|_1 \|\mathbf{y}\|_\infty\).
Example 1.3:
- Let \(\|\mathbf{x}\|_A = \sqrt{\mathbf{x}^{\top}A\mathbf{x}}\) where \(A \succ 0\). Then the dual is \(\|\mathbf{y}\|_* = \sqrt{\mathbf{y}^{\top}A^{-1}\mathbf{y}}\), because \[ \mathbf{x}^{\top}\mathbf{y} = \mathbf{x}^{\top}A^{1/2}A^{-1/2}\mathbf{y} \leq \|A^{1/2}\mathbf{x}\|_2 \|A^{-1/2}\mathbf{y}\|_2. \]
Definition 1.2 (Convex Set). A set \(\mathcal{C}\) is said to be convex if the line segment between any two points in \(\mathcal{C}\) lies entirely in \(\mathcal{C}\); that is, for all \(\mathbf{x}_1, \mathbf{x}_2 \in \mathcal{C}\) and \(\theta \in [0,1]\),
\[ \theta \mathbf{x}_1 + (1-\theta)\mathbf{x}_2 \in \mathcal{C}. \]
Definition 1.3 (Convex Function). A function \(f : \mathbb{R}^d \rightarrow \mathbb{R}\) is convex if its domain \(\mathrm{dom}(f)\) is convex and
\[ f(\theta \mathbf{x} + (1-\theta)\mathbf{y}) \le \theta f(\mathbf{x}) + (1-\theta) f(\mathbf{y}), \quad \forall \mathbf{x}, \mathbf{y} \in \mathrm{dom}(f),\ \theta \in [0,1]. \]
It is strictly convex if strict inequality holds whenever \(\mathbf{x} \neq \mathbf{y}\) and \(\theta \in (0,1)\).
This inequality implies that the graph of a convex function lies below the straight line connecting any two points on the graph – like a bowl: if you place a chopstick across its edges, it will stay above the surface of the bowl.
Lemma 1.1 (First-order
condition)
Suppose \(f\) is differentiable (i.e.,
its gradient \(\nabla f\) exists at
each point in \(\operatorname{dom}f\)).
Then \(f\) is convex if and only if
\(\operatorname{dom}f\) is convex and
\[\begin{equation}\label{eq:first_order_convexity}
f(\mathbf{y})\ge f(\mathbf{x})+\nabla
f(\mathbf{x})^{\top}(\mathbf{y}-\mathbf{x})
\end{equation}\] holds for all \(\mathbf{x},\mathbf{y}\in\operatorname{dom}f\).
Proof.
We first prove for one-dimensional convex function \(\phi(\cdot):\mathbb{R}\to\mathbb{R}\), we
have \[\begin{align}\label{eqn:1conv}
\phi(t)\ge\phi(s)+\phi'(s)(t-s).
\end{align}\] According to the definition of convexity, we have
\[
\phi(t)\ge\phi(s)+\frac{\phi(s+\alpha(t-s))-\phi(s)}{\alpha}.
\] Taking the limit \(\alpha\to0\) yields (\(\ref{eqn:1conv}\)).
(\(\Rightarrow\)) Assume \(f\) is convex and differentiable on the open convex set \(\operatorname{dom}f\). Fix \(\mathbf{x}\in\operatorname{dom}f\) and any \(\mathbf{y}\in\operatorname{dom}f\). Define \(\phi:[0,1]\to\mathbb{R}\) by \[ \phi(t)=f\big(\mathbf{x}+t(\mathbf{y}-\mathbf{x})\big). \] Since \(f\) is convex and the map \(t\mapsto\mathbf{x}+t(\mathbf{y}-\mathbf{x})\) is affine, \(\phi\) is a convex function on \([0,1]\). For a convex (one-dimensional) differentiable function, we have proved that \[ \phi(1)\ge\phi(0)+\phi'(0)(1-0). \] By the chain rule, \[ \phi'(0)=\nabla f(\mathbf{x})^{\top}(\mathbf{y}-\mathbf{x}). \] Thus \[ f(\mathbf{y})=\phi(1)\ge\phi(0)+\nabla f(\mathbf{x})^{\top}(\mathbf{y}-\mathbf{x})=f(\mathbf{x})+\nabla f(\mathbf{x})^{\top}(\mathbf{y}-\mathbf{x}). \]
(\(\Leftarrow\)) Assume \(\operatorname{dom}f\) is convex and for all \(\mathbf{x},\mathbf{y}\in\operatorname{dom}f\), \[ f(\mathbf{y})\ge f(\mathbf{x})+\nabla f(\mathbf{x})^{\top}(\mathbf{y}-\mathbf{x}). \] Take any \(\mathbf{x},\mathbf{y}\in\operatorname{dom}f\) and \(\theta\in[0,1]\), and set \(\mathbf{z}=\theta\mathbf{x}+(1-\theta)\mathbf{y}\in\operatorname{dom}f\). Apply the assumption with \((\mathbf{x},\mathbf{z})\) and \((\mathbf{y},\mathbf{z})\): \[ f(\mathbf{x})\ge f(\mathbf{z})+\nabla f(\mathbf{z})^{\top}(\mathbf{x}-\mathbf{z}),\qquad f(\mathbf{y})\ge f(\mathbf{z})+\nabla f(\mathbf{z})^{\top}(\mathbf{y}-\mathbf{z}). \] Multiply the first by \(\theta\) and the second by \((1-\theta)\) and add: \[ \theta f(\mathbf{x})+(1-\theta)f(\mathbf{y})\ge f(\mathbf{z})+\nabla f(\mathbf{z})^{\top}\big(\theta(\mathbf{x}-\mathbf{z})+(1-\theta)(\mathbf{y}-\mathbf{z})\big). \] Since \(\theta(\mathbf{x}-\mathbf{z})+(1-\theta)(\mathbf{y}-\mathbf{z})=0\), we get \[ f(\mathbf{z})\le\theta f(\mathbf{x})+(1-\theta)f(\mathbf{y}), \] i.e., \(f(\theta\mathbf{x}+(1-\theta)\mathbf{y})\le\theta f(\mathbf{x})+(1-\theta)f(\mathbf{y})\). Hence \(f\) is convex.■
Definition 1.4 (Subgradient). Let \(f\) be a convex (possibly non-differentiable) function. A vector \(\mathbf{v} \in \partial f(\mathbf{x})\) is called a subgradient of \(f\) at \(\mathbf{x}\) if
\[ f(\mathbf{x}) \geq f(\mathbf{y}) + \mathbf{v}^{\top}(\mathbf{x} - \mathbf{y}), \quad \forall \mathbf{y} \in \mathrm{dom}(f). \]
Without causing any confusion, we often write \[f(\mathbf y) \geq f(\mathbf x) + \partial f(\mathbf x)^{\top}(\mathbf y - \mathbf x), \forall \mathbf x, \mathbf y\in\text{dom}(f),\] where \(\partial f(\mathbf x)\) refers to some specific element of the subgradient set.
Example 1.4:
\(f(x) = [x]_+ = \max(0, x)\).
- At \(x=0\), \(\partial f(0) = \{\xi \in [0,1]\}\)
- For \(x > 0\), \(\partial f(x) = \{1\}\)
- For \(x < 0\), \(\partial f(x) = \{0\}\)
Definition 1.5 (Strongly Convex Function). A function \(f : \mathbb{R}^d \to \mathbb{R}\) is said to be \(\mu\)-strongly convex with respect to a norm \(\|\cdot\|\) if there exists a constant \(\mu > 0\) such that, for all \(\mathbf{x}, \mathbf{y} \in \mathrm{dom}(f)\) and any \(\mathbf{v} \in \partial f(\mathbf{x})\), \[ f(\mathbf{y}) \ge f(\mathbf{x}) + \mathbf{v}^{\top}(\mathbf{y} - \mathbf{x}) + \frac{\mu}{2}\|\mathbf{x} - \mathbf{y}\|^2. \]
Example 1.5:
The function \(f(\mathbf x) = \frac{1}{2}\|\mathbf x\|_2^2\) is 1-strongly convex with respect to the Euclidean norm \(\|\cdot\|_2\). This follows directly from the identity: \[ \frac{1}{2}\|\mathbf y\|_2^2 = \frac{1}{2}\|\mathbf x\|_2^2 + \mathbf x^{\top}(\mathbf y - \mathbf x) + \frac{1}{2}\|\mathbf x - \mathbf y\|_2^2, \] which satisfies the definition of strong convexity with parameter 1.
Definition 1.6 (Smooth Function). A function \(f:\mathbb R^d\mapsto \mathbb R\) is called \(L\)-smooth with respect to a norm \(\|\cdot\|\) if it is differentiable and its gradient is \(L\)-Lipchitz continuous, i.e., there exists a positive real constant \(L\) such that, for any \(\mathbf x, \mathbf y\in\mathbb R^d\), we have \(\|\nabla f(\mathbf x) -\nabla f(\mathbf y)\|_*\leq L\|\mathbf x - \mathbf y\|\), or equivalently, \[\begin{align} |f(\mathbf x) - f(\mathbf y) - \nabla f(\mathbf y)^{\top}(\mathbf x- \mathbf y)|\leq \frac{L}{2}\|\mathbf x- \mathbf y\|^2. \end{align}\]
Definition 1.7 (Bregman Divergence).
Let \(\varphi:\Omega\rightarrow\mathbb{R}\) be a continuously differentiable, strictly convex function defined on a convex set \(\Omega\). The Bregman divergence induced by \(\varphi(\cdot)\) is defined as
\[\begin{equation*} D_\varphi(\mathbf{x}, \mathbf{y}) := \varphi(\mathbf{x}) - \varphi(\mathbf{y}) - \nabla \varphi(\mathbf{y})^{\top}(\mathbf{x}-\mathbf{y}). \end{equation*}\]
Example 1.6: Euclidean distance
\(\varphi(\mathbf{x}) = \frac{1}{2}\|\mathbf{x}\|_2^2\) induces the Euclidean distance: \[ D_\varphi(\mathbf{x}, \mathbf{y}) = \frac{1}{2}\|\mathbf{x} - \mathbf{y}\|_2^2. \]
Example 1.7: KL divergence
\(\varphi(\mathbf{x}) = \sum_i x_i \log x_i\) for \(\mathbf x\in\Delta_d\) induces the Kullback–Leibler (KL) divergence:
\[ D_\varphi(\mathbf{x}, \mathbf{y}) = \sum_i x_i \log \frac{x_i}{y_i}. \]
Example 1.8: Itakura–Saito distance
\(\varphi(\mathbf{x}) = -\sum_i \log x_i\) for \(\mathbf x > 0\) induces the Itakura-Saito distance: \[ D_\varphi(\mathbf{x}, \mathbf{y}) = \sum_i \frac{x_i}{y_i} - \log \frac{x_i}{y_i} - 1. \]