← Go Back

Section 1.1: Notations and Definitions

This book uses the following notations.

Definition 1.1 (Dual Norm). Let \(\|\cdot\|\) be a norm on \(\mathbb{R}^d\). Its dual norm \(\|\cdot\|_* : \mathbb{R}^d \rightarrow \mathbb{R}\) is defined as \[ \|\mathbf{y}\|_* := \sup \left\{ \mathbf{x}^{\top}\mathbf{y} : \|\mathbf{x}\| \le 1 \right\}. \]

Example 1.1:

Example 1.2:

Example 1.3:

Definition 1.2 (Convex Set). A set \(\mathcal{C}\) is said to be convex if the line segment between any two points in \(\mathcal{C}\) lies entirely in \(\mathcal{C}\); that is, for all \(\mathbf{x}_1, \mathbf{x}_2 \in \mathcal{C}\) and \(\theta \in [0,1]\),

\[ \theta \mathbf{x}_1 + (1-\theta)\mathbf{x}_2 \in \mathcal{C}. \]

Definition 1.3 (Convex Function). A function \(f : \mathbb{R}^d \rightarrow \mathbb{R}\) is convex if its domain \(\mathrm{dom}(f)\) is convex and

\[ f(\theta \mathbf{x} + (1-\theta)\mathbf{y}) \le \theta f(\mathbf{x}) + (1-\theta) f(\mathbf{y}), \quad \forall \mathbf{x}, \mathbf{y} \in \mathrm{dom}(f),\ \theta \in [0,1]. \]

It is strictly convex if strict inequality holds whenever \(\mathbf{x} \neq \mathbf{y}\) and \(\theta \in (0,1)\).

This inequality implies that the graph of a convex function lies below the straight line connecting any two points on the graph – like a bowl: if you place a chopstick across its edges, it will stay above the surface of the bowl.

Lemma 1.1 (First-order condition)
Suppose \(f\) is differentiable (i.e., its gradient \(\nabla f\) exists at each point in \(\operatorname{dom}f\)). Then \(f\) is convex if and only if \(\operatorname{dom}f\) is convex and \[\begin{equation}\label{eq:first_order_convexity} f(\mathbf{y})\ge f(\mathbf{x})+\nabla f(\mathbf{x})^{\top}(\mathbf{y}-\mathbf{x}) \end{equation}\] holds for all \(\mathbf{x},\mathbf{y}\in\operatorname{dom}f\).

Proof.
We first prove for one-dimensional convex function \(\phi(\cdot):\mathbb{R}\to\mathbb{R}\), we have \[\begin{align}\label{eqn:1conv} \phi(t)\ge\phi(s)+\phi'(s)(t-s). \end{align}\] According to the definition of convexity, we have \[ \phi(t)\ge\phi(s)+\frac{\phi(s+\alpha(t-s))-\phi(s)}{\alpha}. \] Taking the limit \(\alpha\to0\) yields (\(\ref{eqn:1conv}\)).

(\(\Rightarrow\)) Assume \(f\) is convex and differentiable on the open convex set \(\operatorname{dom}f\). Fix \(\mathbf{x}\in\operatorname{dom}f\) and any \(\mathbf{y}\in\operatorname{dom}f\). Define \(\phi:[0,1]\to\mathbb{R}\) by \[ \phi(t)=f\big(\mathbf{x}+t(\mathbf{y}-\mathbf{x})\big). \] Since \(f\) is convex and the map \(t\mapsto\mathbf{x}+t(\mathbf{y}-\mathbf{x})\) is affine, \(\phi\) is a convex function on \([0,1]\). For a convex (one-dimensional) differentiable function, we have proved that \[ \phi(1)\ge\phi(0)+\phi'(0)(1-0). \] By the chain rule, \[ \phi'(0)=\nabla f(\mathbf{x})^{\top}(\mathbf{y}-\mathbf{x}). \] Thus \[ f(\mathbf{y})=\phi(1)\ge\phi(0)+\nabla f(\mathbf{x})^{\top}(\mathbf{y}-\mathbf{x})=f(\mathbf{x})+\nabla f(\mathbf{x})^{\top}(\mathbf{y}-\mathbf{x}). \]

(\(\Leftarrow\)) Assume \(\operatorname{dom}f\) is convex and for all \(\mathbf{x},\mathbf{y}\in\operatorname{dom}f\), \[ f(\mathbf{y})\ge f(\mathbf{x})+\nabla f(\mathbf{x})^{\top}(\mathbf{y}-\mathbf{x}). \] Take any \(\mathbf{x},\mathbf{y}\in\operatorname{dom}f\) and \(\theta\in[0,1]\), and set \(\mathbf{z}=\theta\mathbf{x}+(1-\theta)\mathbf{y}\in\operatorname{dom}f\). Apply the assumption with \((\mathbf{x},\mathbf{z})\) and \((\mathbf{y},\mathbf{z})\): \[ f(\mathbf{x})\ge f(\mathbf{z})+\nabla f(\mathbf{z})^{\top}(\mathbf{x}-\mathbf{z}),\qquad f(\mathbf{y})\ge f(\mathbf{z})+\nabla f(\mathbf{z})^{\top}(\mathbf{y}-\mathbf{z}). \] Multiply the first by \(\theta\) and the second by \((1-\theta)\) and add: \[ \theta f(\mathbf{x})+(1-\theta)f(\mathbf{y})\ge f(\mathbf{z})+\nabla f(\mathbf{z})^{\top}\big(\theta(\mathbf{x}-\mathbf{z})+(1-\theta)(\mathbf{y}-\mathbf{z})\big). \] Since \(\theta(\mathbf{x}-\mathbf{z})+(1-\theta)(\mathbf{y}-\mathbf{z})=0\), we get \[ f(\mathbf{z})\le\theta f(\mathbf{x})+(1-\theta)f(\mathbf{y}), \] i.e., \(f(\theta\mathbf{x}+(1-\theta)\mathbf{y})\le\theta f(\mathbf{x})+(1-\theta)f(\mathbf{y})\). Hence \(f\) is convex.

Definition 1.4 (Subgradient). Let \(f\) be a convex (possibly non-differentiable) function. A vector \(\mathbf{v} \in \partial f(\mathbf{x})\) is called a subgradient of \(f\) at \(\mathbf{x}\) if

\[ f(\mathbf{x}) \geq f(\mathbf{y}) + \mathbf{v}^{\top}(\mathbf{x} - \mathbf{y}), \quad \forall \mathbf{y} \in \mathrm{dom}(f). \]

Without causing any confusion, we often write \[f(\mathbf y) \geq f(\mathbf x) + \partial f(\mathbf x)^{\top}(\mathbf y - \mathbf x), \forall \mathbf x, \mathbf y\in\text{dom}(f),\] where \(\partial f(\mathbf x)\) refers to some specific element of the subgradient set.

Example 1.4:

\(f(x) = [x]_+ = \max(0, x)\).
- At \(x=0\), \(\partial f(0) = \{\xi \in [0,1]\}\)
- For \(x > 0\), \(\partial f(x) = \{1\}\)
- For \(x < 0\), \(\partial f(x) = \{0\}\)

Definition 1.5 (Strongly Convex Function). A function \(f : \mathbb{R}^d \to \mathbb{R}\) is said to be \(\mu\)-strongly convex with respect to a norm \(\|\cdot\|\) if there exists a constant \(\mu > 0\) such that, for all \(\mathbf{x}, \mathbf{y} \in \mathrm{dom}(f)\) and any \(\mathbf{v} \in \partial f(\mathbf{x})\), \[ f(\mathbf{y}) \ge f(\mathbf{x}) + \mathbf{v}^{\top}(\mathbf{y} - \mathbf{x}) + \frac{\mu}{2}\|\mathbf{x} - \mathbf{y}\|^2. \]

Example 1.5:

The function \(f(\mathbf x) = \frac{1}{2}\|\mathbf x\|_2^2\) is 1-strongly convex with respect to the Euclidean norm \(\|\cdot\|_2\). This follows directly from the identity: \[ \frac{1}{2}\|\mathbf y\|_2^2 = \frac{1}{2}\|\mathbf x\|_2^2 + \mathbf x^{\top}(\mathbf y - \mathbf x) + \frac{1}{2}\|\mathbf x - \mathbf y\|_2^2, \] which satisfies the definition of strong convexity with parameter 1.

Definition 1.6 (Smooth Function). A function \(f:\mathbb R^d\mapsto \mathbb R\) is called \(L\)-smooth with respect to a norm \(\|\cdot\|\) if it is differentiable and its gradient is \(L\)-Lipchitz continuous, i.e., there exists a positive real constant \(L\) such that, for any \(\mathbf x, \mathbf y\in\mathbb R^d\), we have \(\|\nabla f(\mathbf x) -\nabla f(\mathbf y)\|_*\leq L\|\mathbf x - \mathbf y\|\), or equivalently, \[\begin{align} |f(\mathbf x) - f(\mathbf y) - \nabla f(\mathbf y)^{\top}(\mathbf x- \mathbf y)|\leq \frac{L}{2}\|\mathbf x- \mathbf y\|^2. \end{align}\]

Definition 1.7 (Bregman Divergence).

Let \(\varphi:\Omega\rightarrow\mathbb{R}\) be a continuously differentiable, strictly convex function defined on a convex set \(\Omega\). The Bregman divergence induced by \(\varphi(\cdot)\) is defined as

\[\begin{equation*} D_\varphi(\mathbf{x}, \mathbf{y}) := \varphi(\mathbf{x}) - \varphi(\mathbf{y}) - \nabla \varphi(\mathbf{y})^{\top}(\mathbf{x}-\mathbf{y}). \end{equation*}\]

Example 1.6: Euclidean distance

\(\varphi(\mathbf{x}) = \frac{1}{2}\|\mathbf{x}\|_2^2\) induces the Euclidean distance: \[ D_\varphi(\mathbf{x}, \mathbf{y}) = \frac{1}{2}\|\mathbf{x} - \mathbf{y}\|_2^2. \]

Example 1.7: KL divergence

\(\varphi(\mathbf{x}) = \sum_i x_i \log x_i\) for \(\mathbf x\in\Delta_d\) induces the Kullback–Leibler (KL) divergence:

\[ D_\varphi(\mathbf{x}, \mathbf{y}) = \sum_i x_i \log \frac{x_i}{y_i}. \]

Example 1.8: Itakura–Saito distance

\(\varphi(\mathbf{x}) = -\sum_i \log x_i\) for \(\mathbf x > 0\) induces the Itakura-Saito distance: \[ D_\varphi(\mathbf{x}, \mathbf{y}) = \sum_i \frac{x_i}{y_i} - \log \frac{x_i}{y_i} - 1. \]

← Go Back