Section 1.1: Notations and Definitions

Let us denote by \(\|\cdot\|_2\) the Euclidean norm, and by \(\|\cdot\|\) a general norm. For a differentiable function \(f\), let \(\nabla f(\mathbf{x})\) denote its gradient at \(\mathbf{x}\), and \(\partial f(\mathbf{x})\) denote its subdifferential set at \(\mathbf{x}\).

Let \(\partial_1 f(\mathbf{w}, \mathbf{u})\) and \(\partial_2 f(\mathbf{w}, \mathbf{u})\) denote the partial subgradients of \(f\) with respect to the first variable \(\mathbf{w}\) and the second variable \(\mathbf{u}\), respectively.

Define the \(d\)-dimensional probability simplex as
\[ \Delta_d = \left\{ \mathbf{x} \in \mathbb{R}^d : x_i \geq 0\ \forall i,\ \sum_{i=1}^d x_i = 1 \right\}. \]

Let \(\mathbb{I}(\cdot)\) denote the standard indicator function, which returns 1 if the input condition is true and 0 otherwise.
Let \(\mathbb{I}_{0-\infty}(\cdot)\) denote the zero-infinity indicator function, which returns 0 if the input condition is true and \(\infty\) otherwise.

Definition: Dual Norm

Let \(\|\cdot\|\) be a norm on \(\mathbb{R}^d\), then its dual norm \(\|\cdot\|_*\) is defined as
\[ \|\mathbf{y}\|_*:= \sup\{ \mathbf{x}^{\top}\mathbf{y}: \|\mathbf{x}\|\leq 1\}. \]

Example 1:

\(\|\cdot\|_2\) is the dual norm of itself due to the Cauchy-Schwarz inequality: \(\mathbf{x}^{\top}\mathbf{y} \leq \|\mathbf{x}\|_2 \|\mathbf{y}\|_2\).

Example 2:

\(\|\cdot\|_\infty\) and \(\|\cdot\|_1\) are dual norms of each other: \(\mathbf{x}^{\top}\mathbf{y} \leq \|\mathbf{x}\|_1 \|\mathbf{y}\|_\infty\).

Example 3:

Let \(\|\mathbf{x}\|_A = \sqrt{\mathbf{x}^{\top}A\mathbf{x}}\) where \(A \succ 0\). Then the dual is \(\|\mathbf{y}\|_* = \sqrt{\mathbf{y}^{\top}A^{-1}\mathbf{y}}\), because \[ \mathbf{x}^{\top}\mathbf{y} = \mathbf{x}^{\top}A^{1/2}A^{-1/2}\mathbf{y} \leq \|A^{1/2}\mathbf{x}\|_2 \|A^{-1/2}\mathbf{y}\|_2 \leq \|A^{-1/2}\mathbf{y}\|_2. \]

Definition: Convex Set

A set \(\mathcal{C}\) is convex if for all \(\mathbf{x}_1, \mathbf{x}_2 \in \mathcal{C}\) and \(\theta \in [0, 1]\),
\[ \theta \mathbf{x}_1 + (1-\theta) \mathbf{x}_2 \in \mathcal{C}. \]

Definition: Convex Function

A function \(f: \mathbb{R}^d \rightarrow \mathbb{R}\) is convex if for all \(\mathbf{x}, \mathbf{y}\) and \(\theta \in [0,1]\), \[ f(\theta \mathbf{x} + (1-\theta) \mathbf{y}) \leq \theta f(\mathbf{x}) + (1-\theta) f(\mathbf{y}). \]

Strict convexity holds with strict inequality when \(\mathbf{x} \neq \mathbf{y}\) and \(\theta \in (0,1)\).

Definition: Subgradient

For a convex, non-differentiable function \(f\), a vector \(\mathbf{v} \in \partial f(\mathbf{x})\) satisfies \[ f(\mathbf{x}) \geq f(\mathbf{y}) + \mathbf{v}^{\top}(\mathbf{x} - \mathbf{y}), \quad \forall \mathbf{y} \in \text{dom}(f). \]

Example 4:

\(f(x) = [x]_+ = \max(0, x)\).
- At \(x=0\), \(\partial f(0) = \{\xi \in [0,1]\}\)
- For \(x > 0\), \(\partial f(x) = \{1\}\)
- For \(x < 0\), \(\partial f(x) = \{0\}\)

Definition: Strongly Convex Function

A function \(f: \mathbb{R}^d \to \mathbb{R}\) is \(\mu\)-strongly convex w.r.t. norm \(\|\cdot\|\) if for any \(\mathbf{x}, \mathbf{y}\) and \(\mathbf{v} \in \partial f(\mathbf{x})\), \[ f(\mathbf{y}) \geq f(\mathbf{x}) + \mathbf{v}^{\top}(\mathbf{y} - \mathbf{x}) + \frac{\mu}{2} \|\mathbf{x} - \mathbf{y}\|^2. \]

Example 5:

\(f(\mathbf{x}) = \frac{1}{2}\|\mathbf{x}\|_2^2\) is 1-strongly convex because \[ \frac{1}{2} \|\mathbf{y}\|_2^2 = \frac{1}{2} \|\mathbf{x}\|_2^2 + \mathbf{x}^{\top}(\mathbf{y} - \mathbf{x}) + \frac{1}{2} \|\mathbf{x} - \mathbf{y}\|_2^2. \]

Definition: Smooth Function

A function \(f\) is \(L\)-smooth if its gradient is \(L\)-Lipschitz w.r.t. \(\|\cdot\|\): \[ \|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\|_* \leq L\|\mathbf{x} - \mathbf{y}\|, \] or equivalently, \[ |f(\mathbf{x}) - f(\mathbf{y}) - \nabla f(\mathbf{y})^{\top}(\mathbf{x} - \mathbf{y})| \leq \frac{L}{2} \|\mathbf{x} - \mathbf{y}\|^2. \]

Definition: Bregman Divergence

Given a strictly convex and differentiable function \(\varphi: \Omega \to \mathbb{R}\), \[ D_\varphi(\mathbf{x}, \mathbf{y}) := \varphi(\mathbf{x}) - \varphi(\mathbf{y}) - \nabla \varphi(\mathbf{y})^{\top}(\mathbf{x} - \mathbf{y}). \]

Example 6: Euclidean distance (\(\varphi(\mathbf{x}) = \frac{1}{2}\|\mathbf{x}\|_2^2\)):

\[ D_\varphi(\mathbf{x}, \mathbf{y}) = \frac{1}{2}\|\mathbf{x} - \mathbf{y}\|_2^2. \]

Example 7: KL divergence (\(\varphi(\mathbf{x}) = \sum_i x_i \log x_i\) on \(\Delta_d\)):

\[ D_\varphi(\mathbf{x}, \mathbf{y}) = \sum_i x_i \log \frac{x_i}{y_i}. \]

Example 8: Itakura–Saito distance (\(\varphi(\mathbf{x}) = -\sum_i \log x_i\) for \(x_i > 0\)):

\[ D_\varphi(\mathbf{x}, \mathbf{y}) = \sum_i \frac{x_i}{y_i} - \log \frac{x_i}{y_i} - 1. \]