Section 1.1: Notations and Definitions
Let us denote by \(\|\cdot\|_2\) the Euclidean norm, and by \(\|\cdot\|\) a general norm. For a differentiable function \(f\), let \(\nabla f(\mathbf{x})\) denote its gradient at \(\mathbf{x}\), and \(\partial f(\mathbf{x})\) denote its subdifferential set at \(\mathbf{x}\).
Let \(\partial_1 f(\mathbf{w}, \mathbf{u})\) and \(\partial_2 f(\mathbf{w}, \mathbf{u})\) denote the partial subgradients of \(f\) with respect to the first variable \(\mathbf{w}\) and the second variable \(\mathbf{u}\), respectively.
Define the \(d\)-dimensional
probability simplex as
\[
\Delta_d = \left\{ \mathbf{x} \in \mathbb{R}^d : x_i \geq 0\ \forall i,\
\sum_{i=1}^d x_i = 1 \right\}.
\]
Let \(\mathbb{I}(\cdot)\) denote the
standard indicator function, which returns 1 if the input condition is
true and 0 otherwise.
Let \(\mathbb{I}_{0-\infty}(\cdot)\)
denote the zero-infinity indicator function, which returns 0 if the
input condition is true and \(\infty\)
otherwise.
Definition: Dual Norm
Let \(\|\cdot\|\) be a norm on \(\mathbb{R}^d\), then its dual norm \(\|\cdot\|_*\) is defined as
\[
\|\mathbf{y}\|_*:= \sup\{ \mathbf{x}^{\top}\mathbf{y}:
\|\mathbf{x}\|\leq 1\}.
\]
Example 1:
- \(\|\cdot\|_2\) is the dual norm of itself due to the Cauchy-Schwarz inequality: \(\mathbf{x}^{\top}\mathbf{y} \leq \|\mathbf{x}\|_2 \|\mathbf{y}\|_2\).
Example 2:
- \(\|\cdot\|_\infty\) and \(\|\cdot\|_1\) are dual norms of each other: \(\mathbf{x}^{\top}\mathbf{y} \leq \|\mathbf{x}\|_1 \|\mathbf{y}\|_\infty\).
Example 3:
- Let \(\|\mathbf{x}\|_A = \sqrt{\mathbf{x}^{\top}A\mathbf{x}}\) where \(A \succ 0\). Then the dual is \(\|\mathbf{y}\|_* = \sqrt{\mathbf{y}^{\top}A^{-1}\mathbf{y}}\), because \[ \mathbf{x}^{\top}\mathbf{y} = \mathbf{x}^{\top}A^{1/2}A^{-1/2}\mathbf{y} \leq \|A^{1/2}\mathbf{x}\|_2 \|A^{-1/2}\mathbf{y}\|_2 \leq \|A^{-1/2}\mathbf{y}\|_2. \]
Definition: Convex Set
A set \(\mathcal{C}\) is convex if
for all \(\mathbf{x}_1, \mathbf{x}_2 \in
\mathcal{C}\) and \(\theta \in [0,
1]\),
\[
\theta \mathbf{x}_1 + (1-\theta) \mathbf{x}_2 \in \mathcal{C}.
\]
Definition: Convex Function
A function \(f: \mathbb{R}^d \rightarrow \mathbb{R}\) is convex if for all \(\mathbf{x}, \mathbf{y}\) and \(\theta \in [0,1]\), \[ f(\theta \mathbf{x} + (1-\theta) \mathbf{y}) \leq \theta f(\mathbf{x}) + (1-\theta) f(\mathbf{y}). \]
Strict convexity holds with strict inequality when \(\mathbf{x} \neq \mathbf{y}\) and \(\theta \in (0,1)\).
Definition: Subgradient
For a convex, non-differentiable function \(f\), a vector \(\mathbf{v} \in \partial f(\mathbf{x})\) satisfies \[ f(\mathbf{x}) \geq f(\mathbf{y}) + \mathbf{v}^{\top}(\mathbf{x} - \mathbf{y}), \quad \forall \mathbf{y} \in \text{dom}(f). \]
Example 4:
\(f(x) = [x]_+ = \max(0, x)\).
- At \(x=0\), \(\partial f(0) = \{\xi \in [0,1]\}\)
- For \(x > 0\), \(\partial f(x) = \{1\}\)
- For \(x < 0\), \(\partial f(x) = \{0\}\)
Definition: Strongly Convex Function
A function \(f: \mathbb{R}^d \to \mathbb{R}\) is \(\mu\)-strongly convex w.r.t. norm \(\|\cdot\|\) if for any \(\mathbf{x}, \mathbf{y}\) and \(\mathbf{v} \in \partial f(\mathbf{x})\), \[ f(\mathbf{y}) \geq f(\mathbf{x}) + \mathbf{v}^{\top}(\mathbf{y} - \mathbf{x}) + \frac{\mu}{2} \|\mathbf{x} - \mathbf{y}\|^2. \]
Example 5:
\(f(\mathbf{x}) = \frac{1}{2}\|\mathbf{x}\|_2^2\) is 1-strongly convex because \[ \frac{1}{2} \|\mathbf{y}\|_2^2 = \frac{1}{2} \|\mathbf{x}\|_2^2 + \mathbf{x}^{\top}(\mathbf{y} - \mathbf{x}) + \frac{1}{2} \|\mathbf{x} - \mathbf{y}\|_2^2. \]
Definition: Smooth Function
A function \(f\) is \(L\)-smooth if its gradient is \(L\)-Lipschitz w.r.t. \(\|\cdot\|\): \[ \|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\|_* \leq L\|\mathbf{x} - \mathbf{y}\|, \] or equivalently, \[ |f(\mathbf{x}) - f(\mathbf{y}) - \nabla f(\mathbf{y})^{\top}(\mathbf{x} - \mathbf{y})| \leq \frac{L}{2} \|\mathbf{x} - \mathbf{y}\|^2. \]
Definition: Bregman Divergence
Given a strictly convex and differentiable function \(\varphi: \Omega \to \mathbb{R}\), \[ D_\varphi(\mathbf{x}, \mathbf{y}) := \varphi(\mathbf{x}) - \varphi(\mathbf{y}) - \nabla \varphi(\mathbf{y})^{\top}(\mathbf{x} - \mathbf{y}). \]
Example 6: Euclidean distance (\(\varphi(\mathbf{x}) = \frac{1}{2}\|\mathbf{x}\|_2^2\)):
\[ D_\varphi(\mathbf{x}, \mathbf{y}) = \frac{1}{2}\|\mathbf{x} - \mathbf{y}\|_2^2. \]
Example 7: KL divergence (\(\varphi(\mathbf{x}) = \sum_i x_i \log x_i\) on \(\Delta_d\)):
\[ D_\varphi(\mathbf{x}, \mathbf{y}) = \sum_i x_i \log \frac{x_i}{y_i}. \]
Example 8: Itakura–Saito distance (\(\varphi(\mathbf{x}) = -\sum_i \log x_i\) for \(x_i > 0\)):
\[ D_\varphi(\mathbf{x}, \mathbf{y}) = \sum_i \frac{x_i}{y_i} - \log \frac{x_i}{y_i} - 1. \]