Section 1.5: Basic Lemmas

Below, we present some basic lemmas that are useful for the presentation and analysis in later chapters.

Lemma 1.4.

For a \(L\)-smooth convex function w.r.t. \(\|\cdot\|_2\), the following conditions are equivalent:

1. \(0\leq f(\mathbf{y}) - f(\mathbf{x}) - \nabla f(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x})\leq \frac{L}{2}\|\mathbf{x} - \mathbf{y}\|_2^2\)
1. \(\frac{1}{2L}\|\nabla f(\mathbf{y}) - \nabla f(\mathbf{x})\|_2^2\leq f(\mathbf{y}) - f(\mathbf{x}) - \nabla f(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x})\)
1. \(\frac{1}{L}\|\nabla f(\mathbf{x}) -\nabla f(\mathbf{y})\|_2^2\leq (\nabla f(\mathbf{x}) - \nabla f(\mathbf{y}))^{\top}(\mathbf{x} - \mathbf{y})\leq L\|\mathbf{x} - \mathbf{y}\|_2^2\)
1. \(\frac{\alpha(1-\alpha)}{2L}\|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\|_2^2 \leq \alpha f(\mathbf{x}) + (1-\alpha) f(\mathbf{y}) - f(\alpha \mathbf{x} + (1-\alpha)\mathbf{y}) \leq \alpha(1-\alpha)\frac{L}{2}\|\mathbf{x} - \mathbf{y}\|_2^2\)

Proof

Let us prove (a). Since \(\frac{d f(\mathbf{x} + \gamma \mathbf{p})}{d\gamma} = \nabla f(\mathbf{x} + \gamma \mathbf{p})^{\top} \mathbf{p}\), according to Taylor Theory:

\[ f(\mathbf{x} + \mathbf{p}) = f(\mathbf{x}) + \int_0^1 \nabla f(\mathbf{x} + \gamma \mathbf{p})^{\top} \mathbf{p} \, d\gamma \]

Let \(\mathbf{y} = \mathbf{x} + \mathbf{p}\):

\[ \begin{align*} f(\mathbf{y}) - f(\mathbf{x}) - \nabla f(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) &= \int_0^1 \nabla f(\mathbf{x} + \gamma(\mathbf{y} - \mathbf{x}))^{\top}(\mathbf{y} - \mathbf{x}) \, d\gamma - \nabla f(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) \\ &= \int_0^1 (\nabla f(\mathbf{x} + \gamma(\mathbf{y} - \mathbf{x})) - \nabla f(\mathbf{x}))^{\top}(\mathbf{y} - \mathbf{x}) \, d\gamma \\ &\leq \int_0^1 L\|\gamma \mathbf{p}\|_2 \|\mathbf{p}\|_2 \, d\gamma = \frac{L}{2} \|\mathbf{y} - \mathbf{x}\|_2^2 \end{align*} \]

Let us prove (b). Define \(\phi(\mathbf{z}) = f(\mathbf{z}) - \nabla f(\mathbf{x})^{\top} \mathbf{z}\). Since \(\phi\) is convex and \(L\)-smooth:

\[\begin{align*} \phi(\mathbf{x}) &= \min_{\mathbf{z}} \phi(\mathbf{z}) \leq \min_{\mathbf{z}} \left\{ \phi(\mathbf{y}) + \nabla \phi(\mathbf{y})^{\top} (\mathbf{z} - \mathbf{y}) + \frac{L}{2} \|\mathbf{z} - \mathbf{y}\|_2^2 \right\} \\ &\leq \min_{r} \left\{ \phi(\mathbf{y}) + \|\nabla \phi(\mathbf{y})\|_2 r + \frac{L}{2} r^2 \right\} \\ &= \phi(\mathbf{y}) - \frac{\|\nabla \phi(\mathbf{y})\|_2^2}{2L} \end{align*}\]

Then, \(2L(\phi(\mathbf{y}) - \phi(\mathbf{x})) \geq \|\nabla \phi(\mathbf{y})\|_2^2\). Plug in definitions to finish the proof.

Let us prove (c):

\[\begin{align*} \frac{1}{2L} \|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\|_2^2 &\leq f(\mathbf{y}) - f(\mathbf{x}) - \nabla f(\mathbf{x})^{\top} (\mathbf{y} - \mathbf{x}) \\ \frac{1}{2L} \|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\|_2^2 &\leq f(\mathbf{x}) - f(\mathbf{y}) - \nabla f(\mathbf{y})^{\top} (\mathbf{x} - \mathbf{y}) \end{align*}\]

Summing up both:

\[ \frac{1}{L} \|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\|_2^2 \leq (\nabla f(\mathbf{x}) - \nabla f(\mathbf{y}))^{\top} (\mathbf{x} - \mathbf{y}) \]

Let us prove (d). Let \(\mathbf{x}_\alpha = \alpha \mathbf{x} + (1 - \alpha) \mathbf{y}\):

\[\begin{align*} \frac{1}{2L} \|\nabla f(\mathbf{x}) - \nabla f(\mathbf{x}_\alpha)\|_2^2 &\leq f(\mathbf{x}) - \left( f(\mathbf{x}_\alpha) + \nabla f(\mathbf{x}_\alpha)^{\top} (1 - \alpha)(\mathbf{x} - \mathbf{y}) \right) \\ &\leq \frac{L}{2} \|(1 - \alpha)(\mathbf{x} - \mathbf{y})\|_2^2 \end{align*}\]

\[\begin{align*} \frac{1}{2L} \|\nabla f(\mathbf{y}) - \nabla f(\mathbf{x}_\alpha)\|_2^2 &\leq f(\mathbf{y}) - \left( f(\mathbf{x}_\alpha) + \nabla f(\mathbf{x}_\alpha)^{\top} \alpha(\mathbf{y} - \mathbf{x}) \right) \\ &\leq \frac{L}{2} \|\alpha(\mathbf{y} - \mathbf{x})\|_2^2 \end{align*}\]

Multiply first by \(\alpha\), second by \(1 - \alpha\), and combine:

\[ \frac{\alpha(1 - \alpha)}{2L} \|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\|_2^2 \leq \alpha f(\mathbf{x}) + (1 - \alpha) f(\mathbf{y}) - f(\mathbf{x}_\alpha) \]

Lemma 1.5.

If \(f\) is differentiable and \(\mu\)-strongly convex w.r.t. \(\|\cdot\|_2\), the following conditions are equivalent:

1. \(f(\mathbf{y}) - f(\mathbf{x}) - \nabla f(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) \geq \frac{\mu}{2} \|\mathbf{x} - \mathbf{y}\|_2^2\)
1. \(f(\mathbf{y}) - f(\mathbf{x}) - \nabla f(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) \leq \frac{1}{2\mu} \|\nabla f(\mathbf{y}) - \nabla f(\mathbf{x})\|_2^2\)
1. \(\mu \|\mathbf{x} - \mathbf{y}\|_2^2 \leq (\nabla f(\mathbf{x}) - \nabla f(\mathbf{y}))^{\top}(\mathbf{x} - \mathbf{y}) \leq \frac{1}{\mu} \|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\|_2^2\)
1. \(\frac{\alpha(1 - \alpha)\mu}{2} \|\mathbf{x} - \mathbf{y}\|_2^2 \leq \alpha f(\mathbf{x}) + (1 - \alpha) f(\mathbf{y}) - f(\alpha \mathbf{x} + (1 - \alpha)\mathbf{y}) \leq \alpha(1 - \alpha)\frac{1}{2\mu} \|\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\|_2^2\)

Proof:

Part (a) is the definition. Let us prove part (b). Define \(\phi(\mathbf{z}) = f(\mathbf{z}) - \nabla f(\mathbf{x})^{\top} \mathbf{z}\). We can conclude that \(\mathbf{z}^* = \mathbf{x}\) (by first-order optimality), and that \(\phi(\mathbf{z})\) is also convex and \(\mu\)-strongly convex since \(f\) is convex and \(\mu\)-strongly convex.

\[\begin{align*} \phi(\mathbf{x}) = \min_{\mathbf{z}} \phi(\mathbf{z}) &\geq \min_{\mathbf{z}} \left\{ \phi(\mathbf{y}) + \nabla \phi(\mathbf{y})^{\top}(\mathbf{z} - \mathbf{y}) + \frac{\mu}{2} \left\| \mathbf{z} - \mathbf{y} \right\|_2^2 \right\} \\ &\stackrel{r = \|\mathbf{z} - \mathbf{y}\|_2}{=} \min_{r} \left\{ \phi(\mathbf{y}) + \nabla \phi(\mathbf{y})^{\top} r + \frac{\mu}{2} r^2 \right\} \\ &\stackrel{\text{solve } r}{=} \phi(\mathbf{y}) - \frac{\left\| \nabla \phi(\mathbf{y}) \right\|_2^2}{2\mu}. \end{align*}\]

Then we have \[ 2\mu(\phi(\mathbf{y}) - \phi(\mathbf{x})) \leq \left\| \nabla \phi(\mathbf{y}) \right\|_2^2, \] which proves the result by plugging in \(\phi(\mathbf{z}) = f(\mathbf{z}) - \nabla f(\mathbf{x})^{\top} \mathbf{z}\) and \(\nabla \phi(\mathbf{z}) = \nabla f(\mathbf{z}) - \nabla f(\mathbf{x})\).

Parts (c), and (d) can be proved similarly to the previous lemma.

Lemma 1.6.

If \(r(\cdot)\) is \(\mu\)-strongly convex and \[\begin{align} \label{eqn:pert1} \text{prox}_{\eta r}(\mathbf{z}_1) &= \arg\min_{\mathbf{w}} r(\mathbf{w}) + \frac{1}{2\eta} \|\mathbf{w} - \mathbf{z}_1\|^2,\\ \text{prox}_{\eta r}(\mathbf{z}_2) &= \arg\min_{\mathbf{w}} r(\mathbf{w}) + \frac{1}{2\eta} \|\mathbf{w} - \mathbf{z}_2\|^2, \end{align}\] then we have \[ \|\text{prox}_{\eta r}(\mathbf{z}_1) - \text{prox}_{\eta r}(\mathbf{z}_2)\| \leq \frac{1}{1 + \mu\eta} \|\mathbf{z}_1 - \mathbf{z}_2\|. \]

Proof

First, we can see that when \(r = 0\), the conclusion trivially holds. Next, we prove it when \(r\) is present.

By the optimality of \(\operatorname{prox}_{\eta r}(\mathbf{z}_1)\) and \(\operatorname{prox}_{\eta r}(\mathbf{z}_2)\), we have \[\begin{align*} \mathbf{u} := \frac{\mathbf{z}_1 - \operatorname{prox}_{\eta r}(\mathbf{z}_1)}{\eta} &\in \partial r(\operatorname{prox}_{\eta r}(\mathbf{z}_1)), \\ \mathbf{v} := \frac{\mathbf{z}_2 - \operatorname{prox}_{\eta r}(\mathbf{z}_2)}{\eta} - \mathbf{b} &\in \partial r(\operatorname{prox}_{\eta r}(\mathbf{z}_2)). \end{align*}\]

Since \(r(\mathbf{x})\) is \(\mu\)-strongly convex, we have \[\begin{align*} r(\operatorname{prox}_{\eta r}(\mathbf{z}_1)) &\geq r(\operatorname{prox}_{\eta r}(\mathbf{z}_2)) + \mathbf{v}^{\top} \left( \operatorname{prox}_{\eta r}(\mathbf{z}_1) - \operatorname{prox}_{\eta r}(\mathbf{z}_2) \right) \\ &\quad + \frac{\mu}{2} \|\operatorname{prox}_{\eta r}(\mathbf{z}_1) - \operatorname{prox}_{\eta r}(\mathbf{z}_2)\|^2, \\ r(\operatorname{prox}_{\eta r}(\mathbf{z}_2)) &\geq r(\operatorname{prox}_{\eta r}(\mathbf{z}_1)) + \mathbf{u}^{\top} \left( \operatorname{prox}_{\eta r}(\mathbf{z}_2) - \operatorname{prox}_{\eta r}(\mathbf{z}_1) \right) \\ &\quad + \frac{\mu}{2} \|\operatorname{prox}_{\eta r}(\mathbf{z}_1) - \operatorname{prox}_{\eta r}(\mathbf{z}_2)\|^2. \end{align*}\]

Adding them together, we have \[\begin{align*} &\mu \|\operatorname{prox}_{\eta r}(\mathbf{z}_1) - \operatorname{prox}_{\eta r}(\mathbf{z}_2)\|^2 \leq (\mathbf{u} - \mathbf{v})^{\top} \left( \operatorname{prox}_{\eta r}(\mathbf{z}_1) - \operatorname{prox}_{\eta r}(\mathbf{z}_2) \right) \\ &= \frac{1}{\eta} (\mathbf{z}_1 - \mathbf{z}_2 + \operatorname{prox}_{\eta r}(\mathbf{z}_2) - \operatorname{prox}_{\eta r}(\mathbf{z}_1))^{\top} \left( \operatorname{prox}_{\eta r}(\mathbf{z}_1) - \operatorname{prox}_{\eta r}(\mathbf{z}_2) \right). \end{align*}\]

This implies \[\begin{align*} &\left(\mu + \frac{1}{\eta}\right) \|\operatorname{prox}_{\eta r}(\mathbf{z}_1) - \operatorname{prox}_{\eta r}(\mathbf{z}_2)\|^2 \leq \frac{1}{\eta} (\mathbf{z}_1 - \mathbf{z}_2)^{\top} \left( \operatorname{prox}_{\eta r}(\mathbf{z}_1) - \operatorname{prox}_{\eta r}(\mathbf{z}_2) \right) \\ &\leq \frac{1}{\eta} \|\mathbf{z}_1 - \mathbf{z}_2\|_2 \cdot \|\operatorname{prox}_{\eta r}(\mathbf{z}_1) - \operatorname{prox}_{\eta r}(\mathbf{z}_2)\|_2. \end{align*}\]

Thus, \[ \|\operatorname{prox}_{\eta r}(\mathbf{z}_1) - \operatorname{prox}_{\eta r}(\mathbf{z}_2)\|_2 \leq \frac{1}{\mu\eta + 1} \|\mathbf{z}_1 - \mathbf{z}_2\|. \]

Lemma 1.7.

If \(f\) is \(\mu\)-strongly convex w.r.t. \(\|\cdot\|_2\), then its Fenchel conjugate is \(1/\mu\)-smooth. Similarly, if \(f\) is \(L\)-smooth and convex w.r.t. \(\|\cdot\|_2\), then its Fenchel conjugate is \(1/L\)-strongly convex.

Proof

Let \(f^*(\mathbf{y}) = \max_{\mathbf{x}} \mathbf{x}^{\top} \mathbf{y} - f(\mathbf{x})\) be the Fenchel conjugate of \(f\).

Suppose \(f\) is \(\mu\)-strongly convex. Let \(\mathbf{x}(\mathbf{y}) = \arg\max_{\mathbf{x}} \mathbf{x}^{\top} \mathbf{y} - f(\mathbf{x})\). Then by the Danskin Theorem, \(\nabla f^*(\mathbf{y}) = \mathbf{x}(\mathbf{y})\). Similar to the previous lemma, we can show \[ \|\mathbf{x}(\mathbf{y}_1) - \mathbf{x}(\mathbf{y}_2)\|_2 \leq \frac{1}{\mu} \|\mathbf{y}_1 - \mathbf{y}_2\|_2, \] which proves the Lipschitz continuity of \(\nabla f^*(\mathbf{y})\) and hence the smoothness of \(f^*\).

Now suppose \(f\) is \(L\)-smooth and convex. Let us prove that \(f^*\) is \(1/L\)-strongly convex. Consider \(\mathbf{y}_1, \mathbf{y}_2\). Let \(\mathbf{x}_1 \in \arg\max_{\mathbf{x}} \mathbf{x}^{\top} \mathbf{y}_1 - f(\mathbf{x})\) and \(\mathcal{X}_2 = \arg\max_{\mathbf{x}} \mathbf{x}^{\top} \mathbf{y}_2 - f(\mathbf{x})\). Then \(\nabla f(\mathbf{x}_1) = \mathbf{y}_1\). For any \(\mathbf{x}_2 \in \mathcal{X}_2\), we have \(\nabla f(\mathbf{x}_2) = \mathbf{y}_2\), and \[\begin{align*} & f^*(\mathbf{y}_1) - f^*(\mathbf{y}_2) - \mathbf{x}_2^{\top}(\mathbf{y}_1 - \mathbf{y}_2) \\ &= \mathbf{x}_1^{\top} \mathbf{y}_1 - f(\mathbf{x}_1) - (\mathbf{x}_2^{\top} \mathbf{y}_2 - f(\mathbf{x}_2)) - \mathbf{x}_2^{\top}(\nabla f(\mathbf{x}_1) - \nabla f(\mathbf{x}_2)) \\ &= f(\mathbf{x}_2) - f(\mathbf{x}_1) + \mathbf{x}_1^{\top} \nabla f(\mathbf{x}_1) - \mathbf{x}_2^{\top} \nabla f(\mathbf{x}_2) - \mathbf{x}_2^{\top}(\nabla f(\mathbf{x}_1) - \nabla f(\mathbf{x}_2)) \\ &= f(\mathbf{x}_2) - f(\mathbf{x}_1) + (\mathbf{x}_1 - \mathbf{x}_2)^{\top} \nabla f(\mathbf{x}_1) \\ &\geq \frac{1}{2L} \|\nabla f(\mathbf{x}_1) - \nabla f(\mathbf{x}_2)\|_2^2 = \frac{1}{2L} \|\mathbf{y}_1 - \mathbf{y}_2\|_2^2, \end{align*}\] where the last inequality follows from part (b) of Lemma 1.4. Hence, we conclude the proof by noting that \(\partial f^*(\mathbf{y}_2) = \operatorname{conv}(\mathcal{X}_2)\) due to the generalized Danskin theorem.

Lemma 1.8.

For \(\mathbf{p} \in \Delta_n\), the negative entropy function \(R(\mathbf{p}) = \sum_{i=1}^n p_i \log p_i\) is \(1\)-strongly convex w.r.t. the \(\ell_1\) norm \(\|\cdot\|_1\).

Proof

For any \(\mathbf{x}, \mathbf{y} \in \Delta_n\), let \(f(t) = R(\mathbf{y} + t(\mathbf{x} - \mathbf{y}))\). By the second-order Taylor expansion, for some \(t \in (0,1)\), we have \[ \begin{align*} &R(\mathbf{x}) = f(1) = f(0) + f'(0) + \frac{1}{2} f''(t)\\ &= R(\mathbf{y}) + \nabla R(\mathbf{y})^{\top}(\mathbf{x} - \mathbf{y}) + \frac{1}{2} (\mathbf{x} - \mathbf{y})^{\top} \nabla^2 R(\mathbf{y} + t(\mathbf{x} - \mathbf{y}))(\mathbf{x} - \mathbf{y}). \end{align*} \]

Hence, it suffices to prove that for any \(\mathbf{v}\) and \(\mathbf{p} \in \Delta_n\), \[ \mathbf{v}^{\top} \nabla^2 R(\mathbf{p}) \mathbf{v} \geq \|\mathbf{v}\|_1^2. \]

This follows from the computation: \[ \mathbf{v}^{\top} \nabla^2 R(\mathbf{p}) \mathbf{v} = \sum_{i=1}^n v_i^2 p_i^{-1} = \left[\sum_i v_i^2 p_i^{-1}\right] \left[\sum_i p_i\right] \geq \left[\sum_i (p_i^{-1/2} |v_i|) p_i^{1/2}\right]^2 = \left[\sum_i |v_i|\right]^2, \] where the inequality follows from the Cauchy–Schwarz inequality.