← Go Back

Section 1.5: Basic Lemmas

Below, we present some basic lemmas that are useful for the presentation and analysis in later chapters.

Lemma 1.5 For a \(L\)-smooth convex function w.r.t.\(\|\cdot\|_2\), the following conditions are equivalent:

Proof.
Let us prove (a). Since \(\frac{df(\mathbf{x}+\gamma\mathbf{p})}{d\gamma}=\nabla f(\mathbf{x}+\gamma\mathbf{p})^{\top}\mathbf{p}\), according to Taylor Theory \[ f(\mathbf{x}+\mathbf{p})=f(\mathbf{x})+\int_0^1\nabla f(\mathbf{x}+\gamma\mathbf{p})^{\top}\mathbf{p}\,d\gamma. \] Let \(\mathbf{y}=\mathbf{x}+\mathbf{p}\): \[\begin{align*} f(\mathbf{y})-f(\mathbf{x})-\nabla f(\mathbf{x})^{\top}(\mathbf{y}-\mathbf{x}) &=\int_0^1\nabla f(\mathbf{x}+\gamma(\mathbf{y}-\mathbf{x}))^{\top}(\mathbf{y}-\mathbf{x})\,d\gamma-\nabla f(\mathbf{x})^{\top}(\mathbf{y}-\mathbf{x})\\ &=\int_0^1\left(\nabla f(\mathbf{x}+\gamma(\mathbf{y}-\mathbf{x}))-\nabla f(\mathbf{x})\right)^{\top}(\mathbf{y}-\mathbf{x})\,d\gamma\\ &\le\int_0^1\|\nabla f(\mathbf{x}+\gamma(\mathbf{y}-\mathbf{x}))-\nabla f(\mathbf{x})\|_2\|\mathbf{p}\|_2\,d\gamma\\ &\le\int_0^1L\|\gamma\mathbf{p}\|_2\|\mathbf{p}\|_2\,d\gamma =\frac{L}{2}\|\mathbf{y}-\mathbf{x}\|_2^2. \end{align*}\]

Let us prove (b). Define \(\phi(\mathbf{z})=f(\mathbf{z})-\nabla f(\mathbf{x})^{\top}\mathbf{z}\). We can conclude that \(\mathbf{z}^*=\mathbf{x}\) (by the first-order optimality) and that \(\phi(\mathbf{z})\) is also convex and \(L\)-smooth if \(f\) is convex and \(L\)-smooth. \[\begin{align*} \phi(\mathbf{x}) &=\min_{\mathbf{z}}\phi(\mathbf{z}) \le\min_{\mathbf{z}}\left\{\phi(\mathbf{y})+\nabla\phi(\mathbf{y})^{\top}(\mathbf{z}-\mathbf{y})+\frac{L}{2}\|\mathbf{z}-\mathbf{y}\|_2^2\right\}\\ &\stackrel{\mathbf{r}=\mathbf{z}-\mathbf{y}}{=} \min_{\mathbf{r}}\left\{\phi(\mathbf{y})+\nabla\phi(\mathbf{y})^{\top}\mathbf{r}+\frac{L}{2}\|\mathbf{r}\|_2^2\right\}\\ &\stackrel{\text{solve }\mathbf{r}}{=} \phi(\mathbf{y})-\frac{\|\nabla\phi(\mathbf{y})\|_2^2}{L}+\frac{\|\nabla\phi(\mathbf{y})\|_2^2}{2L} =\phi(\mathbf{y})-\frac{\|\nabla\phi(\mathbf{y})\|_2^2}{2L}. \end{align*}\] Then, we have \(2L(\phi(\mathbf{y})-\phi(\mathbf{x}))\ge\|\nabla\phi(\mathbf{y})\|_2^2\), which proves the result by plugging in \(\phi(\mathbf{z})=f(\mathbf{z})-\nabla f(\mathbf{x})^{\top}\mathbf{z}\) and \(\nabla\phi(\mathbf{z})=\nabla f(\mathbf{z})-\nabla f(\mathbf{x})\).

Let us prove (c). According to part (b) we have \[\begin{align*} \frac{1}{2L}\|\nabla f(\mathbf{x})-\nabla f(\mathbf{y})\|_2^2 \le f(\mathbf{y})-f(\mathbf{x})-\nabla f(\mathbf{x})^{\top}(\mathbf{y}-\mathbf{x}). \end{align*}\] Similarly, \[\begin{align*} \frac{1}{2L}\|\nabla f(\mathbf{x})-\nabla f(\mathbf{y})\|_2^2 \le f(\mathbf{x})-f(\mathbf{y})-\nabla f(\mathbf{y})^{\top}(\mathbf{x}-\mathbf{y}). \end{align*}\] Summing up the above two inequalities leads to \[\begin{align}\nonumber \frac{1}{L}\|\nabla f(\mathbf{x})-\nabla f(\mathbf{y})\|_2^2 \le(\nabla f(\mathbf{x})-\nabla f(\mathbf{y}))^{\top}(\mathbf{x}-\mathbf{y}). \end{align}\]

Let us prove (d). Let \(\mathbf{x}_\alpha=\alpha\mathbf{x}+(1-\alpha)\mathbf{y}\). From (a) and (b), we have \[\begin{align*} \frac{1}{2L}\|\nabla f(\mathbf{x})-\nabla f(\mathbf{x}_\alpha)\|_2^2 &\le f(\mathbf{x})-\left(f(\mathbf{x}_\alpha)+\nabla f(\mathbf{x}_\alpha)^{\top}(1-\alpha)(\mathbf{x}-\mathbf{y})\right) \le\frac{L}{2}\|(1-\alpha)(\mathbf{x}-\mathbf{y})\|_2^2,\\ \frac{1}{2L}\|\nabla f(\mathbf{y})-\nabla f(\mathbf{x}_\alpha)\|_2^2 &\le f(\mathbf{y})-\left(f(\mathbf{x}_alpha)+\nabla f(\mathbf{x}_\alpha)^{\top}\alpha(\mathbf{y}-\mathbf{x})\right) \le\frac{L}{2}\|\alpha(\mathbf{y}-\mathbf{x})\|_2^2. \end{align*}\] Multiplying the first by \(\alpha\) and the second by \(1-\alpha\), we can prove part (d), where the lower bound is \[ \frac{\alpha}{2L}\|\nabla f(\mathbf{x})-\nabla f(\mathbf{x}_\alpha)\|_2^2+\frac{1-\alpha}{2L}\|\nabla f(\mathbf{y})-\nabla f(\mathbf{x}_\alpha)\|_2^2 \ge\frac{\alpha(1-\alpha)}{2L}\|\nabla f(\mathbf{x})-\nabla f(\mathbf{y})\|_2^2 \] by applying Young’s inequality \(\|\mathbf{a}-\mathbf{b}\|_2^2\le(1+\beta)\|\mathbf{a}-\mathbf{c}\|_2^2+(1+\frac{1}{\beta})\|\mathbf{b}-\mathbf{c}\|_2^2\) with \(\beta=\alpha/(1-\alpha)\).

Lemma 1.6 If \(f\) is differentiable and \(\mu\)-strongly convex w.r.t.\(\|\cdot\|_2\), the following conditions are equivalent:

From (a) we can derive an useful inequality for strongly convex optimization \(\mathbf{x}_*=\arg\min_{\mathbf{x}}f(\mathbf{x})\), i.e., for any \(\mathbf{x}\), we have

\[\begin{align}\label{eqn:scx-opt} f(\mathbf{x})\ge f(\mathbf{x}_*)+\nabla f(\mathbf{x}_*)^{\top}(\mathbf{x}- \mathbf{x}_*)+\frac{\mu}{2}\|\mathbf{x}-\mathbf{x}_*\|_2^2\ge f(\mathbf{x}_*)+\frac{\mu}{2}\|\mathbf{x}-\mathbf{x}_*\|_2^2. \end{align}\]

Proof (of (b)).
Define \(\phi(\mathbf{z})=f(\mathbf{z})-\nabla f(\mathbf{x})^{\top}\mathbf{z}\). We can conclude that \(\mathbf{z}^*=\mathbf{x}\) (by the first-order optimality) and that \(\phi(\mathbf{z})\) is also \(\mu\)-strongly convex since \(f\) is \(\mu\)-strongly convex. \[\begin{align*} \phi(\mathbf{x})=\min_{\mathbf{z}}\phi(\mathbf{z}) &\ge\min_{\mathbf{z}}\left\{\phi(\mathbf{y})+(\nabla\phi(\mathbf{y}))^{\top}(\mathbf{z}-\mathbf{y})+\frac{\mu}{2}\|\mathbf{z}-\mathbf{y}\|_2^2\right\}\\ &\stackrel{\mathbf{r}=\mathbf{z}-\mathbf{y}}{=} \min_{\mathbf{r}}\left\{\phi(\mathbf{y})+(\nabla\phi(\mathbf{y}))^{\top}\mathbf{r}+\frac{\mu}{2}\|\mathbf{r}\|_2^2\right\}\\ &\stackrel{\text{solve }\mathbf{r}}{=} \phi(\mathbf{y})-\frac{\|\nabla\phi(\mathbf{y})\|_2^2}{2\mu}. \end{align*}\] Then, we have \(2\mu(\phi(\mathbf{y})-\phi(\mathbf{x}))\le\|\nabla\phi(\mathbf{y})\|_2^2\), which proves the result by plugging in \(\phi(\mathbf{y})=f(\mathbf{y})-\nabla f(\mathbf{x})^{\top}\mathbf{y}\), \(\phi(\mathbf{x})=f(\mathbf{x})-\nabla f(\mathbf{x})^{\top}\mathbf{x}\) and \(\nabla\phi(\mathbf{y})=\nabla f(\mathbf{y})-\nabla f(\mathbf{x})\).

Part (c), (d) can be proved similarly as the previous lemma.

Lemma 1.7 Let \(r(\cdot)\) be \(\mu\)-strongly convex w.r.t.\(\|\cdot\|_2\), and for any \(\mathbf{z}\) define \[\begin{align} \operatorname{prox}_{\eta r}(\mathbf{z}):=\arg\min_{\mathbf{w}}r(\mathbf{w})+\frac{1}{2\eta}\|\mathbf{w}-\mathbf{z}\|_2^2. \end{align}\] Then, \(\|\operatorname{prox}_{\eta r}(\mathbf{z}_1)-\operatorname{prox}_{\eta r}(\mathbf{z}_2)\|_2\le\frac{1}{1+\mu\eta}\|\mathbf{z}_1-\mathbf{z}_2\|_2\) holds \(\forall\mathbf{z}_1,\mathbf{z}_2\).

Proof.
First, we can see that when \(r=0\), the conclusion trivially holds. Next, we prove it when \(r\) is present.

By the optimality of \(\operatorname{prox}_{\eta r}(\mathbf{z}_1)\) and \(\operatorname{prox}_{\eta r}(\mathbf{z}_2)\), we have \[\begin{align*} \mathbf{u}:=\frac{\mathbf{z}_1-\operatorname{prox}_{\eta r}(\mathbf{z}_1)}{\eta}\in\partial r(\operatorname{prox}_{\eta r}(\mathbf{z}_1)),\\ \mathbf{v}:=\frac{\mathbf{z}_2-\operatorname{prox}_{\eta r}(\mathbf{z}_2)}{\eta}\in\partial r(\operatorname{prox}_{\eta r}(\mathbf{z}_2)). \end{align*}\] Since \(r(\mathbf{x})\) is \(\mu\)-strongly convex, we have \[\begin{align*} r(\operatorname{prox}_{\eta r}(\mathbf{z}_1)) &\ge r(\operatorname{prox}_{\eta r}(\mathbf{z}_2))+\mathbf{v}^{\top}(\operatorname{prox}_{\eta r}(\mathbf{z}_1)-\operatorname{prox}_{\eta r}(\mathbf{z}_2)) +\frac{\mu}{2}\|\operatorname{prox}_{\eta r}(\mathbf{z}_1)-\operatorname{prox}_{\eta r}(\mathbf{z}_2)\|_2^2,\\ r(\operatorname{prox}_{\eta r}(\mathbf{z}_2)) &\ge r(\operatorname{prox}_{\eta r}(\mathbf{z}_1))+\mathbf{u}^{\top}(\operatorname{prox}_{\eta r}(\mathbf{z}_2)-\operatorname{prox}_{\eta r}(\mathbf{z}_1)) +\frac{\mu}{2}\|\operatorname{prox}_{\eta r}(\mathbf{z}_1)-\operatorname{prox}_{\eta r}(\mathbf{z}_2)\|_2^2. \end{align*}\] Adding them together, we have \[\begin{align*} \mu\|\operatorname{prox}_{\eta r}(\mathbf{z}_1)-\operatorname{prox}_{\eta r}(\mathbf{z}_2)\|_2^2 &\le(\mathbf{u}-\mathbf{v})^{\top}(\operatorname{prox}_{\eta r}(\mathbf{z}_1)-\operatorname{prox}_{\eta r}(\mathbf{z}_2))\\ &=\frac{1}{\eta}\left(\mathbf{z}_1-\mathbf{z}_2+\operatorname{prox}_{\eta r}(\mathbf{z}_2)-\operatorname{prox}_{\eta r}(\mathbf{z}_1)\right)^{\top} \left(\operatorname{prox}_{\eta r}(\mathbf{z}_1)-\operatorname{prox}_{\eta r}(\mathbf{z}_2)\right), \end{align*}\] which implies \[\begin{align*} \left(\mu+\frac{1}{\eta}\right)\|\operatorname{prox}_{\eta r}(\mathbf{z}_1)-\operatorname{prox}_{\eta r}(\mathbf{z}_2)\|_2^2 &\le\frac{1}{\eta}(\mathbf{z}_1-\mathbf{z}_2)^{\top}(\operatorname{prox}_{\eta r}(\mathbf{z}_1)-\operatorname{prox}_{\eta r}(\mathbf{z}_2))\\ &\le\frac{1}{\eta}\|\mathbf{z}_1-\mathbf{z}_2\|_2\|\operatorname{prox}_{\eta r}(\mathbf{z}_1)-\operatorname{prox}_{\eta r}(\mathbf{z}_2)\|_2. \end{align*}\] Thus \(\|\operatorname{prox}_{\eta r}(\mathbf{z}_1)-\operatorname{prox}_{\eta r}(\mathbf{z}_2)\|_2\le\frac{1}{\mu\eta+1}\|\mathbf{z}_1-\mathbf{z}_2\|_2\).

Lemma 1.8 For a proper closed convex function \(f\), the following holds:

Proof.
Let us prove (i). For any \(\mathbf{y}\) with \(\|\mathbf{y}\|_2>G\), let \(\mathbf{u}=\mathbf{y}/\|\mathbf{y}\|_2\) and take \(\mathbf{x}=t\mathbf{u}\). By Lipschitz continuity, \(f(t\mathbf{u})\le f(0)+Gt\), hence \[ \mathbf{y}^{\top}t\mathbf{u}-f(t\mathbf{u})\ge t(\|\mathbf{y}\|_2-G)-f(0)\to+\infty, \] so \(f^*(\mathbf{y})=+\infty\) and thus \(\mathbf{y}\notin\text{dom}(f^*)\).

Next, we prove (ii). Since \(\mathbf{x}_*\) attains the supremum in the definition of \(f^*(\mathbf{y}_*)\), we have \(\mathbf{y}_*\in\partial f(\mathbf{x}_*)\) according to the optimality condition and \(f^*(\mathbf{y}_*)=\mathbf{x}_*^{\top}\mathbf{y}_*-f(\mathbf{x}_*)\). Using \(f^{**}=f\), we obtain \[ f(\mathbf{x}_*)=\sup_{\mathbf{y}}\{\mathbf{y}^{\top}\mathbf{x}_*-f^*(\mathbf{y})\}=\mathbf{x}_*^{\top}\mathbf{y}_*-f^*(\mathbf{y}_*), \] and the above equality shows that \(\mathbf{y}_*\) attains the supremum. Hence, \(\mathbf{y}_*\in\arg\max_{\mathbf{y}}\{\mathbf{y}^{\top}\mathbf{x}_*-f^*(\mathbf{y})\}\), and \(\mathbf{x}_*\in\partial f^*(\mathbf{y}_*)\).

Lastly, we prove (iii). By definition, \(f^*(\mathbf{y}_*)\) is the supremum of the concave function \(F(\mathbf{x})=\mathbf{y}_*^{\top}\mathbf{x}-f(\mathbf{x})\). If this supremum is attained at \(\mathbf{x}_*\in\mathbb{R}^d\), then \(\nabla F(\mathbf{x}_*)=0\), which is to say \(\mathbf{y}_*=\nabla f(\mathbf{x}_*)\). On the other hand, if \(\mathbf{y}_*=\nabla f(\mathbf{x}_*)\), then \(\mathbf{x}_*\) is a maximizer of \(F(\mathbf{x})\), and therefore \(f^*(\mathbf{y}_*)=\mathbf{y}_*^{\top}\mathbf{x}_*-f(\mathbf{x}_*)\). Using this result twice, \[\begin{align*} \mathbf{y}=\nabla f(\mathbf{x})\quad\text{if and only if}\quad f(\mathbf{x})+f^*(\mathbf{y})=\mathbf{x}^{\top}\mathbf{y}\\ \mathbf{x}=\nabla f^*(\mathbf{y})\quad\text{if and only if}\quad f^*(\mathbf{y})+f^{**}(\mathbf{x})=\mathbf{x}^{\top}\mathbf{y}. \end{align*}\] Since \(f^{**}=f\), then \(\mathbf{x}=(\nabla f)^{-1}(\mathbf{y})=\nabla f^*(\mathbf{y})\). Hence \((\nabla f)^{-1}=\nabla f^*\).

Lemma 1.9 If \(f\) is \(\mu\)-strongly convex w.r.t.\(\|\cdot\|_2\), then its Fenchel conjugate is \(1/\mu\)-smooth. Similarly if \(f\) is \(L\)-smooth and convex w.r.t.\(\|\cdot\|_2\), then its Fenchel conjugate is \(1/L\)-strongly convex.

Proof.
Let \(f^*(\mathbf{y})=\max_{\mathbf{x}}\mathbf{x}^{\top}\mathbf{y}-f(\mathbf{x})\) be the Fenchel conjugate of \(f\).

Suppose \(f\) is \(\mu\)-strongly convex. Let \(\mathbf{x}(\mathbf{y})=\arg\max_{\mathbf{x}}\mathbf{x}^{\top}\mathbf{y}-f(\mathbf{x})\). Then \(\nabla f^*(\mathbf{y})=\mathbf{x}(\mathbf{y})\) due to the Danskin Theorem. Similar to the previous lemma, we can prove that \[\begin{align*} \|\mathbf{x}(\mathbf{y}_1)-\mathbf{x}(\mathbf{y}_2)\|_2\le\frac{1}{\mu}\|\mathbf{y}_1-\mathbf{y}_2\|_2, \end{align*}\] which proves the Lipchitz continuity of \(\nabla f^*(\mathbf{y})\) and hence the smoothness of \(f^*\).

Suppose \(f\) is \(L\)-smooth and convex. Let us prove \(f^*\) is \(1/L\)-strongly convex. Let us consider \(\mathbf{y}_1,\mathbf{y}_2\). Let \(\mathbf{x}_1\in\arg\max_{\mathbf{x}}\mathbf{x}^{\top}\mathbf{y}_1-f(\mathbf{x})\) and \(\mathcal{X}_2=\arg\max_{\mathbf{x}}\mathbf{x}^{\top}\mathbf{y}_2-f(\mathbf{x})\). Then \(\nabla f(\mathbf{x}_1)=\mathbf{y}_1\). For any \(\mathbf{x}_2\in\mathcal{X}_2\), we have \(\nabla f(\mathbf{x}_2)=\mathbf{y}_2\). Given that \[ f^*(\mathbf{y}_1)=\mathbf{x}_1^{\top}\mathbf{y}_1-f(\mathbf{x}_1),\quad f^*(\mathbf{y}_2)=\mathbf{x}_2^{\top}\mathbf{y}_2-f(\mathbf{x}_2), \] then \[\begin{align*} f^*(\mathbf{y}_1)-f^*(\mathbf{y}_2)-\mathbf{x}_2^{\top}(\mathbf{y}_1-\mathbf{y}_2) &=\mathbf{x}_1^{\top}\mathbf{y}_1-f(\mathbf{x}_1)-(\mathbf{x}_2^{\top}\mathbf{y}_2-f(\mathbf{x}_2))-\mathbf{x}_2^{\top}(\nabla f(\mathbf{x}_1)-\nabla f(\mathbf{x}_2))\\ &=f(\mathbf{x}_2)-f(\mathbf{x}_1)+\mathbf{x}_1^{\top}\nabla f(\mathbf{x}_1)-\mathbf{x}_2^{\top}\nabla f(\mathbf{x}_2)-\mathbf{x}_2^{\top}(\nabla f(\mathbf{x}_1)-\nabla f(\mathbf{x}_2))\\ &=f(\mathbf{x}_2)-f(\mathbf{x}_1)+(\mathbf{x}_1-\mathbf{x}_2)^{\top}\nabla f(\mathbf{x}_1) \ge\frac{1}{2L}\|\nabla f(\mathbf{x}_1)-\nabla f(\mathbf{x}_2)\|_2^2 =\frac{1}{2L}\|\mathbf{y}_1-\mathbf{y}_2\|_2^2, \end{align*}\] where the last inequality is due to part (b) of Lemma 1.5. Hence, we can conclude the proof by noting that \(\partial f^*(\mathbf{y}_2)=\text{conv}(\mathcal{X}_2)\) due to the generalized Danskin theorem.

Lemma 1.10
For \(\mathbf{p}\in\Delta_n\), the negative entropy function \(R(\mathbf{p})=\sum_{i=1}^n p_i\log p_i\) is \(1\)-strongly convex w.r.t. the \(\ell_1\) norm \(\|\cdot\|_1\).

Proof.
For any \(\mathbf{x},\mathbf{y}\in\Delta_n\), let \(f(t)=R(\mathbf{y}+t(\mathbf{x}-\mathbf{y}))\). By the second-order Taylor expansion, for some \(t\in(0,1)\), we have \[\begin{align*} R(\mathbf{x}) &=f(1)=f(0)+f'(0)+\frac{1}{2}f''(t)\\ &=R(\mathbf{y})+\nabla R(\mathbf{y})^{\top}(\mathbf{x}-\mathbf{y})+\frac{1}{2}(\mathbf{x}-\mathbf{y})^{\top}\nabla^2R(\mathbf{y}+t(\mathbf{x}-\mathbf{y}))(\mathbf{x}-\mathbf{y}). \end{align*}\] Hence it suffices to prove that \(\mathbf{v}^{\top}\nabla^2R(\mathbf{p})\mathbf{v}\ge\|\mathbf{v}\|_1^2\) for any \(\mathbf{p}\in\Delta_n\). This can be seen from the following: \[\begin{align*} \mathbf{v}^{\top}\nabla^2R(\mathbf{p})\mathbf{v} &=\sum_{i=1}^d v_i^2 p_i^{-1} =\left[\sum_i v_i^2 p_i^{-1}\right]\left[\sum_i p_i\right] \ge\left[\sum_i(p_i^{-1/2}|v_i|)p_i^{1/2}\right]^2\\ &=\left[\sum_i|v_i|\right]^2, \end{align*}\] where the inequality follows by Cauchy inequality.

← Go Back