Section 1.4: Convex Optimization
A standard optimization problem is defined by:
\[ \begin{equation}\label{eqn:opt} \begin{aligned} \min_{\mathbf{x}\in\mathbb{R}^d}&\quad f_0(\mathbf{x})\\ \text{s.t.}&\quad f_i(\mathbf{x})\leq 0, \quad i=1,\ldots, m\\ &\quad h_j(\mathbf{x})=0, \quad j=1,\ldots, n. \end{aligned} \end{equation} \]
Definition 1.8. A standard optimization problem is a convex optimization problem if \(f_i(\mathbf{x})\) is convex for \(i=0,\ldots, m\) and \(h_j(\mathbf{x}) = \mathbf{a}_j^{\top}\mathbf{x} + b_j\) is an affine function for \(j=1,\ldots, n\).
The problem (\(\ref{eqn:opt}\)) is feasible if there exists at least one point such that all constraints are satisfied, and infeasible otherwise. The set of all feasible points is called the feasible set, denoted by
\[ \mathcal{X} = \{\mathbf{x}: f_i(\mathbf{x}) \leq 0, i=1,\ldots, m, \; h_j(\mathbf{x})=0, j=1,\ldots, n\}. \]
The Optimal Value and Optimal Solutions
The optimal value of (\(\ref{eqn:opt}\)) is defined as
\[ f_* = \inf \{f_0(\mathbf{x}) \mid \mathbf{x} \in \mathcal{X}\}, \]
where \(\inf\) returns the greatest value that is less than or equal to all possible objective values at feasible points if such a value exists. For example, \(\inf e^{-x} = 0\). If the problem is infeasible, we let \(f_* = \infty\).
A solution \(\mathbf{x}_*\) is an optimal solution if it is feasible, i.e., satisfying all constraints, and \(f_0(\mathbf{x}_*) = f_*\). Hence, we may have a set of optimal solutions:
\[ \mathcal{X}_* := \arg\min \{f_0(\mathbf{x}) \mid \mathbf{x} \in \mathcal{X}\} = \{\mathbf{x} \in \mathcal{X} : f_0(\mathbf{x}) = f_*\}. \]
The optimal solution is unique if the objective is strongly convex.
1.4.1 Local Minima and Global Minima
A solution \(\mathbf{x}\) is called a local minimum if there is an \(R > 0\) such that
\[ f_0(\mathbf{x}) = \inf \{f_0(\mathbf{y}) \mid \mathbf{y} \in \mathcal{X}, \; \|\mathbf{y} - \mathbf{x}\|_2 \leq R\}. \]
Theorem 1.1.
For a convex optimization problem, a local minimum \(\mathbf{x}\) is also a global minimum.
Proof.
Suppose \(\mathbf{x}\) is not a global minimum. It means that there exists a feasible \(\mathbf{z}\) such that \(f_0(\mathbf{z}) < f_0(\mathbf{x})\). Then \(\|\mathbf{z} - \mathbf{x}\|_2 > R\) because \(\mathbf{x}\) is an optimal solution in the local region \(\Omega = \{\mathbf{y}: \|\mathbf{y} - \mathbf{x}\|_2 \leq R\}\).
Let us derive a contradiction. Let \(\mathbf{y} = \mathbf{x} + \theta(\mathbf{z} - \mathbf{x})\), where \(\theta = \frac{R}{\|\mathbf{x} - \mathbf{z}\|_2}\) such that
\[ \|\mathbf{y} - \mathbf{x}\|_2 \leq \theta \|\mathbf{z} - \mathbf{x}\|_2 \leq R. \]
Then
\[ f_0(\mathbf{y}) \leq \theta f_0(\mathbf{z}) + (1 - \theta) f_0(\mathbf{x}) < f_0(\mathbf{x}), \]
which contradicts the fact that \(\mathbf{x}\) is an optimal solution in the region \(\Omega = \{\mathbf{y}: \|\mathbf{y} - \mathbf{x}\|_2 \leq R\}\). Hence such an \(\mathbf{z}\) does not exist.■
1.4.2 Optimality Conditions
Let us consider a differentiable objective function \(f_0\).
Theorem 1.2.
For a convex optimization problem with non-empty \(\mathcal{X}_*\), a point \(\mathbf{x}\) is optimal if and only if \(\mathbf{x} \in \mathcal{X}\) and
\[ \nabla f_0(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) \geq 0, \quad \forall \mathbf{y} \in \mathcal{X}. \]
For a non-differentiable function, the above condition is replaced by: \(\exists \mathbf{v} \in \partial f_0(\mathbf{x})\) such that \[ \mathbf{v}^{\top}(\mathbf{y} - \mathbf{x}) \geq 0, \quad \forall \mathbf{y} \in \mathcal{X}. \]
Proof.
To prove the sufficient condition, we use the convexity of \(f_0\). For any \(\mathbf{y} \in \mathcal{X}\), we have:
\[ f_0(\mathbf{y}) \geq f_0(\mathbf{x}) + \nabla f_0(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) \geq f_0(\mathbf x). \]
Hence, \(\mathbf{x}\) is an optimal solution.
To prove the necessary condition: if the condition does not hold for some \(\mathbf{y}\), i.e.,
\[ \nabla f_0(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) < 0, \]
let us consider \(\mathbf{z}(t) = t\mathbf{y} + (1 - t)\mathbf{x}\), which is feasible. Then:
\[ \left.\frac{d}{dt} f_0(\mathbf{z}(t))\right|_{t=0} = \nabla f_0(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) < 0, \]
which means there exists a small \(t > 0\) such that \(f_0(\mathbf{z}(t)) < f_0(\mathbf{x})\), contradicting the assumption that \(\mathbf{x}\) is optimal.■
When the problem is unconstrained such that \(\mathcal{X} = \mathbb{R}^d\), the optimality condition implies that \(\mathbf{x}\) is optimal if and only if
\[ \nabla f_0(\mathbf{x}) = 0. \]
Lemma 1.2.
For a convex optimization problem (\(\ref{eqn:opt}\)), if \(f_0\) is strongly convex, then the set \(\mathcal{X}_*\) contains only a single element if it is not empty.
Proof.
Assume \(\mathcal{X}_*\) contains two different solutions \(\mathbf{x}_1 \neq \mathbf{x}_2\) such that \(f_0(\mathbf{x}_1) = f_0(\mathbf{x}_2)\). Since \(f_0\) is strongly convex, we have:
\[ f_0(\mathbf{x}_1) \geq f_0(\mathbf{x}_2) + \partial f_0(\mathbf{x}_2)^{\top}(\mathbf{x}_1 - \mathbf{x}_2) + \frac{1}{2}\|\mathbf{x}_1 - \mathbf{x}_2\|_2^2. \]
Due to the optimality condition, \(\partial f_0(\mathbf{x}_2)^{\top}(\mathbf{x}_1 - \mathbf{x}_2) \geq 0\), hence
\[ f_0(\mathbf{x}_1) \geq f_0(\mathbf{x}_2) + \frac{1}{2}\|\mathbf{x}_1 - \mathbf{x}_2\|_2^2 > f_0(\mathbf{x}_2), \]
which contradicts the assumption that \(f_0(\mathbf{x}_1) = f_0(\mathbf{x}_2)\).■
1.4.3 Karush–Kuhn–Tucker (KKT) Conditions
Constrained optimization problems such as (\(\ref{eqn:opt}\)) are often challenging to analyze and solve directly. The Karush-Kuhn-Tucker (KKT) conditions, derived from Lagrangian duality theory, offer first-order necessary conditions for optimality. These conditions can simplify the original problem, sometimes enabling a transformation into a more tractable form or even leading to a closed-form solution.
The Lagrangian function and the Lagrangian dual function
For the constrained optimization (\(\ref{eqn:opt}\)), the Lagrangian function is defined as: \[\begin{align*} L(\mathbf{x},\boldsymbol{\lambda},\boldsymbol{\nu})=f_0(\mathbf{x})+\sum_{i=1}^m\lambda_i f_i(\mathbf{x})+\sum_{j=1}^n\nu_j h_j(\mathbf{x}), \end{align*}\] where \(\boldsymbol{\lambda}=(\lambda_1,\ldots,\lambda_m)^{\top},\boldsymbol{\nu}=(\nu_1,\ldots,\nu_n)^{\top}\), which are called the Lagrangian multipliers.
The Lagrangian dual function is defined as: \[\begin{align*} g(\boldsymbol{\lambda},\boldsymbol{\nu})=\inf_{\mathbf{x}}L(\mathbf{x},\boldsymbol{\lambda},\boldsymbol{\nu}). \end{align*}\] Based on this, we define the Lagrangian dual problem: \[ g_*=\sup_{\boldsymbol{\lambda}\ge0}g(\boldsymbol{\lambda},\boldsymbol{\nu}). \] Regarding the original optimal value \(f_*\) and the dual optimal value \(g_*\), we have the following weak duality.
Lemma 1.3
We always have \(g_*\le f_*\).
Let \(\mathbf{x}_*\) be an optimal solution to (\(\ref{eqn:opt}\)). For any \(\boldsymbol{\lambda}\ge0,\boldsymbol{\nu}\), we have \[\begin{align*} g(\boldsymbol{\lambda},\boldsymbol{\nu})&=\inf_{\mathbf{x}}L(\mathbf{x},\boldsymbol{\lambda},\boldsymbol{\nu})\le L(\mathbf{x}_*,\boldsymbol{\lambda},\boldsymbol{\nu})\\ &=f_0(\mathbf{x}_*)+\sum_{i=1}^m\lambda_i f_i(\mathbf{x}_*)+\sum_{j=1}^n\nu_j h_j(\mathbf{x}_*)\le f_0(\mathbf{x}_*), \end{align*}\] where the last inequality uses the fact \(h_j(\mathbf{x}_*)=0\), \(f_i(\mathbf{x}_*)\le0\), and \(\boldsymbol{\lambda}\ge0\). The conclusion follows.
■
KKT conditions
An interesting scenario is the strong duality where \(g_*=f_*\). In such case, we can derive two conditions.
Lemma 1.4
Suppose that the primal and dual optimal values are attained and equal.
Let \(\mathbf{x}_*\) be an optimal
primal solution and \(\boldsymbol{\lambda}_*,\boldsymbol{\nu}_*\)
be optimal dual solutions. Assume that \(f_0,f_i,h_j\) are continuously
differentiable, then the following conditions hold: \[\begin{align}
&\nabla f_0(\mathbf{x}_*)+\sum_{i=1}^m\lambda_{*,i}\nabla
f_i(\mathbf{x}_*)+\sum_{j=1}^n\nu_{*,j}\nabla
h_j(\mathbf{x}_*)=0,\label{eqn:KKT-1}\\
&\lambda_{*,i}f_i(\mathbf{x}_*)=0,i=1,\ldots,m,
\end{align}\] where the second condition is called the
complementary slackness.
First, we have \[\begin{align*} g_*&=\sup_{\boldsymbol{\lambda}\ge0}g(\boldsymbol{\lambda},\boldsymbol{\nu})=g(\boldsymbol{\lambda}_*,\boldsymbol{\nu}_*)=\inf_{\mathbf{x}}L(\mathbf{x},\boldsymbol{\lambda}_*,\boldsymbol{\nu}_*)\\ &=\inf_{\mathbf{x}}f_0(\mathbf{x})+\sum_{i=1}^m\lambda_{*,i}f_i(\mathbf{x})+\sum_{j=1}^n\nu_{*,j}h_j(\mathbf{x})\\ &\le f_0(\mathbf{x}_*)+\sum_{i=1}^m\lambda_{*,i}f_i(\mathbf{x}_*)+\sum_{j=1}^n\nu_{*,j}h_j(\mathbf{x}_*)\\ &\le f_0(\mathbf{x}_*)=f_*. \end{align*}\] Since \(g_*=f_*\), the inequalities will become equalities. The first equality is \[ \inf_{\mathbf{x}}f_0(\mathbf{x})+\sum_{i=1}^m\lambda_{*,i}f_i(\mathbf{x})+\sum_{j=1}^n\nu_{*,j}h_j(\mathbf{x})=f_0(\mathbf{x}_*)+\sum_{i=1}^m\lambda_{*,i}f_i(\mathbf{x}_*)+\sum_{j=1}^n\nu_{*,j}h_j(\mathbf{x}_*), \] which implies that \(\mathbf{x}_*\) optimizes \(L(\mathbf{x},\boldsymbol{\lambda}_*,\boldsymbol{\nu}_*)\). Hence, by the first-order optimality condition, we have \(\nabla_{\mathbf{x}}L(\mathbf{x},\boldsymbol{\lambda}_*,\boldsymbol{\nu}_*)=0\), which is (\(\ref{eqn:KKT-1}\)). The second equality is \[ f_0(\mathbf{x}_*)+\sum_{i=1}^m\lambda_{*,i}f_i(\mathbf{x}_*)+\sum_{j=1}^n\nu_{*,j}h_j(\mathbf{x}_*)=f_0(\mathbf{x}_*), \] which implies \(\lambda_{*,i}f_i(\mathbf{x}_*)=0,\forall i\) because \(\lambda_{*,i}f_i(\mathbf{x}_*)\le0,\forall i\) and they cannot be less than zero; otherwise the equality will not hold.
■
💡 KKT conditions
Assume that \(f_0,f_i,h_j\) are
continuously differentiable. Let \(\mathbf{x}_*\) be an optimal primal
solution and \(\boldsymbol{\lambda}_*,\boldsymbol{\nu}_*\)
be optimal dual solutions. The KKT conditions are: \[
\begin{aligned}
&\text{(Stationarity)}\quad\nabla
f_0(\mathbf{x}_*)+\sum_{i=1}^m\lambda_{*,i}\nabla
f_i(\mathbf{x}_*)+\sum_{j=1}^n\nu_{*,j}\nabla h_j(\mathbf{x}_*)=0,\\
&\text{(Primal feasibility)}\quad f_i(\mathbf{x}_*)\le0,\quad
h_j(\mathbf{x}_*)=0,\forall i,j,\\
&\text{(Dual feasibility)}\quad \lambda_{*,i}\ge0,\forall i,\\
&\text{(Complementary slackness)}\quad
\lambda_{*,i}f_i(\mathbf{x}_*)=0,\forall i.
\end{aligned}
\]
Slater’s condition
How to ensure the strong duality holds? Constraint qualifications have been developed as sufficient conditions of strong duality. One simple constraint qualification is Slater’s condition for a convex optimization problem: There exists an \(\mathbf{x}\in\operatorname{relint}(D)\) (where relint denotes the relative interior of the convex set \(D:=\cap_{i=0}^m\operatorname{dom}(f_i)\) such that \[ f_i(\mathbf{x})<0,\forall i,\quad\text{and}\quad \mathbf{a}_j^{\top}\mathbf{x}+b_j=0,\forall j. \] An important theorem of Lagrangian duality is that the strong duality holds when the primal problem is convex and Slater’s condition holds. This suggests a tangible approach to compute \(\mathbf{x}_*\) or transform the original problem into a simplified one. First, we solve the dual problem to obtain an optimal dual solution \((\boldsymbol{\lambda}_*,\boldsymbol{\nu}_*)\): \[\begin{align} (\boldsymbol{\lambda}_*,\boldsymbol{\nu}_*)=\arg\max_{\boldsymbol{\lambda}\ge0,\boldsymbol{\nu}}g(\boldsymbol{\lambda},\boldsymbol{\nu}). \end{align}\] Then we use the stationarity condition of KKT conditions to derive a close form of \(\mathbf{x}_*\). In addition, we have \[\begin{align*} \min_{\mathbf{x}}\{f_0(\mathbf{x}),~\text{s.t.}~\mathbf{x}\in\mathcal{X}\}=\max_{\boldsymbol{\lambda}\ge0,\nu}g(\boldsymbol{\lambda},\boldsymbol{\nu}). \end{align*}\]
Examples
Example 1.14: Dual of Distributionally Robust Optimization (DRO)
The following problem often arises in robust machine learning: \[ f(\ell_1,\ldots, \ell_n) = \max_{\mathbf{p} \in \Delta_n} \sum_{i=1}^n p_i \ell_i - \tau \sum_{i=1}^n q_i \phi(p_i / q_i), \] where \(\tau \geq 0\), \(\mathbf{q} \in \Delta_n\), and \(\phi(t): \mathbb{R}^+ \rightarrow \mathbb{R}\) is a proper closed convex function with minimum value \(0\) attained at \(t=1\). Let us derive its dual problem. We write the above problem as a standard convex optimization problem: \[ \begin{aligned} & \min_{\mathbf{p}} -\sum_{i=1}^n p_i \ell_i + \tau \sum_i q_i \phi(p_i / q_i) \\ & \text{s.t.} \quad \sum_{i=1}^n p_i = 1. \end{aligned} \] The constraint \(p_i \geq 0\) is enforced by the domain of \(\phi(t)\).
We define the Lagrangian function: \[ L(\mathbf{p}, \nu) = -\sum_{i=1}^n p_i \ell_i + \tau \sum_{i=1}^n q_i \phi(p_i / q_i) + \nu \left(\sum_{i=1}^n p_i - 1\right). \] Let us define \[ \phi^*(s) = \max_{t \geq 0} \left\{ ts - \phi(t) \right\}. \] By minimizing over \(\mathbf{p} \geq 0\), we have \[ \begin{aligned} g(\nu) &= \min_{\mathbf{p} \geq 0} \left( -\sum_{i=1}^n p_i (\ell_i - \nu) + \tau \sum_{i=1}^n q_i \phi(p_i / q_i) \right) - \nu \\ &= -\left\{ \max_{\mathbf{p} \geq 0} \sum_{i=1}^n p_i (\ell_i - \nu) - \tau \sum_{i=1}^n q_i \phi(p_i / q_i) \right\} - \nu. \end{aligned} \] With a variable change \(\tilde p=p/q\), we have \[ \begin{aligned} g(\nu) &= -\max_{\tilde{\mathbf{p}} \geq 0} \sum_{i=1}^n q_i \left( \tilde{p}_i (\ell_i - \nu) - \tau \phi(\tilde{p}_i) \right) - \nu \\ &= -\sum_{i=1}^n \tau q_i \phi^*\left( \frac{\ell_i - \nu}{\tau} \right) - \nu. \end{aligned} \] Since the Slater’s condition holds (\(p_i=1/n\) satisfies), we have \[ \begin{aligned} & \min_{\mathbf{p} \in \Delta_n} -\sum_{i=1}^n p_i \ell_i + \tau \sum_{i=1}^n q_i \phi(p_i / q_i) \\ &= \max_{\nu} g(\nu) = -\min_{\nu} \left\{ \sum_{i=1}^n \tau q_i \phi^*\left( \frac{\ell_i - \nu}{\tau} \right) + \nu \right\}. \end{aligned} \] Hence, \[ \max_{\mathbf{p} \in \Delta_n} \left\{ \sum_{i=1}^n p_i \ell_i - \tau \sum_{i=1}^n q_i \phi(p_i / q_i) \right\} = \min_{\nu} \sum_{i=1}^n \tau q_i \phi^*\left( \frac{\ell_i - \nu}{\tau} \right) + \nu. \]
Example 1.15: Conjugate of \(\phi\) functions
We can derive \(\phi^*\) for three cases (exercise):
\(\phi(t) = (t - 1)^2\)
\[ \phi^*(y) = \max_{t \geq 0} \left( y t - (t - 1)^2 \right) = \begin{cases} \frac{1}{4} y^2 + y & \text{if } y \geq -2 \\ - 1 & \text{otherwise} \end{cases} \]
\(\phi(t) = t \log t - t + 1\)
\[ \phi^*(y) = \max_{t \geq 0} \left( y t - (t \log t - t + 1) \right) = \exp(y) - 1 \]
\(\phi(t) = \mathbb{I}_{0-\infty}(t \leq 1/\alpha)\) for \(\alpha \in (0,1]\)
\[ \phi^*(y) = \max_{t \geq 0} \left( y t - \mathbb{I}_{0-\infty}(t \leq 1/\alpha) \right) = \frac{[y]_+}{\alpha} \]
Example 1.16: KKT Conditions of DRO with KL Divergence
Let us consider a special case of Example 1.14 with \(\phi(t) = t \log t - t + 1\): \[ f(\ell_1,\ldots,\ell_n) = \max_{\mathbf{p} \in \Delta_n} \sum_{i=1}^n p_i \ell_i - \tau \sum_{i=1}^n p_i \log \left( \frac{p_i}{q_i} \right) \]
We can derive the following KKT conditions:
\[ \begin{aligned} & (\ell_i - \nu_*) - \tau \left( \log \frac{p_i^*}{q_i} + 1 \right) = 0 \Rightarrow p_i^* = q_i \exp\left( \frac{\ell_i - \nu_* - \tau}{\tau} \right), \\ & \sum_{i=1}^n p_i^* = 1. \end{aligned} \]
As a result, we can derive
\[\begin{align}\label{eqn:kldro-ell-equi} &p^*_i =\frac{q_i \exp(\frac{\ell_i}{\tau})}{\sum_{i=1}^nq_i \exp(\frac{\ell_i}{\tau})}\\ &f(\ell_1,\ldots, \ell_n)=\tau \log\left(\sum_{i=1}^nq_i \exp\left(\frac{\ell_i}{\tau}\right)\right). \end{align} \]