← Go Back

Section 1.4: Convex Optimization

A standard optimization problem is defined by:

\[ \begin{equation}\label{eqn:opt} \begin{aligned} \min_{\mathbf{x}\in\mathbb{R}^d}&\quad f_0(\mathbf{x})\\ \text{s.t.}&\quad f_i(\mathbf{x})\leq 0, \quad i=1,\ldots, m\\ &\quad h_j(\mathbf{x})=0, \quad j=1,\ldots, n. \end{aligned} \end{equation} \]

Definition 1.8. A standard optimization problem is a convex optimization problem if \(f_i(\mathbf{x})\) is convex for \(i=0,\ldots, m\) and \(h_j(\mathbf{x}) = \mathbf{a}_j^{\top}\mathbf{x} + b_j\) is an affine function for \(j=1,\ldots, n\).

The problem (\(\ref{eqn:opt}\)) is feasible if there exists at least one point such that all constraints are satisfied, and infeasible otherwise. The set of all feasible points is called the feasible set, denoted by

\[ \mathcal{X} = \{\mathbf{x}: f_i(\mathbf{x}) \leq 0, i=1,\ldots, m, \; h_j(\mathbf{x})=0, j=1,\ldots, n\}. \]

The Optimal Value and Optimal Solutions

The optimal value of (\(\ref{eqn:opt}\)) is defined as

\[ f_* = \inf \{f_0(\mathbf{x}) \mid \mathbf{x} \in \mathcal{X}\}, \]

where \(\inf\) returns the greatest value that is less than or equal to all possible objective values at feasible points if such a value exists. For example, \(\inf e^{-x} = 0\). If the problem is infeasible, we let \(f_* = \infty\).

A solution \(\mathbf{x}_*\) is an optimal solution if it is feasible, i.e., satisfying all constraints, and \(f_0(\mathbf{x}_*) = f_*\). Hence, we may have a set of optimal solutions:

\[ \mathcal{X}_* := \arg\min \{f_0(\mathbf{x}) \mid \mathbf{x} \in \mathcal{X}\} = \{\mathbf{x} \in \mathcal{X} : f_0(\mathbf{x}) = f_*\}. \]

The optimal solution is unique if the objective is strongly convex.

1.4.1 Local Minima and Global Minima

A solution \(\mathbf{x}\) is called a local minimum if there is an \(R > 0\) such that

\[ f_0(\mathbf{x}) = \inf \{f_0(\mathbf{y}) \mid \mathbf{y} \in \mathcal{X}, \; \|\mathbf{y} - \mathbf{x}\|_2 \leq R\}. \]

Theorem 1.1.

For a convex optimization problem, a local minimum \(\mathbf{x}\) is also a global minimum.

Proof.

Suppose \(\mathbf{x}\) is not a global minimum. It means that there exists a feasible \(\mathbf{z}\) such that \(f_0(\mathbf{z}) < f_0(\mathbf{x})\). Then \(\|\mathbf{z} - \mathbf{x}\|_2 > R\) because \(\mathbf{x}\) is an optimal solution in the local region \(\Omega = \{\mathbf{y}: \|\mathbf{y} - \mathbf{x}\|_2 \leq R\}\).

Let us derive a contradiction. Let \(\mathbf{y} = \mathbf{x} + \theta(\mathbf{z} - \mathbf{x})\), where \(\theta = \frac{R}{\|\mathbf{x} - \mathbf{z}\|_2}\) such that

\[ \|\mathbf{y} - \mathbf{x}\|_2 \leq \theta \|\mathbf{z} - \mathbf{x}\|_2 \leq R. \]

Then

\[ f_0(\mathbf{y}) \leq \theta f_0(\mathbf{z}) + (1 - \theta) f_0(\mathbf{x}) < f_0(\mathbf{x}), \]

which contradicts the fact that \(\mathbf{x}\) is an optimal solution in the region \(\Omega = \{\mathbf{y}: \|\mathbf{y} - \mathbf{x}\|_2 \leq R\}\). Hence such an \(\mathbf{z}\) does not exist.

1.4.2 Optimality Conditions

Let us consider a differentiable objective function \(f_0\).

Theorem 1.2.

For a convex optimization problem with non-empty \(\mathcal{X}_*\), a point \(\mathbf{x}\) is optimal if and only if \(\mathbf{x} \in \mathcal{X}\) and

\[ \nabla f_0(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) \geq 0, \quad \forall \mathbf{y} \in \mathcal{X}. \]

For a non-differentiable function, the above condition is replaced by: \(\exists \mathbf{v} \in \partial f_0(\mathbf{x})\) such that \[ \mathbf{v}^{\top}(\mathbf{y} - \mathbf{x}) \geq 0, \quad \forall \mathbf{y} \in \mathcal{X}. \]

Proof.

To prove the sufficient condition, we use the convexity of \(f_0\). For any \(\mathbf{y} \in \mathcal{X}\), we have:

\[ f_0(\mathbf{y}) \geq f_0(\mathbf{x}) + \nabla f_0(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) \geq f_0(\mathbf x). \]

Hence, \(\mathbf{x}\) is an optimal solution.

To prove the necessary condition: if the condition does not hold for some \(\mathbf{y}\), i.e.,

\[ \nabla f_0(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) < 0, \]

let us consider \(\mathbf{z}(t) = t\mathbf{y} + (1 - t)\mathbf{x}\), which is feasible. Then:

\[ \left.\frac{d}{dt} f_0(\mathbf{z}(t))\right|_{t=0} = \nabla f_0(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) < 0, \]

which means there exists a small \(t > 0\) such that \(f_0(\mathbf{z}(t)) < f_0(\mathbf{x})\), contradicting the assumption that \(\mathbf{x}\) is optimal.

When the problem is unconstrained such that \(\mathcal{X} = \mathbb{R}^d\), the optimality condition implies that \(\mathbf{x}\) is optimal if and only if

\[ \nabla f_0(\mathbf{x}) = 0. \]

Lemma 1.2.

For a convex optimization problem (\(\ref{eqn:opt}\)), if \(f_0\) is strongly convex, then the set \(\mathcal{X}_*\) contains only a single element if it is not empty.

Proof.

Assume \(\mathcal{X}_*\) contains two different solutions \(\mathbf{x}_1 \neq \mathbf{x}_2\) such that \(f_0(\mathbf{x}_1) = f_0(\mathbf{x}_2)\). Since \(f_0\) is strongly convex, we have:

\[ f_0(\mathbf{x}_1) \geq f_0(\mathbf{x}_2) + \partial f_0(\mathbf{x}_2)^{\top}(\mathbf{x}_1 - \mathbf{x}_2) + \frac{1}{2}\|\mathbf{x}_1 - \mathbf{x}_2\|_2^2. \]

Due to the optimality condition, \(\partial f_0(\mathbf{x}_2)^{\top}(\mathbf{x}_1 - \mathbf{x}_2) \geq 0\), hence

\[ f_0(\mathbf{x}_1) \geq f_0(\mathbf{x}_2) + \frac{1}{2}\|\mathbf{x}_1 - \mathbf{x}_2\|_2^2 > f_0(\mathbf{x}_2), \]

which contradicts the assumption that \(f_0(\mathbf{x}_1) = f_0(\mathbf{x}_2)\).

1.4.3 Karush–Kuhn–Tucker (KKT) Conditions

Constrained optimization problems such as (\(\ref{eqn:opt}\)) are often challenging to analyze and solve directly. The Karush-Kuhn-Tucker (KKT) conditions, derived from Lagrangian duality theory, offer first-order necessary conditions for optimality. These conditions can simplify the original problem, sometimes enabling a transformation into a more tractable form or even leading to a closed-form solution.

The Lagrangian function and the Lagrangian dual function

For the constrained optimization (\(\ref{eqn:opt}\)), the Lagrangian function is defined as: \[\begin{align*} L(\mathbf{x},\boldsymbol{\lambda},\boldsymbol{\nu})=f_0(\mathbf{x})+\sum_{i=1}^m\lambda_i f_i(\mathbf{x})+\sum_{j=1}^n\nu_j h_j(\mathbf{x}), \end{align*}\] where \(\boldsymbol{\lambda}=(\lambda_1,\ldots,\lambda_m)^{\top},\boldsymbol{\nu}=(\nu_1,\ldots,\nu_n)^{\top}\), which are called the Lagrangian multipliers.

The Lagrangian dual function is defined as: \[\begin{align*} g(\boldsymbol{\lambda},\boldsymbol{\nu})=\inf_{\mathbf{x}}L(\mathbf{x},\boldsymbol{\lambda},\boldsymbol{\nu}). \end{align*}\] Based on this, we define the Lagrangian dual problem: \[ g_*=\sup_{\boldsymbol{\lambda}\ge0}g(\boldsymbol{\lambda},\boldsymbol{\nu}). \] Regarding the original optimal value \(f_*\) and the dual optimal value \(g_*\), we have the following weak duality.

Lemma 1.3
We always have \(g_*\le f_*\).

Proof.
Let \(\mathbf{x}_*\) be an optimal solution to (\(\ref{eqn:opt}\)). For any \(\boldsymbol{\lambda}\ge0,\boldsymbol{\nu}\), we have \[\begin{align*} g(\boldsymbol{\lambda},\boldsymbol{\nu})&=\inf_{\mathbf{x}}L(\mathbf{x},\boldsymbol{\lambda},\boldsymbol{\nu})\le L(\mathbf{x}_*,\boldsymbol{\lambda},\boldsymbol{\nu})\\ &=f_0(\mathbf{x}_*)+\sum_{i=1}^m\lambda_i f_i(\mathbf{x}_*)+\sum_{j=1}^n\nu_j h_j(\mathbf{x}_*)\le f_0(\mathbf{x}_*), \end{align*}\] where the last inequality uses the fact \(h_j(\mathbf{x}_*)=0\), \(f_i(\mathbf{x}_*)\le0\), and \(\boldsymbol{\lambda}\ge0\). The conclusion follows.

KKT conditions

An interesting scenario is the strong duality where \(g_*=f_*\). In such case, we can derive two conditions.

Lemma 1.4
Suppose that the primal and dual optimal values are attained and equal. Let \(\mathbf{x}_*\) be an optimal primal solution and \(\boldsymbol{\lambda}_*,\boldsymbol{\nu}_*\) be optimal dual solutions. Assume that \(f_0,f_i,h_j\) are continuously differentiable, then the following conditions hold: \[\begin{align} &\nabla f_0(\mathbf{x}_*)+\sum_{i=1}^m\lambda_{*,i}\nabla f_i(\mathbf{x}_*)+\sum_{j=1}^n\nu_{*,j}\nabla h_j(\mathbf{x}_*)=0,\label{eqn:KKT-1}\\ &\lambda_{*,i}f_i(\mathbf{x}_*)=0,i=1,\ldots,m, \end{align}\] where the second condition is called the complementary slackness.

Proof.
First, we have \[\begin{align*} g_*&=\sup_{\boldsymbol{\lambda}\ge0}g(\boldsymbol{\lambda},\boldsymbol{\nu})=g(\boldsymbol{\lambda}_*,\boldsymbol{\nu}_*)=\inf_{\mathbf{x}}L(\mathbf{x},\boldsymbol{\lambda}_*,\boldsymbol{\nu}_*)\\ &=\inf_{\mathbf{x}}f_0(\mathbf{x})+\sum_{i=1}^m\lambda_{*,i}f_i(\mathbf{x})+\sum_{j=1}^n\nu_{*,j}h_j(\mathbf{x})\\ &\le f_0(\mathbf{x}_*)+\sum_{i=1}^m\lambda_{*,i}f_i(\mathbf{x}_*)+\sum_{j=1}^n\nu_{*,j}h_j(\mathbf{x}_*)\\ &\le f_0(\mathbf{x}_*)=f_*. \end{align*}\] Since \(g_*=f_*\), the inequalities will become equalities. The first equality is \[ \inf_{\mathbf{x}}f_0(\mathbf{x})+\sum_{i=1}^m\lambda_{*,i}f_i(\mathbf{x})+\sum_{j=1}^n\nu_{*,j}h_j(\mathbf{x})=f_0(\mathbf{x}_*)+\sum_{i=1}^m\lambda_{*,i}f_i(\mathbf{x}_*)+\sum_{j=1}^n\nu_{*,j}h_j(\mathbf{x}_*), \] which implies that \(\mathbf{x}_*\) optimizes \(L(\mathbf{x},\boldsymbol{\lambda}_*,\boldsymbol{\nu}_*)\). Hence, by the first-order optimality condition, we have \(\nabla_{\mathbf{x}}L(\mathbf{x},\boldsymbol{\lambda}_*,\boldsymbol{\nu}_*)=0\), which is (\(\ref{eqn:KKT-1}\)). The second equality is \[ f_0(\mathbf{x}_*)+\sum_{i=1}^m\lambda_{*,i}f_i(\mathbf{x}_*)+\sum_{j=1}^n\nu_{*,j}h_j(\mathbf{x}_*)=f_0(\mathbf{x}_*), \] which implies \(\lambda_{*,i}f_i(\mathbf{x}_*)=0,\forall i\) because \(\lambda_{*,i}f_i(\mathbf{x}_*)\le0,\forall i\) and they cannot be less than zero; otherwise the equality will not hold.

💡 KKT conditions
Assume that \(f_0,f_i,h_j\) are continuously differentiable. Let \(\mathbf{x}_*\) be an optimal primal solution and \(\boldsymbol{\lambda}_*,\boldsymbol{\nu}_*\) be optimal dual solutions. The KKT conditions are: \[ \begin{aligned} &\text{(Stationarity)}\quad\nabla f_0(\mathbf{x}_*)+\sum_{i=1}^m\lambda_{*,i}\nabla f_i(\mathbf{x}_*)+\sum_{j=1}^n\nu_{*,j}\nabla h_j(\mathbf{x}_*)=0,\\ &\text{(Primal feasibility)}\quad f_i(\mathbf{x}_*)\le0,\quad h_j(\mathbf{x}_*)=0,\forall i,j,\\ &\text{(Dual feasibility)}\quad \lambda_{*,i}\ge0,\forall i,\\ &\text{(Complementary slackness)}\quad \lambda_{*,i}f_i(\mathbf{x}_*)=0,\forall i. \end{aligned} \]

Slater’s condition

How to ensure the strong duality holds? Constraint qualifications have been developed as sufficient conditions of strong duality. One simple constraint qualification is Slater’s condition for a convex optimization problem: There exists an \(\mathbf{x}\in\operatorname{relint}(D)\) (where relint denotes the relative interior of the convex set \(D:=\cap_{i=0}^m\operatorname{dom}(f_i)\) such that \[ f_i(\mathbf{x})<0,\forall i,\quad\text{and}\quad \mathbf{a}_j^{\top}\mathbf{x}+b_j=0,\forall j. \] An important theorem of Lagrangian duality is that the strong duality holds when the primal problem is convex and Slater’s condition holds. This suggests a tangible approach to compute \(\mathbf{x}_*\) or transform the original problem into a simplified one. First, we solve the dual problem to obtain an optimal dual solution \((\boldsymbol{\lambda}_*,\boldsymbol{\nu}_*)\): \[\begin{align} (\boldsymbol{\lambda}_*,\boldsymbol{\nu}_*)=\arg\max_{\boldsymbol{\lambda}\ge0,\boldsymbol{\nu}}g(\boldsymbol{\lambda},\boldsymbol{\nu}). \end{align}\] Then we use the stationarity condition of KKT conditions to derive a close form of \(\mathbf{x}_*\). In addition, we have \[\begin{align*} \min_{\mathbf{x}}\{f_0(\mathbf{x}),~\text{s.t.}~\mathbf{x}\in\mathcal{X}\}=\max_{\boldsymbol{\lambda}\ge0,\nu}g(\boldsymbol{\lambda},\boldsymbol{\nu}). \end{align*}\]

Examples

Example 1.14: Dual of Distributionally Robust Optimization (DRO)

The following problem often arises in robust machine learning: \[ f(\ell_1,\ldots, \ell_n) = \max_{\mathbf{p} \in \Delta_n} \sum_{i=1}^n p_i \ell_i - \tau \sum_{i=1}^n q_i \phi(p_i / q_i), \] where \(\tau \geq 0\), \(\mathbf{q} \in \Delta_n\), and \(\phi(t): \mathbb{R}^+ \rightarrow \mathbb{R}\) is a proper closed convex function with minimum value \(0\) attained at \(t=1\). Let us derive its dual problem. We write the above problem as a standard convex optimization problem: \[ \begin{aligned} & \min_{\mathbf{p}} -\sum_{i=1}^n p_i \ell_i + \tau \sum_i q_i \phi(p_i / q_i) \\ & \text{s.t.} \quad \sum_{i=1}^n p_i = 1. \end{aligned} \] The constraint \(p_i \geq 0\) is enforced by the domain of \(\phi(t)\).

We define the Lagrangian function: \[ L(\mathbf{p}, \nu) = -\sum_{i=1}^n p_i \ell_i + \tau \sum_{i=1}^n q_i \phi(p_i / q_i) + \nu \left(\sum_{i=1}^n p_i - 1\right). \] Let us define \[ \phi^*(s) = \max_{t \geq 0} \left\{ ts - \phi(t) \right\}. \] By minimizing over \(\mathbf{p} \geq 0\), we have \[ \begin{aligned} g(\nu) &= \min_{\mathbf{p} \geq 0} \left( -\sum_{i=1}^n p_i (\ell_i - \nu) + \tau \sum_{i=1}^n q_i \phi(p_i / q_i) \right) - \nu \\ &= -\left\{ \max_{\mathbf{p} \geq 0} \sum_{i=1}^n p_i (\ell_i - \nu) - \tau \sum_{i=1}^n q_i \phi(p_i / q_i) \right\} - \nu. \end{aligned} \] With a variable change \(\tilde p=p/q\), we have \[ \begin{aligned} g(\nu) &= -\max_{\tilde{\mathbf{p}} \geq 0} \sum_{i=1}^n q_i \left( \tilde{p}_i (\ell_i - \nu) - \tau \phi(\tilde{p}_i) \right) - \nu \\ &= -\sum_{i=1}^n \tau q_i \phi^*\left( \frac{\ell_i - \nu}{\tau} \right) - \nu. \end{aligned} \] Since the Slater’s condition holds (\(p_i=1/n\) satisfies), we have \[ \begin{aligned} & \min_{\mathbf{p} \in \Delta_n} -\sum_{i=1}^n p_i \ell_i + \tau \sum_{i=1}^n q_i \phi(p_i / q_i) \\ &= \max_{\nu} g(\nu) = -\min_{\nu} \left\{ \sum_{i=1}^n \tau q_i \phi^*\left( \frac{\ell_i - \nu}{\tau} \right) + \nu \right\}. \end{aligned} \] Hence, \[ \max_{\mathbf{p} \in \Delta_n} \left\{ \sum_{i=1}^n p_i \ell_i - \tau \sum_{i=1}^n q_i \phi(p_i / q_i) \right\} = \min_{\nu} \sum_{i=1}^n \tau q_i \phi^*\left( \frac{\ell_i - \nu}{\tau} \right) + \nu. \]


Example 1.15: Conjugate of \(\phi\) functions

We can derive \(\phi^*\) for three cases (exercise):


Example 1.16: KKT Conditions of DRO with KL Divergence

Let us consider a special case of Example 1.14 with \(\phi(t) = t \log t - t + 1\): \[ f(\ell_1,\ldots,\ell_n) = \max_{\mathbf{p} \in \Delta_n} \sum_{i=1}^n p_i \ell_i - \tau \sum_{i=1}^n p_i \log \left( \frac{p_i}{q_i} \right) \]

We can derive the following KKT conditions:

\[ \begin{aligned} & (\ell_i - \nu_*) - \tau \left( \log \frac{p_i^*}{q_i} + 1 \right) = 0 \Rightarrow p_i^* = q_i \exp\left( \frac{\ell_i - \nu_* - \tau}{\tau} \right), \\ & \sum_{i=1}^n p_i^* = 1. \end{aligned} \]

As a result, we can derive

\[\begin{align}\label{eqn:kldro-ell-equi} &p^*_i =\frac{q_i \exp(\frac{\ell_i}{\tau})}{\sum_{i=1}^nq_i \exp(\frac{\ell_i}{\tau})}\\ &f(\ell_1,\ldots, \ell_n)=\tau \log\left(\sum_{i=1}^nq_i \exp\left(\frac{\ell_i}{\tau}\right)\right). \end{align} \]

← Go Back