Section 1.4: Convex Optimization

A standard optimization problem is defined by:

\[ \begin{equation}\label{eqn:opt} \begin{aligned} \min_{\mathbf{x}\in\mathbb{R}^d}&\quad f_0(\mathbf{x})\\ \text{s.t.}&\quad f_i(\mathbf{x})\leq 0, \quad i=1,\ldots, m\\ &\quad h_j(\mathbf{x})=0, \quad j=1,\ldots, n. \end{aligned} \end{equation} \]

Definition. A standard optimization problem is a convex optimization problem if \(f_i(\mathbf{x})\) is convex for \(i=0,\ldots, m\) and \(h_j(\mathbf{x}) = \mathbf{a}_j^{\top}\mathbf{x} + b_j\) is an affine function for \(j=1,\ldots, n\).

The problem is feasible if there exists at least one point such that all constraints are satisfied, and infeasible otherwise. The set of all feasible points is called the feasible set, denoted by

\[ \mathcal{X} = \{\mathbf{x}: f_i(\mathbf{x}) \leq 0, i=1,\ldots, m, \; h_j(\mathbf{x})=0, j=1,\ldots, n\}. \]

The Optimal Value and Optimal Solutions

The optimal value is defined as

\[ f_* = \inf \{f_0(\mathbf{x}) \mid \mathbf{x} \in \mathcal{X}\}, \]

where \(\inf\) returns the greatest value that is less than or equal to all possible objective values at feasible points if such a value exists. For example, \(\inf e^{-x} = 0\). If the problem is infeasible, we let \(f_* = \infty\).

A solution \(\mathbf{x}_*\) is an optimal solution if it is feasible, i.e., satisfying all constraints, and \(f_0(\mathbf{x}_*) = f_*\). Hence, we may have a set of optimal solutions:

\[ \mathcal{X}_* = \arg\min \{f_0(\mathbf{x}) \mid \mathbf{x} \in \mathcal{X}\} = \{\mathbf{x} \in \mathcal{X} : f_0(\mathbf{x}) = f_*\}. \]

The optimal solution is unique if the objective is strongly convex.

1.4.1 Local Minima and Global Minima

A solution \(\mathbf{x}\) is called a local minimum if there is an \(R > 0\) such that

\[ f_0(\mathbf{x}) = \inf \{f_0(\mathbf{y}) \mid \mathbf{y} \in \mathcal{X}, \; \|\mathbf{y} - \mathbf{x}\|_2 \leq R\}. \]

Theorem 1.1.

For a convex optimization problem, a local minimum \(\mathbf{x}\) is also a global minimum.

Proof.

Suppose \(\mathbf{x}\) is not a global minimum. It means that there exists a feasible \(\mathbf{z}\) such that \(f_0(\mathbf{z}) < f_0(\mathbf{x})\). Then \(\|\mathbf{z} - \mathbf{x}\|_2 > R\) because \(\mathbf{x}\) is an optimal solution in the local region \(\Omega = \{\mathbf{y}: \|\mathbf{y} - \mathbf{x}\|_2 \leq R\}\).

Let us derive a contradiction. Let \(\mathbf{y} = \mathbf{x} + \theta(\mathbf{z} - \mathbf{x})\), where \(\theta = \frac{R}{\|\mathbf{x} - \mathbf{z}\|_2}\) such that

\[ \|\mathbf{y} - \mathbf{x}\|_2 \leq \theta \|\mathbf{z} - \mathbf{x}\|_2 \leq R. \]

Then

\[ f_0(\mathbf{y}) \leq \theta f_0(\mathbf{z}) + (1 - \theta) f_0(\mathbf{x}) < f_0(\mathbf{x}), \]

which contradicts the fact that \(\mathbf{x}\) is an optimal solution in the region \(\Omega = \{\mathbf{y}: \|\mathbf{y} - \mathbf{x}\|_2 \leq R\}\). Hence such a \(\mathbf{z}\) does not exist.

1.4.2 Optimality Conditions

Let us consider a differentiable objective function \(f_0\).

Theorem 1.2.

For a convex optimization problem with non-empty \(\mathcal{X}_*\), a point \(\mathbf{x}\) is optimal if and only if \(\mathbf{x} \in \mathcal{X}\) and

\[ \nabla f_0(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) \geq 0, \quad \forall \mathbf{y} \in \mathcal{X}. \]

For a non-differentiable function, the above condition is replaced by: there exists \(\mathbf{v} \in \partial f_0(\mathbf{x})\) such that

\[ \mathbf{v}^{\top}(\mathbf{y} - \mathbf{x}) \geq 0, \quad \forall \mathbf{y} \in \mathcal{X}. \]

Proof.

To prove the sufficient condition, we use the convexity of \(f_0\). For any \(\mathbf{y} \in \mathcal{X}\), we have:

\[ f_0(\mathbf{y}) \geq f_0(\mathbf{x}) + \nabla f_0(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) \geq f_0(\mathbf x). \]

Hence, \(\mathbf{x}\) is an optimal solution.

To prove the necessary condition: if the condition does not hold for some \(\mathbf{y}\), i.e.,

\[ \nabla f_0(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) < 0, \]

let us consider \(\mathbf{z}(t) = t\mathbf{y} + (1 - t)\mathbf{x}\), which is feasible. Then:

\[ \left.\frac{d}{dt} f_0(\mathbf{z}(t))\right|_{t=0} = \nabla f_0(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) < 0, \]

which means there exists a small \(t > 0\) such that \(f_0(\mathbf{z}(t)) < f_0(\mathbf{x})\), contradicting the assumption that \(\mathbf{x}\) is optimal.

When the problem is unconstrained such that \(\mathcal{X} = \mathbb{R}^d\), the optimality condition implies that \(\mathbf{x}\) is optimal if and only if

\[ \nabla f_0(\mathbf{x}) = 0. \]

Lemma 1.1.

For a convex optimization problem, if \(f_0\) is strongly convex, then the set \(\mathcal{X}_*\) contains only a single element (if it is not empty).

Proof.

Assume \(\mathcal{X}_*\) contains two different solutions \(\mathbf{x}_1 \neq \mathbf{x}_2\) such that \(f_0(\mathbf{x}_1) = f_0(\mathbf{x}_2)\). Since \(f_0\) is strongly convex, we have:

\[ f_0(\mathbf{x}_1) \geq f_0(\mathbf{x}_2) + \partial f_0(\mathbf{x}_2)^{\top}(\mathbf{x}_1 - \mathbf{x}_2) + \frac{1}{2}\|\mathbf{x}_1 - \mathbf{x}_2\|_2^2. \]

Due to the optimality condition, \(\partial f_0(\mathbf{x}_2)^{\top}(\mathbf{x}_1 - \mathbf{x}_2) \geq 0\), hence

\[ f_0(\mathbf{x}_1) \geq f_0(\mathbf{x}_2) + \frac{1}{2}\|\mathbf{x}_1 - \mathbf{x}_2\|_2^2 > f_0(\mathbf{x}_2), \]

which contradicts the assumption that \(f_0(\mathbf{x}_1) = f_0(\mathbf{x}_2)\).

1.4.3 Karush–Kuhn–Tucker (KKT) Conditions

Constrained optimization problems such as the standard form are often challenging to analyze and solve directly. The Karush-Kuhn-Tucker (KKT) conditions, derived from Lagrangian duality theory, offer first-order necessary conditions for optimality. These conditions can simplify the original problem, sometimes enabling a transformation into a more tractable form or even leading to a closed-form solution.

The Lagrangian Function and the Lagrangian Dual Function

For the constrained optimization problem (\(\ref{eqn:opt}\)), the Lagrangian function is defined as:

\[ L(\mathbf{x}, \lambda, \mu) = f_0(\mathbf{x}) + \sum_{i=1}^m \lambda_i f_i(\mathbf{x}) + \sum_{j=1}^n \nu_j h_j(\mathbf{x}), \]

where \(\lambda_1,\ldots, \lambda_m, \nu_1, \ldots, \nu_n\) are called the Lagrangian multipliers.

The Lagrangian dual function is defined as:

\[ g(\lambda, \nu) = \inf_{\mathbf{x}} L(\mathbf{x}, \lambda, \mu). \]

Based on this, we define the Lagrangian dual problem:

\[ g_* = \sup_{\lambda \geq 0} g(\lambda, \nu). \]

Regarding the original optimal value \(f_*\) and the dual optimal value \(g_*\), we have the following:

Lemma 1.2.

We always have \(g_* \leq f_*\).

Proof. Let \(\mathbf{x}_*\) be an optimal solution. For any \(\lambda \geq 0, \nu\):

\[ \begin{aligned} g(\lambda, \nu) &= \inf_{\mathbf{x}} L(\mathbf{x}, \lambda, \mu) \\ &\leq L(\mathbf{x}_*, \lambda, \mu) \\ &= f_0(\mathbf{x}_*) + \sum_{i=1}^m \lambda_i f_i(\mathbf{x}_*) + \sum_{j=1}^n \nu_j h_j(\mathbf{x}_*) \leq f_0(\mathbf{x}_*). \end{aligned} \]

KKT Conditions

An interesting scenario is strong duality where \(g_* = f_*\). In such a case, we can derive two conditions:

Lemma 1.3. Suppose the primal and dual optimal values are attained and equal. Let \(\mathbf{x}_*\) be a primal optimal and \((\lambda_*, \nu_*)\) a dual optimal point. Assume that \(f_0, f_i, h_j\) are continuously differentiable, then:

\[ \begin{aligned} & \nabla f_0(\mathbf{x}_*) + \sum_{i=1}^m \lambda_{*,i} \nabla f_i(\mathbf{x}_*) + \sum_{j=1}^n \nu_{*,j} \nabla h_j(\mathbf{x}_*) = 0 \quad \text{(Stationarity)} \\ & \lambda_{*,i} f_i(\mathbf{x}_*) = 0, \quad i=1,\ldots,m \quad \text{(Complementary Slackness)} \end{aligned} \]

Proof.

We start with:

\[ \begin{aligned} g_* &= \sup_{\lambda \geq 0} g(\lambda, \nu) = g(\lambda_*, \nu_*) \\ &= \inf_{\mathbf{x}} f_0(\mathbf{x}) + \sum_{i=1}^m \lambda_{*,i} f_i(\mathbf{x}) + \sum_{j=1}^n \nu_{*,j} h_j(\mathbf{x}) \\ &\leq f_0(\mathbf{x}_*) + \sum_{i=1}^m \lambda_{*,i} f_i(\mathbf{x}_*) + \sum_{j=1}^n \nu_{*,j} h_j(\mathbf{x}_*) \leq f_0(\mathbf{x}_*) = f_*. \end{aligned} \]

Since \(g_* = f_*\), all inequalities become equalities. This implies:

\[ \inf_{\mathbf{x}} f_0(\mathbf{x}) + \sum_{i=1}^m \lambda_{*,i} f_i(\mathbf{x}) + \sum_{j=1}^n \nu_{*,j} h_j(\mathbf{x}) = f_0(\mathbf{x}_*) + \sum_{i=1}^m \lambda_{*,i} f_i(\mathbf{x}_*) + \sum_{j=1}^n \nu_{*,j} h_j(\mathbf{x}_*), \]

so \(\mathbf{x}_*\) minimizes \(L(\mathbf{x}, \lambda_*, \nu_*)\). Therefore, the first-order condition gives:

\[ \nabla_{\mathbf{x}} L(\mathbf{x}, \lambda_*, \nu_*) = 0. \]

Also,

\[ f_0(\mathbf{x}_*) + \sum_{i=1}^m \lambda_{*,i} f_i(\mathbf{x}_*) + \sum_{j=1}^n \nu_{*,j} h_j(\mathbf{x}_*) = f_0(\mathbf{x}_*), \]

which implies that \(\lambda_{*,i} f_i(\mathbf{x}_*) = 0\) for all \(i\), since each term is non-negative and their sum must be zero.

KKT Conditions (Summary)

Assume that \(f_0, f_i, h_j\) are continuously differentiable. Let \(\mathbf{x}_*\) be a primal optimal and \((\lambda_*, \nu_*)\) a dual optimal point. The KKT conditions are:

\[ \begin{aligned} & \nabla f_0(\mathbf{x}_*) + \sum_{i=1}^m \lambda_{*,i} \nabla f_i(\mathbf{x}_*) + \sum_{j=1}^n \nu_{*,j} \nabla h_j(\mathbf{x}_*) = 0 \quad \text{(Stationarity)} \\ & f_i(\mathbf{x}_*) \leq 0, \quad h_j(\mathbf{x}_*) = 0 \quad \text{(Primal feasibility)} \\ & \lambda_i \geq 0 \quad \text{(Dual feasibility)} \\ & \lambda_i f_i(\mathbf{x}_*) = 0 \quad \forall i \quad \text{(Complementary slackness)} \end{aligned} \]

Slater’s Condition

How do we ensure strong duality holds? One sufficient condition is Slater’s condition for convex problems: There exists an \(\mathbf{x} \in \operatorname{relint}(D)\), where \(D := \cap_{i=0}^m \operatorname{dom}(f_i)\), such that:

\[ f_i(\mathbf{x}) < 0, \quad \forall i, \quad \text{and} \quad \mathbf{a}_j^{\top} \mathbf{x} + b_j = 0, \quad \forall j. \]

An important theorem of Lagrangian duality states that strong duality holds when the primal problem is convex and Slater’s condition is satisfied. This leads to a tangible approach to computing \(\mathbf{x}_*\):

Solve the dual problem to get an optimal dual solution \((\lambda_*, \nu_*)\):

\[ (\lambda_*, \nu_*) = \arg\max_{\lambda \geq 0, \nu} g(\lambda, \nu) \]
Then use the KKT stationarity condition to derive a closed form for \(\mathbf{x}_*\).

Also, we have the primal-dual equality:

\[ \min_{\mathbf{x}} \left\{ f_0(\mathbf{x}) \,\big|\, \mathbf{x} \in \mathcal{X} \right\} = \max_{\lambda \geq 0, \nu} g(\lambda, \nu). \]

Examples

Example 14: Dual of Distributionally Robust Optimization (DRO)

The following problem often arises in robust machine learning:

\[ f(\ell_1,\ldots, \ell_n) = \max_{\mathbf{p} \in \Delta_n} \sum_{i=1}^n p_i \ell_i - \tau \sum_{i=1}^n q_i \phi(p_i / q_i), \]

where \(\tau \geq 0\), \(\mathbf{q} \in \Delta_n\), and \(\phi(t): \mathbb{R}^+ \rightarrow \mathbb{R}\) is a proper closed convex function with minimum value \(0\) attained at \(t=1\).

We rewrite it as a standard convex optimization problem:

\[ \begin{aligned} & \min_{\mathbf{p}} -\sum_{i=1}^n p_i \ell_i + \tau \sum_i q_i \phi(p_i / q_i) \\ & \text{s.t.} \quad \sum_{i=1}^n p_i = 1. \end{aligned} \]

The constraint \(p_i \geq 0\) is enforced by the domain of \(\phi(t)\).

Define the Lagrangian function:

\[ L(\mathbf{p}, \nu) = -\sum_{i=1}^n p_i \ell_i + \tau \sum_{i=1}^n q_i \phi(p_i / q_i) + \nu \left(\sum_{i=1}^n p_i - 1\right). \]

Define the conjugate of \(\phi\):

\[ \phi^*(s) = \max_{t \geq 0} \left\{ ts - \phi(t) \right\}. \]

By minimizing over \(\mathbf{p} \geq 0\):

\[ \begin{aligned} g(\nu) &= \min_{\mathbf{p} \geq 0} \left( -\sum_{i=1}^n p_i (\ell_i - \nu) + \tau \sum_{i=1}^n q_i \phi(p_i / q_i) \right) - \nu \\ &= -\left\{ \max_{\mathbf{p} \geq 0} \sum_{i=1}^n p_i (\ell_i - \nu) - \tau \sum_{i=1}^n q_i \phi(p_i / q_i) \right\} - \nu. \end{aligned} \]

Change variable \(\tilde{p}_i = p_i / q_i\):

\[ \begin{aligned} g(\nu) &= -\max_{\tilde{\mathbf{p}} \geq 0} \sum_{i=1}^n q_i \left( \tilde{p}_i (\ell_i - \nu) - \tau \phi(\tilde{p}_i) \right) - \nu \\ &= -\sum_{i=1}^n \tau q_i \phi^*\left( \frac{\ell_i - \nu}{\tau} \right) - \nu. \end{aligned} \]

Slater’s condition holds (e.g., \(p_i = 1/n\) is feasible), so:

\[ \begin{aligned} & \min_{\mathbf{p} \in \Delta_n} -\sum_{i=1}^n p_i \ell_i + \tau \sum_{i=1}^n q_i \phi(p_i / q_i) \\ &= \max_{\nu} g(\nu) = -\min_{\nu} \left\{ \sum_{i=1}^n \tau q_i \phi^*\left( \frac{\ell_i - \nu}{\tau} \right) + \nu \right\}. \end{aligned} \]

Thus:

\[ \max_{\mathbf{p} \in \Delta_n} \left\{ \sum_{i=1}^n p_i \ell_i - \tau \sum_{i=1}^n q_i \phi(p_i / q_i) \right\} = \min_{\nu} \sum_{i=1}^n \tau q_i \phi^*\left( \frac{\ell_i - \nu}{\tau} \right) + \nu. \]

Example 15: Conjugate of \(\phi\) functions

We can derive \(\phi^*\) for three cases (exercise):

\(\phi(t) = \frac{1}{2}(t - 1)^2\)

\[ \phi^*(y) = \max_{t \geq 0} \left( y t - \frac{1}{2}(t - 1)^2 \right) = \begin{cases} \frac{1}{2} y^2 + y & \text{if } y \geq -1 \\ -\frac{1}{2} & \text{otherwise} \end{cases} \]
\(\phi(t) = t \log t - t + 1\)

\[ \phi^*(y) = \max_{t \geq 0} \left( y t - (t \log t - t + 1) \right) = \exp(y) - 1 \]
\(\phi(t) = \mathbb{I}_{0-\infty}(t \leq 1/\alpha)\) for \(\alpha \in (0,1]\)

\[ \phi^*(y) = \max_{t \geq 0} \left( y t - \mathbb{I}_{0-\infty}(t \leq 1/\alpha) \right) = \frac{[y]_+}{\alpha} \]

Example 16: KKT Conditions of DRO with KL Divergence

Consider the special case of DRO where:

\[ \phi(t) = t \log t - t + 1 \]

Then:

\[ f(\ell_1,\ldots,\ell_n) = \max_{\mathbf{p} \in \Delta_n} \sum_{i=1}^n p_i \ell_i - \tau \sum_{i=1}^n p_i \log \left( \frac{p_i}{q_i} \right) \]

From KKT conditions:

\[ \begin{aligned} & (\ell_i - \nu_*) - \tau \left( \log \frac{p_i^*}{q_i} + 1 \right) = 0 \Rightarrow p_i^* = q_i \exp\left( \frac{\ell_i - \nu_* - \tau}{\tau} \right), \\ & \sum_{i=1}^n p_i^* = 1. \end{aligned} \]

So:

\[ p_i^* = \frac{q_i \exp(\ell_i / \tau)}{\sum_{i=1}^n q_i \exp(\ell_i / \tau)} \]

and the optimal value becomes:

\[ f(\ell_1,\ldots,\ell_n) = \tau \log \left( \sum_{i=1}^n q_i \exp\left( \frac{\ell_i}{\tau} \right) \right) \]