Section 2.2 Robust Optimization
In this section, we introduce advanced machine learning methods based on the principle of robust optimization. Robust optimization is a framework designed to address uncertainty in data. It ensures that the solutions perform well even under worst-case scenarios of data within a specified set of uncertainties.
2.2.1 Distributionally Robust Optimization
Minimizing the average empirical risk often fails to yield a robust model in practice. For instance, the resulting model may perform poorly on minority data (e.g., patients with rare diseases) because the optimization predominantly focuses on majority class data.
Critical: Empirical data may not fully represent the underlying data distribution, leading to generalization issues.
To address these challenges, distributionally robust optimization (DRO) has been extensively studied in machine learning as a means to improve robustness and generalization.
The core idea of DRO is to minimize a robust objective defined over the worst-case distribution of data, perturbed from the empirical distribution. Let us define a set of distributional weights, \(\mathbf{p} = (p_1, \ldots, p_n) \in \Delta_n\), where \(\Delta_n = \{\mathbf{p} \in \mathbb{R}^n : \sum_{i=1}^n p_i = 1, \, p_i \geq 0\}\), with each element \(p_i\) associated with a training sample \(\mathbf{x}_i\).
Definition 2.1 (\(\phi\)-divergence).
Let \(\phi(t)\) be a proper closed convex function that attains its minimum value \(0\) at \(t=1\). The \(\phi\)-divergence is defined as: \[\begin{align} D_{\phi}(\mathbf{p}, \mathbf{q}) = \sum_{i=1}^n q_i \, \phi\left(\frac{p_i}{q_i}\right). \end{align}\]
The \(\phi\)-divergence measures the discrepancy between two distributions \(\mathbf{p}\) and \(\mathbf{q}\) using the function \(\phi\).
We present two common formulations of DRO based on the \(\phi\)-divergence: regularized DRO and constrained DRO. They differ in how they define the uncertainty set of \(\mathbf{p}\).
Definition 2.2 (Regularized DRO).
\[\begin{align}\label{eqn:rdro}
\min_{\mathbf{w}} \hat{\mathcal{R}}_{\mathcal{S}}(\mathbf{w}) :=
\max_{\mathbf{p} \in \Delta_n} \sum_{i=1}^n p_i \, \ell(h(\mathbf{w};
\mathbf{x}_i), y_i) - \tau \, D_{\phi}\left(\mathbf{p},
\frac{\mathbf{1}}{n}\right).
\end{align}\]
Definition 2.3 (Constrained DRO).
\[\begin{align}\label{eqn:cdro}
\min_{\mathbf{w}} \hat{\mathcal{R}}_{\mathcal{S}}(\mathbf{w}) :=
&\max_{\mathbf{p} \in \Omega} \ \sum_{i=1}^n p_i \,
\ell(h(\mathbf{w}; \mathbf{x}_i), y_i) \\
\text{where} \quad &\Omega = \left\{\mathbf{p} \ \middle| \
\mathbf{p} \in \Delta_n, \ D_{\phi}\left(\mathbf{p},
\frac{\mathbf{1}}{n}\right) \leq \rho \right\}. \notag
\end{align}\]
The regularized DRO uses a regularization on \(\mathbf{p}\) to implicitly define the uncertainty set, while the constrained DRO uses a constraint on \(\mathbf{p}\) to explicitly define the uncertainty set.
The maximization over \(\mathbf{p}\) in the DRO formulations simulates a worst-case scenario, thereby enhancing the model’s robustness. The DRO objective interpolates between the maximal loss and the average loss:
- Without the \(\phi\)-divergence
regularization or constraint (i.e., \(\tau =
0\) or \(\rho = \infty\)), the
objective simplifies to the maximal loss among all samples, which is
particularly beneficial for handling imbalanced data.
- Conversely, when \(\rho = 0\) or \(\tau = \infty\), the DRO objective reduces to the standard empirical risk.
Adding a tunable \(\phi\)-divergence regularization or constraint (via \(\tau\) or \(\rho\)) increases the model’s robustness.
A list of \(\phi\)-divergences is presented in the following Table. Two commonly used ones in machine learning are presented below:| Divergence | \(\phi(t)\) | \(\phi^*(s)\) | \(D_{\phi}(\mathbf{p}, \mathbf{q})\) |
|---|---|---|---|
| KL | \(t\log(t) - t + 1\) | \(\exp(s)-1\) | \(\sum_{i=1}^n p_i \log\frac{p_i}{q_i}\) |
| Burg entropy | \(-\log t + t - 1\) | \(-\log(1-s), \ s<1\) | \(\sum_{i=1}^n q_i \log\frac{q_i}{p_i}\) |
| \(\chi^2\) | \((t-1)^2\) | \(\begin{cases}\frac{1}{4}s^2 + s & \text{if } s \geq -2 \\ -1 & \text{o.w.}\end{cases}\) | \(\sum_{i=1}^n q_i (p_i / q_i - 1)^2\) |
| Hellinger distance | \((\sqrt{t} - 1)^2\) | \(\frac{s}{1-s}, \ s<1\) | \(\sum_i (\sqrt{p_i} - \sqrt{q_i})^2\) |
| Variation distance | \(|t-1|\) | \(\begin{cases}s & \text{if } s\in[-1,1] \\ -1 & \text{if } s < -1\end{cases}\) | \(\sum_i |p_i - q_i|\) |
| CVaR | \(\mathbb{I}_{0-\infty}(t \leq 1 / \alpha)\) | \(\frac{[s]_+}{\alpha}\) | \(\begin{cases}0 & \text{if } p_i \leq q_i / \alpha,\ \forall i \\ \infty & \text{o.w.}\end{cases}\) |
KL-Divergence:
With \(\phi(t) = t \log t - t + 1\), the \(\phi\)-divergence becomes the KL divergence: \[ \mathrm{KL}(\mathbf{p}, \mathbf{q}) = D_{\phi}(\mathbf{p}, \mathbf{q}) = \sum_{i=1}^n p_i \log\left(\frac{p_i}{q_i}\right). \]Conditional Value-at-Risk (CVaR):
With \(\phi(t) = \mathbb{I}_{0-\infty}(t \leq 1 / \alpha)\), where \(\alpha \in (0, 1]\) and \(\mathbb{I}_{0-\infty}\) is the \(0-\infty\) indicator function, the divergence becomes \(D_{\phi}(\mathbf{p}, \mathbf{q}) = 0\) if \(p_i \leq q_i / \alpha\) for all \(i\), otherwise \(D_{\phi}(\mathbf{p}, \mathbf{q}) = \infty\). The resulting DRO formulation is also known as the empirical CVaR-\(\alpha\).
The Dual Form of Regularized DRO
Solving the above DRO formulations requires dealing with a high-dimensional variable \(\mathbf{p}\) from a simplex, which will incur additional overhead compared with solving ERM when the number of training data is large. The reason is that it requires performing a projection on the simplex \(\Delta_n\) or the constrained simplex \(\Omega = \{\mathbf{p} \in \Delta_n, D_{\phi}(\mathbf{p}, \frac{\mathbf{1}}{n}) \leq \rho\}\). To reduce this overhead, one approach is to convert the problem into an unconstrained one using the Lagrangian dual theory based on the convex conjugate of the \(\phi\) function.
Proposition 2.1 (Dual form of Regularized DRO)
Let \(\phi^*(s) = \max_{t \geq 0} ts - \phi(t)\). Then we have \[\begin{align}\label{eqn:dual-rdro} \max_{\mathbf{p} \in \Delta_n} \sum_{i=1}^n p_i \, \ell(h(\mathbf{w}; \mathbf{x}_i), y_i) - \tau D_{\phi}\left(\mathbf{p}, \frac{\mathbf{1}}{n}\right) = \min_{\mu} \frac{\tau}{n} \sum_{i=1}^n \phi^*\left(\frac{\ell(h(\mathbf{w}; \mathbf{x}_i), y_i) - \mu}{\tau}\right) + \mu. \end{align}\]
The proof can be found in found in Example 1.14.
Examples of Regularized DRO
KL-divergence Regularized DRO
For the special case of using KL-divergence, we can further simplify the
above objective function. Since \(\phi(t) = t
\log t - t + 1\), then \(\phi^*(s) =
\exp(s) - 1\) (see Example 1.15) and solving
\(\mu\) yields
\[
\mu = \tau \log\left(\frac{1}{n} \sum_{i=1}^n
\exp\left(\frac{\ell(h(\mathbf{w}; \mathbf{x}_i),
y_i)}{\tau}\right)\right).
\] Plugging it back into the objective, we can obtain a
simplified form
\[
\max_{\mathbf{p} \in \Delta_n} \sum_{i=1}^n p_i \, \ell_i - \tau \,
\mathrm{KL}\left(\mathbf{p}, \frac{\mathbf{1}}{n}\right) = \tau
\log\left(\frac{1}{n} \sum_{i=1}^n \exp\left(\frac{\ell(h(\mathbf{w};
\mathbf{x}_i), y_i)}{\tau}\right)\right).
\] As a result, with \(\phi(t) = t \log
t - t + 1\), the KL-divergence regularized DRO (\(\ref{eqn:rdro}\)) is equivalent to
\[\begin{align}\label{eqn:klrdro}
\min_{\mathbf{w}} \tau \log\left(\frac{1}{n} \sum_{i=1}^n
\exp\left(\frac{\ell(h(\mathbf{w}; \mathbf{x}_i),
y_i)}{\tau}\right)\right).
\end{align}\]
Empirical CVaR-\(\alpha\)
As another example, we derive the dual form of the empirical CVaR-\(\alpha\). With simple algebra, we can
derive that \(\phi^*(s) =
\frac{[s]_+}{\alpha}\) (see Example 1.15) for \(\phi(t) = \mathbb{I}_{0-\infty}(t \leq 1 /
\alpha)\).
As a result, with \(\phi(t) = \mathbb{I}_{0-\infty}(t \leq 1 / \alpha)\), the empirical CVaR-\(\alpha\) \(\eqref{eqn:rdro}\) is equivalent to \[\begin{align}\label{eqn:cvar} \min_{\mathbf{w}, \mu} \frac{1}{n \alpha} \sum_{i=1}^n [\ell(h(\mathbf{w}; \mathbf{x}_i), y_i) - \mu]_+ + \mu. \end{align}\] When \(k = n \alpha \in [1, n]\) is an integer, the above objective reduces to the average of top-\(k\) loss values when sorting them in descending order, as shown in the following lemma.
Lemma 2.2
Let \(\ell_{[i]}\) denote the \(i\)-th largest loss among \(\{\ell(h(\mathbf{w}; \mathbf{x}_i), y_i), i=1, \ldots, n\}\) ranked in descending order. If \(\alpha = k/n\), we have \[\begin{align}\label{eqn:cvar-top-k} \min_{\mu} \frac{1}{n \alpha} \sum_{i=1}^n [\ell(h(\mathbf{w}; \mathbf{x}_i), y_i) - \mu]_+ + \mu = \frac{1}{k} \sum_{i=1}^k \ell_{[i]}. \end{align}\]
Proof.
First, we have \[
\min_{\mu} \frac{1}{n \alpha} \sum_{i=1}^n [\ell(h(\mathbf{w};
\mathbf{x}_i), y_i) - \mu]_+ + \mu = \min_{\mu} \frac{1}{n \alpha}
\sum_{i=1}^n [\ell_{[i]} - \mu]_+ + \mu.
\] Let \(\mu_*\) be an optimal
solution given \(\mathbf{w}\). Due to
the first-order optimality condition, we have \[
0 \in \frac{1}{k} \sum_{i=1}^n \partial_{\mu} [\ell_{[i]} - \mu_* ]_+ +
1.
\] Hence, \[\begin{align}\label{eqn:opt-mu}
-k \in \sum_{i=1}^n \partial_{\mu} [\ell_{[i]} - \mu_*]_+.
\end{align}\] If \(\ell_{[k+1]} <
\ell_{[k]}\), then \(\mu_* \in
(\ell_{[k+1]}, \ell_{[k]}]\) satisfies \(\eqref{eqn:opt-mu}\).
If \(\ell_{[k+1]} = \ell_{[k]}\), then
\(\mu_* = \ell_{[k]}\) can still
satisfy \(\eqref{eqn:opt-mu}\) by
subgradient properties. ∎
The Dual Form of Constrained DRO
For transforming the constrained DRO, we can use the following proposition based on the Lagrangian duality theory.
Proposition 2.2 (Dual form of Constrained
DRO).
Let \(\phi^*(s) = \max_{t \geq 0} ts -
\phi(t)\). Then we have \[\begin{align}\label{eqn:dual-cdro}
\max_{\mathbf{p} \in \Delta, D_{\phi}(\mathbf{p}, \frac{\mathbf{1}}{n})
\leq \rho} \sum_{i=1}^n p_i \, \ell(h(\mathbf{w}; \mathbf{x}_i), y_i) =
\min_{\tau \geq 0, \mu} \frac{\tau}{n} \sum_{i=1}^n
\phi^*\left(\frac{\ell(h(\mathbf{w}; \mathbf{x}_i), y_i) -
\mu}{\tau}\right) + \mu + \tau \rho.
\end{align}\]
The proof is similar to that of Proposition 2.1
Examples of Constrained DRO
Example (KL Constrained DRO)
With \(\phi(t) = t \log t - t + 1\),
the KL-divergence constrained DRO \(\eqref{eqn:cdro}\) is equivalent to: \[\begin{align}\label{eqn:klcdro}
\min_{\mathbf{w}, \tau \geq 0} \tau \log\left(\frac{1}{n} \sum_{i=1}^n
\exp\left(\frac{\ell(h(\mathbf{w}; \mathbf{x}_i),
y_i)}{\tau}\right)\right) + \tau \rho.
\end{align}\]
KL-regularized DRO and KL-constrained DRO play important roles in many modern artificial intelligence applications. The LDR loss in Section 2.1 can be interpreted as a form of KL-regularized DRO, except that the uncertainty is placed on the distribution of class labels for each individual data point. Additional applications will be presented in Section 2.4.
2.2.2 Optimized Uncertainty Equivalent
How to understand the generalization of DRO? One way is to still consider bounding the expected risk \(\mathcal{R}(\mathbf{w})\) of the learned model. However, the expected risk may not be a good measure when the data distribution is skewed.
For simplicity, let us consider a binary classification problem with \(\mathbb{P}(\mathbf{x}, y = 1) = \pi_+ \, \mathbb{P}_+(\mathbf{x})\) and \(\mathbb{P}(\mathbf{x}, y = -1) = \pi_- \, \mathbb{P}_-(\mathbf{x})\), where \(\pi_+ = \Pr(y = 1)\), \(\pi_- = \Pr(y = -1)\), \(\mathbb{P}_+(\mathbf{x}) = \Pr(\mathbf{x} \mid y = 1)\), and \(\mathbb{P}_-(\mathbf{x}) = \Pr(\mathbf{x} \mid y = -1)\).
If \(\pi_+ \ll \pi_-\), then by the Law of Total Expectation we have \[\begin{align} \mathcal{R}(\mathbf{w}) = \mathbb{E}_{\mathbf{x}, y} \, \ell(h(\mathbf{w}; \mathbf{x}), y) = \pi_+ \, \mathbb{E}_{\mathbf{x} \sim \mathbb{P}_+}[\ell(h(\mathbf{w}; \mathbf{x}), 1)] + \pi_- \, \mathbb{E}_{\mathbf{x} \sim \mathbb{P}_-}[\ell(h(\mathbf{w}; \mathbf{x}), -1)]. \end{align}\]
Since \(\pi_- \gg \pi_+\), the expected risk would be dominated by the expected loss of data from the negative class. As a result, a small \(\mathcal{R}(\mathbf{w})\) does not necessarily indicate a small \(\mathbb{E}_{\mathbf{x} \sim \mathbb{P}_+}[\ell(\mathbf{w}; \mathbf{x}, 1)]\).
Instead, we consider the population risk of DRO as the target measure. A formal definition of the population risk for the regularized DRO \(\eqref{eqn:rdro}\) is given below.
Definition 2.4 (Population risk of DRO).
For any \(\tau > 0\), we define the
population risk of regularized DRO \(\eqref{eqn:rdro}\) as: \[\begin{align}
\hat{\mathcal{R}}_{\tau}(\mathbf{w}) = \min_{\mu} \, \tau \,
\mathbb{E}_{\mathbf{x}, y} \, \phi^*\left( \frac{\ell(h(\mathbf{w};
\mathbf{x}), y) - \mu}{\tau} \right) + \mu,
\end{align}\] where \(\phi^*(s) =
\max_{t \geq 0} ts - \phi(t)\).
This risk measure originates from the optimized certainty equivalent (OCE). Two special cases are discussed below:
When \(\phi(t) = \mathbb{I}_{0-\infty}(t \leq 1/\alpha)\), the OCE becomes the CVaR-\(\alpha\), i.e., \[ \hat{\mathcal{R}}_\tau(\mathbf{w}) = \mathbb{E}_{\mathbf{x}, y} \left[ \ell(h(\mathbf{w}; \mathbf{x}), y) \ \middle|\ \ell(h(\mathbf{w}; \mathbf{x}), y) \geq \mathrm{VAR}_\alpha(\ell(h(\mathbf{w}; \mathbf{x}), y)) \right], \] where \(\mathrm{VAR}_\alpha(\ell(h(\mathbf{w}; \mathbf{x}), y)) = \sup_s \left[ \Pr\left(\ell(h(\mathbf{w}; \mathbf{x}), y) \geq s\right) \geq \alpha \right]\) is the \(\alpha\)-quantile or value-at-risk of the random loss values.
When \(\phi(t) = t \log t - t + 1\), OCE becomes: \[ \hat{\mathcal{R}}_{\tau}(\mathbf{w}) = \tau \log \left( \mathbb{E}_{\mathbf{x}, y} \exp\left( \frac{\ell(h(\mathbf{w}; \mathbf{x}), y)}{\tau} \right) \right). \]
Lemma 2.3.
Let \(\partial \phi^*(t) = \{ s : \phi'^*_-(t) \leq s \leq \phi'^*_+(t) \}\). If \(a < b\), then\[ 0 \leq \phi'^*_+(a) \leq \phi'^*_-(b). \]
Proof.
Due to the definition \(\phi^*(s) = \max_{t
\geq 0} ts - \phi(t)\), we have \(\partial \phi^*(s) \geq 0\), which
indicates that \(\phi^*\) is
non-decreasing. Since \(\phi^*\) is
also convex, the conclusion follows from convex analysis.
Lemma 2.4.
For any \(\tau > 0\), \(\mathbf{w} \in \mathbb{R}^d\), it holds that \(\hat{\mathcal{R}}_\tau(\mathbf{w}) \geq \mathcal{R}(\mathbf{w})\).Proof.
Since \(\phi(1) = 0\), then \(\phi^*(s) = \max_{t \geq 0} ts - \phi(t) \geq s -
\phi(1) = s\). Hence, \[\begin{align*}
\hat{\mathcal{R}}_{\tau}(\mathbf{w}) &= \min_{\mu} \, \tau \,
\mathbb{E}_{\mathbf{x}, y} \, \phi^*\left( \frac{\ell(h(\mathbf{w};
\mathbf{x}), y) - \mu}{\tau} \right) + \mu \\
&\geq \min_{\mu} \, \tau \, \mathbb{E}_{\mathbf{x}, y} \left(
\frac{\ell(h(\mathbf{w}; \mathbf{x}), y) - \mu}{\tau} \right) + \mu \\
&= \mathcal{R}(\mathbf{w}).
\end{align*}\]
💡 Why it matters:
Lemma 2.3 implies that a data point with a larger loss \(\ell(h(\mathbf{w}; \mathbf{x}), y)\) will have a higher weight in the gradient calculation in terms of \(\mathbf{w}\).
Lemma 2.4 indicates that OCE is a stronger measure than the expected risk. A small OCE will imply a small expected risk, while the reverse is not necessarily true.Based on OCE, we can define the excess risk \(\hat{\mathcal{R}}_{\tau}(\mathbf{w}) - \min_{\mathbf{w} \in \mathcal{W}} \hat{\mathcal{R}}_{\tau}(\mathbf{w})\) and decompose it into an optimization error and a generalization error.
2.2.3 Group Distributionally Robust Optimization
Group DRO is an extension of DRO by aggregating data into groups and using DRO on the group level to formulate a robust risk function. It is helpful for promoting equity of the learned model and mitigating the impact of spurious correlations that exist between the label and some features, by using prior knowledge to group the data.
Let us consider an illustrative example of classifying waterbird images from landbird images. The training data may have the same number of waterbird images and landbird images. However, most waterbird images may have water in the background and most landbird images may have land in the background. Standard empirical risk minimization may learn spurious correlation between the class labels (e.g., waterbird) and the specific value of some attribute (e.g., the water background). As a consequence, the model may perform poorly on waterbird images with land background.
Critical: Data may exhibit imbalance not in the marginal distribution of the class label but in some joint distribution of the class label and some attributes, which causes the spurious correlation.
GDRO can be used to mitigate this issue by leveraging prior knowledge of spurious correlations to define groups over the training data. Let the training data be divided into multiple groups \(\mathcal{G}_1, \ldots, \mathcal{G}_K\), where \(\mathcal{G}_j = \{ (\mathbf{x}^j_1, y^j_1), \ldots, (\mathbf{x}^j_{n_k}, y^j_{n_k}) \}\) includes a set of examples for the \(j\)-th group. We define an averaged loss over examples from each group \[ L_j(\mathbf{w}) = \frac{1}{n_j} \sum_{i=1}^{n_j} \ell(\mathbf{w}; \mathbf{x}^j_i, y^j_i). \]
Then a regularized group DRO can be defined as \[\begin{align}\label{eqn:rgdro} \min_{\mathbf{w}} \max_{\mathbf{p} \in \Delta_K} \sum_{j=1}^K p_j L_j(\mathbf{w}) - \tau D_{\phi}\left(\mathbf{p}, \frac{\mathbf{1}}{K}\right), \end{align}\] and a constrained group DRO is given by \[\begin{equation}\label{eqn:cgdro} \begin{aligned} \min_{\mathbf{w}} \max_{\mathbf{p} \in \Delta_K, \ D_{\phi}(\mathbf{p}, \frac{\mathbf{1}}{K}) \leq \rho} \sum_{j=1}^K p_j L_j(\mathbf{w}). \end{aligned} \end{equation}\]
By doing so, the learning process is less likely to be dominated by the majority group associated with the spurious correlation between the label and a particular feature. If the model only captures the spurious correlation, the loss for the minority group will be large, which in turn drives the learning process to reduce this loss and thereby mitigate the spurious correlation.
Examples and Reformulations
Similar as before, we can convert the min–max problem into a minimization problem to reduce the additional overhead of dealing with a large number of groups. We give two examples of using a KL-divergence constraint on \(\mathbf{p}\) and CVaR-\(\alpha\).
With \(\phi(t) = t \log t - t + 1\), the KL-divergence constrained group DRO (\(\ref{eqn:cgdro}\)) is equivalent to \[\begin{align}\label{eqn:klgdro} \min_{\mathbf{w}, \ \tau \geq 0} \ \tau \log\left( \frac{1}{K} \sum_{j=1}^K \exp\left( \frac{L_j(\mathbf{w})}{\tau} \right) \right) + \tau \rho. \end{align}\]
With \(\phi(t) = \mathbb{I}_{0-\infty}(t \leq 1 / \alpha)\), the CVaR-\(\alpha\) group DRO (\(\ref{eqn:rdro}\)) is equivalent to \[\begin{align}\label{eqn:cvargdro} \min_{\mathbf{w}, \mu} \ \frac{1}{K \alpha} \sum_{j=1}^K [ L_j(\mathbf{w}) - \mu ]_+ + \mu. \end{align}\]
The Optimization Challenge
These new optimization problems (\(\ref{eqn:klgdro}\)), \(\eqref{eqn:cvargdro}\) cannot be solved by
simply using existing stochastic algorithms for ERM since \(L_j(\mathbf{w})\) depends on many data
points and they are inside non-linear functions. In particular, the
problem (\(\ref{eqn:cvargdro}\)) is an
instance of finite-sum coupled compositional optimization
(FCCO), which will be explored in Chapter 4 in depth.