← Go Back

Section 2.1: Empirical Risk Minimization

Empirical Risk Minimization

What is Machine Learning (ML)?
In 1959, Arthur Samuel, a pioneer in the field of ML, defined Machine Learning as the “field of study that gives computers the ability to learn without being explicitly programmed”.

Nowadays, machine learning has become the foundation of AI. The essence of machine learning is to learn a model by optimizing an objective function on training data, with the goal of achieving strong generalization to unseen data. This relationship is captured by the formula: \[\begin{align*} \text{\bf Machine Learning}=\text{\bf Objective}\quad+\quad\text{\bf Algorithm}\quad+\quad\text{\bf Generalization}. \end{align*}\] Optimization plays a fundamental role in machine learning, as it underpins (1) the formulation of objective functions, (2) the development of optimization algorithms, and (3) the analysis of generalization error of learned models. Below, we will use the traditional label prediction problem to illustrate the three components.

Discriminative Label Prediction

In supervised learning, the primary objective is often to learn a predictive model from a given set of supervised training data. Let us consider a classical label prediction problem. Denote by \((\mathbf{x},y)\) a data-label pair, where \(\mathbf{x}\in\mathcal{X}\subset\mathbb{R}^{d_0}\) denotes the input feature vector, and \(y\in\mathcal{Y}=\{1,\ldots,K\}\) is the corresponding label. The goal is to learn a predictive model parameterized by \(\mathbf{w}\in\mathcal{W}\subseteq\mathbb{R}^d\) (e.g., a deep neural network), which induces a scoring function \(h(\mathbf{w};\cdot):\mathcal{X}\to\mathbb{R}^K\). Conceptually, the model can be expressed as \(h(\mathbf{w};\mathbf{x})=Wh_0(\mathbf{w};\mathbf{x})\), where \(h_0(\mathbf{w};\cdot):\mathcal{X}\to\mathbb{R}^{d_1}\) is the feature extraction component, and \(W\in\mathbb{R}^{K\times d_1}\) is the classification head corresponding to the \(K\) classes.

A classical framework for learning such a model is the well-known empirical risk minimization (ERM), which minimizes the empirical risk over the training dataset. To this end, a pointwise loss function \(\ell(h(\mathbf{w};\mathbf{x}),y)\) is defined to measure the discrepancy between the model’s prediction \(h(\mathbf{w};\mathbf{x})\) and the true label \(y\). Given a training dataset \(\mathcal{S}=\{(\mathbf{x}_1,y_1),\ldots,(\mathbf{x}_n,y_n)\}\), the ERM problem is formulated as:

\[\begin{align}\label{eqn:erm} \min_{\mathbf{w}\in\mathcal{W}}\mathcal{R}_{\mathcal{S}}(\mathbf{w}):=\frac{1}{n}\sum_{i=1}^n\ell(h(\mathbf{w};\mathbf{x}_i),y_i). \end{align}\]

Discriminative Loss Functions

A major element of ERM is the design of the loss function. A common strategy of designing a loss function for label prediction is through a discriminative approach. Below, we introduce several discriminative loss functions.

Logistic Loss

A parameterized probabilistic model is defined to represent the probability of a class label for a given data point as \[\begin{align}\label{eqn:dpm-class} \Pr(y|\mathbf{x};\mathbf{w})=\frac{\exp([h(\mathbf{w};\mathbf{x})]_y)}{\sum_{l=1}^K\exp([h(\mathbf{w};\mathbf{x})]_l)}, \end{align}\] where \([\cdot]_k\) denotes the \(k\)-th element of a vector. The associated loss is derived from the negative log-likelihood, resulting in the multi-class logistic loss, also known as the cross-entropy (CE) loss: \[\begin{align}\label{eqn:lsr} \ell(h(\mathbf{w};\mathbf{x}),y)=-\log\frac{\exp([h(\mathbf{w};\mathbf{x})]_y)}{\sum_{l=1}^K\exp([h(\mathbf{w};\mathbf{x})]_l)}. \end{align}\] The resulting method by ERM is commonly referred to as multi-class logistic regression. For binary classification, this loss becomes the binary logistic loss \(\ell(h(\mathbf{w};\mathbf{x}),y)=\log(1+\exp(-yh(\mathbf{w};\mathbf{x})))\), where \(h(\mathbf{w};\cdot)\in\mathbb{R}\) and \(y\in\{1,-1\}\).

Max-Margin Loss

The max-margin loss, introduced by Crammer and Singer and commonly referred to as the Crammer-Singer (CS) loss, is defined as: \[\begin{align} \ell(h(\mathbf{w};\mathbf{x}),y)=\max\left(0,\max_{k\ne y}\left(c_{k,y}+[h(\mathbf{w};\mathbf{x})]_k-[h(\mathbf{w};\mathbf{x})]_y\right)\right), \end{align}\] where \(c_{k,y}>0\) is a margin parameter. This loss seeks to ensure that the prediction score for the ground-truth label, \([h(\mathbf{w};\mathbf{x})]_y\), exceeds the scores of other class labels, \([h(\mathbf{w};\mathbf{x})]_k\) for \(k\ne y\), by at least the margin \(c_{k,y}\). This method is also known as the multi-class support vector machine. For binary classification, it reduces to the standard hinge loss \(\ell(h(\mathbf{w};\mathbf{x}),\mathbf{y})=\max(0,1-yh(\mathbf{w};\mathbf{x}))\) for \(h(\mathbf{w};\cdot)\in\mathbb{R}\) and \(y\in\{1,-1\}\) with a margin \(1\).

Label Distributionally Robust (LDR) Loss

Both the CS loss and the CE loss have their strengths and limitations. For example, the CS loss with the margin parameters is more flexible in controlling the discrimination between classes, while it is not consistent and non-smooth in terms of the prediction scores. The CE loss is smooth and consistent but lacks robustness to noise in class labels.

Consistency of a surrogate loss function
The consistency measures whether minimizing a surrogate loss with an infinite number of data also minimizes the Bayes error. More formally, a surrogate loss \(\ell(h(\mathbf{x}),y)\) is said to be consistent if for any sequence of measurable functions \(h^{(n)}\) it holds \[ \mathcal{R}(h^{(n)})\rightarrow\inf_{h\in\mathcal{H}}\mathcal{R}(h)\Rightarrow\mathcal{R}_{0-1}(h^{(n)})\rightarrow\inf_{h\in\mathcal{H}}\mathcal{R}_{0-1}(h), \] where \(\mathcal{R}(h)=\mathbb{E}_{\mathbf{x},y}[\ell(h(\mathbf{x}),y)]\) is the expected risk, \(\mathcal{R}_{0-1}(h)=\mathbb{E}_{\mathbf{x},y}\!\left[\mathbb{I}\!\left(y\ne h(\mathbf{x})\right)\right]\) is the Bayes error, and \(\mathcal{H}\) is the set of any measurable functions.

In fact, the strengths and limitations of both the CE and CS losses can be better understood within a broader family known as the label-distributionally robust (LDR) loss: \[\begin{align}\label{eqn:ldr} \ell_\tau(h(\mathbf{w};\mathbf{x}),y)=\max_{\mathbf{p}\in\Delta_K}\sum_{k=1}^K p_k\big([h(\mathbf{w};\mathbf{x})]_k-[h(\mathbf{w};\mathbf{x})]_y+c_{k,y}\big)-\tau\sum_{k=1}^K p_k\log(p_k K), \end{align}\] where \(\tau>0\) is a hyperparameter, \(c_{y,y}=0\), \(\mathbf{p}\in\mathbb{R}^K\) is referred to as the label distributional weight vector, and \(\Delta_K=\{\mathbf{p}\in\mathbb{R}^K:p_k\ge0,\sum_{k=1}^K p_k=1\}\) is a simplex.

It is clear that the LDR loss is defined by solving an optimization problem. Indeed, the above optimization problem follows the distributionally robust optimization (DRO) principle, which is widely used at the level of data as discussed in Section 2.2. By treating `label’ as a kind of data, we can unify the LDR loss with other losses discussed later in Section 2.4.

A closed-form solution for \(\mathbf{p}\) can be derived using the KKT conditions (cf. Example 1.16, making the LDR loss equivalent to: \[\begin{align}\label{eqn:ldr-kl} \ell_\tau(h(\mathbf{w};\mathbf{x}),y)=\tau\log\bigg(\frac{1}{K}\sum_{k=1}^K\exp\left(\frac{[h(\mathbf{w};\mathbf{x})]_k+c_{k,y}-[h(\mathbf{w};\mathbf{x})]_y}{\tau}\right)\bigg). \end{align}\]

From the perspective of DRO, we can define a more general family of LDR losses using different regularization functions on \(\mathbf{p}\) and constrained domains \(\Omega\): \[\begin{align}\label{eqn:ldr-g} \bar\ell_\tau(h(\mathbf{w};\mathbf{x}),y)=\max_{\mathbf{p}\in\Omega}\sum_{k=1}^K p_k\big([h(\mathbf{w};\mathbf{x})]_k-[h(\mathbf{w};\mathbf{x})]_y+c_{k,y}\big)-\tau R(\mathbf{p}). \end{align}\] where \(\Omega\subseteq\Delta_K\) and \(R(\mathbf{p})\) is a strongly convex regularizer.

💡 Why it matters
- The LDR loss (\(\ref{eqn:ldr-kl}\)) unifies both the CS and CE losses as special cases. Specifically, the CE loss corresponds to the LDR loss when \(\tau=1\) and \(c_{k,y}=0\) for all \(k\), while the CS loss corresponds to the case \(\tau=0\).

Moreover, the LDR loss encompasses the Label-Distribution-Aware Margin (LDAM) loss (Cao et al. 2019) when \(\tau=1\) and \(c_{k,y}=c_y\propto1/n_y^{1/4}\) for \(k\ne y\), where \(n_y\) denotes the number of samples in class \(y\): \[ \begin{align*} \ell_{\text{LDAM}}(h(\mathbf{w};\mathbf{x}),y) &=-\log\left(\frac{\exp\big([h(\mathbf{w};\mathbf{x})]_y-\frac{C}{n_y^{1/4}}\big)}{\exp\big([h(\mathbf{w};\mathbf{x})]_y-\frac{C}{n_y^{1/4}}\big)+\sum_{l\ne y}\exp([h(\mathbf{w};\mathbf{x})]_l)}\right), \end{align*}\] where \(C\) is a constant. For imbalanced datasets, this assigns larger margins \(c_y\) to minority classes, making it more suitable for handling class imbalance.

In conclusion, the LDR loss offers flexibility in achieving three desirable properties: max-margin, consistency, and symmetry. In practice, when tuning \(\tau\in(0,\infty)\), it may be beneficial to normalize the prediction scores \(h(\mathbf{w};\mathbf{x})\).

Critical:
It is worth noting that all the discussed losses are discriminative in nature, aiming to increase the score \([h(\mathbf{w};\mathbf{x})]_y\) of the true label while decreasing the scores \([h(\mathbf{w};\mathbf{x})]_k\) of the negative labels (\(k\ne y\)).
The LDR loss and its special cases by varying tau.
The LDR loss and its special cases by varying \(\tau\).

Need of Optimization Algorithms

To address the ERM problem in the context of large-scale data (i.e., a substantial number of training examples), first-order stochastic algorithms are commonly employed. These include stochastic gradient descent (SGD), stochastic momentum methods, and adaptive gradient methods. For instance, the update rule of classical SGD for solving (\(\ref{eqn:erm}\)) with \(\mathcal{W}=\mathbb{R}^d\) is given by: \[\begin{align} \mathbf{w}_{t+1}=\mathbf{w}_t-\eta_t\frac{1}{|\mathcal{B}_t|}\sum_{(\mathbf{x}_i,y_i)\in\mathcal{B}_t}\nabla\ell(h(\mathbf{w}_t;\mathbf{x}_i),y_i),\quad t=1,\ldots,T, \end{align}\] where \(\eta_t\ge0\) is the learning rate (or step size), and \(\mathcal{B}_t\) denotes a random mini-batch data sampled from the full dataset. The concern of designing an optimization algorithm is how fast the algorithm can converge to a (near) optimal solution. We will discuss the design and analysis of classical stochastic optimization algorithms in Chapter 3.

Critical:
A critical assumption in conventional stochastic optimization algorithms such as SGD is that the gradient \(\nabla\ell(h(\mathbf{w};\mathbf{x}_i),y_i)\) of each individual loss can be easily computed. This assumption will fail for the logistic loss when the number of classes \(K\) is gigantic, e.g. millions or even billions. This challenge will be addressed in this book.

Generalization Analysis

To study the generalization of a model learned by solving ERM, we usually consider the expected risk defined as \[\begin{align}\label{eqn:pr} \mathcal{R}(\mathbf{w})=\mathbb{E}_{\mathbf{x},y\sim\mathbb{P}}[\ell(h(\mathbf{w};\mathbf{x}),y)]. \end{align}\] Let \(\mathbf{w}=\mathcal{A}(\mathcal{S};\zeta)\) denote a learned model by a randomized algorithm \(\mathcal{A}\) for solving ERM that depend on random variables \(\zeta\). A standard measure of generalization is given by the excess risk defined as \(\mathcal{R}(\mathbf{w})-\mathcal{R}(\mathbf{w}_*)\), where \(\mathbf{w}_*\in\arg\min_{\mathbf{u}\in\mathcal{W}}\mathcal{R}(\mathbf{u})\). The following lemma decomposes the excess risk into the optimization error and the generalization error.

Lemma 2.1
For a learned model \(\mathbf{w}=\mathcal{A}(\mathcal{S};\zeta)\in\mathcal{W}\), we have \[\begin{align*} \mathcal{R}(\mathbf{w})-\mathcal{R}(\mathbf{w}_*) \le2\underbrace{\sup_{\mathbf{w}\in\mathcal{W}}|\mathcal{R}(\mathbf{w})-\mathcal{R}_{\mathcal{S}}(\mathbf{w})|}\limits_{\text{generalization error}} +\underbrace{\mathcal{R}_{\mathcal{S}}(\mathbf{w})-\min_{\mathbf{u}\in\mathcal{W}}\mathcal{R}_{\mathcal{S}}(\mathbf{u})}\limits_{\text{optimization error}}, \end{align*}\] and \[\begin{align*} \mathbb{E}_{\mathcal{S},\zeta}[\mathcal{R}(\mathbf{w})-\mathcal{R}(\mathbf{w}_*)] =\mathbb{E}_{\mathcal{S},\zeta}[\mathcal{R}(\mathbf{w})-\mathcal{R}_{\mathcal{S}}(\mathbf{w})] +\mathbb{E}_{\mathcal{S},\zeta}\left[\mathcal{R}_{\mathcal{S}}(\mathbf{w})-\min_{\mathbf{u}\in\mathcal{W}}\mathcal{R}_{\mathcal{S}}(\mathbf{u})\right]. \end{align*}\]

Proof. \[\begin{align*} \mathcal{R}(\mathbf{w})-\mathcal{R}(\mathbf{w}_*) &=\mathcal{R}(\mathbf{w})-\mathcal{R}_{\mathcal{S}}(\mathbf{w}) +\mathcal{R}_{\mathcal{S}}(\mathbf{w})-\min_{\mathbf{u}\in\mathcal{W}}\mathcal{R}_{\mathcal{S}}(\mathbf{u}) +\min_{\mathbf{u}\in\mathcal{W}}\mathcal{R}_{\mathcal{S}}(\mathbf{u})-\mathcal{R}(\mathbf{w}_*)\\ &\le\mathcal{R}(\mathbf{w})-\mathcal{R}_{\mathcal{S}}(\mathbf{w}) +\mathcal{R}_{\mathcal{S}}(\mathbf{w})-\min_{\mathbf{u}\in\mathcal{W}}\mathcal{R}_{\mathcal{S}}(\mathbf{u}) +\mathcal{R}_{\mathcal{S}}(\mathbf{w}_*)-\mathcal{R}(\mathbf{w}_*). \end{align*}\] This proves the first inequality. By taking expectation over \(\mathcal{S},\zeta\) and noting that \(\mathbb{E}_{\mathcal{S}}[\mathcal{R}_{\mathcal{S}}(\mathbf{w}_*)-\mathcal{R}(\mathbf{w}_*)]=0\), we finish the second inequality.

💡 Why it matters
The excess risk can be decomposed into two components: the optimization error, given by \(\mathcal{R}_{\mathcal{S}}(\mathbf{w})-\min_{\mathbf{u}\in\mathcal{W}}\mathcal{R}_{\mathcal{S}}(\mathbf{u})\), and the generalization error which captures the difference between the expected risk and the empirical risk. Bounding the (expected) optimization error is a central focus of this book, approached through the analysis of stochastic optimization algorithms. The generalization error \(\sup_{\mathbf{w}\in\mathcal{W}}|\mathcal{R}(\mathbf{w})-\mathcal{R}_{\mathcal{S}}(\mathbf{w})|\) decreases as the training data size \(|\mathcal{S}|\) increases. A brief discussion of the literature on generalization error analysis will be provided at the end of this chapter.

← Go Back