Section 2.5 History and Discussion
A pioneering work analyzing the infinite-sample consistency of various multi-class surrogate loss functions is provided by Zhang (2004b). This work proves the consistency of several losses, including the cross-entropy loss. It also shows that the consistency of the Crammer–Singer and hinge losses can fail unless the maximum conditional probability of a class label given the input exceeds 0.5.
The Label-Distribution-Aware Margin (LDAM) Loss was proposed and studied by Cao et al. (2019), inspired by margin-based generalization error bounds tailored for each class. The label distributionally robust (LDR) losses and their consistency was proposed and studied by Zhu et al. (2023).
Variants of standard loss functions have been developed to minimize the top-\(k\) error for \(k > 1\), such as the top-\(k\) SVM loss and the top-\(k\) cross-entropy loss (Lapin et al., 2018; Yang and Koyejo, 2020). The top-\(k\) SVM loss can be recovered as a special case of the general LDR loss by setting \(R(p)=0\) and \(\Omega = \{p ∈ Δ_K : p_k ≤ 1/k\}\). Although this formulation is generally inconsistent, adding a small strongly convex regularizer \(R(p)\) to the LDR loss can restore consistency.
A sufficient condition for a loss function to be noise-tolerant is the symmetry property, as introduced by Ghosh et al. (2017). A loss function is considered noise-tolerant if the minimizer of the expected risk under the true label distribution remains the same under the noisy label distribution, provided the noise level is not excessively high.
Generalization error analysis is a central topic in several classical machine learning texts (Shalev-Shwartz and Ben-David, 2014; Mohri et al., 2018) and in the statistical learning theory literature (Koltchinskii, 2011). Typically, uniform convergence bounds of the form \(\sup_{w∈W}|R(w)−R_S(w)|\) are derived using concentration inequalities, with dependencies on both the number of training samples \(n\) and the complexity of the hypothesis class. More recently, there has been growing interest in directly analyzing the generalization performance of models returned by stochastic optimization algorithms using stability-based techniques (Hardt et al., 2016; Lei and Ying, 2019).
Robust optimization dates back to Scarf (1958), who studied an inventory problem in which the goal is to determine the purchase quantity that maximizes profit when future demand is a random variable whose underlying probability distribution is assumed to belong to a set of plausible distributions. The problem is reformulated as a worst-case analysis over all distributions in this set with known mean and variance. Later, Dupačová (1966) investigated the min–max robust formulation of stochastic linear programming. Since then, robust optimization has been extensively studied in management science, operations research, and mathematical programming (Kouvelis and Yu, 1997; Shapiro and Kleywegt, 2002; Rustem and Howe, 2002; Ben-Tal et al., 2009b). The term distributionally robust optimization was introduced by Delage and Ye (2010).
The \(\phi\)-divergence (sometimes called \(f\)-divergence, where both \(f\) and \(\phi\) denote a function) was introduced by Csiszár (1967). The use of \(\phi\)-divergence to define the uncertainty set in robust optimization was first studied by Ben-Tal et al. (2013), while earlier works had considered using the KL divergence to define an uncertainty set of probabilities (Calafiore, 2007). A special case of DRO, namely the maximal loss, was shown to be beneficial for imbalanced classification by Shalev-Shwartz and Wexler (2016). The popularity of DRO in machine learning is largely attributed to Namkoong and Duchi (2017), who established a variance-based generalization error bound for DRO with the \(\chi^2\) divergence, building on their preceding work (Duchi et al., 2022). The optimized certainty equivalent (OCE) was proposed by Ben-Tal and Teboulle (1986b), and its connection to DRO was later established in (Ben-Tal and Teboulle, 2007). Group DRO was first proposed by Hu et al. (2018) and became widely recognized due to Sagawa et al. (2019).
The receiver operating characteristic (ROC) curve was originally developed in the 1940s by electrical and radar engineers during World War II to detect enemy objects on the battlefield, which gave rise to its name (“receiver operating characteristic”) (Marcum, 1947). It was subsequently formalized within the framework of signal detection theory (Green and Swets, 1966). The probabilistic interpretation of AUC and its equivalence to the Mann–Whitney U-statistic (or Wilcoxon statistic) were later established by Hanley and McNeil (1982). The concept was subsequently introduced into machine learning as a standard metric for evaluating learning algorithms (Spackman, 1989). The first study of the one-way partial AUC (pAUC) was presented by Dodd and Pepe (2003), and the notion of two-way partial AUC was later introduced by Yang et al. (2019).
The study of AUC maximization dates back to Verrelst et al. (1998) and has since been extensively explored in machine learning. Yan et al. (2003) were the first to apply the gradient descent method to optimize a hinge-based pairwise surrogate loss for AUC, while Cortes and Mohri (2003) employed the RankBoost algorithm (Freund et al., 2003) to optimize AUC. The compositional objective for AUC maximization was first proposed by Ying et al. (2016) in a min–max form and was later generalized in (Yuan et al., 2021; Zhu et al., 2022b). For a comprehensive overview of related work, see the survey by Yang and Ying (2022). The first work on maximizing average precision was conducted by Morgan et al. (2004). The use of DRO for formulating partial AUC losses was proposed by Zhu et al. (2022a). NDCG was introduced by Järvelin and Kekäläinen (2000), and the listwise cross-entropy loss for learning to rank was proposed by Cao et al. (2007). The concept of empirical X-risk minimization for unifying a family of non-decomposable losses was developed by the author of this book in (Yang, 2022), which also presents additional examples of X-risks.
Representation learning in traditional machine learning is related to principal component analysis and distance metric learning (Yang and Jin, 2006). Conventional contrastive losses are defined on pairs \((x, y)\) using a binary label indicating positive or negative pair (Hadsell et al., 2006) or triplets \((x, y_+, y_-)\) (Weinberger and Saul, 2009). The contrastive loss defined on a list of negative data for a positive pair was first introduced by Sohn (2016).
The term foundation model was introduced by Bommasani et al. (2021). The use of DRO to formulate the contrastive loss was first proposed by Qiu et al. (2023), providing a principled approach for optimizing individualized temperature parameters. The discriminative probabilistic modeling approach for self-supervised representation learning was first explored by Wang et al. (2025).
There are excellent textbooks on machine learning (Shalev-Shwartz and Ben-David, 2014; Mohri et al., 2018; Bishop, 2006) and on robust optimization (Ben-Tal et al., 2009a). However, to the best of our knowledge, this book is the first to provide a comprehensive and unified treatment of diverse loss functions and objectives, ranging from the traditional cross-entropy loss to the contrastive loss used in self-supervised representation learning, through the lens of robust optimization and discriminative learning.
References
Zhang, T. (2004). Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5, 1225–1251.
Cao, K., Wei, C., Gaidon, A., Arechiga, N., & Ma, T. (2019). Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems (NeurIPS) (Vol. 32, pp. 1567–1578).
Zhu, D., Ying, Y., & Yang, T. (2023). Label Distributionally Robust Losses for Multi-class Classification: Consistency, Robustness and Adaptivity. In International Conference on Machine Learning (ICML) (PMLR 202, pp. 43289–43325).
Lapin, M., Hein, M., & Schiele, B. (2018). Analysis and optimization of loss functions for multiclass, top-k, and multilabel classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(7), 1533–1554. https://doi.org/10.1109/TPAMI.2017.2751607
Yang, F., & Koyejo, S. (2020). On the consistency of top-k surrogate losses. In International Conference on Machine Learning (ICML) (pp. 10727–10735). PMLR.
Ghosh, A., Kumar, H., & Sastry, P. S. (2017). Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 31).
Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of Machine Learning (2nd ed.). MIT Press.
Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Springer.
Hardt, M., Recht, B., & Singer, Y. (2016). Train faster, generalize better: Stability of stochastic gradient descent. In ICML (pp. 1225–1234). PMLR.
Lei, Y.-X., & Ying, Y. (2019). Fine-grained analysis of stability and generalization for stochastic gradient descent. In ICML (pp. 5809–5819).
Scarf, H. (1958). A Min-Max Solution of an Inventory Problem. In Studies in the Mathematical Theory of Inventory and Production (pp. 201–209). Stanford University Press.
Dupačová, J. (1966). On minimax solutions of stochastic linear programming problems. Časopis pro pěstování matematiky, 091(4), 423–430.
Kouvelis, P., & Yu, G. (1997). Robust Discrete Optimization and Its Applications. Springer.
Shapiro, A., & Kleywegt, A. J. (2002). Minimax analysis of stochastic problems. Optimization Methods and Software, 17(3), 523–542.
Rustem, B., & Howe, M. (2002). Algorithms for Worst-Case Design and Applications to Risk Management. Princeton University Press.
Ben-Tal, A., El Ghaoui, L., & Nemirovski, A. (2009). Robust Optimization. Princeton University Press.
Delage, E., & Ye, Y. (2010). Distributionally Robust Optimization under Moment Uncertainty with Application to Data-Driven Problems. Operations Research, 58(3), 595–612.
Csiszár, I. (1967). Information-Type Measures of Difference of Probability Distributions and Indirect Observations. Studia Scientiarum Mathematicarum Hungarica, 2, 299–318.
Ben-Tal, A., den Hertog, D., De Waegenaere, A., Melenberg, B., & Rennen, G. (2013). Robust Solutions of Optimization Problems Affected by Uncertain Probabilities. Management Science, 59(2), 341–357. https://doi.org/10.1287/mnsc.1120.1641
Calafiore, G. C. (2007). Ambiguous Risk Measures and Optimal Robust Portfolios. SIAM Journal on Optimization, 18(3), 853–877. https://doi.org/10.1137/050639379
Shalev-Shwartz, S., & Wexler, Y. (2016). Minimizing the Maximal Loss: How and Why? CoRR, abs/1602.01690.
Namkoong, H., & Duchi, J. C. (2017). Variance-based regularization with convex objectives. In NeurIPS (pp. 2975–2984).
Duchi, J. C., Glynn, P. W., & Namkoong, H. (2022). Statistics of Robust Optimization: A Generalized Empirical Likelihood Approach. Mathematics of Operations Research, 47(2), 882–910. https://doi.org/10.1287/moor.2020.1085
Ben-Tal, A., & Teboulle, M. (1986). Expected Utility, Penalty Functions, and Duality in Stochastic Nonlinear Programming. Management Science, 32(11), 1445–1466.
Ben-Tal, A., & Teboulle, M. (2007). An old-new concept of convex risk measures: the optimized certainty equivalent. Mathematical Finance, 17(3), 449–476.
Hu, W., Niu, G., Sato, I., & Sugiyama, M. (2018). Does Distributionally Robust Supervised Learning Give Robust Classifiers? In ICML (pp. 2029–2037). PMLR.
Sagawa, S., Koh, P. W., Hashimoto, T. B., & Liang, P. (2019). Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization. CoRR, abs/1911.08731.
Marcum, J. I. (1947). A Statistical Theory of Target Detection by Pulsed Radar. RAND Corporation, RM-754.
Green, D. M., & Swets, J. A. (1966). Signal Detection Theory and Psychophysics. John Wiley and Sons Inc.
Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29–36.
Spackman, K. A. (1989). Signal detection theory: valuable tools for evaluating inductive learning. In Proceedings of the Sixth International Workshop on Machine Learning (pp. 160–163). Morgan Kaufmann.
Dodd, L. E., & Pepe, M. S. (2003). Partial AUC Estimation and Regression. Biometrics, 59(3), 614–623. https://doi.org/10.1111/1541-0420.00071
Yang, H., Lu, K., Lyu, X., & Hu, F. (2019). Two-Way Partial AUC and Its Properties. Statistical Methods in Medical Research, 28(1), 184–195.
Verrelst, H., Moreau, Y., Vandewalle, J., & Timmerman, D. (1998). Use of a multi-layer perceptron to predict malignancy in ovarian tumors. In NIPS ’97 (pp. 978–984). MIT Press.
Yan, L., Dodier, R., Mozer, M. C., & Wolniewicz, R. (2003). Optimizing Classifier Performance via an Approximation to the Wilcoxon-Mann-Whitney Statistic. In ICML (pp. 848–855). AAAI Press.
Cortes, C., & Mohri, M. (2003). AUC Optimization vs. Error Rate Minimization. In Advances in Neural Information Processing Systems (Vol. 16).
Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969.
Ying, Y., Wen, L., & Lyu, S. (2016). Stochastic Online AUC Maximization. In Advances in Neural Information Processing Systems (Vol. 29, pp. 451–459).
Yuan, Z., Yan, Y., Sonka, M., & Yang, T. (2021). Large-scale Robust Deep AUC Maximization: A New Surrogate Loss and Empirical Studies on Medical Image Classification. In ICCV 2021 (pp. 3020–3029). IEEE.
Zhu, D., Wu, X., & Yang, T. (2022). Benchmarking Deep AUROC Optimization: Loss Functions and Algorithmic Choices. CoRR, abs/2203.14177.
Yang, T., & Ying, Y. (2022). AUC Maximization in the Era of Big Data and AI: A Survey. ACM Computing Surveys, 55(8), Article 172. https://doi.org/10.1145/3554729
Morgan, W., Greiff, W., & Henderson, J. (2004). Direct maximization of average precision by hill-climbing, with a comparison to a maximum entropy approach. In HLT-NAACL 2004: Short Papers (pp. 93–96). ACL.
Zhu, D., Li, G., Wang, B., Wu, X., & Yang, T. (2022). When AUC meets DRO: Optimizing Partial AUC for Deep Learning with Non-Convex Convergence Guarantee. In ICML 2022 (PMLR 162, pp. 27548–27573).
Järvelin, K., & Kekäläinen, J. (2000). IR evaluation methods for retrieving highly relevant documents. In SIGIR ’00 (pp. 41–48). ACM.
Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., & Li, H. (2007). Learning to Rank: From Pairwise Approach to Listwise Approach. In Proceedings of the 24th International Conference on Machine Learning (pp. 129–136).
Yang, T. (2022). Algorithmic Foundation of Empirical X-risk Minimization. arXiv preprint arXiv:2206.00439.
Yang, L., & Jin, R. (2006). Distance Metric Learning: A Comprehensive Survey. Technical Report, Michigan State University.
Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality Reduction by Learning an Invariant Mapping. In CVPR 2006 (Vol. 2, pp. 1735–1742). IEEE.
Weinberger, K. Q., & Saul, L. K. (2009). Distance Metric Learning for Large Margin Nearest Neighbor Classification. Journal of Machine Learning Research, 10, 207–244.
Sohn, K. (2016). Improved deep metric learning with multi-class N-pair loss objective. In NeurIPS 2016 (pp. 1857–1865).
Bommasani, R., Hudson, D. A., Adeli, E., et al. (2021). On the Opportunities and Risks of Foundation Models. CoRR, abs/2108.07258.
Qiu, Z.-H., Hu, Q., Yuan, Z., Zhou, D., Zhang, L., & Yang, T. (2023). Not All Semantics are Created Equal: Contrastive Self-supervised Learning with Automatic Temperature Individualization. In ICML 2023 (PMLR 202, pp. 28389–28421).
Wang, B., Lei, Y., Ying, Y., & Yang, T. (2025). On Discriminative Probabilistic Modeling for Self-Supervised Representation Learning. In ICLR 2025. OpenReview.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Ben-Tal, A., El Ghaoui, L., & Nemirovski, A. S. (2009). Robust Optimization. Princeton University Press.