← Go Back

Section 2.5 History and Notes

Loss functions

A pioneering work analyzing the infinite-sample consistency of various multi-class surrogate loss functions is provided by Zhang, 2004. This work proves the consistency of several losses, including the cross-entropy loss. It also shows that the consistency of the Crammer-Singer and hinge losses can fail unless the maximum conditional probability of a class label given the input exceeds \(0.5\).

The Label-Distribution-Aware Margin (LDAM) loss was proposed and studied by Cao et al., 2019, inspired by margin-based generalization error bounds tailored for each class. The label distributionally robust (LDR) losses and their consistency were proposed and studied by Zhu et al., 2023.

Variants of standard loss functions have been developed to minimize the top-\(k\) error for \(k>1\), such as the top-\(k\) SVM loss and the top-\(k\) cross-entropy loss (Lapin et al., 2018); (Yang and Koyejo, 2020). The top-\(k\) SVM loss can be recovered as a special case of the general LDR loss by setting \(R(\mathbf{p})=0\) and \(\Omega=\{\mathbf{p}\in\Delta_K:p_k\le 1/k\}\). Although this formulation is generally inconsistent, adding a small strongly convex regularizer \(R(\mathbf{p})\) to the LDR loss can restore consistency.

A sufficient condition for a loss function to be noise-tolerant is the symmetry property, as introduced by Ghosh et al., 2017. A loss function is considered noise-tolerant if the minimizer of the expected risk under the true label distribution remains the same under the noisy label distribution, provided the noise level is not excessively high.

Robust optimization

Robust optimization dates back to Scarf, 1958, who studied an inventory problem in which the goal is to determine the purchase quantity that maximizes profit when future demand is a random variable whose underlying probability distribution is assumed to belong to a set of plausible distributions. The problem is reformulated as a worst-case analysis over all distributions in this set with known mean and variance. Later, Dupacova, 1966 investigated the min-max robust formulation of stochastic linear programming. Since then, robust optimization has been extensively studied in management science, operations research, and mathematical programming (Kouvelis and Yu, 1997); (Shapiro and Kleywegt, 2002); (Rustem and Howe, 2002); (Ben-Tal et al., 2009). The term distributionally robust optimization was introduced by Delage and Ye, 2010.

The phi-divergence (sometimes called f-divergence, where both f and phi denote a function) was introduced by Csiszar, 1967. The use of phi-divergence to define the uncertainty set in robust optimization was first studied by Ben-Tal et al., 2013, while earlier works had considered using the KL divergence to define an uncertainty set of probabilities (Calafiore, 2007). A special case of DRO, namely the maximal loss, was shown to be beneficial for imbalanced classification by Shalev-Shwartz and Wexler, 2016. The popularity of DRO in machine learning is largely attributed to Namkoong and Duchi, 2017, who established a variance-based generalization error bound for DRO with the chi-square divergence, building on their preceding work (Duchi et al., 2022). The optimized certainty equivalent (OCE) was proposed by Ben-Tal and Teboulle, 1986, and its connection to DRO was later established in (Ben-Tal and Teboulle, 2007). Group DRO was first proposed by Hu et al., 2018 and became widely recognized due to Sagawa et al., 2019.

AUC and NDCG

The receiver operating characteristic (ROC) curve was originally developed in the 1940s by electrical and radar engineers during World War II to detect enemy objects on the battlefield, which gave rise to its name (“receiver operating characteristic”) (Marcum, 1947). It was subsequently formalized within the framework of signal detection theory (Green and Swets, 1966). The probabilistic interpretation of AUC and its equivalence to the Mann-Whitney U-statistic (or Wilcoxon statistic) were later established by Hanley and McNeil, 1982. The concept was subsequently introduced into machine learning as a standard metric for evaluating learning algorithms (Spackman, 1989). The first study of the one-way partial AUC (pAUC) was presented by Dodd and Pepe, 2003, and the notion of two-way partial AUC was later introduced by Yang et al., 2019.

The study of AUC maximization dates back to Verrelst et al., 1998 and has since been extensively explored in machine learning. Yan et al., 2003 were the first to apply the gradient descent method to optimize a hinge-based pairwise surrogate loss for AUC, while Cortes and Mohri, 2003 employed the RankBoost algorithm Freund et al., 2003 to optimize AUC. The compositional objective for AUC maximization was first proposed by Ying et al., 2016 in a min-max form and was later generalized in (Yuan et al., 2021); (Zhu et al., 2022). For a comprehensive overview of related work, see the survey by Yang and Ying, 2023. The first work on maximizing average precision was conducted by Morgan et al., 2004. The use of DRO for formulating partial AUC losses was proposed by Zhu et al., 2022.

NDCG was introduced by Jarvelin and Kekalainen, 2000, and the listwise cross-entropy loss for learning to rank was proposed by Cao et al., 2007. The concept of empirical X-risk minimization for unifying a family of non-decomposable losses was developed by the author of this book in (Yang, 2022), which also presents additional examples of X-risks.

Foundation Models

Representation learning in traditional machine learning is related to principal component analysis and distance metric learning (Yang and Jin, 2006). Conventional contrastive losses are defined on pairs \((\mathbf{x},\mathbf{y})\) using a binary label indicating positive or negative pair (Hadsell et al., 2006) or triplets \((\mathbf{x},\mathbf{y}_+,\mathbf{y}_-)\) (Weinberger and Saul, 2009). The contrastive loss defined on a list of negative data for a positive pair was first introduced by Sohn, 2016.

The term foundation model was introduced by Bommasani et al., 2021. The use of DRO to formulate the contrastive loss was first proposed by Qiu et al., 2023, providing a principled approach for optimizing individualized temperature parameters. The discriminative probabilistic modeling approach for self-supervised representation learning was first explored by Wang et al., 2025.

Generalization Error

Generalization error analysis is a central topic in several classical machine learning texts (Shalev-Shwartz and Ben-David, 2014); (Mohri et al., 2018) and in the statistical learning theory literature (Koltchinskii, 2011). Typically, uniform convergence bounds of the form \(\sup_{\mathbf{w}\in\mathcal{W}}|\mathcal{R}(\mathbf{w})-\mathcal{R}_{\mathcal{S}}(\mathbf{w})|\) are derived using concentration inequalities, with dependencies on both the number of training samples \(n\) and the complexity of the hypothesis class. More recently, there has been growing interest in directly analyzing the generalization performance of models returned by stochastic optimization algorithms using stability-based techniques (Hardt et al., 2016); (Lei and Ying, 2019).

Generalization error analyses for DRO and OCE objectives have been extensively developed in the literature: Brown, 2007 established theoretical bounds for CVaR, Namkoong and Duchi, 2017 developed bounds for chi-square-constrained DRO, and Lee et al., 2020 explored generalization for general OCE risk. However, the generalization error for compositional OCE is under-development.

Machine Learning texts

There are excellent textbooks on machine learning (Shalev-Shwartz and Ben-David, 2014); (Mohri et al., 2018); (Bishop, 2006) and on robust optimization (Ben-Tal et al., 2009). However, to the best of our knowledge, this book is the first to provide a comprehensive and unified treatment of diverse loss functions and objectives, ranging from the traditional cross-entropy loss to the contrastive loss used in self-supervised representation learning, through the lens of robust optimization and discriminative learning.

References

  1. Zhang, T. (2004). Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5, 1225–1251.
  2. Cao, K., Wei, C., Gaidon, A., Arechiga, N., & Ma, T. (2019). Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in Neural Information Processing Systems (NeurIPS), 32, 1567–1578.
  3. Zhu, D., Ying, Y., & Yang, T. (2023). Label distributionally robust losses for multi-class classification: Consistency, robustness and adaptivity. In International Conference on Machine Learning (ICML 2023), PMLR 202, 43289–43325.
  4. Lapin, M., Hein, M., & Schiele, B. (2018). Analysis and optimization of loss functions for multiclass, top-k, and multilabel classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(7), 1533–1554.
  5. Yang, F., & Koyejo, S. (2020). On the consistency of top-k surrogate losses. In International Conference on Machine Learning (ICML), 10727–10735.
  6. Ghosh, A., Kumar, H., & Sastry, P. S. (2017). Robust loss functions under label noise for deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, 31.
  7. Scarf, H. (1958). A min-max solution of an inventory problem. In Studies in the Mathematical Theory of Inventory and Production, 201–209. Stanford University Press.
  8. Dupacova, J. (1966). On minimax solutions of stochastic linear programming problems. Casopis pro pestovani matematiky, 91(4), 423–430.
  9. Kouvelis, P., & Yu, G. (1997). Robust Discrete Optimization and Its Applications. Springer.
  10. Shapiro, A., & Kleywegt, A. J. (2002). Minimax analysis of stochastic problems. Optimization Methods and Software, 17(3), 523–542.
  11. Rustem, B., & Howe, M. (2002). Algorithms for Worst-Case Design and Applications to Risk Management. Princeton University Press.
  12. Ben-Tal, A., El Ghaoui, L., & Nemirovski, A. (2009). Robust Optimization. Princeton University Press.
  13. Delage, E., & Ye, Y. (2010). Distributionally robust optimization under moment uncertainty with application to data-driven problems. Operations Research, 58(3), 595–612.
  14. Csiszar, I. (1967). Information-type measures of difference of probability distributions and indirect observations. Studia Scientiarum Mathematicarum Hungarica, 2, 299–318.
  15. Ben-Tal, A., den Hertog, D., De Waegenaere, A., Melenberg, B., & Rennen, G. (2013). Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2), 341–357.
  16. Calafiore, G. C. (2007). Ambiguous risk measures and optimal robust portfolios. SIAM Journal on Optimization, 18(3), 853–877.
  17. Shalev-Shwartz, S., & Wexler, Y. (2016). Minimizing the maximal loss: How and why? CoRR, abs/1602.01690.
  18. Namkoong, H., & Duchi, J. C. (2017). Variance-based regularization with convex objectives. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), 2975–2984.
  19. Duchi, J. C., Glynn, P. W., & Namkoong, H. (2022). Statistics of robust optimization: A generalized empirical likelihood approach. Mathematics of Operations Research, 47(2), 882–910.
  20. Ben-Tal, A., & Teboulle, M. (1986). Expected utility, penalty functions, and duality in stochastic nonlinear programming. Management Science, 32(11), 1445–1466.
  21. Ben-Tal, A., & Teboulle, M. (2007). An old-new concept of convex risk measures: the optimized certainty equivalent. Mathematical Finance, 17(3), 449–476.
  22. Hu, W., Niu, G., Sato, I., & Sugiyama, M. (2018). Does distributionally robust supervised learning give robust classifiers? In International Conference on Machine Learning (ICML), PMLR 80, 2029–2037.
  23. Sagawa, S., Koh, P. W., Hashimoto, T. B., & Liang, P. (2019). Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. CoRR, abs/1911.08731.
  24. Marcum, J. I. (1947). A Statistical Theory of Target Detection by Pulsed Radar (RM-754). RAND Corporation.
  25. Green, D. M., & Swets, J. A. (1966). Signal Detection Theory and Psychophysics. John Wiley and Sons.
  26. Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29–36.
  27. Spackman, K. A. (1989). Signal detection theory: valuable tools for evaluating inductive learning. In Proceedings of the Sixth International Workshop on Machine Learning, 160–163.
  28. Dodd, L. E., & Pepe, M. S. (2003). Partial AUC estimation and regression. Biometrics, 59(3), 614–623.
  29. Yang, H., Lu, K., Lyu, X., & Hu, F. (2019). Two-way partial AUC and its properties. Statistical Methods in Medical Research, 28(1), 184–195.
  30. Verrelst, H., Moreau, Y., Vandewalle, J., & Timmerman, D. (1998). Use of a multi-layer perceptron to predict malignancy in ovarian tumors. In Advances in Neural Information Processing Systems 10 (NIPS’97), 978–984.
  31. Yan, L., Dodier, R., Mozer, M. C., & Wolniewicz, R. (2003). Optimizing classifier performance via an approximation to the Wilcoxon-Mann-Whitney statistic. In Proceedings of the 20th International Conference on Machine Learning (ICML), 848–855.
  32. Cortes, C., & Mohri, M. (2003). AUC optimization vs. error rate minimization. In Advances in Neural Information Processing Systems, 16.
  33. Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969.
  34. Ying, Y., Wen, L., & Lyu, S. (2016). Stochastic online AUC maximization. In Advances in Neural Information Processing Systems, 29, 451–459.
  35. Yuan, Z., Yan, Y., Sonka, M., & Yang, T. (2021). Large-scale robust deep AUC maximization: A new surrogate loss and empirical studies on medical image classification. In International Conference on Computer Vision (ICCV 2021), 3020–3029.
  36. Zhu, D., Wu, X., & Yang, T. (2022). Benchmarking deep AUROC optimization: Loss functions and algorithmic choices. CoRR, abs/2203.14177.
  37. Yang, T., & Ying, Y. (2023). AUC maximization in the era of big data and AI: A survey. ACM Computing Surveys, 55(8), Article 172, 1–37.
  38. Morgan, W., Greiff, W., & Henderson, J. (2004). Direct maximization of average precision by hill-climbing, with a comparison to a maximum entropy approach. In Proceedings of HLT-NAACL 2004: Short Papers, 93–96.
  39. Zhu, D., Li, G., Wang, B., Wu, X., & Yang, T. (2022). When AUC meets DRO: Optimizing partial AUC for deep learning with non-convex convergence guarantee. In International Conference on Machine Learning (ICML 2022), PMLR 162, 27548–27573.
  40. Jarvelin, K., & Kekalainen, J. (2000). IR evaluation methods for retrieving highly relevant documents. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’00), 41–48.
  41. Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., & Li, H. (2007). Learning to rank: From pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning, 129–136.
  42. Yang, T. (2022). Algorithmic foundation of empirical X-risk minimization. arXiv preprint arXiv:2206.00439.
  43. Yang, L., & Jin, R. (2006). Distance metric learning: A comprehensive survey. Technical report, Department of Computer Science and Engineering, Michigan State University.
  44. Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’06), 2, 1735–1742.
  45. Weinberger, K. Q., & Saul, L. K. (2009). Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10(9), 207–244.
  46. Sohn, K. (2016). Improved deep metric learning with multi-class N-pair loss objective. In Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16), 1857–1865.
  47. Bommasani, R., Hudson, D. A., Adeli, E., Altman, R. B., Arora, S., von Arx, S., et al. (2021). On the opportunities and risks of foundation models. CoRR, abs/2108.07258.
  48. Qiu, Z.-H., Hu, Q., Yuan, Z., Zhou, D., Zhang, L., & Yang, T. (2023). Not all semantics are created equal: Contrastive self-supervised learning with automatic temperature individualization. In International Conference on Machine Learning (ICML 2023), PMLR 202, 28389–28421.
  49. Wang, B., Lei, Y., Ying, Y., & Yang, T. (2025). On discriminative probabilistic modeling for self-supervised representation learning. In International Conference on Learning Representations (ICLR 2025).
  50. Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
  51. Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of Machine Learning (2nd ed.). MIT Press.
  52. Koltchinskii, V. (2011). Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems. Springer.
  53. Hardt, M., Recht, B., & Singer, Y. (2016). Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning (ICML), 1225–1234.
  54. Lei, Y.-X., & Ying, Y. (2019). Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning (ICML), 5809–5819.
  55. Brown, D. B. (2007). Large deviations bounds for estimating conditional value-at-risk. Operations Research Letters, 35(6), 722–730.
  56. Lee, J., Park, S., & Shin, J. (2020). Learning bounds for risk-sensitive learning. arXiv preprint arXiv:2006.08138.
  57. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer-Verlag.
  58. Rockafellar RT. 1970. Convex analysis. Princeton Mathematical Series, Princeton University Press, Princeton, N. J.