Section 6.10 History and Notes

DRO and GDRO

We first formulated KL-regularized Distributionally Robust Optimization (DRO) as a stochastic compositional optimization (SCO) problem in (Qi et al., 2021), which is optimized by utilizing STORM-based estimators. This line of research was further developed in (Qi et al., 2020), which introduced an attentional biased stochastic momentum method for KL-regularized DRO with specific applications in imbalanced data classification. Subsequently, we extended both the algorithmic framework and theoretical analysis to address KL-constrained DRO (Qi et al., 2023). Collectively, these works demonstrate the advantages of employing compositional optimization techniques over traditional primal-dual methods for solving DRO problems.

The formulation of FCCO for group DRO (GDRO) was initially identified in (Hu et al., 2024). Building on this, Wang and Yang (2023) applied the ALEXR algorithm to convex group DRO, demonstrating significant improvements over traditional stochastic primal-dual methods. Most recently, the application of SONEX to non-convex group DRO within the context of deep learning was investigated by Chen et al. (2025).

Stochastic AUC and NDCG Optimization

Stochastic AUC maximization has a long-standing history in machine learning, as detailed in our survey (Yang and Ying, 2023). The formulation of AUC maximization with a square surrogate loss as a minimax optimization problem was first introduced by Ying et al. (2016). Building on this foundation, we developed the first convergence analysis for stochastic non-convex minimax optimization in the context of deep AUC maximization (Liu et al., 2020). While this work was inspired by our previous work on weakly-convex strongly-concave minimax optimization (Rafique et al., 2022), it established a superior complexity bound by leveraging the PL condition. These theoretical results were subsequently strengthened in (Guo et al., 2023).

This line of research eventually facilitated our winning entry in the CheXpert competition for X-ray image classification (Yuan et al., 2021), which also introduced the AUC-margin minimax objective. Notably, all of these proposed methods utilize a double-loop algorithmic structure. The single-loop PDMA and PDAdam methods for deep AUC maximization were first proposed and analyzed in our work (Guo et al., 2025). The compositional training method for deep AUC maximization that facilitates the feature learning and classifier learning in a unified framework was proposed in our work (Yuan et al., 2022).

The SOAP algorithm represents the first method of its kind to offer a convergence guarantee that does not rely on the use of large batch sizes, which has challenged the computer vision and machine learning communities for many years (see references in (Qi et al., 2021)). The SOPA and SOPAs algorithms for one-way partial AUC maximization and STOAs for two-way partial AUC maximization were developed and analyzed in (Zhu et al., 2022). The STACO algorithm for two-way partial AUC maximization was proposed in (Zhou et al., 2025). These studies have addressed long-standing open problems for efficient partial AUC maximization with convergence guarantee (Kar et al., 2014), (Narasimhan and Agarwal, 2013).

The formulation of stochastic NDCG optimization as FCCO was proposed in our work (Qiu et al., 2022), which also developed a multi-block bilevel optimization formulation and algorithm for optimizing top-\(K\) NDCG. The complexity for multi-block bilevel optimization was improved in (Hu et al., 2023) by using the MSVR estimators.

The design and benchmark of LibAUC library was presented in (Yuan et al., 2023).

Discriminative Learning of Foundation models.

The SogCLR algorithm was inspired by the SOX framework for FCCO; its advantages over SimCLR, particularly regarding efficiency with small batch sizes in unimodal contrastive learning, were demonstrated in (Yuan et al., 2022). Building on this, we introduced iSogCLR in (Qiu et al., 2023) to optimize individualized temperatures. This advancement was also informed by our previous research on KL-constrained DRO (Qi et al., 2023).

Subsequently, we proposed TempNet (Qiu et al., 2024), which has been successfully applied to CLIP training and the pretraining of large language models (LLMs). Furthermore, a comprehensive evaluation of FCCO-based techniques for distributed CLIP training was recently provided in (Wei et al., 2024).

The discriminative fine-tuning approach of LLMs was proposed in our work (Guo et al., 2025). The DisCO method for fine-tuning large reasoning models was developed in our work (Li et al., 2025).

FCCO for Constrained Learning.

The application of compositional optimization techniques to penalty methods for constrained learning dates back to (Ermoliev and Wets, 1988). The first non-asymptotic analysis of the penalty method with a squared hinge penalty function for non-convex inequality constrained optimization based on FCCO was conducted in our work (Li et al., 2024). This work investigated the problem within the context of continual learning under zero-forgetting constraints and established a complexity of \(O(1/\epsilon^7)\) for finding an \(\epsilon\)-KKT solution. Additionally, we developed a theoretical framework to characterize the benefits of network expansion in facilitating constrained learning with non-forgetting constraints. The ROC fairness constraint was first considered in (Vogel et al., 2020).

Subsequent advancements have further improved the complexity of penalty based methods based on FCCO. By employing SONX for the hinge penalty, the complexity was reduced to \(O(1/\epsilon^6)\) (Yang et al., 2025). More recently, the introduction of SONEX and a double-loop ALEXR method for the squared hinge penalty achieved a complexity of \(O(1/\epsilon^5)\) (Chen et al., 2025). This currently represents the state-of-the-art complexity for penalty methods in non-convex constrained optimization.

Learning with data compositional networks.

Graph convolutional neural network was proposed by Kipf and Welling (2017). GraphSAGE was developed in (Hamilton et al., 2017). The use of compositional optimization techniques, specifically incorporating feature momentum for large-scale Graph Neural Network (GNN) learning, was introduced in our previous work (Yu et al., 2022). Furthermore, the application of compositional optimization to multi-instance learning, utilizing compositional pooling operations, was first proposed in (Zhu et al., 2023). Attention-based pooling for multi-instance learning was proposed by Ilse et al. (2018).

DRRHO risk minimization.

The development of DRRHO risk minimization framework and its application to CLIP training was introduced in our work (Wei et al., 2025). The theoretical analysis of this method is largely built upon the foundations of DRO (Namkoong and Duchi, 2017), while the conceptual idea of using the RHO loss for data selection in a mini-batch was originally proposed in (Mindermann et al., 2022).

References

Qi, Q., Guo, Z., Xu, Y., Jin, R., and Yang, T. (2021). An Online Method for A Class of Distributionally Robust Optimization with Non-convex Objectives. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 10067–10080.
Qi, Q., Xu, Y., Jin, R., Yin, W., and Yang, T. (2023). Attentional-Biased Stochastic Gradient Descent. Trans. Mach. Learn. Res., 2023.
Qi, Q., Lyu, J., Chan, K., Bai, E., and Yang, T. (2023). Stochastic Constrained DRO with a Complexity Independent of Sample Size. Trans. Mach. Learn. Res., 2023.
Hu, Q., Qi, Q., Lu, Z., and Yang, T. (2024). Single-Loop Stochastic Algorithms for Difference of Max-Structured Weakly Convex Functions. In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10-15, 2024.
Wang, B. and Yang, T. (2023). A Near-Optimal Single-Loop Stochastic Algorithm for Convex Finite-Sum Coupled Compositional Optimization. In International Conference on Machine Learning.
Chen, X., Wang, B., Yang, M., Lin, Q., and Yang, T. (2025). Stochastic Momentum Methods for Non-smooth Non-Convex Finite-Sum Coupled Compositional Optimization. In The Thirty-ninth Annual Conference on Neural Information Processing Systems.
Yang, T. and Ying, Y. (2023). AUC Maximization in the Era of Big Data and AI: A Survey. ACM Comput. Surv., 55(8), 172:1–172:37.
Ying, Y., Wen, L., and Lyu, S. (2016). Stochastic online AUC maximization. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 451–459.
Liu, M., Yuan, Z., Ying, Y., and Yang, T. (2020). Stochastic AUC Maximization with Deep Neural Networks. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
Rafique, H., Liu, M., Lin, Q., and Yang, T. (2022). Weakly-convex-concave min-max optimization: provable algorithms and applications in machine learning. Optim. Methods Softw., 37(3), 1087–1121.
Guo, Z., Yan, Y., Yuan, Z., and Yang, T. (2023). Fast Objective & Duality Gap Convergence for Non-Convex Strongly-Concave Min-Max Problems with PL Condition. J. Mach. Learn. Res., 24, 148:1–148:63.
Yuan, Z., Yan, Y., Sonka, M., and Yang, T. (2021). Large-scale Robust Deep AUC Maximization: A New Surrogate Loss and Empirical Studies on Medical Image Classification. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pp. 3020–3029.
Guo, Z., Xu, Y., Yin, W., Jin, R., and Yang, T. (2025). Unified convergence analysis for adaptive optimization with moving average estimator. Mach. Learn., 114(4).
Yuan, Z., Guo, Z., Chawla, N. V., and Yang, T. (2022). Compositional Training for End-to-End Deep AUC Maximization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022.
Qi, Q., Luo, Y., Xu, Z., Ji, S., and Yang, T. (2021). Stochastic optimization of areas under precision-recall curves with provable convergence. In Proceedings of the 35th International Conference on Neural Information Processing Systems.
Zhu, D., Li, G., Wang, B., Wu, X., and Yang, T. (2022). When AUC meets DRO: Optimizing Partial AUC for Deep Learning with Non-Convex Convergence Guarantee. In Proceedings of International Conference on Machine Learning, 2022.
Zhou, L., Wang, B., Thai, M. T., and Yang, T. (2025). Stochastic Primal-Dual Double Block-Coordinate for Two-way Partial AUC Maximization. Transactions on Machine Learning Research.
Kar, P., Narasimhan, H., and Jain, P. (2014). Online and stochastic gradient methods for non-decomposable loss functions. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, pp. 694–702.
Narasimhan, H. and Agarwal, S. (2013). A Structural SVM Based Approach for Optimizing Partial AUC. In Proceedings of the 30th International Conference on Machine Learning, pp. 516–524.
Qiu, Z., Hu, Q., Zhong, Y., Zhang, L., and Yang, T. (2022). Large-scale Stochastic Optimization of NDCG Surrogates for Deep Learning with Provable Convergence. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, pp. 18122–18152.
Hu, Q., Qiu, Z., Guo, Z., Zhang, L., and Yang, T. (2023). Blockwise Stochastic Variance-Reduced Methods with Parallel Speedup for Multi-Block Bilevel Optimization. CoRR, abs/2305.18730.
Yuan, Z., Zhu, D., Qiu, Z., Li, G., Wang, X., and Yang, T. (2023). LibAUC: A Deep Learning Library for X-Risk Optimization. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, pp. 5487–5499.
Yuan, Z., Wu, Y., Qiu, Z., Du, X., Zhang, L., Zhou, D., and Yang, T. (2022). Provable Stochastic Optimization for Global Contrastive Learning: Small Batch Does Not Harm Performance. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, pp. 25760–25782.
Qiu, Z., Hu, Q., Yuan, Z., Zhou, D., Zhang, L., and Yang, T. (2023). Not All Semantics are Created Equal: Contrastive Self-supervised Learning with Automatic Temperature Individualization. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, pp. 28389–28421.
Qiu, Z., Guo, S., Xu, M., Zhao, T., Zhang, L., and Yang, T. (2024). To Cool or not to Cool? Temperature Network Meets Large Foundation Models via DRO. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024.
Wei, X., Ye, F., Yonay, O., Chen, X., Sun, B., Tao, D., and Yang, T. (2024). FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources. CoRR, abs/2407.01445.
Guo, S., Hong, I., Balmaseda, V., Yu, C., Qiu, L., Liu, X., Jiang, H., Zhao, T., and Yang, T. (2025). Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data. In Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025.
Li, G., Lin, M., Galanti, T., Tu, Z., and Yang, T. (2025). DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization. In The Thirty-ninth Annual Conference on Neural Information Processing Systems.
Ermoliev, Y. and Wets, R. J. (Eds.) (1988). Numerical Techniques for Stochastic Optimization. Springer Series in Computational Mathematics, vol. 10. Springer-Verlag.
Li, G., Yu, W., Yao, Y., Tong, W., Liang, Y., Lin, Q., and Yang, T. (2024). Model Developmental Safety: A Safety-Centric Method and Applications in Vision-Language Models. CoRR, abs/2410.03955.
Vogel, R., Bellet, A., and Clémençon, S. (2020). Learning Fair Scoring Functions: Bipartite Ranking under ROC-based Fairness Constraints. In International Conference on Artificial Intelligence and Statistics.
Yang, M., Li, G., Hu, Q., Lin, Q., and Yang, T. (2025). Single-loop Algorithms for Stochastic Non-convex Optimization with Weakly-Convex Constraints. CoRR, abs/2504.15243.
Kipf, T. N. and Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations.
Hamilton, W. L., Ying, R., and Leskovec, J. (2017). Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1025–1035.
Yu, H., Wang, L., Wang, B., Liu, M., Yang, T., and Ji, S. (2022). GraphFM: Improving Large-Scale GNN Training via Feature Momentum. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, pp. 25684–25701.
Zhu, D., Wang, B., Chen, Z., Wang, Y., Sonka, M., Wu, X., and Yang, T. (2023). Provable Multi-instance Deep AUC Maximization with Stochastic Pooling. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, pp. 43205–43227.
Ilse, M., Tomczak, J., and Welling, M. (2018). Attention-based Deep Multiple Instance Learning. In Proceedings of the 35th International Conference on Machine Learning, pp. 2127–2136.
Wei, X., Lin, M. C., Ye, F., Song, F., Cao, L., Thai, M. T., and Yang, T. (2025). Model Steering: Learning with a Reference Model Improves Generalization Bounds and Scaling Laws. In Forty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19, 2025.
Namkoong, H. and Duchi, J. C. (2017). Variance-based regularization with convex objectives. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 2975–2984.
Mindermann, S., Brauner, J. M., Razzak, M. T., Sharma, M., Kirsch, A., Xu, W., Höltgen, B., Gomez, A. N., Morisot, A., Farquhar, S., and Gal, Y. (2022). Prioritized Training on Points that are Learnable, Worth Learning, and not yet Learnt. In Proceedings of the 39th International Conference on Machine Learning, pp. 15630–15649.
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17.
Nemirovski, A. S. and Yudin, D. B. (1977). Information complexity of strongly convex optimization. Ekonomika i Matematicheskie Metody, 13(3), 550–559.
Nesterov, Y. (1983). A method for solving the convex programming problem with convergence rate \(O(1/k^2)\). Proceedings of the USSR Academy of Sciences, 269, 543–547.
Lan, G. (2012). An optimal method for stochastic composite optimization. Math. Program., 133(1–2), 365–397.
Yang, T., Lin, Q., and Li, Z. (2016). Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. arXiv:1604.03257.
Tieleman, T. and Hinton, G. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 26–31.
Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR (Poster).
Jordan, K., Jin, Y., Boza, V., Jiacheng, Y., Cesista, F., Newhouse, L., and Bernstein, J. (2024). Muon: An optimizer for hidden layers in neural networks.
Wei, X., Zhou, L., Wang, B., Lin, C.-J., and Yang, T. (2026). A geometry-aware efficient algorithm for compositional entropic risk minimization. In Proceedings of International Conference on Machine Learning.
An, X., Zhu, X., Gao, Y., Xiao, Y., Zhao, Y., Feng, Z., Wu, L., Qin, B., Zhang, M., Zhang, D., and Fu, Y. (2021). Partial FC: Training 10 million identities on a single machine. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pp. 1445–1449.
Zaheer, M., Kottur, S., Ravanbhakhsh, S., Póczos, B., Salakhutdinov, R., and Smola, A. J. (2017). Deep Sets. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 3394–3404.
Qi, C., Su, H., Mo, K., and Guibas, L. J. (2016). PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 77–85.

← Go Back