Section 5.6 History and Notes

Finite-sum coupled compositional optimization (FCCO) was first formalized in our work (Qi et al, 2021) for optimizing average precision, an empirical estimator of the area under the precision–recall curve. We proposed the SOAP algorithm for AP maximization and established the first complexity bound of \(O\left(\frac{n}{\epsilon^5}\right)\) for finding an \(\epsilon\)-stationary solution. Their algorithm is closely related to SOX, but differs in that it does not employ a moving-average gradient estimator. The framework was demonstrated on applications including image classification and molecular property prediction for drug discovery. The analysis of SOAP draws inspiration from the original SCGD analysis (Wang et al., 2017), while significantly improving upon its \(O(1/\epsilon^8)\) complexity with a better hyper-parameter setting, leading to Theorem 4.1.

To accelerate convergence, we subsequently adopted the moving average gradient estimator for FCCO (Wang et al., 2022). While this approach achieves a complexity order of \(O\left(\frac{n}{B\epsilon^4}\right)\), it does not benefit from the variance reduction gained by using mini-batches to estimate inner function values. The limitation arises because we treat all inner functions as a single vector variable and compute a sparse unbiased stochastic estimator for this vector; consequently, the estimator does not enjoy the advantages of inner mini-batching. This improved rate and analysis was inspired by the stochastic compositional momentum method (Ghadimi et al., 2020).

Subsequently, we proposed the SOX algorithm?a significant advancement for solving FCCO (Wang & Yang, 2022), encompassing new design, theoretical analysis, and practical applications. In that work, we established a complexity of \(O\left(\frac{n\sigma_0^2}{B\epsilon^4}\right)\) for SOX to find an \(\epsilon\)-stationary solution in non-convex smooth FCCO problems. It integrates the analysis of stochastic block coordinate update of the \(\mathbf{u}\) sequences with that of stochastic compositional momentum method.

Building on this, we developed a double-loop restarted algorithm that utilizes SOX in the inner loop to address non-convex problems under the \(\mu\)-PL (Polyak-Lojasiewicz) condition, i.e., \(\|\nabla F(\mathbf{w})\|_2^2\geq \mu (F(\mathbf{w}) - \min_\mathbf{w} F(\mathbf{w}))\). This approach yields an improved complexity of \(O\left(\frac{n\sigma_0^2}{\mu^2 B\epsilon}\right)\) for finding an \(\epsilon\)-optimal solution. This result further implies a complexity of \(O\left(\frac{n\sigma_0^2}{\mu^2 B\epsilon}\right)\) for strongly convex FCCO problems and \(O\left(\frac{n\sigma_0^2}{B\epsilon^3}\right)\) for convex FCCO problems, requiring no assumptions on the individual convexity of inner and outer functions beyond the overall convexity of the objective. The improved convergence analysis under the PL condition for the double-loop restarted algorithm was inspired by our prior work on stochastic compositional optimization for distributionally robust learning (Qi et al., 2021b). A comparable complexity bound of \(O(\frac{1}{\mu^2\epsilon})\) for a single-loop algorithm in the context of Stochastic Convex Optimization (SCO) under the PL condition was subsequently established in (Jiang et al., 2023), which considers the application of SCO in training energy-based models.

Furthermore, for convex FCCO instances where the outer function is both convex and monotonically non-decreasing and the inner functions are convex, (Wang & Yang, 2022) reformulated the problem as a convex-concave min-max optimization problem and established a complexity of \(O\left(\frac{n\sigma_0^2}{B\epsilon^2}\right)\) under a weak duality convergence measure. Finally, when a \(\mu\)-strongly convex regularizer is present, the complexity is further refined to \(O\left(\frac{n\sigma_0^2}{\mu^2 B\epsilon}\right)\) for finding an \(\epsilon\)-optimal solution in terms of Euclidean distance to the optimum. This analysis was mostly inspired by (Zhang & Lan, 2024), which is the first work that establishes the optimal complexity for solving convex SCO where the outer function is both convex and monotonically non-decreasing and the inner function is convex.

Later, Jiang et al. (2022) proposed the Multi-Block-Single-Probe Variance Reduction (MSVR) algorithm for FCCO, establishing improved complexity bounds over SOX by leveraging the mean squared smoothness of the inner functions. For non-convex smooth FCCO problems, MSVR improves the complexity to \(O\left(\frac{n\sigma_0}{B\epsilon^3}\right)\) for identifying an \(\epsilon\)-stationary solution.

For objectives satisfying the \(\mu\)-PL condition, a double-loop restarted MSVR algorithm achieves an improved complexity of \(O\left(\frac{n\sigma_0}{\mu B\epsilon}\right)\) to find an \(\epsilon\)-optimal solution. Consequently, this approach yields a complexity of \(O\left(\frac{n\sigma_0}{\mu B\epsilon}\right)\) for strongly convex FCCO problems and \(O\left(\frac{n\sigma_0}{B\epsilon^2}\right)\) for convex FCCO problems.

The analysis for non-smooth weakly convex FCCO and the SONX (v2) algorithm was studied in our work (Hu et al., 2023). This work established a complexity of \(O\left(\frac{n\sigma_0}{B\epsilon^6}\right)\) for finding a nearly \(\epsilon\)-stationary solution for weakly convex inner and outer functions. A similar analysis for a special case of weakly-convex SCO was conducted in (Zhu et al., 2023). When the outer function is smooth, the complexity is improved in this book to \(O\left(\frac{n\sigma_0}{B\epsilon^4}\right)\). The SONEX algorithm for solving weakly convex FCCO with non-smooth outer functions was proposed in our work (Chen et al., 2025).

The ALEXR algorithm and its analysis for convex FCCO instances appeared in our work (Wang & Yang, 2023), where the outer function is both convex and monotonically non-decreasing and the inner functions are convex. For the first time, we established a complexity of \(O\left(\frac{n\sigma_0^2}{B\epsilon^2}\right)\) for finding an \(\epsilon\)-optimal solution of convex FCCO. Our analysis of the stochastic block coordinate update for the dual variables is primarily informed by the framework in Alacaoglu et al. (2025), which addresses convex-concave minimax problems with bilinear structures. The extrapolation for the gradient of the dual variable is inspired by (Zhang et al., 2021). It is worth mentioning that for strongly convex FCCO with smooth outer functions, we only established the convergence of ALEXR for the Euclidean distance to the optimum. However, it is possible to establish the convergence for the objective gap and even the duality gap following our work on strongly-convex strongly-concave min-max optimization (Yan et al., 2020).

In (Wang & Yang, 2023), we also established the lower bounds for convex FCCO and strongly convex FCCO, which matches the upper bounds. Our derivation of the lower bound for convex FCCO with non-smooth outer functions builds upon the construction presented in (Zhang & Lan, 2024) for SCO.

The double-loop ALEXR was developed in (Chen et al., 2025), which was mostly inspired by a line of work on weakly-convex concave min-max problems (Rafique et al., 2022), (Yan et al., 2020), (Zhang et al., 2022). (Rafique et al., 2022) is the first work that proves the convergence for weakly-convex (strongly)-concave problems. Yan et al. (2020) simplified the algorithm for weakly-convex strongly-concave problems with \(\mu\)-strong concavity on the dual variable and established a complexity of \(O(\frac{1}{\mu^2\epsilon^4})\) for finding a nearly \(\epsilon\)-stationary point. The later work (Zhang et al., 2022) improved the complexity to \(O(\frac{1}{\mu\epsilon^4})\) with a simple change on the number of iterations for the inner loop.

The non-convex analysis of ASGD for compositional CVaR minimization first appeared in (Zhu et al., 2022) for one-way partial AUC optimization. The geometric-aware algorithm SCENT for CERM and its analysis were developed in (Wei et al. 2026). It remains an interesting problem to conduct fine-grained analysis of SCENT for non-convex problems.

A more general framework than FCCO is the so-called conditional stochastic optimization (CSO), defined as:

\[\min_{\mathbf{w}} \mathbb{E}_{\xi} \left[ f_\xi \left( \mathbb{E}_{\zeta|\xi} [g(\mathbf{w}; \zeta, \xi)] \right) \right].\]

This paradigm was formally introduced by Hu et al. (2020), who analyzed a biased SGD (BSGD) algorithm employing a large inner mini-batch and a constant outer mini-batch. For non-convex smooth problems, using an inner batch size of \(O(\epsilon^{-2})\) results in an iteration complexity of \(O(\epsilon^{-4})\), which translates to a total sample complexity of \(O(\epsilon^{-6})\). This performance is inferior to that of SOX when \(n/B< \epsilon^{-2}\). For convex and \(\mu\)-strongly convex CSO problems, an inner batch size of \(O(\epsilon^{-1})\) yields iteration complexities of \(O(\epsilon^{-2})\) and \(O(\mu^{-2}\epsilon^{-1})\), respectively. Notably, the latter complexity is likewise worse than that of restarted SOX when \(n/B < O(\epsilon^{-1})\).

References

Qi, Q., Luo, Y., Xu, Z., Ji, S., & Yang, T. (2021). Stochastic optimization of areas under precision-recall curves with provable convergence. Proceedings of the 35th International Conference on Neural Information Processing Systems (NeurIPS 2021).
Wang, M., Fang, E. X., & Liu, H. (2017). Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161(1–2), 419–449.
Wang, G., Yang, M., Zhang, L., & Yang, T. (2022). Momentum Accelerates the Convergence of Stochastic AUPRC Maximization. International Conference on Artificial Intelligence and Statistics (AISTATS 2022), 151, 3753–3771.
Zhang, J., Xiao, P., Sun, R., & Luo, Z. (2019). A Stochastic Composite Gradient Method with Incremental Variance Reduction. SIAM Journal on Optimization, 29(4), 2750–2782.
Wang, B., & Yang, T. (2022). Finite-Sum Coupled Compositional Stochastic Optimization: Theory and Applications. International Conference on Machine Learning (ICML 2022), 162, 23292–23317.
Qi, Q., Guo, Z., Xu, Y., Jin, R., & Yang, T. (2021b). An Online Method for A Class of Distributionally Robust Optimization with Non-convex Objectives. Advances in Neural Information Processing Systems 34 (NeurIPS 2021), pp. 10067–10080.
Jiang, W., Qin, J., Wu, L., Chen, C., Yang, T., & Zhang, L. (2023). Learning Unnormalized Statistical Models via Compositional Optimization. International Conference on Machine Learning (ICML 2023), 202, 15105–15124.
Zhang, Z., & Lan, G. (2024). Optimal methods for convex nested stochastic composite optimization. Mathematical Programming, 212(1), 1–48.
Jiang, W., Li, G., Wang, Y., Zhang, L., & Yang, T. (2022). Multi-block-Single-probe Variance Reduced Estimator for Coupled Compositional Optimization. Advances in Neural Information Processing Systems 35 (NeurIPS 2022).
Hu, Q., Zhu, D., & Yang, T. (2023). Non-Smooth Weakly-Convex Finite-sum Coupled Compositional Optimization. Advances in Neural Information Processing Systems 36 (NeurIPS 2023), New Orleans, LA, USA.
Zhu, L., Gürbüzbalaban, M., & Ruszczynski, A. (2023). Distributionally Robust Learning with Weakly Convex Losses: Convergence Rates and Finite-Sample Guarantees. arXiv:2301.06619.
Chen, X., Wang, B., Yang, M., Lin, Q., & Yang, T. (2025). Stochastic Momentum Methods for Non-smooth Non-Convex Finite-Sum Coupled Compositional Optimization. The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025).
Wang, B., & Yang, T. (2023). A Near-Optimal Single-Loop Stochastic Algorithm for Convex Finite-Sum Coupled Compositional Optimization. International Conference on Machine Learning (ICML 2023).
Alacaoglu, A., Cevher, V., & Wright, S. J. (2025). On the complexity of a simple primal-dual coordinate method. Mathematical Programming.
Zhang, X., Aybat, N. S., & Gürbüzbalaban, M. (2021). Robust Accelerated Primal-Dual Methods for Computing Saddle Points. SIAM Journal on Optimization, 34, 1097–1130.
Yan, Y., Xu, Y., Lin, Q., Liu, W., & Yang, T. (2020). Optimal Epoch Stochastic Gradient Descent Ascent Methods for Min-Max Optimization. Advances in Neural Information Processing Systems 33 (NeurIPS 2020).
Rafique, H., Liu, M., Lin, Q., & Yang, T. (2022). Weakly-convex-concave min-max optimization: provable algorithms and applications in machine learning. Optimization Methods and Software, 37(3), 1087–1121.
Zhang, X., Aybat, N., & Gurbuzbalaban, M. (2022). SAPD+: An Accelerated Stochastic Method for Nonconvex-Concave Minimax Problems. Advances in Neural Information Processing Systems (NeurIPS 2022).
Zhu, D., et al. (2022). When AUC Maximization Meets Weakly Convex Losses. International Conference on Machine Learning (ICML 2022).
Wei, X., Zhou, L., Wang, B., Lin, C., Yang, T. (2026). SCENT: Geometry-aware Algorithm for Compositional Entropic Risk Minimization.
Hu, Y., Zhang, S., Chen, X., & He, N. (2020). Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning. Advances in Neural Information Processing Systems 34 (NeurIPS 2020).
Ghadimi, S., Ruszczynski, A., & Wang, M. (2020). A Single Timescale Stochastic Approximation Method for Nested Stochastic Optimization. SIAM Journal on Optimization, 30(1), 960–979.
Zhou, L., Wang, B., Thai, M. T., & Yang, T. (2025). Stochastic Primal-Dual Double Block-Coordinate for Two-way Partial AUC Maximization. Transactions on Machine Learning Research.