Section 4.6 History and Notes
The optimization techniques presented in this chapter for stochastic compositional optimization are rooted in the pioneering work of Yuri Ermoliev (Ermoliev, 1976), (Ermoliev and Wets, 1988). The monograph (Ermoliev, 1976), written in Ukrainian, laid the early foundations. Chapter 6 of the edited volume (Ermoliev and Wets, 1988) introduces an early form of the Stochastic Compositional Gradient Descent (SCGD) method, employing a sequence of moving average estimators \(\mathbf u_t\) to track the inner function values at each iteration—referred to then simply as ``averaging”. The convergence analysis in these early works is largely limited to asymptotic results, if provided at all. Notably, these works considered a broader class of problems with functional constraints, which will be discussed further in Chapter 6.
The study of non-smooth compositional optimization, where a non-smooth convex function is composed with a smooth function, was first initiated in the works of Fletcher and Watson (1980), Fletcher (1982). Their proposed method, known as the prox-linear method, has since been extensively studied and developed in subsequent research (Lewis and Wright, 2009), (Duchi and Ruan, 2018), (Drusvyatskiy et al., 2021), (Duchi and Ruan, 2017), (Drusvyatskiy and Paquette, 2019). We will consider non-smooth compositional optimization in next chapter.
The modern convergence analysis with non-asymptotic rates for stochastic compositional optimization was pioneered by Wang et al. (2017a). Their initial analysis established an \(O(1/\epsilon^8)\) complexity for finding an \(\epsilon\)-stationary solution to a smooth compositional problem, primarily due to suboptimal choices of learning rates. Subsequent works have aimed to improve this convergence rate (Ghadimi et al., 2020), (Wang et al., 2017b), (Chen et al., 2021a). The improved complexity of \(O(1/\epsilon^5)\) for SCGD is derived by following the parameter settings introduced in Qi et al. (2021a). A further refined complexity of \(O(1/\epsilon^4)\), under the assumption that the inner function is smooth, was achieved in Chen et al. (2021). The use of a moving-average gradient estimator to attain the \(O(1/\epsilon^4)\) complexity in stochastic compositional optimization is credited to (Ghadimi et al., 2020).
The modern variance-reduction technique for estimating the gradient of a smooth function originates from (Johnson and Zhang, 2013), (Mahdavi and Jin, 2013), (Zhang et al., 2013), and was inspired by earlier work (Schmidt et al., 2017) that established linear convergence for finite-sum problems with convex and smooth objectives. This technique is now widely known as SVRG. For the objective function \(f({\mathbf w})=\frac{1}{n}\sum_{i=1}^n f_i({\mathbf w})\), the SVRG gradient estimator takes the form \(\nabla f_i({\mathbf w}_t) - \nabla f_i(\bar {\mathbf w}) + \nabla f(\bar {\mathbf w})\), where \(\bar {\mathbf w}\) is a reference point whose full gradient \(\nabla f(\bar {\mathbf w})\) is computed periodically.
For non-convex optimization, the variance reduction technique named SPIDER was initiated by Fang et al. (2018), which proposes a gradient estimator \({\mathbf v}_t = {\mathbf v}_{t-1} + \nabla f_i({\mathbf w}_t) - \nabla f_i({\mathbf w}_{t-1})\), with \({\mathbf v}\) being periodically re-initialized using either a full gradient or a large-batch gradient. This approach was earlier proposed under the name SARAH for convex optimization in (Nguyen et al., 2017). The technique later evolved into the STORM estimator (Cutkosky and Orabona, 2019), defined as \({\mathbf v}_t = (1-\beta){\mathbf v}_{t-1} + \beta\nabla f({\mathbf w}_t;\xi_t) + (1-\beta)\left[\nabla f({\mathbf w}_t;\xi_t) - \nabla f({\mathbf w}_{t-1};\xi_t)\right]\), which eliminates the need for periodic re-initialization.
Huo et al. (2018) applied the SVRG technique for finite-sum compositional optimization where both the inner and outer expectation is an average over a finite set. Hu et al. (2019) and Zhang and Xiao (2019) concurrently applied SARAH/SPIDER to compositional optimization with an expectation form and a finite-sum structure, and derived a complexity of \(O(1/\epsilon^3)\) for the expectation form and \(O(\sqrt{n}/\epsilon^2)\) for a finite-sum structure with \(n\) components. Qi et al. (2021b) applied the STORM estimator for SCO with a complexity of \(O(1/\epsilon^3)\) and Chen et al. (2021b) applied the STORM estimator to only the inner function estimation for SCO with a complexity of \(O(1/\epsilon^4)\).
The capped \(\ell_1\) norm for sparse regularization was introduced by Zhang (2013). The minimax concave penalty (MCP) was proposed by Zhang (2010), while the smoothly clipped absolute deviation (SCAD) regularizer was introduced by Fan and Li (2001). The proximal mappings for these non-convex regularizers were studied in (Gong et al., 2013). The non-convex piecewise affine regularization method for quantization was proposed by Ma and Xiao (2025). The theoretical analysis presented in Section 4.4 on non-convex optimization with non-convex regularizers follows the framework established by Xu et al. (2019), whose results were applied by Deleu and Bengio (2021) to train sparse deep neural networks.
Stochastic weakly-convex–concave min–max optimization with a complexity of \(O(1/\epsilon^6)\) was first studied by Rafique et al. (2018). When the problem is weakly-convex and strongly-concave, the complexity can be improved to \(O(1/\epsilon^4)\) using double-loop algorithms (Rafique et al., 2018), (Yan et al., 2020). The analysis of SGDA for smooth non-convex min-max optimization was first established by Lin et al. (2020), who derived a complexity of \(O(1/\epsilon^4)\) when using a large batch size on the order of \(O(1/\epsilon^2)\) for problems that are strongly concave in the dual variable. Without employing a large batch size, the complexity degrades to \(O(1/\epsilon^8)\), which also applies to problems lacking strong concavity. The analysis of the single-loop SMDA algorithm was provided by (Guo et al., 2021a), which also established the convergence guarantees for stochastic bilevel optimization using the first approach introduced in Section 4.5.3. A similar convergence result was achieved in Qiu et al. (2020), which employed moving-average gradient estimators for both the primal and dual variables. Chen et al. (2021a) obtained a complexity of \(O(1/\epsilon^4)\) for smooth non-convex strongly-concave problems without relying on moving-average gradient estimators, under the stronger assumption that the Hessian/Jacobian matrix is Lipschitz continuous. An improved rate of \(O(1/\epsilon^3)\) for smooth non-convex strongly-concave problems was established by (Huang et al., 2022) through the use of STORM estimators.
Bilevel optimization has a long and rich history (Bracken and McGill, 1973). The first complexity analysis of bilevel optimization was initiated by Ghadimi and Wang (2018), who employed the Neumann series to approximate the inverse of the Hessian. Their proposed double-loop stochastic algorithm achieves a sample complexity of \(O(1/\epsilon^6)\) for solving the lower-level problem and \(O(1/\epsilon^4)\) for the upper-level problem. Subsequent research has led to improved complexity bounds: \(O(1/\epsilon^5)\) in (Hong et al., 2020), \(O(1/\epsilon^4)\) in (Ji et al., 2020), (Guo et al., 2021b), (Chen et al., 2021), and further down to \(O(1/\epsilon^3)\) in (Yang et al., 2021), (Khanduri et al., 2021), (Guo et al., 2021b) under mean-square smoothness conditions. The analysis corresponding to Approach 1 in Section 4.5.3 can be found in (Qiu et al., 2022), while that of Approach 2 is provided in (Guo et al., 2021b).
Penalty-based metholds for bilevel optimization date back to (Ye et al., 1997), with recent developments appearing in (Liu et al., 2021), (Liu et al., 2022), (Shen and Chen, 2023). Lemma 4.26 is due to Kwon et al. (2023), which established a sample complexity of \(O(1/\epsilon^7)\)–comparable to Theorem 4.6—for a different double-loop algorithm. They also derived a complexity of \(O(1/\epsilon^6)\) for an algorithm similar to the update, except that the gradient estimators for both the lower- and upper-level functions are replaced with STORM estimators under stronger mean-square smoothness assumptions. The double-loop large-batch method described at the end of Section 4.5.3 and its complexity of \(O(1/\epsilon^6)\) has been developed by Chen et al. (2025).
The complexity of \(O(1/\epsilon^4)\) for stochastic compositional optimization is known to be optimal, as it matches the lower bound established for standard stochastic optimization (Arjevani et al., 2022). Moreover, under mean-square smoothness assumptions, a reduced complexity of \(O(1/\epsilon^3)\) is also proven to be optimal (Arjevani et al., 2022).
References
- Ermoliev, Y. M. (1976). Methods of Stochastic Programming. Moscow: Nauka.
- Ermoliev, Y., & Wets, R. J.-B. (Eds.). (1988). Numerical Techniques for Stochastic Optimization. Springer Series in Computational Mathematics, 10. Springer-Verlag.
- Fletcher, R., & Watson, G. A. (1980). First and second order conditions for a class of nondifferentiable optimization problems. Mathematical Programming, 18, 291–307.
- Fletcher, R. (1982). A model algorithm for composite nondifferentiable optimization problems. Mathematical Programming Study, 17, 67–76.
- Lewis, A., & Wright, S. (2009). A proximal method for composite minimization. Mathematical Programming, 158.
- Duchi, J. C., & Ruan, F. (2018). Stochastic Methods for Composite and Weakly Convex Optimization Problems. SIAM Journal on Optimization, 28(4), 3229–3259.
- Drusvyatskiy, D., Ioffe, A. D., & Lewis, A. S. (2021). Nonsmooth optimization using Taylor-like models: error bounds, convergence, and termination criteria. Mathematical Programming, 185, 357–383.
- Duchi, J. C., & Ruan, F. (2017). Stochastic methods for composite and weakly convex optimization problems. arXiv preprint arXiv:1703.08570.
- Drusvyatskiy, D., & Paquette, C. (2019). Efficiency of minimizing compositions of convex functions and smooth maps. Mathematical Programming, 178, 503–558.
- Wang, M., Fang, E. X., & Liu, H. (2017a). Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161(1–2), 419–449.
- Ghadimi, S., Ruszczynski, A., & Wang, M. (2020). A Single Timescale Stochastic Approximation Method for Nested Stochastic Optimization. SIAM Journal on Optimization, 30(1), 960–979.
- Wang, M., Liu, J., & Fang, E. X. (2017b). Accelerating Stochastic Composition Optimization. Journal of Machine Learning Research, 18(105), 1–23.
- Chen, T., Sun, Y., & Yin, W. (2021a). Closing the Gap: Tighter Analysis of Alternating Stochastic Gradient Methods for Bilevel Problems. In Advances in Neural Information Processing Systems.
- Qi, Q., Luo, Y., Xu, Z., Ji, S., & Yang, T. (2021a). Stochastic optimization of areas under precision-recall curves with provable convergence. In Proceedings of the 35th International Conference on Neural Information Processing Systems.
- Johnson, R., & Zhang, T. (2013). Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. In Advances in Neural Information Processing Systems, 26.
- Mahdavi, M., & Jin, R. (2013). MixedGrad: An \(O(1/T)\) Convergence Rate Algorithm for Stochastic Smooth Optimization. In Advances in Neural Information Processing Systems, 26.
- Zhang, L., Mahdavi, M., & Jin, R. (2013). Linear Convergence with Condition Number Independent Access of Full Gradients. In Advances in Neural Information Processing Systems, 26.
- Schmidt, M., Le Roux, N., & Bach, F. (2017). Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1–2), 83–112.
- Fang, C., Li, C. J., Lin, Z., & Zhang, T. (2018). SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path-Integrated Differential Estimator. In Advances in Neural Information Processing Systems, 31.
- Nguyen, L. M., Liu, J., Scheinberg, K., & Takac, M. (2017). SARAH: a novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning, 2613–2621.
- Cutkosky, A., & Orabona, F. (2019). Momentum-based variance reduction in non-convex SGD. In Proceedings of the 33rd International Conference on Neural Information Processing Systems.
- Huo, Z., Gu, B., Liu, J., & Huang, H. (2018). Accelerated method for stochastic composition optimization with nonsmooth regularization. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence.
- Hu, W., Li, C. J., Lian, X., Liu, J., & Yuan, H. (2019). Efficient Smooth Non-Convex Stochastic Compositional Optimization via Stochastic Recursive Gradient Descent. In Advances in Neural Information Processing Systems, 32.
- Zhang, J., & Xiao, L. (2019). A Stochastic Composite Gradient Method with Incremental Variance Reduction. In Advances in Neural Information Processing Systems, 32.
- Qi, Q., Guo, Z., Xu, Y., Jin, R., & Yang, T. (2021b). An online method for a class of distributionally robust optimization with non-convex objectives. In Proceedings of the 35th International Conference on Neural Information Processing Systems.
- Chen, T., Sun, Y., & Yin, W. (2021b). Solving Stochastic Compositional Optimization is Nearly as Easy as Solving Stochastic Optimization. IEEE Transactions on Signal Processing, 69, 4937–4948.
- Zhang, T. (2013). Multi-stage convex relaxation for feature selection. Bernoulli, 19(5B), 2277–2293.
- Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2), 894–942.
- Fan, J., & Li, R. (2001). Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96(456), 1348–1360.
- Gong, P., Zhang, C., Lu, Z., Huang, J., & Ye, J. (2013). A General Iterative Shrinkage and Thresholding Algorithm for Non-convex Regularized Optimization Problems. In Proceedings of the 30th International Conference on Machine Learning, 37–45.
- Ma, J., & Xiao, L. (2025). Quantization through Piecewise-Affine Regularization: Optimization and Statistical Guarantees. arXiv preprint arXiv:2508.11112.
- Xu, Y., Jin, R., & Yang, T. (2019). Non-asymptotic Analysis of Stochastic Methods for Non-Smooth Non-Convex Regularized Problems. In Advances in Neural Information Processing Systems, 32.
- Deleu, T., & Bengio, Y. (2021). Structured Sparsity Inducing Adaptive Optimizers for Deep Learning. arXiv preprint arXiv:2102.03869.
- Rafique, H., Liu, M., Lin, Q., & Yang, T. (2018). Weakly-convex?concave min?max optimization: provable algorithms and applications in machine learning. Optimization Methods and Software, 37, 1087–1121.
- Yan, Y., Xu, Y., Lin, Q., Liu, W., & Yang, T. (2020). Optimal Epoch Stochastic Gradient Descent Ascent Methods for Min-Max Optimization. arXiv preprint arXiv.
- Lin, T., Jin, C., & Jordan, M. (2020). On Gradient Descent Ascent for Nonconvex-Concave Minimax Problems. In Proceedings of the 37th International Conference on Machine Learning, 6083–6093.
- Guo, Z., Xu, Y., Yin, W., Jin, R., & Yang, T. (2021a). Unified convergence analysis for adaptive optimization with moving average estimator. Machine Learning, 114(4).
- Qiu, S., Yang, Z., Wei, X., Ye, J., & Wang, Z. (2020). Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning. arXiv preprint arXiv:2008.10103.
- Huang, F., Gao, S., Pei, J., & Huang, H. (2022). Accelerated zeroth-order and first-order momentum methods from mini to minimax optimization. Journal of Machine Learning Research, 23(1), Article 36.
- Bracken, J., & McGill, J. T. (1973). Mathematical programs with optimization problems in the constraints. Operations Research, 21, 37–44.
- Ghadimi, S., & Wang, M. (2018). Approximation Methods for Bilevel Programming. arXiv preprint arXiv:1802.02246.
- Hong, M., Wai, H.-T., Wang, Z., & Yang, Z. (2020). A Two-Timescale Stochastic Algorithm Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic. SIAM Journal on Optimization, 33(1), 147–180.
- Ji, K., Yang, J., & Liang, Y. (2020). Bilevel Optimization: Convergence Analysis and Enhanced Design. In International Conference on Machine Learning.
- Yang, J., Ji, K., & Liang, Y. (2021). Provably faster algorithms for bilevel optimization. In Proceedings of the 35th International Conference on Neural Information Processing Systems.
- Khanduri, P., Zeng, S., Hong, M., Wai, H.-T., Wang, Z., & Yang, Z. (2021). A near-optimal algorithm for stochastic bilevel optimization via double-momentum. In Proceedings of the 35th International Conference on Neural Information Processing Systems.
- Guo, Z., Hu, Q., Zhang, L., & Yang, T. (2021b). Randomized Stochastic Variance-Reduced Methods for Stochastic Bilevel Optimization. CoRR, abs/2105.02266.
- Qiu, Z.-H., Hu, Q., Zhong, Y., Zhang, L., & Yang, T. (2022). Large-scale Stochastic Optimization of NDCG Surrogates for Deep Learning with Provable Convergence. In International Conference on Machine Learning, 18122–18152.
- Ye, J., Zhu, D., & Zhu, Q. (1997). Exact penalization and necessary optimality conditions for generalized bilevel programming problems. SIAM Journal on Optimization, 7(2), 481–507.
- Liu, R., Liu, X., Zeng, S., Zhang, J., & Zhang, Y. (2021). Value-Function-based Sequential Minimization for Bi-level Optimization. CoRR, abs/2110.04974.
- Liu, B., Ye, M., Wright, S., Stone, P., & Liu, Q. (2022). BOME! bilevel optimization made easy: a simple first-order approach. In Proceedings of the 36th International Conference on Neural Information Processing Systems.
- Shen, H., & Chen, T. (2023). On Penalty-based Bilevel Gradient Descent Method. In Proceedings of the 40th International Conference on Machine Learning, 30992–31015.
- Kwon, J., Kwon, D., Wright, S., & Nowak, R. (2023). A fully first-order method for stochastic bilevel optimization. In Proceedings of the 40th International Conference on Machine Learning.
- Chen, L., Ma, Y., & Zhang, J. (2025). Near-Optimal Nonconvex-Strongly-Convex Bilevel Optimization with Fully First-Order Oracles. Journal of Machine Learning Research, 26(109), 1–56.
- Arjevani, Y., Carmon, Y., Duchi, J. C., Foster, D. J., Srebro, N., & Woodworth, B. (2022). Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199(1–2), 165–214.
- Chen, L., Ma, Y., & Zhang, J. (2025). Near-Optimal Nonconvex-Strongly-Convex Bilevel Optimization with Fully First-Order Oracles. Journal of Machine Learning Research, 26(109), 1–56.
- Hu, Q., Qi, Q., Lu, Z., & Yang, T. (2024). Single-Loop Stochastic Algorithms for Difference of Max-Structured Weakly Convex Functions. In Proceedings of the Thirty-eighth Annual Conference on Neural Information Processing Systems.
- Lan, G. (2020). First-Order and Stochastic Optimization Methods for Machine Learning. Springer International Publishing.