Section 3.8 History and Notes
Stochastic Approximation and Mathematical Optimization
Stochastic approximation has a long history dating back to Robbins and Monro, 1951 for solving a root finding problem \(f(x)=\alpha\) using an iterative method \(x_{t+1}=x_t-a_t(y_t-\alpha)\), where \(y_t\) is a stochastic variable such that \(\mathbb{E}[y_t]=f(x_t)\). They studied the asymptotic convergence of \(\lim_{t\to\infty}\mathbb{E}[(x_t-\theta)^2]=0\) under some conditions, where \(\theta\) is the solution to the root finding problem. It is notable that Herbert Robbins was regarded as one of the most influential mathematicians of the latter half of the 20th century, renowned for his seminal contributions to probability, algebra, and graph theory.
Inspired by Robbins and Monro, 1951, Kiefer and Wolfowitz, 1952 considered stochastic maximization of a regression function using a stochastic finite difference estimator of the gradient. Later, Chung, 1954 established the convergence bound of Robbins-Monro’s method under some conditions. Since then, the convergence of SGD has been extensively studied. Polyak and Juditsky, 1992 analyzed the convergence of SGD with a simple averaging for stochastic optimization, which is sometimes referred to as Polyak-Juditsky averaging or Polyak averaging. This work assumes smoothness and strong convexity of the objective function and established a convergence rate of \(O(1/T)\).
(Nemirovski and Yudin, 1978) is probably the first work that analyzes the non-asymptotic convergence of SGDA for general convex-concave min-max optimization without smoothness and strong convexity assumption. Their paper introduces the weighted averaging (weighted by the step size at each iteration) and establishes the convergence rate of \(O(1/\sqrt{T})\). The optimal rate \(O(1/T)\) for strongly-convex strongly-concave min-max problem was recently proved by Yan et al., 2020.
The mirror descent method originates from (Nemirovsky and Yudin, 1983), which is also the work that is often cited for the lower bound of \(O(1/\sqrt{T})\) for general convex problems. A more general form of SMD and its extension for convex-concave min-max problems using a Bregman divergence was later considered in (Nemirovski et al., 2009).
The non-asymptotic analysis of SGD for non-convex optimization was initiated by Ghadimi and Lan, 2013. The non-asymptotic analysis of SGD for weakly convex optimization was developed by Davis and Drusvyatskiy, 2019.
The proximal method dates back to the proximal point method proposed by Martinet, 1972 and further developed in (Rockafellar, 1976). Lions and Mercier, 1979 proposed a splitting method for finding a zero point of the sum of two maximal monotone operators. The forward backward splitting was first proposed by Pazy, 1979 in the same context of finding a zero of sum of monotone operators. Its special instance for minimization problems known as projected gradient method was first studied by Goldstein, 1964.
Coordinate descent has a long history in optimization, with its earliest roots traceable to the Gauss-Seidel iterations for solving linear systems in the 19th century. The method was later formalized and discussed in early optimization literature, including (Warga, 1963), (Ortega and Rheinboldt, 1970), (Luenberger, 1973). Rigorous analysis of convergence properties was developed in a sequence of influential works by Paul Tseng and others, including (Luo and Tseng, 1992), (Tseng, 1990), (Tseng and Bertsekas, 1987), (Tseng, 2001). Recent developments of block coordinate descent including accelerated coordinate descent (Nesterov, 2012) and stochastic block coordinate descent (Dang and Lan, 2015).
The extragradient method was first proposed by Korpelevich, 1976. The mirror prox method and its convergence rate \(O(1/T)\) was proposed and established by Nemirovski, 2004. The stochastic mirror prox method was analyzed in (Juditsky et al., 2011).
Optimization in machine learning
Frank Rosenblatt’s pioneering work in the late 1950s introduced a learning rule for updating the Perceptron model (a single-layer neural network for binary classification) (Rosenblatt, 1962), a method that shares a conceptual foundation with modern stochastic gradient descent (SGD). The earliest instance of SGD for machine learning is perhaps the Widrow-Hoff algorithm (Widrow and Hoff, 1960) (also known as the least mean square algorithm), which was used to train ADALINE - a single-layer neural network that produces a continuous output. Amari, 1967 is the first work that applies SGD to optimize a neural network for binary and multi-class classification.
Starting in late 1980s, online prediction problem has attracted increasing attention in machine learning, whose developments have many parallels to stochastic optimization. Littlestone, 1988 proposed the Winnow algorithm for learning Boolean functions. It was shown to be better than the earlier Perceptron learning algorithm in the sense that the number of mistakes grows only logarithmically with the number of irrelevant attributes in the examples. Later, it was generalized to weighted majority for learning with expert advice (Littlestone and Warmuth, 1994), and the exponentiated gradient method (Kivinen and Warmuth, 1997) for online learning with a decision variable from a simplex, which is a special case of SMD using the KL-divergence. It has impact on the development of Adaboost (Freund and Schapire, 1997).
During the first decade of the 21st century, online convex optimization emerged as a central topic in machine learning. A key focus was on regret bound analysis, which can be transferred into convergence guarantees for stochastic optimization via the online-to-batch conversion technique (Dekel and Singer, 2005). Regret bounds for online gradient descent were established for both convex loss functions (Zinkevich, 2003) and strongly convex loss functions (Hazan et al., 2007). The multi-epoch scheme for achieving an optimal rate of \(O(1/T)\) for stochastic strongly convex optimization was established independently and concurrently in (Iouditski and Nesterov, 2010), (Hazan and Kale, 2011), (Ghadimi and Lan, 2012). Later, SGD has shown to be able to achieve the optimal rate for stochastic non-smooth strongly convex optimization using tail averaging (Rakhlin et al., 2012) or increased weighted averaging (Lacoste-Julien et al., 2012). The last iterate convergence of SGD for non-smooth convex optimization was established in (Shamir and Zhang, 2013).
The use of the \(l_1\) norm as a regularizer in the Lasso method was pioneered by Tibshirani, 1996. The elastic net regularizer was later proposed in (Zou and Hastie, 2003), while the group lasso regularizer was introduced by (Yuan and Lin, 2006). More recently, the Piecewise Affine Regularizer (PARQ) was proposed in (Jin et al., 2025). The nuclear norm minimization for promoting a low-rank matrix was first considered in (Fazel et al., 2001).
Pioneering works on the application of SGD to regularized empirical risk minimization in machine learning, including support vector machines, include (Zhang, 2004a), (Shalev-Shwartz et al., 2007). The application of the proximal gradient method to \(l_1\) norm regularized problem was initiated by Daubechies et al., 2004, yielding an algorithm known as iterative thresholding. The application of SPGD to machine learning with various regularization terms was studied in (Duchi and Singer, 2009). The application of SGD for optimizing truncated loss and its theory was studied in (Xu et al., 2019).
The most famous application of coordinate descent methods in machine learning is the solver for support vector machine (Chang et al., 2008), (Hsieh et al., 2008).
AdaGrad, proposed by (Duchi et al., 2011), was a breakthrough in stochastic optimization for machine learning. It later inspired several popular stochastic algorithms for deep learning, including RMSprop (Hinton, 2018) and Adam (Kingma and Ba, 2015), which will be discussed in Chapter 6..
The first variant of stochastic optimistic mirror prox method appeared in the author’s award-winning work (Chiang et al., 2012), inspired by Nemirovski’s mirror prox method. It was introduced to address a long-standing challenge in online convex optimization for achieving variational regret bounds. This line of research later inspired the work of (Rakhlin and Sridharan, 2013), which formally coined the term optimistic mirror descent. More recently, stochastic optimistic mirror prox has been adopted for solving min-max problems in machine learning, including applications such as training generative adversarial networks (Daskalakis et al., 2018).
Discussion.
The most important factor that affects the practical performance of SGD
and other stochastic algorithms is the learning rate scheme \(\eta_t\). In this chapter, we focus on a
fixed learning rate \(\eta_t=\eta\).
However, it is usually not the best choice in practice. We can also
develop theoretical analysis of these algorithms using decreasing
learning rates, e.g., \(\eta_t\propto
1/\sqrt{t},1/t\). However, these theoretical learning rate
schemes are usually also not the best in practice. A practical approach
is the step decay strategy as in Theorem 3.7, which gives a
convergence that has only logarithmic dependence on the initial distance
\(\|\mathbf{w}_1-\mathbf{w}_*\|_2\).
This strategy also works for general stochastic convex optimization
under generic error bound conditions in the form \(\|\mathbf{w}-\mathbf{w}_*\|_2\le
c(g(\mathbf{w})-g(\mathbf{w}_*))^{\theta}\) with \(\theta\in(0,1]\) (Xu et
al., 2017). Another issue of theoretical learning rates is that
their best values that optimize the convergence bound may depend on some
unknown parameters of the problem, e.g., \(\mathbf{w}_*\), the smoothness constant,
strong convexity modulus. This issue has triggered a line of research
known as parameter-free algorithms (Orabona, 2019),
(Lan et al., 2023).
While this chapter focuses on classical stochastic methods that not only have important applications in machine learning but also significantly influence the approaches presented in later chapters, it does not cover several important algorithms, most notably accelerated gradient methods and their stochastic variants. These methods achieve optimal convergence rates for smooth convex objectives when the variance of stochastic gradients vanishes (Lan, 2012). For a comprehensive treatment of accelerated gradient methods, we refer to the textbook by Nesterov, 2004, and for stochastic accelerated algorithms, we recommend Lan, 2020. Variants of these methods will be introduced in Chapter 6.
Finally, I recommend the textbook (Recht and Wright, 2025), which provides a comprehensive treatment of convex optimization algorithms tailored for data analysis.
References
- Robbins, H., and Monro, S. (1951). A stochastic
approximation method. Annals of Mathematical Statistics, 22,
400–407.
- Kiefer, J., and Wolfowitz, J. (1952). Stochastic
estimation of the maximum of a regression function. The Annals of
Mathematical Statistics, 23(3), 462–466.
- Chung, K. L. (1954). On a stochastic approximation
method. The Annals of Mathematical Statistics, 25(3),
463–483.
- Polyak, B. T., and Juditsky, A. B. (1992).
Acceleration of stochastic approximation by averaging. SIAM Journal
on Control and Optimization, 30(4), 838–855.
- Nemirovski, A., and Yudin, D. (1978). On Cezari’s
convergence of the steepest descent method for approximating saddle
point of convex-concave functions. Soviet Mathematics Doklady,
19, 341–362.
- Yan, Y., Xu, Y., Lin, Q., Liu, W., and Yang, T.
(2020). Optimal epoch stochastic gradient descent ascent methods for
min-max optimization. In Advances in Neural Information Processing
Systems (NeurIPS 2020).
- Nemirovsky, A. S., and Yudin, D. B. (1983).
Problem Complexity and Method Efficiency in Optimization. John
Wiley and Sons.
- Nemirovski, A., Juditsky, A., Lan, G., and Shapiro,
A. (2009). Robust stochastic approximation approach to stochastic
programming. SIAM Journal on Optimization, 19(4),
1574–1609.
- Ghadimi, S., and Lan, G. (2013). Stochastic first-
and zeroth-order methods for nonconvex stochastic programming. SIAM
Journal on Optimization, 23(4), 2341–2368.
- Davis, D., and Drusvyatskiy, D. (2019). Stochastic
model-based minimization of weakly convex functions. SIAM Journal on
Optimization, 29(1), 207–239.
- Martinet, B. (1972). Determination approchee d’un
point fixe d’une application pseudo-contractante. Cas de l’application
prox. Comptes Rendus de l’Academie des Sciences, Paris, Serie
A, 274, 163–165.
- Rockafellar, R. T. (1976). Monotone operators and
the proximal point algorithm. SIAM Journal on Control and
Optimization, 14(5), 877–898.
- Lions, P. L., and Mercier, B. (1979). Splitting
algorithms for the sum of two nonlinear operators. SIAM Journal on
Numerical Analysis, 16(6), 964–979.
- Pazy, G. B. (1979). Ergodic convergence to a zero
of the sum of monotone operators in Hilbert space. Journal of
Mathematical Analysis and Applications, 72, 383–390.
- Goldstein, A. A. (1964). Convex programming in
Hilbert space. Bulletin of the American Mathematical Society,
70(5), 709–710.
- Warga, J. (1963). Minimizing certain convex
functions. Journal of the Society for Industrial and Applied
Mathematics, 11(3), 588–593.
- Ortega, J. M., and Rheinboldt, W. C. (1970).
Iterative Solution of Nonlinear Equations in Several Variables.
Academic Press.
- Luenberger, D. G. (1973). Introduction to
Linear and Nonlinear Programming. Addison-Wesley Publishing
Company.
- Luo, Z. Q., and Tseng, P. (1992). On the
convergence of the coordinate descent method for convex differentiable
minimization. Journal of Optimization Theory and Applications,
72(1), 7–35.
- Tseng, P. (1990). Dual ascent methods for problems
with strictly convex costs and linear constraints: A unified approach.
SIAM Journal on Control and Optimization, 28(1), 214–242.
- Tseng, P., and Bertsekas, D. P. (1987). Relaxation
methods for problems with strictly convex separable costs and linear
constraints. Mathematical Programming, 38(3), 303–321.
- Tseng, P. (2001). Convergence of a block
coordinate descent method for nondifferentiable minimization.
Journal of Optimization Theory and Applications, 109(3),
475–494.
- Nesterov, Y. (2012). Efficiency of coordinate
descent methods on huge-scale optimization problems. SIAM Journal on
Optimization, 22(2), 341–362.
- Dang, C. D., and Lan, G. (2015). Stochastic block
mirror descent methods for nonsmooth and stochastic optimization.
SIAM Journal on Optimization, 25(2), 856–882.
- Korpelevich, G. M. (1976). The extragradient
method for finding saddle points and other problems.
- Nemirovski, A. (2004). Prox-method with rate of
convergence O(1/t) for variational inequalities with Lipschitz
continuous monotone operators and smooth convex-concave saddle point
problems. SIAM Journal on Optimization, 15(1), 229–251.
- Juditsky, A., Nemirovski, A., and Tauvel, C.
(2011). Solving variational inequalities with stochastic mirror-prox
algorithm. Stochastic Systems, 1(1), 17–58.
- Rosenblatt, F. (1962). Principles of
Neurodynamics: Perceptrons and the Theory of Brain Mechanisms.
Spartan Books.
- Widrow, B., and Hoff, M. E. (1960). Adaptive
switching circuits. IRE WESCON Convention Record, 4,
96–104.
- Amari, S. (1967). A theory of adaptive pattern
classifier. IEEE Transactions on Electronic Computers,
EC-16(3), 279–307.
- Littlestone, N. (1988). Learning quickly when
irrelevant attributes abound: A new linear-threshold algorithm.
Machine Learning, 2(4), 285–318.
- Littlestone, N., and Warmuth, M. K. (1994). The
weighted majority algorithm. Information and Computation,
108(2), 212–261.
- Kivinen, J., and Warmuth, M. K. (1997).
Exponentiated gradient versus gradient descent for linear predictors.
Information and Computation, 132(1), 1–63.
- Freund, Y., and Schapire, R. E. (1997). A
decision-theoretic generalization of on-line learning and an application
to boosting. Journal of Computer and System Sciences, 55(1),
119–139.
- Dekel, O., and Singer, Y. (2005). Data-driven
online to batch conversions. In Advances in Neural Information
Processing Systems, 18.
- Zinkevich, M. (2003). Online convex programming
and generalized infinitesimal gradient ascent. In Proceedings of the
Twentieth International Conference on Machine Learning (ICML 2003),
928–935.
- Hazan, E., Agarwal, A., and Kale, S. (2007).
Logarithmic regret algorithms for online convex optimization.
Machine Learning, 69(2-3), 169–192.
- Iouditski, A., and Nesterov, Y. (2010).
Primal-dual subgradient methods for minimizing uniformly convex
functions. arXiv preprint.
- Hazan, E., and Kale, S. (2011). Beyond the regret
minimization barrier: an optimal algorithm for stochastic
strongly-convex optimization. In Proceedings of the 24th Annual
Conference on Learning Theory (COLT 2011), 421–436.
- Ghadimi, S., and Lan, G. (2012). Optimal
stochastic approximation algorithms for strongly convex stochastic
composite optimization I: A generic algorithmic framework. SIAM
Journal on Optimization, 22(4), 1469–1492.
- Rakhlin, A., Shamir, O., and Sridharan, K. (2012).
Making gradient descent optimal for strongly convex stochastic
optimization. In Proceedings of the 29th International Conference on
Machine Learning (ICML 2012), 1571–1578.
- Lacoste-Julien, S., Schmidt, M., and Bach, F. R.
(2012). A simpler approach to obtaining an O(1/t) convergence rate for
the projected stochastic subgradient method. CoRR,
abs/1212.2002.
- Shamir, O., and Zhang, T. (2013). Stochastic
gradient descent for non-smooth optimization: convergence results and
optimal averaging schemes. In Proceedings of the 30th International
Conference on Machine Learning (ICML 2013), 71–79.
- Tibshirani, R. (1996). Regression shrinkage and
selection via the lasso. Journal of the Royal Statistical Society
(Series B), 58, 267–288.
- Zou, H., and Hastie, T. (2003). Regularization and
variable selection via the elastic net. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 67(2),
301–320.
- Yuan, M., and Lin, Y. (2006). Model selection and
estimation in regression with grouped variables. Journal of the
Royal Statistical Society: Series B (Statistical Methodology),
68(1), 49–67.
- Jin, L., Ma, J., Liu, Z., Gromov, A., Defazio, A.,
and Xiao, L. (2025). PARQ: Piecewise-affine regularized quantization. In
Forty-second International Conference on Machine
Learning.
- Fazel, M., Hindi, H., and Boyd, S. P. (2001). A
rank minimization heuristic with application to minimum order system
approximation. In 2001 IEEE International Conference on Control
Applications, 1347–1352.
- Zhang, T. (2004a). Solving large scale linear
prediction problems using stochastic gradient descent algorithms. In
Proceedings of the Twenty-First International Conference on Machine
Learning (ICML 2004), 919–926.
- Shalev-Shwartz, S., Singer, A., and Srebro, N.
(2007). Pegasos: Primal estimated sub-gradient solver for SVM. In
Proceedings of the 24th International Conference on Machine
Learning, 807–814.
- Daubechies, I., Defrise, M., and De Mol, C.
(2004). An iterative thresholding algorithm for linear inverse problems
with a sparsity constraint. Communications on Pure and Applied
Mathematics, 57(11), 1413–1457.
- Duchi, J., and Singer, Y. (2009). Efficient online
and batch learning using forward backward splitting. Journal of
Machine Learning Research, 10, 2899–2934.
- Xu, Y., Zhu, S., Yang, S., Zhang, C., Jin, R., and
Yang, T. (2019). Learning with non-convex truncated losses by SGD. In
Proceedings of UAI 2019, PMLR 115, 701–711.
- Chang, K.-W., Hsieh, C.-J., and Lin, C.-J. (2008).
Coordinate descent method for large-scale L2-loss linear support vector
machines. Journal of Machine Learning Research, 9(45),
1369–1398.
- Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi,
S. S., and Sundararajan, S. (2008). A dual coordinate descent method for
large-scale linear SVM. In Proceedings of the 25th International
Conference on Machine Learning (ICML 2008), 408–415.
- Duchi, J., Hazan, E., and Singer, Y. (2011).
Adaptive subgradient methods for online learning and stochastic
optimization. Journal of Machine Learning Research, 12,
2121–2159.
- Hinton, G. (2018). Neural networks for machine
learning, lecture 6. Coursera online course.
- Kingma, D. P., and Ba, J. (2015). Adam: A method
for stochastic optimization. In ICLR 2015.
- Chiang, C.-K., Yang, T., Lee, C.-J., Mahdavi, M.,
Lu, C.-J., Jin, R., and Zhu, S. (2012). Online optimization with gradual
variations. In Proceedings of COLT 2012, PMLR 23,
6.1–6.20.
- Rakhlin, A., and Sridharan, K. (2013).
Optimization, learning, and games with predictable sequences. In
Advances in Neural Information Processing Systems (NeurIPS
2013), 3066–3074.
- Daskalakis, C., Ilyas, A., Syrgkanis, V., and
Zeng, H. (2018). Training GANs with optimism. In International
Conference on Learning Representations (ICLR 2018).
- Xu, Y., Lin, Q., and Yang, T. (2017). Stochastic
convex optimization: Faster local growth implies faster global
convergence. In Proceedings of the 34th International Conference on
Machine Learning (ICML 2017), PMLR 70, 3821–3830.
- Orabona, F. (2019). A modern introduction to
online learning. CoRR, abs/1912.13213.
- Lan, G., Ouyang, Y., and Zhang, Z. (2023). Optimal
and parameter-free gradient minimization methods for convex and
nonconvex optimization.
- Lan, G. (2012). An optimal method for stochastic
composite optimization. Mathematical Programming, 133(1-2),
365–397.
- Nesterov, Y. (2004). Introductory Lectures on
Convex Programming: A Basic Course. Springer.
- Lan, G. (2020). First-order and Stochastic
Optimization Methods for Machine Learning. Springer International
Publishing.
- Recht, B., and Wright, S. J. (2025). Optimization for Modern Data Analysis. Cambridge University Press.