Section 3.8 History and Notes

Stochastic Approximation and Mathematical Optimization

Stochastic approximation has a long history dating back to Robbins and Monro, 1951 for solving a root finding problem \(f(x)=\alpha\) using an iterative method \(x_{t+1}=x_t-a_t(y_t-\alpha)\), where \(y_t\) is a stochastic variable such that \(\mathbb{E}[y_t]=f(x_t)\). They studied the asymptotic convergence of \(\lim_{t\to\infty}\mathbb{E}[(x_t-\theta)^2]=0\) under some conditions, where \(\theta\) is the solution to the root finding problem. It is notable that Herbert Robbins was regarded as one of the most influential mathematicians of the latter half of the 20th century, renowned for his seminal contributions to probability, algebra, and graph theory.

Inspired by Robbins and Monro, 1951, Kiefer and Wolfowitz, 1952 considered stochastic maximization of a regression function using a stochastic finite difference estimator of the gradient. Later, Chung, 1954 established the convergence bound of Robbins-Monro’s method under some conditions. Since then, the convergence of SGD has been extensively studied. Polyak and Juditsky, 1992 analyzed the convergence of SGD with a simple averaging for stochastic optimization, which is sometimes referred to as Polyak-Juditsky averaging or Polyak averaging. This work assumes smoothness and strong convexity of the objective function and established a convergence rate of \(O(1/T)\).

(Nemirovski and Yudin, 1978) is probably the first work that analyzes the non-asymptotic convergence of SGDA for general convex-concave min-max optimization without smoothness and strong convexity assumption. Their paper introduces the weighted averaging (weighted by the step size at each iteration) and establishes the convergence rate of \(O(1/\sqrt{T})\). The optimal rate \(O(1/T)\) for strongly-convex strongly-concave min-max problem was recently proved by Yan et al., 2020.

The mirror descent method originates from (Nemirovsky and Yudin, 1983), which is also the work that is often cited for the lower bound of \(O(1/\sqrt{T})\) for general convex problems. A more general form of SMD and its extension for convex-concave min-max problems using a Bregman divergence was later considered in (Nemirovski et al., 2009).

The non-asymptotic analysis of SGD for non-convex optimization was initiated by Ghadimi and Lan, 2013. The non-asymptotic analysis of SGD for weakly convex optimization was developed by Davis and Drusvyatskiy, 2019.

The proximal method dates back to the proximal point method proposed by Martinet, 1972 and further developed in (Rockafellar, 1976). Lions and Mercier, 1979 proposed a splitting method for finding a zero point of the sum of two maximal monotone operators. The forward backward splitting was first proposed by Pazy, 1979 in the same context of finding a zero of sum of monotone operators. Its special instance for minimization problems known as projected gradient method was first studied by Goldstein, 1964.

Coordinate descent has a long history in optimization, with its earliest roots traceable to the Gauss-Seidel iterations for solving linear systems in the 19th century. The method was later formalized and discussed in early optimization literature, including (Warga, 1963), (Ortega and Rheinboldt, 1970), (Luenberger, 1973). Rigorous analysis of convergence properties was developed in a sequence of influential works by Paul Tseng and others, including (Luo and Tseng, 1992), (Tseng, 1990), (Tseng and Bertsekas, 1987), (Tseng, 2001). Recent developments of block coordinate descent including accelerated coordinate descent (Nesterov, 2012) and stochastic block coordinate descent (Dang and Lan, 2015).

The extragradient method was first proposed by Korpelevich, 1976. The mirror prox method and its convergence rate \(O(1/T)\) was proposed and established by Nemirovski, 2004. The stochastic mirror prox method was analyzed in (Juditsky et al., 2011).

Optimization in machine learning

Frank Rosenblatt’s pioneering work in the late 1950s introduced a learning rule for updating the Perceptron model (a single-layer neural network for binary classification) (Rosenblatt, 1962), a method that shares a conceptual foundation with modern stochastic gradient descent (SGD). The earliest instance of SGD for machine learning is perhaps the Widrow-Hoff algorithm (Widrow and Hoff, 1960) (also known as the least mean square algorithm), which was used to train ADALINE - a single-layer neural network that produces a continuous output. Amari, 1967 is the first work that applies SGD to optimize a neural network for binary and multi-class classification.

Starting in late 1980s, online prediction problem has attracted increasing attention in machine learning, whose developments have many parallels to stochastic optimization. Littlestone, 1988 proposed the Winnow algorithm for learning Boolean functions. It was shown to be better than the earlier Perceptron learning algorithm in the sense that the number of mistakes grows only logarithmically with the number of irrelevant attributes in the examples. Later, it was generalized to weighted majority for learning with expert advice (Littlestone and Warmuth, 1994), and the exponentiated gradient method (Kivinen and Warmuth, 1997) for online learning with a decision variable from a simplex, which is a special case of SMD using the KL-divergence. It has impact on the development of Adaboost (Freund and Schapire, 1997).

During the first decade of the 21st century, online convex optimization emerged as a central topic in machine learning. A key focus was on regret bound analysis, which can be transferred into convergence guarantees for stochastic optimization via the online-to-batch conversion technique (Dekel and Singer, 2005). Regret bounds for online gradient descent were established for both convex loss functions (Zinkevich, 2003) and strongly convex loss functions (Hazan et al., 2007). The multi-epoch scheme for achieving an optimal rate of \(O(1/T)\) for stochastic strongly convex optimization was established independently and concurrently in (Iouditski and Nesterov, 2010), (Hazan and Kale, 2011), (Ghadimi and Lan, 2012). Later, SGD has shown to be able to achieve the optimal rate for stochastic non-smooth strongly convex optimization using tail averaging (Rakhlin et al., 2012) or increased weighted averaging (Lacoste-Julien et al., 2012). The last iterate convergence of SGD for non-smooth convex optimization was established in (Shamir and Zhang, 2013).

The use of the \(l_1\) norm as a regularizer in the Lasso method was pioneered by Tibshirani, 1996. The elastic net regularizer was later proposed in (Zou and Hastie, 2003), while the group lasso regularizer was introduced by (Yuan and Lin, 2006). More recently, the Piecewise Affine Regularizer (PARQ) was proposed in (Jin et al., 2025). The nuclear norm minimization for promoting a low-rank matrix was first considered in (Fazel et al., 2001).

Pioneering works on the application of SGD to regularized empirical risk minimization in machine learning, including support vector machines, include (Zhang, 2004a), (Shalev-Shwartz et al., 2007). The application of the proximal gradient method to \(l_1\) norm regularized problem was initiated by Daubechies et al., 2004, yielding an algorithm known as iterative thresholding. The application of SPGD to machine learning with various regularization terms was studied in (Duchi and Singer, 2009). The application of SGD for optimizing truncated loss and its theory was studied in (Xu et al., 2019).

The most famous application of coordinate descent methods in machine learning is the solver for support vector machine (Chang et al., 2008), (Hsieh et al., 2008).

AdaGrad, proposed by (Duchi et al., 2011), was a breakthrough in stochastic optimization for machine learning. It later inspired several popular stochastic algorithms for deep learning, including RMSprop (Hinton, 2018) and Adam (Kingma and Ba, 2015), which will be discussed in Chapter 6..

The first variant of stochastic optimistic mirror prox method appeared in the author’s award-winning work (Chiang et al., 2012), inspired by Nemirovski’s mirror prox method. It was introduced to address a long-standing challenge in online convex optimization for achieving variational regret bounds. This line of research later inspired the work of (Rakhlin and Sridharan, 2013), which formally coined the term optimistic mirror descent. More recently, stochastic optimistic mirror prox has been adopted for solving min-max problems in machine learning, including applications such as training generative adversarial networks (Daskalakis et al., 2018).

Discussion.
The most important factor that affects the practical performance of SGD and other stochastic algorithms is the learning rate scheme \(\eta_t\). In this chapter, we focus on a fixed learning rate \(\eta_t=\eta\). However, it is usually not the best choice in practice. We can also develop theoretical analysis of these algorithms using decreasing learning rates, e.g., \(\eta_t\propto 1/\sqrt{t},1/t\). However, these theoretical learning rate schemes are usually also not the best in practice. A practical approach is the step decay strategy as in Theorem 3.7, which gives a convergence that has only logarithmic dependence on the initial distance \(\|\mathbf{w}_1-\mathbf{w}_*\|_2\). This strategy also works for general stochastic convex optimization under generic error bound conditions in the form \(\|\mathbf{w}-\mathbf{w}_*\|_2\le c(g(\mathbf{w})-g(\mathbf{w}_*))^{\theta}\) with \(\theta\in(0,1]\) (Xu et al., 2017). Another issue of theoretical learning rates is that their best values that optimize the convergence bound may depend on some unknown parameters of the problem, e.g., \(\mathbf{w}_*\), the smoothness constant, strong convexity modulus. This issue has triggered a line of research known as parameter-free algorithms (Orabona, 2019), (Lan et al., 2023).

While this chapter focuses on classical stochastic methods that not only have important applications in machine learning but also significantly influence the approaches presented in later chapters, it does not cover several important algorithms, most notably accelerated gradient methods and their stochastic variants. These methods achieve optimal convergence rates for smooth convex objectives when the variance of stochastic gradients vanishes (Lan, 2012). For a comprehensive treatment of accelerated gradient methods, we refer to the textbook by Nesterov, 2004, and for stochastic accelerated algorithms, we recommend Lan, 2020. Variants of these methods will be introduced in Chapter 6.

Finally, I recommend the textbook (Recht and Wright, 2025), which provides a comprehensive treatment of convex optimization algorithms tailored for data analysis.

References

Robbins, H., and Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22, 400–407.
Kiefer, J., and Wolfowitz, J. (1952). Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3), 462–466.
Chung, K. L. (1954). On a stochastic approximation method. The Annals of Mathematical Statistics, 25(3), 463–483.
Polyak, B. T., and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), 838–855.
Nemirovski, A., and Yudin, D. (1978). On Cezari’s convergence of the steepest descent method for approximating saddle point of convex-concave functions. Soviet Mathematics Doklady, 19, 341–362.
Yan, Y., Xu, Y., Lin, Q., Liu, W., and Yang, T. (2020). Optimal epoch stochastic gradient descent ascent methods for min-max optimization. In Advances in Neural Information Processing Systems (NeurIPS 2020).
Nemirovsky, A. S., and Yudin, D. B. (1983). Problem Complexity and Method Efficiency in Optimization. John Wiley and Sons.
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4), 1574–1609.
Ghadimi, S., and Lan, G. (2013). Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4), 2341–2368.
Davis, D., and Drusvyatskiy, D. (2019). Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1), 207–239.
Martinet, B. (1972). Determination approchee d’un point fixe d’une application pseudo-contractante. Cas de l’application prox. Comptes Rendus de l’Academie des Sciences, Paris, Serie A, 274, 163–165.
Rockafellar, R. T. (1976). Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14(5), 877–898.
Lions, P. L., and Mercier, B. (1979). Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6), 964–979.
Pazy, G. B. (1979). Ergodic convergence to a zero of the sum of monotone operators in Hilbert space. Journal of Mathematical Analysis and Applications, 72, 383–390.
Goldstein, A. A. (1964). Convex programming in Hilbert space. Bulletin of the American Mathematical Society, 70(5), 709–710.
Warga, J. (1963). Minimizing certain convex functions. Journal of the Society for Industrial and Applied Mathematics, 11(3), 588–593.
Ortega, J. M., and Rheinboldt, W. C. (1970). Iterative Solution of Nonlinear Equations in Several Variables. Academic Press.
Luenberger, D. G. (1973). Introduction to Linear and Nonlinear Programming. Addison-Wesley Publishing Company.
Luo, Z. Q., and Tseng, P. (1992). On the convergence of the coordinate descent method for convex differentiable minimization. Journal of Optimization Theory and Applications, 72(1), 7–35.
Tseng, P. (1990). Dual ascent methods for problems with strictly convex costs and linear constraints: A unified approach. SIAM Journal on Control and Optimization, 28(1), 214–242.
Tseng, P., and Bertsekas, D. P. (1987). Relaxation methods for problems with strictly convex separable costs and linear constraints. Mathematical Programming, 38(3), 303–321.
Tseng, P. (2001). Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109(3), 475–494.
Nesterov, Y. (2012). Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2), 341–362.
Dang, C. D., and Lan, G. (2015). Stochastic block mirror descent methods for nonsmooth and stochastic optimization. SIAM Journal on Optimization, 25(2), 856–882.
Korpelevich, G. M. (1976). The extragradient method for finding saddle points and other problems.
Nemirovski, A. (2004). Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1), 229–251.
Juditsky, A., Nemirovski, A., and Tauvel, C. (2011). Solving variational inequalities with stochastic mirror-prox algorithm. Stochastic Systems, 1(1), 17–58.
Rosenblatt, F. (1962). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books.
Widrow, B., and Hoff, M. E. (1960). Adaptive switching circuits. IRE WESCON Convention Record, 4, 96–104.
Amari, S. (1967). A theory of adaptive pattern classifier. IEEE Transactions on Electronic Computers, EC-16(3), 279–307.
Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2(4), 285–318.
Littlestone, N., and Warmuth, M. K. (1994). The weighted majority algorithm. Information and Computation, 108(2), 212–261.
Kivinen, J., and Warmuth, M. K. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132(1), 1–63.
Freund, Y., and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.
Dekel, O., and Singer, Y. (2005). Data-driven online to batch conversions. In Advances in Neural Information Processing Systems, 18.
Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), 928–935.
Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3), 169–192.
Iouditski, A., and Nesterov, Y. (2010). Primal-dual subgradient methods for minimizing uniformly convex functions. arXiv preprint.
Hazan, E., and Kale, S. (2011). Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In Proceedings of the 24th Annual Conference on Learning Theory (COLT 2011), 421–436.
Ghadimi, S., and Lan, G. (2012). Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization I: A generic algorithmic framework. SIAM Journal on Optimization, 22(4), 1469–1492.
Rakhlin, A., Shamir, O., and Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning (ICML 2012), 1571–1578.
Lacoste-Julien, S., Schmidt, M., and Bach, F. R. (2012). A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. CoRR, abs/1212.2002.
Shamir, O., and Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), 71–79.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B), 58, 267–288.
Zou, H., and Hastie, T. (2003). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.
Yuan, M., and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1), 49–67.
Jin, L., Ma, J., Liu, Z., Gromov, A., Defazio, A., and Xiao, L. (2025). PARQ: Piecewise-affine regularized quantization. In Forty-second International Conference on Machine Learning.
Fazel, M., Hindi, H., and Boyd, S. P. (2001). A rank minimization heuristic with application to minimum order system approximation. In 2001 IEEE International Conference on Control Applications, 1347–1352.
Zhang, T. (2004a). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the Twenty-First International Conference on Machine Learning (ICML 2004), 919–926.
Shalev-Shwartz, S., Singer, A., and Srebro, N. (2007). Pegasos: Primal estimated sub-gradient solver for SVM. In Proceedings of the 24th International Conference on Machine Learning, 807–814.
Daubechies, I., Defrise, M., and De Mol, C. (2004). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11), 1413–1457.
Duchi, J., and Singer, Y. (2009). Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10, 2899–2934.
Xu, Y., Zhu, S., Yang, S., Zhang, C., Jin, R., and Yang, T. (2019). Learning with non-convex truncated losses by SGD. In Proceedings of UAI 2019, PMLR 115, 701–711.
Chang, K.-W., Hsieh, C.-J., and Lin, C.-J. (2008). Coordinate descent method for large-scale L2-loss linear support vector machines. Journal of Machine Learning Research, 9(45), 1369–1398.
Hsieh, C.-J., Chang, K.-W., Lin, C.-J., Keerthi, S. S., and Sundararajan, S. (2008). A dual coordinate descent method for large-scale linear SVM. In Proceedings of the 25th International Conference on Machine Learning (ICML 2008), 408–415.
Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159.
Hinton, G. (2018). Neural networks for machine learning, lecture 6. Coursera online course.
Kingma, D. P., and Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR 2015.
Chiang, C.-K., Yang, T., Lee, C.-J., Mahdavi, M., Lu, C.-J., Jin, R., and Zhu, S. (2012). Online optimization with gradual variations. In Proceedings of COLT 2012, PMLR 23, 6.1–6.20.
Rakhlin, A., and Sridharan, K. (2013). Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems (NeurIPS 2013), 3066–3074.
Daskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, H. (2018). Training GANs with optimism. In International Conference on Learning Representations (ICLR 2018).
Xu, Y., Lin, Q., and Yang, T. (2017). Stochastic convex optimization: Faster local growth implies faster global convergence. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), PMLR 70, 3821–3830.
Orabona, F. (2019). A modern introduction to online learning. CoRR, abs/1912.13213.
Lan, G., Ouyang, Y., and Zhang, Z. (2023). Optimal and parameter-free gradient minimization methods for convex and nonconvex optimization.
Lan, G. (2012). An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2), 365–397.
Nesterov, Y. (2004). Introductory Lectures on Convex Programming: A Basic Course. Springer.
Lan, G. (2020). First-order and Stochastic Optimization Methods for Machine Learning. Springer International Publishing.
Recht, B., and Wright, S. J. (2025). Optimization for Modern Data Analysis. Cambridge University Press.