Section 3.5: Adaptive Gradient Method (AdaGrad)

The stochastic algorithms discussed so far are fairly general and were originally developed to address a wide range of problems, extending beyond those encountered specifically in machine learning. Nevertheless, the ERM problem of machine learning may exhibit some unique properties dependent on data. How to leverage them to develop a stochastic algorithm that could be potentially faster in practice?

Below, we introduce Adaptive Gradient Method (AdaGrad), which employs an adaptive step size, which incorporates knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning.

While AdaGrad was considered an important breakthrough in machine learning, it indeed evolves from SMD. We use the same language as SMD to present AdaGrad and its analysis. Let us consider the smooth problem and recall the update of SMD: \[\begin{align*} \mathbf{w}_{t+1} = \arg\min_{\mathbf{w}\in\mathbb{R}^d} \nabla g(\mathbf{w}_t; \zeta_t)^{\top}\mathbf{w} + \frac{1}{\eta}D_\varphi(\mathbf{w}, \mathbf{w}_t). \end{align*}\] The key design to AdaGrad is to use a time-varying proximal function \(\varphi_t\) that changes across iterations. A specific way to construction \(\varphi_t\) is the following.

Let \(H_t=\text{diag}(s_{t,1},\ldots, s_{t, d})\) be a diagonal positive matrix. Define \(\varphi_t(\mathbf{w})= \frac{1}{2}\mathbf{w}^{\top}H_t\mathbf{w}\) and a general norm \(\|\mathbf{w}\|_H = \sqrt{\mathbf{w}^{\top}H\mathbf{w}}\). Then the Bregman divergence induced by \(\varphi_t\) becomes: \[ D_{\varphi_t}(\mathbf{w}, \mathbf{w}') = \frac{1}{2}(\mathbf{w}- \mathbf{w}')^{\top}H_t (\mathbf{w} - \mathbf{w}') = \frac{1}{2}\sum_{i=1}^ds_{t,i}(w_i - w'_i)^2, \] which is \(1\)-strongly convex w.r.t \(\|\cdot\|_H\). The weights \(\mathbf{s}_t=(s_{t,1},\ldots, s_{t,d})\) are updated according to the following:

\[\begin{align} s_{t,i}= \sqrt{\sum_{\tau=1}^t [\nabla g(\mathbf{w}_\tau; \zeta_\tau)]_i^2}, \forall i, \end{align}\]

which essentially measures the growth of stochastic gradients across all iterations before \(t\).

Let \(\mathbf{z}_t =\nabla g(\mathbf{w}_t,\zeta_t)\), and \(\mathbf{m}_{1:t} = [\mathbf{z}_1, \ldots, \mathbf{z}_t]\), and \(\mathbf{m}_{1:t, i}\) denotes its \(i\)-th row vector. Then \(s_{t,i}=\|\mathbf{m}_{1:t,i}\|_2\). As a result, the updating step becomes \[\begin{align} \mathbf{w}_{t+1} =\mathbf{w}_t - \eta H_t^{-1}\nabla g(\mathbf{w}_t; \zeta_t) = \mathbf{w}_t - \frac{\eta}{\mathbf{s}_t}\circ\nabla g(\mathbf{w}_t; \zeta_t), \end{align}\] where \(\circ\) denotes element-wise product. The full steps of AdaGrad are summarized in Algorithm 6.

Compared with SGD, there are two differences: (i) the effective step size \(\frac{\eta}{\mathbf{s}_t}\) is adaptive that depends on the history of updates, hence depends on sampled data \(\zeta_1, \ldots, \zeta_t\). This is the why it is called adaptive step size; (ii) each coordinate of \(\mathbf{w}\) will receive a different step size. This feature makes it useful to tackle deep neural networks as the parameters at different layers usually have different orders of gradient.

Motivation of the AdaGrad update

The AdaGrad update can be motivated by choosing \(\varphi(\mathbf w) = \frac{1}{2}\mathbf w^{\top}H\mathbf w\) with a positive diagonal matrix \(H\), so as to make the convergence upper bound of SMD small. From the proof of Theorem 3.11, we obtain the following bound: \[ \mathbb E\left[g(\bar{\mathbf w}_{T}) - g(\mathbf w_*)\right] \leq \frac{D_{\varphi}(\mathbf w_*,\mathbf w_1)}{\eta T}+ \frac{\eta}{2\alpha T}\mathbb E\left[\sum_{t=1}^T\|\nabla g(\mathbf w_t, \zeta_t)\|_*^2\right], \] which reduces to the bound in Theorem 3.11 under Assumption 3.9.

If we choose \(H=\text{diag}(\mathbf s)\) such that \(\mathbf s^{\top}\mathbf{1}\leq c\) for some constant \(c\), ensuring that \(D_{\varphi}(\mathbf w_*,\mathbf w_1)\) is well controlled, then we can optimize the second term with respect to \(\mathbf s\): \[ \min_{\mathbf{s} \geq 0, \, \mathbf{s}^{\top}\mathbf{1} \leq c} \sum_{t=1}^T \lVert \nabla g(\mathbf{w}_t, \zeta_t) \rVert_*^2 \] Since the dual norm of \(\lVert \cdot \rVert_{H}\) is given by: \[ \lVert \mathbf{u} \rVert_{H^{-1}} = \sqrt{\mathbf{u}^{\top}H^{-1}\mathbf{u}} \] the above problem can be solved using Lagrangian duality (left as an exercise), yielding: \[ s_i^* \propto \sqrt{\sum_{t=1}^T [\nabla g(\mathbf{w}_t; \zeta_t)]_i^2}, \quad i=1,\ldots,d. \] However, the quantity \(\sqrt{\sum_{t=1}^T [\nabla g(\mathbf w_t; \zeta_t)]_i^2}\) is not available in advance. This motivates the AdaGrad update, which constructs \(\mathbf s_t\) using accumulated past stochastic gradients up to iteration \(t\), with the proportionality constant absorbed into the learning rate \(\eta\).

Algorithm 6: AdaGrad

Input: learning rate parameter \(\eta\), starting point \(\mathbf{w}_1\)
For \(t=1,\dotsc,T\)
Compute an unbiased gradient estimator \(\mathbf{z}_t = \nabla g(\mathbf{w}_t; \zeta_t)\)
Update \(s_{t,i}= \sqrt{\sum_{\tau=1}^t [\nabla g(\mathbf{w}_\tau; \zeta_t)]_i^2}, \forall i\).
Update the model \(\mathbf{w}\) by \(\mathbf{w}_{t+1} = \mathbf{w}_t - \frac{\eta}{\mathbf{s}_t}\circ\mathbf{z}_t\)

Convergence Analysis

Lemma 3.12

We have \[\begin{align*} &\sum_{t=1}^T\{D_{\varphi_t}(\mathbf{w}_*, \mathbf{w}_t)- D_{\varphi_t}(\mathbf{w}_*, \mathbf{w}_{t+1})\}\leq \frac{1}{2}\max_{t\leq T}\|\mathbf{w}_* - \mathbf{w}_{t}\|^2_\infty \sum_{i=1}^ds_{T,i}. \end{align*}\]

Proof. \[\begin{align*} &\sum_{t=1}^T\{D_{\varphi_t}(\mathbf{w}_*, \mathbf{w}_t)- D_{\varphi_t}(\mathbf{w}_*, \mathbf{w}_{t+1})\}\\ & =\sum_{t=1}^T\{D_{\varphi_t}(\mathbf{w}_*, \mathbf{w}_t) - D_{\varphi_{t-1}}(\mathbf{w}_*, \mathbf{w}_{t}) + D_{\varphi_{t-1}}(\mathbf{w}_*, \mathbf{w}_{t}) - D_{\varphi_t}(\mathbf{w}_*, \mathbf{w}_{t+1})\}\\ & \leq \sum_{t=1}^T\{D_{\varphi_{t}}(\mathbf{w}_*, \mathbf{w}_{t}) - D_{\varphi_{t-1}}(\mathbf{w}_*, \mathbf{w}_{t})\} + D_{\varphi_0}(\mathbf{w}_*, \mathbf{w}_1)\\ & = D_{\varphi_0}(\mathbf{w}_*, \mathbf{w}_1) + \frac{1}{2}\sum_{t=1}^T(\mathbf{w}_* -\mathbf{w}_{t})^{\top}(H_{t}-H_{t-1})(\mathbf{w}_* -\mathbf{w}_{t}). \end{align*}\] Since \(\mathbf{s}_{t}\succeq \mathbf{s}_{t-1}\), we have \[\begin{align*} &\sum_{t=1}^T(\mathbf{w}_* -\mathbf{w}_{t})^{\top}(H_{t}-H_{t-1})(\mathbf{w}_* -\mathbf{w}_{t})=\sum_{t=1}^T\sum_{i=1}^d(s_{t,i} - s_{t-1,i})([\mathbf{w}_*]_i - [\mathbf{w}_{t}]_i)^2\\ &\leq \max_{t\leq T}\|\mathbf{w}_* - \mathbf{w}_{t}\|^2_\infty \sum_{t=1}^T\sum_{i=1}^d(s_{t,i} - s_{t-1,i})=\max_{t\leq T}\|\mathbf{w}_* - \mathbf{w}_{t}\|^2_\infty \sum_{i=1}^d(s_{T,i} - s_{0,i}). \end{align*}\] Combining the above two inequalities, we have \[\begin{align*} &\sum_{t=1}^TD_{\varphi_t}(\mathbf{w}_*, \mathbf{w}_t)- D_{\varphi_t}(\mathbf{w}_*, \mathbf{w}_{t+1})\\ & \leq D_{\varphi_0}(\mathbf{w}_*, \mathbf{w}_1) + \frac{1}{2}\max_{t\leq T}\|\mathbf{w}_* - \mathbf{w}_{t+1}\|^2_\infty \sum_{i=1}^d(s_{T+1,i} - s_{1,i})\\ & \leq \frac{1}{2}\|\mathbf{w}_1 - \mathbf{w}_*\|^2_\infty \sum_{i=1}^ds_{0,i} + \frac{1}{2}\max_{t\leq T}\|\mathbf{w}_* - \mathbf{w}_{t}\|^2_\infty \sum_{i=1}^d(s_{T,i} - s_{0,i})\\ &\leq \frac{1}{2}\max_{t\leq T}\|\mathbf{w}_* - \mathbf{w}_{t}\|^2_\infty\sum_{i=1}^ds_{T,i}. \end{align*}\]

■

Lemma 3.13

We have \[\begin{align*} \sum_{t=1}^T\|\nabla g(\mathbf{w}_t; \zeta_t)\|_{H_t^{-1}}^2\leq 2\sum_{i=1}^d s_{T,i}. \end{align*}\]

Proof. Let us first prove a general result in the following: for a general real-value sequence \(\{a_t\}\), we have

\[\begin{align}\label{eqn:userec} \sum_{t=1}^T \frac{a_t^2}{\|a_{1:t}\|_2}\leq 2\|a_{1:T}\|_2, \end{align}\]

where \(a_{1:t}=(a_1,\ldots, a_t)\). We prove this by induction. First, it holds trivially for \(T=1\). Now, assume it holds for \(T-1\), we prove it holds for \(T\). \[\begin{align*} \sum_{t=1}^T \frac{a_t^2}{\sqrt{\sum_{\tau=1}^ta_t^2}} = \sum_{t=1}^{T-1} \frac{a_t^2}{\sqrt{\sum_{\tau=1}^ta_t^2}} + \frac{a_T^2}{\|a_{1:T}\|_2}\leq 2\|a_{1:T-1}\|_2 +\frac{a_T^2}{\|a_{1:T}\|_2}. \end{align*}\] Let \(b_T=\sqrt{\sum_{t=1}^Ta_t^2}\), then we have \[\begin{align*} 2\|a_{1:T-1}\|_2 +\frac{a_T^2}{\|a_{1:T}\|_2} = 2\sqrt{b_T^2 - a_T^2} + \frac{a_T^2}{\sqrt{b_T^2}}. \end{align*}\] Since \(\sqrt{\cdot}\) is a concave function, applying \(\sqrt{x+\delta}\leq \sqrt{x} + \delta \frac{1}{2\sqrt{x}}\) we have \[\begin{align*} \sqrt{b_T^2 - a_T^2}\leq \sqrt{b_T^2} - (a_T^2)\frac{1}{2\sqrt{b_T^2}}. \end{align*}\] Hence, \(2\sqrt{b_T^2 - a_T^2} + \frac{a_T^2}{\sqrt{b_T^2}}\leq 2\sqrt{b_T^2}\). Thus, we prove (\(\ref{eqn:userec}\)) for \(T\).

Next, we apply this result to the following: \[\begin{align*} &\sum_{t=1}^T\|\nabla g(\mathbf{w}_t; \zeta_t)\|_{H_t^{-1}}^2 = \sum_{t=1}^T\nabla g(\mathbf{w}_t; \zeta_t)^{\top}\text{diag}(\mathbf{s}_t)^{-1}\nabla g(\mathbf{w}_t; \zeta_t) \\ & = \sum_{i=1}^d\sum_{t=1}^T\frac{[\nabla g(\mathbf{w}_t; \zeta_t)]_i^2}{\sqrt{\sum_{\tau=1}^t[\nabla g(\mathbf{w}_\tau; \zeta_\tau)]_i^2}}\leq \sum_{i=1}^d2\sqrt{\sum_{\tau=1}^T[\nabla g(\mathbf{w}_\tau; \zeta_\tau)]_i^2}. \end{align*}\]

■

Theorem 3.12

Let \(\bar{\mathbf{w}}_T=\frac{1}{T}\sum_{t=1}^T\mathbf{w}_t\). AdaGrad guarantees that \[\begin{align*} \mathbb{E}[g(\bar{\mathbf{w}}_t)- g(\mathbf{w}_*)]\leq &\frac{\mathbb{E}\left[\max_{t\leq T}\|\mathbf{w}_*-\mathbf{w}_t\|_\infty^2\sum_{i=1}^d\|\mathbf{m}_{1:T,i}\|_2\right]}{2\eta T}\\ &+ \frac{\eta\mathbb{E}\left[\sum_{i=1}^d\|\mathbf{m}_{1:T,i}\|_2\right]}{T}. \end{align*}\] If \(\max_t\|\mathbf{w}_* - \mathbf{w}_t\|_\infty\leq D_\infty\) and \(\eta = D_{\infty}/\sqrt{2}\), we have \[\begin{align*} \mathbb{E}[g(\bar{\mathbf{w}}_t)- g(\mathbf{w}_*)]\leq \frac{\sqrt{2}D_\infty\mathbb{E}\left[\sum_{i=1}^d\|\mathbf{m}_{1:T,i}\|_2\right]}{T}. \end{align*}\]

💡 Why it matters
The above result shows the convergence rate depends on the growth rate of the cumulative stochastic gradient \(\sum_{i=1}^d\|\mathbf{m}_{1:T,i}\|_2\). In the worst case, it grows at a rate of \(O(\sqrt{T})\), inducing a convergence rate of \(O(1/\sqrt{T})\), similar to SGD. However, when the cumulative stochastic gradient grows slower than \(O(\sqrt{T})\), AdaGrad will enjoy a convergence rate of \(o(1/\sqrt{T})\).

Let us consider the following linear model with sparse random data scenario, where \(g(\mathbf{w}_t, \zeta_t)=[1-\mathbf{w}_t^{\top}\zeta_t]_+\) and the data vectors \(\zeta_t \in \{-1,0,1\}^d\). Assume that at in each round \(t\), feature \(i\) appears with probability \(p_i = \min\{1, c i^{-\alpha}\}\) for some \(\alpha \in (1, \infty)\) and a dimension-independent constant \(c\). Then we have \[\begin{align*} \mathbb{E}\left[\sum_{i=1}^d \|\mathbf{m}_{1:T,i}\|_2\right]& = \mathbb{E}\left[\sum_{i=1}^d \sqrt{\left|t: \mathbf{z}_{t,i} = 1\right|}\right] \le \sum_{i=1}^d \sqrt{\mathbb{E}\left[\left|t: \mathbf{z}_{t,i} = 1\right|\right]}\\ &= \sum_{i=1}^d \sqrt{p_i T}. \end{align*}\] In the rightmost sum, we have \(c \sum_{i=1}^d i^{-\alpha/2} = O(\log d)\) for \(\alpha \ge 2\), and \(\sum_{i=1}^d i^{-\alpha/2} = O(d^{1-\alpha/2})\) for \(\alpha \in (1, 2)\). If \(\mathbf{w}_t\) is restricted in a domain \(\mathcal W= \{\mathbf{w}: \|\mathbf{w}\|_\infty \le 1\}\), then \(D_\infty = 2\), and the convergence rate of AdaGrad becomes \(O(\max\{\log d, d^{1-\alpha/2}\}/\sqrt{T})\). For contrast, the convergence rate of SGD in Theorem 3.2 is \(O(\sqrt{d/T})\). So we see that in this sparse yet heavy tailed feature setting, AdaGrad’s convergence bound can be exponentially smaller in the dimension \(d\) than the non-adaptive bound of SGD.

Proof. Similar to Eqn. 9 in the proof of Theorem 3.11, we have \[\begin{equation} \eta \langle \nabla g(\mathbf{w}_t; \zeta_t), \mathbf{w}_t- \mathbf{w}\rangle\leq D_{\varphi_t}(\mathbf{w}, \mathbf{w}_t)- D_{\varphi_t}(\mathbf{w}, \mathbf{w}_{t+1})+ \frac{\eta_t^2}{2}\|\nabla g(\mathbf{w}_t; \zeta_t)\|_{H_t^{-1}}^2. \end{equation}\] Taking expectation and summation over \(t=1,\ldots, T\), we have \[\begin{align*} \sum_{t=1}^T\eta\mathbb{E}[\nabla g(\mathbf{w}_t), \mathbf{w}_t- \mathbf{w}\rangle]\leq &\mathbb{E}\left[\sum_{t=1}^TD_{\varphi_t}(\mathbf{w}, \mathbf{w}_t)- D_{\varphi_t}(\mathbf{w}, \mathbf{w}_{t+1})\right]\\ & + \mathbb{E}\left[\sum_{t=1}^T\frac{\eta^2}{2}\|\nabla g(\mathbf{w}_t; \zeta_t)\|_{H_t^{-1}}^2\right]. \end{align*}\] By setting \(\mathbf w=\mathbf w_*\), bounding the left-hand side using the first-order condition of a convex function, and bounding the two terms on the right-hand side using the results from the two lemmas above, we complete the proof.

■

← Go Back