Section 3.5: Adaptive Gradient Method (AdaGrad)
The stochastic algorithms discussed so far are fairly general and were originally developed to address a wide range of problems, extending beyond those encountered specifically in machine learning. Nevertheless, the ERM problem of machine learning may exhibit some unique properties dependent on data. How to leverage them to develop a stochastic algorithm that could be potentially faster in practice?
Below, we introduce Adaptive Gradient Method (AdaGrad), which employs an adaptive step size, which incorporates knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning.
While AdaGrad was considered an important breakthrough in machine learning, it indeed evolves from SMD. We use the same language as SMD to present AdaGrad and its analysis. Let us consider the smooth problem and recall the update of SMD: \[\begin{align*} \mathbf{w}_{t+1} = \arg\min_{\mathbf{w}\in\mathbb{R}^d} \nabla g(\mathbf{w}_t; \zeta_t)^{\top}\mathbf{w} + \frac{1}{\eta}D_\varphi(\mathbf{w}, \mathbf{w}_t). \end{align*}\] The key design to AdaGrad is to use a time-varying proximal function \(\varphi_t\) that changes across iterations. A specific way to construction \(\varphi_t\) is the following.
Let \(H_t=\text{diag}(s_{t,1},\ldots, s_{t, d})\) be a diagonal positive matrix. Define \(\varphi_t(\mathbf{w})= \frac{1}{2}\mathbf{w}^{\top}H_t\mathbf{w}\) and a general norm \(\|\mathbf{w}\|_H = \sqrt{\mathbf{w}^{\top}H\mathbf{w}}\). Then the Bregman divergence induced by \(\varphi_t\) becomes: \[ D_{\varphi_t}(\mathbf{w}, \mathbf{w}') = \frac{1}{2}(\mathbf{w}- \mathbf{w}')^{\top}H_t (\mathbf{w} - \mathbf{w}') = \frac{1}{2}\sum_{i=1}^ds_{t,i}(w_i - w'_i)^2, \] which is \(1\)-strongly convex w.r.t \(\|\cdot\|_H\). The weights \(\mathbf{s}_t=(s_{t,1},\ldots, s_{t,d})\) are updated according to the following:
\[\begin{align} s_{t,i}= \sqrt{\sum_{\tau=1}^t [\nabla g(\mathbf{w}_\tau; \zeta_\tau)]_i^2}, \forall i, \end{align}\]
which essentially measures the growth of stochastic gradients across all iterations before \(t\).
Let \(\mathbf{z}_t =\nabla g(\mathbf{w}_t,\zeta_t)\), and \(\mathbf{m}_{1:t} = [\mathbf{z}_1, \ldots, \mathbf{z}_t]\), and \(\mathbf{m}_{1:t, i}\) denotes its \(i\)-th row vector. Then \(s_{t,i}=\|\mathbf{m}_{1:t,i}\|_2\). As a result, the updating step becomes \[\begin{align} \mathbf{w}_{t+1} =\mathbf{w}_t - \eta H_t^{-1}\nabla g(\mathbf{w}_t; \zeta_t) = \mathbf{w}_t - \frac{\eta}{\mathbf{s}_t}\circ\nabla g(\mathbf{w}_t; \zeta_t), \end{align}\] where \(\circ\) denotes element-wise product. The full steps of AdaGrad are summarized in Algorithm 6.
Compared with SGD, there are two differences: (i) the effective step size \(\frac{\eta}{\mathbf{s}_t}\) is adaptive that depends on the history of updates, hence depends on sampled data \(\zeta_1, \ldots, \zeta_t\). This is the why it is called adaptive step size; (ii) each coordinate of \(\mathbf{w}\) will receive a different step size. This feature makes it useful to tackle deep neural networks as the parameters at different layers usually have different orders of gradient.
Algorithm 6: AdaGrad
- Input: learning rate parameter \(\eta\), starting point \(\mathbf{w}_1\)
- For \(t=1,\dotsc,T\)
- Compute an unbiased gradient estimator \(\mathbf{z}_t = \nabla g(\mathbf{w}_t; \zeta_t)\)
- Update \(s_{t,i}= \sqrt{\sum_{\tau=1}^t [\nabla g(\mathbf{w}_\tau; \zeta_t)]_i^2}, \forall i\).
- Update the model \(\mathbf{w}\) by \(\mathbf{w}_{t+1} = \mathbf{w}_t - \frac{\eta}{\mathbf{s}_t}\circ\mathbf{z}_t\)
Convergence Analysis
Let the dual norm of \(\|\cdot\|_H\) is given by \(\|\mathbf{u}\|_{H^{-1}} = \sqrt{\mathbf{u}^{\top}H^{-1} \mathbf{u}}\). Then, \(\varphi_t\) is \(1\)-strongly convex in terms of \(\|\cdot\|_{H_t}\).
Lemma 3.12
We have \[\begin{align*} &\sum_{t=1}^T\{D_{\varphi_t}(\mathbf{w}_*, \mathbf{w}_t)- D_{\varphi_t}(\mathbf{w}_*, \mathbf{w}_{t+1})\}\leq \frac{1}{2}\max_{t\leq T}\|\mathbf{w}_* - \mathbf{w}_{t}\|^2_\infty \sum_{i=1}^ds_{T,i}. \end{align*}\]
■
Lemma 3.13
We have \[\begin{align*} \sum_{t=1}^T\|\nabla g(\mathbf{w}_t; \zeta_t)\|_{H_t^{-1}}^2\leq 2\sum_{i=1}^d s_{T,i}. \end{align*}\]
Proof. Let us first prove a general result in the following: for a general real-value sequence \(\{a_t\}\), we have
\[\begin{align}\label{eqn:userec} \sum_{t=1}^T \frac{a_t^2}{\|a_{1:t}\|_2}\leq 2\|a_{1:T}\|_2, \end{align}\]
where \(a_{1:t}=(a_1,\ldots, a_t)\). We prove this by induction. First, it holds trivially for \(T=1\). Now, assume it holds for \(T-1\), we prove it holds for \(T\). \[\begin{align*} \sum_{t=1}^T \frac{a_t^2}{\sqrt{\sum_{\tau=1}^ta_t^2}} = \sum_{t=1}^{T-1} \frac{a_t^2}{\sqrt{\sum_{\tau=1}^ta_t^2}} + \frac{a_T^2}{\|a_{1:T}\|_2}\leq 2\|a_{1:T-1}\|_2 +\frac{a_T^2}{\|a_{1:T}\|_2}. \end{align*}\] Let \(b_T=\sqrt{\sum_{t=1}^Ta_t^2}\), then we have \[\begin{align*} 2\|a_{1:T-1}\|_2 +\frac{a_T^2}{\|a_{1:T}\|_2} = 2\sqrt{b_T^2 - a_T^2} + \frac{a_T^2}{\sqrt{b_T^2}}. \end{align*}\] Since \(\sqrt{\cdot}\) is a concave function, applying \(\sqrt{x+\delta}\leq \sqrt{x} + \delta \frac{1}{2\sqrt{x}}\) we have \[\begin{align*} \sqrt{b_T^2 - a_T^2}\leq \sqrt{b_T^2} - (a_T^2)\frac{1}{2\sqrt{b_T^2}}. \end{align*}\] Hence, \(2\sqrt{b_T^2 - a_T^2} + \frac{a_T^2}{\sqrt{b_T^2}}\leq 2\sqrt{b_T^2}\). Thus, we prove (\(\ref{eqn:userec}\)) for \(T\).
Next, we apply this result to the following: \[\begin{align*} &\sum_{t=1}^T\|\nabla g(\mathbf{w}_t; \zeta_t)\|_{H_t^{-1}}^2 = \sum_{t=1}^T\nabla g(\mathbf{w}_t; \zeta_t)^{\top}\text{diag}(\mathbf{s}_t)^{-1}\nabla g(\mathbf{w}_t; \zeta_t) \\ & = \sum_{i=1}^d\sum_{t=1}^T\frac{[\nabla g(\mathbf{w}_t; \zeta_t)]_i^2}{\sqrt{\sum_{\tau=1}^t[\nabla g(\mathbf{w}_\tau; \zeta_\tau)]_i^2}}\leq \sum_{i=1}^d2\sqrt{\sum_{\tau=1}^T[\nabla g(\mathbf{w}_\tau; \zeta_\tau)]_i^2}. \end{align*}\]■
Theorem 3.12
Let \(\bar{\mathbf{w}}_T=\frac{1}{T}\sum_{t=1}^T\mathbf{w}_t\). AdaGrad guarantees that \[\begin{align*} \mathbb{E}[g(\bar{\mathbf{w}}_t)- g(\mathbf{w}_*)]\leq &\frac{\mathbb{E}\left[\max_{t\leq T}\|\mathbf{w}_*-\mathbf{w}_t\|_\infty^2\sum_{i=1}^d\|\mathbf{m}_{1:T,i}\|_2\right]}{2\eta T}\\ &+ \frac{\eta\mathbb{E}\left[\sum_{i=1}^d\|\mathbf{m}_{1:T,i}\|_2\right]}{T}. \end{align*}\] If \(\max_t\|\mathbf{w}_* - \mathbf{w}_t\|_\infty\leq D_\infty\) and \(\eta = D_{\infty}/\sqrt{2}\), we have \[\begin{align*} \mathbb{E}[g(\bar{\mathbf{w}}_t)- g(\mathbf{w}_*)]\leq \frac{\sqrt{2}D_\infty\mathbb{E}\left[\sum_{i=1}^d\|\mathbf{m}_{1:T,i}\|_2\right]}{T}. \end{align*}\]
💡 Why it matters
The above result shows the convergence rate depends on the growth rate
of the cumulative stochastic gradient \(\sum_{i=1}^d\|\mathbf{m}_{1:T,i}\|_2\). In
the worst case, it grows at a rate of \(O(\sqrt{T})\), inducing a convergence rate
of \(O(1/\sqrt{T})\), similar to SGD.
However, when the cumulative stochastic gradient grows slower than \(O(\sqrt{T})\), AdaGrad will enjoy a
convergence rate of \(o(1/\sqrt{T})\).
Let us consider the following linear model with sparse random data scenario, where \(g(\mathbf{w}_t, \zeta_t)=[1-\mathbf{w}_t^{\top}\zeta_t]_+\) and the data vectors \(\zeta_t \in \{-1,0,1\}^d\). Assume that at in each round \(t\), feature \(i\) appears with probability \(p_i = \min\{1, c i^{-\alpha}\}\) for some \(\alpha \in (1, \infty)\) and a dimension-independent constant \(c\). Then we have \[\begin{align*} \mathbb{E}\left[\sum_{i=1}^d \|\mathbf{m}_{1:T,i}\|_2\right]& = \mathbb{E}\left[\sum_{i=1}^d \sqrt{\left|t: \mathbf{z}_{t,i} = 1\right|}\right] \le \sum_{i=1}^d \sqrt{\mathbb{E}\left[\left|t: \mathbf{z}_{t,i} = 1\right|\right]}\\ &= \sum_{i=1}^d \sqrt{p_i T}. \end{align*}\] In the rightmost sum, we have \(c \sum_{i=1}^d i^{-\alpha/2} = O(\log d)\) for \(\alpha \ge 2\), and \(\sum_{i=1}^d i^{-\alpha/2} = O(d^{1-\alpha/2})\) for \(\alpha \in (1, 2)\). If \(\mathbf{w}_t\) is restricted in a domain \(\mathcal W= \{\mathbf{w}: \|\mathbf{w}\|_\infty \le 1\}\), then \(D_\infty = 2\), and the convergence rate of AdaGrad becomes \(O(\max\{\log d, d^{1-\alpha/2}\}/\sqrt{T})\). For contrast, the convergence rate of SGD in Theorem 3.2 is \(O(\sqrt{d/T})\). So we see that in this sparse yet heavy tailed feature setting, AdaGrad’s convergence bound can be exponentially smaller in the dimension \(d\) than the non-adaptive bound of SGD.
■