Section 3.1: Stochastic Gradient Descent

Let us consider the following standard stochastic optimization problem:

\[\begin{align}\label{eqn:so} \min_{\mathbf{w}}g(\mathbf{w}) := \mathbb{E}_{\zeta}[g(\mathbf{w}; \zeta)]. \end{align}\]

If \(g\) is differentiable, the stochastic gradient descent (SGD) method takes the following update:

\[\begin{align}\label{eqn:sgd} \mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t \nabla g(\mathbf{w}_t; \zeta_t), \end{align}\]

where \(\zeta_t\) is a random sample. If \(g\) is non-differentiable, we use a stochastic subgradient \(\mathcal{G}(\mathbf{w}; \zeta)\) to update the model parameter:

\[\begin{align}\label{eqn:sgd-ns} \mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t \mathcal{G}(\mathbf{w}_t; \zeta_t). \end{align}\]

The key assumption regarding the stochastic gradient or subgradient is the following.

Assumption 3.1
For any \(\mathbf{w}\), we have \(\mathbb{E}_{\zeta}[\nabla g(\mathbf{w}; \zeta)] = \nabla g(\mathbf{w})\) or \(\mathbb{E}_{\zeta}[\mathcal{G}(\mathbf{w}; \zeta)]\in\partial g(\mathbf{w})\).

Explanation of SGD update

The update (\(\ref{eqn:sgd}\)) is equivalent to:

\[\begin{align}\label{eqn:sgd-e} \mathbf{w}_{t+1} & = \arg\min_{\mathbf{w}} g(\mathbf{w}_t; \zeta_t)+ \nabla g(\mathbf{w}_t; \zeta_t)^{\top}(\mathbf{w} - \mathbf{w}_t) + \frac{1}{2\eta_t}\|\mathbf{w} - \mathbf{w}_t\|_2^2. \end{align}\]

The stochastic linear approximation \(\tilde g(\mathbf{w}; \zeta_t) = g(\mathbf{w}_t; \zeta_t) + \nabla g(\mathbf{w}_t; \zeta_t)^{\top}(\mathbf{w} - \mathbf{w}_t)\) serves as a stochastic surrogate for \(g(\mathbf{w})\). Since it is only an approximation, we avoid optimizing it directly; instead, we seek a solution close to \(\mathbf{w}_t\) that minimizes this surrogate.

When SGD is applied to solving ERM, \(\zeta_t\) could represent one randomly sampled data with an index from \(\{1,\ldots,n\}\) or a mini-batch of random data.

Algorithm 1: SGD

Input: learning rate schedule \(\{\eta_t\}_{t=1}^{T}\), starting point \(\mathbf{w}_1\)
For \(t=1,\dotsc,T\)
Compute an unbiased gradient estimator \(\mathbf{z}_t=\nabla g(\mathbf{w}_t;\zeta_t)\)
Update the model by \(\mathbf{w}_{t+1}=\mathbf{w}_t-\eta_t\mathbf{z}_t\)

Below, we present the convergence analysis for smooth and non-smooth, convex and non-convex objective functions.

3.1.1 Smooth Convex Functions

For a point \(\mathbf{w}\), convergence is typically measured by the objective gap: \[ g(\mathbf{w}) - \min_{\mathbf{w}} g(\mathbf{w}) = g(\mathbf{w}) - g(\mathbf{w}_*), \] where \(\mathbf{w}_*\) denotes a global optimal solution. A convergence analysis aims to show that after \(T\) iterations of updates, we can obtain a solution \(\hat{\mathbf{w}}_T\) such that the expected objective gap is bounded by \[ \mathbb{E}\left[g(\hat{\mathbf{w}}_T) - g(\mathbf{w}_*)\right] \leq O\left(\frac{1}{T^{\alpha}}\right), \] for some \(\alpha > 0\). The term \(1/T^\alpha\) is referred to as the convergence rate. Accordingly, to guarantee a small objective gap \(\mathbb{E}[g(\hat{\mathbf{w}}_T) - g(\mathbf{w}_*)] \leq \epsilon\) for some \(\epsilon \ll 1\), the bound implies that \(T = O\left(\frac{1}{\epsilon^{1/\alpha}}\right)\), which is known as the iteration complexity.

Let us first assume that \(g\) is smooth and its stochastic gradient \(\nabla g(\mathbf{w}; \zeta)\) satisfies the following assumption.

Assumption 3.2
(i) \(g(\mathbf{w})\) is \(L\)-smooth and convex; (ii) For any \(\mathbf{w}\), we have \[\mathbb{E}\big[\|\nabla g(\mathbf{w}; \zeta)- \nabla g(\mathbf{w})\|_2^2\big]\leq \sigma^2\] for some \(\sigma\geq 0\).

The following lemma is useful for convergence analysis.

Lemma 3.1
Consider the update (\(\ref{eqn:sgd}\)). For any \(\mathbf{w}\) we have \[\begin{align*} \nabla g(\mathbf{w}_t; \zeta_t)^{\top}(\mathbf{w}_{t+1} - \mathbf{w})\leq & \frac{1}{2\eta_t}\|\mathbf{w} - \mathbf{w}_t\|_2^2 - \frac{1}{2\eta_t}\|\mathbf{w} - \mathbf{w}_{t+1}\|_2^2 - \frac{1}{2\eta_t}\|\mathbf{w}_{t+1} - \mathbf{w}_t\|_2^2. \end{align*}\]

Proof.
Since the problem (\(\ref{eqn:sgd-e}\)) is \(1/\eta_t\) strongly convex and has an optimal solution \(\mathbf{w}_{t+1}\), following Eq. 1 in Section 1.5 for any \(\mathbf{w}\) we have \[\begin{align*} &\nabla g(\mathbf{w}_t; \zeta_t)^{\top}(\mathbf{w} - \mathbf{w}_t) + \frac{1}{2\eta_t}\|\mathbf{w} - \mathbf{w}_t\|_2^2\\ &\geq \nabla g(\mathbf{w}_t; \zeta_t)^{\top}(\mathbf{w}_{t+1} - \mathbf{w}_t) + \frac{1}{2\eta_t}\|\mathbf{w}_{t+1} - \mathbf{w}_t\|_2^2+ \frac{1}{2\eta_t}\|\mathbf{w} - \mathbf{w}_{t+1}\|_2^2. \end{align*}\] Re-arranging the inequality, we have \[\begin{align*} \nabla g(\mathbf{w}_t; \zeta_t)^{\top}(\mathbf{w}_{t+1} - \mathbf{w})\leq & \frac{1}{2\eta_t}\|\mathbf{w} - \mathbf{w}_t\|_2^2 - \frac{1}{2\eta_t}\|\mathbf{w} - \mathbf{w}_{t+1}\|_2^2 - \frac{1}{2\eta_t}\|\mathbf{w}_{t+1} - \mathbf{w}_t\|_2^2. \end{align*}\]

■

The following lemma shows one-step objective gap bound.

Lemma 3.2
Suppose Assumption 3.1 and Assumption 3.2 hold. For one step SGD update \(\mathbf{w}_{t+1} =\mathbf{w}_t -\eta_t\nabla g(\mathbf{w}_t; \zeta_t)\), we have \[\begin{align*} \mathbb{E}[g(\mathbf{w}_{t+1}) -g(\mathbf{w}_*)]\leq & \mathbb{E}\left[\frac{1}{2\eta_t}\|\mathbf{w}_* - \mathbf{w}_t\|_2^2 - \frac{1}{2\eta_t}\|\mathbf{w}_* - \mathbf{w}_{t+1}\|_2^2\right] +\eta_t\sigma^2. \end{align*}\]

Proof.
From Lemma 3.1, we have

\[\begin{equation}\label{eqb:sgd-sm-1} \begin{aligned} \nabla g(\mathbf{w}_t)^{\top}(\mathbf{w}_{t+1} - \mathbf{w})\leq & \frac{1}{2\eta_t}\|\mathbf{w} - \mathbf{w}_t\|_2^2 - \frac{1}{2\eta_t}\|\mathbf{w} - \mathbf{w}_{t+1}\|_2^2 - \frac{1}{2\eta_t}\|\mathbf{w}_{t+1} - \mathbf{w}_t\|_2^2\\ & +(\nabla g(\mathbf{w}_t)- \nabla g(\mathbf{w}_t; \zeta_t))^{\top}(\mathbf{w}_{t+1} - \mathbf{w}). \end{aligned} \end{equation}\]

By the smoothness and convexity of \(g\), we have

\[\begin{equation}\label{eqn:scg-cross} \begin{aligned} g(\mathbf{w}_{t+1})&\leq g(\mathbf{w}_t) +\nabla g(\mathbf{w}_t)^{\top}(\mathbf{w}_{t+1} - \mathbf{w}_t) + \frac{L}{2}\|\mathbf{w}_{t+1} - \mathbf{w}_t\|_2^2\\ &\leq g(\mathbf{w}) + \nabla g(\mathbf{w}_t)^{\top}(\mathbf{w}_ t- \mathbf{w})+\nabla g(\mathbf{w}_t)^{\top}(\mathbf{w}_{t+1} - \mathbf{w}_t) + \frac{L}{2}\|\mathbf{w}_{t+1} - \mathbf{w}_t\|_2^2\\ &\leq g(\mathbf{w})+\nabla g(\mathbf{w}_t)^{\top}(\mathbf{w}_{t+1} - \mathbf{w}) + \frac{L}{2}\|\mathbf{w}_{t+1} - \mathbf{w}_t\|_2^2. \end{aligned} \end{equation}\]

Combining this with (\(\ref{eqb:sgd-sm-1}\)), we have

\[\begin{equation}\label{eqn:sgd-sm-1} \begin{aligned} g(\mathbf{w}_{t+1}) -g(\mathbf{w})\leq & \frac{1}{2\eta_t}\|\mathbf{w} - \mathbf{w}_t\|_2^2 - \frac{1}{2\eta_t}\|\mathbf{w} - \mathbf{w}_{t+1}\|_2^2 - \frac{1}{2\eta_t}\|\mathbf{w}_{t+1} - \mathbf{w}_t\|_2^2 \\ & + \frac{L}{2}\|\mathbf{w}_{t+1} - \mathbf{w}_t\|_2^2 +(\nabla g(\mathbf{w}_t)- \nabla g(\mathbf{w}_t; \zeta_t))^{\top}(\mathbf{w}_{t+1} - \mathbf{w}). \end{aligned} \end{equation}\]

Then if \(\eta_t\leq 1/L\) and plugging \(\mathbf{w}=\mathbf{w}_*\), we have \[\begin{align*} g(\mathbf{w}_{t+1}) -g(\mathbf{w}_*)\leq & \frac{1}{2\eta_t}\|\mathbf{w}_* - \mathbf{w}_t\|_2^2 - \frac{1}{2\eta_t}\|\mathbf{w}_* - \mathbf{w}_{t+1}\|_2^2 \\ &+(\nabla g(\mathbf{w}_t)- \nabla g(\mathbf{w}_t; \zeta_t))^{\top}(\mathbf{w}_{t+1} - \mathbf{w}_*). \end{align*}\]

To bound the last term, we introduce \[ \hat{\mathbf{w}}_{t+1} = \arg\min_{\mathbf{w}} \nabla g(\mathbf{w}_t)^{\top}(\mathbf{w} - \mathbf{w}_t) + \frac{1}{2\eta_t}\|\mathbf{w} - \mathbf{w}_t\|_2^2. \] Note that \(\hat{\mathbf{w}}_{t+1}\) is independent of \(\zeta_t\). Then \(\mathbb{E}_{\zeta_t}[(\nabla g(\mathbf{w}_t)- \nabla g(\mathbf{w}_t; \zeta_t))^{\top}(\hat{\mathbf{w}}_{t+1} - \mathbf{w}_*)]=0\). Thus, we have \[\begin{align*} \mathbb{E}[g(\mathbf{w}_{t+1}) -g(\mathbf{w}_*)]\leq & \mathbb{E}\left[\frac{1}{2\eta_t}\|\mathbf{w}_* - \mathbf{w}_t\|_2^2 - \frac{1}{2\eta_t}\|\mathbf{w}_* - \mathbf{w}_{t+1}\|_2^2\right] \\ &+\mathbb{E}[(\nabla g(\mathbf{w}_t)- \nabla g(\mathbf{w}_t; \zeta_t))^{\top}(\mathbf{w}_{t+1} - \hat{\mathbf{w}}_{t+1})]. \end{align*}\]

Due to Lemma 1.7, we have \(\|\mathbf{w}_{t+1} - \hat{\mathbf{w}}_{t+1}\|_2\leq \eta_t\|\nabla g(\mathbf{w}_t)- \nabla g(\mathbf{w}_t; \zeta_t)\|_2\), thus \[\begin{align*} \mathbb{E}[g(\mathbf{w}_{t+1}) -g(\mathbf{w}_*)]\leq & \mathbb{E}\left[\frac{1}{2\eta_t}\|\mathbf{w}_* - \mathbf{w}_t\|_2^2 - \frac{1}{2\eta_t}\|\mathbf{w}_* - \mathbf{w}_{t+1}\|_2^2\right] +\eta_t\sigma^2. \end{align*}\]

■

Theorem 3.1
Suppose Assumption 3.1 and Assumption 3.2 hold. Let the learning rate \(\{\eta_t\}\) be \(\eta_t=\eta\leq 1/L\) and \(\bar{\mathbf{w}}_{T} = \frac{1}{T}\sum_{t=1}^T\mathbf{w}_{t+1}\). Then after \(T\) iterations of SGD update we have

\[\begin{align}\label{eqn:sgd-upb} \mathbb{E}\left[g(\bar{\mathbf{w}}_{T}) - g(\mathbf{w}_*)\right]\leq \frac{\|\mathbf{w}_1 - \mathbf{w}_*\|_2^2}{2\eta T} +\eta \sigma^2. \end{align}\]

If \(\eta= \min(\frac{1}{L}, \frac{\|\mathbf{w}_1 - \mathbf{w}_*\|_2}{\sqrt{2T}\sigma})\), then \[\begin{align*} \mathbb{E}\left[g(\bar{\mathbf{w}}_{T}) - g(\mathbf{w}_*)\right]\leq \frac{\sqrt{2}\sigma\|\mathbf{w}_1 - \mathbf{w}_*\|_2}{\sqrt{T}}+ \frac{L\|\mathbf{w}_1 - \mathbf{w}_*\|_2^2}{T}. \end{align*}\]

💡 Why it matters
In the convergence upper bound (\(\ref{eqn:sgd-upb}\)), the first term captures the optimization error due to the finite time horizon, while the second term represents the error induced by stochastic gradient noise.

If \(\sigma = 0\) (no noise), SGD reduces to gradient descent, then a constant step size \(\eta=1/L\) can be used and the convergence rate becomes \(O\left(\tfrac{L\|\mathbf{w}_1 - \mathbf{w}_*\|_2^2}{T}\right)\). If \(\sigma^2>0\) (there is noise in stochastic gradient), in order to guarantee convergence, we have to set \(\eta_t\rightarrow 0\) or a small value to guarantee certain level of accuracy.

For a fixed number of iterations \(T\), a smaller variance \(\sigma\) allows for faster convergence with a larger learning rate \(\eta\) (up to a certain limit).

The iteration complexity required to achieve \(\mathbb{E}[g(\bar{\mathbf{w}}_T) - g(\mathbf{w}_*)] \leq \epsilon\) is \[ T = O\left(\max\Big(\tfrac{\sigma^2 \|\mathbf{w}_1 - \mathbf{w}_*\|_2^2}{\epsilon^2}, \tfrac{L\|\mathbf{w}_1 - \mathbf{w}_*\|_2^2}{\epsilon}\Big)\right). \] If a mini-batch of size \(B\) is used to compute the stochastic gradient at each iteration, the variance of the stochastic gradient decreases by a factor of \(B\). This implies that increasing the batch size, up to a certain point, can reduce the number of iterations needed.

Finally, the result also highlights that the initial learning rate \(\eta\) cannot be too large; in practice, an excessively large initial learning rate may cause the algorithm to diverge.

Proof.

If \(\eta_t=\eta\), summing the inequalities in Lemma 3.2 over \(t=1,\ldots,T\), we have \[ \begin{align*} \mathbb{E}\bigg[\sum_{t=1}^T(g(\mathbf{w}_{t+1})-g(\mathbf{w}_*))\bigg] \leq &\ \mathbb{E}\left[\sum_{t=1}^T\frac{1}{2\eta}\|\mathbf{w}_*-\mathbf{w}_t\|_2^2-\frac{1}{2\eta}\|\mathbf{w}_*-\mathbf{w}_{t+1}\|_2^2\right] \\ &+T\eta\sigma^2. \end{align*} \]

The first term in \([\cdot]\) is a telescoping series, \[ \begin{align*} &\sum_{t=1}^T\frac{1}{2\eta}\|\mathbf{w}_*-\mathbf{w}_t\|_2^2-\frac{1}{2\eta}\|\mathbf{w}_*-\mathbf{w}_{t+1}\|_2^2 \leq \frac{1}{2\eta}\|\mathbf{w}_*-\mathbf{w}_1\|_2^2-\frac{1}{2\eta}\|\mathbf{w}_*-\mathbf{w}_{T+1}\|_2^2 \\ &\leq \frac{1}{2\eta}\|\mathbf{w}_*-\mathbf{w}_1\|_2^2. \end{align*} \]

As a result, \[ \begin{align*} \mathbb{E}\bigg[\frac{1}{T}\sum_{t=1}^T(g(\mathbf{w}_{t+1})-g(\mathbf{w}_*))\bigg] \leq &\ \frac{1}{2\eta T}\|\mathbf{w}_*-\mathbf{w}_1\|_2^2+\eta\sigma^2, \end{align*} \] which concludes the proof of the first part of the theorem.

For the second part, optimizing the upper bound over \(\eta\) gives \(\eta_*=\frac{\|\mathbf{w}_1-\mathbf{w}_*\|_2}{\sqrt{2T}\sigma}\). If \(\eta_*\leq 1/L\), i.e., \(T\geq\frac{\|\mathbf{w}_1-\mathbf{w}_*\|_2^2L^2}{2\sigma^2}\), we set \(\eta=\eta_*\), then \[ \mathbb{E}\left[g(\bar{\mathbf{w}}_{T})-g(\mathbf{w}_*)\right] \leq \frac{2\sigma\|\mathbf{w}_1-\mathbf{w}_*\|_2}{\sqrt{2T}}. \]

If \(\eta_*>1/L\), i.e., \(\sigma^2\leq\frac{\|\mathbf{w}_1-\mathbf{w}_*\|_2^2L^2}{2T}\), we set \(\eta=1/L\), then \[ \mathbb{E}\left[g(\bar{\mathbf{w}}_{T})-g(\mathbf{w}_*)\right] \leq \frac{L\|\mathbf{w}_1-\mathbf{w}_*\|_2^2}{2T}+\frac{L\|\mathbf{w}_1-\mathbf{w}_*\|_2^2}{2T} =\frac{L\|\mathbf{w}_1-\mathbf{w}_*\|_2^2}{T}. \]

■

3.1.2 Non-smooth Convex Functions

Now, let us consider the SGD update (\(\ref{eqn:sgd-ns}\)) for non-smooth convex functions under the following assumption.

Assumption 3.3
(i) \(g(\mathbf{w})\) is convex; (ii) For any \(\mathbf{w}\), we have \(\mathbb{E}[\|\mathcal{G}(\mathbf{w}; \zeta)\|_2^2]\leq G^2\).

Lemma 3.3
Suppose Assumption 3.1 and Assumption 3.3 hold. For one step SGD update \(\mathbf{w}_{t+1}=\mathbf{w}_t-\eta_t\mathcal{G}(\mathbf{w}_t;\zeta_t)\), we have \[\begin{align*} \mathbb{E}[g(\mathbf{w}_t)-g(\mathbf{w}_*)]\leq \mathbb{E}\left[\frac{1}{2\eta_t}\|\mathbf{w}_*-\mathbf{w}_t\|_2^2-\frac{1}{2\eta_t}\|\mathbf{w}_*-\mathbf{w}_{t+1}\|_2^2\right]+\frac{\eta_t}{2}G^2. \end{align*}\]

Proof.
From Lemma 3.1, we have

\[\begin{equation}\label{eqb:sgd-nsm-1} \begin{aligned} \mathcal{G}(\mathbf{w}_t; \zeta_t)^{\top}(\mathbf{w}_{t} - \mathbf{w}_*)&\leq \frac{1}{2\eta_t}\|\mathbf{w}_* - \mathbf{w}_t\|_2^2 - \frac{1}{2\eta_t}\|\mathbf{w}_* - \mathbf{w}_{t+1}\|_2^2 - \frac{1}{2\eta_t}\|\mathbf{w}_{t+1} - \mathbf{w}_t\|_2^2\\ & +\mathcal{G}(\mathbf{w}_t; \zeta_t)^{\top}(\mathbf{w}_{t+1} - \mathbf{w}_t)\\ & \leq \frac{1}{2\eta_t}\|\mathbf{w}_* - \mathbf{w}_t\|_2^2 - \frac{1}{2\eta_t}\|\mathbf{w}_* - \mathbf{w}_{t+1}\|_2^2 +\frac{\eta_t}{2}\|\mathcal{G}(\mathbf{w}_t; \zeta_t)\|_2^2, \end{aligned} \end{equation}\]

where the last inequality uses Young’s inequality. Taking expectation on both sides, we have

\[\begin{equation}\label{eqb:sgd-nsm-2} \begin{aligned} \mathbb{E}[\mathcal{G}(\mathbf{w}_t; \zeta_t)^{\top}(\mathbf{w}_{t} - \mathbf{w}_*)]\leq \mathbb{E}\left[\frac{1}{2\eta_t}\|\mathbf{w}_* - \mathbf{w}_t\|_2^2 - \frac{1}{2\eta_t}\|\mathbf{w}_* - \mathbf{w}_{t+1}\|_2^2\right] +\frac{\eta_t}{2}G^2. \end{aligned} \end{equation}\]

Since \(\mathbf{w}_t\) is independent of \(\zeta_t\), we have \(\mathbb{E}[\mathcal{G}(\mathbf{w}_t; \zeta_t)^{\top}(\mathbf{w}_{t} - \mathbf{w}_*)]=\mathbb{E}[\mathbf{v}_t^{\top}(\mathbf{w}_{t} - \mathbf{w}_*)]\) for some \(\mathbf{v}_t\in\partial g(\mathbf{w}_t)\). By convexity, \[\begin{equation}\label{eqb:sgd-nsm-3} \begin{aligned} \mathbb{E}[g(\mathbf{w}_t) - g(\mathbf{w}_*)]\leq \mathbb{E}[\mathbf{v}_t^{\top}(\mathbf{w}_{t} - \mathbf{w}_*)] \leq \mathbb{E}\left[\frac{1}{2\eta_t}\|\mathbf{w}_* - \mathbf{w}_t\|_2^2 - \frac{1}{2\eta_t}\|\mathbf{w}_* - \mathbf{w}_{t+1}\|_2^2\right] +\frac{\eta_t}{2}G^2. \end{aligned} \end{equation}\]

■

Theorem 3.2
Suppose Assumption 3.1 and Assumption 3.3 hold. Let the learning rate \(\{\eta_t\}\) be \(\eta_t=\eta\) and \(\bar{\mathbf{w}}_{T} = \frac{1}{T}\sum_{t=1}^T\mathbf{w}_{t}\). Then after \(T\) iterations of SGD update (\(\ref{eqn:sgd-ns}\)) we have \[\begin{align*} \mathbb{E}\left[g(\bar{\mathbf{w}}_{T}) - g(\mathbf{w}_*)\right]\leq \frac{\|\mathbf{w}_1 - \mathbf{w}_*\|_2^2}{2\eta T} +\frac{\eta G^2}{2}. \end{align*}\] If \(\eta= \frac{\|\mathbf{w}_1 - \mathbf{w}_*\|_2}{\sqrt{T}G}\), then \[\begin{align*} \mathbb{E}\left[g(\bar{\mathbf{w}}_{T}) - g(\mathbf{w}_*)\right]\leq \frac{G\|\mathbf{w}_1 - \mathbf{w}_*\|_2}{\sqrt{T}}. \end{align*}\]

💡 Why it matters
The above theorem exhibits the key difference in the convergence of SGD for smooth convex functions and non-smooth convex functions. Even with a zero variance for the stochastic subgradient, the convergence rate is still \(O(1/\sqrt{T})\). The reason is that for smooth convex functions when \(g(\mathbf{w})\rightarrow g(\mathbf{w}_*)\), we have \(\nabla g(\mathbf{w})\rightarrow 0\), which is not true for non-smooth convex functions.

Proof.
The proof is similar to that in the smooth case.

■

3.1.3 Smooth Non-Convex Functions

For a non-convex function, it is generally NP-hard to find a global optimal solution. Hence, our goal here is to establish the complexity of SGD for finding an \(\epsilon\)-stationary solution with \(\epsilon\ll 1\), as defined below.

Definition 3.1 (\(\epsilon\)-stationary solution).
\(\mathbf{w}\) is an \(\epsilon\)-stationary solution to \(\min_{\mathbf{w}}g(\mathbf{w})\), if \(\|\nabla g(\mathbf{w})\|_2\leq \epsilon\).

Assumption 3.4
(i) \(g(\mathbf{w})\) is \(L\)-smooth and non-convex; (ii) For any \(\mathbf{w}\), we have \[\mathbb{E}\big[\|\nabla g(\mathbf{w}; \zeta)- \nabla g(\mathbf{w})\|_2^2\big]\leq \sigma^2\] for some \(\sigma\geq 0\).

Theorem 3.3
Suppose Assumption 3.1 and Assumption 3.4 hold. Let the learning rate \(\{\eta_t\}\) be \(\eta_t = \min\{\frac{1}{L}, \frac{D}{\sigma \sqrt{T}}\}\) for some constant \(D>0\). Let \(\tau\in\{1, \ldots, T\}\) be a random sample following the distribution \(\Pr(\tau=t) = \frac{1}{T}\). Then we have \[\begin{align*} \mathbb{E}[\|\nabla g(\mathbf{w}_\tau)\|_2^2]\leq \frac{2L(g(\mathbf{w}_1) - g(\mathbf{w}_*))}{T } + \left(\frac{2(g(\mathbf{w}_1) - g(\mathbf{w}_*))}{D} + D L\right)\frac{\sigma}{\sqrt{T}}. \end{align*}\]

Proof.

For brevity of notation, we let \(\nabla g_t(\mathbf{w}_t)=\nabla g(\mathbf{w}_t;\zeta_t)\). Due to the \(L\)-smoothness of \(g\), we have \[ \begin{align*} &g(\mathbf{w}_{t+1})\leq g(\mathbf{w}_t)+\nabla g(\mathbf{w}_t)^{\top}(\mathbf{w}_{t+1}-\mathbf{w}_t)+\frac{L}{2}\|\mathbf{w}_{t+1}-\mathbf{w}_t\|_2^2\\ &=g(\mathbf{w}_t)-\eta_t\nabla g(\mathbf{w}_t)^{\top}\nabla g_t(\mathbf{w}_t)+\frac{\eta_t^2L}{2}\|\nabla g_t(\mathbf{w}_t)\|_2^2\\ &=g(\mathbf{w}_t)-\eta_t\|\nabla g(\mathbf{w}_t)\|_2^2+\eta_t\nabla g(\mathbf{w}_t)^{\top}(\nabla g(\mathbf{w}_t)-\nabla g_t(\mathbf{w}_t))+\frac{\eta_t^2L}{2}\|\nabla g_t(\mathbf{w}_t)\|_2^2\\ &=g(\mathbf{w}_t)-\eta_t\|\nabla g(\mathbf{w}_t)\|_2^2+\eta_t\nabla g(\mathbf{w}_t)^{\top}(\nabla g(\mathbf{w}_t)-\nabla g_t(\mathbf{w}_t))\\ &\quad+\frac{\eta_t^2L}{2}\|\nabla g_t(\mathbf{w}_t)-\nabla g(\mathbf{w}_t)+\nabla g(\mathbf{w}_t)\|_2^2\\ &=g(\mathbf{w}_t)-\left(\eta_t-\frac{\eta_t^2L}{2}\right)\|\nabla g(\mathbf{w}_t)\|_2^2+(\eta_t-\eta_t^2L)\nabla g(\mathbf{w}_t)^{\top}(\nabla g(\mathbf{w}_t)-\nabla g_t(\mathbf{w}_t))\\ &\quad+\frac{\eta_t^2L}{2}\|\nabla g_t(\mathbf{w}_t)-\nabla g(\mathbf{w}_t)\|_2^2. \end{align*} \]

Taking expectation over \(\zeta_t\) given \(\mathbf{w}_t\) on both sides, we have

\[ \begin{align}\label{eq:ncvx_base} \mathbb E_{\zeta_t}\left[g(\mathbf{w}_{t+1})\right] \leq g(\mathbf{w}_t)-\left(\eta_t-\frac{\eta_t^2L}{2}\right)\|\nabla g(\mathbf{w}_t)\|_2^2+\frac{\eta_t^2L}{2}\sigma^2. \end{align} \]

Telescoping this from \(t=1\) to \(T\) gives \[ \begin{align*} \mathbb E\left[\sum_{t=1}^T\left(\eta_t-\frac{\eta_t^2L}{2}\right)\|\nabla g(\mathbf{w}_t)\|_2^2\right] \leq g(\mathbf{w}_1)-g(\mathbf{w}_*)+\sum_{t=1}^T\frac{\eta_t^2L}{2}\sigma^2. \end{align*} \]

As a result, \[ \begin{align*} \mathbb E\left[\|\nabla g(\mathbf{w}_{\tau})\|_2^2\right] \leq \frac{g(\mathbf{w}_1)-g(\mathbf{w}_*)}{\sum_{t=1}^T\left(\eta_t-\frac{\eta_t^2L}{2}\right)} +\frac{\sum_{t=1}^T\eta_t^2L}{2\sum_{t=1}^T\left(\eta_t-\frac{\eta_t^2L}{2}\right)}\sigma^2. \end{align*} \]

Plugging the value \(\eta_t=\min\left(\frac{1}{L},\frac{D}{\sigma\sqrt{T}}\right)\), we have \[ \begin{align*} &\mathbb E\left[\|\nabla g(\mathbf{w}_{\tau})\|_2^2\right] \leq \frac{g(\mathbf{w}_1)-g(\mathbf{w}_*)}{T\left(\eta_1-\frac{\eta_1^2L}{2}\right)} +\frac{T\eta_1^2L}{2T\left(\eta_1-\frac{\eta_1^2L}{2}\right)}\sigma^2\\ &\leq \frac{2(g(\mathbf{w}_1)-g(\mathbf{w}_*))}{T\eta_1}+\eta_1L\sigma^2\\ &\leq \max\left(\frac{2L(g(\mathbf{w}_1)-g(\mathbf{w}_*))}{T}, \frac{2(g(\mathbf{w}_1)-g(\mathbf{w}_*))\sigma}{D\sqrt{T}}\right) +\frac{D\sigma L}{\sqrt{T}}\\ &\leq \frac{2L(g(\mathbf{w}_1)-g(\mathbf{w}_*))}{T} +\left(\frac{2(g(\mathbf{w}_1)-g(\mathbf{w}_*))}{D}+DL\right)\frac{\sigma}{\sqrt{T}}. \end{align*} \]

If we set \(\eta_t=\min\left(\frac{1}{L},\frac{D}{\sigma\sqrt{t}}\right)\), then \(\sum_{t=1}^T\eta_t\geq\Omega(\sqrt{T})\) and \(\sum_{t=1}^T\eta_t^2\leq O(\log T)\), hence \(\mathbb E\left[\|\nabla g(\mathbf{w}_{\tau})\|_2^2\right]\leq O(\log T/T)\).

■

3.1.4 Non-smooth Weakly Convex Functions

Next, let us extend the analysis to non-smooth non-convex functions. Consider a function \(g: \mathbb{R}^d\mapsto \mathbb{R}\) and a point \(\mathbf{w}\in\mathbb{R}^d\) with \(g(\mathbf{w})\) finite. The Fr�chet subdifferential of \(g\) at \(\mathbf{w}\), denoted \(\partial g(\mathbf{w})\), consists of all vectors \(\mathbf{v}\) satisfying \[ g(\mathbf{w}) \geq g(\mathbf{w}') + \mathbf{v}^{\top}(\mathbf{w} -\mathbf{w}') + o(\|\mathbf{w} - \mathbf{w}'\|_2) \text{ as } \mathbf{w} \rightarrow \mathbf{w}'. \] We consider a family of non-convex functions, namely weakly convex functions. A lower semi-continuous function \(g\) is called \(\rho\)-weakly, if there exists \(\rho>0\) such that: \[\begin{align*} g(\mathbf{w}) \geq g(\mathbf{w}') + \mathbf{v}^{\top}(\mathbf{w - \mathbf{w}'}) - \frac{\rho}{2}\|\mathbf{w - \mathbf{w}'}\|_2^2,\quad \forall \mathbf{w}, \mathbf{w}', \mathbf{v}\in\partial g(\mathbf{w}'). \end{align*}\] It is easy to show that if \(g\) is \(\rho\)-weakly convex, then \(g(\mathbf{w}) + \frac{\rho}{2}\|\mathbf{w}\|_2^2\) is a convex function of \(\mathbf{w}\). A smooth function is weakly convex, but the reverse is not necessarily true.

Examples

Example 1: Compositional functions

Let \(F(\mathbf{x}) = f(g(\mathbf{x}))\). If \(f\) is convex and \(G_1\)-Lipschitz continuous and \(g(\mathbf{x})\) is \(L_2\)-smooth, then \(F\) is \(\rho\)-weakly convex for some \(\rho>0\). We will prove this in Section 5.3. The OCE risk is a special case when \(\phi^*\) is non-smooth and the loss function \(\ell(\mathbf{w}; \mathbf{z})\) is smooth non-convex.

Example 2: Compositional functions

Let \(F(\mathbf{x}) = f(g(\mathbf{x}))\). If \(f\) is \(L_1\)-smooth and monotonically non-decreasing and \(g(\mathbf{x})\) is non-smooth convex and \(G_2\)-Lipschitz continuous, then \(F\) is \(\rho\)-weakly convex for some \(\rho>0\).

Let us prove it. Since \(f(g)\) is \(L_1\)-smooth, for any \(\mathbf{w},\mathbf{v}\in\mathbb{R}^d\), \[ f(g(\mathbf{v})) + f'(g(\mathbf{v}))(g(\mathbf{w})-g(\mathbf{v})) - \frac{L_1}{2}|g(\mathbf{w})-g(\mathbf{v})|^2 \leq f(g(\mathbf{w})). \] Since \(g\) is convex, \(g(\mathbf{w}) \geq g(\mathbf{v}) + \partial g(\mathbf{v})^\top (\mathbf{w} - \mathbf{v})\), then \[\begin{align*} f(g(\mathbf{w})) - f(g(\mathbf{v})) \geq & f'(g(\mathbf{v}))\partial g(\mathbf{v})^\top (\mathbf{w} - \mathbf{v}) - \frac{L_1}{2}|g(\mathbf{w})-g(\mathbf{v})|^2\\ \geq & f'(g(\mathbf{v}))\partial g(\mathbf{v})^\top (\mathbf{w} - \mathbf{v}) - \frac{G_2^2L_1}{2}\|\mathbf{w}-\mathbf{v}\|_2^2, \end{align*}\] where the first inequality uses \(f'(g(\mathbf{v}))\geq 0\), and the second uses \(\|\partial g(\mathbf{w})\|_2 \leq G_2\). That is, \(f(g(\mathbf{w}))\) is \(G_2^2L_1\)-weakly convex.

An important application of this function in machine learning is optimizing the truncation of a convex loss \(g(\mathbf{w})=\ell(\mathbf{w}; \mathbf{z})\geq 0\) with a smooth truncation function \(f(\ell(\mathbf{w}; \mathbf{z})) =\alpha \log(1+ \frac{\ell(\mathbf{w}; \mathbf{z})}{\alpha})\) for some \(\alpha>0\), which is useful for tackling heavy-tailed data distribution.

Nearly \(\epsilon\)-stationary solution

When \(g(\cdot)\) is non-smooth, finding an \(\epsilon\)-stationary solution such that \(\|\nabla g(\mathbf{w})\|_2\leq \epsilon\) is difficult even for a convex function. Consider \(\min_{w}|w|\): the only stationary point is \(w_*=0\), and any \(w\neq 0\) is not an \(\epsilon\)-stationary solution (\(\epsilon<1\)) no matter how close \(w\) is to \(0\). To address this issue, we introduce a weak notion of \(\epsilon\)-stationary solution, termed nearly \(\epsilon\)-stationary solution.

Definition 3.2 (Nearly \(\epsilon\)-stationary solution).
\(\mathbf{w}\) is a nearly \(\epsilon\)-stationary solution to \(\min_{\mathbf{w}}g(\mathbf{w})\), if there exists \(\hat{\mathbf{w}}\) such that \(\|\mathbf{w} - \hat{\mathbf{w}}\|\leq O(\epsilon)\) and \(\mathrm{dist}(0, \partial g(\hat{\mathbf{w}}))\leq \epsilon\).

A useful tool for deriving a nearly \(\epsilon\)-stationary solution is the Moreau envelope of \(g\):

\[\begin{align}\label{eqn:moreau} g_{\lambda}(\mathbf{w}) &:= \min_{\mathbf{u}} g(\mathbf{u}) + \frac{1}{2\lambda}\|\mathbf{u} - \mathbf{w}\|_2^2. \end{align}\]

Define \[\begin{align} \mathrm{prox}_{\lambda g}(\mathbf{w}) := \arg\min_{\mathbf{u}} g(\mathbf{u}) + \frac{1}{2\lambda}\|\mathbf{u} - \mathbf{w}\|_2^2. \end{align}\]

An example of a weakly convex function and its Moreau envelope is illustrated in Figure 3.1.

Moreau envelope example — Fig. 3.1: Moreau envelope of \(g(x)=|x^2-1|\) with \(\lambda=0.2\).

Proposition 3.1
Consider a \(\rho\)-weakly convex function \(g(\cdot)\). Then for any \(\lambda\in(0, \rho^{-1})\), the Moreau envelope \(g_{\lambda}(\cdot)\) is \(\frac{2-\lambda\rho}{\lambda(1-\lambda\rho)}\)-smooth, with gradient given by \[\begin{align*} \nabla g_{\lambda}(\mathbf{w}) = \frac{1}{\lambda}(\mathbf{w} -\mathrm{prox}_{\lambda g}(\mathbf{w})). \end{align*}\]

Proof.
First, when \(\lambda<\rho^{-1}\) we have \(g(\mathbf{u}) + \frac{1}{2\lambda}\|\mathbf{u} - \mathbf{w}\|_2^2\) become \((\frac{1}{\lambda} - \rho)\)-strongly convex. Hence the solution \(\operatorname{prox}_{\lambda g}(\mathbf{w})\) is unique for a given \(\mathbf{w}\). We can also write \(\operatorname{prox}_{\lambda g}(\mathbf{w})\) as \[\begin{align*} \operatorname{prox}_{\lambda g}(\mathbf{w})& = \arg\min_{\mathbf{u}} g(\mathbf{u}) + \frac{1}{2\lambda}\|\mathbf{u} - \mathbf{w}\|_2^2 \\ &= \arg\min_{\mathbf{u}} \underbrace{g(\mathbf{u}) + \frac{\rho}{2}\|\mathbf{u}\|_2^2}\limits_{r(\mathbf{u})} + \frac{1}{2}\bigg(\frac{1}{\lambda} - \rho\bigg)\bigg\|\mathbf{u} - \frac{1}{1-\lambda\rho}\mathbf{w}\bigg\|_2^2. \end{align*}\] Due to Lemma 1.7, we have \(\|\operatorname{prox}_{\lambda g}(\mathbf{w}) - \operatorname{prox}_{\lambda g}(\mathbf{w}')\|_2\leq \frac{1}{1-\lambda\rho}\|\mathbf{w} - \mathbf{w}'\|_2\). Then \[\begin{align*} &\|\nabla g_{\lambda}(\mathbf{w}) -\nabla g_{\lambda}(\mathbf{w}')\|_2 = \frac{1}{\lambda}\|(\mathbf{w} -\operatorname{prox}_{\lambda g}(\mathbf{w})) - (\mathbf{w}' - \operatorname{prox}_{\lambda g}(\mathbf{w}'))\|_2\\ &\leq \frac{1}{\lambda}\left(\|\mathbf{w} - \mathbf{w}'\|_2 + \frac{1}{1 -\lambda\rho}\|\mathbf{w} - \mathbf{w}'\|_2\right) = \frac{2-\lambda\rho}{\lambda(1-\lambda\rho)}\|\mathbf{w} - \mathbf{w}'\|_2. \end{align*}\]

■

With the Moreau envelope, we can use the norm of its gradient to measure the convergence for optimizing the original function.

Proposition 3.2
If \(\lambda<\rho^{-1}\), we have

\[\begin{align} g_{\lambda}(\mathbf{w})\leq g(\mathbf{w}), \quad \min_{\mathbf{w}}g_\lambda(\mathbf{w}) = \min_{\mathbf{w}} g(\mathbf{w}). \end{align}\]

If \(\|\nabla g_{\lambda}(\mathbf{w})\|_2\leq \epsilon\), then \(\hat{\mathbf{w}} = \operatorname{prox}_{\lambda g}(\mathbf{w})\) is a nearly \(\epsilon\)-stationary solution. In particular,

\[\begin{equation}\label{eqn:nearly-stationary} \begin{aligned} &\|\hat{\mathbf{w}} - \mathbf{w}\|_2 =\lambda\|\nabla g_{\lambda}(\mathbf{w})\|_2\leq \lambda\epsilon,\\ &\operatorname{dist}(0, \partial g(\hat{\mathbf{w}})) \leq \|\nabla g_{\lambda}(\mathbf{w})\|_2\leq \epsilon. \end{aligned} \end{equation}\]

Proof.
\(g_{\lambda}(\mathbf{w})\leq g(\mathbf{w})\) follows the definition of \(g_\lambda(\mathbf{w})\). Then \(g_{\lambda}(\mathbf{w}_*)\leq g(\mathbf{w}_*)\). To prove they are equal, we have \[\begin{align*} g_{\lambda}(\mathbf{w}) = \min_{\mathbf{u}} g(\mathbf{u}) + \frac{1}{2\lambda}\|\mathbf{u} - \mathbf{w}\|_2^2\geq \min_{\mathbf{u}}g(\mathbf{u}) = g(\mathbf{w}_*). \end{align*}\] Since \(\nabla g_{\lambda}(\mathbf{w}) = \frac{1}{\lambda}(\mathbf{w} - \hat{\mathbf{w}})\), this implies the first inequality in (\(\ref{eqn:nearly-stationary}\)). The second inequality is due to the first-order optimality condition of \(\min_{\mathbf{u}} g(\mathbf{u}) + \frac{1}{2\lambda}\|\mathbf{u} - \mathbf{w}\|_2^2\).

■

💡 Why it matters
Proposition 3.2 shows that if we can make \(\|\nabla g_{\lambda}(\mathbf{w})\|_2\) small, then \(\mathbf{w}\) is close to an \(\epsilon\)-stationary solution \(\hat{\mathbf{w}}\) of the original function \(g(\mathbf{w})\). The smaller the \(\lambda\), the closer between \(\mathbf{w}\) and \(\hat{\mathbf{w}}\).

Convergence Analysis

Assumption 3.5
(i) \(g(\mathbf{w})\) is \(\rho\)-weakly convex; (ii) For any \(\mathbf{w}\), \(\mathbb{E}_\zeta[\|\mathcal{G}(\mathbf{w}, \zeta)\|_2^2]\leq G^2\) for some \(G\geq 0\).

Lemma 3.4
Let us consider an update \(\mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t\mathbf{z}_t\). If \(\mathbb{E}_t[\mathbf{z}_t]=\mathcal{M}_t\) and \(\mathbb{E}_t[\|\mathbf{z}_t\|_2^2]\leq G^2\), then we have \[\begin{align*} \mathbb{E}_t[g_{\lambda}(\mathbf{w}_{t+1})]\leq g_{\lambda}(\mathbf{w}_t) + \frac{\eta_t}{\lambda}(\hat{\mathbf{w}}_t - \mathbf{w}_t)^{\top}\mathcal{M}_t + \frac{\eta_t^2G^2}{2\lambda}, \end{align*}\] where \(\hat{\mathbf{w}}_t = \operatorname{prox}_{\lambda g}(\mathbf{w}_t)\).

Proof.
We have \[\begin{align*} g_{\lambda}(\mathbf{w}_{t+1})& = g(\hat{\mathbf{w}}_{t+1}) + \frac{1}{2\lambda}\|\hat{\mathbf{w}}_{t+1} - \mathbf{w}_{t+1}\|_2^2 \leq g(\hat{\mathbf{w}}_{t}) + \frac{1}{2\lambda}\|\hat{\mathbf{w}}_{t} - \mathbf{w}_{t+1}\|_2^2\\ & = g(\hat{\mathbf{w}}_{t}) + \frac{1}{2\lambda}\|\hat{\mathbf{w}}_{t} - \mathbf{w}_{t}\|_2^2- \frac{1}{2\lambda}\|\hat{\mathbf{w}}_{t} - \mathbf{w}_{t}\|_2^2 + \frac{1}{2\lambda}\|\hat{\mathbf{w}}_{t} - \mathbf{w}_{t+1}\|_2^2. \end{align*}\] Merging the first two terms we get \(g_\lambda(\mathbf{w}_{t})\), and using the three-point equality \(2(a-b)(b-c) = \|a-c\|_2^2 - \|a-b\|_2^2 - \|b - c\|_2^2\) to merge the last two terms we get \[\begin{align*} g_{\lambda}(\mathbf{w}_{t+1})& = g_{\lambda}(\mathbf{w}_t) + \frac{1}{\lambda}(\hat{\mathbf{w}}_t - \mathbf{w}_t)^{\top}(\mathbf{w}_t - \mathbf{w}_{t+1}) + \frac{1}{2\lambda}\|\mathbf{w}_t - \mathbf{w}_{t+1}\|_2^2\\ & = g_{\lambda}(\mathbf{w}_t) + \frac{1}{\lambda}(\hat{\mathbf{w}}_t - \mathbf{w}_t)^{\top}\eta_t\mathbf{z}_t + \frac{\eta_t^2}{2\lambda}\|\mathbf{z}_t \|_2^2. \end{align*}\]

Taking expectation over \(\zeta_t\) given \(\mathbf{w}_t\) on both sides, we have \[\begin{align*} \mathbb{E}_t[g_{\lambda}(\mathbf{w}_{t+1})]\leq g_{\lambda}(\mathbf{w}_t) + \frac{1}{\lambda}(\hat{\mathbf{w}}_t - \mathbf{w}_t)^{\top}\eta_t\mathcal{M}_t + \frac{\eta_t^2G^2}{2\lambda}. \end{align*}\]

■

Lemma 3.5
Under the same setting of Lemma 3.1 we have \[\begin{align*} \eta_t(1 - \lambda\rho)\|\nabla g_\lambda(\mathbf{w}_t)\|_2^2\leq g_{\lambda}(\mathbf{w}_t) - \mathbb{E}_t[g_{\lambda}(\mathbf{w}_{t+1})] + \frac{\eta_t^2G^2}{2\lambda}. \end{align*}\]

Proof.
Due to the weak convexity of \(g\), for any \(\mathcal{M}_t\in \partial g(\mathbf{w}_t)\), we have \[\begin{align*} &\mathcal{M}_t^{\top}(\mathbf{w}_{t} - \hat{\mathbf{w}}_t)\geq g(\mathbf{w}_t) - g(\hat{\mathbf{w}}_t)- \frac{\rho}{2}\|\hat{\mathbf{w}}_t - \mathbf{w}_t\|_2^2\\ &=(g(\mathbf{w}_t) + \frac{1}{2\lambda}\|\mathbf{w}_t - \mathbf{w}_t\|_2^2) - (g(\hat{\mathbf{w}}_t) + \frac{1}{2\lambda}\|\hat{\mathbf{w}}_t - \mathbf{w}_t\|_2^2) + (\frac{1}{2\lambda} - \frac{\rho}{2})\|\hat{\mathbf{w}}_t - \mathbf{w}_t\|_2^2. \end{align*}\] Since \(h(\mathbf{w})=g(\mathbf{w}) + \frac{1}{2\lambda}\|\mathbf{w} - \mathbf{w}_t\|_2^2\) is \((1/\lambda - \rho)\)-strongly convex and \(\hat{\mathbf{w}}_t=\arg\min h(\mathbf{w})\), then applying Lemma 1.6 (a), we get \[ (g(\mathbf{w}_t) + \frac{1}{2\lambda}\|\mathbf{w}_t - \mathbf{w}_t\|_2^2) - (g(\hat{\mathbf{w}}_t) + \frac{1}{2\lambda}\|\hat{\mathbf{w}}_t - \mathbf{w}_t\|_2^2)\geq (\frac{1}{2\lambda} - \frac{\rho}{2})\|\hat{\mathbf{w}}_t - \mathbf{w}_t\|_2^2. \] Combining the above two inequalities we have \[\begin{align*} \mathcal{M}_t^{\top}(\mathbf{w}_{t} - \hat{\mathbf{w}}_t) \geq (\lambda - \lambda^2\rho)\|\nabla g_\lambda(\mathbf{w}_t)\|_2^2. \end{align*}\] Plugging this into the inequality in Lemma 3.4, we have \[\begin{align*} \eta_t(1 - \lambda\rho)\|\nabla g_\lambda(\mathbf{w}_t)\|_2^2\leq g_{\lambda}(\mathbf{w}_t) - \mathbb{E}_t[g_{\lambda}(\mathbf{w}_{t+1})] + \frac{\eta_t^2G^2}{2\lambda}. \end{align*}\]

■

Theorem 3.4
Suppose the learning rate \(\{\eta_t\}\) is set to \(\eta_t = \frac{C}{\sqrt{T}}\). Let \(\tau\in\{1, \ldots, T\}\) be a random sample following a distribution \(\Pr(\tau=t) = \frac{1}{T}\). Then for any \(\lambda\in(0, \rho^{-1})\), we have \[\begin{align*} \mathbb{E}[\|\nabla g_\lambda(\mathbf{w}_\tau)\|_2^2]\leq \frac{g(\mathbf{w}_1) - g(\mathbf{w}_*)}{(1-\lambda\rho)C\sqrt{T}} + \frac{C G^2}{2\lambda(1-\lambda\rho)\sqrt{T}}. \end{align*}\]

Proof.
Summing up the inequalities in Lemma 3.5 over \(t=1,\ldots, T\) and taking expectation over all randomness, we have \[\begin{align*} \mathbb{E}\left[\sum_{t=1}^T\eta_t(1 - \lambda\rho)\|\nabla g_\lambda(\mathbf{w}_t)\|_2^2\right]\leq g(\mathbf{w}_1) - g(\mathbf{w}_*) + \sum_{t=1}^T\frac{\eta_t^2G^2}{2\lambda}. \end{align*}\] where we have used \(g_\lambda(\mathbf{w})\leq g(\mathbf{w})\) and \(\min g_{\lambda}(\mathbf{w})= g(\mathbf{w}_*)\). Then \[\begin{align*} \mathbb{E}[\|\nabla g_\lambda(\mathbf{w}_\tau)\|_2^2]\leq \frac{g(\mathbf{w}_1) - g(\mathbf{w}_*)}{(1-\lambda\rho)C\sqrt{T}} + \frac{C G^2}{2\lambda(1-\lambda\rho)\sqrt{T}}. \end{align*}\]

■

← Go Back