Section 4.2 Stochastic Compositional Gradient Descent
Next, we introduce stochastic compositional gradient descent (SCGD) as a solution method for SCO. The key to the design is to track the sequence \(\{g(\mathbf{w}_t), t=1, \ldots, T\}\) by a sequence of estimators \(\{\mathbf{u}_t, t=1, \ldots, T\}\). Let us consider the following problem:
\[\begin{align}\label{eqn:u_tp} \min_{\mathbf u}\frac{1}{2}\|\mathbf{u} - g(\mathbf{w}_t)\|_2^2. \end{align}\]
We compute \(\mathbf{u}_t\) by using the SGD update:
\[\begin{align}\label{eqn:u-sgd} \mathbf{u}_{t} = \mathbf{u}_{t-1} - \gamma_t (\mathbf{u}_{t-1} - g(\mathbf{w}_t;\zeta_t)) = (1-\gamma_t) \mathbf{u}_{t-1} + \gamma_t g(\mathbf{w}_t; \zeta_t), t\in[T], \end{align}\]
where \(g(\mathbf{w}; \zeta)\) is a stochastic estimator of \(g(\mathbf{w})\) such that \(\mathbb{E}_\zeta[g(\mathbf{w}; \zeta)]=g(\mathbf{w})\). The update is also known as the moving average sequence of \(\{g(\mathbf{w}_t)\}\).
The intuition behind this is that when \(\mathbf{w}_t\) converges (i.e., \(\mathbf{w}_{t} - \mathbf{w}_{t-1}\rightarrow 0\)), \(\mathbf{u}_{t}\) is a better estimator of \(g(\mathbf{w}_t)\) than \(g(\mathbf{w}_t; \zeta_t)\). With \(\mathbf{u}_{t}\), the gradient estimator can be computed by
\[\begin{align}\label{eqn:scgd-z} \mathbf{z}_t = \nabla g(\mathbf{w}_t; \zeta'_t)\nabla f(\mathbf{u}_{t}; \xi_t), \end{align}\]
where \(\zeta'_t\) is another independent random variable. Then, we can use it for updating \(\mathbf{w}_t\): \[ \mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t \mathbf{z}_t. \]
The detailed steps are presented in Algorithm 9.
Using \(\zeta'_t\) instead of \(\zeta_t\) in computing \(\nabla g(\mathbf{w}_t; \zeta_t')\) is for simplicity of analysis, which decouples the dependence between \(\mathbf{u}_{t}\) and \(\zeta'_t\) since \(\mathbf{u}_{t}\) depends on \(\zeta_t\). However, this will increase the number of random samples per iteration. For practical implementation, one may just use \(\zeta'_t=\zeta_t\).
Algorithm 9: SCGD
- Input: learning rate schedules \(\{\eta_t\}_{t=1}^{T}\), \(\{\gamma_t\}_{t=1}^{T}\); starting points \(\mathbf{w}_1\), \(\mathbf{u}_0\)
- For \(t=1,\dotsc,T\)
- Sample \(\zeta_t, \zeta_t'\) and \(\xi_t\)
- Compute the inner function value estimator \(\mathbf{u}_{t} = (1-\gamma_t)\mathbf{u}_{t-1} + \gamma_t g(\mathbf{w}_t; \zeta_t)\)
- Compute the vanilla gradient estimator \(\mathbf{z}_{t} = \nabla g(\mathbf{w}_t; \zeta'_t) \nabla f(\mathbf{u}_{t};\xi_t)\)
- Update the model \(\mathbf{w}\) by \(\mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t \mathbf{z}_{t}\)
4.2.1 Convergence Analysis
We make the following assumptions regarding the SCO problem.
Assumption 4.1
There exist \(L_1, G_1>0\) such that
- \(f\) is \(L_1\)-smooth, i.e., \(\|\nabla f(g) - \nabla f(g')\|_2\leq L_1\|g - g'\|_2, \forall g, g'\);
- \(\mathbb{E}[\|\nabla f(g; \xi)\|_2^2] \leq G_1^2,\forall g\).
Assumption 4.2
There exists \(G_2>0\) such that \(\mathbb{E}[\|\nabla g(\mathbf{w}; \zeta)\|_2^2] \leq G_2^2, \forall\mathbf{w}\).
Due to Jensen’s inequality, \(\mathbb{E}[\|\nabla f(\cdot; \xi)\|_2^2] \leq G_1^2\) and \(\mathbb{E}[\|\nabla g(\mathbf{w}; \zeta)\|_2^2] \leq G_2^2\) indicate the \(G_1\)-Lipschitz condition of \(f\) and \(G_2\)-Lipschitz condition of \(g\), respectively.
Assumption 4.3
There exist \(\sigma_0, \sigma_1, \sigma_2>0\) such that
- \(\mathbb{E}[\|g(\mathbf{w}; \zeta) - g(\mathbf{w})\|_2^2]\leq \sigma_0^2, \forall \mathbf{w}\);
- \(\mathbb{E}[\|\nabla f(g; \xi) - \nabla f(g)\|_2^2]\leq \sigma_1^2,\quad\mathbb{E}[\|\nabla g(\mathbf{w}; \zeta) - \nabla g(\mathbf{w})\|_2^2]\leq \sigma_2^2, \forall\mathbf{w}, g\);
Assumption 4.4
\(F\) is \(L_F\)-smooth and lower bounded, i.e., there exist \(L_F>0\) such that \(\nabla F(\cdot)\) is \(L_F\)-Lipschitz continuous, and \(F_*=\min_{\mathbf w}F(\mathbf w)>-\infty\).
It is notable that the smoothness of \(F\) does not necessarily imply that \(g\) is smooth. One example is that if \(g(\mathbf{w})=\|\mathbf{w}\|_2\) and \(f(g)=g^2\), the overall function \(F(\mathbf{w})=\|\mathbf{w}\|_2^2\) is smooth but the inner function \(g\) is non-smooth.
Lemma 4.1
Under Assumption 4.2 and Assumption 4.3(i), the \(\{\mathbf{u}_t\}_{t\geq 1}\) sequence (\(\ref{eqn:u-sgd}\)) satisfies that
\[\begin{align}\label{eq:scgd_fval} \mathbb{E}_{\zeta_t}\left[\|\mathbf{u}_{t} - g(\mathbf{w}_t)\|_2^2\right] & \leq (1-\gamma_t) \|\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1})\|_2^2 + \gamma_t^2\sigma_0^2 + \frac{G_2^2}{\gamma_t} \|\mathbf{w}_t-\mathbf{w}_{t-1}\|_2^2. \end{align}\]
where \(\mathbb{E}_{\zeta_t}\) denotes the expectation over \(\zeta_t\) given all previous randomness.
💡 Why it matters
The lemma admits an intuitive interpretation. The first term shows that
\(\|\mathbf{u}_{t} -
g(\mathbf{w}_t)\|_2^2\) is bounded by a contracting sequence. The
second term is due to the noise in \(g(\mathbf{w}_t; \zeta_t)\) and the third
term is caused by the drifting from \(\mathbf{w}_{t-1}\) to \(\mathbf{w}_t\), both of which decay to zero
under the conditions \(\gamma_t^2 \rightarrow
0\) and \(\frac{\mathbb{E}[\|\mathbf{w}_t-\mathbf{w}_{t-1}\|_2^2
]}{\gamma_t}=O\left(\frac{\eta_{t-1}^2}{\gamma_t}\right) \rightarrow
0\), respectively.
Proof
In the following proof, we abuse the notation \(\mathbb{E}_t\) to denote \(\mathbb{E}_{\zeta_t}\). According to the update formula \(\mathbf{u}_{t} = (1-\gamma_t)\mathbf{u}_{t-1} + \gamma_t g(\mathbf{w}_t; \zeta_t)\) we have \[\begin{align*} \mathbb{E}_t\left[\|\mathbf{u}_{t} - g(\mathbf{w}_t)\|_2^2 \right] & = \mathbb{E}_t\left[\|(1-\gamma_t)\mathbf{u}_{t-1} + \gamma_t g(\mathbf{w}_t; \zeta_t) - g(\mathbf{w}_t)\|_2^2 \right]\\ & = \mathbb{E}_t\left[\|(1-\gamma_t)(\mathbf{u}_{t-1} - g(\mathbf{w}_{t}))+ \gamma_t (g(\mathbf{w}_t; \zeta_t) - g(\mathbf{w}_t))\|_2^2 \right]. \end{align*}\] Note that \(\mathbb{E}_t\left[(\mathbf{u}_{t-1}- g(\mathbf{w}_t))^{\top}(g(\mathbf{w}_t; \zeta_t) - g(\mathbf{w}_t))\right] = 0\). Thus,
\[\begin{align}\label{eqn:useq-i} \mathbb{E}_t\left[\|\mathbf{u}_{t} - g(\mathbf{w}_t)\|_2^2 \right] & \leq (1-\gamma_t)^2 \|\mathbf{u}_{t-1} - g(\mathbf{w}_t)\|_2^2 + \gamma_t^2 \sigma_0^2. \end{align}\]
This inequality is same as Lemma 3.8 when we consider \(\mathbf{u}_{t}\) as the SGD update for (\(\ref{eqn:u_tp}\)).
Due to Young’s inequality of inner product, we have \(\|\mathbf{u}_{t-1} - g(\mathbf{w}_t)\|_2^2 \leq (1+\alpha)\|\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1})\|_2^2 + (1+1/\alpha)\|g(\mathbf{w}_t) - g(\mathbf{w}_{t-1})\|_2^2\) for any \(\alpha>0\). Whence, \[\begin{align*} \mathbb{E}_t\left[\|\mathbf{u}_{t} - g(\mathbf{w}_t)\|_2^2 \right] \leq& (1-\gamma_t)^2(1+\gamma_t) \|\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1})\|_2^2 \\ & + (1-\gamma_t)^2(1+1/\gamma_t)G_2^2 \|\mathbf{w}_t-\mathbf{w}_{t-1}\|_2^2 + \gamma_t^2 \sigma_0^2. \end{align*}\] The proof is completed by noticing \((1-\gamma_t)^2(1+\gamma_t)\leq 1-\gamma_t\) and \((1-\gamma_t)^2(1+1/\gamma_t) \leq \frac{1}{\gamma_t}\).■
Lemma 4.2
Under Assumption 4.1, Assumption 4.2, Assumption 4.3 and Assumption 4.4, SCGD satisfies
\[\begin{align} \mathbb{E}_{\zeta_t, \xi_t, \zeta'_t}\left[F(\mathbf{w}_{t+1})\right] \leq& F(\mathbf{w}_t) - \frac{\eta_t}{2}\|\nabla F(\mathbf{w}_t)\|_2^2 + \frac{\eta_t G_2^2 L_1^2}{2}\mathbb{E}_{\zeta_t}\left[\|\mathbf{u}_{t} - g(\mathbf{w}_t)\|_2^2 \right] \notag\\ & + \frac{\eta_t^2 L_FG_1^2G_2^2}{2}.\label{eq:starter_scgd_same} \end{align}\]
Proof
In the following proof, we abuse the notation \(\mathbb{E}_t\) to denote \(\mathbb{E}_{\zeta_t, \xi_t, \zeta'_t}\). According to \(L_F\)-smoothness of \(F\), we have \[\begin{align*} &F(\mathbf{w}_{t+1}) \leq F(\mathbf{w}_t) + \nabla F(\mathbf{w}_t)^{\top}(\mathbf{w}_{t+1} - \mathbf{w}_t) + \frac{L_F}{2}\|\mathbf{w}_{t+1}-\mathbf{w}_t\|_2^2 \\ &= F(\mathbf{w}_t) - \eta_t \nabla F(\mathbf{w}_t)^{\top}\nabla g(\mathbf{w}_t; \zeta'_t)\nabla f(\mathbf{u}_{t}; \xi_t) + \frac{\eta_t^2 L_F}{2}\|\nabla g(\mathbf{w}_t; \zeta'_t)\nabla f(\mathbf{u}_{t}; \xi_t)\|_2^2 . \end{align*}\] Then, we have
\[\begin{align} \mathbb{E}_t\left[F(\mathbf{w}_{t+1})\right] \leq & F(\mathbf{w}_t)- \eta_t \|\nabla F(\mathbf{w}_t)\|_2^2 \notag\\ & + \eta_t\boxed{\mathbb{E}_{t}\left[\nabla F(\mathbf{w}_t)^{\top}(\nabla g(\mathbf{w}_t; {\zeta'_t})\nabla f(g(\mathbf{w}_t)) - \nabla g(\mathbf{w}_t; {\zeta'_t})\nabla f(\mathbf{u}_{t}))\right]}\notag\\ &+ \frac{\eta_t^2 L_F}{2}\mathbb{E}_t\left[\|\nabla g(\mathbf{w}_t; {\zeta'_t})\nabla f(\mathbf{u}_{t}; {\xi_t})\|_2^2 \right],\label{eq:scgd_starter} \end{align}\] where we use the fact \[\begin{align*} &\mathbb{E}_{\zeta'_t}\left[\nabla g(\mathbf{w}_t; \zeta'_t)\nabla f(g(\mathbf{w}_t))\right] = \nabla F(\mathbf{w}_t)\\ &\mathbb{E}_{\zeta_t, \zeta'_t, \xi_t}\left[\nabla g(\mathbf{w}_t; \zeta'_t)\nabla f(\mathbf{u}_{t}; \xi_t)\right] = \mathbb{E}_{\zeta_t, \zeta'_t}\left[\nabla g(\mathbf{w}_t; \zeta'_t)\nabla f(\mathbf{u}_{t})\right]. \end{align*}\] Due to the Cauchy-Schwarz inequality and Young’s inequality of inner product, we have
\[\begin{align} &\mathbb{E}_{t}[\nabla F(\mathbf{w}_t)^{\top}(\nabla g(\mathbf{w}_t; \zeta'_t)\nabla f(g(\mathbf{w}_t)) - \nabla g(\mathbf{w}_t; \zeta'_t)\nabla f(\mathbf{u}_{t}))]\notag\\ & \leq \mathbb{E}_{t}\left[\frac{\|\nabla F(\mathbf{w}_t)\|_2^2 \|\nabla g(\mathbf{w}_t; \zeta'_t)\|_2^2 }{2G_2^2} \right]+ \mathbb{E}_{\zeta_t}\left[\frac{G_2^2}{2}\|\nabla f(g(\mathbf{w}_t)) - \nabla f(\mathbf{u}_{t})\|_2^2 \right]\notag\\ & \leq \frac{\|\nabla F(\mathbf{w}_t)\|_2^2 }{2} + \frac{G_2^2 L_1^2}{2}\mathbb{E}_{\zeta_t}\|\mathbf{u}_{t} - g(\mathbf{w}_t)\|_2^2 .\label{eqn:bound-indpen-scgd} \end{align}\] For bounding the last term in (\(\ref{eq:scgd_starter}\)), we proceed as follows:
\[\begin{align}\label{eqn:varg} \boxed{\mathbb{E}_t\left[\|\nabla g(\mathbf{w}_t, {\zeta'_t})\nabla f(\mathbf{u}_{t}, {\xi_t})\|_2^2 \right]}&\leq \mathbb{E}_{\zeta_t, \zeta'_t}\left[\|\nabla g(\mathbf{w}_t; {\zeta'_t})\|_2^2 \mathbb{E}_{\xi_t|\zeta_t, \zeta'_t}\|\nabla f(\mathbf{u}_{t}; {\xi_t})\|_2^2 \right]\notag\\ &\leq G_1^2G_2^2. \end{align}\]
We finish the proof by plugging the last two inequalities into (\(\ref{eq:scgd_starter}\)).■
We comment on the modifications required in the analysis when the same sample \(\zeta_t\) is used to compute \(\nabla g(\mathbf{w}_t;\zeta_t)\). In the original proof, there are two places highlighted in boxes, where we explicitly rely on the independence between \(\mathbf{u}_t\) and \(\zeta'_t\). If instead we use the coupled estimator \(\nabla g(\mathbf{w}_t;\zeta_t)\nabla f(\mathbf{u}_t;\xi_t)\), then the first boxed term must be modified and bounded as follows: \[\begin{align*} &\mathbb{E}_{t}\!\left[ \nabla F(\mathbf{w}_t)^{\top} \bigl( \nabla g(\mathbf{w}_t;{\zeta_t})\nabla f(g(\mathbf{w}_t);\xi_t) - \nabla g(\mathbf{w}_t;{\zeta_t})\nabla f(\mathbf{u}_t;\xi_t) \bigr) \right] \\ &\;\;\le \mathbb{E}_{t}\!\left[ \frac{\|\nabla F(\mathbf{w}_t)\|_2^2 \, \|\nabla g(\mathbf{w}_t;\zeta_t)\|_2^2 }{2G_2^2} \right] + \mathbb{E}_{t}\!\left[ \frac{G_2^2}{2} \|\nabla f(g(\mathbf{w}_t);\xi_t)-\nabla f(\mathbf{u}_t;\xi_t)\|_2^2 \right]. \end{align*}\] To recover the same bound as in (\(\ref{eqn:bound-indpen-scgd}\)), we must impose a stronger regularity condition on \(f\), namely, \[ \mathbb{E}_{\xi}\bigl[\|\nabla f(g;\xi)-\nabla f(g';\xi)\|_2^2 \bigr] \;\le\; L_1 \|g-g'\|_2^2. \]
For the second boxed term, the corresponding expression becomes \(\mathbb{E}_t\!\left[\|\nabla g(\mathbf{w}_t;\zeta_t)\nabla f(\mathbf{u}_t;\xi_t)\|_2^2 \right]\), which in turn requires assuming that this quantity is uniformly bounded by a constant.
Combining Lemma 4.1 and Lemma 4.2, we can prove the following theorem of convergence for SCGD for a non-convex function.
Theorem 4.1
Suppose Assumption 4.1, Assumption 4.2, Assumption 4.3 and Assumption 4.4 hold. After \(T\) iterations of SCGD updates with parameters \(\eta_t = \frac{\eta_1}{T^{3/5}}, \gamma_t = \frac{\gamma_1}{T^{2/5}}\), we have \[\begin{align*} \mathbb{E}\left[\frac{1}{T}\sum_{t=1}^T\|\nabla F(\mathbf{w}_t)\|_2^2 \right]& \leq \frac{2C_\Upsilon}{\eta_1T^{2/5}} + \frac{L_1^2G_1^2G_2^6\eta_1^2}{\gamma_1^2 T^{2/5}} + \frac{ L_1^2G_2^2 \sigma_0^2\gamma_1 }{T^{2/5}} + \frac{L_FG_1^2G_2^2\eta_1}{2T^{3/5}}, \end{align*}\] where \(C_\Upsilon= F(\mathbf{w}_1) - F_* + \frac{L_1^2G_2^2\sigma_0^2}{2}\frac{\eta_1}{\gamma_1}\). If \(\eta_t = \eta_1/t^{3/5}, \gamma_t = \gamma_1/t^{2/5}\), then the convergence rate becomes \(O(\log T/T^{2/5})\).
Proof
Adding \(\frac{L_1^2G_2^2}{2}\frac{\eta_t}{\gamma_t}\mathbb{E}_t\left[\|\mathbf{u}_{t} - g(\mathbf{w}_t)\|_2^2 \right]\) on (\(\ref{eq:starter_scgd_same}\)), we have \[\begin{align*} & \mathbb{E}_t\left[F(\mathbf{w}_{t+1})\right] + \frac{L_1^2G_2^2}{2}\frac{\eta_t}{\gamma_t}\mathbb{E}_t\left[\|\mathbf{u}_{t} - g(\mathbf{w}_t)\|_2^2 \right]\\ &\leq F(\mathbf{w}_t) - \frac{\eta_t}{2}\|\nabla F(\mathbf{w}_t)\|_2^2 + (1+\gamma_t)\frac{\eta_t L_1^2G_2^2}{2\gamma_t}\mathbb{E}_t\|\mathbf{u}_{t}-g(\mathbf{w}_t)\|_2^2 + \frac{\eta_t^2L_FG_1^2G_2^2}{2}. \end{align*}\] Applying Lemma 4.1 to bound the right hand side, we have \[\begin{align*} & \mathbb{E}_t\left[F(\mathbf{w}_{t+1})\right] + \frac{L_1^2G_2^2}{2}\frac{\eta_t}{\gamma_t}\mathbb{E}_t\left[\|\mathbf{u}_{t} - g(\mathbf{w}_t)\|_2^2 \right]\\ & \leq F(\mathbf{w}_t) - \frac{\eta_t}{2}\|\nabla F(\mathbf{w}_t)\|_2^2 +(1-\gamma_t)(1+\gamma_t)\frac{L_1^2G_2^2}{2}\frac{\eta_t}{\gamma_t}\|\mathbf{u}_{t-1}-g(\mathbf{w}_{t-1})\|_2^2 \\ &+ \frac{(1+\gamma_t)L_1^2 G_2^2G_2^2\eta_t}{2\gamma_t^2}\|\mathbf{w}_t-\mathbf{w}_{t-1}\|_2^2 + \gamma_t \eta_t(1+\gamma_t)\frac{L_1^2G_2^2\sigma_0^2}{2} + \frac{\eta_t^2 L_FG_1^2G_2^2}{2} \\ & \stackrel{\gamma_t\leq 1}{\leq} F(\mathbf{w}_t) + \frac{L_1^2G_2^2}{2}\frac{\eta_t}{\gamma_t}\|\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1})\|_2^2 + \frac{\eta_t L_1^2G_2^4}{\gamma_t^2}\|\mathbf{w}_t-\mathbf{w}_{t-1}\|_2^2 \\ &+ \gamma_t\eta_t L_1^2G_2^2\sigma_0^2+ \frac{\eta_t^2L_F G_1^2G_2^2}{2} - \frac{\eta_t}{2}\|\nabla F(\mathbf{w}_t)\|_2^2 . \end{align*}\] We define the potential function \(\Upsilon_t := F(\mathbf{w}_t) + \frac{L_1^2G_2^2}{2}\frac{\eta_t}{\gamma_t}\|\mathbf{u}_{t-1}-g(\mathbf{w}_{t-1})\|_2^2\). By the setting, we have \(\frac{\eta_{t+1}}{\gamma_{t+1}}\leq \frac{\eta_t}{\gamma_t}\), then \[ \Upsilon_{t+1} = F(\mathbf{w}_{t+1}) + \frac{L_1^2G_2^2}{2}\frac{\eta_{t+1}}{\gamma_{t+1}}\|\mathbf{u}_{t} - g(\mathbf{w}_t)\|_2^2 \leq F(\mathbf{w}_{t+1}) + \frac{L_1^2G_2^2}{2}\frac{\eta_t}{\gamma_t}\|\mathbf{u}_{t} - g(\mathbf{w}_t)\|_2^2 . \] Then, \[\begin{align*} \mathbb{E}_t\left[\Upsilon_{t+1}\right] & \leq \Upsilon_t + \frac{\eta_t L_1^2G_2^4}{\gamma_t^2}\|\mathbf{w}_t-\mathbf{w}_{t-1}\|_2^2 + \gamma_t\eta_t L_1^2G_2^2\sigma_0^2 + \frac{\eta_t^2L_F G_1^2 G_2^2}{2} \\ &- \frac{\eta_t}{2}\|\nabla F(\mathbf{w}_t)\|_2^2 . \end{align*}\] Telescoping the above over \(t=1\) to \(T\) and use the tower property of conditional expectation, \[\begin{align*} \mathbb{E}\left[\sum_{t=1}^T \eta_t \|\nabla F(\mathbf{w}_t)\|_2^2 \right]& \leq 2\mathbb{E}\left[\Upsilon_1- \Upsilon_{T+1}\right] + 2L_1^2G_2^4\sum_{t=1}^T \gamma_t^{-2}\eta_t \eta_{t-1}^2G_1^2G_2^2\\ &+ L_1^2G_2^2 \sigma_0^2 \sum_{t=1}^T \gamma_t \eta_t + \frac{L_FG_1^2G_2^2}{2}\sum_{t=1}^T \eta_t^2, \end{align*}\] where we use the fact \(\mathbb{E}[\|\mathbf{w}_t - \mathbf{w}_{t-1}\|_2^2 ] = \mathbb{E}[\eta_{t-1}^2\|\nabla g(\mathbf{w}_t; \zeta'_t)\nabla f(\mathbf{u}_{t}; \xi_t)\|_2^2 ]\leq \eta_{t-1}^2G_1^2G_2^2\). Let \(\mathbf{w}_0 = \mathbf{w}_1\) and \(\mathbf{u}_0 = g(\mathbf{w}_0; \zeta_1)\). Then, we have \[\begin{align*} \mathbb{E}\left[\Upsilon_1- \Upsilon_{T+1}\right] &\leq \mathbb{E}\left[F(\mathbf{w}_1) + \frac{L_1^2G_2^2}{2}\frac{\eta_1}{\gamma_1}\|\mathbf{u}_0-g(\mathbf{w}_0)\|_2^2 \right]-F_* \\ &\leq F(\mathbf{w}_1) - F_*+ \frac{L_1^2G_2^2\sigma_0^2}{2}\frac{\eta_1}{\gamma_1}. \end{align*}\] We define \(C_\Upsilon= F(\mathbf{w}_1) - F_*+ \frac{L_1^2G_2^2\sigma_0^2}{2}\frac{\eta_1}{\gamma_1}\). Then we have \[\begin{align*} \mathbb{E}\left[\|\nabla F(\mathbf{w}_\tau)\|_2^2 \right]& \leq \frac{2C_{\Upsilon}}{\sum_{t=1}^T\eta_t} + 2L_1^2G_2^6G_1^2\frac{\sum_{t=1}^T \gamma_t^{-2}\eta_t \eta_{t-1}^2}{\sum_{t=1}^T\eta_t}\\ &+ L_1^2G_2^2 \sigma_0^2 \frac{\sum_{t=1}^T \gamma_t \eta_t}{\sum_{t=1}^T\eta_t} + \frac{L_FG_1^2G_2^2}{2}\frac{\sum_{t=1}^T \eta_t^2}{\sum_{t=1}^T\eta_t}. \end{align*}\] Plugging the constant values of \(\eta_t= \frac{\eta_1}{T^{3/5}}\) and \(\gamma_t = \frac{\gamma_1}{T^{2/5}}\), we have \[\begin{align*} \mathbb{E}\left[\|\nabla F(\mathbf{w}_\tau)\|_2^2 \right]& \leq \frac{2C_{\Upsilon}}{\eta_1T^{2/5}} + \frac{2L_1^2G_1^2G_2^6\eta_1^2}{\gamma_1^2 T^{2/5}} + \frac{ L_1^2G_2^2 \sigma_0^2\gamma_1 }{T^{2/5}} + \frac{L_FG_1^2G_2^2\eta_1}{2T^{3/5}}. \end{align*}\] If \(\eta_t = O(1/t^{3/5})\), \(\gamma_t = O(1/t^{2/5})\), \(\frac{\eta_{t+1}}{\gamma_{t+1}}\leq \frac{\eta_t}{\gamma_t}\) is satisfied. Besides, we have \(\sum_{t=1}^T \eta_t = O(T^{2/5})\), \(\sum_{t=1}^T \eta_t^2 = O(1)\), \(\sum_{t=1}^T \gamma_t \eta_t = O(\log T)\), \(\sum_{t=1}^T \gamma_t^{-2}\eta_t \eta_{t-1}^2 = O(\log T)\). Then, we have \(\mathbb{E}\left[\|\nabla F(\mathbf{w}_{\tau})\|_2^2 \right] \leq \tilde{\mathcal{O}}(1/T^{2/5})\).■
4.2.2 An Improved Complexity with Smooth Inner Function
If we replace the smoothness assumption of \(F\) by the smoothness of \(g\), we can establish a better complexity of SCGD.
Assumption 4.5 \(g\) is \(L_2\)-smooth, i.e., there exists \(L_2>0\) such that \(\nabla g(\cdot)\) is \(L_2\)-Lipschitz continuous.
Assumption 4.1 and Assumption 4.5 ensure that \(F\) is smooth.
Lemma 4.3 Under Assumption 4.1 and Assumption 4.5, we have \(F\) is \(L_F\)-smooth, where \(L_F = G_1 L_2 + G_2^2 L_1\).
Proof
Since \(\nabla F(\mathbf{w}) = \nabla g(\mathbf{w})\nabla f(g(\mathbf{w}))\), we have \[\begin{align*} &\|\nabla g(\mathbf{w}_1)\nabla f(g(\mathbf{w}_1)) -\nabla g(\mathbf{w}_2)\nabla f(g(\mathbf{w}_2)) \|_2\\ &=\|\nabla g(\mathbf{w}_1)\nabla f(g(\mathbf{w}_1)) -\nabla g(\mathbf{w}_1)\nabla f(g(\mathbf{w}_2)) +\nabla g(\mathbf{w}_1)\nabla f(g(\mathbf{w}_2)) - \nabla g(\mathbf{w}_2)\nabla f(g(\mathbf{w}_2)) \|_2\\ & \leq G_2^2 L_1\|\mathbf{w}_1 -\mathbf{w}_2\|_2 + G_1L_2\|\mathbf{w}_1 - \mathbf{w}_2\|_2. \end{align*}\]■
Lemma 4.4 Let \(\mathbf{z}_t =\nabla g(\mathbf{w}_t;\zeta'_t)\nabla f(\mathbf{u}_t;\xi_t)\), \(\mathbf{M}_t = \mathbb{E}_t[\mathbf{z}_t]\). Then \[\begin{align*} & \mathbb{E}_t[\|\mathbf{z}_{t}-\mathbf{M}_{t}\|_2^2]\leq G_1^2\sigma_2^2 + G_2^2\sigma_1^2,\\ &\mathbb{E}_t[\|\mathbf{w}_{t+1} - \mathbf{w}_{t}\|_2^2]\leq \eta_{t}^2G_1^2G_2^2,\\ &\mathbb{E}_t[\|\mathbf{w}_{t+1} - \mathbf{w}_{t}\|_2^2] \leq \eta_{t}^2\|\mathbf{M}_{t}\|_2^2 + \eta_{t}^2(G_1^2\sigma_2^2 + G_2^2\sigma_1^2). \end{align*}\] where \(\mathbb{E}_t\) denotes \(\mathbb{E}_{\zeta'_t, \xi_t}\) conditioned on \(\mathbf{w}_t, \mathbf{u}_t\).
Proof
First, we have \[\begin{equation*} \begin{aligned} &\mathbb{E}_t[\|\mathbf{z}_{t} - \mathbf{M}_{t}\|_2^2 ] = \mathbb{E}_t[\|\nabla g(\mathbf{w}_{t};\zeta'_{t})\nabla f(\mathbf{u}_{t}; \xi_{t}) - \nabla g(\mathbf{w}_{t})\nabla f(\mathbf{u}_{t})\|_2^2 ]\\ &= \mathbb{E}_t[\left\|\nabla g(\mathbf{w}_{t})\nabla f(\mathbf{u}_{t}) -\nabla g(\mathbf{w}_{t})\nabla f(\mathbf{u}_{t}; \xi_{t}) + \nabla g(\mathbf{w}_{t})\nabla f(\mathbf{u}_{t}; \xi_{t}) - \nabla g(\mathbf{w}_{t};\zeta'_{t})\nabla f(\mathbf{u}_{t}; \xi_{t})\right\|_2^2]\\ & \leq G_2^2\sigma_1^2 + G_1^2 \sigma_2^2. \end{aligned} \end{equation*}\]
Next, due to Assumption 4.1 and Assumption 4.2 we have \[\begin{align*} \mathbb{E}_t[\|\mathbf{w}_{t+1} - \mathbf{w}_{t}\|_2^2]&=\mathbb{E}_t[\eta_{t}^2\|\nabla g(\mathbf{w}_{t};\zeta'_{t})\nabla f(\mathbf{u}_{t};\xi_{t})\|_2^2]\leq \eta_t^2G_1^2G_2^2. \end{align*}\]
Second, we have \[\begin{align*} \mathbb{E}_t[\|\mathbf{w}_{t+1} - \mathbf{w}_{t}\|_2^2] &= \mathbb{E}_t[\eta_{t}^2\|\mathbf{z}_{t}-\mathbf{M}_{t}+\mathbf{M}_{t}\|_2^2]= \mathbb{E}_t[\eta_{t}^2\|\mathbf{z}_{t}-\mathbf{M}_{t}\|_2^2]+\eta_{t}^2\|\mathbf{M}_{t}\|_2^2. \end{align*}\]
Plugging the first result into the above, we finish the proof.■
Next, we develop two lemmas similar to Lemma 4.1 and Lemma 4.2.
Lemma 4.5 Under Assumption 4.2, Assumption 4.3, and Assumption 4.5, if \(\eta_{t-1}^2\leq \frac{\gamma_t}{L_2^2G_1^2}\) then the \(\{\mathbf{u}_t\}_{t\geq 1}\) sequence (\(\ref{eqn:u-sgd}\)) satisfies that \[\begin{align}\label{eqn:scgd_fval} \mathbb{E}\left[\|\mathbf{u}_t - g(\mathbf{w}_t)\|_2^2\right] & \leq (1-\gamma_t)\mathbb{E}[\|\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1})\|_2^2]+\frac{2\eta_{t-1}^2G_2^2}{\gamma_t}\mathbb{E}[\|\mathbf{M}_{t-1}\|_2^2]\\ &+ \gamma_t^2 \sigma_0^2 + \frac{3\eta_{t-1}^2G^2_2}{2}(G_1^2\sigma_2^2+G_2^2\sigma_1^2).\notag \end{align}\]
Proof
Similar to the proof of Lemma 4.1, we have \[\begin{align}\label{eqn:utt} \mathbb{E}_t\left[\|\mathbf{u}_{t} - g(\mathbf{w}_t)\|_2^2 \right] & \leq (1-\gamma_t)^2 \|\mathbf{u}_{t-1} - g(\mathbf{w}_t)\|_2^2 + \gamma_t^2 \sigma_0^2. \end{align}\]
Next, we will handle \(\|\mathbf{u}_{t-1} - g(\mathbf{w}_t)\|_2^2\) differently by using the smoothness of \(g\). \[\begin{align*} \|\mathbf{u}_{t-1} - g(\mathbf{w}_{t})\|_2^2 &= \|\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1}) + g(\mathbf{w}_{t-1}) - g(\mathbf{w}_t)\|_2^2 \\ & =\|\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1})\|_2^2 + \|g(\mathbf{w}_{t-1}) - g(\mathbf{w}_t)\|_2^2 \\ & \quad+ 2(\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1}))^{\top}(g(\mathbf{w}_{t-1}) - g(\mathbf{w}_t))\\ & \leq \|\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1})\|_2^2 + G_2^2\|\mathbf{w}_{t-1} - \mathbf{w}_t\|_2^2 \\ & \quad + 2(\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1}))^{\top}(g(\mathbf{w}_{t-1}) - g(\mathbf{w}_t)). \end{align*}\]
Taking expectation on both sides and applying Lemma 4.4, we have \[\begin{align*} &\mathbb{E}[\|\mathbf{u}_{t-1} - g(\mathbf{w}_{t})\|_2^2 ] \leq \mathbb{E}[\|\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1})\|_2^2 ] + \eta_{t-1}^2G_2^2\mathbb{E}[\|\mathbf{M}_{t-1}\|_2^2] \\ &+ \eta_{t-1}^2G^2_2(G_2^2\sigma_1^2+G_1^2\sigma_2^2) + \mathbb{E}[2(\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1}))^{\top}(g(\mathbf{w}_{t-1}) - g(\mathbf{w}_t))]. \end{align*}\]
Instead of using Young’s inequality of inner product to bound the last term, we proceed as follows: \[\begin{align*} &\mathbb{E}[(\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1}))^{\top}(g(\mathbf{w}_{t-1}) - g(\mathbf{w}_t))]\\ &= \underbrace{\mathbb{E}[(\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1}))^{\top}\nabla g(\mathbf{w}_{t-1})^{\top}( \mathbf{w}_{t-1} - \mathbf{w}_t)]}\limits_A\\ &+\underbrace{\mathbb{E}[(\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1}))^{\top}(g(\mathbf{w}_{t-1}) - g(\mathbf{w}_t) + \nabla g(\mathbf{w}_{t-1})^{\top}(\mathbf{w}_t - \mathbf{w}_{t-1}))]}\limits_{B}. \end{align*}\]
To bound \(A\), we have \[\begin{align*} A &= \mathbb{E}[(\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1}))^{\top}\nabla g(\mathbf{w}_{t-1})^{\top}\eta_{t-1}\mathbf{M}_{t-1}]\\ & \leq \mathbb{E}[\alpha_t\|\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1})\|_2^2 + \frac{\eta_{t-1}^2}{4\alpha_t}\|\nabla g(\mathbf{w}_{t-1})^{\top}\mathbf{M}_{t-1}\|_2^2]\\ &\leq \mathbb{E}[\alpha_t\|\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1})\|_2^2 + \frac{\eta_{t-1}^2G_2^2}{4\alpha_t}\|\mathbf{M}_{t-1}\|_2^2]. \end{align*}\]
To bound \(B\), we have \[\begin{align*} B&\leq \mathbb{E}[\|\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1})\|_2\|g(\mathbf{w}_{t-1}) - g(\mathbf{w}_t) + \nabla g(\mathbf{w}_{t-1})^{\top}(\mathbf{w}_t - \mathbf{w}_{t-1})\|_2]\\ &\leq \mathbb{E}[\|\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1})\|_2\frac{L_2}{2}\|\mathbf{w}_t - \mathbf{w}_{t-1}\|_2^2]\\ &\leq \frac{L^2_2}{4G_2^2}\mathbb{E}[\|\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1})\|_2^2\|\mathbf{w}_t - \mathbf{w}_{t-1}\|_2^2]+ \frac{G_2^2}{4}\mathbb{E}[\|\mathbf{w}_t - \mathbf{w}_{t-1}\|_2^2], \end{align*}\] where the first inequality uses the smoothness of \(g\) and the last inequality uses Young’s inequality.
To proceed, we utilize the first bound of \(\mathbb{E}_{t-1}[\|\mathbf{w}_t - \mathbf{w}_{t-1}\|_2^2]\) in Lemma 4.4 to bound the first term, and utilize its second bound in Lemma 4.4 to bound the second \(\mathbb{E}[\|\mathbf{w}_t - \mathbf{w}_{t-1}\|_2^2]\). Thus, we have \[\begin{align*} B \leq &\frac{\eta_{t-1}^2L^2_2G_1^2}{4}\mathbb{E}[\|\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1})\|_2^2]+ \frac{\eta_{t-1}^2G^2_2}{4}\mathbb{E}[\|\mathbf{M}_{t-1}\|_2^2] +\frac{\eta_{t-1}^2G^2_2}{4}(G_1^2\sigma_2^2+G_2^2\sigma_1^2). \end{align*}\]
Combining the bounds for \(A\) and \(B\), we have \[\begin{align*} &\mathbb{E}[(\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1}))^{\top}(g(\mathbf{w}_{t-1}) - g(\mathbf{w}_t))]\\ &= \bigg(\alpha_t+ \frac{\eta_{t-1}^2L^2_2G_1^2}{4}\bigg)\mathbb{E}[\|\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1})\|_2^2] + \bigg(\frac{\eta_{t-1}^2G_2^2}{4\alpha_t}+\frac{\eta_{t-1}^2G_2^2}{4}\bigg)\mathbb{E}[\|\mathbf{M}_{t-1}\|_2^2]\\ & +\frac{\eta_{t-1}^2G^2_2}{4}(G_1^2\sigma_2^2+G_2^2\sigma_1^2). \end{align*}\]
As a result, \[\begin{align*} &\mathbb{E}[\|\mathbf{u}_{t-1} - g(\mathbf{w}_{t})\|_2^2 ]\leq \bigg(1+ 2\alpha_t+ \frac{\eta_{t-1}^2L^2_2G_1^2}{2}\bigg) \|\mathbf{u}_{t-1} - g(\mathbf{w}_{t-1})\|_2^2 \\ & + \bigg(\eta_{t-1}^2G_2^2+\frac{\eta_{t-1}^2G_2^2}{2\alpha_t}+\frac{\eta_{t-1}^2G_2^2}{2}\bigg)\mathbb{E}[\|\mathbf{M}_{t-1}\|_2^2]\\ & +\bigg(\eta_{t-1}^2G_2^2+\frac{\eta_{t-1}^2G^2_2}{2}\bigg)(G_1^2\sigma_2^2+G_2^2\sigma_1^2). \end{align*}\]
We let \(\alpha_t = \frac{\gamma_t}{4}<1\), \(\frac{\eta_{t-1}^2L^2_2G_1^2}{2}\leq \frac{\gamma_t}{2}\). Combining the above inequality with (\(\ref{eqn:utt}\)), we can finish the proof.■
Lemma 4.6 Under Assumption 4.1, Assumption
4.2, Assumption 4.3, and Assumption 4.5, if \(\eta_tL_F\leq 1/2\) then SCGD
satisfies \[\begin{align}\label{eq:starter_scgd_same_2}
\mathbb{E}[F(\mathbf{w}_{t+1})] \leq&
\mathbb{E}\bigg[F(\mathbf{w}_t) - \frac{\eta_t}{2}\|\nabla
F(\mathbf{w}_t)\|_2^2 - \frac{\eta_t}{4}\|\mathbf{M}_t\|_2^2 \bigg]\\
&+ \frac{\eta_t G_2^2L_1^2}{2}\mathbb{E}[\|g(\mathbf{w}_t) -
\mathbf{u}_t\|_2^2]+ \frac{\eta_t^2L_F}{2}(G_2^2\sigma_1^2 + G_1^2
\sigma_2^2).\notag
\end{align}\]
Proof
According to Lemma 4.3 (the \(L_F\)-smoothness of \(F\)), we have \[\begin{align*} &F(\mathbf{w}_{t+1}) \leq F(\mathbf{w}_t) + \nabla F(\mathbf{w}_t)^{\top}(\mathbf{w}_{t+1} - \mathbf{w}_t) + \frac{L_F}{2}\|\mathbf{w}_{t+1}-\mathbf{w}_t\|_2^2 \\ &= F(\mathbf{w}_t) - \eta_t \nabla F(\mathbf{w}_t)^{\top}\nabla g(\mathbf{w}_t; \zeta'_t)\nabla f(\mathbf{u}_{t}; \xi_t) + \frac{\eta_t^2 L_F}{2}\|\nabla g(\mathbf{w}_t; \zeta'_t)\nabla f(\mathbf{u}_{t}; \xi_t)\|_2^2 . \end{align*}\]
Taking expectation on both sides, we have \[\begin{align*} &\mathbb{E}[F(\mathbf{w}_{t+1})] \leq \mathbb{E}[F(\mathbf{w}_t)] - \eta_t \mathbb{E}[\nabla F(\mathbf{w}_t)^{\top}\mathbf{M}_t ]+ \frac{\eta_t^2 L_F}{2}\mathbb{E}[\|\mathbf{z}_t - \mathbf{M}_t + \mathbf{M}_t\|_2^2 ]\\ & =\mathbb{E}[F(\mathbf{w}_t)] - \eta_t \mathbb{E}[\nabla F(\mathbf{w}_t)^{\top}\mathbf{M}_t ]+ \frac{\eta_t^2L_F}{2}\mathbb{E}[\|\mathbf{z}_t - \mathbf{M}_t\|_2^2 ] + \frac{\eta_t^2L_F}{2}\mathbb{E}[\|\mathbf{M}_t\|_2^2 ]. \end{align*}\]
Using \(-2\mathbf{a}^{\top}\mathbf{b} = \|\mathbf{a} - \mathbf{b}\|_2^2 - \|\mathbf{a}\|_2^2 - \|\mathbf{b}\|_2^2\), we have \[\begin{align*} \mathbb{E}[F(\mathbf{w}_{t+1})]& \leq \mathbb{E}[F(\mathbf{w}_t) - \frac{\eta_t}{2}\|\nabla F(\mathbf{w}_t)\|_2^2 - \frac{\eta_t}{2}\|\mathbf{M}_t\|_2^2 ]\\ &+ \frac{\eta_t}{2}\mathbb{E}[\|\nabla F(\mathbf{w}_t)-\mathbf{M}_t\|_2^2 ]+\frac{\eta_t^2L_F}{2}\mathbb{E}[\|\mathbf{z}_t - \mathbf{M}_t\|_2^2 ] + \frac{\eta_t^2L_F}{2}\mathbb{E}[\|\mathbf{M}_t\|_2^2 ]. \end{align*}\]
Next, we bound \(\mathbb{E}[\|\nabla F(\mathbf{w}_t)-\mathbf{M}_t\|_2^2 ]\): \[\begin{align*} &\mathbb{E}[\|\nabla F(\mathbf{w}_t)-\mathbf{M}_t\|_2^2 ] = \mathbb{E}[\|\nabla g(\mathbf{w}_t)\nabla f(g(\mathbf{w}_t))-\nabla g(\mathbf{w}_t)\nabla f(\mathbf{u}_t)\|_2^2 ]\\ & \leq G_2^2L_1^2\mathbb{E}[\|g(\mathbf{w}_t) - \mathbf{u}_t\|_2^2]. \end{align*}\]
Combining the above inequalities, we have \[\begin{align*} &\mathbb{E}[F(\mathbf{w}_{t+1})] \leq \mathbb{E}[F(\mathbf{w}_t) - \frac{\eta_t}{2}\|\nabla F(\mathbf{w}_t)\|_2^2 - \frac{\eta_t}{2}\|\mathbf{M}_t\|_2^2 ]\\ &+ \frac{\eta_t G_2^2L_1^2}{2}\mathbb{E}[\|g(\mathbf{w}_t) - \mathbf{u}_t\|_2^2]+ \frac{\eta_t^2L_F}{2}(G_2^2\sigma_1^2 + G_1^2 \sigma_2^2) + \frac{\eta_t^2L_F}{2}\mathbb{E}[\|\mathbf{M}_t\|_2^2 ]. \end{align*}\]
If \(\eta_tL_F\leq 1/2\), we have \(- \frac{\eta_t}{2}\|\mathbf{M}_t\|_2^2 + \frac{\eta_t^2L_F}{2}\|\mathbf{M}_t\|_2^2 \leq \frac{\eta_t}{4}\|\mathbf{M}_t\|_2^2\), which concludes the proof.■
Finally, we establish the following convergence of SCGD under the smoothness condition of \(g\).
Theorem 4.2 Suppose Assumption 4.1, Assumption 4.5, and Assumption 4.3 hold. Run SCGD with \(T\) iterations with parameters \(\eta_t = \frac{\eta_1}{\sqrt{T}}\), \(\gamma_t = \frac{\gamma_1}{\sqrt{T}}\), where \(\eta_1\leq \min(\frac{\gamma_1}{2G_2^2L_1}, \frac{\sqrt{\gamma_1}}{L_2G_1}, \frac{1}{2L_F})\). Then we have \[\begin{align*} \mathbb{E}\left[\frac{1}{T}\sum_{t=1}^T\|\nabla F(\mathbf{w}_t)\|_2^2 \right]& \leq O\left(\frac{C_\Upsilon}{\eta_1\sqrt{T}} + \frac{L_1\gamma_1^2 \sigma_0^2}{\eta_1\sqrt{T}}+ \frac{\eta_1(L_F+L_1G_2^2)(G_2^2\sigma_1^2 + G_1^2 \sigma_2^2)}{\sqrt{T}}\right), \end{align*}\] where \(C_\Upsilon= F(\mathbf{w}_1) - F_* + \frac{L_1}{4}\|\mathbf{u}_1 - g(\mathbf{w}_1)\|_2^2\).
💡 Why it matters
From Theorem 4.2, we can derive that in order
to find an \(\epsilon\)-level
stationary solution of a smooth non-convex compositional function, whose
gradient norm is less than \(\epsilon\), SCGD needs a sample complexity
of \(O(\frac{1}{\epsilon^{4}})\). The
order in terms of \(\epsilon\) is the
same order as that of SGD for solving non-convex ERM.
Proof
By Lemma 4.5 and Lemma 4.6, we have \[\begin{align*} &\mathbb{E}[F(\mathbf{w}_{t+1})] \leq \mathbb{E}[F(\mathbf{w}_t) - \frac{\eta_t}{2}\|\nabla F(\mathbf{w}_t)\|_2^2 - \frac{\eta_t}{4}\|\mathbf{M}_t\|_2^2 ]\\ &+ \frac{\eta_t G_2^2L_1^2}{2}\mathbb{E}[\|\mathbf{u}_t - g(\mathbf{w}_t)\|_2^2]+ \frac{\eta_t^2 L_F}{2}(G_2^2\sigma_1^2 + G_1^2 \sigma_2^2),\\ &\mathbb{E}\left[\|\mathbf{u}_{t+1} - g(\mathbf{w}_{t+1})\|_2^2 \right] \leq (1-\gamma_{t+1})\|\mathbf{u}_{t} - g(\mathbf{w}_{t})\|_2^2 +\frac{2\eta_t^2G_2^2}{\gamma_{t+1}}\mathbb{E}[\|\mathbf{M}_t\|_2^2]\\ &+ \gamma_{t+1}^2 \sigma_0^2 + \frac{3\eta_t^2G^2_2}{2}(G_1^2\sigma_2^2+G_2^2\sigma_1^2). \end{align*}\]
Multiplying the second inequality by \(G_2^2L_1^2\eta_t/(2\gamma_{t+1})\) and adding it to the first inequality, we have \[\begin{align*} &\mathbb{E}\bigg[F(\mathbf{w}_{t+1})+ \frac{\eta_tG_2^2L_1^2}{2\gamma_{t+1}}\|\mathbf{u}_{t+1} - g(\mathbf{w}_{t+1})\|_2^2 \bigg] \leq \mathbb{E}\bigg[F(\mathbf{w}_t) - \frac{\eta_t}{2}\|\nabla F(\mathbf{w}_t)\|_2^2 - \frac{\eta_t}{4}\|\mathbf{M}_t\|_2^2 \bigg]\\ &+ \frac{\eta_tG_2^2L_1^2}{2\gamma_{t+1}}\mathbb{E}[\|\mathbf{u}_t - g(\mathbf{w}_t)\|_2^2]+\frac{\eta_tG_2^2L_1^2}{2\gamma_{t+1}}\frac{2\eta_t^2G_2^2}{\gamma_{t+1}}\mathbb{E}[\|\mathbf{M}_t\|_2^2]\\ &+ \frac{\eta_t^2 L_F}{2}(G_2^2\sigma_1^2 + G_1^2 \sigma_2^2)+ \frac{\eta_tG_2^2L_1^2}{2\gamma_{t+1}}\gamma_{t+1}^2 \sigma_0^2 + \frac{\eta_tG_2^2L_1^2}{2\gamma_{t+1}}\frac{3\eta_t^2G^2_2}{2}(G_1^2\sigma_2^2+G_2^2\sigma_1^2). \end{align*}\]
Since \(\frac{\eta_tG_2^2L_1^2}{2\gamma_{t+1}}\frac{2\eta_t^2G_2^2}{\gamma_{t+1}}\leq \frac{\eta_t}{4}\) due to \(\eta_t\leq \frac{\gamma_{t+1}}{2G_2^2L_1}\), the term involving \(\|\mathbf{M}_t\|_2^2\) will be less than zero. If \(\frac{\eta_t}{\gamma_{t+1}}\leq \frac{\eta_{t-1}}{\gamma_t}\), we obtain \[\begin{align*} &\mathbb{E}\bigg[F(\mathbf{w}_{t+1})+ \frac{\eta_tG_2^2L_1^2}{2\gamma_{t+1}}\|\mathbf{u}_{t+1} - g(\mathbf{w}_{t+1})\|_2^2 \bigg]\\ &\leq \mathbb{E}\bigg[F(\mathbf{w}_t) + \frac{\eta_{t-1}G_2^2L_1^2}{2\gamma_{t}}\mathbb{E}[\|\mathbf{u}_t - g(\mathbf{w}_t) \|_2^2\bigg] - \frac{\eta_t}{2}\mathbb{E}[\|\nabla F(\mathbf{w}_t)\|_2^2] + \frac{\eta_tG_2^2L_1^2}{2\gamma_{t+1}}\gamma_{t+1}^2 \sigma_0^2\\ & + \frac{\eta_t^2 L_F}{2}(G_2^2\sigma_1^2 + G_1^2 \sigma_2^2)+ \frac{\eta_tG_2^2L_1^2}{2\gamma_{t+1}}\frac{3\eta_t^2G^2_2}{2}(G_1^2\sigma_2^2+G_2^2\sigma_1^2). \end{align*}\]
Applying \(\eta_t\leq \frac{\gamma_{t+1}}{2G_2^2L_1}\) to the right-hand side, we have \[\begin{align*} &\mathbb{E}\bigg[F(\mathbf{w}_{t+1})+ \frac{\eta_tG_2^2L_1^2}{2\gamma_{t+1}}\|\mathbf{u}_{t+1} - g(\mathbf{w}_{t+1})\|_2^2 \bigg] \\ &\leq \mathbb{E}\bigg[F(\mathbf{w}_t) + \frac{\eta_{t-1}G_2^2L_1^2}{2\gamma_{t}}\|\mathbf{u}_t - g(\mathbf{w}_t) \|_2^2\bigg] - \frac{\eta_t}{2}\mathbb{E}[\|\nabla F(\mathbf{w}_t)\|_2^2] \\ &+ \frac{L_1}{4}\gamma_{t+1}^2 \sigma_0^2 + \eta_t^2\bigg(\frac{L_F}{2}+\frac{3L_1G_2^2}{8}\bigg)(G_2^2\sigma_1^2 + G_1^2 \sigma_2^2). \end{align*}\]
Define \(\Upsilon_t = F(\mathbf{w}_t) + \frac{\eta_{t-1}G_2^2L_1^2}{2\gamma_{t}}\mathbb{E}[\|\mathbf{u}_t - g(\mathbf{w}_t) \|_2^2]\). Then we have \(\sum_{t=1}^T(\Upsilon_t -\Upsilon_{t+1})\leq C_\Upsilon:=\Upsilon_1 - F_*\) and \[\begin{align*} \mathbb{E}\left[\sum_{t=1}^T\frac{\eta_t}{\sum_{t=1}^T\eta_t}\|\nabla F(\mathbf{w}_t)\|_2^2\right]& \leq \frac{2C_\Upsilon}{\sum_{t=1}^T\eta_t } + \frac{\sum_{t=1}^TL_1\gamma_{t+1}^2 \sigma_0^2}{2\sum_{t=1}^T\eta_t}\\ &+ \frac{\sum_{t=1}^T\eta_t^2(L_F+L_1G_2^2)(G_2^2\sigma_1^2 + G_1^2 \sigma_2^2)}{\sum_{t=1}^T\eta_t}. \end{align*}\]
Plugging the values of \(\eta_t, \gamma_t\) will finish the proof.■
Let us compare the complexities of SCGD in Theorem 4.1 and Theorem 4.2 with the sample complexity of the large-batch approach in Eqn. 4. Without assuming the smoothness of \(g\), the complexity of SCGD in Theorem 4.1 is worse by one order of magnitude. In contrast, the complexity of SCGD in Theorem 4.2 matches that of the large-batch approach; however, this result relies on the additional assumption that \(g\) is smooth.