Section 5.2 Smooth Functions

In this section, we consider a non-convex but smooth objective function \(F(\mathbf{w})\) with smooth outer functions. In addition, we assume the inner stochastic functions satisfy the following conditions throughout this section.

Assumption 5.1

We assume that

1. \(\quad\mathbb{E}_{\zeta\sim\mathbb{P}_i}[\left\|g_i(\mathbf{w}; \zeta) - g_i(\mathbf{w})\right\|_2^2]\leq \sigma_0^2\).
1. \(\quad\mathbb{E}_{\zeta\sim\mathbb{P}_i}[\left\|\nabla g_i(\mathbf{w}; \zeta) - \nabla g_i(\mathbf{w})\right\|_2^2]\leq \sigma_2^2\).
1. \(\quad\mathbb{E}_{\zeta\sim\mathbb{P}_i}[\left\|\nabla g_i(\mathbf{w}; \zeta)\right\|_2^2] \leq G_2^2\).

5.2.1 The SOX Algorithm

The first algorithm for solving FCCO is called SOX, named by Stochastic Optimization of X-risks. Owing to its ease of implementation and favorable practical performance, this algorithm is commonly adopted for addressing FCCO. Below, we outline the assumptions necessary for its analysis.

Assumption 5.2

There exist \(G_1, L_1, L_F>0\) such that

1. \(\quad f_i:\mathbb{R}^{d'}\mapsto \mathbb{R}\) is \(G_1\)-Lipschitz continuous and \(L_1\)-smooth;
1. \(\quad F: \mathbb{R}^{d}\mapsto \mathbb{R}\) is \(L_F\)-smooth;
1. \(\quad F_* =\min_{\mathbf{w}}F(\mathbf{w})>-\infty\).

Similar to that for SCO, we also need to track and estimate the inner functions. However, the difference is that we need to maintain and update \(n\) estimators for the \(n\) inner functions \(g_i(\mathbf{w}), i\in[n]\).

To this end, we maintain \(n\) sequence of estimators \(\{\mathbf{u}_{i,t}, t\in[T]\}_{i=1}^n\). At the \(t\)-th iteration, we draw a set of \(B\) random indices \(\mathcal{B}_{t}\subset[n]\) with \(|\mathcal{B}_t|=B\). We update \(\mathbf{u}_{i,t}, i\in[n]\) by the following:

\[ \begin{align} \mathbf{u}_{i, t} = \left\{\begin{array}{lc} (1-\gamma_t) \mathbf{u}_{i,t-1} + \gamma_t g_{i}(\mathbf{w}_t; \zeta_{i,t}), & i\in\mathcal{B}_{t}\\ \mathbf{u}_{i,t-1},&\text{o.w.} \end{array}\right., t=1, \ldots, T, \label{eqn:uc} \end{align} \]

where \(\zeta_{i,t}\sim\mathbb{P}_i\) is a random variable. We refer to the above estimator as coordinate moving average estimator. Then, similar to SCMA, a moving average estimator of the gradient is computed by:

\[ \begin{align*} &\mathbf{v}_{t} = (1-\beta_t) \mathbf{v}_{t-1} + \beta_t \mathbf{z}_t,\\ &\text{where}\quad\mathbf{z}_t = \frac{1}{|\mathcal{B}_t|}\sum_{i\in\mathcal{B}_{t}}\nabla g_i(\mathbf{w}_t; \zeta'_{i,t})\nabla f_{i}(\mathbf{u}_{i,t}). \end{align*} \]

Then, the model parameters are updated by:

\[ \begin{align*} \mathbf{w}_{t+1} & = \mathbf{w}_t - \eta_t \mathbf{v}_{t}. \end{align*} \]

The detailed steps are presented in Algorithm SOX.

Algorithm 14: SOX

Input: learning rate schedules \(\{\eta_t\}_{t=1}^{T}\), \(\{\gamma_t\}_{t=1}^{T}\), \(\{\beta_t\}_{t=1}^{T}\); starting points \(\mathbf{w}_0\), \(\mathbf{u}_{0},\mathbf{v}_0\)
Let \(\mathbf{w}_1 = \mathbf{w}_0 - \eta_0 \mathbf{v}_0\)
For \(t=1,\dotsc, T\)
Draw a batch of samples \(\mathcal{B}_t\subset[n]\)
For \(i\in \mathcal{B}_t\)
Draw two samples \(\zeta_{i,t}, \zeta'_{i,t}\sim \mathbb{P}_i\)
Update the inner function value estimators \[\mathbf{u}_{i, t} = (1-\gamma_t) \mathbf{u}_{i,t-1} + \gamma_t g_{i}(\mathbf{w}_t; \zeta_{i,t})\]
end for
Set \(\mathbf{u}_{i,t} = \mathbf{u}_{i,t-1}, i\not\in\mathcal{B}_t\)
Compute the vanilla gradient estimator \(\mathbf{z}_t = \frac{1}{|\mathcal{B}_t|}\sum_{i\in\mathcal{B}_{t}}\nabla g_i(\mathbf{w}_t; \zeta'_{i,t})\nabla f_{i}(\mathbf{u}_{i,t})\)
Update the MA gradient estimator \(\mathbf{v}_{t} = (1-\beta_t) \mathbf{v}_{t-1} + \beta_t \mathbf{z}_t\)
Update the model \(\mathbf{w}\) by \(\mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t \mathbf{v}_{t}\)

Convergence Analysis

Let us first define two notations: \[ \begin{align} \Delta_t &= \left\|\mathbf{v}_{t} - \nabla F(\mathbf{w}_t)\right\|_2^2,\\ \delta_{t} &= \frac{1}{n}\sum_{i=1}^n\left\|\mathbf{u}_{i,t} -g_i(\mathbf{w}_t)\right\|_2^2. \end{align} \] The descent lemma (Lemma 4.9) remains valid. Next, we analyze the recursion of \(\Delta_t\) and \(\delta_t\). One point of deviation is that only some randomly selected coordinates of \(\mathbf{u}\) are updated and used for computing the gradient estimator \(\mathbf{z}_t\). To facilitate the proof, we introduce a virtual sequence: \[ \begin{align} \bar{\mathbf{u}}_{i,t} = (1-\gamma_t)\mathbf{u}_{i,t-1}+\gamma_t g_i(\mathbf{w}_t; \zeta_{i,t}),\forall i=1,\ldots, n. \end{align} \] This is similar to that is done in the analysis of stochastic coordinate descent method in Section 3.3. Then, we have \[ \begin{align*} \mathcal{M}_t := \mathbb{E}_{\mathcal{B}_t, \zeta'_{t}}[\mathbf{z}_t] = \frac{1}{n}\sum_{i=1}^n\nabla g_i(\mathbf{w}_{t}) \nabla f_i(\bar{\mathbf{u}}_{i,t}). \end{align*} \]

Critical:

Since \(\mathbf{u}_t\) is a random variable that depends on \(\mathcal{B}_t\), hence \[ \mathbb{E}_{\mathcal{B}_t, \zeta'_{t}}[\mathbf{z}_t]\neq \frac{1}{n}\sum_{i=1}^n\nabla g_i(\mathbf{w}_{t}) \nabla f_i(\mathbf{u}_{i,t}). \]

We first bound the error recursion of \(\delta_{t}\).

Lemma 5.1

Consider the \(\mathbf{u}_{t}\) updates in Algorithm 14. Under Assumption 5.1, if \(\gamma_t\leq 1\), then \[ \begin{equation*} \begin{aligned} & \mathbb{E}\left[\delta_t\right]\leq \left(1-\frac{B\gamma_t}{2n}\right)\mathbb{E}\left[\delta_{t-1}\right] + \frac{2n G_2^2}{B\gamma_t}\mathbb{E}\left[\|\mathbf{w}_{t-1}-\mathbf{w}_t\|_2^2\right]+\frac{B\gamma_t^2\sigma_0^2}{n}. \end{aligned} \end{equation*} \]

Proof. Since \(\bar{\mathbf{u}}_{i,t}\) is updated using MA, then similar to Eqn. 5, for all \(i\in[n]\) we have \[ \begin{equation*} \begin{aligned} &\mathbb{E}_{\zeta_{i,t}}[\|\bar{\mathbf{u}}_{i,t}-g_i(\mathbf{w}_t)\|_2^2] \leq (1-\gamma_t)^2\|\mathbf{u}_{i,t-1}-g_i(\mathbf{w}_t)\|_2^2+\gamma_t^2\sigma_0^2. \end{aligned} \end{equation*} \]

Given \(i\in[n]\), with a probability of \(B/n\) that \(i\in\mathcal{B}_t\), we have \(\mathbf{u}_{i,t} = \bar{\mathbf{u}}_{i,t}\); otherwise, \(\mathbf{u}_{i,t} = \mathbf{u}_{i,t-1}\). Hence, \[ \begin{equation*} \begin{aligned} &\mathbb{E}_{\zeta_{i,t}}\mathbb{E}_{\mathcal{B}_t}[\|\mathbf{u}_{i,t}-g_i(\mathbf{w}_t)\|_2^2]\\ & = \frac{B}{n}\mathbb{E}_{\zeta_{i,t}}[\|\bar{\mathbf{u}}_{i,t}-g_i(\mathbf{w}_t)\|_2^2]+(1-\frac{B}{n})\|\mathbf{u}_{i,t-1}-g_i(\mathbf{w}_t)\|_2^2\\ &\leq \frac{B}{n}(1-\gamma_t)^2\|\mathbf{u}_{i,t-1}-g_i(\mathbf{w}_t)\|_2^2+\frac{B\gamma_t^2\sigma_0^2}{n}+(1-\frac{B}{n})\|\mathbf{u}_{i,t-1}-g_i(\mathbf{w}_t)\|_2^2\\ &\leq (1-\frac{B\gamma_t}{2n})^2\|\mathbf{u}_{i,t-1}-g_i(\mathbf{w}_t)\|_2^2+\frac{B\gamma_t^2\sigma_0^2}{n}, \end{aligned} \end{equation*} \] where we use the fact \(\frac{B}{n}(1-\gamma_t)^2+(1-\frac{B}{n}) \leq (1-\frac{\gamma_t B}{2n})^2\).

Then, taking expectation over all randomness on both sides yields \[ \begin{equation*} \begin{aligned} & \mathbb{E}\left[\|\mathbf{u}_{i,t}-g_i(\mathbf{w}_t)\|_2^2\right]\leq (1-\frac{B\gamma_t}{2n})^2\mathbb{E}\left[\|\mathbf{u}_{i,t-1}-g_i(\mathbf{w}_t)\|_2^2\right]+\frac{B\gamma_t^2\sigma_0^2}{n}. \end{aligned} \end{equation*} \]

Then using the Young’s inequality similar to the proof of Lemma 4.1, we have \[ \begin{equation*} \begin{aligned} & \mathbb{E}\left[\|\mathbf{u}_{i,t}-g_i(\mathbf{w}_t)\|_2^2\right]\leq (1+\frac{B\gamma_t}{2n})(1-\frac{B\gamma_t}{2n})^2\mathbb{E}\left[\|\mathbf{u}_{i,t-1}-g_i(\mathbf{w}_{t-1})\|_2^2\right] \\ &\quad + (1+\frac{2n}{B\gamma_t})(1-\frac{B\gamma_t}{2n})^2\mathbb{E}\left[\|g_i(\mathbf{w}_{t-1})-g_i(\mathbf{w}_t)\|_2^2\right]+\frac{B\gamma_t^2\sigma_0^2}{n}\\ &\leq (1-\frac{B\gamma_t}{2n})\mathbb{E}\left[\|\mathbf{u}_{i,t-1}-g_i(\mathbf{w}_{t-1})\|_2^2\right] + \frac{2n G_2^2}{B\gamma_t}\mathbb{E}\left[\|\mathbf{w}_{t-1}-\mathbf{w}_t\|_2^2\right]+\frac{B\gamma_t^2\sigma_0^2}{n}, \end{aligned} \end{equation*} \] where we use \(\gamma_t\leq 1 < \frac{2n}{B}\). The desired result follows by taking average over \(i=1,\dots,n\) on both sides.

■

Lemma 5.2 (Variance of \(\mathbf{z}_t\))

Let \(\sigma^2= \frac{G_1^2\sigma_2^2}{B} + \frac{G_1^2G_2^2}{B}\frac{n-B}{n-1}\). We have \[ \mathbb{E}_t\left[\left\|\mathbf{z}_t - \mathcal{M}_t\right\|_2^2\right]\leq \sigma^2. \]

Proof. First, using the variance bound of the average of \(B\) independent zero-mean random variables gives \[ A_1=\mathbb{E}_t\left[\left\|\frac{1}{B}\sum_{i\in\mathcal{B}_{t}}\nabla g_i(\mathbf{w}_{t};\zeta'_{i,t}) \nabla f_i(\bar{\mathbf{u}}_{i,t}) - \frac{1}{B}\sum_{i\in\mathcal{B}_{t}}\nabla g_i(\mathbf{w}_{t}) \nabla f_i(\bar{\mathbf{u}}_{i,t}) \right\|_2^2\right]\leq \frac{G_1^2\sigma_2^2}{B}, \] and using the variance bound of the average of \(B\) random variables without replacement yields \[ A_2=\mathbb{E}_t\left[\left\|\frac{1}{B}\sum_{i\in\mathcal{B}_{t}}\nabla g_i(\mathbf{w}_{t}) \nabla f_i(\bar{\mathbf{u}}_{i,t}) - \frac{1}{n}\sum_{i=1}^n\nabla g_i(\mathbf{w}_{t}) \nabla f_i(\bar{\mathbf{u}}_{i,t}) \right\|_2^2\right]\leq \frac{G_1^2G_2^2}{B}\frac{n-B}{n-1}. \] As a result, \[ \begin{align*} & \mathbb{E}_t\left[\left\|\mathbf{z}_t - \mathcal{M}_t\right\|_2^2\right] \\ & = \mathbb{E}_t\left[\left\|\frac{1}{B}\sum_{i\in\mathcal{B}_{t}}\nabla g_i(\mathbf{w}_{t};\zeta'_{i,t})\nabla f_i(\bar{\mathbf{u}}_{i,t}) - \frac{1}{n}\sum_{i=1}^n \nabla g_i(\mathbf{w}_{t})\nabla f_i(\bar{\mathbf{u}}_{i,t})\right\|_2^2\right]\\ & =A_1+A_2\leq \frac{G_1^2\sigma_2^2}{B} + \frac{G_1^2G_2^2}{B}\frac{n-B}{n-1} :=\sigma^2. \end{align*} \]

■

Lemma 5.3

Under Assumption 5.1 and Assumption 5.2, if \(\beta_t\leq 1\), the gradient estimation error \(\Delta_t\) can be bounded as \[ \begin{equation*} \begin{aligned} \mathbb{E}[\Delta_t] &\leq (1-\beta_t)\mathbb{E}\left[\Delta_{t-1}\right]+ \frac{2L_F^2 + 8\beta_t^2G_2^4L_1^2}{\beta_t}\mathbb{E}\left[\left\|\mathbf{w}_t - \mathbf{w}_{t-1}\right\|_2^2\right]+ 8\beta_t L_1^2G_2^2 \mathbb{E}\left[\delta_{t-1}\right] \\ &\quad +\beta_t^2\sigma^2 + 4G_2^2L_1^2\beta_t\gamma_t^2\sigma_0^2, \end{aligned} \end{equation*} \] where \(\sigma^2= \frac{G_1^2\sigma_2^2}{B}+\frac{G_1^2G_2^2}{B}\frac{n-B}{n-1}\).

Proof. Since \(\mathbf{v}_t\) is updated using MA, we apply Lemma 4.7 in light of Lemma 5.2, yielding \[ \begin{align}\label{eqn:sox-mv} &\mathbb{E}\left[\left\|\mathbf{v}_{t} - \nabla F(\mathbf{w}_t)\right\|_2^2\right] \leq (1-\beta_t)\mathbb{E}\left[\left\|\mathbf{v}_{t-1} - \nabla F(\mathbf{w}_{t-1})\right\|_2^2\right]\\ & \quad + \frac{2L_F^2}{\beta_t}\mathbb{E}\left[\left\|\mathbf{w}_{t-1}-\mathbf{w}_t\right\|_2^2\right] + 4\beta_t\mathbb{E}\left[\left\|\mathcal{M}_{t} - \nabla F(\mathbf{w}_t)\right\|_2^2\right] + \beta_t^2\sigma^2. \end{align} \]

Next, we bound \(\mathbb{E}\left[\left\|\mathcal{M}_{t} - \nabla F(\mathbf{w}_t)\right\|_2^2\right]\): \[ \begin{align*} \|\mathcal{M}_t - \nabla F(\mathbf{w}_t)\|_2^2 & = \left\|\frac{1}{n}\sum_{i=1}^n\nabla g_i(\mathbf{w}_{t}) \nabla f(\bar{\mathbf{u}}_{i,t}) - \frac{1}{n}\sum_{i=1}^n\nabla g_i(\mathbf{w}_{t}) \nabla f(g_i(\mathbf{w}_{t}))\right\|_2^2\\ & \leq G_2^2L_1^2\frac{1}{n}\sum_{i=1}^n \|\bar{\mathbf{u}}_{i,t} - g_i(\mathbf{w}_t)\|_2^2. \end{align*} \]

From Lemma 5.1, we have \[ \begin{align*} &\mathbb{E}_{\zeta_{i,t}}[\|\bar{\mathbf{u}}_{i,t}-g_i(\mathbf{w}_t)\|_2^2]\leq (1-\gamma_t)^2\|\mathbf{u}_{i,t-1}-g_i(\mathbf{w}_t)\|_2^2+\gamma_t^2\sigma_0^2, \quad \forall i. \end{align*} \]

Hence \[ \begin{align*} &\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n\|\bar{\mathbf{u}}_{i,t}-g_i(\mathbf{w}_t)\|_2^2\right]\\ &\leq (1-\gamma_t)^2\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n\|\mathbf{u}_{i,t-1}-g_i(\mathbf{w}_t)\|_2^2\right]+\gamma_t^2\sigma_0^2\\ & \leq (1-\gamma_t)^2\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n\left(2\|g_{i}(\mathbf{w}_{t})-g_{i}(\mathbf{w}_{t-1})\|_2^2 + 2\|\mathbf{u}_{i,t-1}-g_i(\mathbf{w}_{t-1})\|_2^2\right)\right]+\gamma_t^2\sigma_0^2\\ & \leq 2\mathbb{E}[\delta_{t-1}]+ \gamma_t^2\sigma_0^2 + \mathbb{E}\left[2G_2^2\|\mathbf{w}_{t-1}-\mathbf{w}_{t}\|_2^2\right]. \end{align*} \]

As a result, \[ \begin{align*} \mathbb{E}[\|\mathcal{M}_t - \nabla F(\mathbf{w}_t)\|_2^2] \leq 2G_2^2L_1^2\mathbb{E}[\delta_{t-1}]+ G_2^2L_1^2\gamma_t^2\sigma_0^2 + \mathbb{E}\left[2G_2^4L_1^2\|\mathbf{w}_{t-1}-\mathbf{w}_{t}\|_2^2\right]. \end{align*} \]

Plugging the above results into (\(\ref{eqn:sox-mv}\)) we finish the proof.

■

For combining the descent lemma and the above lemmas, we present a result similar to Lemma 4.10, with differences highlighted in boxes

Lemma 5.4

If \(\eta_t\leq 1/L\), assume that there exist non-negative sequences \(A_t, B_t, \Gamma_t, \Delta_t, \delta_t, t\geq 0\) satisfying: \[ \begin{align*} (*)&A_{t+1}\leq A_t + \eta_t \Delta_t - \eta_t B_t - \eta_t \Gamma_t\\ (\sharp)&\Delta_{t+1} \leq (1-\beta_{t+1})\Delta_{t} + C_1\beta_{t+1}\delta_{t} + \frac{C_2\eta_{t}^2}{\beta_{t+1}}\Gamma_t + \beta_{t+1}^2\sigma^2 + \beta_{t+1}\gamma_{t+1}^2\sigma''^2,\\ (\diamond)&\delta_{t+1}\leq (1-\gamma_{t+1}) \delta_{t} + \frac{C_3\eta_{t}^2}{\gamma_{t+1}} \Gamma_t + \gamma_{t+1}^2{\sigma'}^2. \end{align*} \]

If \[ \beta =\frac{\epsilon^2}{4\sigma^2}, \quad \gamma = \min\left(\frac{\epsilon^2}{8C_1{\sigma'}^2}, \frac{\epsilon}{2\sigma''}\right), \quad \eta = \min\left(\frac{1}{L}, \frac{\beta}{\sqrt{4C_2}}, \frac{\gamma}{\sqrt{8C_1C_3}}\right), \] then in order to guarantee \[ \sum_{t=0}^{T-1}\frac{1}{T}(B_t + \frac{1}{2}\Gamma_t)\leq \epsilon^2, \] the iteration complexity is in the order of \[ T=O\left(\max\left\{\frac{C_\Upsilon L_F}{\epsilon^2}, \frac{C_\Upsilon\sqrt{C_1C_3}\sigma''}{\epsilon^3}, \frac{C_\Upsilon\sigma^2\sqrt{C_2}}{\epsilon^4}, \frac{C_\Upsilon\sqrt{C_1C_3}C_1{\sigma'}^2}{\epsilon^4}\right\}\right). \]

where \[ C_\Upsilon \leq A_0 - A_* + \frac{1}{2\sqrt{C_2}}\Delta_{0} + \sqrt{\frac{C_1}{2C_3}}\delta_{0}, \] and \(A_*\) is a constant lower bound of \(A_t,\forall t\).

Proof. Following similar analysis to Lemma 4.10, we have \[ \begin{align*} &A_{t+1} + \frac{\eta_t}{\beta_{t+1}}\Delta_{t+1} +(C_1\eta_t\frac{1+\gamma_{t+1}}{\gamma_{t+1}}- C_1\eta_t) \delta_{t+1}\\ &\leq A_t - \eta_t B_t - \eta_t \Gamma_t \\ & +\left(\eta_t+\frac{\eta_t}{\beta_{t+1}}(1-\beta_{t+1})\right)\Delta_{t} + \frac{C_2\eta_{t}^3}{\beta_{t+1}^2}\Gamma_t + \eta_t(\beta_{t+1}\sigma^2+\boxed{\gamma^2_{t+1}\sigma''^2}) + \boxed{C_1\eta_t(\delta_t - \delta_{t+1})}\\ &+C_1\eta_t\frac{1+\gamma_{t+1}}{\gamma_{t+1}}(1-\gamma_{t+1})\delta_{t} + \frac{C_3C_1\eta_{t}^3(1+\gamma_{t+1})}{\gamma_{t+1}^2} \Gamma_t +C_1\eta_t(1+\gamma_{t+1})\gamma_{t+1}{\sigma'}^2, \end{align*} \] where the terms in the box highlight the difference due to the slight difference in the recursion of \(\Delta_t\). Under similar conditions of \(\beta_{t+1}, \gamma_{t+1}, \eta_t\), we get \[ \begin{align*} &A_{t+1} + \frac{\eta_t}{\beta_{t+1}}\Delta_{t+1} + \frac{C_1\eta_t}{\gamma_{t+1}} \delta_{t+1}\\ &\leq A_t +\frac{\eta_{t-1}}{\beta_{t}}\Delta_{t}+\frac{C_1\eta_{t-1}}{\gamma_{t}}\delta_{t} + \boxed{C_1\eta_t(\delta_t - \delta_{t+1})}\\ &- \eta_t B_t - \frac{1}{2}\eta_t \Gamma_t+ \eta_t(\beta_{t+1}\sigma^2+\boxed{\gamma^2_{t+1}\sigma''^2}) + 2C_1\eta_t\gamma_{t+1}{\sigma'}^2. \end{align*} \]

Since \(\eta_{t+1}\leq \eta_t\), we have \[ \begin{align*} &A_{t+1} + \frac{\eta_t}{\beta_{t+1}}\Delta_{t+1} + \frac{C_1\eta_t}{\gamma_{t+1}} \delta_{t+1}+\boxed{C_1\eta_{t+1}\delta_{t+1}}\\ &\leq A_t +\frac{\eta_{t-1}}{\beta_{t}}\Delta_{t}+\frac{C_1\eta_{t-1}}{\gamma_{t}}\delta_{t} + \boxed{C_1\eta_t\delta_t}\\ &- \eta_t B_t - \frac{1}{2}\eta_t \Gamma_t+ \eta_t(\beta_{t+1}\sigma^2+\boxed{\gamma^2_{t+1}\sigma''^2}) + 2C_1\eta_t\gamma_{t+1}{\sigma'}^2. \end{align*} \]

Since \(\Upsilon_{t+1}=A_{t+1} + \frac{\eta_t}{\beta_{t+1}}\Delta_{t+1} + \frac{C_1\eta_t}{\gamma_{t+1}} \delta_{t+1} + \boxed{C_1\eta_{t+1}\delta_{t+1}}\), we have \[ \begin{align*} &\eta_t B_t + \frac{1}{2}\eta_t \Gamma_t\leq \Upsilon_{t} - \Upsilon_{t+1}+ \eta_t(\beta_{t+1}\sigma^2+\boxed{\gamma^2_{t+1}\sigma''^2}) + 2C_1\eta_t\gamma_{t+1}{\sigma'}^2. \end{align*} \]

Hence \[ \begin{align*} &\sum_{t=0}^{T-1}(\eta_t B_t + \frac{1}{2}\eta_t \Gamma_t)\leq \Upsilon_{0} + \sum_{t=0}^{T-1}\left(\eta_t\beta_{t+1}\sigma^2 + 2C_1\eta_t\gamma_{t+1}{\sigma'}^2+\eta_t\gamma^2_{t+1}\sigma''^2\right). \end{align*} \]

Next, let us consider \(\eta_t=\eta, \beta_t=\beta, \gamma_t = \gamma\). Then we have \[ \begin{align*} &\sum_{t=0}^{T-1}\frac{1}{T}(B_t + \frac{1}{2}\Gamma_t)\leq \frac{C_\Upsilon}{T} + \left(\beta\sigma^2 + 2\gamma C_1{\sigma'}^2+\gamma^2\sigma''^2\right). \end{align*} \]

To ensure the RHS is less than \(\epsilon^2\), it suffices to have \[ \beta =\frac{\epsilon^2}{4\sigma^2},\quad \gamma = \min\left(\frac{\epsilon^2}{8C_1{\sigma'}^2}, \frac{\epsilon}{2\sigma''}\right), \quad T\geq \frac{C_\Upsilon}{4\epsilon^2\eta}. \]

Since \[ \eta =\min\left(\frac{1}{L}, \frac{\beta}{\sqrt{4C_2}}, \frac{\gamma}{\sqrt{8C_1C_3}}\right), \] the order of \(T\) becomes \[ \begin{align*} T&= O\left(\max\left\{\frac{C_\Upsilon L}{\epsilon^2}, \frac{C_\Upsilon\sqrt{C_2}}{\epsilon^2\beta}, \frac{C_\Upsilon\sqrt{C_1C_3}}{\gamma\epsilon^2}\right\}\right)\\ &=O\left(\max\left\{\frac{C_\Upsilon L}{\epsilon^2}, \frac{C_\Upsilon\sqrt{C_1C_3}\sigma''}{\epsilon^3}, \frac{C_\Upsilon\sigma^2\sqrt{C_2}}{\epsilon^4}, \frac{C_\Upsilon\sqrt{C_1C_3}C_1{\sigma'}^2}{\epsilon^4}\right\}\right). \end{align*} \]

where \[ \begin{align*} C_\Upsilon&=A_0 + \frac{\eta}{\beta}\Delta_{0} + \frac{C_1\eta}{\gamma} \delta_{0}+C_1\eta\delta_0\leq A_0 + \frac{1}{2\sqrt{C_2}}\Delta_{0} + 2\frac{\sqrt{C_1}}{\sqrt{8C_3}} \delta_{0}. \end{align*} \]

■

Finally, we state the convergence of SOX.

Theorem 5.1

Under Assumption 5.1 and Assumption 5.2, SOX with \[ \beta =\frac{\epsilon^2}{4\sigma^2}<\frac{1}{4L_1G_2}, \quad \gamma= \min\left(\frac{\epsilon^2}{128G_2^2L_1^2{\sigma_0}^2}, \frac{n}{2BG_1L_1\sigma_0}\right), \] \[ \eta= \min\left(\frac{1}{2L_F}, \frac{\beta}{2\sqrt{C_2}}, \frac{B\gamma}{n\sqrt{32C_1C_3}}\right), \] can find \(\mathbf{w}_\tau\) with \(\tau\) randomly sampled from \(\{1,\dots, T\}\) so that \[ \mathbb{E}\left[\|\mathbf{v}_{\tau}\|_2^2+\|\nabla F(\mathbf{w}_\tau)\|_2^2\right]\leq \epsilon^2 \] with an iteration complexity of \[ T=O\left(\max\left\{\frac{C_\Upsilon L_F}{\epsilon^2}, \frac{C_\Upsilon L_1^2\sigma_0}{\epsilon^3}, \frac{C_\Upsilon L_F \sigma^2 }{\epsilon^4}, \frac{C_\Upsilon L_1^3n\sigma_0^2}{\epsilon^4B}\right\}\right), \] where \(C_1=8G_2^2L_1\), \(C_2=4L_F^2+2\), \(C_3=2G_2^2\), \(\sigma^2= \frac{G_1^2\sigma_2^2}{B}+\frac{G_1^2G_2^2}{B}\frac{n-B}{n-1},\) and \(C_\Upsilon = O\left(F(\mathbf{w}_0) - F_* + \frac{1}{L_F}\|\mathbf{v}_0 - \nabla F(\mathbf{w}_0)\|_2^2 + L_1\frac{1}{n}\|\mathbf{u}_0 - g(\mathbf{w}_0)\|_2^2\right).\)

💡 Why it matters
Theorem 5.1 shows that SOX achieves a complexity dominated by \(O\left(\frac{C_\Upsilon L_1^3n\sigma_0^2}{\epsilon^4 B}\right)\), which is comparable to that of SCMA for finding an \(\epsilon\)-stationary solution. The key difference is that the complexity of SOX is scaled by a factor of \(n/B\), since it must track and estimate \(n\) inner functions.

Proof. Assume that \(\epsilon\) is sufficiently small such that \(8\beta^2G_2^2L_1^2\leq 1\). We have established the following three inequalities: \[ \begin{aligned} &(*)\,\mathbb{E}\left[F(\mathbf{w}_{t+1})\right] \leq \mathbb{E}\left[F(\mathbf{w}_t)\right] + \frac{\eta}{2}\mathbb{E}[\Delta_t]- \frac{\eta}{2} \mathbb{E}\left[\|\nabla F(\mathbf{w}_t)\|_2^2\right] - \frac{\eta}{4} \mathbb{E}\left[\|\mathbf{v}_{t}\|_2^2\right],\\ &(\sharp)\,\mathbb{E}[\Delta_{t+1}] \leq (1-\beta)\mathbb{E}\left[\Delta_{t}\right]+ \frac{2L_F^2+1}{\beta}\eta^2\mathbb{E}\left[\|\mathbf{v}_{t}\|_2^2\right]+ 8\beta L_1^2G_2^2 \mathbb{E}\left[\delta_{t}\right]\\ &\quad +\beta^2\sigma^2 + 4G_2^2L_1^2\beta\gamma^2\sigma_0^2,\\ &(\diamond)\,\mathbb{E}\left[\delta_{t+1}\right] \leq \left(1-\frac{B\gamma}{2n}\right)\mathbb{E}\left[\delta_{t}\right] + \frac{2n G_2^2\eta^2}{B\gamma}\mathbb{E}\left[\|\mathbf{v}_{t}\|_2^2\right]+\frac{B\gamma^2\sigma_0^2}{n}. \end{aligned} \]

Let us define \(\bar\gamma= \frac{B\gamma}{2n}\), the last inequality becomes \[ (\diamond)\,\mathbb{E}\left[\delta_{t+1}\right] \leq \left(1-\bar\gamma\right)\mathbb{E}\left[\delta_{t}\right] + \frac{G_2^2\eta^2}{\bar\gamma}\mathbb{E}\left[\|\mathbf{v}_{t}\|_2^2\right]+\frac{4n{\bar\gamma}^2\sigma_0^2}{B}. \]

Define \[ A_t= 2(F(\mathbf{w}_t) - F(\mathbf{w}_*)), \quad B_t = \|\nabla F(\mathbf{w}_t)\|_2^2, \quad \Gamma_t = \|\mathbf{v}_{t}\|_2^2/2, \] \[ \Delta_t= \|\nabla F(\mathbf{w}_t) - \mathbf{v}_{t}\|_2^2, \quad \delta_{t}=\frac{1}{n}\|\mathbf{u}_{t} - g(\mathbf{w}_t)\|_2^2, \] and \[ \Upsilon_{t}=A_{t} + \frac{\eta_{t-1}}{\beta_{t}}\Delta_{t} + \frac{C_1\eta_{t-1}}{\bar\gamma_{t}} \delta_{t}. \]

Then the three inequalities satisfy those in Lemma 5.4 with \(C_1=8G_2^2L_1^2, \quad C_2=2(2L_F^2+1), \quad C_3=2G_2^2,\) \(\sigma^2 = \frac{G_1^2\sigma_2^2}{B}+\frac{G_1^2G_2^2}{B}\frac{n-B}{n-1}, \quad {\sigma'}^2=\frac{4n\sigma_0^2}{B}, \quad \sigma''^2=4G_2^2L_1^2\sigma_0^2.\) Then \(\eta, \beta, \bar\gamma\) satisfy \[ \begin{aligned} &\beta =\frac{\epsilon^2}{4\sigma^2},\\ &\bar\gamma= \min\left(\frac{\epsilon^2}{8C_1{\sigma'}^2}, \frac{\epsilon}{2\sigma''}\right) = \min\left( \frac{\epsilon^2B}{256G_2^2L_1^2n{\sigma_0}^2}, \frac{\epsilon}{4G_2L_1\sigma_0}\right),\\ &\eta= \min\left(\frac{1}{2L_F}, \frac{\beta}{\sqrt{4C_2}}, \frac{\bar\gamma}{\sqrt{8C_1C_3}}\right). \end{aligned} \]

Thus the order of \(T\) becomes \[ \begin{aligned} T&=O\left(\max\left\{\frac{C_\Upsilon L_F}{\epsilon^2}, \frac{C_\Upsilon\sqrt{C_1C_3}\sigma''}{\epsilon^3}, \frac{C_\Upsilon \sigma^2\sqrt{C_2}}{\epsilon^4}, \frac{C_\Upsilon \sqrt{C_1C_3}C_1{\sigma'}^2}{\epsilon^4}\right\}\right)\\ &=O\left(\max\left\{\frac{C_\Upsilon L_F}{\epsilon^2}, \frac{C_\Upsilon L_1^2\sigma_0}{\epsilon^3}, \frac{C_\Upsilon L_F \sigma^2 }{\epsilon^4}, \frac{C_\Upsilon L_1^3n\sigma_0^2}{\epsilon^4B}\right\}\right). \end{aligned} \]

where \[ \begin{aligned} C_\Upsilon &\leq2(F(\mathbf{w}_0) - F(\mathbf{w}_*)) + \frac{1}{2\sqrt{C_2}}\|\mathbf{v}_0 - \nabla F(\mathbf{w}_0)\|_2^2 + \frac{\sqrt{C_1}}{\sqrt{2C_3}}\frac{1}{n}\|\mathbf{u}_0 - g(\mathbf{w}_0)\|_2^2\\ &= 2(F(\mathbf{w}_0) - F(\mathbf{w}_*)) + O\left(\frac{1}{L_F}\right)\|\mathbf{v}_0 - \nabla F(\mathbf{w}_0)\|_2^2 + O(L_1)\frac{1}{n}\|\mathbf{u}_0 - g(\mathbf{w}_0)\|_2^2. \end{aligned} \]

5.2.2 Multi-block Single-Probe Variance Reduction

In this subsection, we present a second algorithm for solving FCCO with an improved complexity than that of SOX under a stronger condition on \(g_i\). We replace Assumption 5.2 by the following:

Assumption 5.3

There exist \(G_1, L_1, L_2>0\) such that

1. \(f_i:\mathbb{R}^{d'}\mapsto \mathbb{R}\) is \(G_1\)-Lipschitz continuous and \(L_1\)-smooth;
1. \(\nabla g_i(\cdot, \zeta): \mathbb{R}^{d}\mapsto \mathbb{R}^{d'}\) is mean-squared Lipschitz continuous, i.e., \[ \mathbb{E}_{\zeta}[\|\nabla g_i(\mathbf{w}, \zeta) - \nabla g_i(\mathbf{w}', \zeta)\|_2^2]\leq L_2^2\|\mathbf{w}- \mathbf{w}'\|_2^2, \forall \mathbf{w}, \mathbf{w}'; \]
1. \(F_* =\min_{\mathbf{w}}F(\mathbf{w})>-\infty\).

The idea is to leverage advanced variance reduction for tracking both the inner functions and the gradient. A straightforward approach is to change the update of \(\mathbf{u}_{i,t-1}\) by using the STORM estimator and do similarly for the gradient estimator. In particular, one may change the update for \(\mathbf{u}_{i,t}\) according to STORM:

\[\begin{equation}\label{eqn:soxe2} \begin{split} \mathbf{u}_{i,t}= \left\{\begin{array}{ll} (1-\gamma_t)\mathbf{u}_{i,t-1}+\gamma_t g_i(\mathbf{w}_{t}; \zeta_{i,t}) +\underbrace{ (1-\gamma_t)(g_i(\mathbf{w}_{t}; \zeta_{i,t}) - g_i(\mathbf{w}_{t-1}; \zeta_{i,t}))}\limits_{\text{error correction}} & i \in \mathcal{B}_{t}\\ \mathbf{u}_{i,t-1} & i \notin \mathcal{B}_{t} \end{array}\right. \end{split} \end{equation}\]

However, this naive approach does not work as the standard error correction term only accounts for the randomness in \(g_i(\mathbf{w}_t; \zeta_{i,t})\) but not the randomness caused by sampling \(i\in\mathcal B_t\).

To address this, we introduce the multi-block single-probe variance reduction (MSVR) estimator:

\[\begin{equation}\label{MSVR} \begin{split} \mathbf{u}_{i,t}=\left\{\begin{array}{ll} (1-\gamma_{t})\mathbf{u}_{i, t-1}+\gamma_{t} g_i(\mathbf{w}_{t}; \zeta_{i,t}) + \gamma'_{t} \left( g_i(\mathbf{w}_{t}; \zeta_{i,t})-g_i(\mathbf{w}_{t-1}; \zeta_{i,t}) \right) & i \in \mathcal{B}_{t}\\ \mathbf{u}_{i,t-1} & i \notin \mathcal{B}_{t} \end{array}\right. \end{split} \end{equation}\]

The difference from (\(\ref{eqn:soxe2}\)) lies at the value of \(\gamma'_{t}\), which is set as \(\frac{n-B}{B(1-\gamma_{t})}+(1-\gamma_{t})\) with \(B=|\mathcal{B}_t|\). The MSVR estimator can track multiple functional mappings \((g_1, g_2, \cdots, g_n)\), simultaneously, while the number of sampled blocks \(B\) for probing can be as small as one. It is notable that when \(B=n\), i.e., all blocks are probed at each iteration, \(\gamma'_{t}=1-\gamma_{t}\) and MSVR reduces to STORM applied to \(\mathbf{g}(\mathbf{w})\). The additional factor in \(\gamma'_{t}\), i.e., \(\alpha_t =\frac{n-B}{B(1-\gamma_{t})}\) is to account for the randomness in the sampled blocks and noise in those blocks that are not updated.

With \(\mathbf{u}_t\), we compute a vanilla gradient estimator by

\[\mathbf{z}_{t} = \frac{1}{B}\sum_{i \in \mathcal{B}'_{t}} \nabla g_{i}(\mathbf{w}_t;\zeta'_{i,t}) \nabla f_{i}(\mathbf{u}_{i,t}),\]

where \(\mathcal{B}'_t\subset[n]\) is a mini-batch of \(B\) indices independent of \(\mathcal{B}_t\).

Similar to SCST, we apply another STORM estimator to estimate

\[\mathcal{M}_{t}: = \mathbb{E}_{\mathcal{B}'_t, \zeta'_t}[\mathbf{z}_{t}] = \frac{1}{n}\sum_{i=1}^n\nabla g_i(\mathbf{w}_{t}) \nabla f_i(\mathbf{u}_{i,t}),\]

with an extra vanilla gradient estimator at previous iteration:

\[\tilde{\mathbf{z}}_{t-1} = \frac{1}{B}\sum_{i \in \mathcal{B}'_{t}} \nabla g_{i}(\mathbf{w}_{t-1};\zeta'_{i,t}) \nabla f_{i}(\mathbf{u}_{i,t-1}).\]

This is computed by the following sequence:

\[\begin{equation} \begin{split} \label{eqn:z} \mathbf{v}_{t} &= (1-\beta_t)\mathbf{v}_{t-1} + \beta_t \mathbf{z}_t + (1-\beta_t)(\mathbf{z}_t - \tilde{\mathbf{z}}_{t-1}). \end{split} \end{equation}\]

Then we use \(\mathbf{v}_{t}\) to update the model parameter. The full steps are presented in Algorithm 15.

💡 Critical

We use an independent batch \(\mathcal{B}'_t\) because \(\mathbf{z}_t\) depends on \(\mathbf{u}_t\), which depends on \(\mathcal{B}_t\). If we use the same batch \(\mathcal{B}_t\) to compute \(\mathbf{z}_t\), then

\[\begin{align*} \mathcal{M}_t&=\mathbb{E}_{\mathcal{B}_t, \zeta'_t}\bigg[\frac{1}{B}\sum_{i \in \mathcal{B}_{t}} \nabla g_{i}(\mathbf{w}_t;\zeta'_{i,t}) \nabla f_{i}(\mathbf{u}_{i,t})\bigg]\\ & = \mathbb{E}_{\mathcal{B}_t, \zeta'_t}\bigg[\frac{1}{B}\sum_{i \in \mathcal{B}_{t}} \nabla g_{i}(\mathbf{w}_t;\zeta'_{i,t}) \nabla f_{i}(\bar{\mathbf{u}}_{i,t})\bigg]=\frac{1}{n}\sum_{i=1}^n\nabla g_i(\mathbf{w}_{t}) \nabla f_i(\bar{\mathbf{u}}_{i,t}). \end{align*}\]

where \(\bar{\mathbf{u}}_t\) independent of \(\mathcal{B}_t\) is defined in (\(\ref{MSVR}\)). However, we cannot construct an unbiased estimator of \(\mathcal{M}_{t-1}\) since \(\bar{\mathbf{u}}_{t-1}\) is not available in the algorithm.

An alternative approach is that we use \(\mathbf{u}_{t-1}\) and \(\mathbf{u}_{t-2}\) to compute \(\mathbf{z}_t\) and \(\tilde{\mathbf{z}}_{t-1}\), respectively, with \(\mathcal{B}_t\), i.e.,

\[\begin{align*} &\mathbf{z}_{t} = \frac{1}{B}\sum_{i \in \mathcal{B}_{t}} \nabla g_{i}(\mathbf{w}_t;\zeta'_{i,t}) \nabla f_{i}(\mathbf{u}_{i,t-1}),\\ &\tilde{\mathbf{z}}_{t-1} = \frac{1}{B}\sum_{i \in \mathcal{B}_{t}} \nabla g_{i}(\mathbf{w}_{t-1};\zeta'_{i,t}) \nabla f_{i}(\mathbf{u}_{i,t-2}), \end{align*}\] and compute \(\mathbf{v}_t\) by

\[\mathbf{v}_{t} = (1-\beta_t)\mathbf{v}_{t} + \beta_t \mathbf{z}_t + (1-\beta_t)(\mathbf{z}_t - \tilde{\mathbf{z}}_{t-1}).\]

The convergence analysis can be performed similarly with slight modifications.

Algorithm 15: MSVR

Input: learning rate schedules \(\{\eta_t\}_{t=1}^{T}\), \(\{\gamma_t\}_{t=1}^{T}\), \(\{\beta_t\}_{t=1}^{T}\); starting points \(\mathbf{w}_0\), \(\mathbf{u}_{0}\), \(\mathbf{v}_0\)
Let \(\mathbf{w}_1 = \mathbf{w}_0 - \eta_0 \mathbf{v}_0\)
for \(t=1,\dotsc, T\) do
Draw two batches of samples \(\mathcal{B}_t, \mathcal{B}'_t\subset[n]\)
for \(i\in \mathcal{B}_t\) do
Draw two samples \(\zeta_{i,t}, \zeta'_{i,t}\sim \mathbb{P}_i\)
Update the inner function value estimators \[\mathbf{u}_{i,t}=(1-\gamma_{t})\mathbf{u}_{i, t-1}+\gamma_{t} g_i\left(\mathbf{w}_{t}; \zeta_{i,t}\right) + \gamma'_{t} \left( g_i\left(\mathbf{w}_{t}; \zeta_{i,t}\right)-g_i\left(\mathbf{w}_{t-1}; \zeta_{i,t}\right) \right)\]
end for
Set \(\mathbf{u}_{i,t} = \mathbf{u}_{i, t-1}\), \(i\notin\mathcal{B}_t\)
Compute the vanilla gradient estimator \(\mathbf{z}_t = \frac{1}{B}\sum_{i \in \mathcal{B}'_{t}} \nabla g_{i}(\mathbf{w}_t;\zeta'_{i,t}) \nabla f_{i}(\mathbf{u}_{i,t})\)
Compute the extra vanilla gradient estimator \(\tilde{\mathbf{z}}_{t-1} = \frac{1}{B}\sum_{i \in \mathcal{B}'_{t}} \nabla g_{i}(\mathbf{w}_{t-1};\zeta'_{i,t}) \nabla f_{i}(\mathbf{u}_{i,t-1})\)
Update the STORM gradient estimator \(\mathbf{v}_{t} = (1-\beta_t) \mathbf{v}_{t-1} + \beta_t \mathbf{z}_t + (1-\beta_t) (\mathbf{z}_t - \tilde{\mathbf{z}}_{t-1})\)
Update the model \(\mathbf{w}\) by \(\mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t \mathbf{v}_{t}\)
end for

Convergence Analysis

We first analyze the error recursion of

\[\delta_t: =\frac{1}{n} \|\mathbf{u}_{t} - \mathbf{g}(\mathbf{w}_t)\|_2^2=\frac{1}{n}\sum_{i=1}^n\|\mathbf{u}_{i,t} - g_i(\mathbf{w}_t)\|_2^2.\]

Similar to the analysis of SOX, we introduce virtual sequences \(\bar{\mathbf{u}}_{i,t},\forall i\):

\[\begin{equation}\label{eqn:virtu-storm} \bar{\mathbf{u}}_{i,t}=(1-\gamma_{t})\mathbf{u}_{i, t-1}+\gamma_{t} g_i\left(\mathbf{w}_{t}; \zeta_{i,t}\right) + \gamma'_{t} \left( g_i\left(\mathbf{w}_{t}; \zeta_{i,t}\right)-g_i\left(\mathbf{w}_{t-1}; \zeta_{i,t}\right) \right),\forall i. \end{equation}\]

Lemma 5.5

Consider the \(\mathbf{u}_t\) updates in Algorithm 15. Under Assumption 5.1 and Assumption 5.3 (ii), by setting \(\gamma'_{t} = \frac{n-B}{B(1-\gamma_{t})}+(1-\gamma_{t})\), for \(\gamma_{t} \leq \frac{1}{2}\), we have:

\[\mathbb{E}\left[\delta_t\right] \leq \left(1-\frac{B\gamma_{t}}{n}\right)\mathbb{E}\left[\delta_{t-1}\right]+ \frac{2B}{n}\gamma_{t}^{2} \sigma_0^{2} + \frac{12 n G_2^2}{B}\mathbb{E}\left[\left\|\mathbf{w}_{t}-\mathbf{w}_{t-1}\right\|^{2}\right].\]

Proof.

Let us consider a fixed \(i\in[n]\). With a probability \(B/n\) that \(i\in\mathcal{B}_t\), we have \(\mathbf{u}_{i,t} = \bar{\mathbf{u}}_{i,t}\); otherwise \(\mathbf{u}_{i,t} = \mathbf{u}_{i,t-1}\). Hence,

\[\begin{align*} \mathbb{E}\left[ \|\mathbf{u}_{i,t} - g_{i}(\mathbf{w}_t)\|_2^2\right]= \frac{B}{n}\underbrace{\mathbb{E}\left[ \|\bar{\mathbf{u}}_{i,t} - g_{i}(\mathbf{w}_t)\|_2^2\right]}_{A_1} + \left(1-\frac{B}{n}\right)\underbrace{\mathbb{E}\left[ \|\mathbf{u}_{i,t-1} - g_{i}(\mathbf{w}_t)\|_2^2\right]}_{A_2}. \end{align*}\]

Note that the first term \(A_1\) in the R.H.B. can be bounded similarly as in Lemma 4.12 for using the STORM estimator by building a recursion with \(\|\mathbf{u}_{i,t-1} - g_i(\mathbf{w}_{t-1})\|_2^2\). However, there exists the second term due to the randomness of \(\mathcal{B}_t\), which can be decomposed as

\[\begin{align*} &A_2 = \mathbb{E}[\|\mathbf{u}_{i,t-1} - g_i(\mathbf{w}_{t-1}) + g_i(\mathbf{w}_{t-1}) - g_i(\mathbf{w}_t)\|_2^2] \\ & = \underbrace{\mathbb{E}[\|\mathbf{u}_{i,t-1} - g_i(\mathbf{w}_{t-1})\|_2^2]}_{A_{21}} + \underbrace{\mathbb{E}[\|g_i(\mathbf{w}_{t-1}) - g_i(\mathbf{w}_t)\|_2^2]}_{A_{22}}\\ &+ \underbrace{\mathbb{E}[2(\mathbf{u}_{i,t-1} - g_i(\mathbf{w}_{t-1}))^{\top}(g_i(\mathbf{w}_{t-1}) - g_i(\mathbf{w}_t))]}_{A_{23}}. \end{align*}\]

The first two terms in RHS (\(A_{21}\) and \(A_{22}\)) can be easily handled. The difficulty comes from the third term, which cannot be simply bounded by using Young’s inequality. If doing so, it will end up with a non-diminishing error of \(\mathbf{u}_{i,t}\). To combat this difficulty, we use the additional factor brought by \(\gamma'_{t} (g_i\left(\mathbf{w}_{t}; \xi_t^{i}\right)-g_i\left(\mathbf{w}_{t-1}; \xi_t^{i}\right))\) in \(A_1\) to cancel \(A_{23}\). This is more clear by the following decomposition of \(A_1\).

\[\begin{align*} A_1 =\; &\mathbb{E}[\|\underbrace{(1-\gamma_{t})(\mathbf{u}_{i,t-1} - g_i(\mathbf{w}_{t-1}))}_{A_{11}} + \underbrace{\alpha_t (g_i(\mathbf{w}_t) - g_i(\mathbf{w}_{t-1}))}_{A_{12}} \\ &+ \underbrace{\gamma_{t} (g_i(\mathbf{w}_t; \zeta_{i,t}) - g_i(\mathbf{w}_t))}_{A_{13}} \\ &+\underbrace{\gamma'_{t}(g_i(\mathbf{w}_t; \zeta_{i,t}) - g_i(\mathbf{w}_{t-1}; \xi_{i,t}) - g_i(\mathbf{w}_t) + g_i(\mathbf{w}_{t-1}))}_{A_{14}} \|_2^2], \end{align*}\]

where \(\alpha_t = \gamma'_t+\gamma_t-1\). Since \(\mathbb{E}_t[A_{13}]=0, \mathbb{E}_t[A_{14}]=0\), then we have

\[\begin{align*} A_1 \leq &\mathbb{E}[\|A_{11}+A_{12}\|_2^2 ] + \mathbb{E}\left[\|A_{13}+A_{14}\|_2^2\right]. \end{align*}\]

In light of the above decomposition, we can bound \(\mathbb{E}[\|A_{11}+A_{12}\|_2^2]\leq \mathbb{E}[\|A_{11}\|_2^2 + \|A_{12}\|_2^2 + 2A_{11}^{\top}A_{12}]\) and \(\mathbb{E}[\|A_{13}+A_{14}\|_2^2]\leq 2\mathbb{E}[\|A_{13}\|_2^2] + 2\mathbb{E}[\|A_{14}\|_2^2]\). The resulting term \(\mathbb{E}[2A_{11}^{\top}A_{12}]\) has a negative sign as \(A_{23}\). Hence, by carefully choosing \(\gamma'_{t}\), we can cancel both terms. Specifically, we have

\[\begin{align*} \frac{B}{n} A_1 &\leq \frac{B}{n}\left(\mathbb{E}[\|A_{11}\|_2^2 + \|A_{12}\|_2^2 + 2A_{11}^{\top}A_{12}] + 2\mathbb{E}[\|A_{13}\|_2^2] + 2\mathbb{E}[\|A_{14}\|_2^2]\right)\\ & = \mathbb{E}\bigg[\frac{B}{n}(1-\gamma_t)^2 \|\mathbf{u}_{i,t-1} - g_i(\mathbf{w}_{t-1})\|_2^2\bigg] + \mathbb{E}\bigg[\frac{B}{n}\alpha_t^2 \left\|g_i(\mathbf{w}_t) - g_i(\mathbf{w}_{t-1})\right\|_2^2\bigg]\\ & + \mathbb{E}\bigg[\frac{B}{n}2\alpha_t(1-\gamma_t) (g_i(\mathbf{w}_t) - g_i(\mathbf{w}_{t-1}))^{\top}(\mathbf{u}_{i,t-1} - g_i(\mathbf{w}_{t-1}))\bigg]\\ & + \mathbb{E}\bigg[\frac{B}{n}2\gamma_t^2\left\|g_i(\mathbf{w}_t; \zeta_{i,t}) - g_i(\mathbf{w}_t)\right\|_2^2\bigg] \\ & + \mathbb{E}\bigg[\frac{B}{n}2{\gamma'_t}^2 \left\|(g_i(\mathbf{w}_t; \zeta_{i,t}) - g_i(\mathbf{w}_{t-1}; \zeta_{i,t}) - g_i(\mathbf{w}_t) + g_i(\mathbf{w}_{t-1}))\right\|_2^2\bigg]. \end{align*}\]

Combining the upper bounds of \(A_1\) and \(A_2\), we have

\[\begin{align*} &\frac{B}{n} A_1 + \frac{n-B}{n}A_2\\ & \leq \mathbb{E}\left[\frac{B}{n}(1-\gamma_t)^2 \|\mathbf{u}_{i,t-1} - g_i(\mathbf{w}_{t-1})\|_2^2 + \frac{B}{n}\alpha_t^2 \left\|g_i(\mathbf{w}_t) - g_i(\mathbf{w}_{t-1})\right\|_2^2\right]\\ & + \mathbb{E}\bigg[\frac{B}{n}2\alpha_t(1-\gamma_t) (g_i(\mathbf{w}_t) - g_i(\mathbf{w}_{t-1}))^{\top}(\mathbf{u}_{i,t-1} - g_i(\mathbf{w}_{t-1}))\bigg]\\ & + \mathbb{E}\bigg[\frac{B}{n}2\gamma_t^2 \left\|g_i(\mathbf{w}_t; \zeta_{i,t}) - g_i(\mathbf{w}_t)\right\|_2^2\bigg] \\ & + \mathbb{E}\bigg[\frac{B}{n}2(\gamma'_t)^2 \left\|(g_i(\mathbf{w}_t; \zeta_{i,t}) - g_i(\mathbf{w}_{t-1}; \zeta_{i,t}) - g_i(\mathbf{w}_t) + g_i(\mathbf{w}_{t-1}))\right\|_2^2\bigg]\\ & + \mathbb{E}\bigg[\frac{n-B}{n}\|\mathbf{u}_{i,t-1} - g_i(\mathbf{w}_{t-1})\|_2^2\bigg] + \mathbb{E}\bigg[\frac{n-B}{n}\|g_i(\mathbf{w}_{t-1}) - g_i(\mathbf{w}_t)\|_2^2\bigg]\\ &+ \mathbb{E}\bigg[\frac{n-B}{n}2(\mathbf{u}_{i,t-1} - g_i(\mathbf{w}_{t-1}))^{\top}(g_i(\mathbf{w}_{t-1}) - g_i(\mathbf{w}_t))\bigg]. \end{align*}\]

Since \(\frac{B}{n}2\alpha_t(1-\gamma_t)= 2\frac{B}{n}\frac{(n-B)}{B(1-\gamma_t)}(1-\gamma_t)= 2\frac{n-B}{n}\), then cross terms will cancel out. The remaining terms can be merged and handled separately. First,

\[\begin{align*} & \mathbb{E}\bigg[\frac{B}{n}(1-\gamma_t)^2 \|\mathbf{u}_{i,t-1} - g_i(\mathbf{w}_{t-1})\|_2^2 + \frac{n-B}{n}\|\mathbf{u}_{i,t-1} - g_i(\mathbf{w}_{t-1})\|_2^2\bigg]\\ &\leq \left(1- \frac{B}{n}\gamma_t\right) \mathbb{E}[\|\mathbf{u}_{i,t-1} - g_i(\mathbf{w}_{t-1})\|_2^2], \end{align*}\]

where we use \(\frac{B}{n}(1-\gamma_t)^2+ \frac{n-B}{n}\leq 1 - \frac{2B}{n}\gamma_t + \frac{B}{n}\gamma_t^2 \leq 1 - \frac{B}{n}\gamma_t\) due to \(\gamma_t< 1\). Second,

\[\begin{align*} &\frac{B}{n}\alpha_t^2 \left\|g_i(\mathbf{w}_t) - g_i(\mathbf{w}_{t-1})\right\|_2^2 + \frac{n-B}{n}\|g_i(\mathbf{w}_{t-1}) - g_i(\mathbf{w}_t)\|_2^2\\ &\leq \bigg(\frac{B}{n}\frac{(n-B)^2}{B^2(1-\gamma_t)^2} + \frac{n-B}{n}\bigg)G_2^2\|\mathbf{w}_{t} - \mathbf{w}_{t-1}\|_2^2 \leq \frac{4n-4B}{B}G_2^2\|\mathbf{w}_{t} - \mathbf{w}_{t-1}\|_2^2, \end{align*}\]

where we use \(\frac{B}{n}\frac{(n-B)^2}{B^2(1-\gamma_t)^2} + \frac{n-B}{n}\leq \frac{n-B}{n}\left(\frac{(n-B)}{B(1-\gamma_t)^2} + 1\right)\leq \frac{n-B}{n}\left(\frac{4(n-B)}{B} + 4\right) = \frac{4n-4B}{B}\) due to \(\gamma_t\leq 1/2\). Third,

\[\begin{align*} &\mathbb{E}\bigg[\frac{B}{n}2{\gamma'_t}^2 \left\|(g_i(\mathbf{w}_t; \zeta_{i,t}) - g_i(\mathbf{w}_{t-1}; \zeta_{i,t}) - g_i(\mathbf{w}_t) + g_i(\mathbf{w}_{t-1}))\right\|_2^2\bigg]\\ & \leq \frac{B}{n}2{\gamma'_t}^2\mathbb{E}\bigg[\left\|(g_i(\mathbf{w}_t; \zeta_{i,t}) - g_i(\mathbf{w}_{t-1}; \zeta_{i,t}))\right\|_2^2\bigg]\\ & \leq \frac{B}{n}2\bigg(\frac{n-B}{B(1-\gamma_t)} + 1-\gamma_t\bigg)^2G_2^2\|\mathbf{w}_t - \mathbf{w}_{t-1}\|_2^2 \leq \frac{8n-4B}{B}G_2^2\|\mathbf{w}_t - \mathbf{w}_{t-1}\|_2^2, \end{align*}\]

where we use \(\frac{B}{n}2(\frac{n-B}{B(1-\gamma_t)} + 1-\gamma_t)^2\leq \frac{B}{n}2(\frac{2(n-B)}{B} + 1)^2\leq \frac{B}{n}2(\frac{2n-B}{B})^2 = \frac{2(2n-B)(2n-B)}{nB}\leq \frac{8n-4B}{B}\).

Combining the above results, we have

\[\begin{align*} & \mathbb{E}\left[ \|\mathbf{u}_{i,t} - g_{i}(\mathbf{w}_t)\|_2^2\right] \leq \left(1- \frac{B}{n}\gamma_t\right) \mathbb{E}[\|\mathbf{u}_{i,t-1} - g_i(\mathbf{w}_{t-1})\|_2^2]\\ & + \frac{12n-8B}{B}G_2^2\|\mathbf{w}_t - \mathbf{w}_{t-1}\|_2^2 + \frac{B}{n}2\gamma_t^2\sigma_0^2. \end{align*}\]

Averaging over \(i=1, \ldots, n\) concludes the proof.

■

Lemma 5.6

Consider the \(\mathbf{u}_t\) updates in Algorithm 15. Suppose that Assumption 5.1 and Assumption 5.3 hold. With \(\gamma_t\leq \frac{1}{2}\) and \(\gamma'_{t} = \frac{n-B}{B(1-\gamma_{t})}+(1-\gamma_{t})\), we have

\[\mathbb{E}\left[\|\mathbf{u}_{t} - \mathbf{u}_{t-1}\|_2^2\right]\leq 6 B\gamma_t^2\sigma_0^2 + 6 B\gamma_t^2\mathbb{E}[\delta_{t-1}]+ \frac{10 n^2 G_2^2}{B}\mathbb{E}\left[\|\mathbf{w}_t-\mathbf{w}_{t-1}\|_2^2\right].\]

Proof.

Since \(\|\mathbf{u}_{t} - \mathbf{u}_{t-1}\|_2^2=\sum_{i=1}^n\|\mathbf{u}_{i,t} - \mathbf{u}_{i,t-1}\|_2^2\), with a probability \(B/n\) we have \(\mathbf{u}_{i,t} = \bar{\mathbf{u}}_{i,t}\) and a probability \(1-B/n\) we have \(\mathbf{u}_{i,t}=\mathbf{u}_{i,t-1}\), then

\[\begin{equation*} \begin{aligned} &\mathbb{E}\left[\|\mathbf{u}_{t} - \mathbf{u}_{t-1}\|_2^2\right]\\ & = \frac{B}{n} \sum_{i=1}^n \mathbb{E}\Big[ \big\| -\gamma_{t}\mathbf{u}_{i, t-1}+\gamma_{t} g_i\left(\mathbf{w}_{t}; \zeta_{i,t}\right) + \gamma'_{t} \left( g_i\left(\mathbf{w}_{t}; \zeta_{i,t}\right)-g_i\left(\mathbf{w}_{t-1}; \zeta_{i,t}\right) \right)\big\|_2^2 \Big]\\ &\leq \frac{B}{n} \sum_{i=1}^n \mathbb{E}\left[ 2 \gamma_t^2 \left\|g_i\left(\mathbf{w}_{t}; \zeta_{i,t}\right) - \mathbf{u}_{i,t-1}\right\|_2^2 + 2({\gamma}'_{t})^2 \left\| g_i\left(\mathbf{w}_{t}; \zeta_{i,t}\right)-g_i\left(\mathbf{w}_{t-1}; \zeta_{i,t}\right) \right\|_2^2 \right]\\ &\leq \frac{B}{n} \sum_{i=1}^n \mathbb{E}\left[ 2 \gamma_t^2 \left\|g_i\left(\mathbf{w}_{t}; \zeta_{i,t}\right) - \mathbf{u}_{i,t-1}\right\|_2^2 \right] + 2B({\gamma}'_{t})^2G_2^2 \left\| \mathbf{w}_t - \mathbf{w}_{t-1} \right\|_2^2. \end{aligned} \end{equation*}\]

To bound the first term on the RHS, we use the Young’s inequality and Lipschitz continuity of \(g_i\):

\[\begin{align*} &\mathbb{E}\left[\left\|g_i\left(\mathbf{w}_{t}; \zeta_{i,t}\right) - \mathbf{u}_{i,t-1}\right\|_2^2\right]\leq 3\mathbb{E}\left[\left\|g_i\left(\mathbf{w}_{t}; \zeta_{i,t}\right) - g_i\left(\mathbf{w}_{t}\right)\right\|_2^2\right]\\ &+3\mathbb{E}\left[\left\|g_i\left(\mathbf{w}_{t}\right) - g_i\left(\mathbf{w}_{t-1}\right)\right\|_2^2+3\left\|g_i\left(\mathbf{w}_{t-1}\right) - \mathbf{u}_{i,t-1}\right\|_2^2\right]\\ & \leq 3\sigma_0^2 + 3G_2^2\mathbb{E}\left[\|\mathbf{w}_t - \mathbf{w}_{t-1}\|_2^2+3\left\|g_i\left(\mathbf{w}_{t-1}\right) - \mathbf{u}_{i,t-1}\right\|_2^2\right]. \end{align*}\]

Combining the above results, we have

\[\begin{equation*} \begin{aligned} &\mathbb{E}\left[\|\mathbf{u}_{t} - \mathbf{u}_{t-1}\|_2^2\right]\\ &\leq 6 B\gamma_t^2\sigma_0^2 + 6 B\gamma_t^2\mathbb{E}\left[\delta_{t-1}\right]+ 2 BG_2^2(3\gamma_t^2+({\gamma}'_t)^2 )\mathbb{E}\left[\|\mathbf{w}_t - \mathbf{w}_{t-1}\|_2^2\right]. \end{aligned} \end{equation*}\]

With \(\gamma_t\leq \frac{1}{2}\), we have \(\gamma'_t\leq \frac{2n}{B}\), which yields \((3\gamma_t^2+({\gamma}'_t)^2 )\leq \frac{5n^2}{B^2}\).

■

Next, we analyze error recursion of \(\Delta_t:=\left\|\mathbf{v}_{t} - \mathcal{M}_t\right\|_2^2\).

Lemma 5.7

Consider the \(\mathbf{v}_{t}\) updates in Algorithm 15 and suppose that Assumption 5.1 and Assumption 5.3 hold. Then we have

\[\begin{equation*} \begin{aligned} \mathbb{E}[\Delta_t] &\leq (1-\beta_t)\mathbb{E}[\Delta_{t-1}] + \frac{24 G_2^2L_1^2B\gamma_t^2}{n}\mathbb{E}[\delta_{t-1}]\\ &+ \left(4L_2^2G_1^2+ \frac{40G_2^4L_1^2n}{B}\right)\mathbb{E}\left[\|\mathbf{w}_{t-1} - \mathbf{w}_t\|_2^2\right] + 2\beta_t^2 \sigma^2 + \frac{24 G_2^2L_1^2B}{n}\gamma_t^2\sigma_0^2, \end{aligned} \end{equation*}\]

where \(\sigma^2=\frac{G_1^2\sigma_2^2}{B} + \frac{G_1^2G_2^2}{B}\frac{n-B}{n-1}\).

Proof.

Similar to Lemma 5.2, we have \(\mathbb{E}_t[\|\mathbf{z}_t - \mathcal{M}_t\|_2^2]\leq \sigma^2\). Since \(\mathbf{v}_t = (1-\beta_t)\mathbf{v}_{t-1} + \beta_t \mathbf{z}_t + (1-\beta_t)(\mathbf{z}_t - \tilde{\mathbf{z}}_{t-1})\), applying Lemma 4.11, we have

\[\begin{align*} \mathbb{E}_t\left[\left\|\mathbf{v}_{t} - \mathcal{M}_t\right\|_2^2\right] & \leq (1-\beta_t)\left\|\mathbf{v}_{t-1} - \mathcal{M}_{t-1}\right\|_2^2 + \mathbb{E}_t[2\|\mathbf{z}_t - \tilde{\mathbf{z}}_{t-1}\|_2^2] + 2\beta_t^2\sigma^2. \end{align*}\]

To bound \(\mathbb{E}_t[\|\mathbf{z}_t - \tilde{\mathbf{z}}_{t-1}\|_2^2]\), we have

\[\begin{equation*} \begin{aligned} &\mathbb{E}_t[\|\mathbf{z}_t - \tilde{\mathbf{z}}_{t-1}\|_2^2]\\ &\leq 2 \mathbb{E}_t\left[\frac{1}{B}\sum_{i\in\mathcal{B}'_t}\left\|\nabla g_{i}(\mathbf{w}_t;\zeta'_{i,t}) \nabla f_{i}(\mathbf{u}_{i,t}) - \nabla g_{i}(\mathbf{w}_{t};\zeta'_{i,t}) \nabla f_{i}(\mathbf{u}_{i,t-1})\right\|_2^2\right]\\ & + 2 \mathbb{E}_t\left[\frac{1}{B}\sum_{i\in\mathcal{B}'_t}\left\|\nabla g_{i}(\mathbf{w}_t;\zeta'_{i,t}) \nabla f_{i}(\mathbf{u}_{i,t-1}) - \nabla g_{i}(\mathbf{w}_{t-1};\zeta'_{i,t}) \nabla f_{i}(\mathbf{u}_{i,t-1})\right\|_2^2\right]\\ &\leq 2G_2^2L_1^2\mathbb{E}_t\left[\frac{1}{B}\sum_{i\in\mathcal{B}'_t}\|\mathbf{u}_{i,t}- \mathbf{u}_{i,t-1}\|_2^2\right] + 2L_2^2 G_1^2\|\mathbf{w}_t - \mathbf{w}_{t-1}\|_2^2\\ &=2G_2^2L_1^2\mathbb{E}_t\left[\frac{1}{n}\sum_{i=1}^n\|\mathbf{u}_{i,t}- \mathbf{u}_{i,t-1}\|_2^2\right] + 2L_2^2 G_1^2\|\mathbf{w}_t - \mathbf{w}_{t-1}\|_2^2, \end{aligned} \end{equation*}\]

where the last inequality follows Assumption 5.3.

As a result, we have

\[\begin{equation*} \begin{aligned} \mathbb{E}[\Delta_t]\leq & (1-\beta_t)\mathbb{E}[\Delta_{t-1}] + \frac{4G_2^2L_1^2}{n}\mathbb{E}\left[\|\mathbf{u}_{t} - \mathbf{u}_{t-1}\|_2^2\right] + 4L_2^2G_1^2\mathbb{E}\left[\|\mathbf{w}_{t-1} - \mathbf{w}_t\|_2^2\right]\\ & + 2\beta_t^2\sigma^2. \end{aligned} \end{equation*}\]

Combining with the result in Lemma 5.6, i.e.,

we have

\[\begin{equation*} \begin{aligned} \mathbb{E}[\Delta_t] &\leq (1-\beta_t)\mathbb{E}[\Delta_{t-1}] + \frac{24 B G_2^2L_1^2}{n}\gamma_t^2\mathbb{E}[\delta_{t-1}]\\ &+ \left(4L_2^2G_1^2+ \frac{40G_2^4L_1^2n}{B}\right)\mathbb{E}\left[\|\mathbf{w}_{t-1} - \mathbf{w}_t\|_2^2\right] + 2\beta_t^2 \sigma^2 + \frac{24 B G_2^2L_1^2\gamma_t^2\sigma_0^2}{n}, \end{aligned} \end{equation*}\]

which completes the proof.

■

Lemma 5.8

For the update \(\mathbf{w}_{t+1} = \mathbf{w}_t -\eta_t \mathbf{v}_{t}, t\geq 0\), if \(\eta_t\leq 1/(2L_F)\) we have

\[\begin{align}\label{eq:csstorm_starter} F(\mathbf{w}_{t+1})& \leq F(\mathbf{w}_t) + G_2^2L_1^2\eta_t \delta_t + \eta_t\Delta_t- \frac{\eta_t}{2}\left\|\nabla F(\mathbf{w}_t)\right\|_2^2- \frac{1}{4\eta_t} \left\|\mathbf{w}_{t+1} - \mathbf{w}_t\right\|_2^2. \end{align}\]

Proof. It follows directly from Lemma 4.9 by noting that

\[\begin{equation*} \begin{aligned} &\|\mathbf{v}_{t} - \nabla F(\mathbf{w}_t)\|_2^2 =\|\mathbf{v}_{t} -\mathcal{M}_t + \mathcal{M}_t- \nabla F(\mathbf{w}_t)\|_2^2 \\ & \leq 2\Delta_t +2\Bigg\| \frac{1}{n}\sum_{i=1}^n \nabla g_i(\mathbf{w}_t) \nabla f_i(\mathbf{u}_{i,t}) - \frac{1}{n}\sum_{i=1}^n \nabla g_i(\mathbf{w}_t) \nabla f_i( g_i(\mathbf{w}_t))\Bigg\|_2^2\\ & \leq 2\Delta_t +\frac{2G_2^2L_1^2 }{n}\sum_{i=1}^n\|\mathbf{u}_{i,t} - g_i(\mathbf{w}_t)\|_2^2. \end{aligned} \end{equation*}\]

Taking expectation over all randomness on both sides yields the desired result.

■

Now we state the convergence theorem for MSVR.

Theorem 5.2

Suppose that Assumption 5.1 and Assumption 5.3 hold. Let \(\beta =O(\frac{\epsilon\eta L_1\sqrt{n}}{\sigma \sqrt{B}}), \gamma =\min\bigg(\frac{\epsilon\eta n}{{\sigma_0}B}, 1\bigg), \eta=\min\bigg(\frac{1}{2L_F}, O(\frac{\epsilon \sqrt{B}}{L_1\sigma\sqrt{n}}), O(\frac{\epsilon B}{L_1^2{\sigma_0}n}), O(\frac{B}{nL_1})\bigg)\). Then MSVR can find \(\mathbf{w}_{\tau}\) that is sampled randomly from \(\{0,\dotsc, T-1\}\) satisfying

\[\begin{align*} \mathbb{E}\left[\left\|\mathbf{v}_{\tau}\right\|_2^2+\left\|\nabla F(\mathbf{w}_\tau)\right\|_2^2\right]\leq O(\epsilon). \end{align*}\]

with an iteration complexity of

\[T = O\bigg(\max\bigg\{\frac{C_\Upsilon L_F}{\epsilon^2}, \frac{C_\Upsilon L_1n}{\epsilon^2B}, \frac{C_\Upsilon L_1\sigma \sqrt{n}}{\epsilon^3\sqrt{B}}, \frac{C_\Upsilon L_1^2{\sigma_0}n}{\epsilon^3B}\bigg\}\bigg),\]

where \(\sigma^2= \frac{G_1^2\sigma_2^2}{B}+ \frac{G_1^2G_2^2(n-B)}{B(n-1)}\), \(C_\Upsilon = O(F(\mathbf{w}_0) - F_* + \frac{B}{nL_1^2\eta}\Delta_0 + \frac{B}{n\eta}\delta_0)\).

💡 Why it matters

Theorem 5.2 indicates that when the initial estimators \(\mathbf{u}_0\) and \(\mathbf{v}_0\) have an estimation error in the order of \(O(\epsilon)\) such that \(C_\Upsilon\) is \(O(1)\), MSVR attains a better complexity than SOX for finding an \(\epsilon\)-stationary solution under stronger assumptions of the mean-Lipschitz continuity of \(g\) and \(\nabla g\). Its complexity is comparable to that of SCST in Theorem 4.4, up to a factor of \(n/B\).

Proof.

We have established the following:

\[\begin{equation*} \begin{aligned} &(*)\, F(\mathbf{w}_{t+1}) \leq F(\mathbf{w}_t) + G_2^2L_1^2\eta_t \delta_t + \eta_t\Delta_t- \frac{\eta_t}{2}\left\|\nabla F(\mathbf{w}_t)\right\|_2^2- \frac{1}{4\eta_t} \left\|\mathbf{w}_{t+1} - \mathbf{w}_t\right\|_2^2\\ &(\sharp) \mathbb{E}[\Delta_t] \leq (1-\beta_t)\mathbb{E}[\Delta_{t-1}] + \frac{24 B G_2^2L_1^2}{n}\gamma_t^2\mathbb{E}[\delta_{t-1}]\\ &+ \left(4L_2^2G_1^2+ \frac{40G_2^4L_1^2n}{B}\right)\mathbb{E}\left[\|\mathbf{w}_{t-1} - \mathbf{w}_t\|_2^2\right] + 2\beta_t^2 \sigma^2 + \frac{24B G_2^2L_1^2\sigma_0^2}{n}\gamma_t^2, \\ &(\diamond)\mathbb{E}\left[\delta_t\right] \leq \left(1-\frac{B\gamma_{t}}{n}\right)\mathbb{E}\left[\delta_{t-1}\right]+ \frac{2B}{n}\gamma_{t}^{2} \sigma_0^{2}+\frac{12 n G_2^2}{B}\mathbb{E}\left[\left\|\mathbf{w}_{t}-\mathbf{w}_{t-1}\right\|^{2}\right]. \end{aligned} \end{equation*}\]

In order to apply Lemma 4.15, we let \(A_t = F(\mathbf{w}_t) -F_*\), \(B_t=\|\nabla F(\mathbf{w}_t)\|_2^2/2\), \(\Gamma_t= \|\mathbf{v}_{t}\|_2^2/4\), \(\bar\delta_t= L_1^2G_2^2\delta_t\), \(\bar\gamma_{t}= \frac{B\gamma_{t}}{n}\). Then the following three inequalities

\[\begin{align*} (*)&\mathbb{E}[A_{t+1}]\leq \mathbb{E}[A_t + \eta_t \Delta_t + \eta_t \bar\delta_t - \eta_t B_t - \eta_t \Gamma_t]\\ (\sharp)&\mathbb{E}\left[\Delta_{t+1}\right] \leq\mathbb{E}[(1-\beta_{t+1})\Delta_{t} + C_1\bar\gamma_{t+1}^2\bar\delta_{t} + C_2\eta_{t}^2\Gamma_t + \beta_{t+1}^2\sigma^2 + \bar\gamma_{t+1}^2\sigma'^2],\\ (\diamond)& \mathbb{E}\left[\bar\delta_{t+1}\right] \leq \mathbb{E}[(1-\bar\gamma_{t+1}) \bar\delta_{t} + C_3\eta_{t}^2 \Gamma_t + \bar\gamma_{t+1}^2{\sigma''}^2], \end{align*}\] hold with \(C_1=O(n/B), C_2=O(L_1^2n/B+L_2^2), C_3=O(L_1^2n/B), \sigma^2=\frac{G_1^2\sigma_2^2}{B} + \frac{G_1^2G_2^2(n-B)}{B(n-1)}, \sigma'^2 = O(L_1^2\sigma_0^2n/B), \sigma''^2=O(L_1^2\sigma_0^2n/B)\). Following the settings in Lemma 4.15, we can finish the proof with

\[\begin{align*} \eta&= \min\bigg(\frac{1}{L}, \frac{\epsilon}{4\sqrt{C_2}\sigma}, \frac{\epsilon \sqrt{C_2}}{8C_3{\sigma'}}, \frac{\epsilon }{8\sqrt{C_3}{\sigma''}}, \frac{\sqrt{C_2}}{4C_3\sqrt{C_1}}\bigg)\\ &= \min\bigg(\frac{1}{2L_F}, O\bigg(\frac{\epsilon}{L_1\sigma}\sqrt{\frac{B}{n}}\bigg), O\bigg(\frac{\epsilon B}{L_1^2{\sigma_0}n}\bigg), O\bigg(\frac{B}{n L_1}\bigg)\bigg),\\ \beta& =\frac{\epsilon\eta\sqrt{C_2}}{\sigma}=O\left(\frac{\epsilon\eta L_1}{\sigma}\sqrt{\frac{n}{B}}\right),\\ \bar\gamma & = \min\bigg(\frac{\epsilon\eta\sqrt{C_2}}{{\sigma'}},\frac{\epsilon\eta\sqrt{C_3}}{{\sigma''}}, \frac{C_2}{2C_3C_1}\bigg) = \min\bigg(O\bigg(\frac{\epsilon\eta}{{\sigma_0}}\bigg), O\bigg(\frac{B}{n}\bigg)\bigg),\\ T & = O\bigg(\max\bigg\{\frac{C_\Upsilon L_F}{\epsilon^2}, \frac{C_\Upsilon L_1n}{\epsilon^2B}, \frac{C_\Upsilon L_1\sigma \sqrt{n}}{\epsilon^3\sqrt{B}}, \frac{C_\Upsilon L_1^2{\sigma_0}n}{\epsilon^3B}\bigg\}\bigg), \end{align*}\]

where \(C_\Upsilon=F(\mathbf{w}_0)-F_*+ \frac{1}{4C_2\eta}\Delta_{0} + \frac{L_1^2G_2^2}{4C_3\eta} \delta_{0}\).

■

← Go Back