Section 5.5.1 A Basic Algorithm

5.5.1 A Basic Algorithm

For optimizing the general COCE minimization problem, we present a basic stochastic algorithm in Algorithm 20. It alternates the stochastic block-coordinate update for \(\boldsymbol{\nu}\) and a SGD update for \(\mathbf{w}\), which is essentially SGD applied to \(F(\mathbf w,\boldsymbol{\nu})\). Below, we present its convergence analysis for both convex and non-convex loss functions.

5.5.1.1 Convex loss

For notational simplicity, we set \(\tau = 1\) throughout the analysis.

Assumption 5.13

\(s_i(\cdot, \zeta)\) is a convex function.

Algorithm 20: The SGD algorithm for solving COCE

Input: Initialize \(\mathbf{w}_0,\boldsymbol{\nu}_0\), step sizes \(\eta_t\) and \(\gamma_t\)
for \(t=0,1\dotsc,T-1\) do
Sample \(\mathcal{B}_t\subset \{1,\dotsc, n\}\) with \(|\mathcal{B}_t| = B\)
for each \(i\in\mathcal{B}_t\) do
Update \(\nu_{i,t+1} = \nu_{i,t} - \gamma_t \partial_2\Phi_i(\mathbf{w}_t, \nu_{i,t};\zeta_{i,t})\)
end for
Compute \(\mathbf{z}_t = \frac{1}{B}\sum_{i\in\mathcal{B}_t}\partial_1 \Phi_i(\mathbf{w}_t, \nu_{i,t};\zeta_{i,t})\)
Update \(\mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t\mathbf{z}_t\)
end for

Lemma 5.20

\(F(\mathbf{w},\boldsymbol{\nu})\) is jointly convex in terms of \((\mathbf{w}^\top,\boldsymbol{\nu}^{\top})^\top\) if \(s_i(\cdot;\zeta)\) is convex.

Proof.

We prove that \(\Phi_i(\mathbf{w}, \nu_i; \zeta)\) is jointly convex in terms of \((\mathbf{w}^\top,\nu_i)^\top\). Then the convexity of \(F(\mathbf{w}, \boldsymbol{\nu})\) follows. Let \(\mathbf{u} =(\mathbf{w}^\top,\nu)^\top\). Consider \(\mathbf{u}_1,\mathbf{u}_2\), \(\alpha \in [0,1]\), and \(\bar{\mathbf{u}} = \alpha\mathbf{u}_1 + (1-\alpha)\mathbf{u}_2\). Then

\[\Phi_i(\bar{\mathbf{u}};\zeta) = \phi^*( s_i(\bar{\mathbf{w}};\zeta) - \bar{\nu} ) + \bar{\nu}.\]

If \(s_i(\cdot;\zeta)\) is convex, we have \(s_i(\bar{\mathbf{w}};\zeta) \leq \alpha s_i(\mathbf{w}_1;\zeta) + (1-\alpha) s_i(\mathbf{w}_2;\zeta)\). Since \(\phi^*(\cdot)\) is non-decreasing (cf. Lemma 2.3), we have

\[\phi^*(s_i(\bar{\mathbf{w}};\zeta) - \bar{\nu}) \leq \phi^*(\alpha(s_i(\mathbf{w}_1;\zeta) -\nu_{1} ) + (1-\alpha) (s_i(\mathbf{w}_2;\zeta) - \nu_{2} )).\]

Since \(\phi^*(\cdot)\) is convex, we further have

\[\phi^*(\alpha(s_i(\mathbf{w}_1;\zeta) -\nu_{1} ) + (1-\alpha) (s_i(\mathbf{w}_2;\zeta) -\nu_{2} )) \leq \alpha \phi^*(s_i(\mathbf{w}_1;\zeta) - \nu_{1}) + (1-\alpha) \phi^*(s_i(\mathbf{w}_2;\zeta) - \nu_{2}).\]

As a result,

\[\Phi_i(\bar{\mathbf{u}};\zeta) \leq \alpha \Phi_i(\mathbf{u}_1;\zeta) + (1-\alpha) \Phi_i(\mathbf{u}_2;\zeta),\]

which proves the convexity of \(\Phi_i(\mathbf{u};\zeta)\).

■

Assumption 5.14

Assume that either of the following conditions hold:

\(F(\mathbf{w}, \boldsymbol{\nu})\) is smooth satisfying:

\[\|\nabla_1 F(\mathbf{w},\boldsymbol{\nu})\|_2^2 +\|\nabla_2 F(\mathbf{w},\boldsymbol{\nu})\|_2^2 \leq 2L_F(F(\mathbf{w},\boldsymbol{\nu}) - F(\mathbf{w}_*,\boldsymbol{\nu}_*)),\]

\(F(\mathbf{w}, \boldsymbol{\nu})\) is non-smooth such that for any \(\mathbf{v}_1\in \partial_1 F(\mathbf{w},\boldsymbol{\nu}), \mathbf{v}_{2,i}\in \partial_2 F_i(\mathbf{w},\nu_i)\) it holds

\[\|\mathbf{v}_1\|_2^2\leq G_1^2, \quad |\mathbf{v}_{2,i}|^2 \leq G_2^2,\]

where \(\mathbf{w}_*,\boldsymbol{\nu}_*\) denotes an optimal solution to COCE, and \(\nabla_1F(\mathbf{w},\boldsymbol{\nu})\) (\(\partial_1 F(\mathbf{w},\boldsymbol{\nu})\)), and \(\nabla_2 F(\mathbf{w}, \boldsymbol{\nu})\) (\(\partial_2 F(\mathbf{w},\boldsymbol{\nu})\)) denote partial (sub)gradients with respect to \(\mathbf{w}, \boldsymbol{\nu}\), respectively.

💡 Critical

For CERM, the smoothness assumption is satisfied when \(s_i(\mathbf{w}; \zeta)\) is bounded, Lipschitz, and smooth. For CCVaR, the non-smoothness assumption is satisfied when \(s_i(\mathbf{w}; \zeta)\) is bounded and Lipschitz.

Assumption 5.15 (Bounded Variance)

In the smooth case, there exist \(\sigma_1^2,\sigma_2^2,\delta^2\) such that

\[\begin{align*} & \mathbb{E}_{\zeta}\|\nabla_1\Phi_i(\mathbf{w},\nu_{i};\zeta) -\nabla_1F_i(\mathbf{w},\nu_{i})\|_2^2 \leq \sigma_1^2, \quad \forall i\in[n],\\ & \mathbb{E}_{\zeta}\|\nabla_2\Phi_i(\mathbf{w},\nu_{i};\zeta) -\nabla_2F_i(\mathbf{w},\nu_{i})\|_2^2 \leq \sigma_2^2, \quad \forall i\in[n],\\ & \frac{1}{n}\sum_{i=1}^n \|\nabla_1F_i(\mathbf{w},\nu_{i}) - \nabla_1 F(\mathbf{w},\boldsymbol{\nu})\|_2^2 \leq \delta^2. \end{align*}\]

In the non-smooth case, the gradients above are replaced by subgradients. The subsequent analysis proceeds analogously.

Lemma 5.21

Let \(D_{\mathbf{w},0}^2 \coloneqq \mathbb{E}\|\mathbf{w}_0-\mathbf{w}_*\|_2^2\) and \(\eta_t=\eta\), we have

\[\frac{1}{T}\sum_{t=0}^{T-1} \left(2\mathbb{E} [\nabla_1 F(\mathbf{w}_t,\boldsymbol{\nu}_t)^{\top}(\mathbf{w}_t - \mathbf{w}_*)] - \eta \mathbb{E} \|\nabla_1 F(\mathbf{w}_t,\boldsymbol{\nu}_t)\|_2^2\right) \leq \frac{D_{\mathbf{w},0}^2}{\eta T} +\eta\sigma^2,\]

where \(\boldsymbol{\nu}_t = (\nu_{1,t},\ldots, \nu_{n,t})^{\top}\) and \(\sigma^2=\frac{\sigma_1^2}{B} + \frac{\delta^2(n-B)}{B(n-1)}\).

Proof.

Let \(\mathbb{E}_t\) denote the expectation over the random samples in the \(t\)-th iteration. First, we note that \(\mathbb{E}_t[\mathbf{z}_t] = \nabla_1 F(\mathbf{w}_t,\boldsymbol{\nu}_t)\). Similar to Lemma 5.2, we have

\[\begin{align*} & \mathbb{E}_t\|\mathbf{z}_t - \nabla_1 F(\mathbf{w}_t,\boldsymbol{\nu}_t)\|_2^2 \\ & = \mathbb{E}_t\bigg[\bigg\|\mathbf{z}_t - \frac{1}{B}\sum_{i\in\mathcal{B}_t}\partial_1 F_i(\mathbf{w}_t, \nu_{i,t}) + \frac{1}{B}\sum_{i\in\mathcal{B}_t}\partial_1 F_i(\mathbf{w}_t, \nu_{i,t})- \nabla_1 F(\mathbf{w}_t,\boldsymbol{\nu}_t)\bigg\|_2^2\bigg]\\ & =\mathbb{E}_t\bigg[\bigg\|\mathbf{z}_t - \frac{1}{B}\sum_{i\in\mathcal{B}_t}\partial_1 F_i(\mathbf{w}_t, \nu_{i,t})\bigg\|_2^2\bigg] + \mathbb{E}_t\bigg[\bigg\|\frac{1}{B}\sum_{i\in\mathcal{B}_t}\partial_1 F_i(\mathbf{w}_t, \nu_{i,t})- \nabla_1 F(\mathbf{w}_t,\boldsymbol{\nu}_t)\bigg\|_2^2\bigg]\\ & \leq \frac{\sigma_1^2}{B} + \frac{\delta^2(n-B)}{B(n-1)}:=\sigma^2. \end{align*}\]

Due to the update of \(\mathbf{w}\), we have

\[\|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2 = \|\mathbf{w}_t - \mathbf{w}_*\|_2^2 - 2\eta \nabla_1 F(\mathbf{w}_t,\boldsymbol{\nu}_t)^{\top}(\mathbf{w}_t - \mathbf{w}_*)+\eta^2 \|\mathbf{z}_t\|_2^2.\]

Then,

\[\begin{align}\label{eq:x_single} &\mathbb{E} \|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2 \leq \mathbb{E} \|\mathbf{w}_t - \mathbf{w}_*\|_2^2 - 2\eta \mathbb{E} [\nabla_1 F(\mathbf{w}_t,\boldsymbol{\nu}_t)^{\top}(\mathbf{w}_t - \mathbf{w}_*)]\\ &+ \eta^2 \mathbb{E} \|\nabla_1 F(\mathbf{w}_t,\boldsymbol{\nu}_t)\|_2^2+ \eta^2 \sigma^2.\notag \end{align}\]

Summing over \(t=0,\ldots, T-1\) and rearranging it finishes the proof.

■

Lemma 5.22

Let \(D_{\nu,0}^2 \coloneqq \mathbb{E}\|\boldsymbol{\nu}_0-\boldsymbol{\nu}_*\|_2^2\) and \(\gamma_t=\gamma\), we have

\[\frac{1}{T}\sum_{t=0}^{T-1} \left(2\mathbb{E} [\nabla_2 F(\mathbf{w}_t,\boldsymbol{\nu}_t)^{\top}(\boldsymbol{\nu}_t - \boldsymbol{\nu}_*)] - \gamma n \mathbb{E} \|\nabla_2 F(\mathbf{w}_t,\boldsymbol{\nu}_t)\|_2^2\right) \leq \frac{D_{\nu,0}^2}{\gamma B T} + \gamma \sigma_2^2.\]

Proof.

Let \(\mathbb{E}_t\) denote the expectation over the random samples in the \(t\)-th iteration. Note that \(\mathbb{E}_t[\nabla_2\Phi_i(\mathbf{w}_t, \nu_{i,t};\zeta_{i,t})] = \nabla_2 F_i(\mathbf{w}_t,\nu_{i,t})\) and \(\mathbb{E}_t\|\nabla_2\Phi_i(\mathbf{w}_t, \nu_{i,t};\zeta_{i,t}) - \nabla_2 F_i(\mathbf{w}_t,\nu_{i,t})\|_2^2 \leq \sigma_2^2\) for each \(i\in[n]\) (For those \(i\notin\mathcal B_t\), \(\nabla_2\Phi_i(\mathbf w_t, \nu_{i,t};\zeta_{i,t})\) are not explicitly computed). For each \(i\in[n]\), we have

\[\begin{align*} &\mathbb{E}\|\nu_{i,t+1} - \nu_{i,*}\|_2^2 \\ & = \left(1-\frac{B}{n}\right)\mathbb{E} \|\nu_{i,t} - \nu_{i,*}\|_2^2 + \frac{B}{n}\mathbb{E}\|\nu_{i,t} - \gamma\nabla_2\Phi_i(\mathbf{w}_t, \nu_{i,t};\zeta_{i,t}) - \nu_{i,*}\|_2^2\\ & \leq \mathbb{E} \|\nu_{i,t} - \nu_{i,*}\|_2^2 - \frac{2\gamma B}{n} \mathbb{E} [\nabla_2 F_i(\mathbf{w}_t,\nu_{i,t})^{\top}(\nu_{i,t} - \nu_{i,*})] + \frac{\gamma^2 B}{n} \mathbb{E} \|\nabla_2 F_i(\mathbf{w}_t,\nu_{i,t})\|_2^2 + \frac{\gamma^2 \sigma_2^2 B}{n}. \end{align*}\]

Summing over \(i\in[n]\) leads to

\[\begin{align} &\mathbb{E}\|\boldsymbol{\nu}_{t+1} - \boldsymbol{\nu}_*\|_2^2 = \mathbb{E} \|\boldsymbol{\nu}_t - \boldsymbol{\nu}_*\|_2^2 - \frac{2\gamma B}{n} \mathbb{E} \bigg[\sum_{i=1}^n\nabla_2 F_i(\mathbf{w}_t,\nu_{i,t})^{\top}(\nu_{i,t} - \nu_{i,*})\bigg]\notag\\ &+\frac{\gamma^2 B}{n}\mathbb{E} \bigg[\sum_{i=1}^n\|\nabla_2 F_i(\mathbf{w}_t,\boldsymbol{\nu}_{i,t})\|_2^2\bigg] + \gamma^2 \sigma_2^2 B.\label{eq:y_single} \end{align}\]

Since

\[\begin{align*} &\nabla_2 F(\mathbf{w}_t,\boldsymbol{\nu}_t)^{\top}(\boldsymbol{\nu}_t - \boldsymbol{\nu}_*) = \frac{1}{n}\sum_{i=1}^n \nabla_2 F_i(\mathbf{w}_t,\nu_{i,t})(\nu_{i,t} - \nu_{i,*}),\\ &\|\nabla_2 F(\mathbf{w}_t,\boldsymbol{\nu}_t)\|_2^2 = \frac{1}{n^2}\sum_{i=1}^n \|\nabla_2F_i(\mathbf{w}_t,\nu_{i,t})\|_2^2, \end{align*}\]

plugging these into (\(\ref{eq:y_single}\)) we have

\[\begin{align*} &\mathbb{E}\|\boldsymbol{\nu}_{t+1} - \boldsymbol{\nu}_*\|_2^2 \leq \mathbb{E} \|\boldsymbol{\nu}_t - \boldsymbol{\nu}_*\|_2^2 - 2\gamma B \mathbb{E} \bigg[\nabla_2 F(\mathbf{w}_t,\boldsymbol{\nu}_{t})^{\top}(\boldsymbol{\nu}_{t} - \boldsymbol{\nu}_{*})\bigg]\\ &+\gamma^2 n B\mathbb{E} \bigg[\|\nabla_2 F(\mathbf{w}_t,\boldsymbol{\nu}_{t})\|_2^2\bigg] + \gamma^2 \sigma_2^2 B. \end{align*}\]

Summing over \(t=0,\ldots, T-1\) and rearranging it finishes the proof.

■

Theorem 5.11 (Smooth case)

Suppose Assumption 5.13, Assumption 5.14(i) and Assumption 5.15 hold. If we set \(\gamma = \min\{\frac{1}{2n L_F},\frac{\epsilon}{4\sigma_2^2}\}\), \(\eta=\min\{\frac{1}{2L_F}, \frac{\epsilon}{4\sigma^2}\}\) and \(T= \max(\frac{4D_{\mathbf{w},0}^2}{\eta\epsilon}, \frac{4D_{\nu,0}^2}{\gamma B\epsilon})\), then Algorithm 20 guarantees that

\[\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}(F(\mathbf{w}_t,\boldsymbol{\nu}_t) - F(\mathbf{w}_*,\boldsymbol{\nu}_*))\right] \leq \epsilon.\]

The iteration complexity is

\[T = O\left(\max\left\{\frac{D_{\mathbf{w},0}^2L_F}{\epsilon}, \frac{n D_{\nu,0}^2 L_F}{B\epsilon}, \frac{D_{\mathbf{w},0}^2 \sigma_1^2}{\epsilon^2}, \frac{D_{\nu,0}^2\sigma_2^2}{B\epsilon^2}\right\}\right),\]

where \(\sigma^2=\frac{\sigma_1^2}{B} + \frac{\delta^2(n-B)}{B(n-1)}\).

Proof.

From Lemma 5.21 and Lemma 5.22, we have

\[\begin{align*} & \frac{1}{T}\sum_{t=0}^{T-1} \left(2\mathbb{E} [\nabla_1 F(\mathbf{w}_t,\boldsymbol{\nu}_t)^{\top}(\mathbf{w}_t - \mathbf{w}_*)] - \eta \mathbb{E} \|\nabla_1 F(\mathbf{w}_t,\boldsymbol{\nu}_t)\|_2^2\right) \leq \frac{D_{\mathbf{w},0}^2}{\eta T} + \eta\sigma^2,\\ & \frac{1}{T}\sum_{t=0}^{T-1} \left(2\mathbb{E} [\nabla_2 F(\mathbf{w}_t,\boldsymbol{\nu}_t)^{\top}(\boldsymbol{\nu}_t - \boldsymbol{\nu}_*)] - \gamma n \mathbb{E} \|\nabla_2 F(\mathbf{w}_t,\boldsymbol{\nu}_t)\|_2^2\right) \leq \frac{D_{\nu,0}^2}{\gamma B T} + \gamma \sigma_2^2. \end{align*}\]

If \(F\) is smooth and \(\eta \leq \frac{1}{2L_F}\) and \(\gamma n \leq \frac{1}{2L_F}\),

\[\begin{align*} \eta\|\nabla_1 F(\mathbf{w}_t,\boldsymbol{\nu}_t)\|_2^2 +\gamma n\|\nabla_2 F(\mathbf{w}_t,\boldsymbol{\nu}_t)\|_2^2 &\leq \frac{1}{2L_F}\left(\|\nabla_1 F(\mathbf{w},\boldsymbol{\nu})\|_2^2 +\|\nabla_2 F(\mathbf{w},\boldsymbol{\nu})\|_2^2\right)\\ &\leq F(\mathbf{w}_t,\boldsymbol{\nu}_t) - F(\mathbf{w}_*,\boldsymbol{\nu}_*), \end{align*}\]

where the last inequality uses Lemma 1.5 (b).

On the other hand, the joint convexity of \(F\) implies

\[F(\mathbf{w}_t,\boldsymbol{\nu}_t) - F(\mathbf{w}_*,\boldsymbol{\nu}_*) \leq \nabla_1 F(\mathbf{w}_t,\boldsymbol{\nu}_t)^{\top}(\mathbf{w}_t - \mathbf{w}_*) + \nabla_2 F(\mathbf{w}_t,\boldsymbol{\nu}_t)^{\top}(\boldsymbol{\nu}_t - \boldsymbol{\nu}_*).\]

Then combining the above inequalities, we have

\[\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1} [F(\mathbf{w}_t,\boldsymbol{\nu}_t) - F(\mathbf{w}_*,\boldsymbol{\nu}_*)]\right]\leq \frac{D_{\mathbf{w},0}^2}{\eta T} +\eta\sigma^2+ \frac{D_{\nu,0}^2}{\gamma B T} +\gamma \sigma_2^2.\]

In order to let the RHS above be less than \(\epsilon\), we set \(\gamma = \min\{\frac{1}{2n L_F},\frac{\epsilon}{4\sigma_2^2}\}\) and \(\eta=\min\{\frac{1}{2L_F}, \frac{\epsilon}{4\sigma^2}\}\), and \(T\geq \max(\frac{4D_{\mathbf{w},0}^2}{\eta\epsilon}, \frac{4D_{\nu,0}^2}{\gamma B\epsilon})\). As a result, the complexity is in the order of

\[T = O\left(\max\left\{\frac{D_{\mathbf{w},0}^2L_F}{\epsilon}, \frac{n D_{\nu,0}^2 L_F}{B\epsilon}, \frac{D_{\mathbf{w},0}^2 \sigma^2}{\epsilon^2}, \frac{D_{\nu,0}^2\sigma_2^2}{B\epsilon^2}\right\}\right).\]

Theorem 5.12 (Non-smooth case)

Suppose Assumption 5.13, Assumption 5.14(ii) and Assumption 5.15 hold. If we set \(\gamma = \frac{\epsilon}{2(G_2^2+\sigma_2^2)}\), \(\eta= \frac{\epsilon}{2(G_1^2+\sigma^2)}\) and \(T= \max(\frac{2D_{\mathbf{w},0}^2}{\eta\epsilon}, \frac{2D_{\nu,0}^2}{\gamma B\epsilon})\), then ASGD guarantees that

\[\mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}(F(\mathbf{w}_t,\boldsymbol{\nu}_t) - F(\mathbf{w}_*,\boldsymbol{\nu}_*))\right] \leq \epsilon.\]

The iteration complexity is

\[T = O\left(\max\left\{ \frac{D_{\mathbf{w},0}^2 (G_1^2+\sigma^2)}{\epsilon^2}, \frac{D_{\nu,0}^2(G_2^2+\sigma_2^2)}{B\epsilon^2}\right\}\right).\]

We leave the proof as an exercise for the reader.

💡 Why it matters

Since \(F(\mathbf{w}, \boldsymbol{\nu})\) is jointly convex in \((\mathbf{w}, \boldsymbol{\nu})\), the above two theorems imply convergence of the objective with respect to the primary variable \(\mathbf{w}\), i.e., \(F_{\text{COCE}}(\mathbf{w}) = \min_{\boldsymbol{\nu}} F(\mathbf{w}, \boldsymbol{\nu})\). In particular, if we define the averaged iterate \(\bar{\mathbf{w}}_T = \frac{1}{T}\sum_{t=0}^{T-1}\mathbf{w}_t\), we have

\[\begin{align*} \mathbb{E}[F_{\text{COCE}}(\bar{\mathbf{w}}_T) - F_{\text{COCE}}(\mathbf{w}_*)]& \leq \mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}(F_{\text{COCE}}(\mathbf{w}_t) - F_{\text{COCE}}(\mathbf{w}_*))\right] \\ & \leq \mathbb{E}\left[\frac{1}{T}\sum_{t=0}^{T-1}(F(\mathbf{w}_t,\boldsymbol{\nu}_t) - F(\mathbf{w}_*,\boldsymbol{\nu}_*))\right] \leq \epsilon. \end{align*}\]

5.1.1.2 Non-convex loss

If \(s_i(\mathbf{w}, \zeta)\) is non-convex, we consider two different cases: (1) smooth case and (2) non-smooth weakly convex case. If \(F(\mathbf{w}, \boldsymbol{\nu})\) is smooth in terms of \(\mathbf{w},\boldsymbol{\nu}\) and is strongly convex in terms of \(\boldsymbol{\nu}\) (e.g, compositional entropic risk or COCE with \(\chi^2\) divergence for \(\phi(\cdot)\)), we can follow the analysis in Section 4.5 to design an algorithm and an analysis to prove the convergence for finding an \(\epsilon\)-stationary point of \(F_{\text{COCE}}(\mathbf{w})=\min_{\boldsymbol{\nu}}F(\mathbf{w}; \boldsymbol{\nu})\). We leave this as an exercise for the reader.

Below, we analyze the convergence of ASGD for non-smooth weakly convex losses. We also assume \(\phi^*\) is non-smooth such that it covers the CCVaR minimization.

Assumption 5.16

Suppose the following conditions hold:

\(s_i(\mathbf{w}; \zeta)\) is \(\rho_0\)-weakly convex with respect to \(\mathbf{w}\), and \(\mathbb{E}_\zeta[\|\partial s_i(\mathbf{w}; \zeta)\|_2^2]\leq G_\ell^2\);
Assume \(|\frac{\partial \phi^*(q)}{\partial q}|\leq G_0\) for any \(q=s_i(\mathbf{w}, \zeta) -\nu_i\).

Lemma 5.23

\(F(\mathbf{w}, \nu)\) is \(\rho\)-weakly convex with respect to \((\mathbf{w}, \boldsymbol{\nu})\), where \(\rho=\rho_0G_0\).

Proof.

We first prove that \(\phi^*(s_i(\mathbf{w}; \zeta) - \nu_i)\) is weakly convex in terms of \((\mathbf{w}, \nu_i)\), i.e. there exists \(\rho>0\) such that \(\phi^*(s_i(\mathbf{w}; \zeta) - \nu_i)+\frac{\rho}{2}\|\mathbf{w}\|_2^2 + \frac{\rho}{2}\nu_i^2\) is jointly convex in terms of \(\mathbf{w}, \nu_i\).

Since \(s_i(\mathbf{w}; \zeta)\) is \(\rho_0\)-weakly convex, we have that \(q(\mathbf{w}, \nu_i, \zeta)=s_i(\mathbf{w}, \zeta) -\nu_i\) is \(\rho_0\)-weakly convex in terms of \(\mathbf{v}_i = (\mathbf{w}, \nu_i)\):

\[q(\mathbf{v}_i,\zeta)\geq q(\mathbf{v}'_i,\zeta) + \partial q(\mathbf{v}'_i,\zeta)^{\top}(\mathbf{v}_i - \mathbf{v}'_i) - \frac{\rho_0}{2}\|\mathbf{v}_i' - \mathbf{v}_i\|_2^2,\forall\mathbf{v}_i, \mathbf{v}'_i.\]

For any \(\zeta\), we abbreviate \(q(\mathbf{v}_i;\zeta)\) as \(q(\mathbf{v}_i)\). Since \(\phi^*\) is convex and monotonically non-decreasing, for any \(\omega\in \partial\phi^*(q(\mathbf{v}'_i))\in[0, G_0]\) we have

\[\begin{align*} &\phi^*(q(\mathbf{v}_i))\geq \phi^*(q(\mathbf{v}'_i)) + \omega(q(\mathbf{v}_i) - q(\mathbf{v}'_i))\\ &\geq \phi^*(q(\mathbf{v}'_i)) + \omega (\partial q(\mathbf{v}'_i)^{\top}(\mathbf{v}_i-\mathbf{v}_i') - \frac{\rho_0}{2}\|\mathbf{v}_i-\mathbf{v}'_i\|_2^2)\\ &\geq \phi^*(q(\mathbf{v}'_i)) + \partial\phi^*(q(\mathbf{v}'_i))^{\top}(\mathbf{v}_i-\mathbf{v}_i') - \frac{G_0\rho_0}{2}\|\mathbf{v}_i-\mathbf{v}'_i\|_2^2. \end{align*}\]

The above inequality implies that \(\phi^*(s_i(\mathbf{w}; \zeta) - \nu_i)\) is \(\rho=G_0\rho_0\)-weakly convex in terms of \((\mathbf{w}, \nu_i)\), i.e., \(\mathbb{E}_{\zeta}\phi^*(s_i(\mathbf{w};\zeta) - \nu_i) + \frac{\rho}{2}(\|\mathbf{w}\|_2^2+ |\nu_i|^2)\) is convex. As a result \(F(\mathbf{w}, \boldsymbol{\nu})+ \frac{\rho}{2}\|\mathbf{w}\|_2^2+\frac{\rho}{2}\|\boldsymbol{\nu}\|_2^2\) is jointly convex in terms of \((\mathbf{w}, \boldsymbol{\nu})\).

■

Similar to the SGD for weakly convex objectives in Section 3.1.4, we use the Moreau envelope of \(F(\mathbf{w}; \boldsymbol{\nu})\). In particular, let \(\mathbf{v}=(\mathbf{w}^{\top}, \boldsymbol{\nu}^{\top})^{\top}\) and consider some \(\bar\rho>\rho\), we define:

\[\begin{align}\label{eqn:moreau} F_{1/\bar\rho}(\mathbf{v}) &: =\min_{\mathbf{u}}F(\mathbf{u}) + \frac{\bar\rho}{2}\|\mathbf{u} - \mathbf{v}\|_2^2,\\ \operatorname{prox}_{F/\bar\rho}(\mathbf{v})& := \arg\min_{\mathbf{u}} F(\mathbf{u}) + \frac{\bar\rho}{2}\|\mathbf{u} - \mathbf{v}\|_2^2.\notag \end{align}\]

Lemma 5.24

Under Assumption 5.16, we have

\[\mathbb{E}_t[\|\mathbf{z}_t\|^2_2]\leq G_1^2,\quad |\partial_2 F_i(\mathbf{w}, \nu_i)|^2\leq G_2^2,\]

where \(G_1^2=G_0^2G_\ell^2\), and \(G_2^2=(1+G_0)^2\).

Proof.

For the first part,

\[\mathbb{E}_t[\|\mathbf{z}_t\|^2_2] = \mathbb{E}_t\bigg[\bigg\|\frac{1}{B}\sum_{i\in\mathcal{B}_t}\partial_1 \Phi_i(\mathbf{w}_t, \nu_{i,t};\zeta_{i,t})\bigg\|_2^2\bigg]\leq G_0^2G_\ell^2.\]

For the second part,

\[|\partial_2 F_i(\mathbf{w}, \nu_i)|^2 = \bigg|\mathbb{E}_\zeta\bigg[-\frac{\partial\phi^*(q(\mathbf{w},\nu_i; \zeta))}{\partial q}+1\bigg]\bigg|^2\leq (1+G_0)^2.\]

■

Lemma 5.25

Under Assumption 5.16, let \(\mathbf{v}_{t}=(\mathbf{w}_t^{\top}, \boldsymbol{\nu}_t^{\top})^{\top}\), for one iteration of ASGD, we have

\[\begin{align*} \mathbb{E}_t[F_{1/\bar\rho}(\mathbf{v}_{t+1})] &\leq F_{1/\bar\rho}(\mathbf{v}_{t}) + \bar\rho\eta_t (F(\bar{\mathbf{v}}_{t}) - F(\mathbf{v}_{t}) + \frac{\rho}{2}\|\mathbf{v}_{t} - \bar{\mathbf{v}}_{t}\|_2^2)\\ & + \frac{\bar\rho \eta_t^2(G_1^2+G_2^2/B)}{2}, \end{align*}\]

where \(\bar{\mathbf{v}}_{t} =\operatorname{prox}_{F/\bar\rho}(\mathbf{v}_{t})\).

Proof.

Let \(\mathbb{E}_t\) denote the expectation over the random samples at the \(t\)-th iteration conditioned on that in all previous iterations.

\[\begin{align*} &\mathbb{E}_t[F_{1/\bar\rho}(\mathbf{v}_{t+1})] \leq \mathbb{E}_t\bigg[F(\bar{\mathbf{v}}_{t}) +\frac{\bar\rho}{2}\|\mathbf{v}_{t+1} - \bar{\mathbf{v}}_{t}\|_2^2\bigg]\\ &\leq F(\bar{\mathbf{v}}_{t}) + \frac{\bar\rho}{2}\mathbb{E}_t[\|\mathbf{w}_t - \eta_t\mathbf{z}_t - \bar{\mathbf{w}}_t\|_2^2+\|\boldsymbol{\nu}_{t+1} - \bar{\boldsymbol{\nu}}_t\|_2^2]\\ &\leq F(\bar{\mathbf{v}}_{t}) + \frac{\bar\rho}{2}\mathbb{E}_t[\|\mathbf{w}_t - \eta_t\mathbf{z}_t - \bar{\mathbf{w}}_t\|_2^2] + \frac{\bar\rho}{2}\mathbb{E}_t[\|\boldsymbol{\nu}_{t+1}-\bar{\boldsymbol{\nu}}_t\|_2^2] \\ & \leq F(\bar{\mathbf{v}}_{t}) + \frac{\bar\rho}{2}\|\mathbf{w}_t- \bar{\mathbf{w}}_t\|_2^2 + \bar\rho\eta_t \mathbb{E}_t[(\bar{\mathbf{w}}_t - \mathbf{w}_t)^{\top}\partial_{1} F(\mathbf{w}_t, \nu_t)] + \frac{\bar\rho \eta_t^2G_1^2}{2} \\ &+ \frac{\bar\rho}{2}\mathbb{E}_t[\|\boldsymbol{\nu}_{t+1}-\bar{\boldsymbol{\nu}}_t\|_2^2], \end{align*}\]

where the last step uses \(\mathbb{E}_t[\mathbf{z}_t] =\partial_{1} F(\mathbf{w}_t, \boldsymbol{\nu}_t)\) and \(\mathbb{E}[\|\mathbf{z}_t\|_2^2]\leq G_1^2\).

Similar to (\(\ref{eq:y_single}\)), we can prove that

\[\mathbb{E}_t\|\boldsymbol{\nu}_{t+1} - \bar{\boldsymbol{\nu}}_t\|_2^2 = \|\boldsymbol{\nu}_t - \bar{\boldsymbol{\nu}}_t\|_2^2 - 2\gamma_t B\partial_2 F(\mathbf{w}_t,\boldsymbol{\nu}_t)^{\top}(\boldsymbol{\nu}_t - \bar{\boldsymbol{\nu}}_t) + \gamma_t^2 G_2^2 B.\]

Let \(\gamma_t B=\eta_t\), combining the above we have

\[\begin{align*} &\mathbb{E}_t[F_{1/\bar\rho}(\mathbf{v}_{t+1})] \\ & \leq F(\bar{\mathbf{v}}_{t}) + \frac{\bar\rho}{2}\|\mathbf{v}_{t}- \bar{\mathbf{v}}_{t}\|_2^2 + \bar\rho\eta_t\mathbb{E}_t[(\bar{\mathbf{v}}_{t} - \mathbf{v}_{t})^{\top}\partial F(\mathbf{v}_{t})] + \frac{\bar\rho \eta_t^2(G_1^2+G_2^2/B)}{2}\\ &\leq F_{1/\bar\rho}(\mathbf{v}_{t}) + \bar\rho\eta_t (F(\bar{\mathbf{v}}_{t}) - F(\mathbf{v}_{t}) + \frac{\rho}{2}\|\mathbf{v}_{t} - \bar{\mathbf{v}}_{t}\|_2^2)+ \frac{\bar\rho \eta_t^2(G_1^2+G_2^2/B)}{2}, \end{align*}\]

where the last step uses the definition of \(F_{1/\bar\rho}(\mathbf{v}_{t})\) and the \(\rho\)-weak convexity of \(F\). Rearranging this inequality finishes the proof.

■

Theorem 5.13

Suppose Assumption 5.16 holds and \(F_*= \inf F(\mathbf{w}, \boldsymbol{\nu})\geq \infty\), by setting \(\bar\rho=2\rho\), \(\eta = \epsilon^2/(2\bar\rho (G_1^2+G_2^2/B))\), \(\gamma=\eta/B\) and \(T\geq \frac{4(F(\mathbf{w}_0, \boldsymbol{\nu}_0) - F_*)}{\epsilon^2\eta}\), ASGD guarantees that

\[\mathbb{E}\bigg[\frac{1}{T}\sum_{t=1}^T\|\nabla F_{1/\bar\rho}(\mathbf{v}_{t})\|_2^2\bigg]\leq \epsilon^2\]

with a complexity of \(T=O\left(\frac{\rho(G_1^2+G_2^2/B)}{\epsilon^4}\right)\).

Proof.

Since \(F(\mathbf{v}) + \frac{\bar\rho}{2}\|\mathbf{v} - \mathbf{v}_{t}\|_2^2\) is \((\bar\rho-\rho)\)-strongly convex and have a minimum solution at \(\bar{\mathbf{v}}_t\), then we have

\[\begin{align*} &F(\mathbf{v}_{t}) - F(\bar{\mathbf{v}}_{t}) - \frac{\rho}{2}\|\mathbf{v}_{t} - \bar{\mathbf{v}}_{t}\|_2^2 \\ &= (F(\mathbf{v}_{t}) + \frac{\bar\rho}{2}\|\mathbf{v}_{t} - \mathbf{v}_{t}\|_2^2) -(F(\bar{\mathbf{v}}_{t}) + \frac{\bar\rho}{2}\|\bar{\mathbf{v}}_{t} - \mathbf{v}_{t}\|_2^2) + (\frac{\bar\rho}{2} - \frac{\rho}{2})\|\mathbf{v}_{t} - \bar{\mathbf{v}}_{t}\|_2^2 \\ &\geq \frac{(\bar\rho - \rho)}{2}\|\mathbf{v}_{t}-\bar{\mathbf{v}}_{t}\|_2^2 + \frac{(\bar\rho - \rho)}{2}\|\mathbf{v}_{t} - \bar{\mathbf{v}}_{t}\|_2^2 = (\bar\rho - \rho)\|\mathbf{v}_{t} - \bar{\mathbf{v}}_{t}\|_2^2 \\ &= \frac{\bar\rho - \rho}{\bar\rho^2}\|\nabla F_{1/\bar\rho}(\mathbf{v}_{t})\|_2^2. \end{align*}\]

Combining this result with that in Lemma 5.25 and noting that \(\bar\rho=2\rho, \eta_t=\eta\), we have

\[\begin{align*} \mathbb{E}\bigg[\frac{1}{T}\sum_{t=0}^{T-1}\|\nabla F_{1/\bar\rho}(\mathbf{v}_{t})\|_2^2\bigg] &\leq \frac{2(F_{1/\bar\rho}(\mathbf{v}_0) - F_*)}{\eta T} + \bar\rho \eta (G_1^2+G_2^2/B) \\ &\leq \frac{2(F(\mathbf{v}_0) - F_*)}{\eta T} + \bar\rho \eta (G_1^2+G_2^2/B). \end{align*}\]

By setting \(\eta = \epsilon^2/(2\bar\rho (G_1^2+G_2^2/B))\) and \(T\geq \frac{4(F(\mathbf{v}_0) - F_*)}{\epsilon^2\eta}\), we have \(\mathbb{E}[\|\nabla F_{1/\bar\rho}(\mathbf{v}_\tau)\|_2^2]\leq \epsilon^2\) for a randomly selected \(\tau\in\{0,\ldots,T-1\}\).

■

← Go Back