Section 5.3 Non-Smooth Weakly Convex Functions

In this section, we consider non-smooth weakly convex functions, where either the outer function or the inner function are non-smooth. The group DRO objective falls into this category. Another instance is the two-way partial AUC maximization problem as discussed in Section 6.4.

Assumption 5.4

We assume that

\(\quad\mathbb{E}_{\zeta\sim\mathbb{P}_i}[\left\|g_i(\mathbf{w}; \zeta) - g_i(\mathbf{w})\right\|_2^2]\leq \sigma_0^2\).
\(\quad\mathbb{E}_{\zeta\sim\mathbb{P}_i}[\|\mathcal{G}_i(\mathbf{w}; \zeta)\|^2] \leq G_2^2\) for any \(\mathcal{G}_i(\mathbf{w}; \zeta)\in \partial g_i(\mathbf{w}; \zeta)\).

The second condition above implies that \(g_i\) is \(G_2\)-Lipschitz continuous.

Assumption 5.5

We assume either of the following conditions holds:

\(f_i\) is \(\rho_1\)-weakly convex, \(G_1\)-Lipschitz continuous, and \(\partial f_i(g)\geq 0\) \(\forall g\); \(g_i\) is \(\rho_2\)-weakly convex.
\(f_i\) is \(\rho_1\)-weakly convex, \(G_1\)-Lipschitz continuous, and \(\partial f_i(g)\geq 0\) or \(\partial f_i(g)\leq 0\) \(\forall g\); and \(g_i\) is \(L_2\)-smooth.

We first characterize the conditions on \(f_i\) and \(g_i\) to induce weak convexity of \(F\).

Lemma 5.9

Under Assumption 5.4 and Assumption 5.5, the objective function \(F\) is \(\rho\)-weakly convex for some \(\rho>0\). If Assumption 5.5(i) holds, then \(\rho = G_1\rho_2\sqrt{d'} + \rho_1 G_2^2\) and if Assumption 5.5(ii) holds, then \(\rho = G_1L_2\sqrt{d'} + \rho_1 G_2^2\).

Proof.

The weak convexity of \(f_i\) implies that for any \(\mathbf{v}_i\in \partial f_i(g_i(\mathbf{w}))\):

\[\begin{equation*} \begin{aligned} f_i(g_i(\mathbf{w}'))&\geq f_i(g_i(\mathbf{w}))+ \mathbf{v}_i^\top (g_i(\mathbf{w}')-g_i(\mathbf{w}))-\frac{\rho_1}{2}\|g_i(\mathbf{w}')-g_i(\mathbf{w})\|_2^2\\ &\geq f_i(g_i(\mathbf{w}))+\mathbf{v}_i^\top (g_i(\mathbf{w}')-g_i(\mathbf{w}))-\frac{\rho_1 G_2^2}{2}\|\mathbf{w}- \mathbf{w}'\|_2^2. \end{aligned} \end{equation*}\]

Let us first prove the weak convexity under Assumption 5.5 (i). Since \(g_i\) is \(\rho_2\)-weakly convex, we have for any \(U_i\in\partial g_i(\mathbf{w})\)

\[\begin{equation}\label{ineq:37} \begin{aligned} &g_i(\mathbf{w}')-g_i(\mathbf{w})\geq U_i^\top (\mathbf{w}'-\mathbf{w}) - \frac{\rho_2}{2}\|\mathbf{w}'-\mathbf{w}\|_2^2\mathbf{1}. \end{aligned} \end{equation}\]

where \(\mathbf{1}\in\mathbb R^{d'}\) denotes a vector of all ones. Since \(\mathbf{v}_i\geq 0\), we have

\[\begin{equation*} \begin{aligned} f_i(g_i(\mathbf{w}')) &\geq f_i(g_i(\mathbf{w}))+ \mathbf{v}_i^\top (U_i^\top (\mathbf{w}'-\mathbf{w}) - \frac{\rho_2}{2}\|\mathbf{w} - \mathbf{w}'\|_2^2\mathbf{1})-\frac{\rho_1 G_2^2}{2}\|\mathbf{w} - \mathbf{w}'\|_2^2\\ &\geq f_i(g_i(\mathbf{w}))+ (U_i\mathbf{v}_i)^\top(\mathbf{w}'-\mathbf{w}) - \frac{G_1\sqrt{d'}\rho_2 + \rho_1 G_2^2}{2}\|\mathbf{w} - \mathbf{w}'\|_2^2. \end{aligned} \end{equation*}\]

Since \(U_i\mathbf{v}_i\in\partial g_i(\mathbf{w}) \partial f_i(g_i(\mathbf{w}))\), the above inequality indicates that \(f_i(g_i(\mathbf{w}))\) is \(\rho\)-weakly convex, where \(\rho=G_1\sqrt{d'}\rho_2 + \rho_1 G_2^2\). As a result, \(F(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^n f_i(g_i(\mathbf{w}))\) is \(\rho\)-weakly convex.

Next, we prove the weak convexity of \(F\) under Assumption 5.5 (ii). Due to the smoothness of \(g(\cdot)\) we have

\[\begin{equation}\label{ineq:37b} \begin{aligned} &g(\mathbf{w})-g(\mathbf{w}')\leq \nabla g(\mathbf{w}')^\top (\mathbf{w}-\mathbf{w}') + \frac{L_2}{2}\|\mathbf{w}-\mathbf{w}'\|_2^2\mathbf{1},\\ &g(\mathbf{w})-g(\mathbf{w}')\geq \nabla g(\mathbf{w}')^\top (\mathbf{w}-\mathbf{w}') - \frac{L_2}{2}\|\mathbf{w}-\mathbf{w}'\|_2^2\mathbf{1}. \end{aligned} \end{equation}\]

If \(\partial f_i(g_i(\mathbf{w}))\geq 0\), we use the second inequality above and follow the same steps as before to prove the \(\rho\)-weak convexity of \(F\) with \(\rho=G_1\sqrt{d'}L_2 + \rho_1 G_2^2\). If \(\partial f_i(g_i(\mathbf{w}))\leq 0\), we will use the first inequality above to get:

\[\begin{equation*} \begin{aligned} &f_i(g_i(\mathbf{w}'))\geq f_i(g_i(\mathbf{w}))+ \mathbf{v}_i^\top (g_i(\mathbf{w}')-g_i(\mathbf{w}))-\frac{\rho_1}{2}\|g_i(\mathbf{w}')-g_i(\mathbf{w})\|_2^2\\ &\geq f_i(g_i(\mathbf{w}))+ \mathbf{v}_i^\top (\nabla g_i(\mathbf{w})^\top (\mathbf{w}'-\mathbf{w}) + \frac{L_2}{2}\|\mathbf{w} - \mathbf{w}'\|_2^2\mathbf{1})-\frac{\rho_1 G_2^2}{2}\|\mathbf{w} - \mathbf{w}'\|_2^2\\ &\geq f_i(g_i(\mathbf{w}))+ (\nabla g_i(\mathbf{w})\mathbf{v}_i)^\top(\mathbf{w}'-\mathbf{w}) - \frac{G_1\sqrt{d'}L_2 + \rho_1 G_2^2}{2}\|\mathbf{w} - \mathbf{w}'\|_2^2. \end{aligned} \end{equation*}\]

This concludes the proof.

■

Table 5.1 Conditions of \(f_i\) and \(g_i\) to make \(F(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^nf_i(g_i(\mathbf{w}))\) weakly convex, where \(g_i:\mathbb{R}^d\rightarrow\mathbb{R}^{d'}\) and \(f_i: \mathbb{R}^{d'}\rightarrow\mathbb{R}\).

	Lipschitz continuity (\(f_i\))	Weak convexity (\(f_i\))	Monotonicity (\(f_i\))	Lipschitz continuity (\(g_i\))	Weak convexity (\(g_i\))	Smoothness (\(g_i\))	Weak convexity of \(F\) (\(\rho\))
(i)	\(G_1\)	\(\rho_1\)	\(\partial f\geq 0\)	\(G_2\)	\(\rho_2\)	-	\(G_1\rho_2\sqrt{d'} + \rho_1 G_2^2\)
(ii)	\(G_1\)	\(\rho_1\)	\(\partial f\geq 0\) or \(\partial f\leq 0\)	\(G_2\)	-	\(L_2\)	\(G_1L_2\sqrt{d'} + \rho_1 G_2^2\)

5.3.1 SONX for Non-smooth Inner Functions

Algorithm 16: SONX

Input: learning rate schedules \(\{\eta_t\}_{t=1}^{T}\), \(\{\gamma_t\}_{t=1}^{T}\), \(\{\beta_t\}_{t=1}^{T}\); starting points \(\mathbf{w}_0\), \(\mathbf{u}_{0}\)
\(\mathbf{w}_1=\mathbf{w}_0\)
for \(t=1,\dotsc, T\) do
Draw a batch of samples \(\mathcal{B}_t\subset[n]\)
for \(i\in \mathcal{B}_t\) do
Draw two samples \(\zeta_{i,t}\sim \mathbb{P}_i\)
Update the inner function value estimators by

v1: \(\quad \mathbf{u}_{i, t} = (1-\gamma_t) \mathbf{u}_{i,t-1} + \gamma_t g_{i}(\mathbf{w}_t; \zeta_{i,t})\)

v2: \(\quad \mathbf{u}_{i,t}=(1-\gamma_{t})\mathbf{u}_{i, t-1}+\gamma_{t} g_i\left(\mathbf{w}_{t}; \zeta_{i,t}\right) + \gamma'_{t} \left( g_i\left(\mathbf{w}_{t}; \zeta_{i,t}\right)-g_i\left(\mathbf{w}_{t-1}; \zeta_{i,t}\right) \right)\)

end for
Set \(\mathbf{u}_{i,t} = \mathbf{u}_{i,t-1}\), \(i\notin\mathcal{B}_t\)
Compute \(\mathbf{z}_t = \frac{1}{B}\sum_{i \in \mathcal{B}'_{t}} \partial g_{i}(\mathbf{w}_t;\zeta'_{i,t}) \partial f_{i}(\mathbf{u}_{i,t})\) \(\diamond\) check text for discussion
Update the model \(\mathbf{w}\) by \(\mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t \mathbf{z}_{t}\)
end for

Since we do not assume smoothness for the overall objective function, the key difference from the previous two sections is that we no longer have the descent lemma in Lemma 4.9, hence cannot leverage the MA or STORM gradient estimators. Consequently, we employ the vanilla gradient estimator \(\mathbf{z}_t\) to update the model parameter \(\mathbf{w}_{t+1}\). The updating steps are summarized in Algorithm 16, referred to as SONX. The two options correspond to different strategies for updating the inner function value estimators: v1 uses a coordinate MA estimator, while v2 adopts the MSVR estimator.

For ease of presentation, we compute the vanilla gradient estimator \(\mathbf{z}_t\) using a batch \(\mathcal{B}'_t\) independent from \(\mathcal{B}_t\):

\[\mathbf{z}_t = \frac{1}{B}\sum_{i \in \mathcal{B}'_{t}} \partial g_{i}(\mathbf{w}_t;\zeta'_{i,t}) \partial f_{i}(\mathbf{u}_{i,t}).\]

However, for SONX-v1 with MA estimator, we can indeed use the same vanilla gradient estimator \(\mathbf{z}_t\) as in SOX:

\[\mathbf{z}_t = \frac{1}{B}\sum_{i \in \mathcal{B}_{t}} \partial g_{i}(\mathbf{w}_t;\zeta'_{i,t}) \partial f_{i}(\mathbf{u}_{i,t}).\]

An alternative method for using both options is to compute \(\mathbf{z}_t\) by

\[\mathbf{z}_t = \frac{1}{B}\sum_{i \in \mathcal{B}_{t}} \partial g_{i}(\mathbf{w}_t;\zeta'_{i,t}) \partial f_{i}(\mathbf{u}_{i,t-1}).\]

Convergence Analysis

Similar to Section 3.1.4, we state the convergence using the Moreau envelope of \(F\):

\[\begin{align*} F_{\lambda}(\mathbf{w}) &:= \min_{\mathbf{u}} F(\mathbf{u}) + \frac{1}{2\lambda}\|\mathbf{u} - \mathbf{w}\|_2^2. \end{align*}\]

Recall the definition:

\[\begin{align*} &\operatorname{prox}_{\lambda F}(\mathbf{w}) = \arg\min_{\mathbf{u}} F(\mathbf{u}) + \frac{1}{2\lambda}\|\mathbf{u} - \mathbf{w}\|_2^2. \end{align*}\]

We first present a result similar to Lemma 3.5 for standard SGD to account for the bias of \(\mathbf{z}_t\).

Lemma 5.10

Suppose Assumption 5.4 and Assumption 5.5 hold. Let \(\bar\rho =\rho+\rho_2 G_1+2\rho_1G_2^2\). Consider the step update of SONX, we have

\[\begin{equation*} \begin{aligned} &\mathbb{E}_{\zeta'_t, \mathcal{B}'_t}[F_{1/\bar{\rho}}(\mathbf{w}_{t+1})] \leq F_{1/\bar{\rho}}(\mathbf{w}_t) +\frac{\eta_t^2\bar{\rho}G^2}{2} -\frac{\eta_t}{2}\|\nabla F_{1/\bar{\rho}}(\mathbf{w}_t)\|_2^2 \\ &+\frac{\bar{\rho}\eta_t}{n}\sum_{i=1}^n \bigg[2G_1\|g_i(\mathbf{w}_t)-\mathbf{u}_{i,t}\|_2+\rho_1\|g_i(\mathbf{w}_t)-\mathbf{u}_{i,t}\|_2^2\bigg]. \end{aligned} \end{equation*}\]

If \(f_i\) is further \(L_1\)-smooth, then

\[\begin{equation*} \begin{aligned} &\mathbb{E}_{\zeta'_t, \mathcal{B}'_t}[F_{1/\bar{\rho}}(\mathbf{w}_{t+1})] \leq F_{1/\bar{\rho}}(\mathbf{w}_t) +\frac{\eta_t^2\bar{\rho}G^2}{2} -\frac{\eta_t}{2}\|\nabla F_{1/\bar{\rho}}(\mathbf{w}_t)\|_2^2 \\ &+\frac{\bar{\rho}\eta_t}{n}\sum_{i=1}^n \bigg[\frac{L_1}{2}\|g_i(\mathbf{w}_t)-\mathbf{u}_{i,t}\|^2_2+\rho_1\|g_i(\mathbf{w}_t)-\mathbf{u}_{i,t}\|_2^2\bigg], \end{aligned} \end{equation*}\]

where \(G^2=G_1^2G_2^2\).

If \(\mathbf{u}_{i,t}=g_i(\mathbf{w}_t)\), i.e., there is no bias in \(\mathbf{z}_t\), then the terms in the square bracket are gone, the above lemma reduces to Lemma 3.4.

Proof.

Define \(\hat{\mathbf{w}}_t:=\operatorname{prox}_{F/\bar{\rho}}(\mathbf{w}_t)\) and \(\mathbb{E}_t[\cdot]=\mathbb{E}_{\zeta'_t, \mathcal{B}'_t}[\cdot]\). First,

\[\begin{align*} \mathbb{E}_t[\|\mathbf{z}_t\|_2^2]&\leq \mathbb{E}_t\bigg[\frac{1}{B}\sum_{i\in\mathcal{B}'_t}\|\partial g_i(\mathbf{w}_t, \zeta'_{i,t})\partial f_i(\mathbf{u}_{i,t})\|_2^2\bigg]\\ &\leq \mathbb{E}_t\bigg[\frac{1}{B}\sum_{i\in\mathcal{B}'_t}\|\partial g_i(\mathbf{w}_t, \zeta'_{i,t})\|_2^2G_1^2\bigg]\leq G_2^2G_1^2=G^2. \end{align*}\]

Following Lemma 3.4, we have

\[\begin{align}\label{ineq:44} \mathbb{E}_t[F_{1/\bar{\rho}}(\mathbf{w}_{t+1})] &\leq F_{1/\bar{\rho}}(\mathbf{w}_t) +\bar{\rho}\eta_t (\mathbb{E}_t[\mathbf{z}_t])^{\top}(\hat{\mathbf{w}}_t-\mathbf{w}_t) +\frac{\eta_t^2\bar{\rho}G^2}{2}. \end{align}\]

Next we bound the term \(\mathbb{E}_t[\mathbf{z}_t]^{\top}(\hat{\mathbf{w}}_t-\mathbf{w}_t)\) on the RHS of (\(\ref{ineq:44}\)). Note that \(\mathbb{E}_t[\mathbf{z}_t] = \frac{1}{n}\sum_{i=1}^n\partial g_i(\mathbf{w}_t)\partial f_i(\mathbf{u}_{i,t})\). For a given \(i\in [n]\), we have

\[\begin{equation*} \begin{aligned} &f_i(g_i(\hat{\mathbf{w}}_t)) - f_i(\mathbf{u}_{i,t})\\ &\stackrel{(a)}{\geq} \partial f_i(\mathbf{u}_{i,t})^\top (g_i(\hat{\mathbf{w}}_t)-\mathbf{u}_{i,t}) -\frac{\rho_1}{2}\|g_i(\hat{\mathbf{w}}_t)-\mathbf{u}_{i,t}\|_2^2\\ &\geq \partial f_i(\mathbf{u}_{i,t})^\top (g_i(\hat{\mathbf{w}}_t)-\mathbf{u}_{i,t}) -\rho_1\|g_i(\hat{\mathbf{w}}_t)-g_i(\mathbf{w}_t)\|_2^2-\rho_1\|g_i(\mathbf{w}_t)-\mathbf{u}_{i,t}\|_2^2\\ &\geq \partial f_i(\mathbf{u}_{i,t})^\top (g_i(\hat{\mathbf{w}}_t)-\mathbf{u}_{i,t}) -\rho_1G_2^2\|\hat{\mathbf{w}}_t-\mathbf{w}_t\|_2^2-\rho_1\|g_i(\mathbf{w}_t)-\mathbf{u}_{i,t}\|_2^2\\ &\stackrel{(b)}{\geq} \partial f_i(\mathbf{u}_{i,t})^\top \bigg[g_i(\mathbf{w}_t)-\mathbf{u}_{i,t}+ \partial g_i(\mathbf{w}_t)^\top(\hat{\mathbf{w}}_t-\mathbf{w}_t)-\frac{\rho_2}{2}\|\hat{\mathbf{w}}_t-\mathbf{w}_t\|_2^2\bigg] \\ &\quad -\rho_1G_2^2\|\hat{\mathbf{w}}_t-\mathbf{w}_t\|_2^2-\rho_1\|g_i(\mathbf{w}_t)-\mathbf{u}_{i,t}\|_2^2\\ &\stackrel{(c)}{\geq} \partial f_i(\mathbf{u}_{i,t})^\top (g_i(\mathbf{w}_t)-\mathbf{u}_{i,t})+ \partial f_i(\mathbf{u}_{i,t})^\top\partial g_i(\mathbf{w}_t)^\top(\hat{\mathbf{w}}_t-\mathbf{w}_t)\\ &\quad -(\frac{\rho_2G_1}{2}+\rho_1G_2^2)\|\hat{\mathbf{w}}_t-\mathbf{w}_t\|_2^2 -\rho_1\|g_i(\mathbf{w}_t)-\mathbf{u}_{i,t}\|_2^2, \end{aligned} \end{equation*}\]

where (a) follows from the \(\rho_1\)-weak-convexity of \(f_i\), (b) follows from that \(\partial f_i(\cdot)\geq 0\) and the weak convexity of \(g_i\), (c) is due to \(\|\partial f_i(\mathbf{u}_{i,t})\|_2\leq G_1\). When \(\partial f_i(\cdot)\leq 0\) and \(g_i\) is smooth, we can bound similarly with \(\rho_2\) in the last inequality replaced by \(L_2\).

Then rearranging the above inequality and averaging over \(i\) yields \[\begin{equation}\label{ineq:43} \begin{aligned} &\mathbb{E}_t[\mathbf{z}_t]^\top(\hat{\mathbf{w}}_t-\mathbf{w}_t)=\frac{1}{n}\sum_{i=1}^n \partial f_i(\mathbf{u}_{i,t})^\top\partial g_i(\mathbf{w}_t)^\top(\hat{\mathbf{w}}_t-\mathbf{w}_t)\\ &\leq \frac{1}{n}\sum_{i=1}^n \bigg[ f_i(g_i(\hat{\mathbf{w}}_t)) - f_i(g_i(\mathbf{w}_t)) + f_i(g_i(\mathbf{w}_t))- f_i(\mathbf{u}_{i,t})\\ &\quad- \partial f_i(\mathbf{u}_{i,t})^\top (g_i(\mathbf{w}_t)-\mathbf{u}_{i,t}) + (\frac{\rho_2G_1}{2}+\rho_1G_2^2)\|\hat{\mathbf{w}}_t-\mathbf{w}_t\|_2^2 +\rho_1\|g_i(\mathbf{w}_t)-\mathbf{u}_{i,t}\|_2^2\bigg]. \end{aligned} \end{equation}\]

Due to the \(\rho\)-weak convexity of \(F(\mathbf{w})\), we have that \(F(\mathbf{w})+\frac{\bar{\rho}}{2}\|\mathbf{w}_t-\mathbf{w}\|_2^2\) is \((\bar{\rho}-\rho)\)-strongly convex. Then \(\bigg[F(\mathbf{w}_t)+\frac{\bar{\rho}}{2}\|\mathbf{w}_t-\mathbf{w}_t\|_2^2\bigg] - \bigg[F(\hat{\mathbf{w}}_t)+\frac{\bar{\rho}}{2}\|\mathbf{w}_t-\hat{\mathbf{w}}_t\|_2^2\bigg]\geq \frac{\bar\rho-\rho}{2}\|\hat{\mathbf{w}}_t - \mathbf{w}_t\|_2^2\). It follows that:

\[\begin{equation}\label{ineq:46} \begin{aligned} &\frac{1}{n}\sum_{i=1}^n \bigg[ f_i(g_i(\hat{\mathbf{w}}_t)) - f_i(g_i(\mathbf{w}_t))\bigg]=F(\hat{\mathbf{w}}_t)-F(\mathbf{w}_t)\\ & = \bigg[F(\hat{\mathbf{w}}_t)+\frac{\bar{\rho}}{2}\|\mathbf{w}_t-\hat{\mathbf{w}}_t\|_2^2\bigg] -\bigg[F(\mathbf{w}_t)+\frac{\bar{\rho}}{2}\|\mathbf{w}_t-\mathbf{w}_t\|_2^2\bigg]-\frac{\bar{\rho}}{2}\|\mathbf{w}_t-\hat{\mathbf{w}}_t\|_2^2\\ &\leq (\frac{\rho}{2}-\bar{\rho})\|\mathbf{w}_t-\hat{\mathbf{w}}_t\|_2^2. \end{aligned} \end{equation}\]

Combining inequalities (\(\ref{ineq:43}\)), (\(\ref{ineq:44}\)) and (\(\ref{ineq:46}\)) and noting that \(\frac{\rho}{2} + \frac{\rho_2G_1}{2}+\rho_1G_2^2 = \frac{\bar\rho}{2}\) yields

\[\begin{equation*} \begin{aligned} &\mathbb{E}_t[F_{1/\bar{\rho}}(\mathbf{w}_{t+1})]\leq F_{1/\bar{\rho}}(\mathbf{w}_t) +\frac{\eta_t^2\bar{\rho}G^2}{2}-\frac{\bar{\rho}^2\eta_t}{2}\|\mathbf{w}_t-\hat{\mathbf{w}}_t\|_2^2\\ & +\frac{\bar{\rho}\eta_t}{n}\sum_{i=1}^n \bigg[f_i(g_i(\mathbf{w}_t)) - f_i(\mathbf{u}_{i,t}) - \partial f_i(\mathbf{u}_{i,t})^\top (g_i(\mathbf{w}_t)-\mathbf{u}_{i,t})+\rho_1\|g_i(\mathbf{w}_t)-\mathbf{u}_{i,t}\|_2^2\bigg]. \end{aligned} \end{equation*}\]

We finish the proof by noting that \(\|\nabla F_{1/\bar{\rho}}(\mathbf{w}_t)\|_2 = \bar\rho \|\mathbf{w}_t -\hat{\mathbf{w}}_t\|_2\), using

\[f_i(g_i(\mathbf{w}_t)) - f_i(\mathbf{u}_{i,t}) - \partial f_i(\mathbf{u}_{i,t})^\top (g_i(\mathbf{w}_t)-\mathbf{u}_{i,t})\leq 2G_1\|g_i(\mathbf{w}_t) - \mathbf{u}_{i,t}\|_2,\]

if \(f_i\) is \(G_1\)-Lipschitz continuous, or using

\[f_i(g_i(\mathbf{w}_t)) - f_i(\mathbf{u}_{i,t}) - \partial f_i(\mathbf{u}_{i,t})^\top (g_i(\mathbf{w}_t)-\mathbf{u}_{i,t})\leq \frac{L_1}{2}\|g_i(\mathbf{w}_t) - \mathbf{u}_{i,t}\|^2_2,\]

if \(f_i\) is \(L_1\)-smooth.

■

Convergence of SONX-v1

Recall the definition: \[\delta_t = \frac{1}{n}\sum_{i=1}^n\|\mathbf{u}_{i,t} -g_i(\mathbf{w}_t)\|_2^2.\] Let us also define: \[\delta'_t = \frac{1}{n}\sum_{i=1}^n\|\mathbf{u}_{i,t} -g_i(\mathbf{w}_t)\|_2.\] From Lemma 5.10, the key is to bound \(\delta_t\) and \(\delta'_t\).

Lemma 5.11

Consider the update of SONX-v1, under Assumption 5.4 and Assumption 5.5, with constant parameters \(\gamma_t=\gamma\leq 1\) and \(\eta_t = \eta\), we have

\[\begin{equation*} \begin{aligned} \mathbb{E}\left[\delta_t\right] &\leq \left(1-\frac{B\gamma}{4n}\right)^{2t}\delta_0 +\frac{8n^2 G_2^4G_1^2\eta^2}{B^2\gamma^2}+4\gamma\sigma_0^2,\\ \mathbb{E}\left[\delta'_t\right] &\leq \left(1-\frac{B\gamma}{4n}\right)^{t}\delta'_0 +\frac{4n G_2^2G_1\eta}{B\gamma }+2\sqrt{\gamma}\sigma_0. \end{aligned} \end{equation*}\]

Proof.

From the proof of Lemma 5.1, we have

\[\begin{equation*} \begin{aligned} & \mathbb{E}\left[\left\|\mathbf{u}_{i,t} - g_i(\mathbf{w}_t)\right\|_2^2\right]\\ &\leq \left(1-\frac{B\gamma_t}{2n}\right)\mathbb{E}\left[\left\|\mathbf{u}_{i,t-1} - g_i(\mathbf{w}_{t-1})\right\|_2^2\right] + \frac{2n G_2^2}{B\gamma_t}\mathbb{E}\left[\|\mathbf{w}_{t-1}-\mathbf{w}_t\|_2^2\right]+\frac{B\gamma_t^2\sigma_0^2}{n}\\ &\leq \left(1-\frac{B\gamma_t}{2n}\right)\mathbb{E}\left[\left\|\mathbf{u}_{i,t-1} - g_i(\mathbf{w}_{t-1})\right\|_2^2\right] + \frac{2n G_2^2\eta_{t-1}^2}{B\gamma_t}\mathbb{E}\left[\|\mathbf{z}_{t-1}\|_2^2\right]+\frac{B\gamma_t^2\sigma_0^2}{n}\\ &\leq \left(1-\frac{B\gamma_t}{4n}\right)^2\mathbb{E}\left[\left\|\mathbf{u}_{i,t-1} - g_i(\mathbf{w}_{t-1})\right\|_2^2\right] + \frac{2n G_2^4G_1^2\eta_{t-1}^2}{B\gamma_t}+\frac{B\gamma_t^2\sigma_0^2}{n}. \end{aligned} \end{equation*}\]

Applying the above inequality recursively for \(\gamma_t = \gamma\) and \(\eta_t = \eta\), we obtain

\[\begin{equation*} \begin{aligned} & \mathbb{E}\left[\left\|\mathbf{u}_{i,t} - g_i(\mathbf{w}_t)\right\|_2^2\right]\\ &\leq \left(1-\frac{B\gamma}{4n}\right)^{2t}\left\|\mathbf{u}_{i,0} - g_i(\mathbf{w}_{0})\right\|_2^2 +\sum_{j=0}^{t-1}\left(1-\frac{B\gamma}{4n}\right)^{2j} \left(\frac{2n G_2^4G_1^2\eta^2}{B\gamma}+\frac{B\gamma^2\sigma_0^2}{n}\right)\\ &\leq \left(1-\frac{B\gamma}{4n}\right)^{2t}\left\|\mathbf{u}_{i,0} - g_i(\mathbf{w}_{0})\right\|_2^2 +\frac{8n^2 G_2^4G_1^2\eta^2}{B^2\gamma^2}+4\gamma\sigma_0^2, \end{aligned} \end{equation*}\]

where we use

\[\sum_{j=0}^{t-1}\left(1-\alpha\right)^{2j}\leq\sum_{j=0}^{\infty}\left(1-\alpha\right)^{2j}=\frac{1}{1-(1-\alpha)^2}=\frac{1}{\alpha(2-\alpha)} \leq \frac{1}{\alpha}, \forall \alpha\in(0,1).\]

Averaging the above inequality over \(i\), we prove the first result in the lemma.

It follows

\[\begin{equation*} \begin{aligned} & \mathbb{E}\left[\left\|\mathbf{u}_{i,t} - g_i(\mathbf{w}_t)\right\|_2\right]\leq \sqrt{\mathbb{E}\left[\left\|\mathbf{u}_{i,t} - g_i(\mathbf{w}_t)\right\|_2^2\right]}\\ & \leq \sqrt{\left(1-\frac{B\gamma}{4n}\right)^{2t}\left\|\mathbf{u}_{i,0} - g_i(\mathbf{w}_{0})\right\|_2^2 +\frac{8n^2 G_2^4G_1^2\eta^2}{B^2\gamma^2}+4\gamma\sigma_0^2}\\ &\leq \left(1-\frac{B\gamma}{4n}\right)^{t}\left\|\mathbf{u}_{i,0} - g_i(\mathbf{w}_{0})\right\|_2 +\frac{4n G_2^2G_1\eta}{B\gamma}+2\gamma^{1/2}\sigma_0. \end{aligned} \end{equation*}\]

Averaging the above result, we prove the second result.

■

Theorem 5.3 (Convergence of SONX-v1 with Lipschitz \(f_i\))

Consider SONX-v1, and suppose Assumption 5.4 and Assumption 5.5 hold and \(f_i\) is \(G_1\)-Lipschitz continuous. Let \(\eta_t=\eta=O(\frac{B\epsilon^6}{n\sigma_0^2})\), \(\gamma_t =\gamma=O(\frac{\epsilon^4}{\sigma_0^2})\). Then after \(T=O(\frac{n\sigma_0^2}{B\epsilon^8})\) iterations, we have \(\mathbb{E}[\|\nabla F_{1/\bar\rho}(\mathbf{w}_t)\|_2^2]\leq O(\epsilon^2)\).

Proof.

From Lemma 5.10, we have

\[\begin{align*} &\mathbb{E}\bigg[\frac{1}{T}\sum_{t=1}^T\|\nabla F_{1/\bar{\rho}}(\mathbf{w}_t)\|_2^2\bigg] \leq \mathbb{E}\bigg[\frac{2\sum_{t=1}^T(F_{1/\bar{\rho}}(\mathbf{w}_t) - F_{1/\bar{\rho}}(\mathbf{w}_{t+1}))}{\eta T} \bigg]+\eta\bar{\rho}G^2\\ & +4\bar{\rho}G_1\mathbb{E}\bigg[\frac{1}{T}\sum_{t=1}^T\delta'_t\bigg]+2\bar{\rho}\rho_1\mathbb{E}\bigg[\frac{1}{T}\sum_{t=1}^T\delta_t\bigg]. \end{align*}\]

Next, we bound the last two terms. From Lemma 5.11, we have

\[\begin{equation*} \begin{aligned} \mathbb{E}\left[\frac{1}{T}\sum_{t=1}^T\delta_t\right] &\leq \frac{1}{T}\sum_{t=1}^T\left(1-\frac{B\gamma}{4n}\right)^{2t}\delta_0 +\frac{8n^2 G_2^4G_1^2\eta^2}{B^2\gamma^2}+4\gamma\sigma_0^2,\\ \mathbb{E}\left[\frac{1}{T}\sum_{t=1}^T\delta'_t\right] &\leq \frac{1}{T}\sum_{t=1}^T\left(1-\frac{B\gamma}{4n}\right)^{t}\delta'_0 +\frac{4n G_2^2G_1\eta}{B\gamma }+2\sqrt{\gamma}\sigma_0. \end{aligned} \end{equation*}\]

Since \(\sum_{t=1}^{T}(1-\mu)^t\leq \frac{1}{\mu}\) for \(\mu\in(0,1)\), we have

\[\begin{align*} &\mathbb{E}\left[\frac{1}{T}\sum_{t=1}^T\delta_t\right] \leq \frac{4n\delta_0 }{B\gamma T}+\frac{8n^2 G_2^4G_1^2\eta^2}{B^2\gamma^2}+4\gamma\sigma_0^2,\\ &\mathbb{E}\left[\frac{1}{T}\sum_{t=1}^T\delta'_t\right] \leq \frac{4n\delta'_0}{B\gamma T} +\frac{4n G_2^2G_1\eta}{B\gamma }+2\sqrt{\gamma}\sigma_0. \end{align*}\]

From Proposition 3.2, we have

\[\sum_{t=1}^T(F_{1/\bar{\rho}}(\mathbf{w}_t) - F_{1/\bar{\rho}}(\mathbf{w}_{t+1}))= F_{1/\bar{\rho}}(\mathbf{w}_1) - F_{1/\bar{\rho}}(\mathbf{w}_{T+1})\leq F(\mathbf{w}_1) - F(\mathbf{w}_*).\]

Combining the above results, we have

\[\begin{align*} &\mathbb{E}\bigg[\frac{1}{T}\sum_{t=1}^T\|\nabla F_{1/\bar{\rho}}(\mathbf{w}_t)\|_2^2\bigg] \leq \mathbb{E}\bigg[\frac{2(F(\mathbf{w}_1) - F_*)}{\eta T} \bigg]+\eta\bar{\rho}G^2\\ & +4\bar{\rho}G_1\bigg(\frac{4n\delta'_0}{B\gamma T} +\frac{4n G_2^2G_1\eta}{B\gamma }+2\sqrt{\gamma}\sigma_0\bigg)+2\bar{\rho}\rho_1\bigg(\frac{4n\delta_0 }{B\gamma T}+\frac{8n^2 G_2^4G_1^2\eta^2}{B^2\gamma^2}+4\gamma\sigma_0^2\bigg). \end{align*}\]

Plugging the order of \(\eta, \gamma\), we finish the proof.

■

Theorem 5.4 (Convergence of SONX-v1 with smooth \(f_i\))

Consider SONX-v1, and suppose Assumption 5.4 and Assumption 5.5 hold and \(f_i\) is \(L_1\)-smooth. Let \(\eta_t=\eta=O(\frac{B\epsilon^3}{n\sigma_0^2})\), \(\gamma_t =\gamma=O(\frac{\epsilon^2}{\sigma_0^2})\), then after \(T=O(\frac{n\sigma_0^2}{B\epsilon^5})\) iterations, we have \(\mathbb{E}[\|\nabla F_{1/\bar\rho}(\mathbf{w}_t)\|_2^2]\leq O(\epsilon^2)\).

Proof.

By using the result for smooth \(f_i\) in Lemma 5.10, we have

\[\begin{align*} &\mathbb{E}\bigg[\frac{1}{T}\sum_{t=1}^T\|\nabla F_{1/\bar{\rho}}(\mathbf{w}_t)\|_2^2\bigg] \leq \mathbb{E}\bigg[\frac{2\sum_{t=1}^T(F_{1/\bar{\rho}}(\mathbf{w}_t) - F_{1/\bar{\rho}}(\mathbf{w}_{t+1}))}{\eta T} \bigg]+\eta\bar{\rho}G^2\\ & + \bar{\rho}(L_1+2\rho_1)\mathbb{E}\bigg[\frac{1}{T}\sum_{t=1}^T\delta_t\bigg]. \end{align*}\]

Plugging the bounds for the first and last term in the RHS, we have

\[\begin{align*} &\mathbb{E}\bigg[\frac{1}{T}\sum_{t=1}^T\|\nabla F_{1/\bar{\rho}}(\mathbf{w}_t)\|_2^2\bigg] \leq \mathbb{E}\bigg[\frac{2(F(\mathbf{w}_1) - F_*)}{\eta T} \bigg]+\eta\bar{\rho}G^2\\ & +\bar{\rho}(L_1+2\rho_1)\bigg(\frac{4n\delta_0 }{B\gamma T}+\frac{8n^2 G_2^4G_1^2\eta^2}{B^2\gamma^2}+4\gamma\sigma_0^2\bigg). \end{align*}\]

Plugging the order of \(\eta, \gamma\), we finish the proof.

■

Convergence of SONX-v2

Similar to the first option, we need to bound \(\delta_t,\delta'_t\) first.

Lemma 5.12

Under Assumption 5.4 and Assumption 5.5, by setting \(\gamma_t = \gamma\leq \frac{1}{2}\), \(\eta_t = \eta\), \(\gamma'_{t} = \frac{n-B}{B(1-\gamma)}+(1-\gamma)\), we have:

\[\begin{equation*} \begin{aligned} \mathbb{E}\left[\delta_t\right] &\leq \left(1-\frac{B\gamma}{2n}\right)^{2t}\delta_0+ 4\gamma \sigma_0^{2}+\frac{24 n^{2} G_2^4G_1^2 \eta^2}{B^2\gamma },\\ \mathbb{E}\left[\delta'_t\right] &\leq \left(1-\frac{B\gamma}{2n}\right)^{t}\delta'_0+ 2\gamma^{1/2} \sigma_0 +\frac{5 n G_2^2G_1 \eta}{B\gamma^{1/2} }. \end{aligned} \end{equation*}\]

Proof is omitted as it is similar to that of Lemma 5.11 but based on Lemma 5.5.

Theorem 5.5 (Convergence of SONX-v2)

Consider SONX-v2, and suppose Assumption 5.4, and Assumption 5.5 hold.

If \(f_i\) is \(G_1\)-Lipschitz continuous, by setting \(\eta=O(\frac{B\epsilon^4}{n\sigma_0}), \gamma=O(\frac{\epsilon^4}{\sigma_0^2})\), then after \(T=O(\frac{n\sigma_0}{B\epsilon^6})\) iterations, we have \(\mathbb{E}[\|\nabla F_{1/\bar\rho}(\mathbf{x}_t)\|_2^2]\leq \epsilon^2\).
If \(f_i\) is further \(L_1\)-smooth, by setting \(\eta=O(\frac{B\epsilon^2}{n\sigma_0}), \gamma=O(\frac{\epsilon^2}{\sigma_0^2})\), then the complexity reduces to \(T=O(\frac{n\sigma_0}{B\epsilon^4})\).

The proof follows similarly to that of Theorem 5.3 and Theorem and is left as an exercise for interested readers.

5.3.2 SONEX for Non-smooth Outer functions

When \(f_i\) is Lipschitz continuous and non-smooth, the best complexity derived in last subsection is \(O(n/(B\epsilon^6))\). Can we further improve the complexity when the inner functions are smooth? We present a method and its analysis in this section.

Let us make the following assumptions.

Assumption 5.6

We assume that

\(\quad\mathbb{E}_{\zeta\sim\mathbb{P}_i}[\left\|g_i(\mathbf{w}; \zeta) - g_i(\mathbf{w})\right\|_2^2]\leq \sigma_0^2\)
\(\quad\mathbb{E}_{\zeta\sim\mathbb{P}_i}[\left\|\nabla g_i(\mathbf{w}; \zeta) - \nabla g_i(\mathbf{w})\right\|_2^2]\leq \sigma_2^2\)
\(\quad\mathbb{E}_{\zeta\sim\mathbb{P}_i}[\left\|\nabla g_i(\mathbf{w}; \zeta)\right\|_2^2] \leq G_2^2\)

Assumption 5.7

The following conditions hold:

\(f_i\) is \(\rho_1\)-weakly convex, \(G_1\)-Lipschitz continuous, and its proximal mapping \(\text{prox}_{f_i/\bar\rho_1}(g) = \arg\min_{\mathbf u\in\mathbb R^{d'}} \frac{\bar\rho_1}{2}\|\mathbf u - g\|_2^2 + f_i(\mathbf u)\) can be easily computed for \(\bar\rho_1>\rho_1\).
\(g_i\) is \(L_2\)-smooth and \(G_2\)-Lipschitz continuous.

Moreau Envelope Smoothing of the outer function

A classical approach of improving the convergence for non-smooth functions in convex optimization is smoothing, i.e., first smoothing the function and then using an optimizer for solving the resulting smoothed function. We define the Moreau envelope smoothing of \(f_i\) as follows:

\[\begin{align} \bar{f}_{i}(g) = \min_{\mathbf{u}\in\mathbb{R}^{d'}} \frac{\bar\rho_1}{2}\|\mathbf{u} - g\|_2^2 + f_i(\mathbf{u}), \end{align}\]

where \(\bar\rho_1>\rho_1\). We present a lemma below regarding \(\bar{f}_i\).

Lemma 5.13

If \(f_i\) is \(G_1\)-Lipschitz continuous and \(\rho_1\)-weakly convex, then \(\bar{f}_i\) is \(\bar{L}_1\)-smooth and \(G_1\) Lipschitz continuous, where \(\bar{L}_1=\frac{\bar\rho_1(2\bar\rho_1-\rho_1)}{(\bar\rho_1-\rho_1)}\). In addition, \(\bar f_i(g)>-\infty, \forall g\).

Proof.

Define \(\operatorname{prox}_{f_i/\bar\rho_1}(g) = \arg\min_{\mathbf{u}\in\mathbb{R}^{d'}} \frac{\bar\rho_1}{2}\|\mathbf{u} - g\|_2^2 + f_i(\mathbf{u})\). We have

\[\begin{align*} \nabla\bar{f}_{i}(g) = \bar\rho_1 (g - \operatorname{prox}_{f_i/\bar\rho_1}(g)). \end{align*}\]

Due to the optimality condition of \(\operatorname{prox}_{f_i/\bar\rho_1}(g)\), we have

\[\begin{align*} \bar\rho_1 (g - \operatorname{prox}_{f_i/\bar\rho_1}(g)) \in \partial f_i(\operatorname{prox}_{f_i/\bar\rho_1}(g)). \end{align*}\]

Hence, \(\nabla\bar{f}_{i}(g)\in \partial f_i(\operatorname{prox}_{f_i/\bar\rho_1}(g))\), which implies \(\|\nabla\bar{f}_{i}(g)\|\leq G_1\). The smoothness of \(\bar{f}_i\) follows from Proposition 3.1. The lower boundedness of \(\bar f_i(g)\) follows the coercivity of the strongly convex function \(\frac{\bar\rho_1}{2}\|\mathbf u - g\|_2^2 + f_i(\mathbf u)\) when \(\bar\rho_1>\rho_1\).

■

Relationship with Nesterov Smoothing

When \(f_i\) is a convex function, its Moreau envelope smoothing is also equivalent to the well-known Nesterov smoothing. To see this, let \(f_i^*\) denote the convex conjugate of \(f_i\), i.e., \(f_i^*(\mathbf{u}) = \max_{g\in\mathbb{R}^{d'}}\mathbf{u}^{\top}g - f_i({g})\). Since \(f_i\) is convex, we have \(f_i(g) = \max_{\mathbf{u}\in\mathcal{U}}\mathbf{u}^{\top}g - f_i^*(\mathbf{u})\), where \(\mathcal{U}=\text{dom}(f_i^*)\) is bounded as \(\|\partial f_i(g)\|\leq G_1\) (cf. Lemma 1.8). As a result,

\[\begin{align*} &\bar{f}_{i}(g)=\min_{\mathbf{u}\in\mathbb{R}^{d'}} \frac{\bar\rho_1}{2}\|\mathbf{u} - g\|_2^2 + f_i(\mathbf{u}) = \min_{\mathbf{u}\in\mathbb{R}^{d'}} \frac{\bar\rho_1}{2}\|\mathbf{u} - g\|_2^2 +\max_{\mathbf{u}'\in\mathcal{U}}\mathbf{u}'^{\top}\mathbf{u} - f_i^*(\mathbf{u}'). \end{align*}\]

By Sion’s minimax theorem, we can switch the min and max. Hence,

\[\begin{align*} &\bar{f}_{i}(g)=\max_{\mathbf{u}'\in\mathcal{U}} \min_{\mathbf{u}\in\mathbb{R}^{d'}} \frac{\bar\rho_1}{2}\|\mathbf{u} - g\|_2^2 +\mathbf{u}'^{\top}\mathbf{u} - f_i^*(\mathbf{u}'). \end{align*}\]

By solving the minimization over \(\mathbf{u}\) and plugging the optimal solution into the expression, we get

\[\begin{align}\label{eqn:ness} \bar{f}_{i}(g)= \max_{\mathbf{u}'\in\mathcal{U}}g^{\top}\mathbf{u}' - f_i^*(\mathbf{u}') - \frac{1}{2\bar\rho_1}\|\mathbf{u}'\|_2^2. \end{align}\]

This is known as Nesterov smoothing of the function \(f_i(g)\). When \(\bar\rho_1\) is sufficiently large, we can prove that \(\bar{f}_i\) is sufficiently close to \(f_i\).

Example 5.1: Smoothed hinge function

Let us consider the Nesterov smoothing of the hinge function \(f(x)=[x]_+\). Let \(\bar\rho_1=1/\varepsilon\) for some small \(\varepsilon\ll1\). Then, the Nesterov smoothing of the hinge function is

\[\bar{f}(x) = \max_{u\in[0,1]} ux - \frac{\varepsilon}{2} u^2 = \left\{\begin{array}{ll}x - \frac{\varepsilon}{2} & \text{if } x\geq \varepsilon\\ \frac{x^2}{2\varepsilon}&\text{ if } 0<x<\varepsilon\\ 0 & \text{o.w.}\end{array}\right. .\]

This is also known as the smoothed hinge function.

Solving the smoothed problem

With a smoothed outer function \(\bar{f}_i\), we consider optimizing the following problem with some proper value of \(\bar\rho_1\):

\[\begin{align}\label{eqn:smoothned} \min_{\mathbf{w}}\bar{F}(\mathbf{w}): = \frac{1}{n}\sum_{i=1}^n \bar{f}_{i}(g_i(\mathbf{w})). \end{align}\]

Following Lemma 4.3, \(\bar{F}(\cdot)\) is smooth with a smoothness parameter \(\bar{L}_F= G_1 L_2 + G_2^2 \bar{L}_1\).

The key concern is how the convergence of solving the above problem translates to the convergence of solving the original FCCO problem. To address this question, we introduce a new convergence measure, named approximate \(\epsilon\)-stationarity.

Definition (Approximate \(\epsilon\)-stationary solution). A point \(\mathbf{w}\) is an approximate \(\epsilon\)-stationary solution to the original FCCO problem, if there exists \((\mathbf{u}_1,\ldots, \mathbf{u}_n)\) and \(\lambda_i\in\partial f(\mathbf{u}_i), \forall i\) such that

\[\begin{align} & \left\|\frac{1}{n}\sum_{i=1}^n\nabla g_i(\mathbf{w}) \lambda_i\right\|_2\leq \epsilon,\\ &\|\mathbf{u}_i-g_i(\mathbf{w})\|_2\leq O(\epsilon), \forall i. \end{align}\]

We note that this condition is closely related to the KKT condition of the following equivalent constrained formulation of the original FCCO problem:

\[\begin{align} &\min_{\mathbf{w}, \mathbf{u}}\quad \frac{1}{n}\sum_{i=1}^n f_i(\mathbf{u}_i)\\ &\text{s.t.}\quad g_i(\mathbf{w}) = \mathbf{u}_i, \forall i. \end{align}\]

The Lagrangian function of this constrained formulation is given by \(F(\mathbf{w}, \mathbf{u}, \lambda) = \frac{1}{n}\sum_{i=1}^n f_i(\mathbf{u}_i) + \sum_{i=1}^n\lambda_i^{\top} (g_i(\mathbf{w}) - \mathbf{u}_i).\) A solution \((\mathbf{w}, \mathbf{u}, \lambda)\) satisfies the KKT condition, if

\[\begin{align*} &\frac{1}{n}\sum_{i=1}^n\nabla g_i(\mathbf{w})\lambda_i =0, \quad \lambda_i\in\partial f_i(\mathbf{u}_i)\\ &\mathbf{u}_i = g_i(\mathbf{w}). \end{align*}\]

Hence, an approximate \(\epsilon\)-stationary solution satisfies the KKT condition approximately when \(\epsilon\ll 1\).

If \(f_i\) is \(L_1\)-smooth, an approximate \(\epsilon\)-stationary solution is also a standard \(O(\epsilon)\)-stationary solution. To see this, we have

\[\begin{align*} &\left\|\frac{1}{n}\sum_{i=1}^n\nabla g_i(\mathbf{w}) \nabla f_i(g_i(\mathbf{w}))\right\|_2 \\ & = \left\|\frac{1}{n}\sum_{i=1}^n\nabla g_i(\mathbf{w}) \nabla f_i(g_i(\mathbf{w})) - \frac{1}{n}\sum_{i=1}^n\nabla g_i(\mathbf{w}) \nabla f_i(\mathbf{u}_i) + \frac{1}{n}\sum_{i=1}^n\nabla g_i(\mathbf{w}) \nabla f_i(\mathbf{u}_i)\right\|_2\\ & \leq \frac{1}{n}\sum_{i=1}^nG_2L_1\|\mathbf{u}_i - g_i(\mathbf{w})\|_2 + \epsilon \leq O(\epsilon). \end{align*}\]

Algorithm 17: SONEX

Input: learning rate schedules \(\{\eta_t\}_{t=1}^{T}\), \(\{\gamma_t\}_{t=1}^{T}\), \(\{\beta_t\}_{t=1}^{T}\); starting points \(\mathbf{w}_0, \mathbf{v}_0\), \(\mathbf{u}_{0}\)
\(\mathbf{w}_1 = \mathbf{w}_0 - \eta_0\mathbf{v}_0\)
for \(t=1,\dotsc, T\) do
Draw a batch of samples \(\mathcal{B}_t\subset[n]\)
for \(i\in \mathcal{B}_t\) do
Draw two samples \(\zeta_{i,t}\sim \mathbb{P}_i\)
Update the inner function value estimators by

v1: \(\quad \mathbf{u}_{i, t} = (1-\gamma_t) \mathbf{u}_{i,t-1} + \gamma_t g_{i}(\mathbf{w}_t; \zeta_{i,t})\)

end for
Set \(\mathbf{u}_{i,t} = \mathbf{u}_{i,t-1}\), \(i\notin\mathcal{B}_t\)
Compute the vanilla gradient estimator \(\mathbf{z}_t = \frac{1}{B}\sum_{i \in \mathcal{B}'_{t}} \nabla g_{i}(\mathbf{w}_t;\zeta'_{i,t}) \nabla\bar{f}_{i}(\mathbf{u}_{i,t})\)
Update the MA gradient estimator \(\mathbf{v}_{t} = (1-\beta_t) \mathbf{v}_{t-1} + \beta_t \mathbf{z}_t\)
Update the model \(\mathbf{w}\) by \(\mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t \mathbf{v}_{t}\)
end for

The following proposition states that an \(\epsilon\)-stationary solution to the smoothed problem (\(\ref{eqn:smoothned}\)) is an approximate \(\epsilon\)-stationary solution to the original problem when \(\bar\rho_1\) is sufficiently large.

Proposition 5.1

Let \(\mathbf{w}\) be an \(\epsilon\)-stationary solution to (\(\ref{eqn:smoothned}\)), when \(\bar\rho_1 = 1/\epsilon\), then \(\mathbf{w}\) is also an approximate \(\epsilon\)-stationary solution to the original FCCO problem.

Proof.

Given that \(\mathbf{w}\) be an \(\epsilon\)-stationary solution to (\(\ref{eqn:smoothned}\)), we have

\[\begin{align*} \left\|\frac{1}{n}\sum_{i=1}^n\nabla g_i(\mathbf{w}) \nabla\bar{f}_{i}(g_i(\mathbf{w}))\right\|_2\leq \epsilon. \end{align*}\]

We define \(\mathbf{u}_i = \operatorname{prox}_{f_i/\bar\rho_1}(g_i(\mathbf{w})) = \arg\min_{\mathbf{u}}f_i(\mathbf{u}) + \frac{\bar\rho_1}{2}\|\mathbf{u} - g_i(\mathbf{w})\|_2^2\) and \(\lambda_i = \nabla\bar{f}_{i}(g_i(\mathbf{w}))\). Since \(\nabla\bar{f}_{i}(g_i(\mathbf{w}))\in \partial f_i(\operatorname{prox}_{f_i/\bar\rho_1}(g_i(\mathbf{w})))= \partial f_i(\mathbf{u}_i)\), we have \(\lambda_i\in\partial f_i(\mathbf{u}_i)\) and \(\left\|\frac{1}{n}\sum_{i=1}^n\nabla g_i(\mathbf{w}) \lambda_i\right\|_2\leq \epsilon.\)

Due to the optimality condition of \(\mathbf{u}_i\), we have \(g_i(\mathbf{w}) - \mathbf{u}_i \in \partial f_i(\mathbf{u}_i)/\bar\rho_1\). Since \(f_i\) is \(G_1\)-Lipschitz continuous and \(\bar\rho_1\geq 1/\epsilon\), hence, \(\|\mathbf{u}_i - g_i(\mathbf{w})\|_2\leq O(\epsilon)\).

■

Next, we discuss algorithms and complexities for solving the smoothed problem when \(\bar\rho_1 = 1/\epsilon\). Since both inner and outer functions of the smoothed problem are smooth, we can leverage the moving average gradient estimators. We present detailed steps for solving the smoothed problem in Algorithm 17, which is referred to as SONEX.

A step in implementing SONEX for solving the smoothed problem (\(\ref{eqn:smoothned}\)) is the calculation of \(\nabla \bar{f}_{i}(\mathbf{u}_{i,t})\), which amounts to solving a proximal mapping of \(f_i\), i.e.,

\[\operatorname{prox}_{f_i/\bar\rho_1}(\mathbf{u}_{i,t}) = \arg\min_{\mathbf{u}\in\mathbb{R}^{d'}} \frac{\bar\rho_1}{2}\|\mathbf{u} - \mathbf{u}_{i,t}\|_2^2 + f_i(\mathbf{u}).\]

In fact, \(\nabla \bar{f}_{i}(\mathbf{u}_{i,t}) = \bar\rho_1 (\mathbf{u}_{i,t} - \operatorname{prox}_{f_i/\bar\rho_1}(\mathbf{u}_{i,t}) )\).

Convergence of SONEX-v1

Finally, we present the complexity of SONEX-v1 for finding an \(\epsilon\)-stationary solution to the smoothed problem when \(\bar\rho_1=1/\epsilon\).

Corollary 5.1 (Convergence of SONEX-v1)

Under Assumption 5.6 and Assumption 5.7, if we set \(\mathbf{u}_0\) such that \(\frac{1}{n}\mathbb{E}[\sum_{i=1}^n\|\mathbf{u}_{i,0} - g_i(\mathbf{w}_0)\|_2^2]\leq O(\epsilon)\), \(\beta =O(\frac{\epsilon^2}{\sigma^2}), \gamma= O(\frac{\epsilon^4}{\sigma_0^2}),\eta= \min(\epsilon, O(\beta\epsilon), O(\frac{B\epsilon\gamma}{n})), \bar\rho_1=1/\epsilon>\rho_1\), then SONEX-v1 finds an approximate \(O(\epsilon)\)-stationary solution to the original FCCO problem with a complexity of \(O(\frac{n\sigma_0^2}{B\epsilon^7})\).

Proof.

The proof can be completed by using the convergence result of SOX (cf. Theorem 5.1) with noting the order of \(\bar{L}_1=O(\bar\rho_1) = O(1/\epsilon)\) and \(L_F = O(\bar{L}_1) = O(1/\epsilon)\).

■

Convergence of SONEX-v2

SONEX-v2 is a combination of SOX and MSVR, i.e., with \(\mathbf{u}_{t}\) sequence from MSVR and \(\mathbf{v}_{t}\) from SOX.

Theorem 5.6 (Convergence of SONEX-v2)

Under Assumption 5.6 and Assumption 5.7, if we set \(\mathbf{u}_1\) such that \(\frac{1}{n}\mathbb{E}[\sum_{i=1}^n\|\mathbf{u}_{i,0} - g_i(\mathbf{w}_0)\|_2^2]\leq O(\epsilon^3/\sigma_0)\), \(\beta=O(\frac{\epsilon^2}{\sigma^2}), \gamma=O(\frac{\epsilon^2}{\sigma_0^2}), \eta =\min(O(\epsilon), O(\beta\epsilon), O(\frac{B\sqrt{\gamma}\epsilon}{n}))\) and \(\bar\rho_1=\frac{1}{\epsilon}>\rho_1\), then SONEX-v2 finds an approximate \(\epsilon\)-stationary solution to the original FCCO problem with a complexity of

\[T=O\left(\max\left\{\frac{1}{\epsilon^3}, \frac{ \sigma^2}{\epsilon^5}, \frac{ n\sigma_0}{B\epsilon^5}\right\}\right),\]

where \(\sigma^2 = \frac{G_1^2\sigma_2^2}{B} + \frac{G_1^2G_2^2(n-B)}{B(n-1)}\).

Proof.

The proof is similar to that of Theorem 4.3 except that the \(\diamond\) inequality in Lemma 4.10 is replaced by the following for using MSVR estimators (see Lemma 5.5):

\[(\diamond)\quad \mathbb{E}\left[\delta_{t+1}\right] \leq \mathbb{E}[(1-\bar\gamma) \delta_{t} + C_3\eta^2 \Gamma_t + \bar\gamma^2{\sigma'}^2],\]

where \(\bar\gamma = \frac{B\gamma}{n}, {\sigma'}^2=\frac{2n\sigma_0^2}{B}, C_3=O(n/B)\).

We only highlight the changes below and leave details as an exercise. First, the condition on \(\eta\) in Lemma 4.10 is changed to

\[\eta\leq O\left(\frac{1}{L}, \frac{\beta}{\sqrt{C_2}}, \sqrt{\frac{\bar\gamma}{C_1C_3}}\right).\]

The settings on \(\beta, \bar\gamma\) remain the same as \(\beta =O(\frac{\epsilon^2}{\sigma^2}), \bar\gamma = O(\frac{\epsilon^2}{C_1\sigma'^2})\). The iteration complexity becomes:

\[\begin{align*} T&= O\left(\max\left\{\frac{C_\Upsilon L_F}{\epsilon^2}, \frac{C_\Upsilon \sqrt{C_2}}{\epsilon^2\beta}, \frac{C_\Upsilon \sqrt{C_1C_3}}{\sqrt{\bar\gamma}\epsilon^2}\right\}\right)\\ &=O\left(\max\left\{\frac{C_\Upsilon L_F}{\epsilon^2}, \frac{C_\Upsilon \sigma^2\sqrt{C_2}}{\epsilon^4}, \frac{C_\Upsilon \sqrt{C_3}C_1{\sigma'}}{\epsilon^3}\right\}\right), \end{align*}\]

and \(C_\Upsilon\) is changed to

\[\begin{align*} C_\Upsilon&=A_0 + \frac{\eta}{\beta}\Delta_{0} + \frac{C_1\eta}{\bar\gamma} \delta_{0} \leq A_0 + O\bigg(\frac{1}{\sqrt{C_2}}\bigg)\Delta_{0} + O\bigg(\frac{\sqrt{C_1}}{\sqrt{C_3\bar\gamma}}\bigg)\delta_{0} \\ & =A_0 + O\bigg(\frac{1}{\sqrt{C_2}}\bigg)\Delta_{0} + O\bigg(\frac{C_1\sigma'}{\sqrt{8C_3}\epsilon}\bigg) \delta_{0}. \end{align*}\]

Then, as in the proof of Theorem 5.1, we substitute \(C_1 = O(\bar{L}_1^2)\), \(C_2 = O(\bar{L}_F^2)\), \(C_3 = O(n/B)\), \(\sigma^2 = \frac{G_1^2\sigma_2^2}{B} + \frac{G_1^2G_2^2(n-B)}{B(n-1)}\), and \({\sigma'}^2 = O(n\sigma_0^2/B)\) into the above complexity expression and \(C_\Upsilon\), and obtain

\[\begin{align*} T & = O\left(\max\left\{\frac{C_\Upsilon \bar{L}_F}{\epsilon^2}, \frac{C_\Upsilon \bar{L}_F\sigma^2}{\epsilon^4}, \frac{C_\Upsilon n\bar{L}_1^2\sigma_0}{B\epsilon^3}\right\}\right),\\ C_\Upsilon&\leq O(\bar F(\mathbf{w}_0)-\bar{F}_*) + O\bigg(\frac{1}{\bar{L}_F}\bigg)\Delta_{0} + O\bigg(\frac{\bar{L}_1^2\sigma_0}{\epsilon}\bigg) \delta_{0}. \end{align*}\]

We finish the proof by noting that \(\bar{L}_1=O(1/\epsilon)\) and \(\bar{L}_F=O(1/\epsilon)\) and \(C_\Upsilon=O(1)\) if \(\delta_{0}\leq O(\epsilon^3/\sigma_0)\).

■

← Go Back