Section 4.5.2 Non-convex Min-Min Optimization
We can extend SMDA to solving a non-convex strongly-convex min-min problem: \[\begin{align}\label{eqn:minmin} \min_{\mathbf{w}\in\mathbb{R}^d}\min_{\mathbf{u}\in\mathcal U} f(\mathbf{w}, \mathbf{u}): = \mathbb{E}_{\xi}[f(\mathbf{w}, \mathbf{u}; \xi)], \end{align}\] where \(f(\mathbf{w},\mathbf{u})\) is smooth, non-convex in terms of \(\mathbf{w}\) and strongly convex in terms of \(\mathbf{u}\) and \(\mathcal U\) is a closed convex set. If the \(\mathbf{u}^*(\mathbf{w})=\arg\min_{\mathbf{u}\in\mathcal U} f(\mathbf{w}, \mathbf{u})\) exists and unique, then we have \(\nabla F(\mathbf{w}) = \nabla_1 f(\mathbf{w}, \mathbf{u}^*(\mathbf{w}))\). Hence, its gradient also exhibits a compositional structure, where the inner function \(\mathbf{u}^*(\mathbf{w})\) is a solution to a strongly convex problem.
SMDA can be modified by replacing the \(\mathbf{u}\) update with \[\mathbf{u}_{t+1} = \Pi_{\mathcal U}[\mathbf{u}_{t} - \gamma_t \nabla_2 f(\mathbf{w}_t, \mathbf{u}_t; \zeta_t)].\] Then, the same convergence result in the last subsection can be established for min-min problem, which is omitted here.
Algorithm 13: A novel method for weakly convex minimization
- Input: learning rate schedules \(\{\eta_t\}_{t=1}^{T}\), \(\{\gamma_t\}_{t=1}^{T}\); starting points \(\mathbf{w}_1\), \(\mathbf{u}_1, \mathbf{v}_1\)
- For \(t=1,\dotsc, T\)
- Sample \(\zeta_t\) and compute \(\mathcal{G}(\mathbf{u}_t; \zeta_t)=\partial g(\mathbf{u}_t; \zeta_t)\)
- Update \(\mathbf{u}_{t+1} = \mathbf{u}_t - \gamma_t(\mathcal{G}(\mathbf{u}_t; \zeta_t) + \rho (\mathbf{u}_t - \mathbf{w}_t))\)
- Update \(\mathbf{w}_{t+1} = (1-2\eta_t\rho)\mathbf{w}_t + 2\eta_t\rho \mathbf{u}_{t}\)
4.5.2.1 Application to weakly convex minimization
Next, we present an application to solving weakly convex minimization problems: \[\begin{align} \min_{\mathbf{w}} F(\mathbf{w}) := \mathbb{E}[g(\mathbf{w}; \zeta)], \end{align}\] where \(F(\mathbf w) > -\infty\) is \(\rho\)-weakly convex, as discussed in Chapter 3.
As argued in Section 3.1.4, an \(\epsilon\)-stationary solution of the Moreau envelope of \(F(\mathbf{w})\) corresponds to a nearly \(\epsilon\)-stationary solution of the original problem. Hence, we consider optimizing the Moreau envelope directly: \[\begin{align}\label{eqn:minmin-weak} \min_{\mathbf{w}} F_{\rho}(\mathbf{w}) := \min_{\mathbf{u}} \mathbb{E}[g(\mathbf{u}; \zeta)] + \rho \|\mathbf{u} - \mathbf{w}\|_2^2. \end{align}\] Define \(f(\mathbf{w}, \mathbf{u}) = \mathbb{E}[g(\mathbf{u}; \zeta)] + \rho \|\mathbf{u} - \mathbf{w}\|_2^2\). Then \(f(\mathbf{w}, \mathbf{u})\) is \(\rho\)-strongly convex with respect to \(\mathbf{u}\) due to the \(\rho\)-weak convexity of \(F\).
For updating \(\mathbf{u}\), we use the standard SGD: \[\begin{align} \mathbf{u}_{t+1} = \mathbf{u}_t - \gamma_t(\mathcal{G}(\mathbf{u}_t; \zeta_t) + 2\rho (\mathbf{u}_t - \mathbf{w}_t)). \end{align}\] where \(\mathcal{G}(\mathbf{u}_t; \zeta_t)\in\partial g(\mathbf{u}_t; \zeta_t)\). For updating \(\mathbf{w}\), then we just apply GD with its gradient given by \(\nabla_1 f(\mathbf{w}_t,\mathbf{u}_t) = 2\rho(\mathbf{w}_t - \mathbf{u}_t)\): \[\begin{align} \mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t 2\rho (\mathbf{w}_t - \mathbf{u}_t) = (1-2\eta_t\rho)\mathbf{w}_t + 2\eta_t\rho\mathbf{u}_t. \end{align}\] We present the updates in Algorithm 13. An interesting observation about this algorithm is that the \(\mathbf{u}\) update is similar to the Momentum update except that the momentum term \(\mathbf{u}_t - \mathbf{u}_{t-1}\) is replaced by \(\mathbf{u}_t - \mathbf{w}_t\), where \(\mathbf{w}_t\) is a MA weight vector.
Convergence Analysis
Let us first prove the following lemma.
Lemma 4.21
We have (i) \(F_\rho\) is \(L_F\)-smooth with \(L_F=\frac{6}{\rho}\); (ii) \(\nabla_1 f(\mathbf{w}, \mathbf{u})\) is Lipschitz continuous with \(L_1=2\rho\), and (iii) \(\mathbf{u}^*(\mathbf{w})\) is \(1\)-Lipschitz continuous.
Proof
The smoothness of \(F_\rho\) has been proved in Proposition 3.1 with \(\lambda=\rho/2\). The Lipschitz continuity of \(\nabla_1 f(\mathbf{w}, \mathbf{u})= 2\rho(\mathbf{w} - \mathbf{u})\) is obvious. Next, let us prove the Lipschitz continuity of \(\mathbf{u}^*(\mathbf{w})\). The proof is similar to that of Lemma 4.17.
Let us consider \(\mathbf{w}_1, \mathbf{w}_2\). By the optimality condition of \(\mathbf{u}^*(\mathbf{w}_1)\) and \(\mathbf{u}^*(\mathbf{w}_2)\) for a concave function, there exists \(\mathbf{v}(\mathbf{w}_1)\in\partial_2 f(\mathbf{w}_1, \mathbf{u}^*(\mathbf{w}_1)), \mathbf{v}(\mathbf{w}_2)\in\partial_2 f(\mathbf{w}_2, \mathbf{u}^*(\mathbf{w}_2))\) \[\begin{align*} \mathbf{v}(\mathbf{w}_1)^{\top}(\mathbf{u} - \mathbf{u}^*(\mathbf{w}_1))\leq 0,\quad \forall \mathbf{u}\\ \mathbf{v}(\mathbf{w}_2)^{\top}(\mathbf{u} - \mathbf{u}^*(\mathbf{w}_2))\leq 0,\quad\forall \mathbf{u} \end{align*}\] Let \(\mathbf{u}=\mathbf{u}^*(\mathbf{w}_2)\) in the first inequality and \(\mathbf{u}=\mathbf{u}^*(\mathbf{w}_1)\) in the second equality and add them together we have \[\begin{align*} (\mathbf{v}(\mathbf{w}_1)-\mathbf{v}(\mathbf{w}_2))^{\top}(\mathbf{u}^*(\mathbf{w}_2) - \mathbf{u}^*(\mathbf{w}_1))\leq 0. \end{align*}\] Since \(-f(\mathbf{w}_1,\cdot)\) is \(\rho\)-strongly convex, similar to Lemma 1.6, we have for any \(\mathbf{v}\in\partial_2 f(\mathbf{w}_1, \mathbf{u}^*(\mathbf{w}_2))\), \[\begin{align*} (\mathbf{v}(\mathbf{w}_1)&-\mathbf{v})^{\top}(\mathbf{u}^*(\mathbf{w}_2) - \mathbf{u}^*(\mathbf{w}_1))\geq \rho\|\mathbf{u}^*(\mathbf{w}_2) - \mathbf{u}^*(\mathbf{w}_1)\|_2^2. \end{align*}\] Combining these two inequalities we have \[\begin{align*} \rho\|\mathbf{u}^*(\mathbf{w}_2) &- \mathbf{u}^*(\mathbf{w}_1)\|_2^2\leq (\mathbf{v}(\mathbf{w}_2)-\mathbf{v})^{\top}(\mathbf{u}^*(\mathbf{w}_2) - \mathbf{u}^*(\mathbf{w}_1))\\ &\leq \|\mathbf{v}(\mathbf{w}_2)-\mathbf{v}\|_2\|\mathbf{u}^*(\mathbf{w}_2) - \mathbf{u}^*(\mathbf{w}_1)\|_2. \end{align*}\] Since there exists \(\mathbf{v}'\in\partial g(\mathbf{u}^*(\mathbf{w}_2))\) such that \(\mathbf{v}(\mathbf{w}_2) =\mathbf{v}'+ \rho (\mathbf{u}^*(\mathbf{w}_2) - \mathbf{w}_2)\), we let \(\mathbf{v} = \mathbf{v}' + \rho (\mathbf{u}^*(\mathbf{w}_2) - \mathbf{w}_1)\), then \[\begin{align*} \|\mathbf{u}^*(\mathbf{w}_2) - \mathbf{u}^*(\mathbf{w}_1)\|_2\leq \|\mathbf{w}_2 - \mathbf{w}_1\|_2. \end{align*}\]■
Since \(\partial_2f(\mathbf{w},\mathbf{u})\) is not Lipschitz continuous with respect to \(\mathbf{u}\), Lemma 4.20 is not directly applicable. We develop a similar one below.
Lemma 4.22
Consider the following update: \[\begin{align*} \mathbf{u}_{t+1} = \mathbf{u}_t - \gamma_t(\mathcal{G}(\mathbf{u}_t; \zeta_t) + 2\rho (\mathbf{u}_t - \mathbf{w}_t)). \end{align*}\] If \(\mathbb{E}_\zeta[\|\mathcal{G}(\mathbf{u};\zeta)\|^2_2]\leq G^2\) and \(\gamma_t\rho<1/8\), then we have \[\begin{align*} & \mathbb{E}_t\|\mathbf{u}_{t+1} - \mathbf{u}^*(\mathbf{w}_{t+1})\|^2_2\\ & \leq \left(1-\frac{\gamma_t \rho}{2}\right) \|\mathbf{u}_t - \mathbf{u}^*(\mathbf{w}_t)\|_2^2 + 8\gamma_t^2 G^2 + \frac{12}{\gamma_t \rho}\mathbb{E}_t\|\mathbf{w}_{t+1} -\mathbf{w}_t\|^2_2. \end{align*}\]
Proof
Since \(\mathbf{u}_{t+1}\) is one-step SGD update of \(f(\mathbf{w}_t, \mathbf{u})\), the proof is similar to Lemma 3.8 for the non-smooth case. \[\begin{align}\label{eqn:u-rec} & \|\mathbf{u}_{t+1} - \mathbf{u}^*(\mathbf{w}_t)\|^2_2 = \|\mathbf{u}_t -\gamma_t \left(\mathcal{G}(\mathbf{u}_t;\zeta_t)+2\rho(\mathbf{u}_t- \mathbf{w}_t)\right)- \mathbf{u}^*(\mathbf{w}_t)\|^2_2\\ & = \|\mathbf{u}_t - \mathbf{u}^*(\mathbf{w}_t)\|^2_2 + \gamma_t^2\|\mathcal{G}(\mathbf{u}_t;\zeta_t)+2\rho(\mathbf{u}_t- \mathbf{w}_t)\|^2_2\notag\\ &- 2\gamma_t(\mathcal{G}(\mathbf{u}_t;\zeta_t)+2\rho(\mathbf{u}_t-\mathbf{w}_t))^{\top}(\mathbf{u}_t - \mathbf{u}^*(\mathbf{w}_t)).\notag \end{align}\] Note that \(0 \in \partial g(\mathbf{u}^*(\mathbf{w}_t)) + 2\rho(\mathbf{u}^*(\mathbf{w}_t) - \mathbf{w}_t)\). Thus, \(\mathbf{v}_{t-1}=2\rho(\mathbf{w}_t - \mathbf{u}^*(\mathbf{w}_t))\in \partial g(\mathbf{u}^*(\mathbf{w}_t))\), \[\begin{align*} \mathbb{E}_t\|\mathcal{G}(\mathbf{u}_t;\zeta_t)+2\rho(\mathbf{u}_t-\mathbf{w}_t)\|^2_2 & = \mathbb{E}_t\|\mathcal{G}(\mathbf{u}_t;\zeta_t) - \mathbf{v}_{t-1} +2\rho(\mathbf{u}_t-\mathbf{u}^*(\mathbf{w}_t))\|^2_2 \\ & \leq 2\mathbb{E}_t\|\mathcal{G}(\mathbf{u}_t;\zeta_t)-\mathbf{v}_{t-1}\|_2^2 + 8 \rho^2\|\mathbf{u}_t-\mathbf{u}^*(\mathbf{w}_t)\|_2^2\\ & \leq 8G^2 + 8 \rho^2\|\mathbf{u}_t-\mathbf{u}^*(\mathbf{w}_t)\|_2^2, \end{align*}\] where the last inequality uses \(\|\mathbf{v}_{t-1}\|_2\leq G\). For the last term in (\(\ref{eqn:u-rec}\)), let \(\tilde{\mathbf{v}}_{t} = \mathbb{E}[\mathcal{G}(\mathbf{u}_t;\zeta_t)]+2\rho(\mathbf{u}_t- \mathbf{w}_t)\in\partial_2 f(\mathbf{w}_t, \mathbf{u}_t)\), then we have \[\begin{align*} & \mathbb{E}_t(\mathcal{G}(\mathbf{u}_t;\zeta_t)+2\rho(\mathbf{u}_t- \mathbf{w}_t))^{\top}(\mathbf{u}_t - \mathbf{u}^*(\mathbf{w}_t)) = \tilde{\mathbf{v}}_{t}^{\top}(\mathbf{u}_t-\mathbf{u}^*(\mathbf{w}_t))\\ & = (\tilde{\mathbf{v}}_{t} - \mathbf{v}(\mathbf{w}_t))^{\top}(\mathbf{u}_t-\mathbf{u}^*(\mathbf{w}_t)) \geq \rho \|\mathbf{u}_t-\mathbf{u}^*(\mathbf{w}_t)\|_2^2. \end{align*}\] where \(\mathbf{v}(\mathbf{w}_t):=0\in\partial_2 f(\mathbf{w}_t, \mathbf{u}^*(\mathbf{w}_t))\) and the last inequality is due to the strong convexity of \(f\) in terms of \(\mathbf{u}\). Combining the above inequalities we have \[\begin{align*} & \|\mathbf{u}_{t+1} - \mathbf{u}^*(\mathbf{w}_t)\|^2_2 = \|\mathbf{u}_t -\gamma_t \left(\partial g(\mathbf{u}_t;\zeta_t)+2\rho(\mathbf{u}_t- \mathbf{w}_t)\right)- \mathbf{u}^*(\mathbf{w}_t)\|^2_2\\ & \leq \|\mathbf{u}_t - \mathbf{u}^*(\mathbf{w}_t)\|^2_2 + \gamma_t^2(8G^2 + 8\rho^2 \|\mathbf{u}_t-\mathbf{u}^*(\mathbf{w}_t)\|_2^2) - 2\gamma_t\rho \|\mathbf{u}_t-\mathbf{u}^*(\mathbf{w}_t)\|_2^2\\ &=(1-2\gamma_t\rho + 8\gamma_t^2\rho^2)\|\mathbf{u}_t - \mathbf{u}^*(\mathbf{w}_t)\|_2^2 + 8\gamma_t^2G^2\\ &\leq (1-\gamma_t\rho )\|\mathbf{u}_t - \mathbf{u}^*(\mathbf{w}_t)\|_2^2 + 8\gamma_t^2G^2 \end{align*}\] where the last inequality uses \(\gamma_t\leq \frac{1}{8\rho}\). Since \(\mathbf{u}^*(\mathbf{w})\) is \(1\)-Lipschitz continuous, we have \[\begin{align*} & \mathbb{E}_t\|\mathbf{u}_{t+1} - \mathbf{u}^*(\mathbf{w}_{t+1})\|^2_2\\ & \leq \left(1 + \frac{\gamma_t\rho}{2}\right) \mathbb{E}_t\|\mathbf{u}_{t+1} - \mathbf{u}^*(\mathbf{w}_t)\|^2_2 + \left(1+\frac{2}{\gamma_t\rho}\right) \|\mathbf{u}^*(\mathbf{w}_{t+1}) - \mathbf{u}^*(\mathbf{w}_t)\|^2_2 \\ & \leq \left(1-\frac{\gamma_t \rho}{2}\right) \|\mathbf{u}_t - \mathbf{u}^*(\mathbf{w}_t)\|_2^2 + 16\gamma_t^2 G^2 + \frac{3}{\gamma_t \rho}\mathbb{E}_t\|\mathbf{w}_{t+1} -\mathbf{w}_t\|^2_2. \end{align*}\]■
Lemma 4.23
Let \(\mathbf{z}_t = 2\rho(\mathbf{w}_t - \mathbf{u}_t)\). For the update \(\mathbf{w}_{t+1} = \mathbf{w}_t -\eta_t \mathbf{z}_t\), if \(\eta_t\leq 1/(2L_F)\), we have \[\begin{align}\label{eqn:nasa_starter} F_\rho(\mathbf{w}_{t+1})& \leq F_\rho(\mathbf{w}_t) + \frac{\eta_t}{2} \|\nabla F_\rho(\mathbf{w}_t) - \mathbf{z}_{t}\|_2^2- \frac{\eta_t}{2}\|\nabla F_\rho(\mathbf{w}_t)\|_2^2 - \frac{1}{4\eta_t}\|\mathbf{w}_{t+1} - \mathbf{w}_t\|_2^2\notag, \end{align}\] where \(L_F\) is the smoothness parameter of \(F_\rho(\cdot)\).
Since \(\nabla F_\rho(\mathbf{w}_t) = 2\rho(\mathbf{w}_t - \mathbf{u}^*(\mathbf{w}_t))\), hence \(\|\nabla F_\rho(\mathbf{w}_t) -\mathbf{z}_t\|_2^2 = 4\rho^2\|\mathbf{u}_t - \mathbf{u}^*(\mathbf{w}_t)\|_2^2\), whose recursion has been established in Lemma 4.22. We can combine these two lemmas and establish a complexity of \(O(1/\epsilon^4)\) for Algorithm 13 in order to find an \(\epsilon\)-stationary solution to \(F_\rho(\cdot)\).
4.5.2.2 Application to weakly-convex strongly-concave min-max problems
The same technique can be applied to solving weakly-convex strongly-concave min-max problems \(\min_{\mathbf{w}}\max_{\mathbf{u}\in\mathcal U}f(\mathbf{w}, \mathbf{u})\) with a single loop algorithm. In subsection 4.5.1, we assume the partial gradient \(\nabla_1 f(\mathbf{w}, \mathbf{u})\) is Lipschitz continuous. We replace this assumption by an assumption that \(f(\mathbf{w}, \mathbf{u})\) is \(\rho\)-weakly convex in terms of \(\mathbf{w}\) for any \(\mathbf{u}\in\mathcal U\).
In this case, \(F(\mathbf{w})=\max_{\mathbf{u}\in\mathcal U}f(\mathbf{w}, \mathbf{u})\) is not smooth but weakly convex. Let us consider its Moreau envelope: \[\begin{align*} \min_{\mathbf{w}} F_\rho(\mathbf{w}): = \min_{\mathbf{u}_1} F(\mathbf{u}_1) + \rho \|\mathbf{u}_1 - \mathbf{w}\|_2^2. \end{align*}\] This problem is equivalent to \[\begin{align*} \min_{\mathbf{w}, \mathbf{u}_1} \max_{\mathbf{u}_2\in\mathcal U} f(\mathbf{u}_1, \mathbf{u}_2) + \rho \|\mathbf{u}_1 - \mathbf{w}\|_2^2, \end{align*}\] which is strongly convex in terms of \(\mathbf{u}_1\) and strongly concave in terms of \(\mathbf{u}_2\).
Compared to (\(\ref{eqn:minmin-weak}\)), this problem just adds another layer of inner maximization. However, it can be still mapped to the general framework as discussed at the beginning. The gradient of \(F_\rho(\mathbf{w})\) is given by \(\mathcal{M}(\mathbf{w}, \mathbf{u}_1^*(\mathbf{w}))=\rho (\mathbf{w} - \mathbf{u}_1^*(\mathbf{w}))\). If we track \(\mathbf{u}_1^*(\mathbf{w}_t)\) by \(\mathbf{u}_{1,t}\) and its update relies on the gradient \(\partial_1 f(\mathbf{u}_{1,t}, \mathbf{u}_{2}^*(\mathbf{u}_{1,t}))\). Hence, we just need another variable \(\mathbf{u}_{2,t}\) to track \(\mathbf{u}_2^*(\mathbf{u}_{1,t})\).
We can develop a similar algorithm. First, let us update \(\mathbf{u}_1,\mathbf{u}_2\). Given \(\mathbf{w}_t, \mathbf{u}_{1,t}, \mathbf{u}_{2,t}\), we update \(\mathbf{u}_{1,t+1}, \mathbf{u}_{2,t+1}\) with SGD update by \[\begin{align} \mathbf{u}_{2,t+1} & = \Pi_{\mathcal U}[\mathbf{u}_{2,t} + \gamma_2 \partial_2 f(\mathbf{u}_{1,t}, \mathbf{u}_{2,t}; \zeta_t)]\\ \mathbf{u}_{1,t+1} &= \mathbf{u}_{1,t} - \gamma_1(\partial_1f(\mathbf{u}_{1,t}, \mathbf{u}_{2,t}; \zeta_t)+2\rho (\mathbf{u}_{1,t} - \mathbf{w}_t)). \end{align}\] Then we update \(\mathbf{w}_{t+1}\) with GD update by \[\begin{align} \mathbf{w}_{t+1} = \mathbf{w}_t - \eta 2\rho(\mathbf{w}_t - \mathbf{u}_{1,t}) = (1-2\eta\rho) \mathbf{w}_t + 2\eta\rho\mathbf{u}_{1,t}. \end{align}\] This algorithm also enjoys a complexity of \(O(1/\epsilon^4)\) for finding a nearly \(\epsilon\)-stationary solution of \(F(\mathbf{w})\). We refer the readers to Hu et al. (2024) for a convergence analysis of this algorithm.
4.5.2.3 Application to Compositional Optimization
We can apply a similar strategy to a compositional function \(F(\mathbf{w})=f_0(g(\mathbf{w}))\), where \(f_0\) is smooth convex and \(g\) is weakly convex. With the conjugate of \(f_0\), we can write \[\min_{\mathbf{w}}f_0(g(\mathbf{w})) = \min_{\mathbf{w}}\max_{\mathbf{u}_2\in\mathcal U}f(\mathbf{w},\mathbf{u}_2) := \mathbf{u}_2^{\top} g(\mathbf{w}) - f_0^*(\mathbf{u}_2).\] Since \(f_0\) is smooth, then \(f_0^*\) is strongly convex. Then if \(g\) is weakly convex and \(\mathcal U\) is bounded (i.e., \(f_0\) is Lipschitz), then \(f(\mathbf{w},\mathbf{u})\) is weakly convex and strongly concave. Optimizing the Moreau envelope of \(f_0(g(\mathbf{w}))\) yields: \[\min_{\mathbf{w}, \mathbf{u}_1}\max_{\mathbf{u}_2\in\mathcal U}\mathbf{u}_2^{\top} g(\mathbf{u}_1) - f_0^*(\mathbf{u}_2) + \rho\|\mathbf{u}_1 - \mathbf{w}\|_2^2,\] which is strongly convex in terms of \(\mathbf{u}_1\) and strongly concave in terms of \(\mathbf{u}_2\). We give an update below: \[\begin{align*} \mathbf{u}_{2,t+1} & = \Pi_{\mathcal U}[\mathbf{u}_{2,t} + \gamma_2 g(\mathbf{u}_{1,t}; \zeta_t)]\\ \mathbf{u}_{1,t+1} &= \mathbf{u}_{1,t} - \gamma_1(\partial_1 g(\mathbf{u}_{1,t}; \zeta_t)\mathbf{u}_{2,t}+2\rho (\mathbf{u}_{1,t} - \mathbf{w}_t))\\ \mathbf{w}_{t+1} &= \mathbf{w}_t - \eta 2\rho(\mathbf{w}_t - \mathbf{u}_{1,t}) = (1-2\eta\rho) \mathbf{w}_t + 2\eta\rho\mathbf{u}_{1,t}. \end{align*}\] Then similar convergence analysis can be developed with a complexity of \(O(1/\epsilon^4)\) for finding a nearly \(\epsilon\)-stationary solution to \(F\).