Section 4.5.1 Non-convex Min-Max Optimization
We consider a non-convex min-max optimization problem:
\[\begin{align} \label{eqn:mm} \min_{\mathbf{w}\in\mathbb{R}^d}\max_{\mathbf{u}\in\mathcal U} f(\mathbf{w}, \mathbf{u}) := \mathbb{E}_{\xi}[f(\mathbf{w}, \mathbf{u}; \xi)], \end{align}\]
where \(f(\mathbf{w},\mathbf{u})\) is continuous and differentiable and \(\mathcal U\) is a closed convex set. Let \(F(\mathbf{w}) = \max_{\mathbf{u}\in\mathcal U} f(\mathbf{w}, \mathbf{u})\). Denote by \(\nabla_1f(\cdot,\cdot)\) and \(\nabla_2 f(\cdot, \cdot)\) the partial gradients of the first and second variable, respectively.
Assumption 4.8
Regarding the problem (\(\ref{eqn:mm}\)), the following conditions hold:
\(f(\mathbf{w}, \mathbf{u})\) is \(\mu\)-strongly concave in terms of \(\mathbf{u}\), and \(\mathbf{u}^*(\mathbf{w})=\arg\max_{\mathbf{u}\in\mathcal U}f(\mathbf{w}, \mathbf{u})\) exists for any \(\mathbf{w}\).
\(\nabla_1 f(\mathbf{w}, \mathbf{u})\) is \(L_1\)-Lipschitz continuous such that
\[\begin{align} \|\nabla_1 f(\mathbf{w}, \mathbf{u}) - \nabla_1 f(\mathbf{w}', \mathbf{u}')\|_2 \leq L_1(\|\mathbf{w} - \mathbf{w}'\|_2 + \|\mathbf{u} - \mathbf{u}'\|_2). \end{align}\]
- \(\nabla_2 f(\mathbf{w}, \mathbf{u})\) is \(L_{21}\)-Lipschitz continuous with respect to the first variable and is \(L_{22}\)-Lipschitz continuous with respect to the second variable
\[\begin{align} &\|\nabla_2 f(\mathbf{w}, \mathbf{u}) - \nabla_2 f(\mathbf{w}', \mathbf{u}')\|_2\leq L_{21} \|\mathbf{w} - \mathbf{w}'\|_2 +L_{22} \|\mathbf{u} - \mathbf{u}'\|_2. \end{align}\]
- there exist \(\sigma_1,\sigma_2\) such that
\[\begin{align} &\mathbb{E}[\|\nabla_1f(\mathbf{w}, \mathbf{u}; \xi)- \nabla_1 f(\mathbf{w}, \mathbf{u})\|_2^2]\leq \sigma_1^2,\\ &\mathbb{E}[\|\nabla_2f(\mathbf{w}, \mathbf{u}; \xi)- \nabla_2 f(\mathbf{w}, \mathbf{u})\|_2^2]\leq \sigma_2^2. \end{align}\]
- \(F_*=\min\limits_{\mathbf{w}} F(\mathbf{w})\geq -\infty\).
4.5.1.1 A Double-loop Large mini-batch method
Let us first consider a straightforward approach that updates \(\mathbf{w}_t\) using a large-batch gradient estimator
\[ \mathbf{v}_t = \frac{1}{B} \sum_{i=1}^B \nabla_1 f(\mathbf{w}_t, \mathbf{u}_t; \zeta_{i,t}), \]
and computes \(\mathbf{u}_t\) via an inner-loop SGD with \(K\) updates. It suffices to have \(K = O(L_1^2\sigma_2^2 / (\mu^2 \epsilon^2))\) (by Corollary 3.9) such that
\[\begin{align*} &\mathbb{E}[\|\mathbf{u}_t - \mathbf{u}^*(\mathbf{w}_t)\|_2^2] \leq \frac{\epsilon^2}{L_1^2}. \end{align*}\]
If \(B = O(\sigma_1^2/\epsilon^2)\), following Lemma 4.18 below we have
\[\begin{align*} &\mathbb{E}[\|{\mathbf{v}}_t - \nabla F(\mathbf{w}_t)\|^2_2]\leq \mathbb{E}\bigg[\bigg\|\frac{1}{B} \sum_{i=1}^B \nabla_1 f(\mathbf{w}_t, \mathbf{u}_t; \zeta_{i,t}) - \nabla_1 f(\mathbf{w}_t, \mathbf{u}^*(\mathbf{w}_t))\bigg\|_2^2\bigg] \\ &\leq O\left(\frac{\sigma_1^2}{B} + L_1^2\|\mathbf{u}_t - \mathbf{u}^*(\mathbf{w}_t)\|_2^2\right)\leq \epsilon^2. \end{align*}\]
Combining this with Lemma 4.9, we can set the step size \(\eta_t = O(1/L_F)\) and the number of iterations \(T = O(L_F/\epsilon^2)\), yielding an overall sample complexity of
\[BT + KT = O\left(\frac{L_F\sigma_1^2}{\epsilon^4} + \frac{L_FL_1^2\sigma_2^2}{\mu^2\epsilon^4}\right).\]
4.5.1.2 A Stochastic Momentum Method
We present a solution method in Algorithm 12, referred to as SMDA (Stochastic Momentum Descent-Ascent). The method begins by updating the dual variable using stochastic gradient ascent (Step 4), then computes the moving average gradient estimator \(\mathbf{v}_{t}\) for the primal variable (Step 6), and finally updates the primal variable using this estimator (Step 7). When \(\beta_t = 1\), the method reduces to Algorithm 7. However, setting \(\beta_t < 1\) is crucial for achieving improved complexity. Conceptually, the method shares similarities with SCMA.
Algorithm 12: SMDA
- Input: learning rate schedules \(\{\eta_t\}_{t=1}^{T}\), \(\{\gamma_t\}_{t=1}^{T}, \{\beta_t\}_{t=1}^T\); starting points \(\mathbf{w}_0\), \(\mathbf{u}_1, \mathbf{v}_0\)
- \(\mathbf{w}_1 = \mathbf{w}_0 - \eta_0\mathbf{v}_0\)
- For \(t=1,\dotsc,T\)
- Sample \(\zeta_t\)
- Update \(\mathbf{u}_{t+1} = \Pi_{\mathcal U}[\mathbf{u}_{t} + \gamma_t \nabla_2 f(\mathbf{w}_t, \mathbf{u}_t; \zeta_t)]\)
- Compute the vanilla gradient estimator \(\mathbf{z}_t = \nabla_1 f(\mathbf{w}_t, \mathbf{u}_{t}; \zeta_t)\)
- Update the MA gradient estimator \(\mathbf{v}_{t} = (1-\beta_t)\mathbf{v}_{t-1} + \beta_t \mathbf{z}_t\)
- Update the model by \(\mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t \mathbf{v}_{t}\)
Convergence Analysis
We will prove the convergence of the gradient norm of \(F(\mathbf{w})\). We first prove the following lemmas.
Lemma 4.17
Let \(\mathbf{u}^*(\mathbf{w})=\arg\max_{\mathbf{u}\in\mathcal U}f(\mathbf{w}, \mathbf{u})\). Under Assumption 4.8(i), (iii), \(\mathbf{u}^*(\cdot)\) is \(\kappa\)-Lipschitz continuous with \(\kappa=\frac{L_{21}}{\mu}\).
Proof
Let us consider \(\mathbf{w}_1, \mathbf{w}_2\). By the optimality condition of \(\mathbf{u}^*(\mathbf{w}_1)\) and \(\mathbf{u}^*(\mathbf{w}_2)\) for a concave function, we have \[\begin{align*} \nabla_2 f(\mathbf{w}_1, \mathbf{u}^*(\mathbf{w}_1))^{\top}(\mathbf{u} - \mathbf{u}^*(\mathbf{w}_1))\leq 0,\quad \forall \mathbf{u}\in\mathcal U\\ \nabla_2 f(\mathbf{w}_2, \mathbf{u}^*(\mathbf{w}_2))^{\top}(\mathbf{u} - \mathbf{u}^*(\mathbf{w}_2))\leq 0,\quad\forall \mathbf{u}\in\mathcal U. \end{align*}\] Let \(\mathbf{u}=\mathbf{u}^*(\mathbf{w}_2)\) in the first inequality and \(\mathbf{u}=\mathbf{u}^*(\mathbf{w}_1)\) in the second equality and add them together we have \[\begin{align*} (\nabla_2 f(\mathbf{w}_1, \mathbf{u}^*(\mathbf{w}_1))-\nabla_2 f(\mathbf{w}_2, \mathbf{u}^*(\mathbf{w}_2)))^{\top}(\mathbf{u}^*(\mathbf{w}_2) - \mathbf{u}^*(\mathbf{w}_1))\leq 0. \end{align*}\] Since \(-f(\mathbf{w}_1,\cdot)\) is \(\mu\)-strongly convex, due to Lemma 1.6, we have \[\begin{align*} (\nabla_2 f(\mathbf{w}_1, \mathbf{u}^*(\mathbf{w}_1))&-\nabla_2 f(\mathbf{w}_1, \mathbf{u}^*(\mathbf{w}_2)))^{\top}(\mathbf{u}^*(\mathbf{w}_2) - \mathbf{u}^*(\mathbf{w}_1))\\ &\geq \mu\|\mathbf{u}^*(\mathbf{w}_2) - \mathbf{u}^*(\mathbf{w}_1)\|_2^2. \end{align*}\] Combining these two inequalities we have \[\begin{align*} \mu\|\mathbf{u}^*(\mathbf{w}_2) &- \mathbf{u}^*(\mathbf{w}_1)\|_2^2\leq (\nabla_2 f(\mathbf{w}_2, \mathbf{u}^*(\mathbf{w}_2))-\nabla_2 f(\mathbf{w}_1, \mathbf{u}^*(\mathbf{w}_2)))^{\top}(\mathbf{u}^*(\mathbf{w}_2) - \mathbf{u}^*(\mathbf{w}_1))\\ &\leq \|\nabla_2 f(\mathbf{w}_2, \mathbf{u}^*(\mathbf{w}_2))-\nabla_2 f(\mathbf{w}_1, \mathbf{u}^*(\mathbf{w}_2))\|_2\|\mathbf{u}^*(\mathbf{w}_2) - \mathbf{u}^*(\mathbf{w}_1)\|_2\\ &\leq L_{21}\|\mathbf{w}_2 - \mathbf{w}_1\|_2\|\mathbf{u}^*(\mathbf{w}_2) - \mathbf{u}^*(\mathbf{w}_1)\|_2. \end{align*}\] Thus, \[\begin{align*} \|\mathbf{u}^*(\mathbf{w}_2) - \mathbf{u}^*(\mathbf{w}_1)\|_2\leq \frac{L_{21}}{\mu}\|\mathbf{w}_2 - \mathbf{w}_1\|_2. \end{align*}\]■
Lemma 4.18
Under Assumption 4.8(i) and (ii), \(\nabla F(\mathbf{w})=\nabla_1 f(\mathbf{w}, \mathbf{u}^*(\mathbf{w}))\), and it is \(L_F\)-Lipschitz continuous with \(L_F=L_1(1+\kappa)\).
Proof
If \(\mathcal U\) is bounded, the Danskin’s theorem implies that \(\nabla F(\mathbf{w}) = \nabla_1 f(\mathbf{w}, \mathbf{u}^*(\mathbf{w}))\). If \(\mathcal U\) is unbounded, we have \[\begin{align} \nabla F(\mathbf{w}) = \nabla_1 f(\mathbf{w}, \mathbf{u}^*(\mathbf{w})) + \frac{\partial \mathbf{u}^*(\mathbf{w})}{\partial \mathbf{w}}^{\top}\nabla_2 f(\mathbf{w}, \mathbf{u}^*(\mathbf{w})) = \nabla_1 f(\mathbf{w}, \mathbf{u}^*(\mathbf{w})), \end{align}\] where the last equality follows from \(\nabla_2 f(\mathbf{w}, \mathbf{u}^*(\mathbf{w}))=0\). To establish the Lipschitz continuity of \(\nabla F(\mathbf{w})\), let us consider \(\mathbf{w}_1\) and \(\mathbf{w}_2\). We have \[\begin{align*} &\|\nabla F(\mathbf{w}_1)-\nabla F(\mathbf{w}_2)\|_2 =\|\nabla_1 f(\mathbf{w}_1, \mathbf{u}^*(\mathbf{w}_1)) -\nabla_1 f(\mathbf{w}_2, \mathbf{u}^*(\mathbf{w}_2))\|_2 \\ &\leq L_1(\|\mathbf{w}_1 - \mathbf{w}_2\|_2 + \|\mathbf{u}^*(\mathbf{w}_1) - \mathbf{u}^*(\mathbf{w}_2)\|_2)\leq L_1(1+\kappa)\|\mathbf{w}_1- \mathbf{w}_2\|_2. \end{align*}\]■
Next, we prove two lemmas similar to Lemma 4.8 and Lemma 4.1, regarding the recursion of gradient estimation error and the estimation error of \(\mathbf{u}\), respectively. The descent lemma (Lemma 4.9) still holds.
Lemma 4.19
It holds that \[\begin{align*} &\mathbb{E}_{\xi_t}\left[\|\mathbf{v}_{t} - \nabla F(\mathbf{w}_t)\|^2_2\right] \leq (1-\beta_t)\|\mathbf{v}_{t-1} - \nabla F(\mathbf{w}_{t-1})\|_2^2 + \frac{2L_F^2}{\beta_t}\|\mathbf{w}_{t-1}-\mathbf{w}_t\|_2^2 \\ & + 4L_1^2\beta_t\|\mathbf{u}_{t} - \mathbf{u}^*(\mathbf{w}_t)\|_2^2 + \beta_t^2\sigma_1^2. \end{align*}\]
Proof
Let \(\mathbf{z}_t = \nabla_1 f(\mathbf{w}_t, \mathbf{u}_{t};\xi_t)\) and \(\mathcal{M}_t = \mathbb{E}_t[\mathbf{z}_t]= \nabla_1 f(\mathbf{w}_t, \mathbf{u}_{t})\). Then \(\mathbf{v}_{t} = (1-\beta_t)\mathbf{v}_{t-1} + \beta_t \mathbf{z}_t\). Noting that \(\mathbb{E}_t[\|\mathbf{z}_t - \mathcal{M}_t\|_2^2]\leq \sigma_1^2\) and \(\|\mathcal{M}_t - \nabla F(\mathbf{w}_t)\|_2^2 \leq L_1^2 \|\mathbf{u}_t - \mathbf{u}^*(\mathbf{w_t})\|_2^2\). Plugging these into Lemma 4.7 finishes the proof.■
Lemma 4.20
Suppose Assumption 4.8(i), (iii), (iv) hold. Consider the update \(\mathbf{u}_{t} = \Pi_{\mathcal U} [\mathbf{u}_t + \gamma_t \nabla_2 f(\mathbf{w}_t, \mathbf{u}_t; \zeta_t)]\). If \(\gamma_t<1/L_{22}<1/\mu\), we have \[\begin{equation*} \begin{split} \mathbb{E}_t[\|\mathbf{u}_{t+1} - \mathbf{u}^*(\mathbf{w}_{t+1})\|_2^2] & \leq (1-\frac{\gamma_t \mu}{2}) \|\mathbf{u}_t - \mathbf{u}^*(\mathbf{w}_t)\|_2^2 + \frac{3\kappa^2}{\gamma_t \mu} \mathbb{E}_t[\|\mathbf{w}_t - \mathbf{w}_{t+1}\|_2^2]\\ & + 2\gamma_t^2 \sigma_2^2. \end{split} \end{equation*}\]
Proof
By See Lemma 3.8, if \(\gamma<1/L_{22}\) we have \[\begin{equation} \begin{split} \mathbb{E}_t[\|\mathbf{u}_{t+1} - \mathbf{u}^*(\mathbf{w}_t)\|_2^2]\leq (1-\gamma_t\mu)\|\mathbf{u}_{t} - \mathbf{u}^*(\mathbf{w}_t)\|_2^2 + \gamma_t^2\sigma_2^2. \end{split} \end{equation}\] Then, \[\begin{equation*} \begin{split} &\mathbb{E}_t[\|\mathbf{u}_{t+1} - \mathbf{u}^*(\mathbf{w}_{t+1})\|_2^2] \leq (1+\frac{\gamma_t \mu}{2})\mathbb{E}_t[\|\mathbf{u}_{t} - \mathbf{u}^*(\mathbf{w}_t)\|_2^2] \\ &+ (1+\frac{2}{\gamma_t\mu})\mathbb{E}_t[\|\mathbf{u}^*(\mathbf{w}_t) - \mathbf{u}^*(\mathbf{w}_{t+1})\|_2^2] \\ & \leq (1+\frac{\gamma_t \mu}{2}) (1-\gamma_t \mu) \|\mathbf{u}_t - \mathbf{u}^*(\mathbf{w}_t)\|_2^2 + (1+\frac{\gamma_t \mu}{2}) \gamma_t^2 \sigma_2^2 \\ & + \frac{2+\gamma_t\mu}{\gamma_t\mu} \kappa^2 \mathbb{E}_t[\|\mathbf{w}_t - \mathbf{w}_{t+1}\|_2^2] \\ & \leq (1-\frac{\gamma_t \mu}{2}) \|\mathbf{u}_t - \mathbf{u}^*(\mathbf{w}_t)\|_2^2 + 2\gamma_t^2 \sigma_2^2 + \frac{3\kappa^2}{\gamma_t \mu} \mathbb{E}_t[\|\mathbf{w}_t - \mathbf{w}_{t+1}\|_2^2], \end{split} \end{equation*}\] where the first inequality uses the Young’s inequality, and the last inequality uses \(\gamma\mu<1\).■
Finally, we can prove the following theorem regarding the convergence of SMDA.
Theorem 4.5
Suppose Assumption 4.8 holds. By setting \(\beta_t=\beta= \epsilon^2/(3\sigma_1^2)\), \(\gamma_t=\gamma=\mu\epsilon^2/(96L_1^2\sigma_2^2)\) and \(\eta_t=\eta= \min(\frac{\beta}{\sqrt{8}L_F},\frac{\gamma\mu}{16\sqrt{3}L_1\kappa},\frac{1}{2L_F})\) in SMDA, then the following holds
\[\begin{align} \label{eqn:smda-c} \mathbb E\left[\frac{1}{T}\sum_{t=0}^{T-1}\left\{\frac{1}{4}\|\mathbf{v}_{t}\|_2^2 + \|\nabla F(\mathbf{w}_t)\|_2^2\right\}\right]\leq \epsilon^2, \end{align}\]
with an iteration complexity of
\[\begin{align} \label{eqn:smda-t} T&=O\left(\max\left\{\frac{C_\Upsilon L_F}{\epsilon^2}, \frac{C_\Upsilon\sigma_1^2L_F}{\epsilon^4}, \frac{C_\Upsilon L_1^3\kappa\sigma_2^2}{\epsilon^4\mu^2}\right\}\right), \end{align}\]
where \(C_\Upsilon= 2(F(\mathbf{w}_0) - F_*) + \frac{1}{\sqrt{8}L_F}\|\mathbf{v}_0 - \nabla F(\mathbf{w}_0)\|_2^2 + \frac{L_1}{\sqrt{3}\kappa}\|\mathbf{u}_0 - \mathbf{u}^*(\mathbf{w}_0)\|_2^2\).
💡 Why it matters
The MA gradient estimator in SMDA is critical to obtaining a complexity
of \(O(1/\epsilon^4)\). If we simply
update the primal variable by SGD, the algorithm becomes SGDA. The
convergence analysis of SGDA for non-convex minimax problems will suffer
from a large batch size issue or slow convergence. In particular, SGDA
with a batch size of \(O(1/\epsilon^2)\) can find an \(\epsilon\)-stationary solution in \(O(1/\epsilon^2)\) iterations when the
problem is smooth in terms of primal and dual variables and
strongly-concave in terms of dual variable, yielding a sample complexity
of \(O(1/\epsilon^4)\). If using a
constant batch size \(O(1)\), SGDA may
need \(O(1/\epsilon^8)\) iterations for
finding an \(\epsilon\)-stationary
solution (Lin et al. 2020).
Proof
The proof is similar to Theorem 4.3. Let us see the three inequalities in Lemma 4.9, Lemma 4.19, and Lemma 4.20 that we have proved so far: \[\begin{align*} (*)\;&F(\mathbf{w}_{t+1}) \leq F(\mathbf{w}_t) + \frac{\eta}{2} \|\nabla F(\mathbf{w}_t) - \mathbf{v}_{t}\|_2^2- \frac{\eta}{2}\|\nabla F(\mathbf{w}_t)\|_2^2-\frac{\eta}{4}\|\mathbf{v}_{t}\|_2^2,\\ (\sharp)\;&\mathbb{E}\left[\|\mathbf{v}_{t} - \nabla F(\mathbf{w}_t)\|_2^2\right] \leq \mathbb{E}\left[(1-\beta)\|\mathbf{v}_{t-1} - \nabla F(\mathbf{w}_{t-1})\|_2^2 + \frac{2L_F^2\eta^2}{\beta}\|\mathbf{v}_{t-1}\|_2^2\right] \\ & + \mathbb{E}\left[4L_1^2\beta\|\mathbf{u}_{t} - \mathbf{u}^*(\mathbf{w}_t)\|_2^2 + \beta^2\sigma_1^2\right],\\ (\diamond)\;&\mathbb{E}\|\mathbf{u}_{t} - \mathbf{u}^*(\mathbf{w}_{t})\|_2^2 \leq \mathbb{E}\left[(1-\frac{\gamma \mu}{2})\|\mathbf{u}_{t-1} - \mathbf{u}^*(\mathbf{w}_{t-1})\|_2^2 + 2\gamma^2 \sigma_2^2 + \frac{3\kappa^2\eta^2}{\gamma \mu} \|\mathbf{v}_{t-1}\|_2^2\right]. \end{align*}\] Let \(\bar\gamma=\gamma\mu/2\), the last inequality becomes: \[\begin{align*} (\diamond)\;&\mathbb{E}\|\mathbf{u}_{t} - \mathbf{u}^*(\mathbf{w}_{t})\|_2^2 \leq \mathbb{E}\left[(1-\bar\gamma) \|\mathbf{u}_{t-1} - \mathbf{u}^*(\mathbf{w}_{t-1})\|_2^2 + 8\bar\gamma^2 \frac{\sigma_2^2}{\mu^2} + \frac{3\kappa^2\eta^2}{2\bar\gamma} \|\mathbf{v}_{t-1}\|_2^2\right]. \end{align*}\] Let us define \(A_t= 2(F(\mathbf{w}_t) - F_*)\) and \(B_t = \|\nabla F(\mathbf{w}_t)\|_2^2\), \(\Gamma_t = \|\mathbf{v}_{t}\|_2^2/2\), \(\Delta_t= \|\nabla F(\mathbf{w}_t) - \mathbf{v}_{t}\|_2^2\), \(\delta_{t}=\|\mathbf{u}_{t} - \mathbf{u}^*(\mathbf{w}_{t})\|_2^2\). Then the three inequalities \((*),(\sharp),(\diamond)\) satisfy that in Lemma 4.10 with \(C_1=4L_1^2, C_2=4L_F^2, C_3=3\kappa^2, \sigma^2=\sigma_1^2, {\sigma'}^2=8\sigma_2^2/\mu^2\). If \(\eta, \beta, \bar\gamma\) satisfy \[\begin{align*} &\beta =\frac{\epsilon^2}{3\sigma^2} =\frac{\epsilon^2}{3\sigma_1^2},\quad \bar\gamma =\frac{\epsilon^2}{6C_1{\sigma'}^2}=\frac{\epsilon^2\mu^2}{192L_1^2\sigma_2^2}, \\ & \eta=\min(\frac{1}{2L_F},\frac{\beta}{\sqrt{4C_2}}, \frac{\bar\gamma}{\sqrt{8C_1C_3}}) = \min(\frac{1}{2L_F},\frac{\beta}{4L_F}, \frac{\bar\gamma}{\sqrt{96}L_1\kappa}), \end{align*}\] then (\(\ref{eqn:smda-c}\)) holds, and the iteration complexity becomes \[\begin{align*} T&=O\left(\max\left\{\frac{C_\Upsilon L_F}{\epsilon^2}, \frac{C_\Upsilon \sigma^2\sqrt{C_2}}{\epsilon^4}, \frac{C_\Upsilon\sqrt{C_1C_3}C_1{\sigma'}^2}{\epsilon^4}\right\}\right)\\ &=O\left(\max\left\{\frac{C_\Upsilon L_F}{\epsilon^2}, \frac{C_\Upsilon\sigma_1^2L_F}{\epsilon^4}, \frac{C_\Upsilon L_1^3\kappa\sigma_2^2}{\epsilon^4\mu^2}\right\}\right). \end{align*}\]■
It is worth mentioning that an improved complexity of \(O(1/\epsilon^3)\) can be achieved by employing the STORM gradient estimator for both the primal and dual variables under the mean-square smooth condition of the objective.