Section 3.6: Stochastic Gradient Descent Ascent
In this section, we consider stochastic convex–concave min–max
optimization problems: \[\begin{align*}
\min_{\mathbf{w} \in \mathcal{W}} \max_{\mathbf{u} \in \mathcal{U}}
f(\mathbf{w}, \mathbf{u})
:= \mathbb{E}_{\zeta}\big[f(\mathbf{w}, \mathbf{u}; \zeta)\big].
\end{align*}\] This class of problems has two important
applications in machine learning:
(1) it serves as a foundation for directly formulating learning tasks
(e.g., the DRO problem); (2) it
provides a tool for reformulating standard minimization problems to
enable more efficient optimization (e.g., the min-max reformulation of
AUC maximization as discussed in Section
6.4.
A solution of interest is the so-called saddle point \((\mathbf{w}_*, \mathbf{u}_*)\in\mathcal W\times \mathcal U\) satisfying: \[\begin{align*} f(\mathbf{w}_*, \mathbf{u})\leq f(\mathbf{w}_*, \mathbf{u}_*)\leq f(\mathbf{w}, \mathbf{u}_*), \forall \mathbf{w}\in\mathcal W, \mathbf{u}\in\mathcal U. \end{align*}\] In many machine learning applications, we may be only interested in finding a global optimal solution to the objective \(F(\mathbf{w}) =\max_{\mathbf{u}\in\mathcal U}f(\mathbf{w}, \mathbf{u})\). If \((\mathbf{w}_*, \mathbf{u}_*)\) is a saddle point, then \(\mathbf{w}_*\) is a global optimal solution to \(F(\mathbf{w})\). This can be seen from \[\begin{align*} \max_{\mathbf{u}\in\mathcal U}f(\mathbf{w}_*, \mathbf{u})\leq f(\mathbf{w}_*, \mathbf{u}_*)\leq f(\mathbf{w}, \mathbf{u}_*)\leq \max_{\mathbf{u}\in\mathcal U}f(\mathbf{w}, \mathbf{u}). \end{align*}\] For a point \((\mathbf{w}, \mathbf{u})\in\mathcal W\times\mathcal U\), a convergence measure is defined by the duality gap: \[\begin{align*} \Delta(\mathbf{w}, \mathbf{u}) = \max_{\mathbf{u}'\in\mathcal U}f(\mathbf{w}, \mathbf{u}') - \min_{\mathbf{w}'\in\mathcal W}f(\mathbf{w}', \mathbf{u}). \end{align*}\]
A simple method for solving the convex-concave min-max problem is the stochastic gradient descent ascent (SGDA) algorithm, which is an extension of SGD. It employs two key updates:
\[\begin{equation}\label{eqn:sgda} \begin{aligned} \mathbf{w}_{t+1} & = \arg\min_{\mathbf{w}\in\mathcal W} \partial_1 f(\mathbf{w}_t, \mathbf{u}_t; \zeta_t)^{\top}(\mathbf{w} - \mathbf{w}_t) + \frac{1}{2\eta_{1}}\|\mathbf{w} - \mathbf{w}_t\|_2^2 \\ \mathbf{u}_{t+1} & = \arg\min_{\mathbf{u}\in\mathcal U} -\partial_2 f(\mathbf{w}_t, \mathbf{u}_t; \zeta_t)^{\top}(\mathbf{u} - \mathbf{u}_t) + \frac{1}{2\eta_{2}}\|\mathbf{u} - \mathbf{u}_t\|_2^2, \end{aligned} \end{equation}\] where \(\partial_1 f(\mathbf{w}, \mathbf{u}; \zeta)\) and \(\partial_2 f(\mathbf{w}, \mathbf{u}; \zeta)\) denote the stochastic partial subgradients such that \(\mathbb{E}_\zeta[\partial_1 f(\mathbf{w}, \mathbf{u}; \zeta)]\in \partial_1 f(\mathbf{w}, \mathbf{u})\) and \(\mathbb{E}_{\zeta}[\partial_2 f(\mathbf{w}, \mathbf{u}; \zeta)]\in\partial_2 f(\mathbf{w}, \mathbf{u})\).
Algorithm 7: SGDA
- Input: learning rates \(\{\eta_{1}, \eta_{2}\}\), starting points \(\mathbf{w}_1, \mathbf{u}_1\)
- For \(t=1,\dotsc,T\)
- Compute unbiased gradient estimators \(\mathbf{z}_{1,t} = \partial_1 f(\mathbf{w}_t; \zeta_t)\) and \(\mathbf{z}_{2,t} = \partial_2 f(\mathbf{w}_t, \mathbf{u}_t; \zeta_t)\)
- Update the primal variable \(\mathbf{w}\) by \(\mathbf{w}_{t+1} = \arg\min_{\mathbf{w}\in\mathcal W} \mathbf{z}_{1,t}^{\top}(\mathbf{w} - \mathbf{w}_t) + \frac{1}{2\eta_{1}}\|\mathbf{w} - \mathbf{w}_t\|_2^2.\)
- Update the dual variable \(\mathbf{u}\) by \(\mathbf{u}_{t+1} = \arg\min_{\mathbf{u}\in\mathcal U} -\mathbf{z}_{2,t}^{\top}(\mathbf{u} - \mathbf{u}_t) + \frac{1}{2\eta_{2}}\|\mathbf{u} - \mathbf{u}_t\|_2^2.\)
Convergence Analysis
Below, we analyze the convergence rate of SGDA under the following assumptions.
Assumption 3.10
Suppose the following conditions hold:
- \(f(\mathbf{w}, \mathbf{u})\) is convex w.r.t \(\mathbf{w}\) and concave w.r.t \(\mathbf{u}\).
- There exist \(G_1, G_2>0\) such that \[\begin{align}\label{eqn:sgda-g1} &\mathbb{E}_{\zeta}[\|\partial_1 f(\mathbf{w}, \mathbf{u}; \zeta)\|_2^2]\leq G_1^2, \forall \mathbf{w}\in\mathcal W, \mathbf{u}\in\mathcal U,\\ &\mathbb{E}_{\zeta}[\|\partial_2 f(\mathbf{w}, \mathbf{u}; \zeta)\|_2^2]\leq G_2^2, \forall \mathbf{w}\in\mathcal W, \mathbf{u}\in\mathcal U. \end{align}\]
- \(\max_{\mathbf{w}\in\mathcal W,\mathbf{w}'\in\mathcal W}\|\mathbf{w}-\mathbf{w}'\|_2\leq D_1\) and \(\max_{\mathbf{u}\in\mathcal U,\mathbf{u}'\in\mathcal U}\|\mathbf{u}-\mathbf{u}'\|_2\leq D_2\).
Lemma 3.14
Let us consider a martingale difference sequence \(\{\delta_t\}_{t\geq 1}\) and a sequence \(\{y_t\}_{t\geq 1}\): \[\begin{align*} y_{t+1} = \arg\min_{v\in\mathcal V} \{\mathbf{-}\delta_t^{\top}v + \alpha D_\psi(v, y_t)\}. \end{align*}\] If \(\psi\) is \(\mu_\psi\)-strongly convex w.r.t. \(\|\cdot\|\) \((\mu_\psi>0)\). For any \(v\) (that possibly depends on \(\{\delta_t\}\)) we have \[\begin{align*} \mathbb{E}\left[\delta_t^{\top}v\right] \leq \mathbb{E}\left[\alpha D_\psi(v, y_t) - \alpha D_\psi(v, y_{t+1}) + \frac{1}{2\alpha\mu_\psi}\|\delta_t\|_*^2\right]. \end{align*}\]
💡 Why it matters
In standard minimization problems, the convergence measure is usually
defined with respect to the optimal solution \(\mathbf{w}_*\), which is fixed and
independent of the randomness introduced by the algorithm. In contrast,
in stochastic min–max optimization we are concerned with the duality gap
\(\Delta(\mathbf{w}, \mathbf{u}) =
\max_{\mathbf{u}' \in \mathcal{U}} f(\mathbf{w}, \mathbf{u}') -
\min_{\mathbf{w}' \in \mathcal{W}} f(\mathbf{w}',
\mathbf{u})\) of a random solution \((\mathbf w, \mathbf u)\), where the optimal
\(\mathbf{w}'\) and \(\mathbf{u}'\) are also random. This
dependency introduces additional subtleties into the analysis.
The preceding lemma applies to any random variable \(v\) that may depend on the entire randomness of the algorithm, and will be useful for our analysis. Recall that a sequence \(\{X_t\}\) is a martingale difference sequence if the conditional expectation of each variable given the past is zero, i.e., \(\mathbb{E}[X_t \mid X_1, \ldots, X_{t-1}] = 0.\)
Proof
Applying Lemma 3.11 to the update of \(y_{t+1}\), we have \[\begin{align*} \mathbb{E}\left[-\delta_{t}^{\top}({y}_{t+1} - v)\right] \leq \mathbb{E}\left[\alpha D_{\psi}(y, {y}_{t}) - \alpha D_{\psi}(y, {y}_{t+1}) - \alpha D_{\psi}({y}_{t+1}, y_{t})\right]. \end{align*}\] Hence, \[\begin{align*} \mathbb{E}&\left[\delta_{t}^{\top}(v- y_{t})\right] \leq \mathbb{E}\left[\alpha D_{\psi}(v, {y}_{t}) - \alpha D_{\psi}(v, {y}_{t+1}) - \alpha D_{\psi}({y}_{t+1}, y_{t})\right] \\ &\quad\quad\quad\quad\quad\quad\quad+ \mathbb{E}[\delta_{t}^{\top}(y_{t+1}- y_{t})]\\ &\leq \mathbb{E}\left[\alpha D_{\psi}(v, {y}_{t}) - \alpha D_{\psi}(v, {y}_{t+1})\right]\\ & -\mathbb{E}\left[ \frac{\alpha\mu_{\psi}}{2}\|{y}_{t+1}- y_{t}\|^2 + \frac{\mu_\psi\alpha}{2}\|y_{t+1}- y_{t}\|^2+\frac{1}{2 \mu_\psi \alpha}\|\delta_{t}\|_*^2\right]. \end{align*}\] Since \(\mathbb{E}[\delta_{t}]=0\) and \(y_{t}\) is independent of \(\delta_t\), we have \(\mathbb{E}[\delta_t^{\top}y_t]=0\). As a result, \[\begin{align*} \mathbb{E}[\delta_t^{\top}v] \leq \mathbb{E}\left[\alpha D_{\psi}(v, {y}_{t}) - \alpha D_{\psi}(v, {y}_{t+1})\right] + \frac{1}{2\mu_\psi \alpha} \mathbb{E}\left[ \|\delta_t\|_*^2\right]. \end{align*}\]■
Theorem 3.13
Suppose Assumption 3.10 holds Let \(\bar{\mathbf w}_T = \frac{1}{T}\sum_{t=1}^T\mathbf{w}_t, \bar{\mathbf u}_T = \frac{1}{T}\sum_{t=1}^T\mathbf{u}_t\). After \(T\) iterations, SGDA \(\ref{eqn:sgda}\) guarantees that \[ \mathbb{E}[\Delta(\bar{\mathbf w}_T, \bar{\mathbf u}_T)]\leq \frac{D_1^2}{\eta_1 T} + \frac{D_2^2}{\eta_2 T} + \frac{5\eta_1G_1^2}{2} + \frac{5\eta_2G_2^2}{2}. \] If we set \(\eta_1 = O(\frac{D_1}{G_1\sqrt{T}})\) and \(\eta_2 = O(\frac{D_2}{G_2\sqrt{T}})\), we have \[ \mathbb{E}[\Delta(\bar{\mathbf w}_T, \bar{\mathbf u}_T)]\leq O\left(\frac{D_1G_1}{\sqrt{T}} + \frac{D_2G_2}{\sqrt{T}}\right). \]
Proof
Similar to Eqn. 10 in the proof of Lemma 3.3, for the primal update and dual update for any \(\mathbf{w}\in\mathcal W, \mathbf{u}\in\mathcal U\) we have \[\begin{align*} \partial_1 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t)^{\top}&(\mathbf{w}_t - \mathbf{w}) \leq \\ & \frac{1}{2\eta_1}\|\mathbf{w}_t - \mathbf{w}\|_2^2 -\frac{1}{2\eta_1}\|\mathbf{w}_{t+1} - \mathbf{w}\|_2^2+ \frac{1}{2}\eta_{1}\| \partial_1 f(\mathbf{w}_t, \mathbf{u}_t; \zeta_t)\|_2^2\\ -\partial_2 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t)^{\top}&(\mathbf{u}_t - \mathbf{u}) \leq \\ & \frac{1}{2\eta_2}\|\mathbf{u}_t - \mathbf{u}\|_2^2 -\frac{1}{2\eta_2}\|\mathbf{u}_{t+1} - \mathbf{u}\|_2^2+ \frac{1}{2}\eta_{2}\| \partial_2 f(\mathbf{w}_t, \mathbf{u}_t; \zeta_t)\|_2^2. \end{align*}\] The difference from the SGD analysis is that we cannot fix \(\mathbf{w}\) as \(\mathbf{w}_*\) and fix \(\mathbf{u}\) as \(\mathbf{u}_*\), which will not yield the duality gap measure. Indeed, at the end we need to take max over \(\mathbf{w}\in\mathcal W\) and min over \(\mathbf{u}\in\mathcal U\) to obtain the duality gap, making them dependent on the randomness.
To proceed, we have \[\begin{align*} &\partial_1 f(\mathbf{w}_t,\mathbf{u}_t)^{\top}(\mathbf{w}_t - \mathbf{w}) \leq \frac{1}{2\eta_1}\|\mathbf{w}_t - \mathbf{w}\|_2^2 -\frac{1}{2\eta_1}\|\mathbf{w}_{t+1} - \mathbf{w}\|_2^2 \\ &+ \frac{1}{2}\eta_{1}\| \partial_1 f(\mathbf{w}_t, \mathbf{u}_t; \zeta_t)\|_2^2 + (\partial_1 f(\mathbf{w}_t,\mathbf{u}_t) - \partial_1 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t))^{\top}(\mathbf{w}_t - \mathbf{w})\\ &-\partial_2 f(\mathbf{w}_t,\mathbf{u}_t)^{\top}(\mathbf{u}_t - \mathbf{u}) \leq \frac{1}{2\eta_2}\|\mathbf{u}_t - \mathbf{u}\|_2^2 -\frac{1}{2\eta_2}\|\mathbf{u}_{t+1} - \mathbf{u}\|_2^2\\ &+\frac{1}{2}\eta_{2}\| \partial_2 f(\mathbf{w}_t, \mathbf{u}_t; \zeta_t)\|_2^2 + (\partial_2 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t) - \partial_2 f(\mathbf{w}_t,\mathbf{u}_t))^{\top}(\mathbf{u}_t - \mathbf{u}). \end{align*}\] Adding these inequalities we have \[\begin{align*} &\partial_1 f(\mathbf{w}_t,\mathbf{u}_t)^{\top}(\mathbf{w}_t - \mathbf{w}) -\partial_2 f(\mathbf{w}_t,\mathbf{u}_t)^{\top}(\mathbf{u}_t - \mathbf{u}) \\ &\leq \frac{1}{2\eta_1}\left(\|\mathbf{w}_t - \mathbf{w}\|_2^2-\|\mathbf{w}_{t+1} - \mathbf{w}\|_2^2\right)+ \frac{1}{2\eta_2}\left(\|\mathbf{u}_t - \mathbf{u}\|_2^2 -\|\mathbf{u}_{t+1} - \mathbf{u}\|_2^2\right)\\ &+ \frac{1}{2}\eta_{1}\| \partial_1 f(\mathbf{w}_t, \mathbf{u}_t; \zeta_t)\|_2^2 +\frac{1}{2}\eta_{2}\| \partial_2 f(\mathbf{w}_t, \mathbf{u}_t; \zeta_t)\|_2^2\\ &+ (\partial_1 f(\mathbf{w}_t,\mathbf{u}_t) - \partial_1 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t))^{\top}(\mathbf{w}_t - \mathbf{w}) \\ & + (\partial_2 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t) - \partial_2 f(\mathbf{w}_t,\mathbf{u}_t))^{\top}(\mathbf{u}_t - \mathbf{u}). \end{align*}\] Due to the convexity and concavity of \(f(\mathbf{w},\mathbf{u})\) in terms of \(\mathbf{w}, \mathbf{u}\), respectively, we have \[\begin{align*} f(\mathbf{w}_t, \mathbf{u}_t) -f(\mathbf{w}, \mathbf{u}_t)\leq \partial_1 f(\mathbf{w}_t,\mathbf{u}_t)^{\top}(\mathbf{w}_t - \mathbf{w}),\\ f(\mathbf{w}_t, \mathbf{u}) -f(\mathbf{w}_t, \mathbf{u}_t)\leq -\partial_2 f(\mathbf{w}_t,\mathbf{u}_t)^{\top}(\mathbf{u}_t - \mathbf{u}). \end{align*}\] Adding these two equalities, we have \[\begin{align*} f(\mathbf{w}_t, \mathbf{u}) -f(\mathbf{w}, \mathbf{u}_t)\leq \partial_1 f(\mathbf{w}_t,\mathbf{u}_t)^{\top}(\mathbf{w}_t - \mathbf{w}) -\partial_2 f(\mathbf{w}_t,\mathbf{u}_t)^{\top}(\mathbf{u}_t - \mathbf{u}). \end{align*}\] As a result, we have \[\begin{align*} &f(\mathbf{w}_t, \mathbf{u}) -f(\mathbf{w}, \mathbf{u}_t)\\ &\leq \frac{1}{2\eta_1}\left(\|\mathbf{w}_t - \mathbf{w}\|_2^2-\|\mathbf{w}_{t+1} - \mathbf{w}\|_2^2\right)+ \frac{1}{2\eta_2}\left(\|\mathbf{u}_t - \mathbf{u}\|_2^2 -\|\mathbf{u}_{t+1} - \mathbf{u}\|_2^2\right)\\ &+ \frac{1}{2}\eta_{1}\| \partial_1 f(\mathbf{w}_t, \mathbf{u}_t; \zeta_t)\|_2^2 +\frac{1}{2}\eta_{2}\| \partial_2 f(\mathbf{w}_t, \mathbf{u}_t; \zeta_t)\|_2^2\\ &+ (\partial_1 f(\mathbf{w}_t,\mathbf{u}_t) - \partial_1 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t))^{\top}(\mathbf{w}_t - \mathbf{w}) \\ & + (\partial_2 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t) - \partial_2 f(\mathbf{w}_t,\mathbf{u}_t))^{\top}(\mathbf{u}_t - \mathbf{u}). \end{align*}\] Taking average over \(t=1,\ldots, T\), we have \[\begin{align*} &f(\bar{\mathbf w}_T, \mathbf{u}) -f(\mathbf{w}, \bar{\mathbf u}_T)\leq \frac{1}{T}\sum_{t=1}^T\left(f(\mathbf{w}_t, \mathbf{u}) -f(\mathbf{w}, \mathbf{u}_t)\right)\\ &\leq \frac{1}{2\eta_1 T}\|\mathbf{w}_1 - \mathbf{w}\|_2^2+ \frac{1}{2\eta_2 T}\|\mathbf{u}_1 - \mathbf{u}\|_2^2\\ &+ \frac{\eta_1}{2T}\sum_{t=1}^T\| \partial_1 f(\mathbf{w}_t, \mathbf{u}_t; \zeta_t)\|_2^2 +\frac{\eta_{2}}{2T}\sum_{t=1}^T\| \partial_2 f(\mathbf{w}_t, \mathbf{u}_t; \zeta_t)\|_2^2\\ &+ \frac{1}{T}\sum_{t=1}^T(\partial_1 f(\mathbf{w}_t,\mathbf{u}_t) - \partial_1 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t))^{\top}(\mathbf{w}_t - \mathbf{w}) \\ & + \frac{1}{T}\sum_{t=1}^T(\partial_2 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t) - \partial_2 f(\mathbf{w}_t,\mathbf{u}_t))^{\top}(\mathbf{u}_t - \mathbf{u}). \end{align*}\] Let \(\mathbf{w}, \mathbf{u}\) be the solution to \(\max_{\mathbf{w}\in\mathcal W, \mathbf{u}\in\mathcal U}f(\bar{\mathbf w}_T, \mathbf{u}) -f(\mathbf{w}, \bar{\mathbf u}_T)\), which are random variables. Taking expectation over both sides, we have
\[\begin{equation}\label{eqn:sgda-1} \begin{aligned} \mathbb{E}[\Delta(\bar{\mathbf w}_T, \bar{\mathbf u}_T)]\leq& \frac{1}{2\eta_1 T}\|\mathbf{w}_1 - \mathbf{w}\|_2^2+ \frac{1}{2\eta_2 T}\|\mathbf{u}_1 - \mathbf{u}\|_2^2+ \frac{\eta_1G_1^2}{2}+\frac{\eta_{2}G_2^2}{2}\\ &+ \frac{1}{T}\mathbb{E}\left[\sum_{t=1}^T(\partial_1 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t) - \partial_1 f(\mathbf{w}_t,\mathbf{u}_t))^{\top}\mathbf{w} \right]\\ & + \frac{1}{T}\mathbb{E}\left[\sum_{t=1}^T(\partial_2 f(\mathbf{w}_t,\mathbf{u}_t) - \partial_2 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t))^{\top} \mathbf{u}\right]. \end{aligned} \end{equation}\] Next, we apply Lemma 3.14 to bound the last two terms. To this end, we introduce two virtual sequences with \(\hat{\mathbf w}_1 = \mathbf{w}_1, \hat{\mathbf u}_1=\mathbf{u}_1\): \[\begin{align*} &\hat{\mathbf w}_{t+1} = \arg\min_{\mathbf{w}\in\mathcal W}-(\partial_1 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t) - \partial_1 f(\mathbf{w}_t,\mathbf{u}_t))^{\top}\mathbf{w} + \frac{1}{2\eta_{1}}\|\mathbf{w} - \hat{\mathbf w}_t\|_2^2 \\ &\hat{\mathbf u}_{t+1} = \arg\min_{\mathbf{u}\in\mathcal U}(\partial_2 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t) - \partial_2 f(\mathbf{w}_t,\mathbf{u}_t))^{\top}\mathbf{u} + \frac{1}{2\eta_{2}}\|\mathbf{u} - \hat{\mathbf u}_t\|_2^2. \end{align*}\] Applying Lemma 3.14, we have \[\begin{align*} &\mathbb{E}\left[(\partial_1 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t) - \partial_1 f(\mathbf{w}_t,\mathbf{u}_t))^{\top}\mathbf{w} \right]\leq \frac{1}{2\eta_1}\left(\|\hat{\mathbf w}_t - \mathbf{w}\|_2^2-\|\hat{\mathbf w}_{t+1} - \mathbf{w}\|_2^2\right)\\ &+ \frac{\eta_1}{2}\mathbb{E}[\|\partial_1 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t) - \partial_1 f(\mathbf{w}_t,\mathbf{u}_t)\|_2^2]\\ &\mathbb{E}\left[(\partial_2 f(\mathbf{w}_t,\mathbf{u}_t) - \partial_2 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t))^{\top}\mathbf{u} \right]\leq \frac{1}{2\eta_2}\left(\|\hat{\mathbf u}_t - \mathbf{u}\|_2^2-\|\hat{\mathbf u}_{t+1} - \mathbf{u}\|_2^2\right)\\ &+ \frac{\eta_2}{2}\mathbb{E}[\|\partial_1 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t) - \partial_1 f(\mathbf{w}_t,\mathbf{u}_t)\|_2^2]. \end{align*}\] Hence,
\[\begin{equation}\label{eqn:sgda-2} \begin{aligned} &\mathbb{E}\left[\sum_{t=1}^T(\partial_1 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t) - \partial_1 f(\mathbf{w}_t,\mathbf{u}_t))^{\top}\mathbf{w} \right]\\ &+ \mathbb{E}\left[\sum_{t=1}^T(\partial_2 f(\mathbf{w}_t,\mathbf{u}_t) - \partial_2 f(\mathbf{w}_t,\mathbf{u}_t; \zeta_t))^{\top}\mathbf{u} \right]\\ &\leq \frac{1}{2\eta_1}\|\hat{\mathbf w}_1 - \mathbf{w}\|_2^2+ \frac{1}{2\eta_2}\|\hat{\mathbf u}_t - \mathbf{u}\|_2^2+ \frac{4\eta_1G_1^2 T}{2}+ \frac{4\eta_2G_2^2 T}{2}. \end{aligned} \end{equation}\] Combining (\(\ref{eqn:sgda-1}\)) and (\(\ref{eqn:sgda-2}\)), we have \[\begin{equation*} \begin{aligned} &\mathbb{E}[\Delta(\bar{\mathbf w}_T, \bar{\mathbf u}_T)]\leq \frac{1}{\eta_1 T}\mathbb{E}[\|\mathbf{w}_1 - \mathbf{w}\|_2^2]+ \frac{1}{\eta_2 T}\mathbb{E}[\|\mathbf{u}_1 - \mathbf{u}\|_2^2]+ \frac{5\eta_1G_1^2}{2}+\frac{5\eta_{2}G_2^2}{2}. \end{aligned} \end{equation*}\] Hence, we conclude the proof by using Assumption 3.10 and optimizing the upper bound over \(\eta_1, \eta_2\).■