Section 3.7: Stochastic Optimistic Mirror Prox

While simple in design, SGDA cannot enjoy a faster convergence when the function is smooth and the stochastic gradients have zero variance. A classical method to address this limitation is to use an extra-gradient. Let \[ \mathbf{v} = \left[\begin{array}{c}\mathbf{w}\\ \mathbf{u}\end{array}\right],\quad \mathcal{M}(\mathbf{v}) = \left[\begin{array}{c}\nabla_1f(\mathbf{w}, \mathbf{u})\\ -\nabla_2 f(\mathbf{w}, \mathbf{u})\end{array}\right], \quad \mathcal V=\mathcal W\times \mathcal U. \] The extra-gradient method takes the following update with an initialization of \(\mathbf{x}_1\in\mathcal V\): \[\begin{equation} \begin{aligned} \mathbf{y}_{t} = \arg\min_{\mathbf{v}\in\mathcal V} \mathcal{M}(\mathbf{x}_t)^{\top}\mathbf{v} + \frac{1}{2\eta}\|\mathbf{v} - \mathbf{x}_t\|_2^2\\ \mathbf{x}_{t+1} = \arg\min_{\mathbf{v}\in\mathcal V} \mathcal{M}(\mathbf{y}_{t})^{\top}\mathbf{v} + \frac{1}{2\eta}\|\mathbf{v} - \mathbf{x}_t\|_2^2. \end{aligned} \end{equation}\] The name “extragradient” comes from the fact that it uses two gradients \(\mathcal{M}(\mathbf{x}_t)\) and \(\mathcal{M}(\mathbf{y}_t)\) at each iteration.

The extragradient method can be generalized using the mirror descent steps with a Bregman divergence \(D_{\varphi}(\cdot, \cdot)\) defined by a strongly-convex function \(\varphi:\mathcal V\rightarrow\mathbb{R}\): \[\begin{equation} \begin{aligned} \mathbf{y}_{t} = \arg\min_{\mathbf{v}\in\mathcal V} \mathcal{M}(\mathbf{x}_t)^{\top}\mathbf{v} + \frac{1}{\eta}D_{\varphi}(\mathbf{v} , \mathbf{x}_t)\\ \mathbf{x}_{t+1} = \arg\min_{\mathbf{v}\in\mathcal V} \mathcal{M}(\mathbf{y}_{t})^{\top}\mathbf{v} + \frac{1}{\eta}D_{\varphi}(\mathbf{v}, \mathbf{x}_t). \end{aligned} \end{equation}\] This method is called mirror prox.

Both methods can be extended to their stochastic versions. For example, the stochastic mirror prox method (SMP) uses the following update:

\[\begin{equation}\label{eqn:smp} \begin{aligned} \mathbf{y}_{t} = \arg\min_{\mathbf{v}\in\mathcal V} \mathcal{M}(\mathbf{x}_t;\zeta'_t)^{\top}\mathbf{v} + \frac{1}{\eta}D_{\varphi}(\mathbf{v}, \mathbf{x}_t)\\ \mathbf{x}_{t+1} = \arg\min_{\mathbf{v}\in\mathcal V} \mathcal{M}(\mathbf{y}_{t}; \zeta_t)^{\top}\mathbf{v} + \frac{1}{\eta}D_{\varphi}(\mathbf{v}, \mathbf{x}_t), \end{aligned} \end{equation}\] where \(\mathbb{E}_{\zeta}[\mathcal{M}(\mathbf{x}; \zeta)]=\mathcal{M}(\mathbf{x})\).

Algorithm 8: Stochastic Optimistic Mirror Prox (SOMP)

Input: learning rate \(\eta\), starting points \(\mathbf{x}_1=\mathbf{y}_0 = (\mathbf{w}_1, \mathbf{u}_1)\)
Compute \(\mathbf{y}_1 = \arg\min_{\mathbf{v}\in\mathcal V} \mathcal{M}(\mathbf{y}_{0}; \zeta_{0})^{\top}\mathbf{v} + \frac{1}{\eta}D_{\varphi}(\mathbf{v} , \mathbf{x}_{1})\).
For \(t=1,\dotsc,T\)
Compute unbiased gradient mapping \(\mathcal{M}(\mathbf{y}_{t}; \zeta_{t})\)
Update \(\mathbf{x}_{t+1} = \arg\min_{\mathbf{v}\in\mathcal V} \mathcal{M}(\mathbf{y}_{t}; \zeta_t)^{\top}\mathbf{v} + \frac{1}{\eta}D_{\varphi}(\mathbf{v}, \mathbf{x}_t)\).
Update \(\mathbf{y}_{t+1} = \arg\min_{\mathbf{v}\in\mathcal V} \mathcal{M}(\mathbf{y}_{t}; \zeta_{t})^{\top}\mathbf{v} + \frac{1}{\eta}D_{\varphi}(\mathbf{v} , \mathbf{x}_{t+1})\).

Stochastic Optimistic Mirror Prox: a variant with a Single Gradient Sequence

The updates of SMP (\(\ref{eqn:smp}\)) need to compute two stochastic gradient sequences \(\{\mathcal{M}(\mathbf{x}_t, \zeta'_{t})\}\) and \(\{\mathcal{M}(\mathbf{y}_t; \zeta_t)\}\), which double the costs of SGDA. A simple remedy is to use \(\mathcal{M}(\mathbf{y}_{t-1}; \zeta_{t-1})\) in the first update of \(\mathbf{y}_t\), yielding \[\begin{equation} \begin{aligned} \mathbf{y}_{t} = \arg\min_{\mathbf{v}\in\mathcal V} \mathcal{M}(\mathbf{y}_{t-1}; \zeta_{t-1})^{\top}\mathbf{v} + \frac{1}{\eta}D_{\varphi}(\mathbf{v} , \mathbf{x}_t)\\ \mathbf{x}_{t+1} = \arg\min_{\mathbf{v}\in\mathcal V} \mathcal{M}(\mathbf{y}_{t}; \zeta_t)^{\top}\mathbf{v + \frac{1}{2\eta}D_{\varphi}(\mathbf{v}, \mathbf{x}_t)}, \end{aligned} \end{equation}\] As a result, we only need to compute one sequence of stochastic gradients \(\{\mathcal{M}(\mathbf{y}_t; \zeta_t)\}\). This method is known as stochastic optimistic mirror prox (SOMP).

Let us consider a special case when \(\mathcal V=\mathbb{R}^{d}\times \mathbb{R}^{d'}\) and \(D_{\varphi}(\mathbf{x}, \mathbf{y})=\frac{1}{2}\|\mathbf{x}- \mathbf{y}\|_2^2\). The above update reduces to \[\begin{equation} \begin{aligned} \mathbf{y}_{t} = \mathbf{x}_t - \eta \mathcal{M}(\mathbf{y}_{t-1}; \zeta_{t-1})\\ \mathbf{x}_{t+1} = \mathbf{x}_t - \eta\mathcal{M}(\mathbf{y}_{t}; \zeta_t). \end{aligned} \end{equation}\] This update can be re-written using one sequence of \(\{\mathbf{y}_t\}\). By subtracting the second equation from the first one, we have \[\begin{equation} \begin{aligned} \mathbf{y}_{t} - \mathbf{x}_{t+1} = \eta\mathcal{M}(\mathbf{y}_{t}; \zeta_t) - \eta \mathcal{M}(\mathbf{y}_{t-1}; \zeta_{t-1}). \end{aligned} \end{equation}\] As a result, \[\begin{align*} \mathbf{y}_{t} & = \mathbf{x}_{t+1} + \eta\mathcal{M}(\mathbf{y}_{t}; \zeta_t) - \eta \mathcal{M}(\mathbf{y}_{t-1}; \zeta_{t-1})\\ & = \mathbf{y}_{t+1} + \eta \mathcal{M}(\mathbf{y}_{t}; \zeta_{t}) + \eta\mathcal{M}(\mathbf{y}_{t}; \zeta_t) - \eta \mathcal{M}(\mathbf{y}_{t-1}; \zeta_{t-1}). \end{align*}\] From this, we derive that \[\begin{align} \mathbf{y}_{t+1} = \mathbf{y}_{t} - \eta (\mathcal{M}(\mathbf{y}_{t}; \zeta_{t}) + \mathcal{M}(\mathbf{y}_{t}; \zeta_t) - \mathcal{M}(\mathbf{y}_{t-1}; \zeta_{t-1})). \end{align}\] This method applied to the min-max problem is known as stochastic optimistic gradient descent ascent (SOGDA), yielding the following primal and dual updates: \[\begin{align} \mathbf{w}_{t+1} = \mathbf{w}_{t} - \eta(2\nabla_1 f(\mathbf{w}_{t}, \mathbf{u}_t; \zeta_{t}) - \nabla_1 f(\mathbf{w}_{t-1}, \mathbf{u}_{t-1}; \zeta_{t-1}))\\ \mathbf{u}_{t+1} = \mathbf{u}_{t} + \eta (2\nabla_2 f(\mathbf{w}_{t}, \mathbf{u}_t; \zeta_{t}) - \eta \nabla_2 f(\mathbf{w}_{t-1}, \mathbf{u}_{t-1}; \zeta_{t-1}))). \end{align}\]

Convergence Analysis

We analyze the stochastic optimistic mirror prox method in Algorithm 8. We make the following assumption.

Assumption 3.11

Suppose the following conditions hold:

1. \(f(\mathbf{w}, \mathbf{u})\) is convex w.r.t \(\mathbf{w}\) and concave w.r.t \(\mathbf{u}\).
1. Let \(\varphi(\mathbf{z})\) be a \(\alpha\)-strongly convex function with respect to the norm \(\|\cdot\|\), whose dual norm is denoted by \(\|\cdot\|_*\).
1. \(\mathcal{M}(\mathbf{v})\) is \(L\)-Lipschitz continuous such that \[\begin{align*} \|\mathcal{M}(\mathbf{v}) - \mathcal{M}(\mathbf{v}')\|_*^2\leq L^2\|\mathbf{v} - \mathbf{v}'\|^2. \end{align*}\]
1. There exists \(\sigma>0\) such that \[\begin{align*} \mathbb{E}_{\zeta}[\|\mathcal{M}(\mathbf{x}; \zeta) - \mathcal{M}(\mathbf{x})\|_*^2]\leq \sigma^2, \forall \mathbf{x}\in\mathcal V. \end{align*}\]
1. \(\max_{\mathbf{x}\in\mathcal V,\mathbf{x}'\in\mathcal V}D_{\varphi}(\mathbf{x}, \mathbf{x}')\leq D^2\).

Lemma 3.15

Given \(\mathbf{x}\), consider the updates:

\[\begin{equation}\label{eqn:project} \begin{aligned} \mathbf{y} &= \arg\min_{\mathbf{v}\in\mathcal V} \gamma\mathcal{M}(\xi)^{\top}\mathbf{v} + D_{\varphi}(\mathbf{v}, \mathbf{x}),\\ \mathbf{x}_+&=\arg\min_{\mathbf{v}\in \mathcal V} \gamma\mathcal{M}(\zeta)^{\top}\mathbf{v}+ D_{\varphi}(\mathbf{v},\mathbf{x}). \end{aligned} \end{equation}\] For any \(\mathbf{v}\in\mathcal V\), we have

\[\begin{equation}\label{eqn:ineq} \begin{aligned} \gamma\mathcal{M}(\zeta)^{\top}(\mathbf{y}-\mathbf{v})\leq & D_{\varphi}(\mathbf{v},\mathbf{x}) - D_{\varphi}(\mathbf{v}, \mathbf{x}_+) + \frac{\gamma^2}{\alpha}\|\mathcal{M}(\xi)-\mathcal{M}(\zeta)\|_*^2\\ & - \frac{\alpha}{2}[\|\mathbf{y}-\mathbf{x}\|^2 + \|\mathbf{y}-\mathbf{x}_+\|^2]. \end{aligned} \end{equation}\]

Proof

First, by Lemma 3.9, we have

\[\begin{equation}\label{eqn:nemir-lemma-1} \|\mathbf{y} - \mathbf{x}_+\|\leq \frac{\gamma}{\alpha}\|\mathcal{M}(\zeta) - \mathcal{M}(\xi)\|_*. \end{equation}\] Let \(\phi(\mathbf{v}) = \gamma \mathcal{M}(\zeta)^{\top}(\mathbf{y} - \mathbf{v}) - D_{\varphi}(\mathbf{v},\mathbf{x}) + D_\varphi(\mathbf{v}, \mathbf{x}_+)\). Given the optimality condition of \(\mathbf{x}_+\), it is easy to verify that it also satisfies the optimality condition of \(\max_{\mathbf{v}\in\mathcal V}\phi(\mathbf{v})\). As a result, \(\phi(\mathbf{v})\leq \phi(\mathbf{x}_+),\forall \mathbf{v}\in\mathcal V\), i.e.,

\[\begin{equation}\label{eqn:nemir-lemma} \begin{aligned} &\gamma \mathcal{M}(\zeta)^{\top}(\mathbf{y} - \mathbf{v}) - D_{\varphi}(\mathbf{v},\mathbf{x}) + D_\varphi(\mathbf{v}, \mathbf{x}_+)\\ &\leq \gamma \mathcal{M}(\zeta)^{\top}(\mathbf{y} - \mathbf{x}_+) - D_{\varphi}(\mathbf{x}_+,\mathbf{x})\\ &= \gamma \mathcal{M}(\zeta)^{\top}(\mathbf{y} - \mathbf{x}_+) + \varphi(\mathbf{x}) + \nabla\varphi(\mathbf{x})^{\top}(\mathbf{x}_+ - \mathbf{x})- \varphi(\mathbf{x}_+) \\ &= \gamma (\mathcal{M}(\zeta)-\mathcal{M}(\xi))^{\top}(\mathbf{y} - \mathbf{x}_+) + \gamma\mathcal{M}(\xi)^{\top} (\mathbf{y} - \mathbf{x}_+)\\ &+ \varphi(\mathbf{x}) + \nabla\varphi(\mathbf{x})^{\top}(\mathbf{x}_+ - \mathbf{x})- \varphi(\mathbf{x}_+). \end{aligned} \end{equation}\] By the optimality condition of \(\mathbf{y}\), for any \(\mathbf{v}\in\mathcal V\) we have \[ (\gamma\mathcal{M}(\xi) + \nabla\varphi(\mathbf{y}) - \nabla\varphi(\mathbf{x}))^{\top}(\mathbf{y}-\mathbf{v})\leq 0. \] Plugging \(\mathbf{v}=\mathbf{x}_+\) into the above inequality, we have \[ (\gamma\mathcal{M}(\xi) + \nabla\varphi(\mathbf{y}) - \nabla\varphi(\mathbf{x}))^{\top}(\mathbf{y}-\mathbf{x}_+)\leq 0, \] which implies that \[ \gamma\mathcal{M}(\xi)^{\top}(\mathbf{y}-\mathbf{x}_+)\leq (\nabla\varphi(\mathbf{y}) - \nabla\varphi(\mathbf{x}))^{\top}(\mathbf{x}_+-\mathbf{y}). \] Combining this with (\(\ref{eqn:nemir-lemma}\)), we have \[\begin{equation*} \begin{aligned} &\gamma \mathcal{M}(\zeta)^{\top}(\mathbf{y} - \mathbf{v}) - D_{\varphi}(\mathbf{v},\mathbf{x}) + D_\varphi(\mathbf{v}, \mathbf{x}_+)\leq \gamma (\mathcal{M}(\zeta)-\mathcal{M}(\xi))^{\top}(\mathbf{y} - \mathbf{x}_+) \\ &+(\nabla\varphi(\mathbf{y}) - \nabla\varphi(\mathbf{x}))^{\top}(\mathbf{x}_+-\mathbf{y})+ \varphi(\mathbf{x}) + \nabla\varphi(\mathbf{x})^{\top}(\mathbf{x}_+ - \mathbf{x})- \varphi(\mathbf{x}_+) \\ &=\gamma (\mathcal{M}(\zeta)-\mathcal{M}(\xi))^{\top}(\mathbf{y} - \mathbf{x}_+) \\ & + \varphi(\mathbf{x})+ \nabla\varphi(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) - \varphi(\mathbf{x}_+)+(\nabla\varphi(\mathbf{y}))^{\top}(\mathbf{x}_+-\mathbf{y})\\ &=\gamma (\mathcal{M}(\zeta)-\mathcal{M}(\xi))^{\top}(\mathbf{y} - \mathbf{x}_+) \\ & + \varphi(\mathbf{x}) + \nabla\varphi(\mathbf{x})^{\top}(\mathbf{y} - \mathbf{x}) - \varphi(\mathbf{y}) + \varphi(\mathbf{y}) +(\nabla\varphi(\mathbf{y}))^{\top}(\mathbf{x}_+-\mathbf{y}) - \varphi(\mathbf{x}_+)\\ &=\gamma (\mathcal{M}(\zeta)-\mathcal{M}(\xi))^{\top}(\mathbf{y} - \mathbf{x}_+) - D_\varphi(\mathbf{y}, \mathbf{x}) - D_\varphi(\mathbf{x}_+, \mathbf{y}) \\ &\leq \frac{\gamma^2}{\alpha}\|\mathcal{M}(\zeta)-\mathcal{M}(\xi)\|_*^2 - \frac{\alpha}{2}\|\mathbf{y} - \mathbf{x}\|^2 - \frac{\alpha}{2}\|\mathbf{x}_+ - \mathbf{y}\|^2, \end{aligned} \end{equation*}\] where the last inequality uses (\(\ref{eqn:nemir-lemma-1}\)) and the \(\alpha\)-strong convexity of \(\varphi\).

■

Theorem 3.14

Suppose Assumption 3.11 holds. Let \(\bar{\mathbf w}_T = \frac{1}{T}\sum_{t=1}^T\mathbf{w}_t, \bar{\mathbf u}_T = \frac{1}{T}\sum_{t=1}^T\mathbf{u}_t\). After \(T\) iterations, SOMP guarantees that \[ \mathbb{E}[\Delta(\bar{\mathbf w}_T, \bar{\mathbf u}_T)]\leq\frac{2D^2}{T\eta} + \frac{8\sigma^2\eta}{\alpha}. \] If we set \(\eta =\min( \frac{D}{2\sqrt{T}\sigma}, \frac{\alpha}{\sqrt{12}L})\), we have \[ \mathbb{E}[\Delta(\bar{\mathbf w}_T, \bar{\mathbf u}_T)]\leq O\left(\frac{LD^2}{T\alpha} + \frac{\sigma D}{\sqrt{T\alpha}} \right). \]

💡 Why it matters
This result is consistent with the convergence of SGD for smooth convex minimization in Theorem 3.1. In particular, when \(\sigma = 0\) (i.e., using the deterministic gradient), the convergence rate simplifies to \(O(1/T)\).

Proof

Since the updates of \(\mathbf{y}_t, \mathbf{x}_{t+1}\) follow that in (\(\ref{eqn:project}\)), by applying Lemma 3.15, we have \[\begin{align*} \eta &\mathcal{M}(\mathbf{y}_t, \zeta_t)^{\top}(\mathbf{y}_t-\mathbf{v})\leq D_{\varphi}(\mathbf{v},\mathbf{x}_t) - D_{\varphi}(\mathbf{v}, \mathbf{x}_{t+1})\\ &+ \frac{\eta^2}{\alpha}\|\mathcal{M}(\mathbf{y}_t, \zeta_t)-\mathcal{M}(\mathbf{y}_{t-1},\zeta_{t-1})\|_*^2 - \frac{\alpha}{2}[\|\mathbf{y}_t-\mathbf{x}_t\|^2 + \|\mathbf{y}_t-\mathbf{x}_{t+1}\|^2]\\ &\leq D_{\varphi}(\mathbf{v},\mathbf{x}_t) - D_{\varphi}(\mathbf{v}, \mathbf{x}_{t+1})\\ &+ \frac{\eta^2}{\alpha}\|\mathcal{M}(\mathbf{y}_t, \zeta_t)-\mathcal{M}(\mathbf{y}_{t-1},\zeta_{t-1}) - \mathcal{M}(\mathbf{y}_t)+\mathcal{M}(\mathbf{y}_{t-1}) + (\mathcal{M}(\mathbf{y}_t) -\mathcal{M}(\mathbf{y}_{t-1}))\|_*^2 \\ & - \frac{\alpha}{2}[\|\mathbf{y}_t-\mathbf{x}_t\|^2 + \|\mathbf{y}_t-\mathbf{x}_{t+1}\|^2]. \end{align*}\] Let \(\sigma_t^2:= \|\mathcal{M}(\mathbf{y}_t, \zeta_t) - \mathcal{M}(\mathbf{y}_t)\|_*^2\), then we have \[\begin{align*} &\|\mathcal{M}(\mathbf{y}_t, \zeta_t)-\mathcal{M}(\mathbf{y}_{t-1},\zeta_{t-1}) - \mathcal{M}(\mathbf{y}_t)+\mathcal{M}(\mathbf{y}_{t-1}) + (\mathcal{M}(\mathbf{y}_t) -\mathcal{M}(\mathbf{y}_{t-1}))\|_*^2\\ &\leq 3\|\mathcal{M}(\mathbf{y}_t, \zeta_t) - \mathcal{M}(\mathbf{y}_t)\|_*^2 + 3\|\mathcal{M}(\mathbf{y}_{t-1},\zeta_{t-1}) - \mathcal{M}(\mathbf{y}_{t-1})\|_*^2\\ &+ 3\|\mathcal{M}(\mathbf{y}_t) -\mathcal{M}(\mathbf{y}_{t-1})\|_*^2\\ & \leq 3\sigma_t^2 + 3\sigma_{t-1}^2 + 3L^2\|\mathbf{y}_t - \mathbf{y}_{t-1}\|^2. \end{align*}\] Combining the above two inequalities, we have \[\begin{align*} \eta &\mathcal{M}(\mathbf{y}_t, \zeta_t)^{\top}(\mathbf{y}_t-\mathbf{v})\leq D_{\varphi}(\mathbf{v},\mathbf{x}_t) - D_{\varphi}(\mathbf{v}, \mathbf{x}_{t+1})\\ &+ \frac{\eta^2}{\alpha}(3\sigma_t^2 + 3\sigma_{t-1}^2 + 3L^2\|\mathbf{y}_t - \mathbf{y}_{t-1}\|^2) - \frac{\alpha}{2}[\|\mathbf{y}_t-\mathbf{x}_t\|^2 + \|\mathbf{y}_t-\mathbf{x}_{t+1}\|^2]. \end{align*}\] Taking average over \(t=1,\ldots, T\), we have \[\begin{align*} \frac{1}{T} &\sum_{t=1}^T\mathcal{M}(\mathbf{y}_t)^{\top}(\mathbf{y}_t-\mathbf{v})\leq \frac{1}{T\eta}D_{\varphi}(\mathbf{v},\mathbf{x}_1) \\ &+ \frac{\eta}{\alpha T}\sum_{t=1}^T(3\sigma_t^2 + 3\sigma_{t-1}^2 + 3L^2\|\mathbf{y}_t - \mathbf{y}_{t-1}\|^2) - \frac{\alpha}{2\eta T}\sum_{t=1}^T[\|\mathbf{y}_t-\mathbf{x}_t\|^2 + \|\mathbf{y}_t-\mathbf{x}_{t+1}\|^2]\\ &+\frac{1}{T} \sum_{t=1}^T(\mathcal{M}(\mathbf{y}_t)-\mathcal{M}(\mathbf{y}_t, \zeta_t))^{\top}(\mathbf{y}_t-\mathbf{v}). \end{align*}\] Let \(\mathbf{y}_t=(\mathbf{w}_t, \mathbf{u}_t)\) and \(\mathbf{v}=(\mathbf{w},\mathbf{u})= \arg\max_{\mathbf{w}\in\mathcal W, \mathbf{u}\in\mathcal U} f(\bar{\mathbf w}_T, \mathbf{u}) - f(\mathbf{w}, \bar{\mathbf u}_T)\). We have \[\begin{align*} &\frac{1}{T} \sum_{t=1}^T\mathcal{M}(\mathbf{y}_t)^{\top}(\mathbf{y}_t-\mathbf{v}) = \frac{1}{T} \sum_{t=1}^T(\nabla_1f(\mathbf{w}_t, \mathbf{u}_t)^{\top}(\mathbf{w}_t - \mathbf{w}) - \nabla_2f(\mathbf{w}_t, \mathbf{u}_t)^{\top}(\mathbf{u}_t - \mathbf{u}))\\ &\geq \frac{1}{T} \sum_{t=1}^T(f(\mathbf{w}_t, \mathbf{u}_t) - f(\mathbf{w}, \mathbf{u}_t) + f(\mathbf{w}_t, \mathbf{u}) - f(\mathbf{w}_t, \mathbf{u}_t))\\ & = \frac{1}{T} \sum_{t=1}^T(f(\mathbf{w}_t, \mathbf{u}) - f(\mathbf{w}, \mathbf{u}_t))\geq f(\bar{\mathbf w}_T, \mathbf{u}) - f(\mathbf{w}, \bar{\mathbf u}_T). \end{align*}\] As a result, \[\begin{align*} &\Delta(\bar{\mathbf w}_T, \bar{\mathbf u}_T)\leq \frac{1}{T} \sum_{t=1}^T\mathcal{M}(\mathbf{y}_t)^{\top}(\mathbf{y}_t-\mathbf{v})\leq \frac{1}{T\eta}D_{\varphi}(\mathbf{v},\mathbf{x}_1) \\ &+ \frac{\eta}{\alpha T}\sum_{t=1}^T(3\sigma_t^2 + 3\sigma_{t-1}^2 + 3L^2\|\mathbf{y}_t - \mathbf{y}_{t-1}\|^2) - \frac{\alpha}{2\eta T}\sum_{t=1}^T[\|\mathbf{y}_t-\mathbf{x}_t\|^2 + \|\mathbf{y}_t-\mathbf{x}_{t+1}\|^2]\\ &+\frac{1}{T} \sum_{t=1}^T(\mathcal{M}(\mathbf{y}_t)-\mathcal{M}(\mathbf{y}_t, \zeta_t))^{\top}(\mathbf{y}_t-\mathbf{v}). \end{align*}\] The last term can be bounded by using Lemma 3.14. Define the virtual sequence with \(\hat{\mathbf y}_1 = \mathbf{x}_1\): \[ \hat{\mathbf y}_{t+1}=\arg\min_{\mathbf{v}\in\mathcal V}(\mathcal{M}(\mathbf{y}_t) - \mathcal{M} (\mathbf{y}_t, \zeta_t))^{\top}\mathbf{v} + \frac{1}{\eta}D_{\varphi}(\mathbf{v, \hat{\mathbf y}_t}). \] Then Lemma 3.14 implies that \[\begin{align*} &\mathbb{E} \left[\frac{1}{T}\sum_{t=1}^T(\mathcal{M}(\mathbf{y}_t, \zeta_t)-\mathcal{M}(\mathbf{y}_t))^{\top}\mathbf{v}\right] \leq \mathbb{E}\left[\frac{1}{\eta T} D_\varphi(\mathbf{v}, \hat{\mathbf y}_1)\right]\\ &+ \mathbb{E}\left[\frac{\eta}{2\alpha T}\sum_{t=1}^T\|\mathcal{M}(\mathbf{y}_t) - \mathcal{M} (\mathbf{y}_t, \zeta_t)\|_*^2\right]. \end{align*}\] Combining the above results, we have \[\begin{align*} &\mathbb{E}[\Delta(\bar{\mathbf w}_T, \bar{\mathbf u}_T)]\leq \frac{2}{T\eta}D_{\varphi}(\mathbf{v},\mathbf{x}_1) + \frac{8\sigma^2\eta}{\alpha}\\ &+ \mathbb{E}\left[\frac{3L^2\eta}{\alpha T}\sum_{t=1}^T\|\mathbf{y}_t - \mathbf{y}_{t-1}\|^2 - \frac{\alpha}{2\eta T}\sum_{t=1}^T\left[\|\mathbf{y}_t-\mathbf{x}_t\|^2 + \|\mathbf{y}_t-\mathbf{x}_{t+1}\|^2\right]\right]\\ & \leq\frac{2}{T\eta}D_{\varphi}(\mathbf{v},\mathbf{x}_1) + \frac{8\sigma^2\eta}{\alpha}+\mathbb{E}\left[\frac{3L^2\eta}{\alpha T}\sum_{t=1}^T[2\|\mathbf{y}_t -\mathbf{x}_t\|^2 + 2\|\mathbf{x}_t- \mathbf{y}_{t-1}\|^2] \right]\\ &- \mathbb{E}\left[ \frac{\alpha}{2\eta T}\sum_{t=1}^T\left[\|\mathbf{y}_t-\mathbf{x}_t\|^2 + \|\mathbf{y}_t-\mathbf{x}_{t+1}\|^2\right]\right]. \end{align*}\] If \(6L^2\frac{\eta}{\alpha}\leq \frac{\alpha}{2\eta}\), i.e., \(\eta\leq \frac{\alpha}{\sqrt{12}L}\), the sum of the last two terms will be less than zero due to \(\mathbf{x}_1 = \mathbf{y}_0\). Then, we have \[\begin{align*} &\mathbb{E}[\Delta(\bar{\mathbf w}_T, \bar{\mathbf u}_T)]\leq\frac{2}{T\eta}D_{\varphi}(\mathbf{v},\mathbf{x}_1) + \frac{8\sigma^2\eta}{\alpha}\leq\frac{2D^2}{T\eta} + \frac{8\sigma^2\eta}{\alpha}. \end{align*}\]

For the second part, optimizing the upper bound over \(\eta\) gives \(\eta_*= \frac{D\sqrt{\alpha}}{2\sqrt{T}\sigma}\). If \(\eta_*\leq \frac{\alpha}{\sqrt{12}L}\), i.e., \(T\geq \frac{3D^2L^2}{\sigma^2\alpha}\), we set \(\eta= \eta_*\), then \[ \mathbb{E}\left[\Delta(\bar{\mathbf w}_{T}, \bar{\mathbf u}_T)\right]\leq \frac{8\sigma D}{\sqrt{T\alpha}}. \] If \(\eta_*>\frac{\alpha}{\sqrt{12}L}\), i.e., \(\sigma^2\leq \frac{3D^2L^2}{\alpha T}\), we set \(\eta=\frac{\alpha}{\sqrt{12}L}\), then \[ \mathbb{E}\left[\Delta(\bar{\mathbf w}_T, \bar{\mathbf u}_T)\right]\leq \frac{2\sqrt{12}LD^2}{T\alpha} + \frac{12LD^2}{\sqrt{3}T\alpha}. \]

■

← Go Back