Section 3.2: Stochastic Proximal Gradient Descent

Let us consider the following stochastic composite optimization problem:

\[\begin{align}\label{eqn:cso} \min_{\mathbf{w}\in\mathbb{R}^d}F(\mathbf{w}) := \mathbb{E}_{\zeta}[g(\mathbf{w}; \zeta)] + r(\mathbf{w}), \end{align}\]

where \(g(\mathbf{w})=\mathbb{E}_{\zeta}[g(\mathbf{w}; \zeta)]\) is a smooth function and \(r(\mathbf{w})\) is a possibly non-smooth function. In machine learning, \(r\) usually corresponds to some regularizer on the model parameter. We make the following assumption.

Assumption 3.6

Suppose the following conditions hold:

1. \(g(\mathbf{w})\) is \(L\)-smooth and convex, and \(r(\mathbf{w})\) is convex.
1. There exists \(\sigma>0\) such that \(\mathbb{E}_{\zeta}[\|\nabla g(\mathbf{w}; \zeta)- \nabla g(\mathbf{w})\|_2^2]\leq \sigma^2\) for all \(\mathbf{w}\).

If the regularizer \(r\) is non-smooth, the overall objective function is also non-smooth. Consequently, applying SGD directly cannot exploit the smoothness of \(g\), which would otherwise enable faster convergence and enjoy the variance scaling in the convergence bound.

Algorithm 2: SPGD

Input: learning rate schedule \(\{\eta_t\}_{t=1}^{T}\), starting point \(\mathbf{w}_1\)
For \(t=1,\dotsc,T\)
Compute an unbiased gradient estimator \(\mathbf{z}_t=\nabla g(\mathbf{w}_t;\zeta_t)\)
Update the model \(\mathbf{w}\) by \(\mathbf{w}_{t+1}=\arg\min_{\mathbf{w}\in\mathbb{R}^d}\mathbf{z}_t^{\top}(\mathbf{w}-\mathbf{w}_t)+\frac{1}{2\eta_t}\|\mathbf{w}-\mathbf{w}_t\|_2^2+r(\mathbf{w}).\)

To address this challenge, we can employ the stochastic proximal gradient descent (SPGD) method:

\[\begin{align}\label{eqn:spgd} \mathbf{w}_{t+1} &= \arg\min_{\mathbf{w}\in\mathbb{R}^d}\nabla g(\mathbf{w}_t; \zeta_t)^{\top}(\mathbf{w}-\mathbf{w}_t)+r(\mathbf{w})+\frac{1}{2\eta_t}\|\mathbf{w}-\mathbf{w}_t\|_2^2\\ &=\arg\min_{\mathbf{w}\in\mathbb{R}^d}r(\mathbf{w})+\frac{1}{2\eta_t}\|\mathbf{w}-(\mathbf{w}_t-\eta_t\nabla g(\mathbf{w}_t; \zeta_t))\|_2^2.\notag \end{align}\]

This is also known as forward-backward splitting, where \(\widetilde{\mathbf{w}}_{t+1}=\mathbf{w}_t-\eta_t\nabla g(\mathbf{w}_t; \zeta_t)\) is the forward step and the proximal mapping of \(r\) is the backward step:

\[ \mathbf{w}_{t+1}=\operatorname{prox}_{\eta_t r}(\widetilde{\mathbf{w}}_{t+1}) =\arg\min_{\mathbf{w}} r(\mathbf{w})+\frac{1}{2\eta_t}\|\mathbf{w}-\widetilde{\mathbf{w}}_{t+1}\|_2^2. \]

When \(r\) is absent, the above update is equivalent to the SGD update. If \(r(\mathbf{w})\) corresponds to a domain constraint \(\mathbf{w}\in\mathcal{W}\), i.e., \(r(\mathbf{w})=\mathbb{I}_{0-\infty}(\mathbf{w}\in\mathcal{W})\), the above update becomes

\[\begin{align} \mathbf{w}_{t+1}=\Pi_{\mathcal{W}}[\widetilde{\mathbf{w}}_{t+1}]: =\min_{\mathbf{w}\in\mathcal{W}}\|\mathbf{w}-\widetilde{\mathbf{w}}_{t+1}\|_2^2, \end{align}\]

which is the projection of \(\widetilde{\mathbf{w}}_{t+1}=\mathbf{w}_t-\eta_t\nabla g(\mathbf{w}_t,\zeta_t)\) onto the constrained domain \(\mathcal{W}\). This is known as projected SGD method.

Explanation of SPGD update

The update (\(\ref{eqn:spgd}\)) is equivalent to:

\[\begin{align*} \mathbf{w}_{t+1}=\arg\min_{\mathbf{w}} g(\mathbf{w}_t;\zeta_t)+\nabla g(\mathbf{w}_t;\zeta_t)^{\top}(\mathbf{w}-\mathbf{w}_t)+r(\mathbf{w})+\frac{1}{2\eta_t}\|\mathbf{w}-\mathbf{w}_t\|_2^2. \end{align*}\]

Unlike SGD, SPGD uses a stochastic linear approximation \(g(\mathbf{w}_t;\zeta_t)+\nabla g(\mathbf{w}_t;\zeta_t)^{\top}(\mathbf{w}-\mathbf{w}_t)+r(\mathbf{w})\) as a stochastic surrogate for \(g(\mathbf{w})+r(\mathbf{w})\).

Using the first-order optimality condition of (\(\ref{eqn:spgd}\)), \(\mathbf{w}_{t+1}\) satisfies

\[\begin{align} \mathbf{w}_{t+1}=\mathbf{w}_t-\eta(\nabla g(\mathbf{w}_t;\zeta_t)+\partial r(\mathbf{w}_{t+1})). \end{align}\]

It resembles SGD but differs in that it uses a stochastic gradient of \(g\) evaluated at \(\mathbf{w}_t\) and a subgradient of \(r\) evaluated at \(\mathbf{w}_{t+1}\).

Table 3.1: Examples of regularization functions \(r(\cdot)\) and their proximal mappings, where \(\sigma_i\) denote the \(i\)-th singular value of a matrix.
Regularization	\(r(\cdot)\)	\(\operatorname{prox}_{\eta r}(\bar{\mathbf{w}})\) or \(\operatorname{prox}_{\eta r}(\bar W)\)
Euclidean norm square	\(\frac{\lambda}{2}\\|\mathbf{w}\\|_2^2\)	\(\frac{\bar{\mathbf{w}}}{1+\lambda\eta}\)
Euclidean norm	\(\lambda\\|\mathbf{w}\\|_2\)	\(\big(1-\frac{\lambda\eta}{\\|\bar{\mathbf{w}}\\|_2}\big)_+\bar{\mathbf{w}}\)
Lasso	\(\lambda\\|\mathbf{w}\\|_1\)	\(\operatorname{sign}(\bar{\mathbf{w}})\odot\max\{\|\bar{\mathbf{w}}\|-\lambda\eta,0\}\)
Group Lasso	\(\lambda\sum_g\\|\mathbf{w}_g\\|_2\)	\(\big(1-\frac{\lambda\eta}{\\|\bar{\mathbf{w}}_g\\|_2}\big)_+\bar{\mathbf{w}}_g\) (for each group \(g\))
Elastic Net	\(\alpha\\|\mathbf{w}\\|_1+\tfrac{\beta}{2}\\|\mathbf{w}\\|_2^2\)	\(\frac{1}{1+\eta\beta}\Big(\operatorname{sign}(\bar{\mathbf{w}})\odot\max\{\|\bar{\mathbf{w}}\|-\eta\alpha,0\}\Big)\)
Trace norm (nuclear)	\(\lambda\\|W\\|_*=\lambda\sum_i\sigma_i(W)\)	\(U\,\operatorname{diag}((\sigma_i-\lambda\eta)_+)\,V^\top\) (where \(\bar W=U\operatorname{diag}(\sigma_i)V^\top\))

In order to make the update efficient, the proximal mapping of \(r\) should be easily computable. Table Table 3.1 presents several examples of regularizers \(r\) and the corresponding solutions of their proximal mappings, followed by explanations below. We leave the detailed derivations of these proximal mappings to the reader as exercises.

Examples of Proximal Mappings

Example 3.3: Euclidean norm square

This is the most commonly used regularizer. Its proximal mapping shrinks the magnitude of the input vector \(\bar{\mathbf{w}}\), effectively performing weight decay.

Example 3.4: \(\ell_1\) norm

The \(\ell_1\) norm regularizer \(\lambda\|\mathbf{w}\|_1\) is used in the well-known Lasso method for linear regression. Its proximal mapping promotes sparsity in the solution by setting some entries to zero if the corresponding component of \(\bar{\mathbf{w}}\) is smaller than \(\eta\lambda\) in magnitude.

Example 3.5: Group Lasso

This is an extension of Lasso that groups features together and enforces group-wise sparsity. Specifically, if one weight within a group is set to zero, then all weights in that group are simultaneously set to zero.

Example 3.6: Trace norm

The trace norm regularizer for a matrix is analogous to the \(\ell_1\) norm for a vector, as it promotes low-rank structure. Its proximal mapping induces a low-rank solution by setting the singular values of the input matrix to zero whenever they are smaller than \(\eta\lambda\).

3.2.1 Convex Functions

We first present two lemmas, analogous to Lemma 3.1 and Lemma 3.2, under the assumptions that \(r\) is \(\mu_r\)-strongly convex with \(\mu_r\geq 0\) and \(g\) is \(\mu\)-strongly convex with \(\mu\geq 0\). These lemmas allow us to establish convergence results in both the convex and strongly convex settings.

Lemma 3.6

Consider the update

\[\begin{align}\label{eqn:spgd-g} \mathbf{w}_{t+1}=\arg\min_{\mathbf{w}\in\mathbb{R}^d}\mathbf{z}_t^{\top}(\mathbf{w}-\mathbf{w}_t)+\frac{1}{2\eta_t}\|\mathbf{w}-\mathbf{w}_t\|_2^2+r(\mathbf{w}). \end{align}\]

If \(r\) is \(\mu_r\)-strongly convex with \(\mu_r\geq 0\), for any \(\mathbf{w}\) we have

\[\begin{align*} \mathbf{z}_t^{\top}(\mathbf{w}_{t+1}-\mathbf{w})+r(\mathbf{w}_{t+1})-r(\mathbf{w}) \leq&\frac{1}{2\eta_t}\|\mathbf{w}_t-\mathbf{w}\|_2^2-\left(\frac{1}{2\eta_t}+\frac{\mu_r}{2}\right)\|\mathbf{w}_{t+1}-\mathbf{w}\|_2^2\\ &-\frac{1}{2\eta_t}\|\mathbf{w}_t-\mathbf{w}_{t+1}\|_2^2. \end{align*}\]

Proof.
By the first-order optimality condition of (\(\ref{eqn:spgd-g}\)), for any \(\mathbf{w}\) we have

\[\begin{align}\label{eqn:1st} \big(\mathbf{z}_t+\partial r(\mathbf{w}_{t+1})+\frac{1}{\eta_t}(\mathbf{w}_{t+1}-\mathbf{w}_t)\big)^{\top}(\mathbf{w}-\mathbf{w}_{t+1})\geq 0. \end{align}\]

By the strong convexity of \(r\), we have

\[\begin{align*} r(\mathbf{w}_{t+1}) \leq r(\mathbf{w})+\partial r(\mathbf{w}_{t+1})^{\top}(\mathbf{w}_{t+1}-\mathbf{w})-\frac{\mu_r}{2}\|\mathbf{w}-\mathbf{w}_{t+1}\|_2^2. \end{align*}\]

Adding the above two inequalities, we have

\[\begin{align*} &\mathbf{z}_t^{\top}(\mathbf{w}_{t+1}-\mathbf{w})+r(\mathbf{w}_{t+1})-r(\mathbf{w}) \leq \frac{1}{\eta_t}(\mathbf{w}_t-\mathbf{w}_{t+1})^{\top}(\mathbf{w}_{t+1}-\mathbf{w})-\frac{\mu_r}{2}\|\mathbf{w}-\mathbf{w}_{t+1}\|_2^2\\ &=\frac{1}{2\eta_t}\big(\|\mathbf{w}_t-\mathbf{w}\|_2^2-\|\mathbf{w}_{t+1}-\mathbf{w}\|_2^2-\|\mathbf{w}_t-\mathbf{w}_{t+1}\|_2^2\big)-\frac{\mu_r}{2}\|\mathbf{w}-\mathbf{w}_{t+1}\|_2^2, \end{align*}\]

where the last equality uses the fact that \(2(a-b)^{\top}(b-c)=\|a-c\|_2^2-\|a-b\|_2^2-\|b-c\|_2^2\).

■

Lemma 3.7

Following the same setting as in Lemma 3.6, if \(g\) is \(\mu\)-strongly convex with \(\mu\geq 0\) and \(\eta_t\leq 1/L\), then we have

\[\begin{equation}\label{eqn:Fr} \begin{aligned} \mathbb{E}[F(\mathbf{w}_{t+1})-F(\mathbf{w}_*)] \leq&\mathbb{E}\left[\left(\frac{1}{2\eta_t}-\frac{\mu}{2}\right)\|\mathbf{w}_t-\mathbf{w}_*\|_2^2-\left(\frac{1}{2\eta_t}+\frac{\mu_r}{2}\right)\|\mathbf{w}_{t+1}-\mathbf{w}_*\|_2^2\right]\\ &+\eta_t\sigma^2. \end{aligned} \end{equation}\]

Proof.
By Lemma 3.6, we have

\[\begin{align*} &\nabla g(\mathbf{w}_t,\zeta_t)^{\top}(\mathbf{w}_{t+1}-\mathbf{w})+r(\mathbf{w}_{t+1}) \leq r(\mathbf{w})+\frac{1}{2\eta_t}\big(\|\mathbf{w}_t-\mathbf{w}\|_2^2-\|\mathbf{w}_{t+1}-\mathbf{w}\|_2^2\big)\\ &-\frac{\mu_r}{2}\|\mathbf{w}-\mathbf{w}_{t+1}\|_2^2-\frac{1}{2\eta_t}\|\mathbf{w}_t-\mathbf{w}_{t+1}\|_2^2. \end{align*}\]

By the smoothness of \(g\), we have

\[\begin{align*} g(\mathbf{w}_{t+1}) \leq g(\mathbf{w}_t)+\nabla g(\mathbf{w}_t)^{\top}(\mathbf{w}_{t+1}-\mathbf{w}_t)+\frac{L}{2}\|\mathbf{w}_{t+1}-\mathbf{w}_t\|_2^2. \end{align*}\]

By the strong convexity of \(g\), we have

\[\begin{align*} g(\mathbf{w}_t)\leq g(\mathbf{w})+\nabla g(\mathbf{w}_t)^{\top}(\mathbf{w}_t-\mathbf{w})-\frac{\mu}{2}\|\mathbf{w}_t-\mathbf{w}\|_2^2. \end{align*}\]

Adding the above two inequalities, we have

\[\begin{align*} g(\mathbf{w}_{t+1}) \leq g(\mathbf{w})+\nabla g(\mathbf{w}_t)^{\top}(\mathbf{w}_{t+1}-\mathbf{w}) -\frac{\mu}{2}\|\mathbf{w}_t-\mathbf{w}\|_2^2+\frac{L}{2}\|\mathbf{w}_{t+1}-\mathbf{w}_t\|_2^2. \end{align*}\]

Combining this with the first inequality for \(\mathbf{w}=\mathbf{w}_*\), we have

\[\begin{align*} F(\mathbf{w}_{t+1})-F(\mathbf{w}_*) \leq&\frac{1}{2\eta_t}\big(\|\mathbf{w}_t-\mathbf{w}_*\|_2^2-\|\mathbf{w}_{t+1}-\mathbf{w}_*\|_2^2-\|\mathbf{w}_t-\mathbf{w}_{t+1}\|_2^2\big)\\ &-\frac{\mu}{2}\|\mathbf{w}_t-\mathbf{w}_*\|_2^2-\frac{\mu_r}{2}\|\mathbf{w}_{t+1}-\mathbf{w}_*\|_2^2+\frac{L}{2}\|\mathbf{w}_{t+1}-\mathbf{w}_t\|_2^2\\ &+(\nabla g(\mathbf{w}_t)-\nabla g(\mathbf{w}_t,\zeta_t))^{\top}(\mathbf{w}_{t+1}-\mathbf{w}_*). \end{align*}\]

This is similar to Eqn. 7 in Section 3.1 except for the two negative terms \(-\frac{\mu}{2}\|\mathbf{w}_t-\mathbf{w}_*\|_2^2-\frac{\mu_r}{2}\|\mathbf{w}_{t+1}-\mathbf{w}_*\|_2^2\), which are due to the \(\mu_r\)-strong convexity of \(r\) and \(\mu\)-strong convexity of \(g\). The remaining proof is similar to that in Lemma 3.2.

Theorem 3.5

Suppose Assumption 3.6 holds. Let \(\eta_t=\eta\leq 1/L\) and \(\bar{\mathbf{w}}_T=\frac{1}{T}\sum_{t=1}^T\mathbf{w}_{t+1}\). Then after \(T\) iterations of SPGD update (\(\ref{eqn:spgd}\)), we have \[\begin{align*} \mathbb{E}[F(\bar{\mathbf{w}}_T)-F(\mathbf{w}_*)]\leq \frac{\|\mathbf{w}_1-\mathbf{w}_*\|_2^2}{2\eta T}+\eta\sigma^2. \end{align*}\] If \(\eta=\min\left(\frac{1}{L},\frac{\|\mathbf{w}_1-\mathbf{w}_*\|_2}{\sqrt{2T}\sigma}\right)\), then \[\begin{align*} \mathbb{E}[F(\bar{\mathbf{w}}_T)-F(\mathbf{w}_*)]\leq \frac{\sqrt{2}\sigma\|\mathbf{w}_1-\mathbf{w}_*\|_2}{\sqrt{T}}+\frac{L\|\mathbf{w}_1-\mathbf{w}_*\|_2^2}{T}. \end{align*}\]

💡 Why it matters
Insight 1: The theorem indicates that even if the objective has a non-smooth regularizer \(r\), the convergence rate of SPGD still depends on the variance bound \(\sigma^2\) instead of the Lipschitz constant of the objective function as in the analysis of SGD for non-smooth convex functions.

Insight 2: Employing the proximal mapping of \(r\) renders the convergence independent of the smoothness of \(r\). Consequently, this approach is advantageous even when \(r\) is smooth, particularly if it possesses a large smoothness constant.

Proof.
The proof is the same as Theorem 3.1 by following Lemma 3.7 and noting \(\mu_r=\mu=0\).

■

3.2.2 Strongly Convex Functions

We can prove a faster convergence when the loss function or the regularizer is strongly convex.

Theorem 3.6

Suppose Assumption 3.6 holds and \(g\) is \(\mu\)-strongly convex and \(r\) is \(\mu_r\)-strongly convex. Let \(\eta_t=1/((\mu+\mu_r)t+L)\) and \(\bar{\mathbf{w}}_T=\frac{1}{T}\sum_{t=1}^T\mathbf{w}_{t+1}\). Then after \(T\) iterations of SPGD update (\(\ref{eqn:spgd}\)), we have

\[\begin{align*} \mathbb{E}\left[\frac{1}{T}\sum_{t=1}^T\big(F(\mathbf{w}_{t+1})-F(\mathbf{w}_*)\big)\right] \leq \frac{(L+\mu_r)\|\mathbf{w}_1-\mathbf{w}_*\|_2^2}{T}+\frac{(1+\log T)\sigma^2}{T(\mu+\mu_r)}. \end{align*}\]

Proof.
Following Lemma 3.7, if \(\eta_t\leq \frac{1}{L}\) we have

\[\begin{align*} &\mathbb{E}\big[F(\mathbf{w}_{t+1})-F(\mathbf{w}_*)\big]\\ &\leq \mathbb{E}\left[\frac{1}{2\eta_t}\|\mathbf{w}_t-\mathbf{w}_*\|_2^2-\frac{1}{2\eta_t}\|\mathbf{w}_{t+1}-\mathbf{w}_*\|_2^2 -\frac{\mu}{2}\|\mathbf{w}_t-\mathbf{w}_*\|_2^2-\frac{\mu_r}{2}\|\mathbf{w}_{t+1}-\mathbf{w}_*\|_2^2\right]+\eta_t\sigma^2. \end{align*}\]

Taking summation over \(t=1,\ldots,T\) we have

\[\begin{align*} &\mathbb{E}\left[\sum_{t=1}^T\big(F(\mathbf{w}_{t+1})-F(\mathbf{w}_*)\big)\right]\\ &\leq \mathbb{E}\left[\sum_{t=1}^T\left(\frac{1}{2\eta_t}-\frac{1}{2\eta_{t-1}}-\frac{\mu+\mu_r}{2}\right)\|\mathbf{w}_t-\mathbf{w}_*\|_2^2 +\frac{1}{2\eta_0}\|\mathbf{w}_1-\mathbf{w}_*\|_2^2+\frac{\mu_r}{2}\|\mathbf{w}_1-\mathbf{w}_*\|_2^2\right]\\ &\quad+\sum_{t=1}^T\eta_t\sigma^2. \end{align*}\]

Let \(\eta_t=\frac{1}{(\mu+\mu_r)t+L}\). Then \(\frac{1}{2\eta_t}-\frac{1}{2\eta_{t-1}}-\frac{\mu+\mu_r}{2}=0\) and we have

\[\begin{align*} &\mathbb{E}\left[\frac{1}{T}\sum_{t=1}^T\big(F(\mathbf{w}_{t+1})-F(\mathbf{w}_*)\big)\right]\\ &\leq \frac{L+\mu_r}{2T}\|\mathbf{w}_1-\mathbf{w}_*\|_2^2+\frac{1}{T}\sum_{t=1}^T\frac{\sigma^2}{(\mu+\mu_r)t} \leq \frac{L+\mu_r}{2T}\|\mathbf{w}_1-\mathbf{w}_*\|_2^2+\frac{(1+\log T)\sigma^2}{T(\mu+\mu_r)}. \end{align*}\]

Algorithm 3: Restarted SPGD

Input: a learning schedule \(\{\eta_k,T_k\}_{k=1}^{K}\), starting point \(\mathbf{w}_1\)
For \(k=1,\dotsc,K\)
run SPGD with a learning rate \(\eta_k\) for \(T_k\) iterations starting from \(\mathbf{w}_k\)
return an averaged solution \(\mathbf{w}_{k+1}\)

A Restarted Approach

The \(\log T\) factor in the convergence bound can be removed using a restarting scheme. It runs in multiple stages. At stage \(k\), it starts with a step size \(\eta_k\) and runs SGD with a number of iterations \(T_k\) and returns an averaged solution \(\mathbf{w}_k\). By choosing \(\eta_k,T_k\) appropriately, after a logarithmic number of \(K\) stages, we will get a solution \(\mathbf{w}_K\) satisfying \(\mathbb{E}[F(\mathbf{w}_K)-F(\mathbf{w}_*)]\leq \epsilon\).

The key motivation is coming from the one-stage convergence bound in Theorem 3.5:

\[\begin{align}\label{eqn:one-epoch-SGD} \mathbb{E}[F(\bar{\mathbf{w}}_T)-F(\mathbf{w}_*)]\leq \frac{\|\mathbf{w}_1-\mathbf{w}_*\|_2^2}{2\eta T}+\eta\sigma^2. \end{align}\]

Since the \(\mu\)-strong convexity of \(F\) implies \(\|\mathbf{w}_1-\mathbf{w}_*\|_2^2\leq \frac{2}{\mu}(F(\mathbf{w}_1)-F(\mathbf{w}_*))\), then we can establish a recursion of the objective gap in a stage-wise manner. From which, we can show the objective gap will decrease geometrically if \(\eta_k\) decreases geometrically and \(T_k\) increases accordingly. This is formally stated in the following theorem.

Theorem 3.7

Suppose Assumption 3.6 holds, \(F\) is \(\mu\)-strongly convex and there exists \(\epsilon_0\) such that \(F(\mathbf{w}_1)-F(\mathbf{w}_*)\leq \epsilon_0/2\). Let \(\eta_k=\min(\frac{1}{L},\frac{\epsilon_0}{2^{k+2}\sigma^2})\) and \(T_k=\frac{4}{\mu\eta_k}\). Then after \(K=\lfloor\log_2(\epsilon_0/\epsilon)\rfloor\) stages of Restarted SPGD updates Alg. 3, we have

\[\begin{align*} \mathbb{E}\left[F(\mathbf{w}_{K+1})-F(\mathbf{w}_*)\right]\leq \epsilon. \end{align*}\]

The iteration complexity is \(\sum_{k=1}^K T_k = O\left(\frac{\sigma^2}{\mu\epsilon}+\frac{L}{\mu}\log\left(\frac{\epsilon_1}{\epsilon}\right)\right)\).

💡 Why it matters
The learning-rate schedule in this theorem follows the widely used step-decay scheme. The dominant term in the iteration complexity is \(O\left(\frac{\sigma^2}{\mu\epsilon}\right)\), which improves upon the bound in Theorem 3.6. Moreover, in the deterministic case \(\sigma=0\), it yields a sharper complexity of \(O\left(\frac{L}{\mu}\log\left(\frac{\epsilon_0}{\epsilon}\right)\right)\), compared with \(O\left(\frac{(L+\mu_r)\|\mathbf w_1-\mathbf w_*\|_2^2}{\epsilon}\right)\) as indicated by Theorem 3.6.

Proof.
Let \(\epsilon_k=\epsilon_0/2^k, \forall k\geq 1\). Then \(\epsilon_{K+1} = \epsilon_0/2^{K+1}\leq \epsilon\) and \(\epsilon_K\geq \epsilon\).

Applying the one-stage analysis of SPGD, we have

\[\begin{align*} \mathbb{E}[F({\mathbf{w}}_{k+1})-F(\mathbf{w}_*)] &\leq \frac{\mathbb{E}[\|\mathbf{w}_k-\mathbf{w}_*\|_2^2]}{2\eta_k T_k}+\eta_k\sigma^2 \leq \frac{\mathbb{E}[F(\mathbf{w}_k)-F(\mathbf{w}_*)]}{\mu\eta_k T_k}+\eta_k\sigma^2. \end{align*}\]

Then we prove by induction. Assuming \(\mathbb{E}[F(\mathbf{w}_k)-F(\mathbf{w}_*)]\leq \epsilon_k\), we prove that \(\mathbb{E}[F(\mathbf{w}_{k+1})-F(\mathbf{w}_*)]\leq \epsilon_{k+1}\) as follows.

\[\begin{align*} \mathbb{E}[F({\mathbf{w}}_{k+1})-F(\mathbf{w}_*)] &\leq \frac{\mathbb{E}[F(\mathbf{w}_k)-F(\mathbf{w}_*)]}{\mu\eta_k T_k}+\eta_k\sigma^2\\ &\leq \frac{\epsilon_k}{\mu\eta_k T_k}+\eta_k\sigma^2 \leq \frac{\epsilon_k}{\mu\eta_k T_k}+\frac{\epsilon_{k+1}}{2} \leq \frac{\epsilon_k}{4}+\frac{\epsilon_{k+1}}{2} =\epsilon_{k+1}. \end{align*}\]

Thus, \(\mathbb{E}[F(\mathbf{w}_{K+1})-F(\mathbf{w}_*)]\leq \epsilon_{K+1}\leq \epsilon\). The total number of iterations is

\[\begin{align*} \sum_{k=1}^K T_k &=\sum_{k=1}^K \frac{4}{\mu\eta_k} =\sum_{k=1}^K \max\left(\frac{4\cdot 2^{k+1}\sigma^2}{\mu\epsilon_1},\frac{4L}{\mu}\right)\\ &\leq \sum_{k=1}^K \max\left(\frac{8\sigma^2}{\mu\epsilon 2^{K-k}},\frac{4L}{\mu}\right) =O\left(\frac{\sigma^2}{\mu\epsilon}+\frac{L}{\mu}\log\left(\frac{\epsilon_0}{\epsilon}\right)\right). \end{align*}\]

Last-iterate Convergence

Furthermore, if \(g(\cdot)\) and/or \(r\) is strongly convex, we can also prove \(\|\mathbf{w}_{t+1}-\mathbf{w}_*\|_2\) converges to zero.

Lemma 3.8
If \(g\) is \(L\)-smooth and \(\mu\)-strongly convex and \(r\) is \(\mu_r\)-strongly convex, for the update (\(\ref{eqn:spgd}\)) with \(\eta_t\le2/L\) we have

\[\begin{align} \mathbb{E}_{\zeta_t}[\|\mathbf{w}_{t+1}-\mathbf{w}_*\|_2^2]\le\frac{(1-(2\eta_t-\eta_t^2L)\mu)\|\mathbf{w}_t-\mathbf{w}_*\|_2^2+\eta_t^2\sigma^2}{1+\eta\mu_r}. \end{align}\]

If \(g\) is \(\mu\)-strongly convex and \(\|\partial g(\mathbf{w})\|_2\le G\) for \(\mathbf{w}\in\text{dom}(r)\), for the update (\(\ref{eqn:spgd}\)) we have

\[\begin{align} \mathbb{E}_{\zeta_t}[\|\mathbf{w}_{t+1}-\mathbf{w}_*\|_2^2]\le\frac{(1-2\eta_t\mu)\|\mathbf{w}_t-\mathbf{w}_*\|_2^2+\eta_t^2(\sigma^2+4G^2)}{1+\eta\mu_r}. \end{align}\]

Proof.
Let \(\mathbb{E}_t=\mathbb{E}_{\zeta_t}\). Let us consider smooth case first. Due to the optimality condition of \(\mathbf{w}_*\), we have \[\begin{align*} \mathbf{w}_* &= \arg\min_{\mathbf{w}\in\mathbb{R}^d}\nabla g(\mathbf{w}_*)^{\top}(\mathbf{w}-\mathbf{w}_*)+\frac{1}{2\eta_t}\|\mathbf{w}-\mathbf{w}_*\|_2^2+r(\mathbf{w})\\ &=\operatorname{prox}_{\eta_t r}(\mathbf{w}_*-\eta_t\nabla g(\mathbf{w}_*)). \end{align*}\]

Due to the Lipschitz continuity of the prox operator (see Lemma 1.7), we have

\[\begin{equation}\label{eqn:recl} \begin{split} & \mathbb{E}_t\|\mathbf{w}_{t+1}-\mathbf{w}_*\|_2^2 \le \frac{1}{1+\eta\mu_r}\mathbb{E}_t\|\mathbf{w}_t-\eta_t\nabla g(\mathbf{w}_t;\zeta_t)-[\mathbf{w}_*-\eta_t\nabla g(\mathbf{w}_*)]\|_2^2. \end{split} \end{equation}\]

Next, we bound \[\begin{equation*} \begin{split} &\mathbb{E}_t\|\mathbf{w}_t-\eta_t\nabla g(\mathbf{w}_t;\zeta_t)-[\mathbf{w}_*-\eta_t\nabla g(\mathbf{w}_*)]\|_2^2\\ &=\mathbb{E}_t\|[\mathbf{w}_t-\eta_t\nabla g(\mathbf{w}_t)]-[\mathbf{w}_*-\eta_t\nabla g(\mathbf{w}_*)]+\eta_t\nabla g(\mathbf{w}_t)-\eta_t\nabla g(\mathbf{w}_t,\zeta_t)\|_2^2\\ &=\mathbb{E}_t\|[\mathbf{w}_t-\eta_t\nabla g(\mathbf{w}_t)]-[\mathbf{w}_*-\eta_t\nabla g(\mathbf{w}_*)]\|_2^2+\eta_t^2\sigma^2, \end{split} \end{equation*}\] where the last inequality uses \(\mathbb{E}_t[\nabla g(\mathbf{w}_t,\zeta_t)-\nabla g(\mathbf{w}_t)]=0\) by expanding the RHS. Let us bound the first term below. \[\begin{equation*} \begin{split} &\mathbb{E}_t\|[\mathbf{w}_t-\eta_t\nabla g(\mathbf{w}_t)]-[\mathbf{w}_*-\eta_t\nabla g(\mathbf{w}_*)]\|_2^2\\ &=\mathbb{E}_t\|\mathbf{w}_t-\mathbf{w}_*\|_2^2+\eta_t^2\mathbb{E}_t\|\nabla g(\mathbf{w}_t)-\nabla g(\mathbf{w}_*)\|_2^2-2\eta_t\mathbb{E}_t(\mathbf{w}_t-\mathbf{w}_*)^{\top}(\nabla g(\mathbf{w}_t)-\nabla g(\mathbf{w}_*))\\ &\le \mathbb{E}_t\|\mathbf{w}_t-\mathbf{w}_*\|_2^2+\eta_t^2L\mathbb{E}_t(\mathbf{w}_t-\mathbf{w}_*)^{\top}(\nabla g(\mathbf{w}_t)-\nabla g(\mathbf{w}_*))\\ &\quad-2\eta_t\mathbb{E}_t(\mathbf{w}_t-\mathbf{w}_*)^{\top}(\nabla g(\mathbf{w}_t)-\nabla g(\mathbf{w}_*))\\ &=\mathbb{E}_t\|\mathbf{w}_t-\mathbf{w}_*\|_2^2-(2\eta_t-\eta_t^2L)\mathbb{E}_t(\mathbf{w}_t-\mathbf{w}_*)^{\top}(\nabla g(\mathbf{w}_t)-\nabla g(\mathbf{w}_*))\\ &\le \mathbb{E}_t\|\mathbf{w}_t-\mathbf{w}_*\|_2^2-(2\eta_t-\eta_t^2L)\mu\mathbb{E}_t\|\mathbf{w}_t-\mathbf{w}_*\|_2^2\\ &\le (1-(2\eta_t-\eta_t^2L)\mu)\mathbb{E}_t\|\mathbf{w}_t-\mathbf{w}_*\|_2^2, \end{split} \end{equation*}\] where the first inequality uses Lemma 1.5 (c) and the second inequality follows from Lemma 1.6 (c).

If \(g\) is non-smooth, we bound \(\mathbb{E}\|\nabla g(\mathbf{w}_t)-\nabla g(\mathbf{w}_*)\|_2^2\le4G^2\). Combining this with (\(\ref{eqn:recl}\)) concludes the proof.

■

Theorem 3.8
Suppose Assumption 3.6 holds and \(g\) is \(\mu\)-strongly convex and \(r\) is \(\mu_r\)-strongly convex. Let \(\eta_t=\eta\le\min(1/L,1/\mu_r)\). Then after \(T\) iterations of SPGD (\(\ref{eqn:spgd}\)) update, we have

\[\begin{align} \mathbb{E}[\|\mathbf{w}_{T+1}-\mathbf{w}_*\|_2^2]\le e^{-\frac{\eta(\mu+\mu_r)T}{2}}\mathbb{E}[\|\mathbf{w}_1-\mathbf{w}_*\|_2^2]+\frac{\eta\sigma^2}{\mu+\mu_r}. \end{align}\]

💡 Why it matters
This theorem indicates that if we set \(\eta= O((\mu+\mu_r)\epsilon/\sigma^2)\), then with \(T=\tilde O\left(\frac{\sigma^2}{(\mu+\mu_r)^2\epsilon}\right)\) iterations, the algorithm finds an solution \(\mathbf{w}_{T+1}\) that is \(\epsilon\)-close to the optimal solution \(\mathbf{w}_*\) measured by \(\mathbb{E}[\|\mathbf{w}_{T+1}-\mathbf{w}_*\|_2^2]\), where \(\tilde O(\cdot)\) hides a logarithmic factor of \(\log(1/\epsilon)\).

Building on this result, one can similarly establish the convergence for Restarted SPGD with option II and a step-decay learning-rate schedule in Theorem 3.7; we leave the details as an exercise for interested readers.

Proof.
If \(\eta\le1/L\), Lemma 3.8 implies that \[\begin{align*} \mathbb{E}[\|\mathbf{w}_{t+1}-\mathbf{w}_*\|_2^2] &\le \frac{(1-\eta\mu)\mathbb{E}[\|\mathbf{w}_t-\mathbf{w}_*\|_2^2]+\eta^2\sigma^2}{1+\eta\mu_r}\\ &\le \bigg(1-\frac{\eta\mu_r}{2}\bigg)\{(1-\eta\mu)\mathbb{E}[\|\mathbf{w}_t-\mathbf{w}_*\|_2^2]+\eta^2\sigma^2\}\\ &\le \bigg(1-\frac{\eta\mu_r}{2}-\eta\mu+\frac{\eta^2\mu\mu_r}{2}\bigg)\mathbb{E}[\|\mathbf{w}_t-\mathbf{w}_*\|_2^2]+\eta^2\sigma^2, \end{align*}\] where the first inequality is due to \(1\le(1+\eta\mu_r)(1-\frac{\eta\mu_r}{2})=1+\frac{\eta\mu_r}{2}-\frac{\eta^2\mu_r^2}{2}\) as \(\eta\mu_r\le1\). Then \[\begin{align*} \mathbb{E}[\|\mathbf{w}_{t+1}-\mathbf{w}_*\|_2^2]\le(1-\frac{\eta\mu_r}{2}-\frac{\eta\mu}{2})\mathbb{E}[\|\mathbf{w}_t-\mathbf{w}_*\|_2^2]+\eta^2\sigma^2. \end{align*}\] Applying this inequality \(T\) times gives \[\begin{align*} &\mathbb{E}[\|\mathbf{w}_{T+1}-\mathbf{w}_*\|_2^2]\\ &\le\bigg(1-\frac{\eta(\mu+\mu_r)}{2}\bigg)^T\mathbb{E}\|\mathbf{w}_1-\mathbf{w}_*\|_2^2+\sum_{t=0}^{T-1}\bigg(1-\frac{\eta(\mu+\mu_r)}{2}\bigg)^t\eta^2\sigma^2. \end{align*}\] Since \((1-\alpha)^T\le e^{-\alpha T}\) for \(\alpha\in(0,1)\) and \(\sum_{t=0}^{T-1}\alpha^t<\frac{1}{1-\alpha}\), we have \[\begin{align*} &\mathbb{E}[\|\mathbf{w}_{T+1}-\mathbf{w}_*\|_2^2]\\ &\le e^{-\frac{\eta(\mu+\mu_r)T}{2}}\mathbb{E}[\|\mathbf{w}_1-\mathbf{w}_*\|_2^2]+\eta^2\sigma^2\frac{2}{\eta(\mu+\mu_r)}\\ &= e^{-\frac{\eta(\mu+\mu_r)T}{2}}\mathbb{E}[\|\mathbf{w}_1-\mathbf{w}_*\|_2^2]+\frac{2\eta\sigma^2}{\mu+\mu_r}. \end{align*}\]

■

Corollary 3.9
Under the setting of Theorem 3.6, if \(\frac{1}{\eta_t}=\frac{\bar\mu}{2}+\sqrt{(\frac{\bar\mu}{2})^2+\frac{1}{\eta_{t-1}^2}}\) with \(\eta_0\le\min(1/L,1/\mu_r)\) and \(\bar\mu=(\mu+\mu_r)/2\), then we have

\[ \mathbb{E}[\|\mathbf{w}_{T+1}-\mathbf{w}_*\|_2^2]\le\frac{4\|\mathbf{w}_1-\mathbf{w}_*\|_2^2}{\eta_0^2\bar\mu^2T^2}+\frac{2\sigma^2}{\bar\mu^2T}. \]

💡 Why it matters
This corollary shows that a decreasing learning rate schedule can be used without requiring prior knowledge of \(\epsilon\), in order to obtain a solution \(\mathbf{w}_{T+1}\) that is \(\epsilon\)-close to the optimum \(\mathbf{w}_*\), measured by \(\mathbb{E}[\|\mathbf{w}_{T+1}-\mathbf{w}_*\|_2^2]\). The iteration complexity is

\[ T=\mathcal{O}\left(\max\left\{\frac{1}{\bar\mu\eta_0\sqrt{\epsilon}},\frac{\sigma^2}{\bar\mu^2\epsilon}\right\}\right). \]

Proof.
Let \(\delta_t=\|\mathbf{w}_t-\mathbf{w}_*\|_2^2\). Due to the update of \(\eta_t\), we have \(1-\bar\mu\eta_t=\frac{\eta_t^2}{\eta_{t-1}^2}\). Hence, we have: \[\begin{align*} \mathbb{E}[\delta_{T+1}] &\le \mathbb{E}[(1-\bar\mu\eta_T)\delta_T]+\sigma^2\eta_T^2 \le \mathbb{E}\bigg[\frac{\eta_T^2}{\eta_{T-1}^2}\delta_T\bigg]+\sigma^2\eta_T^2. \end{align*}\] Unrolling this inequality for \(t=1,\ldots,T\), we have \[\begin{align*} \mathbb{E}[\delta_{T+1}] &\le \mathbb{E}\bigg[\frac{\eta_T^2}{\eta_{T-2}^2}\delta_{T-1}\bigg]+\sigma^2\eta_T^2*2 \le \frac{\eta_T^2}{\eta_0^2}\delta_1+\sigma^2\eta_T^2*T. \end{align*}\] Since \(\frac{1}{\eta_t}=\frac{\bar\mu}{2}+\sqrt{(\frac{\bar\mu}{2})^2+\frac{1}{\eta_{t-1}^2}}\), then we have \(\frac{1}{\eta_t}\ge\frac{\bar\mu}{2}+\frac{1}{\eta_{t-1}}\). As a result, \(\frac{1}{\eta_T}\ge\frac{\bar\mu T}{2}+\frac{1}{\eta_0}\ge\max(L,\mu_r)\), where \(\eta_0\le\min(\frac{1}{L},\frac{1}{\mu_r})\). Hence, \(\eta_T\le\frac{2}{\bar\mu T}\), and \[\begin{align*} \mathbb{E}[\delta_{T+1}]\le\frac{4\delta_1}{\eta_0^2\bar\mu^2T^2}+\frac{2\sigma^2}{\bar\mu^2T}. \end{align*}\]

■

← Go Back