Section 3.3: Stochastic Coordinate Descent
In this section, we present stochastic coordinate descent (SCD) for solving the stochastic optimization:
\[\begin{align} \label{eqn:scd-so} \min_{\boldsymbol{\alpha}\in\Omega\subseteq\mathbb{R}^n} f(\boldsymbol{\alpha})=\mathbb{E}[f(\boldsymbol{\alpha},\xi)]. \end{align}\]
where \(\Omega=\Omega_1\times\Omega_2\cdots\times\Omega_n\).
The key motivation is that if the dimensionality \(n\) of \(\boldsymbol{\alpha}\) is very large, then computing \(\nabla f(\boldsymbol{\alpha},\xi)\) could be expensive at each iteration. However, if the function exhibits decomposable structure over dimensions of \(\boldsymbol{\alpha}\), then we can sample a random coordinate of \(\boldsymbol{\alpha}\) and update it. To this end, we assume that \([\nabla f(\boldsymbol{\alpha},\xi)]_i\), \(\forall i\in[n]\), is easy to compute. In machine learning applications, this is possible if \(f(\boldsymbol{\alpha},\xi)=\boldsymbol{\alpha}^{\top}\boldsymbol{g}(\xi)\) and computing each coordinate of \(\boldsymbol{g}(\xi)\) is much more cheaper than computing itself. An example is the COCE problems, which will be discussed in Section 5.5.
Let us consider a simple version of SCD. At each iteration \(t\), a coordinate denoted by \(i_t\) is randomly sampled from \(\{1,\ldots,n\}\) with uniform probabilities. Then we compute \(\nabla_{i_t} f(\boldsymbol{\alpha}_t,\xi_t)=[\nabla f(\boldsymbol{\alpha}_t,\xi_t)]_{i_t}\) and update \(\boldsymbol{\alpha}\) by
\[\begin{align*} \alpha_{t+1,i}= \begin{cases} \Pi_{\Omega_i}[\alpha_{t,i}-\eta \nabla_i f(\boldsymbol{\alpha}_t,\xi_t)] & \text{if } i=i_t,\\ \alpha_{t,i} & \text{o.w.} \end{cases} \end{align*}\]
Convergence Analysis
We make the following assumption.
Assumption 3.7
The following conditions hold:
\(f(\boldsymbol{\alpha})\) is convex.
For any \(\boldsymbol{\alpha}\), we have
\(\mathbb{E}[\|\nabla_i f(\boldsymbol{\alpha}; \zeta)-\nabla_i f(\boldsymbol{\alpha})\|_2^2]\leq \sigma_i^2\)
for some \(\sigma_i\geq 0\).\(\nabla f\) is \(L_i\)-Lipschitz continuous with respect to the \(i\)-th coordinate, i.e., \[ \|\nabla f(\boldsymbol{\alpha})-\nabla f(\boldsymbol{\alpha}+\mathbf{e}_i\delta)\|_2\leq L_i|\delta|. \]
Theorem 3.9
Let \(\bar{\boldsymbol{\alpha}}_T=\frac{1}{T}\sum_{t=1}^T\boldsymbol{\alpha}_{t+1}\)
and \(\bar{L}=\max_i L_i\).
If \(\eta_t=\eta\leq
\frac{1}{\bar{L}}\), after \(T\)
iterations of SCD update we have
\[\begin{align*} \mathbb{E}\!\left[f(\bar{\boldsymbol{\alpha}}_T)-f(\boldsymbol{\alpha}_*)\right] &\leq \frac{(n-1)\big(f(\boldsymbol{\alpha}_1)-f(\boldsymbol{\alpha}_*)\big)}{T} +\frac{n}{2\eta T}\|\boldsymbol{\alpha}_1-\boldsymbol{\alpha}_*\|_2^2 +\sum_{i=1}^n \eta\sigma_i^2 . \end{align*}\]
If \(\|\boldsymbol{\alpha}_1-\boldsymbol{\alpha}_*\|_2^2\leq D^2\) and \(\sum_{i=1}^n\sigma_i^2\leq\sigma^2\), with \(\eta=O\!\left(\min\!\left(\frac{\sqrt{n}}{\sqrt{2T}\sigma},\frac{1}{\bar{L}}\right)\right)\), we have
\[\begin{align*} \mathbb{E}\!\left[f(\bar{\boldsymbol{\alpha}}_T)-f(\boldsymbol{\alpha}_*)\right] &\leq \frac{(n-1)\big(f(\boldsymbol{\alpha}_1)-f(\boldsymbol{\alpha}_*)\big)}{T} +\frac{\sqrt{2n}D\sigma}{\sqrt{T}} +\frac{\bar{L}nD^2}{T}. \end{align*}\]
💡 Why it matters
According to the theorem, SCD’s iteration complexity is \(O\!\left(\frac{nD^2\sigma^2}{\epsilon^2}\right)\).
Although this is \(n\) times higher
than that of SGD, it is offset by the fact that each individual
iteration of SCD can be \(n\) times
cheaper to compute.
Algorithm: SCD
- Input: learning rate schedule \(\{\eta_t\}_{t=1}^T\), starting point \(\boldsymbol{\alpha}_1\)
- For \(t=1,\dotsc,T\)
- Sample a coordinate \(i_t\) uniformly
- Compute an unbiased coordinate gradient estimator \(\nabla_{i_t}f(\boldsymbol{\alpha}_t,\xi_t)\)
- Update
- \[\begin{align*} \alpha_{t+1,i}= \begin{cases} \Pi_{\Omega_i}[\alpha_{t,i}-\eta_t\nabla_i f(\boldsymbol{\alpha}_t,\xi_t)] & \text{if } i=i_t \\ \alpha_{t,i} & \text{o.w.} \end{cases} \end{align*}\]
Proof.
To facilitate the analysis, we consider a virtual sequence \(\{\tilde{\boldsymbol{\alpha}}_t\}\) defined by
\[ \tilde{\boldsymbol{\alpha}}_{t+1}=\Pi_{\Omega}[\boldsymbol{\alpha}_t-\eta_t\nabla f(\boldsymbol{\alpha}_t,\xi_t)] . \]
Due to the decomposability of \(\Omega=\Omega_1\times\cdots\times\Omega_n\), it implies
\[ \tilde{\alpha}_{t+1,i}=\Pi_{\Omega_i}[\alpha_{t,i}-\eta_t\nabla_i f(\boldsymbol{\alpha}_t,\xi_t)],\ \forall i. \]
Applying Lemma 3.6 to each coordinate of \(\tilde{\boldsymbol{\alpha}}_{t+1}\) with \(r(\alpha_i)=\mathbb{I}_{0-\infty}(\alpha_i\in\Omega_i)\), we have
\[\begin{align*} \mathbb{E}[\nabla_i f(\boldsymbol{\alpha}_t,\xi_t)^{\top}(\tilde{\alpha}_{t+1,i}-\alpha_{*,i})] &\leq \frac{1}{2\eta_t}\mathbb{E}[\|\alpha_{t,i}-\alpha_{*,i}\|_2^2] -\frac{1}{2\eta_t}\mathbb{E}[\|\tilde{\alpha}_{t+1,i}-\alpha_{*,i}\|_2^2] \\ &-\frac{1}{2\eta_t}\mathbb{E}[\|\alpha_{t,i}-\tilde{\alpha}_{t+1,i}\|_2^2]. \end{align*}\]
Then
\[\begin{align*} \mathbb{E}[\nabla_i f(\boldsymbol{\alpha}_t)^{\top}(\tilde{\alpha}_{t+1,i}-\alpha_{*,i})] &\leq \frac{1}{2\eta_t}\mathbb{E}[\|\alpha_{t,i}-\alpha_{*,i}\|_2^2] -\frac{1}{2\eta_t}\mathbb{E}[\|\tilde{\alpha}_{t+1,i}-\alpha_{*,i}\|_2^2] \\ &-\frac{1}{2\eta_t}\mathbb{E}[\|\alpha_{t,i}-\tilde{\alpha}_{t+1,i}\|_2^2] \\ &+\mathbb{E}[(\nabla_i f(\boldsymbol{\alpha}_t)-\nabla_i f(\boldsymbol{\alpha}_t,\xi_t))^{\top} (\tilde{\alpha}_{t+1,i}-\alpha_{*,i})]. \end{align*}\]
Similar to Lemma 3.2, the last term in the RHS can be bounded by \[\mathbb E[(\nabla_i f(\alpha_t) -\nabla_i f(\alpha_t, \xi_t) )^{\top}(\tilde\alpha_{t+1,i} - \alpha_{*,i})]\leq \mathbb E(\nabla_i f(\alpha_t) -\nabla_i f(\alpha_t, \xi_t))^2\leq \eta_t\sigma_i^2.\]
Adding the above inequality over \(i=1,\ldots,n\), we obtain
\[\begin{align*} \mathbb{E}[\nabla_i f(\boldsymbol{\alpha}_t)^{\top}(\tilde{\alpha}_{t+1,i}-\alpha_{*,i})] &\leq \frac{1}{2\eta_t}\mathbb{E}\!\left[\|\alpha_{t,i}-\alpha_{*,i}\|_2^2 -\|\tilde{\alpha}_{t+1,i}-\alpha_{*,i}\|_2^2\right] \\ &-\frac{1}{2\eta_t}\mathbb{E}[\|\alpha_{t,i}-\tilde{\alpha}_{t+1,i}\|_2^2] +\eta_t\sigma_i^2 . \end{align*}\]
Due to the randomness of \(i_t\), we have \[\begin{align*} &\mathbb{E}[(\alpha_{t+1,i}-\alpha_{*,i})^2] = \frac{1}{n}\mathbb{E}[(\tilde{\alpha}_{t+1,i}-\alpha_{*,i})^2] +\left(1-\frac{1}{n}\right)\mathbb{E}[(\alpha_{t,i}-\alpha_{*,i})^2],\\ &\mathbb{E}[\nabla_i f(\alpha_t)^{\top}(\alpha_{t+1,i}-\alpha_{*,i})] = \frac{1}{n}\mathbb{E}[\nabla_i f(\alpha_t)^{\top}(\tilde{\alpha}_{t+1,i}-\alpha_{*,i})] \\ &\qquad +\left(1-\frac{1}{n}\right)\mathbb{E}[\nabla_i f(\alpha_t)^{\top}(\alpha_{t,i}-\alpha_{*,i})],\\ &\mathbb{E}[\|\alpha_{t,i}-\alpha_{t+1,i}\|_2^2] = \frac{1}{n}\mathbb{E}[\|\alpha_{t,i}-\tilde{\alpha}_{t+1,i}\|_2^2]. \end{align*}\]
Combining the above, we have \[\begin{align*} &\mathbb{E}\Big[n\nabla_i f(\alpha_t)^{\top}(\alpha_{t+1,i}-\alpha_{*,i}) -(n-1)\nabla_i f(\alpha_t)^{\top}(\alpha_{t,i}-\alpha_{*,i})\Big] \\ &\le \frac{1}{2\eta_t}\mathbb{E}[\|\alpha_{t,i}-\alpha_{*,i}\|_2^2] -\frac{1}{2\eta_t}\mathbb{E}\big[n\|\alpha_{t+1,i}-\alpha_{*,i}\|_2^2 -(n-1)\|\alpha_{t,i}-\alpha_{*,i}\|_2^2\big] \\ &\qquad -\frac{n}{2\eta_t}\mathbb{E}[\|\alpha_{t,i}-\alpha_{t+1,i}\|_2^2] +\eta_t\sigma_i^2 . \end{align*}\]
Adding this over \(i=1,\ldots,n\), we obtain \[\begin{align*} &\mathbb{E}\Big[ n\sum_{i=1}^n \nabla_i f(\alpha_t)^{\top}(\alpha_{t+1,i}-\alpha_{*,i}) -\sum_{i=1}^n (n-1)\nabla_i f(\alpha_t)^{\top}(\alpha_{t,i}-\alpha_{*,i}) \Big] \\ &\le \frac{n}{2\eta_t}\mathbb{E}\!\left[\sum_{i=1}^n\|\alpha_{t,i}-\alpha_{*,i}\|_2^2\right] -\frac{n}{2\eta_t}\mathbb{E}\!\left[\sum_{i=1}^n\|\alpha_{t+1,i}-\alpha_{*,i}\|_2^2\right] \\ &\qquad -\frac{n}{2\eta_t}\mathbb{E}\!\left[\sum_{i=1}^n\|\alpha_{t,i}-\alpha_{t+1,i}\|_2^2\right] +\sum_{i=1}^n \eta_t\sigma_i^2 . \end{align*}\]
For the LHS, we have \[\begin{align*} &n\sum_{i=1}^n\nabla_i f(\alpha_t)^{\top}(\alpha_{t+1,i}-\alpha_{*,i}) -\sum_{i=1}^n (n-1)\nabla_i f(\alpha_t)^{\top}(\alpha_{t,i}-\alpha_{*,i}) \\ &= n\nabla_{i_t} f(\alpha_t)^{\top}(\alpha_{t+1,i_t}-\alpha_{*,i_t}) +n\sum_{i\neq i_t}\nabla_i f(\alpha_t)^{\top}(\alpha_{t,i}-\alpha_{*,i}) \\ &\qquad -\sum_{i=1}^n (n-1)\nabla_i f(\alpha_t)^{\top}(\alpha_{t,i}-\alpha_{*,i}) \\ &= n\nabla_{i_t} f(\alpha_t)^{\top}(\alpha_{t+1,i_t}-\alpha_{*,i_t}) -n\nabla_{i_t} f(\alpha_t)^{\top}(\alpha_{t,i_t}-\alpha_{*,i_t}) \\ &\qquad +\sum_{i=1}^n \nabla_i f(\alpha_t)^{\top}(\alpha_{t,i}-\alpha_{*,i}) \\ &= n\nabla_{i_t} f(\alpha_t)^{\top}(\alpha_{t+1,i_t}-\alpha_{t,i_t}) +\nabla f(\alpha_t)^{\top}(\alpha_t-\alpha_*). \end{align*}\]
By the assumption, we have \[\begin{align*} &\nabla_{i_t} f(\alpha_t)^{\top}(\alpha_{t+1,i_t}-\alpha_{t,i_t}) = \nabla f(\alpha_t)^{\top}\mathbf e_{i_t}(\alpha_{t+1,i_t}-\alpha_{t,i_t}) \\ &\qquad \ge f(\alpha_{t+1})-f(\alpha_t) -\frac{L_{i_t}}{2}\|\alpha_{t+1,i_t}-\alpha_{t,i_t}\|_2^2, \\ &\nabla f(\alpha_t)^{\top}(\alpha_t-\alpha_*) \ge f(\alpha_t)-f(\alpha_*). \end{align*}\]
Combining the above, we obtain \[\begin{align*} &n\sum_{i=1}^n\nabla_i f(\alpha_t)^{\top}(\alpha_{t+1,i}-\alpha_{*,i}) -\sum_{i=1}^n(n-1)\nabla_i f(\alpha_t)^{\top}(\alpha_{t,i}-\alpha_{*,i}) \\ &\ge n\Big(f(\alpha_{t+1})-f(\alpha_t) -\frac{L_{i_t}}{2}\|\alpha_{t+1,i_t}-\alpha_{t,i_t}\|_2^2\Big) +f(\alpha_t)-f(\alpha_*). \end{align*}\]
Thus, \[\begin{align*} &\mathbb{E}\Big[ n\Big(f(\alpha_{t+1})-f(\alpha_t) -\frac{L_{i_t}}{2}\|\alpha_{t+1,i_t}-\alpha_{t,i_t}\|_2^2\Big) +f(\alpha_t)-f(\alpha_*) \Big] \\ &\le \frac{n}{2\eta_t}\mathbb{E}\!\left[\sum_{i=1}^n\|\alpha_{t,i}-\alpha_{*,i}\|_2^2\right] -\frac{n}{2\eta_t}\mathbb{E}\!\left[\sum_{i=1}^n\|\alpha_{t+1,i}-\alpha_{*,i}\|_2^2\right] \\ &\qquad -\frac{n}{2\eta_t}\mathbb{E}\!\left[\sum_{i=1}^n\|\alpha_{t,i}-\alpha_{t+1,i}\|_2^2\right] +\sum_{i=1}^n \eta_t\sigma_i^2 . \end{align*}\]
Rearranging gives \[\begin{align*} &\mathbb{E}\Big[ n\big(f(\alpha_{t+1})-f(\alpha_*)\big) -(n-1)\big(f(\alpha_t)-f(\alpha_*)\big) \Big] \\ &\le \frac{n}{2\eta_t}\mathbb{E}\!\left[\sum_{i=1}^n\|\alpha_{t,i}-\alpha_{*,i}\|_2^2\right] -\frac{n}{2\eta_t}\mathbb{E}\!\left[\sum_{i=1}^n\|\alpha_{t+1,i}-\alpha_{*,i}\|_2^2\right] \\ &\qquad +\sum_{i=1}^n \eta_t\sigma_i^2 +\mathbb{E}\!\left[\frac{nL_{i_t}}{2}\|\alpha_{t+1,i_t}-\alpha_{t,i_t}\|_2^2\right] -\frac{n}{2\eta_t}\mathbb{E}\!\left[\sum_{i=1}^n\|\alpha_{t,i}-\alpha_{t+1,i}\|_2^2\right]. \end{align*}\]
This equals \[\begin{align*} &\frac{n}{2\eta_t}\mathbb{E}\!\left[\sum_{i=1}^n\|\alpha_{t,i}-\alpha_{*,i}\|_2^2\right] -\frac{n}{2\eta_t}\mathbb{E}\!\left[\sum_{i=1}^n\|\alpha_{t+1,i}-\alpha_{*,i}\|_2^2\right] +\sum_{i=1}^n\eta_t\sigma_i^2 \\ &\quad +\mathbb{E}\!\left[\sum_{i=1}^n\frac{nL_i}{2}\|\alpha_{t+1,i}-\alpha_{t,i}\|_2^2\right] -\frac{n}{2\eta_t}\mathbb{E}\!\left[\sum_{i=1}^n\|\alpha_{t,i}-\alpha_{t+1,i}\|_2^2\right]. \end{align*}\]
If \(\eta_t\le \frac{1}{\bar L}\), the sum of the last two terms is non-positive. Hence, \[\begin{align*} &\mathbb{E}[f(\alpha_{t+1})-f(\alpha_*)] \\ &\le \mathbb{E}\Big[ (n-1)(f(\alpha_t)-f(\alpha_*)) -(n-1)(f(\alpha_{t+1})-f(\alpha_*)) \Big] \\ &\quad +\frac{n}{2\eta_t}\mathbb{E}\!\left[\sum_{i=1}^n\|\alpha_{t,i}-\alpha_{*,i}\|_2^2\right] -\frac{n}{2\eta_t}\mathbb{E}\!\left[\sum_{i=1}^n\|\alpha_{t+1,i}-\alpha_{*,i}\|_2^2\right] +\sum_{i=1}^n\eta_t\sigma_i^2 . \end{align*}\]
Averaging over \(t=1,\ldots,T\) gives \[\begin{align*} \mathbb{E}\!\left[ \frac{1}{T}\sum_{t=1}^T f(\alpha_{t+1})-f(\alpha_*) \right] &\le \frac{(n-1)(f(\alpha_1)-f(\alpha_*))}{T} +\frac{n}{2\eta T}\|\alpha_1-\alpha_*\|_2^2 \\ &\quad +\sum_{i=1}^n \eta\sigma_i^2 . \end{align*}\]
■