Section 5.5.2 A Geometry-aware Algorithm for Entropic Risk

5.5.2 A Geometry-aware Algorithm for Entropic Risk

Although the last section presents a general algorithm for solving COCE risk minimization, it may exhibit numerical instability issues and slow convergence when solving compositional entropic risk minimization:

\[\begin{equation*} \begin{aligned} \min_{\mathbf{w}} & \min_{\nu}\left[F(\mathbf{w},\nu) = \frac{1}{n}\sum_{i=1}^n\{\mathbb{E}_{\zeta}\exp(s_i(\mathbf{w};\zeta) - \nu_{i}) - 1 + \nu_{i}\}\right]\\ =&\min_{\mathbf{w}}\frac{1}{n}\sum_{i=1}^n\log\left(\mathbb{E}_{\zeta}\exp(s_i(\mathbf{w}; \zeta))\right). \end{aligned} \end{equation*}\]

The numerical instability issue is caused by dealing with exponential functions, e.g., \(\exp(s_i(\mathbf{w};\zeta) - \nu_{i})\), in calculation of stochastic gradients of \(\nu_i\). The slow convergence arises because the standard SGD update for \(\nu_i\) fails to exploit the geometric structure of the problem.

5.5.2.1 Stochastic Optimization of Log-E-Exp

We first consider a simplified problem where there is only one component \(n=1\), i.e.,

\[\begin{align}\label{eqn:log-e-exp} \min_{\mathbf{w}}F_{\text{ENT}}(\mathbf{w}) := \log\left(\mathbb{E}_{\zeta}\exp(s(\mathbf{w}; \zeta))\right). \end{align}\]

The KL-regularized DRO problem is a special case. It is also known as log-E-Exp, a more general form of the log-Sum-Exp function, where the middle “E” denotes an expectation and highlights the associated computational challenges.

Application of SCGD

At the beginning of Section 4.1, we treat this problem as a special case of stochastic compositional optimization (SCO), where the outer function is \(f(\cdot)=\log(\cdot)\) and the inner function is \(g(\mathbf{w})=\mathbb{E}_\zeta[\exp(s(\mathbf{w}; \zeta))]\). Let us first apply the SCGD algorithm. The key updates are presented below:

\[\begin{equation}\label{eqn:scgd-dro} \begin{aligned} &u_{t} = (1-\gamma_t) u_{t-1} + \gamma_t \exp(s(\mathbf{w}_t; \zeta_t)),\\ &\mathbf{z}_t = \frac{1}{u_t}\exp(s(\mathbf{w}_t; \zeta'_t))\nabla s(\mathbf{w}_t; \zeta'_t),\\ &\mathbf{w}_{t+1}= \mathbf{w}_t - \eta_t \mathbf{z}_t, \end{aligned} \end{equation}\]

where \(u_t\) is an estimator of the inner function value \(g(\mathbf{w}_t)\) and \(\mathbf{z}_t=\nabla f(u_t)\nabla g(\mathbf{w}_t; \zeta'_t)\) is a gradient estimator of \(\mathbf{w}_t\).

From a practitioner’s perspective, the algorithm can be readily implemented and applied to real applications. However, from a theoretical perspective, several open problems remain. In particular: (1) Can we establish an \(O(1/\epsilon^2)\) convergence rate for this algorithm to find an \(\epsilon\)-optimal solution when \(s(\mathbf{w};\zeta)\) is convex? (2) If yes, what are the practical advantages of this algorithm compared with the SGD method presented in the previous section?

Wait! Shouldn’t we have established the convergence rate of SCGD in Chapter 4? It is true that we presented a convergence analysis of the above algorithm for non-convex problems under proper conditions, however, it remains an open problem to establish the complexity of \(O(1/\epsilon^2)\) for finding an \(\epsilon\)-optimal solution under the convexity of \(s(\mathbf{w}; \zeta)\). A naive analysis of SCGD for convex problems yields a complexity of \(O(1/\epsilon^4)\) (see Wang et al. (2017)).

A Novel Algorithm

To address these open questions, we present a novel algorithm based on the min-min reformulation of log-E-exp, i.e.,

\[\begin{equation}\label{eqn:log-e-exp-mm} \begin{aligned} &\min_{\mathbf{w}} \min_{\nu}F(\mathbf{w},\nu) :=\mathbb{E}_{\zeta}\exp(s(\mathbf{w};\zeta) - \nu) + \nu, \end{aligned} \end{equation}\]

where we ignored the constant \(-1\) in the objective. As proved in Lemma 5.20, \(F(\mathbf{w}; \nu)\) is jointly convex in terms of \(\mathbf{w}, \nu\) when \(s(\mathbf{w}; \zeta)\) is convex.

Motivation

The key novelty of our design is a geometry-aware algorithm for solving the equivalent min-min optimization (\(\ref{eqn:log-e-exp-mm}\)). Let us first discuss the motivation. One challenge for solving the min-min optimization problem is that the objective function \(F(\mathbf{w}, \nu)\) could have exponentially large smoothness constant in terms of \(\nu\). We will formally analyze this phenomenon in next section. Hence, a vanilla gradient method that uses the first-order approximation of \(F\) will inevitably be impacted by the large smoothness parameter.

To mitigate the adverse effects of a large smoothness parameter with respect to \(\nu\), we resort to the classical approach of employing a proximal mapping. Proximal mappings have been widely used to handle a non-smooth function in composite objectives consisting of a smooth loss and a non-smooth regularizer. This approach enables optimization algorithms to retain the favorable convergence properties of smooth optimization and often leads to faster convergence despite the presence of non-smooth terms. Analogously, even when a function is smooth but characterized by a very large smoothness parameter, applying its proximal mapping can effectively alleviate the negative impact of this large smoothness constant.

However, there is an important distinction from classical proximal methods, which typically rely on direct access to the function of interest for computing the proximal mapping. In our setting, we cannot directly apply the proximal mapping of \(F(\mathbf{w},\nu)\). Instead, we only have access to a stochastic estimator \(\Phi(\mathbf{w},\nu;\zeta) = e^{s(\mathbf{w};\zeta)-\nu} + \nu\), defined for a random sample \(\zeta\). As a result, it becomes necessary to explicitly account for the noise introduced by this stochastic approximation.

Algorithm

To account for the stochastic noise, we introduce a Bregman divergence \(D_\varphi(\cdot,\cdot)\) and update \(\nu_t\) according to the following scheme:

\[\begin{align}\label{eqn:SCENT-nu} \nu_{t} = \arg\min_\nu \Phi(\mathbf{w}_t, \nu; \zeta_t) + \frac{1}{\alpha_t}D_{\varphi}(\nu, \nu_{t-1}), \end{align}\]

where \(\zeta_t\sim\mathbb{P}\) is a random sample and \(\alpha_t>0\) is a step size parameter. We refer to this step as stochastic proximal mirror descent (SPMD) update. To respect the geometry of the stochastic objective \(\Phi(\mathbf{w}_t,\nu;\zeta_t)\), we construct a tailored Bregman divergence induced by the function \(\varphi(\nu)=e^{-\nu}\), namely,

\[\begin{align}\label{eqn:breg} D_{\varphi}(\nu, \nu_{t-1}) = e^{-\nu} - e^{-\nu_{t-1}} + e^{-\nu_{t-1}}(\nu - \nu_{t-1}). \end{align}\]

Once we have \(\nu_t\), we compute a vanilla gradient estimator by

\[\mathbf{z}_t = \exp(s(\mathbf{w}_t;\zeta'_t) - \nu_t)\nabla s(\mathbf{w}_t;\zeta'_t).\]

If the problem is non-convex, we compute a moving-average estimator \(\mathbf{v}_t = (1-\beta_t)\mathbf{v}_{t-1}+\beta_t\mathbf{z}_t\) and then update the model parameter \(\mathbf{w}_{t+1}\). We present the full steps in Algorithm 21, which is referred to as SCENT.

Algorithm 21: SCENT for solving Log-E-Exp (\(\ref{eqn:log-e-exp}\))

Input: Initialize \(\mathbf{w}_1,\nu_0\), step sizes \(\eta_t\) and \(\alpha_t\), \(\varphi(\nu)=e^{-\nu}\).
for \(t=1\dotsc,T-1\) do
Sample \(\zeta_t, \zeta'_t\)
Update \(\nu_t = \arg\min_{\nu} \exp(s(\mathbf{w}_t;\zeta_t) - \nu) + \nu + \frac{1}{\alpha_t}D_\varphi(\nu, \nu_{t-1})\)
Compute \(\mathbf{z}_t =\exp(s(\mathbf{w}_t;\zeta'_t) - \nu_t)\nabla s(\mathbf{w}_t;\zeta'_t)\)
Compute \(\mathbf{v}_t = (1-\beta_t)\mathbf{v}_{t-1} + \beta_t \mathbf{z}_t\)
Update \(\mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t\mathbf{v}_t\)
end for

SCGD is just a special case of SCENT

To see the connection with SCGD, we present the following lemma.

Lemma 5.26

The update of \(\nu_t\) defined by (\(\ref{eqn:SCENT-nu}\)) can be computed by

\[\begin{align}\label{eqn:enu} e^{\nu_t} = \frac{1}{1+\alpha_te^{\nu_{t-1}}}e^{\nu_{t-1}} + \frac{\alpha_te^{\nu_{t-1}}} {1+\alpha_te^{\nu_{t-1}}} \exp(s(\mathbf{w}_t; \zeta_t)). \end{align}\]

If \(y_t = e^{-\nu_t}\), we have

\[y_t = \frac{y_{t-1}+\alpha_t}{1 + \alpha_te^{s(\mathbf{w}_t; \zeta_t)}}.\]

Proof.

We compute the gradient of the problem (\(\ref{eqn:SCENT-nu}\)) and set it to zero for computing \(\nu_t\), i.e.,

\[- \exp(s(\mathbf{w}_t;\zeta_t) - \nu_t) + 1 + \frac{1}{\alpha_t}( - e^{-\nu_t} + e^{-\nu_{t-1}}) = 0.\]

Solving this equation finishes the proof.

■

If we define \(u_t = e^{\nu_t}\) and \(\gamma'_{t} = \frac{\alpha_te^{\nu_{t-1}}} {1+\alpha_te^{\nu_{t-1}}}\), then the updates of SCENT (\(\beta_t=1\)) are equivalent to

\[\begin{equation}\label{eqn:SCENT-equi} \begin{aligned} &u_t = (1-\gamma'_t)u_{t-1}+ \gamma'_t \exp(s(\mathbf{w}_t; \zeta_t)),\\ &\mathbf{z}_t = \frac{1}{u_t}\exp(s(\mathbf{w}_t; \zeta'_t))\nabla s(\mathbf{w}_t; \zeta'_t),\\ &\mathbf{w}_{t+1}= \mathbf{w}_t - \eta_t \mathbf{z}_t. \end{aligned} \end{equation}\]

Comparing this update with that of SCGD (\(\ref{eqn:scgd-dro}\)), the key difference lies in the choice of the moving-average parameter: SCENT adopts an adaptive parameter \(\gamma'_t = \frac{\alpha_t e^{\nu_{t-1}}}{1+\alpha_t e^{\nu_{t-1}}},\) whereas SCGD uses a non-adaptive \(\gamma_t\). If we set \(\alpha_t = \frac{\gamma_t}{1-\gamma_t}e^{-\nu_{t-1}}\), then the updates of SCENT reduce to that of SCGD.

Convergence analysis for convex problems

Since \(\mathbf{z}_t = \nabla_{\mathbf{w}}\exp(s(\mathbf{w}_t; \zeta'_t) - \nu_t)\), we have

\[\mathbb{E}_{\zeta'_t}[\mathbf{z}_t]= \nabla_{\mathbf{w}}\mathbb{E}_{\zeta'_t}[\exp(s(\mathbf{w}_t; \zeta'_t) - \nu_t)]=\nabla_1F(\mathbf{w}_t, \nu_t).\]

Let \(\mathbf{w}_*, \nu_*\) be the optimal solution:

\[(\mathbf{w}_*, \nu_*)=\arg\min_{\mathbf{w}, \nu}F(\mathbf{w}, \nu).\]

It is straightforward to derive \(\nu_* = \log [\mathbb{E}\exp(s(\mathbf{w}_*; \zeta))]\).

Assumption 5.17

Assume that the following conditions hold:

\(s(\mathbf{w};\zeta)\) is convex;
the loss function is bounded such that \(s(\mathbf{w}; \zeta)\in[c_0, c_1], \forall \mathbf{w},\zeta\);
there exists \(G\) such that \(\mathbb{E}_{\zeta}\|\nabla s(\mathbf{w}_t, \zeta)\|_2^2\leq G^2, \forall t\).

💡 Critical

To relax the second assumption, we can assume that \(\mathbf{w}\) is restricted to a bounded domain \(\mathcal{W}\) and \(s(\mathbf{w}; \zeta)\) is regular. In practice, we always enforce the boundedness of \(\mathbf{w}_t\) through either projection onto \(\mathcal{W}\) or using a regularizer \(r(\mathbf{w})\). The update of \(\mathbf{w}_{t+1}\) can be modified as the SPGD update:

\[\mathbf{w}_{t+1} = \arg\min_{\mathbf{w}} \mathbf{z}_t^{\top}\mathbf{w} + r(\mathbf{w}) + \frac{1}{\eta_t}\|\mathbf{w} - \mathbf{w}_t\|_2^2.\]

The analysis can be performed similarly.

Lemma 5.27

Under Assumption 5.17(ii), \(\nu_*\in[c_0, c_1]\) and if \(\nu_0\in[c_0, c_1]\) then \(\nu_t\in[c_0, c_1], \forall t\).

Proof.

\(\nu_*\in[c_0, c_1]\) can be seen from \(\nu_* = \log [\mathbb{E}\exp(s(\mathbf{w}_*; \zeta))]\). The second result can be easily seen from the update of \(e^{\nu_t}\) as in (\(\ref{eqn:enu}\)) by induction.

■

For the ease of analysis, we define two quantities to capture the variance terms caused by using stochastic estimators.

\[\begin{align*} & \sigma_t^2:=\mathbb{E}_{\zeta'_t}\|\exp(s(\mathbf{w}_t; \zeta'_t) - \nu_t)\nabla s(\mathbf{w}_t; \zeta'_t)\|_2^2, \\ &\delta_t^2:=\mathbb{E}_{\zeta_t}[e^{-\nu_{t-1}}|e^{s(\mathbf{w}_t;\zeta_t)} - \mathbb{E}_{\zeta}[e^{s(\mathbf{w}_t;\zeta)}]|^2]. \end{align*}\]

Under Assumption 5.17 (ii) and (iii), \(\sigma_t, \delta_t\) are bounded because \(e^{\nu_t}, e^{\nu_{t-1}}\) and \(e^{s(\mathbf{w}_t; \zeta_t)}\) are upper and lower bounded.

💡 Critical

These two quantities are related to the variance of stochastic estimators in terms of \(\mathbf{w}_{t}\) and \(\nu_t\), respectively. Both quantities have a normalization term \(e^{-\nu_t}\) or \(e^{-\nu_{t-1}}\).

Lemma 5.28

Under Assumption 5.17 and \(\beta_t=1\), we have

\[\begin{align*} &\mathbb{E}[\eta_t\nabla_1 F(\mathbf{w}_t, \nu_t)^{\top}(\mathbf{w}_{t} - \mathbf{w}_*)]\leq \mathbb{E}\left[\frac{1}{2}\|\mathbf{w}_t- \mathbf{w}_*\|_2^2 - \frac{1}{2}\|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2\right] + \frac{\eta_t^2\sigma_t^2}{2}. \end{align*}\]

Proof.

The proof is a simple application of Lemma 3.3.

■

If the SPGD update is used, we can use Lemma 3.6 giving us

\[\begin{align*} & \mathbf{z}_t^{\top}(\mathbf{w}_{t+1} - \mathbf{w}_*)+ r(\mathbf{w}_{t+1}) - r(\mathbf{w}_*) \leq \frac{1}{2\eta_t}(\|\mathbf{w}_t- \mathbf{w}_*\|_2^2 - \|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2)\\ & - \frac{1}{2\eta_t}\|\mathbf{w}_t - \mathbf{w}_{t+1}\|_2^2. \end{align*}\]

Then,

\[\begin{align*} & \mathbf{z}_t^{\top}(\mathbf{w}_{t} - \mathbf{w}_*)+ r(\mathbf{w}_{t}) - r(\mathbf{w}_*) \leq \frac{1}{2\eta_t}(\|\mathbf{w}_t- \mathbf{w}_*\|_2^2 - \|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2)\\ &+\mathbf{z}_t^{\top}(\mathbf{w}_{t} - \mathbf{w}_{t+1}) - \frac{1}{2\eta_t}\|\mathbf{w}_t - \mathbf{w}_{t+1}\|_2^2 + r(\mathbf{w}_t) - r(\mathbf{w}_{t+1})\\ & \leq \frac{1}{2\eta_t}(\|\mathbf{w}_t- \mathbf{w}_*\|_2^2 - \|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2) + \frac{\eta_t}{2}\|\mathbf{z}_t\|_2^2 + r(\mathbf{w}_t) - r(\mathbf{w}_{t+1}). \end{align*}\]

Taking expectation on both sides, we have

\[\begin{align*} &\mathbb{E}[\eta_t\nabla_1 F(\mathbf{w}_t, \nu_t)^{\top}(\mathbf{w}_{t} - \mathbf{w}_*)] + \eta_t(r(\mathbf{w}_t) - r(\mathbf{w}_*))\\ &\leq \mathbb{E}\left[\bigg(\eta_t r(\mathbf{w}_t)+\frac{1}{2}\|\mathbf{w}_t- \mathbf{w}_*\|_2^2\bigg) - \bigg(\eta_t r(\mathbf{w}_{t+1})+\frac{1}{2}\|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2\bigg)\right] + \frac{\eta_t^2\sigma_t^2}{2}. \end{align*}\]

If \(\eta_{t+1}\leq \eta_t\) and \(r(\mathbf{w})\geq 0\), then \(\eta_t r(\mathbf{w}_{t+1})\geq \eta_{t+1} r(\mathbf{w}_{t+1})\), then the terms in the square bracket will form a telescoping series over \(t=1,\ldots, T\). As a result, the following analysis will proceed similarly.

Lemma 5.29

Under Assumption 5.17(ii), we have

\[\begin{align*} & \alpha_t \nabla_2 \Phi(\mathbf{w}_t, \nu_t; \zeta_t)^{\top}(\nu_{t} - \nu_*)\leq D_\varphi(\nu_*, \nu_{t-1}) - D_\varphi(\nu_*,\nu_{t}) - D_{\varphi}(\nu_{t}, \nu_{t-1}). \end{align*}\]

Proof.

Recall the definition

\[\begin{align*} &\Phi(\mathbf{w}_t, \nu; \zeta_t)=\exp(s(\mathbf{w}_t;\zeta_t) - \nu) + \nu,\\ &\varphi(\nu) = e^{-\nu}, \quad D_{\varphi}(a, b) = \varphi(a) - \varphi(b) - \langle \nabla \varphi(b), a - b \rangle, \end{align*}\]

and the update of \(\nu_t\):

\[\nu_{t} = \arg \min_{\nu} \alpha_t \Phi(\mathbf{w}_t, \nu; \zeta_t) + D_{\varphi}(\nu, \nu_{t-1}).\]

The first-order optimality gives

\[\alpha_t\nabla_2 \Phi(\mathbf{w}_t, \nu_t; \zeta_t) + \nabla \varphi(\nu_{t}) - \nabla \varphi(\nu_{t-1}) = 0.\]

Taking inner product with \((\nu_{t} - \nu_*)\) and rearranging gives

\[\begin{align*} \alpha_t \langle \nabla_2 \Phi(\mathbf{w}_t, \nu_t; \zeta_t), \nu_{t} - \nu_* \rangle &= \langle \nabla \varphi(\nu_{t-1}) - \nabla \varphi(\nu_{t}), \nu_{t} - \nu_*\rangle\\ &= D_{\varphi}(\nu_*, \nu_{t-1}) - D_{\varphi}(\nu_*, \nu_{t}) - D_{\varphi}(\nu_{t}, \nu_{t-1}), \end{align*}\]

where the last equality holds by the three-point identity as in Lemma 3.10.

■

💡 Critical

To proceed the analysis, we need to bound \(\mathbb{E}[\alpha_t \nabla_2 F(\mathbf{w}_t, \nu_t)^{\top}(\nu_{t} - \nu_*)]\). In light of the above lemma, we will bound the following difference in expectation:

\[\mathbb{E}[(\nabla_2 F(\mathbf{w}_t, \nu_t)^{\top}(\nu_{t} - \nu_*) - \nabla_2 \Phi(\mathbf{w}_t, \nu_t; \zeta_t)^{\top}(\nu_{t} - \nu_*)].\]

The challenge lies at \(\nu_t\) depends on \(\zeta_t\), making the above expectation not equal to zero.

Lemma 5.30

Assume \(\alpha_t\leq \rho e^{-\nu_{t-1}}\) for any constant \(\rho>0\), then we have

\[|\mathbb{E}[(\nabla_2 F(\mathbf{w}_t, \nu_t)^{\top}(\nu_{t} - \nu_*) - \nabla_2 \Phi(\mathbf{w}_t, \nu_t; \zeta_t)^{\top}(\nu_{t} - \nu_*)]|\leq \alpha_t \delta_t^2 C,\]

where \(C=(1+\rho)(1+c_1 - c_0)\).

Proof.

In the following proof, we let \(\mathcal{F}_{t-1}\) denote the filtration (the “information available”) up to time \(t-1\).

Let us define \(z_t = e^{s(\mathbf{w}_t;\zeta_t)}\), \(m_t = \mathbb{E}_{\zeta}[e^{s(\mathbf{w}_t;\zeta)}|\mathcal{F}_{t-1}]\), and \(y_t = e^{-\nu_t}\). Let \(z\) and \(z'\) be two independent variables so that \(\mathbb{E}[z|\mathcal{F}_{t-1}]=\mathbb{E}[z'|\mathcal{F}_{t-1}] =m_t\). Since \(\nu_t\) depends on \(z_t\), let us define random functions:

\[\begin{align*} &y_t(z)=\frac{y_{t-1}+\alpha_t}{\alpha_t z+1},\quad \nu_t(z) = -\log y_t(z),\\ &h_t(z)= e^{-\nu_{t}(z)}\big(\nu_{t}(z)- \nu_*\big) = y_t(z)\big(\nu_t(z)-\nu_*\big). \end{align*}\]

According to the update of \(\nu_t\), we have \(y_t= y_t(z_t), \nu_t = \nu_t(z_t)\). For the target, we have

\[\begin{equation}\label{eqn:target-v} \begin{aligned} &\mathbb{E}[(\nabla_2 \Phi(\mathbf{w}_t, \nu_t; \zeta_t)-\nabla_2 F(\mathbf{w}_t, \nu_t))^{\top}(\nu_{t} - \nu_*) \mid \mathcal{F}_{t-1}]\\ & =\mathbb{E}[(\mathbb{E}_{\zeta}[e^{s(\mathbf{w}_t;\zeta)}]-e^{s(\mathbf{w}_t;\zeta_t)})e^{-\nu_{t}}\big(\nu_{t}-\nu_*\big)\mid \mathcal{F}_{t-1}]\\ & =\mathbb{E}[(m_t- z_t)h_t(z_t)\mid \mathcal{F}_{t-1}] = \mathbb{E}_z[(m_t- z)h_t(z)|\mathcal{F}_{t-1}]. \end{aligned} \end{equation}\]

Since \(z'\) is an i.i.d. copy of \(z\) and independent of \(z\) given \(\mathcal{F}_{t-1}\),

\[m_t=\mathbb{E}[z\mid\mathcal{F}_{t-1}]=\mathbb{E}[z'\mid\mathcal{F}_{t-1}].\]

Using the conditional independence,

\[\mathbb{E}\big[(m_t-z)h_t(z)\mid \mathcal{F}_{t-1}\big]=\mathbb{E}\big[(z'-z)h_t(z)\mid \mathcal{F}_{t-1}\big].\]

By exchangeability of \((z,z')\) conditional on \(\mathcal{F}_{t-1}\),

\[\mathbb{E}\big[(z'-z)h_t(z')\mid \mathcal{F}_{t-1}\big]=-\,\mathbb{E}\big[(z'-z)h_t(z)\mid \mathcal{F}_{t-1}\big].\]

Averaging the last two displays gives the standard symmetrization:

\[\begin{align}\label{eqn:varz} \mathbb{E}\big[(m_t-z)h_t(z)\mid \mathcal{F}_{t-1}\big]=\frac{1}{2}\,\mathbb{E}\big[(z'-z)\big(h_t(z)-h_t(z')\big)\mid \mathcal{F}_{t-1}\big]. \end{align}\]

Next, we show that \(h(z)\) is Lipschitz continuous. By definition,

\[y_t(z)=\frac{y_{t-1}+\alpha_t}{\alpha_t z+1}, \quad h_t(z)= y_t(z)\big(\nu_t(z)-\nu_*\big).\]

Differentiate with respect to \(z\):

\[\frac{dy_t(z)}{dz}= (y_{t-1}+\alpha_t)\,\frac{d}{dz}\bigl((\alpha_t z+1)^{-1}\bigr)= -\frac{\alpha_t(y_{t-1}+\alpha_t)}{(\alpha_t z+1)^2}.\]

Using \(y_t(z)(\alpha_t z+1)=y_{t-1}+\alpha_t\), we can rewrite this as

\[\frac{dy_t(z)}{dz}=-\,\frac{\alpha_t y_t(z)}{\alpha_t z+1}.\]

Since \(\nu_t(z)=-\log y_t(z)\), we have

\[\frac{d\nu_t(z)}{dz}=-\frac{1}{y_t(z)}\frac{dy_t(z)}{dz}=\frac{\alpha_t}{\alpha_t z+1}.\]

As a result,

\[\frac{dh_t(z)}{dz}=\frac{dy_t(z)}{dz}\big(\nu_t(z)-\nu_*\big)+y_t(z)\frac{d\nu_t(z)}{dz}=\frac{\alpha_t y_t(z)}{\alpha_t z+1}\,\bigl(1-(\nu_t(z)-\nu_*)\bigr).\]

Since \(\nu_t(z), \nu_*\in[c_0, c_1]\), then

\[\bigl|1-(\nu_t(z)-\nu_*)\bigr|\le 1+c_1-c_0,\]

and since \(y_t(z)=\frac{y_{t-1}+\alpha_t}{\alpha_t z+1}\leq y_{t-1} + \alpha_t\leq (1+\rho)y_{t-1}\), we have

\[\left|\frac{dh_t}{dz}\right|\le \alpha_t y_{t-1}(1+\rho)(1+c_1 -c_0),\]

which means \(h_t\) is \(L_t\)-Lipschitz with \(L_t\leq \alpha_t y_{t-1}C\). Then, it holds

\[\big|(z'-z)\big(h_t(z)-h_t(z')\big)\big|\le L_t\,(z'-z)^2 \leq C\alpha_t y_{t-1}(z'-z)^2.\]

Thus,

\[\begin{align*} &\mathbb{E}\bigg[\big|(z'-z)\big(h_t(z)-h_t(z')\big)\mid \mathcal{F}_{t-1}\bigg]\leq C\alpha_t \mathbb{E}[y_{t-1}(z'-z)^2)\mid \mathcal{F}_{t-1}]\\ &= C\alpha_t \cdot2\mathbb{E}[y_{t-1}(z - \mathbb{E}[z])^2 \mid \mathcal{F}_{t-1}]\leq 2C\alpha_t \delta_t^2, \end{align*}\]

where the last step uses the definition of \(\delta_t^2\). Applying this result to (\(\ref{eqn:varz}\)), we have

\[\Bigl|\mathbb{E}\big[(\mu_t-z)h_t(z)\mid \mathcal{F}_{t-1}\big]\Bigr|\le \frac{1}{2}\mathbb{E}\bigg[\big|(z'-z)\big(h_t(z)-h_t(z')\big)\big|\mid \mathcal{F}_{t-1}\bigg] \leq C\alpha_t \delta_t^2.\]

By noting (\(\ref{eqn:target-v}\)), we finish the proof.

■

Combining Lemma 5.29 and Lemma 5.30, we have the following lemma for one-step analysis of the \(\nu\)-update.

Lemma 5.31

Under Assumption 5.17 (ii), we have

\[\begin{align}\label{eqn:nu-step} & \mathbb{E}[\alpha_t \nabla_2 F(\mathbf{w}_t, \nu_t)^{\top}(\nu_{t} - \nu_*)]\leq \mathbb{E}[D_\varphi(\nu_*, \nu_{t-1}) - D_\varphi(\nu_*,\nu_{t}) + C\alpha_t^2\delta_t^2]. \end{align}\]

Finally, we state the convergence result of SCENT in the following theorem.

Theorem 5.14

Suppose Assumption 5.17 holds. Let \(\beta_t=1, \eta_t=\eta \alpha_t\), \(\alpha_t<\rho e^{-\nu_{t-1}}\), then SCENT guarantees that

\[\begin{align*} &\mathbb{E}\left[\sum_{t=1}^T\alpha_t(F(\mathbf{w}_t,\nu_t) - F(\mathbf{w}_*, \nu_*))\right]\\ & \leq \frac{1}{2\eta}\|\mathbf{w}_1- \mathbf{w}_*\|_2^2 + D_\varphi(\nu_*, \nu_{0}) +\mathbb{E}\left[\sum_{t=1}^T\frac{\eta\alpha_t^2\sigma_t^2}{2} + \sum_{t=1}^TC\alpha_t^2\delta_t^2\right]. \end{align*}\]

Proof.

Since \(\eta_t = \eta \alpha_t\), from Lemma 5.28, we obtain

\[\begin{align*} &\mathbb{E}[\alpha_t\nabla_1 F(\mathbf{w}_t, \nu_t)^{\top}(\mathbf{w}_{t} - \mathbf{w}_*)]\leq \mathbb{E}\left[\frac{1}{2\eta}\|\mathbf{w}_t- \mathbf{w}_*\|_2^2 - \frac{1}{2\eta}\|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2 + \frac{\eta\alpha_t^2\sigma_t^2}{2}\right]. \end{align*}\]

Combining this with Lemma 5.31, we have

\[\begin{align*} &\mathbb{E}[\alpha_t(\nabla_1 F(\mathbf{w}_t, \nu_t)^{\top}(\mathbf{w}_{t} - \mathbf{w}_*)+\nabla_2 F(\mathbf{w}_t, \nu_t)^{\top}(\nu_{t} - \nu_*))]\\ & \leq \mathbb{E}\left[\frac{1}{2\eta}\|\mathbf{w}_t- \mathbf{w}_*\|_2^2 - \frac{1}{2\eta}\|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2+ D_\varphi(\nu_*, \nu_{t-1}) - D_\varphi(\nu_*,\nu_{t}) \right]\\ &+\mathbb{E}\bigg[\frac{\eta\alpha_t^2\sigma_t^2}{2} + C\alpha_t^2\delta_t^2\bigg]. \end{align*}\]

By the joint convexity of \(F(\mathbf{w}, \nu)\), we have

\[\alpha_t(F(\mathbf{w}_t,\nu_t) - F(\mathbf{w}_*, \nu_*))\leq \alpha_t(\nabla_1 F(\mathbf{w}_t, \nu_t)^{\top}(\mathbf{w}_{t} - \mathbf{w}_*)+\nabla_2 F(\mathbf{w}_t, \nu_t)^{\top}(\nu_{t} - \nu_*)).\]

Combining the last two inequalities and summing over \(t=1,\ldots, T\), we have

■

We present two corollaries of the above theorem.

Corollary 5.2

Suppose Assumption 5.17 holds. Let \(\beta_t=1, \eta_t=\eta \alpha_t\), \(\alpha_t= \frac{\alpha}{\sqrt{T}}<\rho e^{-\nu_{t-1}}\) for some constant \(\rho>0\), then SCENT guarantees that

\[\begin{align*} &\mathbb{E}\left[(F_{\text{ENT}}(\bar{\mathbf{w}}_T) - F_{\text{ENT}}(\mathbf{w}_*))\right] \leq \frac{D_0}{\alpha \sqrt{T}}+\frac{\alpha V}{\sqrt{T}}, \end{align*}\]

where \(\bar{\mathbf{w}}_T = \frac{\sum_{t=1}^T\mathbf{w}_t}{T}\), \(D_0=\frac{1}{2\eta}\|\mathbf{w}_1- \mathbf{w}_*\|_2^2 + D_\varphi(\nu_*, \nu_{0})\) and

\[V=\mathbb{E}\left[\frac{\eta \sum_{t=1}^T\sigma_t^2}{2T} + \frac{\sum_{t=1}^TC\delta_t^2}{T}\right].\]

Proof.

Plugging \(\alpha_t=\alpha/\sqrt{T}\) into Theorem 5.14, we have

\[\begin{align*} &\mathbb{E}\left[\frac{1}{T}\sum_{t=1}^T(F(\mathbf{w}_t, \nu_t) - F(\mathbf{w}_*, \nu_*))\right] \leq \frac{D_0}{\alpha \sqrt{T}}+\frac{\alpha V}{\sqrt{T}}. \end{align*}\]

Using \(F_{\text{ENT}}(\mathbf{w})=\min_\nu F(\mathbf{w}, \nu), F_{\text{ENT}}(\mathbf{w}_*)=F(\mathbf{w}_*, \nu_*)\) and the Jensen inequality, we can finish the proof.

■

💡 Why it matters

Since \(\delta_t, \sigma_t\) are finite, the above result implies a convergence rate of \(O(1/\sqrt{T})\) for SCENT.

Corollary 5.3

Suppose Assumption 5.17 holds. Let \(\beta_t=1, \eta_t=\eta \alpha_t\), \(\alpha_t=\frac{\alpha e^{-\nu_{t-1}}}{\sqrt{T}}\), if \(\frac{1}{T}\sum_{t=1}^Te^{-\nu_{t-1}}\geq S\) almost surely, then SCENT guarantees that

\[\begin{align*} &\mathbb{E}\left[F_{\text{ENT}}(\hat{\mathbf{w}}_T) - F_{\text{ENT}}(\mathbf{w}_*)\right] \leq \frac{D_0}{\alpha \sqrt{T}S}+\frac{\alpha\bar{V}}{\sqrt{T}S}, \end{align*}\]

where \(\hat{\mathbf{w}}_T=\frac{\sum_t\alpha_t\mathbf{w}_t}{\sum_{t=1}^T\alpha_t}\) and

\[\bar{V}=\mathbb{E}\left[\frac{\eta \sum_{t=1}^Te^{-2\nu_{t-1}}\sigma_t^2}{2T} + \frac{\sum_{t=1}^TCe^{-2\nu_{t-1}}\delta_t^2}{T}\right].\]

Proof.

Let \(\hat\alpha_t = \frac{\alpha_t}{\sum_{t=1}^T\alpha_t}\). From Theorem 5.14, we have

\[\begin{align*} &\mathbb{E}\left[\bigg(\sum_{t=1}^T\alpha_t\bigg)\sum_{t=1}^T\hat\alpha_t(F(\mathbf{w}_t,\nu_t) - F(\mathbf{w}_*, \nu_*))\right]\\ & \leq \frac{1}{2\eta}\|\mathbf{w}_1- \mathbf{w}_*\|_2^2 + D_\varphi(\nu_*, \nu_{0}) +\mathbb{E}\left[\sum_{t=1}^T\frac{\eta\alpha_t^2\sigma_t^2}{2} + \sum_{t=1}^TC\alpha_t^2\delta_t^2\right]. \end{align*}\]

Since \(\sum_{t=1}^T\alpha_t=\sum_{t=1}^T\frac{\alpha e^{-\nu_{t-1}}}{\sqrt{T}}\geq \alpha \sqrt{T} S\), then

\[\begin{align*} &\mathbb{E}\left[\sum_{t=1}^T\hat\alpha_t(F(\mathbf{w}_t,\nu_t) - F(\mathbf{w}_*, \nu_*))\right] \leq \frac{\frac{1}{2\eta}\|\mathbf{w}_1- \mathbf{w}_*\|_2^2 + D_\varphi(\nu_*, \nu_{0}) }{\alpha \sqrt{T}S}+\frac{\alpha\bar{V}}{\sqrt{T}S}. \end{align*}\]

Applying the joint convexity of \(F(\mathbf{w}, \nu)\) and \(F_{\text{ENT}}=\min_\nu F(\mathbf{w}, \nu)\), we can finish the proof.

■

💡 Why it matters

Under the stated setting, SCENT reduces to SCGD with \(\gamma_t = \frac{\alpha}{\sqrt{T}+\alpha}\). Since \(S\) can be lower bounded by a constant, the above corollary implies \(O(1/\sqrt{T})\) convergence rate for SCGD to minimize log-E-Exp.

Analysis of the Variance Terms

Since the final convergence bound depends on the variance terms \(\sigma_t^2, \delta_t^2\), we would like to provide further analysis on them.

Let us introduce some notations:

\[\begin{align} &z(\mathbf{w}; \zeta)=e^{s(\mathbf{w}; \zeta)}, \; \mu(\mathbf{w}) = \log \mathbb{E}_{\zeta}e^{s(\mathbf{w}; \zeta)},\\ &m_t = \mathbb{E}_{\zeta}e^{s(\mathbf{w}_t; \zeta)}, \; \mu_t = \mu(\mathbf{w}_t) = \log m_t. \end{align}\]

For the analysis, we make two reasonable assumptions.

Assumption 5.18

Assume there exist constants \(\kappa, \sigma'^2\) such that

\(\mathbb{E}\bigg[\frac{\mathbb{E}[z(\mathbf{w}; \zeta)^2]}{(\mathbb{E}[z(\mathbf{w}; \zeta)])^2}\bigg]\le \kappa\) for all \(\mathbf{w}\);
\(\mathbb{E}\|e^{s(\mathbf{w}_t;\zeta')-\mu_t}\nabla s(\mathbf{w}_t;\zeta')\|^2\le \sigma'^2\) for all \(t\).

💡 Critical

These assumptions are necessary. In the next section, we show that the dependence on \(\kappa\) is unavoidable. The second assumption is the standard bounded stochastic gradient assumption for optimizing \(F_1(\mathbf{w})\).

Lemma 5.32 (Dual Variance Term)

Under Assumption 5.18, we have

\[\begin{equation}\label{eq:delta-noZmax} \delta_t^2\leq 2(\kappa-1) m_{t}\Big(F(\mathbf{w}_t,\nu_{t-1})-F(\mathbf{w}_*,\nu_*)+1\Big). \end{equation}\]

💡 Why it matters

When \(F(\mathbf{w}_t,\nu_{t-1})-F(\mathbf{w}_*,\nu_*)\rightarrow 0\), the variance term in the convergence bound caused by the stochastic update of \(\nu_t\) will be dominated by \(2(\kappa-1)m_t\). Large \(m_t\) can be mitigated by choosing small \(\alpha_t\).

Proof.

Recall that

\[\delta_t^2=\mathbb{E}_{\zeta_t}\!\left[e^{-\nu_{t-1}}\bigl(z(\mathbf{w}_t;\zeta_t)-m_t\bigr)^2\right].\]

By Assumption 5.18(i),

\[\operatorname{Var}(z(\mathbf{w}_t;\zeta))\le (\kappa-1)m_t^2.\]

Hence

\[\delta_t^2=e^{-\nu_{t-1}}\operatorname{Var}(z(\mathbf{w}_t;\zeta))\le (\kappa-1)e^{-\nu_{t-1}}m_t^2= (\kappa-1)m_t\cdot (m_t e^{-\nu_{t-1}}).\]

Let \(\tilde{r}_{t-1}\coloneqq m_t e^{-\nu_{t-1}}\). By the definition:

\[F(\mathbf{w}_t,\nu_{t-1})=\mathbb{E} e^{s(\mathbf{w}_t; \zeta)-\nu_{t-1}} + \nu_{t-1} = \tilde{r}_{t-1} + \nu_{t-1}.\]

Since \(\tilde{r}_{t-1}=e^{\log m_t - \nu_{t-1}}\), we have

\[F(\mathbf{w}_t,\nu_{t-1})-(1+\mu_t)= \tilde{r}_{t-1} + \nu_{t-1} - (1+\log m_t)=\tilde{r}_{t-1}-\log \tilde{r}_{t-1}-1.\]

Using \(r\le 2(r-\log r)\) for all \(r>0\) yields

\[\tilde{r}_{t-1}\le 2\big(F(\mathbf{w}_t,\nu_{t-1})-(1+\mu_t)+1\big).\]

Since \(\mathbf{w}_*\) minimizes \(\mu(\mathbf{w})\), we have \(\mu_t=\mu(\mathbf{w}_t)\ge \mu(\mathbf{w}_*)\) and thus \((1+\mu_t)\ge (1+\mu(\mathbf{w}_*))=F(\mathbf{w}_*,\nu_*)\), implying

\[F(\mathbf{w}_t,\nu_{t-1})-(1+\mu_t)\le F(\mathbf{w}_t,\nu_{t-1})-F(\mathbf{w}_*,\nu_*).\]

As a result, we have

\[\begin{align}\label{eqn:b-tilde-r} \tilde{r}_{t-1}\le 2\big(F(\mathbf{w}_t,\nu_{t-1})-F(\mathbf{w}_*,\nu_*) + 1\big). \end{align}\]

Combining this with the bound of \(\delta_t^2\), we complete the proof.

■

Lemma 5.33 (Primal Variance Term)

Under Assumption 5.18, we have

\[\begin{align*} \sigma_t^2 & \leq 4\sigma'^2\big(F(\mathbf{w}_t,\nu_{t})- F(\mathbf{w}_*, \nu_*)+1\big)^2. \end{align*}\]

💡 Why it matters

When \(F(\mathbf{w}_t,\nu_{t})-F(\mathbf{w}_*,\nu_*)\rightarrow 0\), the variance term in the convergence bound caused by the stochastic update of \(\mathbf{w}_t\) will be dominated by \(O(\sigma'^2)\).

Proof.

\[\begin{align*} \sigma_t^2 & =\mathbb{E}_{\zeta'_t}\|\exp(s(\mathbf{w}_t; \zeta'_t) - \nu_t)\nabla s(\mathbf{w}_t; \zeta'_t)\|_2^2, \\ & = \mathbb{E}_{\zeta'_t}[e^{2(\mu_t - \nu_t)}\|\exp(s(\mathbf{w}_t; \zeta'_t) - \mu_t)\nabla s(\mathbf{w}_t; \zeta'_t)\|_2^2] \leq r_t^2\sigma'^2, \end{align*}\]

where \(r_t = e^{\mu_t - \nu_t}\). Similar to (\(\ref{eqn:b-tilde-r}\)), we have shown that

\[r_t\leq 2\big(F(\mathbf{w}_t,\nu_{t})- F(\mathbf{w}_*, \nu_*)+1\big).\]

Hence,

\[\begin{align*} \sigma_t^2 & \leq 4\sigma'^2\big(F(\mathbf{w}_t,\nu_{t})- F(\mathbf{w}_*, \nu_*)+1\big)^2. \end{align*}\]

■

5.5.2.2 Compositional Entropic Risk Minimization

In this section, we extend the results to solving compositional entropic risk minimization (CERM):

\[\min_{\mathbf{w}}F_{\text{CERM}}(\mathbf{w}):=\frac{1}{n}\sum_{i=1}^n\log\left(\mathbb{E}_{\zeta\sim\mathbb{P}_i}\exp(s_i(\mathbf{w}; \zeta))\right)\]

via its equivalent min-min formulation:

\[\begin{equation*} \begin{aligned} \min_{\mathbf{w}} & \min_{\boldsymbol{\nu}}F(\mathbf{w},\boldsymbol{\nu}) := \frac{1}{n}\sum_{i=1}^n\{\mathbb{E}_{\zeta\sim\mathbb{P}_i}\exp(s_i(\mathbf{w};\zeta) - \nu_{i}) + \nu_{i}\}. \end{aligned} \end{equation*}\]

The difference from Log-E-Exp is that there are multiple \(\nu_i, i=1,\ldots, n\), which need to be updated using the stochastic block coordinate method. The technique has been used in algorithms presented in previous sections of this chapter.

We present an extension of SCENT to solving CERM in Algorithm 22. The major change lies at the stochastic block coordinate update of \(\boldsymbol{\nu}\) in Step 5. This extension is analogous to SOX for FCCO, employing stochastic block-coordinate updates for the inner estimators. Indeed, SOX applied to CERM can be recovered as a special case of SCENT by choosing the coordinate-wise step size \(\alpha_{t,i} = \frac{\gamma_t}{1-\gamma_t} e^{-\nu_{i,t-1}}\), using an argument similar to (\(\ref{eqn:SCENT-equi}\)).

Algorithm 22: SCENT for solving CERM

Input: Initialize \(\mathbf{w}_1,\boldsymbol{\nu}_0\), step sizes \(\eta_t\) and \(\alpha_t\), \(\varphi(\nu)=e^{-\nu}\).
for \(t=1\dotsc,T-1\) do
Sample \(\mathcal{B}_t\subset \{1,\dotsc, n\}\) with \(|\mathcal{B}_t| = B\)
for each \(i\in\mathcal{B}_t\) do
Sample \(\zeta_{i,t}, \zeta'_{i,t}\sim\mathbb{P}_i\)
Update \(\nu_{i,t} = \arg\min_{\nu} \exp(s_i(\mathbf{w}_t;\zeta_{i,t}) - \nu) + \nu + \frac{1}{\alpha_t}D_\varphi(\nu, \nu_{i,t-1})\)
end for
Compute \(\mathbf{z}_t =\frac{1}{B}\sum_{i\in\mathcal{B}_t}\exp(s_i(\mathbf{w}_t;\zeta'_{i,t}) - \nu_{i,t})\nabla s_i(\mathbf{w}_t;\zeta'_{i,t})\)
Compute \(\mathbf{v}_t = (1-\beta_t)\mathbf{v}_{t-1} + \beta_t \mathbf{z}_t\)
Update \(\mathbf{w}_{t+1} = \mathbf{w}_t - \eta_t\mathbf{v}_t\)
end for

Convergence analysis for convex problems

Let us define some notations:

\[\begin{align*} &\Phi_i(\mathbf{w}_t, \nu_i; \zeta)=\exp(s_i(\mathbf{w}_t;\zeta_t) - \nu_i) + \nu_i,\\ & F_i(\mathbf{w}_t, \nu_i)= \mathbb{E}_{\zeta\sim\mathbb{P}_i}[\Phi_i(\mathbf{w}_t, \nu_i; \zeta)],\\ &(\mathbf{w}_*, \boldsymbol{\nu}_*)=\arg\min_{\mathbf{w}, \boldsymbol{\nu}}F(\mathbf{w}, \boldsymbol{\nu}). \end{align*}\]

Similar as before, \(\nu_{i,*} = \log [\mathbb{E}_{\zeta\sim \mathbb{P}_i}\exp(s_i(\mathbf{w}_*; \zeta))]\). Since we deal with stochastic block coordinate update, we introduce a virtual sequence \(\bar{\boldsymbol{\nu}}_t\), where

\[\begin{align*} \bar\nu_{i,t} = \arg\min_{\nu} \exp(s_i(\mathbf{w}_t;\zeta_{i,t}) - \nu) + \nu + \frac{1}{\alpha_t}D_\varphi(\nu, \nu_{i,t-1}), \forall i. \end{align*}\]

Following Lemma 5.26, we have

\[\begin{align*} e^{\bar\nu_{i,t}} = \frac{1}{1+\alpha_te^{\nu_{i,t-1}}}e^{\nu_{i,t-1}} + \frac{\alpha_te^{\nu_{i,t-1}}} {1+\alpha_te^{\nu_{i,t-1}}} \exp(s_i(\mathbf{w}_t; \zeta_t)), \forall i. \end{align*}\]

Assumption 5.19

Assume that the following conditions hold:

\(s_i(\mathbf{w};\zeta)\) is convex;
the loss function is bounded such that \(s_i(\mathbf{w}; \zeta)\in[c_0, c_1], \forall \mathbf{w},\zeta, i\);
there exists \(G\) such that \(\mathbb{E}_{\zeta}\|\nabla s_i(\mathbf{w}_t, \zeta)\|_2^2\leq G^2, \forall t, i\).

Define \(\sigma_{i,t}, \delta_{i,t}\) as

\[\begin{align*} &\sigma_{i,t}^2:=\mathbb{E}_{\zeta'_{i,t}\sim\mathbb{P}_i}\|\exp(s_i(\mathbf{w}_t; \zeta'_{i,t}) - \nu_{i,t})\nabla s_i(\mathbf{w}_t; \zeta'_{i,t})\|_2^2, \forall i, t,\\ &\delta_{i,t}^2: = \mathbb{E}_{\zeta_{i,t}\sim\mathbb{P}_i}[e^{-\nu_{i,t-1}}|e^{s_i(\mathbf{w}_t;\zeta_{i,t})} - \mathbb{E}_{\zeta_i\sim\mathbb{P}_i}[e^{s_i(\mathbf{w}_t;\zeta_i)}]|^2], \forall i, t. \end{align*}\]

Similar to Lemma 5.27, the following lemma can be proved.

Lemma 5.34

Under Assumption 5.19, if \(\boldsymbol{\nu}_0\in[c_0, c_1]\) then \(\boldsymbol{\nu}_t\in[c_0, c_1], \forall t\).

Similar to Lemma 5.28, we have the following lemma regarding one-step update of \(\mathbf{w}_{t}\).

Lemma 5.35

Under Assumption 5.19 and \(\beta_t=1\), we have

\[\begin{align*} &\mathbb{E}[\eta_t\nabla_1 F(\mathbf{w}_t, \bar{\boldsymbol{\nu}}_t)^{\top}(\mathbf{w}_{t} - \mathbf{w}_*)]\leq \mathbb{E}\left[\frac{1}{2}\|\mathbf{w}_t- \mathbf{w}_*\|_2^2 - \frac{1}{2}\|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2\right] + \frac{\eta_t^2\sigma_t^2}{2}, \end{align*}\]

where \(\sigma_t^2 = \frac{1}{n}\sum_{i=1}^n\sigma_{i,t}^2\).

Proof.

We first bound \(\mathbb{E}_{t}[\|\mathbf{z}_t\|_2^2 \mid \mathcal{F}_{t-1}]\), where \(\mathbb{E}_t\) denotes the expectation over randomness in the \(t\)-th iteration given \(\mathbf{w}_t, \nu_{t-1}\).

\[\begin{align*} &\mathbb{E}_t[\|\mathbf{z}_t\|_2^2] = \mathbb{E}_t\left[\bigg\|\frac{1}{B}\sum_{i\in\mathcal{B}_t}\exp(s_i(\mathbf{w}_t; \zeta'_{i,t}) - \nu_{i,t})\nabla s_i(\mathbf{w}_t; \zeta'_{i,t})\bigg\|_2^2\right]\\ & =\mathbb{E}_{\mathcal{B}_t, \zeta_t}\mathbb{E}_{\zeta'_t|\mathcal{B}_t, \zeta_t}\left[\bigg\|\frac{1}{B}\sum_{i\in\mathcal{B}_t}\exp(s_i(\mathbf{w}_t; \zeta'_{i,t}) - \nu_{i,t})\nabla s_i(\mathbf{w}_t; \zeta'_{i,t})\bigg\|_2^2\right]\\ & \leq \mathbb{E}_{\mathcal{B}_t, \zeta_t}\bigg[\frac{1}{B}\sum_{i\in\mathcal{B}_t}\sigma_{i,t}^2\bigg] = \frac{1}{n}\sum_{i=1}^n\sigma_{i,t}^2. \end{align*}\]

Since \(\bar\nu_{i,t}=\nu_{i,t}, \forall i\in\mathcal{B}_t\), we have

\[\mathbb{E}_t[\mathbf{z}_t] = \mathbb{E}_{\zeta'_t, \zeta_t, \mathcal{B}_t}\bigg[\frac{1}{B}\sum_{i\in\mathcal{B}_t}\nabla_1\Phi_i(\mathbf{w}_t, \bar\nu_{i,t}; \zeta'_{i,t})\bigg] = \nabla_1 F(\mathbf{w}_t, \bar{\boldsymbol{\nu}}_t).\]

Then following Lemma 3.3, we can finish the proof.

■

Next, we analyze the update of \(\bar{\boldsymbol{\nu}}_t\).

Lemma 5.36

Under Assumption 5.19(ii) and \(\alpha_t\leq \min_i\rho e^{-\nu_{i,t-1}}\), we have

\[\begin{align*} \mathbb{E}[\alpha_t \nabla_2 F(\mathbf{w}_t, \bar{\boldsymbol{\nu}}_t)^{\top}(\bar{\boldsymbol{\nu}}_{t} - \boldsymbol{\nu}_{*})]&\leq \frac{1}{B}\mathbb{E}\left[D_\varphi(\boldsymbol{\nu}_{*}, \boldsymbol{\nu}_{t-1}) - D_\varphi(\boldsymbol{\nu}_{*},\boldsymbol{\nu}_{t})\right] + C\alpha^2_t \delta_{t}^2, \end{align*}\]

where \(D_\varphi(\boldsymbol{\nu}_*, \boldsymbol{\nu}_t)=\sum_{i=1}^nD_\varphi(\nu_{i,*}, \nu_{i,t})\) and \(\delta_t^2 = \frac{1}{n}\sum_{i=1}^n\delta_{i,t}^2\).

Proof.

By applying Lemma 5.30 and Lemma 5.29 for each coordinate of \(\bar\nu_{i,t}\), we have

\[\begin{align*} & \mathbb{E}[\alpha_t \nabla_2 F_i(\mathbf{w}_t, \bar\nu_{i,t})^{\top}(\bar\nu_{i,t} - \nu_{i,*})]\leq D_\varphi(\nu_{i,*}, \nu_{i,t-1}) - D_\varphi(\nu_{i,*},\bar\nu_{i,t}) + C\alpha^2_t \delta_{i,t}^2, \forall i. \end{align*}\]

Averaging the above inequality over \(i=1,\ldots, n\), we have

\[\begin{align}\label{eqn:bnu-one} & \mathbb{E}[\alpha_t \nabla_2 F(\mathbf{w}_t, \bar{\boldsymbol{\nu}}_t)^{\top}(\bar{\boldsymbol{\nu}}_{t} - \boldsymbol{\nu}_{*})]\leq \frac{1}{n}\sum_{i=1}^n\left(D_\varphi(\nu_{i,*}, \nu_{i,t-1}) - D_\varphi(\nu_{i,*},\bar\nu_{i,t})\right) + C\alpha_t \delta_{t}^2. \end{align}\]

Due to the randomness of \(\mathcal{B}_t\), we have

\[\begin{align*} \mathbb{E}[D_{\varphi}(\nu_{i,*}, \nu_{i,t})] = \mathbb{E}\bigg[\left(1-\frac{B}{n}\right) D_{\varphi}(\nu_{i,*}, \nu_{i,t-1}) + \frac{B}{n} D_{\varphi}(\nu_{i,*}, \bar\nu_{i,t})\bigg], \forall i. \end{align*}\]

Hence

\[\begin{align*} & \mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n\left(D_\varphi(\nu_{i,*}, \nu_{i,t-1}) - D_\varphi(\nu_{i,*},\bar\nu_{i,t})\right)\right] \\ &= \mathbb{E}\left[\frac{1}{n}\sum_{i=1}^n\left(D_\varphi(\nu_{i,*}, \nu_{i,t-1}) - \frac{n}{B}D_\varphi(\nu_{i,*},\nu_{i,t}) + \left(\frac{n}{B}-1\right) D_{\varphi}(\nu_{i,*}, \nu_{i,t-1})\right)\right] \\ & = \frac{1}{B}\mathbb{E}\left[\sum_{i=1}^n (D_{\varphi}(\nu_{i,*}, \nu_{i,t-1})- D_{\varphi}(\nu_{i,*}, \nu_{i,t}))\right]. \end{align*}\]

Combining this with (\(\ref{eqn:bnu-one}\)), we finish the proof.

■

Finally, we state the convergence result of SCENT in the following theorem.

Theorem 5.15

Suppose Assumption 5.19 holds. Let \(\beta_t=1, \eta_t=\eta \alpha_t\), and \(\alpha_t=\frac{\alpha}{\sqrt{T}}<\rho \min_i e^{-\nu_{i,t-1}}\), then SCENT guarantees that

\[\begin{align*} &\mathbb{E}\left[(F_{\text{CERM}}(\bar{\mathbf{w}}_T) - F_{\text{CERM}}(\mathbf{w}_*))\right] \leq \frac{1}{2\eta \alpha\sqrt{T}}\|\mathbf{w}_1- \mathbf{w}_*\|_2^2 + \frac{D_\varphi(\boldsymbol{\nu}_*, \boldsymbol{\nu}_{0})}{\alpha B\sqrt{T}} +\frac{\alpha V}{\sqrt{T}}, \end{align*}\]

where \(\bar{\mathbf{w}}_T = \frac{\sum_{t=1}^T\mathbf{w}_t}{T}\), and \(V=\mathbb{E}\left[\frac{\eta \sum_{t=1}^T\sigma_t^2}{2T} + \frac{\sum_{t=1}^TC\delta_t^2}{T}\right]\).

💡 Why it matters

In order to achieve an \(\epsilon\)-optimal solution, the above convergence bound implies the following complexity:

\[T = O\left(\frac{\|\mathbf{w}_1- \mathbf{w}_*\|_2^4}{\eta^2\alpha^2\epsilon^2} + \frac{D_\varphi(\boldsymbol{\nu}_*, \boldsymbol{\nu}_{0})^2}{\alpha^2 B^2\epsilon^2} + \frac{\alpha^2 V^2}{\epsilon^2} \right).\]

For simplicity of discussion, let us consider a setting of \(\eta\) such that the first term matches the second term. As a result, the complexity becomes:

\[T = O\left(\frac{D_\varphi(\boldsymbol{\nu}_*, \boldsymbol{\nu}_{0})^2}{\alpha^2 B^2\epsilon^2} + \frac{\alpha^2 V^2}{\epsilon^2} \right).\]

Insight 1: Since \(\sigma_t, \delta_t\) are finite, and \(D_\varphi(\boldsymbol{\nu}_*, \boldsymbol{\nu}_{0})=O(n)\), if \(\alpha \propto\sqrt{n/B}\), the above result implies an iteration complexity of \(O(\frac{n}{B\epsilon^2})\) for SCENT.

Insight 2: When the loss \(s_i(\mathbf{w}_t; \zeta) \ge 0\) is large, the term \(e^{-\nu_{i,t-1}}\) becomes very small, suggesting that the step size parameter \(\alpha\) should be chosen small so as to mitigate the large variance term \(\delta_t\). In contrast, when the loss \(s_i(\mathbf{w}_t; \zeta) < 0\) is small, the term \(e^{-\nu_{i,t-1}}\) can become large, allowing \(\alpha\) to be set relatively larger, which helps offset the large distance measure \(D_\varphi(\boldsymbol{\nu}_*, \boldsymbol{\nu}_{0})\).

Proof.

Since \(\eta_t = \eta \alpha_t\), from Lemma 5.35, we obtain

\[\begin{align*} &\mathbb{E}[\alpha_t\nabla_1 F(\mathbf{w}_t, \bar{\boldsymbol{\nu}}_t)^{\top}(\mathbf{w}_{t} - \mathbf{w}_*)]\leq \mathbb{E}\left[\frac{1}{2\eta}\|\mathbf{w}_t- \mathbf{w}_*\|_2^2 - \frac{1}{2\eta}\|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2 + \frac{\eta\alpha_t^2\sigma_t^2}{2}\right]. \end{align*}\]

Adding this to the inequality in Lemma 5.36, we have

\[\begin{align*} &\mathbb{E}[\alpha_t(\nabla_1 F(\mathbf{w}_t, \bar{\boldsymbol{\nu}}_t)^{\top}(\mathbf{w}_{t} - \mathbf{w}_*)+\nabla_2 F(\mathbf{w}_t, \bar{\boldsymbol{\nu}}_t)^{\top}(\bar{\boldsymbol{\nu}}_{t} - \boldsymbol{\nu}_*))]\\ & \leq \mathbb{E}\left[\frac{1}{2\eta}\|\mathbf{w}_t- \mathbf{w}_*\|_2^2 - \frac{1}{2\eta}\|\mathbf{w}_{t+1} - \mathbf{w}_*\|_2^2+ \frac{1}{B}D_\varphi(\boldsymbol{\nu}_*, \boldsymbol{\nu}_{t-1}) - \frac{1}{B}D_\varphi(\boldsymbol{\nu}_*,\boldsymbol{\nu}_{t}) \right]\\ &+\mathbb{E}\bigg[\frac{\eta\alpha_t^2\sigma_t^2}{2} + C\alpha_t^2\delta_t^2\bigg]. \end{align*}\]

By the joint convexity of \(F(\mathbf{w}, \nu)\), we have

\[\alpha_t(F(\mathbf{w}_t,\bar{\boldsymbol{\nu}}_t) - F(\mathbf{w}_*, \boldsymbol{\nu}_*))\leq \alpha_t(\nabla_1 F(\mathbf{w}_t, \bar{\boldsymbol{\nu}}_t)^{\top}(\mathbf{w}_{t} - \mathbf{w}_*)+\nabla_2 F(\mathbf{w}_t, \bar{\boldsymbol{\nu}}_t)^{\top}(\bar{\boldsymbol{\nu}}_{t} - \boldsymbol{\nu}_*)).\]

Combining the last two inequalities and summing over \(t=1,\ldots, T\), we have

\[\begin{align*} &\mathbb{E}\left[\sum_{t=1}^T\alpha_t(F(\mathbf{w}_t,\bar{\boldsymbol{\nu}}_t) - F(\mathbf{w}_*, \boldsymbol{\nu}_*))\right]\\ & \leq \frac{1}{2\eta}\|\mathbf{w}_1- \mathbf{w}_*\|_2^2 + \frac{1}{B}D_\varphi(\boldsymbol{\nu}_*, \boldsymbol{\nu}_{0}) +\mathbb{E}\left[\sum_{t=1}^T\frac{\eta\alpha_t^2\sigma_t^2}{2} + \sum_{t=1}^TC\alpha_t^2\delta_t^2\right]. \end{align*}\]

Since \(F_{\text{CERM}}(\mathbf{w}_*) = F(\mathbf{w}_*, \boldsymbol{\nu}_*)\), and \(F_{\text{CERM}}(\mathbf{w}_t)\leq F(\mathbf{w}_t,\bar{\boldsymbol{\nu}}_t)\), we have

\[\begin{align*} &\mathbb{E}\left[\sum_{t=1}^T\alpha_t(F_{\text{CERM}}(\mathbf{w}_t) - F_{\text{CERM}}(\mathbf{w}_*))\right]\\ & \leq \frac{1}{2\eta}\|\mathbf{w}_1- \mathbf{w}_*\|_2^2 + \frac{1}{B}D_\varphi(\boldsymbol{\nu}_*, \boldsymbol{\nu}_{0}) +\mathbb{E}\left[\sum_{t=1}^T\frac{\eta\alpha_t^2\sigma_t^2}{2} + \sum_{t=1}^TC\alpha_t^2\delta_t^2\right]. \end{align*}\]

Plugging the value of \(\alpha_t\), we finish the proof.

■

5.5.2.3 Why SCENT is better than SGD?

In this section, we provide theoretical insight into why SCENT outperforms SGD for entropic risk minimization. The key distinction between the two methods lies in their updates of the dual variable \(\nu\): SCENT employs a stochastic proximal mirror descent (SPMD) update, whereas SGD relies on a standard SGD update. Accordingly, our analysis focuses exclusively on the \(\nu\)-update while keeping \(\mathbf{w}\) fixed. In particular, we consider the following problem:

\[\begin{align} \min_{\nu}F(\nu):=\mathbb{E}_{\zeta}e^{s(\zeta) - \nu} + \nu, \end{align}\]

where we omit \(\mathbf{w}\) in \(s(\zeta)\).

Recall the definitions \(z\coloneqq e^{s(\zeta)}\), \(m\coloneqq \mathbb{E}[z]\), \(r(\nu)\coloneqq m e^{-\nu} = e^{\nu_*-\nu}\) as used previously, and the facts \(\nu_*=\arg\min_\nu F(\nu)=\log m\), \(F(\nu_*)=m e^{-\nu_*}+\nu_* = 1+\nu_*.\) Recall the SPMD update:

\[\begin{align*} e^{\nu_t} = \frac{1}{1+\alpha_te^{\nu_{t-1}}}e^{\nu_{t-1}} + \frac{\alpha_te^{\nu_{t-1}}} {1+\alpha_te^{\nu_{t-1}}} e^{s(\zeta_t)}. \end{align*}\]

Let us define an important quantity to characterize the difficulty of the problem:

\[\kappa = \frac{\mathbb{E}[z^2]}{(\mathbb{E}[z])^2},\]

which is known as the second-order moment ratio. Larger \(\kappa\) indicates heavier tails or higher variability relative to the mean.

A Clean Bound of SPMD

The optimality gap can be written as

\[\begin{equation}\label{eq:gap-r-form} F(\nu)-F(\nu_*)=m e^{-\nu}+\nu-(1+\nu_*)=r(\nu)-\log r(\nu)-1. \end{equation}\]

We assume \(s(\zeta)\in[c_0, c_1]\) and without loss of generality we assume \(c_1\leq 0\). If not, we can define \(s'(\zeta)=s(\zeta)-c_1, z'=e^{s'(\zeta)}\) and \(F'(\nu')=\mathbb{E}[z' e^{-\nu'}]+\nu'\). Then \(F(\nu) - F(\nu_*) = F'(\nu') - \min F'(\nu)\) if \(\nu = \nu'-c_1\).

Lemma 5.37 (Self-bounding inequality)

For all \(r>0\),

\[\begin{equation}\label{eq:r-self-bound} r \;\le\; 2\,(r-\log r). \end{equation}\]

Equivalently, for all \(\nu\in\mathbb{R}\),

\[\begin{equation}\label{eq:r-gap-bound} r(\nu)\le 2\bigl(F(\nu)-F(\nu_*)+1\bigr). \end{equation}\]

Proof.

If \(0<r\le 2\), then \(r\le 2 \le 2(r-\log r)\) since \(r-\log r\ge 1\) for all \(r>0\). If \(r\ge 2\), then \(\log r \le r/2\), hence \(r-\log r\ge r/2\), i.e. \(r\le 2(r-\log r)\). Substituting \(r=r(\nu)\) and using (\(\ref{eq:gap-r-form}\)) yields (\(\ref{eq:r-gap-bound}\)).

■

Theorem 5.16

Suppose \(s(\zeta)\in[c_0, c_1]\leq 0\) holds. By setting \(\alpha_t = \sqrt{\frac{D_\varphi(\nu_*,\nu_0)m}{2C T\mathrm{Var}(z)}}\leq \min(\frac{m}{4C\mathrm{Var}(z)}, \rho)\) for sufficiently large \(T\), SPMD guarantees that

\[\begin{equation}\label{eq:pmd-final-rate} \frac{1}{T}\sum_{t=1}^T \mathbb{E}\!\left[F(\nu_{t})-F(\nu_*)\right] \le 4\sqrt{2}\, \sqrt{\frac{C\,(\kappa -1)\,\bigl(1-r_0+r_0\log r_0\bigr)}{T}} + \frac{F(\nu_0) - F(\nu_*)}{T}, \end{equation}\]

where \(C=(1+\rho)(1+c_1 -c_0)\), and \(r_0=r(\nu_0)=e^{\nu_*-\nu_0}\).

💡 Why it matters

When \(\nu_0\gg \nu_*\) (over-estimation), then \(1-r_0+r_0\log r_0= O(1)\), the dominating term becomes \(O(\sqrt{\frac{\kappa}{T}})\). This upper bound characterizes the intrinsic complexity of SPMD, which depends on the second-order moment ratio \(\kappa\). If \(s(\zeta)\sim \mathcal{N}(\mu_s,\sigma_s^2)\), then \(\kappa=e^{\sigma_s^2}\), which does not depend on the exponential of the mean \(\mu_s\) but rather \(e^{\sigma_s^2}\).

Proof.

From Lemma 5.31, we obtain the SPMD averaged bound

\[\begin{equation}\label{eq:pmd-template-new} \bar{G}_T \;\coloneqq\; \frac{1}{T}\sum_{t=1}^T \mathbb{E}\!\left[F(\nu_{t})-F(\nu_*)\right] \;\le\; \frac{D_\varphi(\nu_*,\nu_0)}{\alpha T} \;+\; C\,\alpha\,V, \end{equation}\]

where

\[V\coloneqq \frac{1}{T}\sum_{t=1}^T \mathbb{E}[\delta_t^2],\qquad\delta_t^2=\mathbb{E}\!\left[e^{-\nu_{t-1}}(z_t-m)^2\right]=e^{-\nu_{t-1}}\mathrm{Var}(z).\]

Since \(e^{-\nu_{t-1}}=r(\nu_{t-1})/m\), we can rewrite

\[\begin{align}\label{eqn:V-exp} V= \frac{\mathrm{Var}(z)}{m}\cdot \frac{1}{T}\sum_{t=1}^T \mathbb{E}[r(\nu_{t-1})]. \end{align}\]

By Lemma 5.37,

\[\begin{align} &\frac{1}{T}\sum_{t=1}^T \mathbb{E}[r(\nu_{t-1})] \le \frac{2}{T}\sum_{t=1}^T \mathbb{E}\!\left[F(\nu_{t-1})-F(\nu_*)+1\right]\notag\\ &= 2\left(1+\frac{1}{T}\sum_{t=1}^T \mathbb{E}[F(\nu_{t-1})-F(\nu_*)]\right).\label{eqn:ravgb} \end{align}\]

Next, observe the index shift:

\[\begin{align*} &\sum_{t=1}^T \mathbb{E}[F(\nu_{t-1})-F(\nu_*)] = \mathbb{E}[F(\nu_0)-F(\nu_*)] + \sum_{t=1}^{T-1}\mathbb{E}[F(\nu_t)-F(\nu_*)]\\ &\le \mathbb{E}[F(\nu_0)-F(\nu_*)] + \sum_{t=1}^{T}\mathbb{E}[F(\nu_t)-F(\nu_*)]. \end{align*}\]

Dividing by \(T\) yields

\[\begin{equation}\label{eq:shift-bound-new} \frac{1}{T}\sum_{t=1}^T \mathbb{E}[F(\nu_{t-1})-F(\nu_*)] \le \frac{\mathbb{E}[F(\nu_0)-F(\nu_*)]}{T}+\bar{G}_T. \end{equation}\]

Combining this with (\(\ref{eqn:V-exp}\)) and (\(\ref{eqn:ravgb}\)) we have

\[\begin{equation}\label{eq:V-closed-new} V \le \frac{2\,\mathrm{Var}(z)}{m} \left(1+\bar{G}_T+\frac{\mathbb{E}[F(\nu_0)-F(\nu_*)]}{T}\right). \end{equation}\]

Plugging (\(\ref{eq:V-closed-new}\)) into (\(\ref{eq:pmd-template-new}\)) yields

\[\bar{G}_T \le \frac{D_\varphi(\nu_*,\nu_0)}{\alpha T} + \frac{2C\alpha\,\mathrm{Var}(z)}{m} \left(1+\bar{G}_T+\frac{\mathbb{E}[F(\nu_0)-F(\nu_*)]}{T}\right).\]

If \(\alpha\le \frac{m}{4C\,\mathrm{Var}(z)}\), then \(\frac{2C\alpha\,\mathrm{Var}(z)}{m}\le \frac{1}{2}\), and therefore

\[\begin{align*} \bar{G}_T&\le \frac{2D_\varphi(\nu_*,\nu_0)}{\alpha T} +\frac{4C\alpha\,\mathrm{Var}(z)}{m} \left(1+\frac{\mathbb{E}[F(\nu_0)-F(\nu_*)]}{T}\right)\\ & \le \frac{2D_\varphi(\nu_*,\nu_0)}{\alpha T} +\frac{4C\alpha\,\mathrm{Var}(z)}{m} +\frac{F(\nu_0)-F(\nu_*)}{T}. \end{align*}\]

Optimizing the right-hand side over \(\alpha\) (assuming \(T\) is large enough) gives:

\[\frac{1}{T}\sum_{t=1}^T \mathbb{E}\!\left[F(\nu_{t})-F(\nu_*)\right] \le 4\sqrt{2}\, \sqrt{\frac{C\,D_\varphi(\nu_*,\nu_0)\mathrm{Var}(z)}{mT}} + \frac{F(\nu_0) - F(\nu_*)}{T}.\]

With \(r_0\coloneqq r(\nu_0)=e^{\nu_*-\nu_0}\),

\[D_\varphi(\nu_*,\nu_0)=e^{-\nu_*}-e^{-\nu_0}+e^{-\nu_0}(\nu_*-\nu_0)=\frac{1}{m}\bigl(1-r_0+r_0\log r_0\bigr).\]

Since \(\mathrm{Var}(z)/m^2 = \kappa-1\), the convergence upper bound becomes

\[4\sqrt{2}\,\sqrt{\frac{C\,(\kappa -1)\bigl(1-r_0+r_0\log r_0\bigr)}{T}} + \frac{F(\nu_0) - F(\nu_*)}{T}.\]

■

Comparison with SGD.

Benefit under the noise setting

In order to control the variance, we consider projected SGD. Let \(\Pi_{[c_0,c_1]}\) denote projection onto \([c_0,c_1]\). The projected SGD update is

\[\begin{equation}\label{eq:sgd-proj} \nu_{t+1}=\Pi_{[c_0,c_1]}\bigl(\nu_t-\alpha'\,g_t\bigr), \qquad g_t \coloneqq 1-z_t e^{-\nu_t}, \end{equation}\]

where \(\{z_t\}_{t\ge 0}\) are i.i.d. copies of \(z\) and \(\alpha'>0\) is a constant step size. Note that \(\mathbb{E}[g_t\mid \nu_t]=\nabla F(\nu_t)=1-m e^{-\nu_t}\).

We present a corollary of Theorem 3.5 for SGD to minimize \(F(\nu)\) below.

Corollary 5.4

Suppose \(s(\zeta)\in[c_0, c_1]\) holds and \(F(\cdot)\) is \(L\)-smooth in the range of \([c_0, c_1]\). Let \(\{\nu_t\}\) follow (\(\ref{eq:sgd-proj}\)). If \(\eta\leq \frac{1}{L}\), then

\[\bar{G}_T^{\mathrm{SGD}}\;\coloneqq\;\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\!\left[F(\nu_t)-F(\nu_*)\right]\le\frac{(\nu_0-\nu_*)^2}{2\alpha' T}+\alpha'V',\]

where

\[V'= \frac{1}{T}\sum_{t=0}^{T-1}(\mathrm{Var}(g_t))^2= \frac{\mathrm{Var}(z)}{T}\sum_{t=0}^{T-1}\mathbb{E}[e^{-2\nu_t}].\]

We quantify the smoothness on the bounded domain of the objective, which introduces an exponential constant.

Lemma 5.38

On \([c_0,c_1]\), the function \(F(\nu)=m e^{-\nu}+\nu\) is \(L\)-smooth with

\[L=\sup_{\nu\in[c_0,c_1]}F''(\nu)=\sup_{\nu\in[c_0,c_1]} m e^{-\nu}=m e^{-c_0}=e^{\nu_*-c_0}.\]

Proof.

We have \(F''(\nu)=m e^{-\nu}\), which is decreasing in \(\nu\), so the maximum over \([c_0,c_1]\) is attained at \(c_0\).

■

Theorem 5.17

By choosing the optimal \(\alpha' = \frac{|\nu_0- \nu_*|e^{c_0}}{\sqrt{2T\mathrm{Var}(z)}}\le \frac{1}{L}=\frac{e^{c_0}}{m}\), SGD’s upper bound becomes

\[\begin{equation}\label{eq:sgd-exp-rate} \bar{G}_T^{\mathrm{SGD}} \le \sqrt{2}|\nu_0-\nu_*|\,e^{\nu_*-c_0}\sqrt{\frac{\kappa -1}{T}}, \end{equation}\]

where \(\kappa = \mathbb{E}[z^2]/(\mathbb{E}[z])^2\).

Proof.

The proof follows Corollary 5.4 by noting that \(V'\leq \mathrm{Var}(z)e^{-2c_0}=m^2(\kappa-1)e^{-2c_0}=e^{2(\nu_*-c_0)}(\kappa-1)\).

■

💡 Why it matters

By comparing the convergence bound of SPMD with that of SGD, the resulting ratio is:

\[\frac{1}{|\nu_0 - \nu_*| e^{\nu_* - c_0}}.\]

Notably, this ratio becomes exponentially small in regimes where \(\nu_* \gg c_0\), highlighting the superior efficiency of SPMD.

Benefit under the noiseless setting

We further show that, even in the noiseless setting, the dependence of the GD update on \(|\nu_0-\nu_*|\) is unavoidable, whereas the PMD update does not exhibit such dependence when \(\nu_0\gg \nu_*\).

In the noiseless setting, where \(m=\mathbb{E}[e^{s(\zeta)}]\) is known, the gradient descent (GD) iteration becomes:

\[\begin{equation}\label{eq:gd-update} \nu_{t+1} \;=\; \nu_t - \alpha'\nabla F(\nu_t) \;=\; \nu_t - \alpha'\bigl(1- me^{-\nu_t}\bigr), \qquad t\ge 0, \end{equation}\]

where \(\alpha'>0\) is a step size. For deterministic PMD, its update is equivalent to (cf. Lemma):

\[\begin{equation}\label{eq:s-recursion} y_{t+1} \;=\; \frac{y_t+\alpha}{1+\alpha m}, \end{equation}\]

where \(y_t = e^{-\nu_t}\).

Lemma 5.39 (GD vs PMD)

Assume \(\nu_0\gg \nu_*\). Let \(\{\nu_t\}_{t\ge 0}\) follow (\(\ref{eq:gd-update}\)) with \(\alpha'\leq 1\). Then in order to have \(|\nabla F(\nu_t)|\le \epsilon\), we need at least

\[\begin{equation}\label{eq:gd-hit-time} t \;\ge\; \frac{\nu_0-\nu_*-\log\!\bigl(\tfrac{1}{1-\epsilon}\bigr)}{\alpha'}. \end{equation}\]

In contrast, for deterministic PMD update (\(\ref{eq:s-recursion}\)), in order to ensure \(|\nabla F(\nu_t)|\le \epsilon\), it suffices that

\[\begin{equation}\label{eq:pmd-iter-complexity} t = \left\lceil\frac{\log\!\bigl(|1-r_0|/\epsilon\bigr)}{\log(1+\alpha m)}\right\rceil. \end{equation}\]

Proof.

Recall the definition \(r(\nu)\;\coloneqq\; m e^{-\nu} \;=\; e^{\nu_*-\nu}\). We have \(|\nabla F(\nu)| =|1-r(\nu)|.\) From (\(\ref{eq:gd-update}\)),

\[\nu_{t+1}=\nu_t-\alpha'\bigl(1-e^{\nu_*-\nu_t}\bigr).\]

If \(\nu_t\ge\nu_*\), then \(\nu_{t+1}-\nu_*=\nu_t-\nu_* - \alpha'\bigl(1-e^{\nu_*-\nu_t}\bigr)\geq 0\) provided \(\alpha'\leq 1\). Let \(r_t = e^{\nu_* - \nu_t}>0\). Then from the GD update we have

\[r_{t+1} = r_t e^{\alpha' (1-r_t)}\leq r_t e^{\alpha'}\leq r_0 e^{\alpha' (t+1)}.\]

In order to have \(\|\nabla F(\nu_t)\|_2\leq \epsilon\), it is necessary to have \(r_t\geq 1-\epsilon\). Hence, we need at least \(t\geq \frac{\log\frac{1-\epsilon}{r_0}}{\alpha'} = \frac{\nu_0 - \nu_*-\log\!\bigl(\tfrac{1}{1-\epsilon}\bigr)}{\alpha'}\).

For deterministic PMD update (\(\ref{eq:s-recursion}\)), since \(r_t = my_t\) we have

\[r_{t+1}-1=\frac{r_t-1}{1+\alpha m}.\]

Taking absolute value yields

\[|\nabla F(\nu_{t+1})|=\frac{|\nabla F(\nu_t)|}{(1+\alpha m)}.\]

Solving \(|\nabla F(\nu_t)|\le |\nabla F(\nu_0)|/(1+\alpha m)^{t}\le \epsilon\) yields (\(\ref{eq:pmd-iter-complexity}\)).

■

💡 Why it matters

Deterministic GD needs at least \(\Omega((\nu_0-\nu_*)/\alpha')\) steps to enter a constant-accuracy region, whereas PMD reduces \(|\nabla F(\nu_t)|\) geometrically with rate \((1+\alpha m)^{-1}\), yielding a complexity of order \(O\!\left(\frac{1}{\log(1+\alpha m)}\log\frac{1}{\epsilon}\right)\), which does not scale with \(\nu_0\) due to \(|1-r_0|=|1-e^{\nu_*-\nu_0}| \leq 1\).

Indeed, in the noiseless setting for PMD, taking the formal limit \(\alpha\rightarrow\infty\) yields \(y_1\rightarrow 1/m\) thus \(\nu_1\rightarrow\nu_*\). This highlights that the PMD update is an implicit, geometry-matched step.

5.5.2.4 An Optimal bound for SPMD

In fact, we can improve the convergence rate of SPMD to \(O\left(\frac{\kappa -1}{ T}\right)\), which matches a lower bound to be established. The key is just to use a specially designed learning rate scheme \(\alpha_t\). Recall the SPMD update:

\[\begin{equation}\label{eq:s-update} y_{t} \;=\; \frac{y_{t-1}+\alpha_t}{1+\alpha_t z_t}, \quad\forall t\geq 1, \end{equation}\]

where \(y_{t-1} = e^{-\nu_{t-1}}, z_t = e^{s(\zeta_t)}\).

Lemma 5.40

Let \(S_t\coloneqq \sum_{i=1}^t z_i\) and \(\bar{z}_t\coloneqq S_t/t\). Initialize \(y_1=1/z_1\) (or equivalently \(\alpha_1=\infty\)) and for \(t\ge 2\) choose

\[\begin{equation}\label{eq:alpha-erm} \alpha_t \;\coloneqq\; \frac{y_{t-1}}{t-1}. \end{equation}\]

Then for all \(t\ge 1\),

\[\begin{equation}\label{eq:st-erm} y_t \;=\; \frac{t}{S_t}, \qquad \nu_t \;=\; -\log y_t \;=\; \log\Bigl(\frac{S_t}{t}\Bigr) \;=\; \log \bar{z}_t. \end{equation}\]

In particular, \(\nu_t\) is the exact minimizer of the empirical objective

\[\widehat{F}_t(\nu)\;\coloneqq\;\bar{z}_t e^{-\nu}+\nu\quad\text{since}\quad\arg\min_\nu \widehat{F}_t(\nu)=\log \bar{z}_t.\]

Proof.

We prove (\(\ref{eq:st-erm}\)) by induction. For \(t=1\), \(y_1=1/z_1=1/S_1\) holds by initialization. Assume \(y_{t-1}=(t-1)/S_{t-1}\). Then (\(\ref{eq:alpha-erm}\)) gives \(\alpha_t=1/S_{t-1}\), and the recursion (\(\ref{eq:s-update}\)) yields

\[y_t=\frac{\frac{t-1}{S_{t-1}}+\frac{1}{S_{t-1}}}{1+\frac{z_t}{S_{t-1}}}=\frac{\frac{t}{S_{t-1}}}{\frac{S_{t-1}+z_t}{S_{t-1}}}=\frac{t}{S_{t-1}+z_t}=\frac{t}{S_t}.\]

Thus \(y_t=t/S_t\) and \(\nu_t=-\log y_t=\log(S_t/t)=\log\bar{z}_t\).

■

Assumption 5.20

Assume \(s(\zeta)\) is \(\sigma^2\)-subgaussian, i.e.,

\[\mathbb{E}\big[e^{\lambda(s(\zeta)-\mathbb{E}[s(\zeta)])}\big]\le e^{\lambda^2\sigma^2/2}\quad\forall \lambda\in\mathbb{R}.\]

This includes bounded \(s(\zeta)\). Indeed, if \(s(\zeta)\in[c_0,c_1]\) a.s., then \(s(\zeta)-\mathbb E[s(\zeta)]\) is \((c_1-c_0)^2/4\)-subgaussian by Hoeffding’s lemma.

Since \(\frac{\mathrm{Var}(z)}{(\mathbb{E}[z])^2} = \kappa - 1\), we have

\[\mathrm{Var}(\bar{z}_T)=\frac{\mathrm{Var}(z)}{T}=\frac{(\kappa-1)m^2}{T}.\]

Since Lemma 5.40 gives \(\nu_T=\log\bar{z}_T\), in light of (\(\ref{eq:gap-r-form}\)) we can write

\[\begin{equation}\label{eq:gap-in-Q} F(\nu_T)-F(\nu_*)=\frac{m}{\bar{z}_T}-1+\log\Bigl(\frac{\bar{z}_T}{m}\Bigr)=\frac{1}{Q_T}+\log Q_T-1,\qquad Q_T\coloneqq \frac{\bar{z}_T}{m}. \end{equation}\]

Note that \(\mathbb{E}[Q_T]=1\) and \(\mathrm{Var}(Q_T)=(\kappa-1)/T\).

Let \(U_T\coloneqq Q_T-1=(\bar{z}_T-m)/m\). Then \(\mathbb{E}[U_T]=0\) and \(\mathbb{E}[U_T^2]=(\kappa-1)/T\). Define

\[g(u)\;\coloneqq\;\frac{1}{1+u}+\log(1+u)-1, \forall u>-1,\]

so that by (\(\ref{eq:gap-in-Q}\)) we have \(F(\nu_T)-F(\nu_*)=g(U_T)\).

Lemma 5.41

For all \(u\ge -\tfrac{1}{2}\),

\[g(u)\le 2u^2.\]

Proof.

Define \(h(u)\coloneqq 2u^2-g(u)\) for \(u>-1\). Since \(g'(u)=\frac{u}{(1+u)^2}\), we have

\[h'(u)=4u-\frac{u}{(1+u)^2}=u\Big(4-\frac{1}{(1+u)^2}\Big).\]

For \(u\ge -\tfrac{1}{2}\), \((1+u)^2\ge \tfrac{1}{4}\), hence \(\frac{1}{(1+u)^2}\le 4\). Therefore \(h'(u)\le 0\) for \(u\in[-\tfrac{1}{2},0]\) and \(h'(u)\ge 0\) for \(u\ge 0\). Thus \(h\) attains its minimum over \([-\tfrac{1}{2},\infty)\) at \(u=0\), where \(h(0)=0\). Hence \(h(u)\ge 0\) on \([-\tfrac{1}{2},\infty)\), i.e., \(g(u)\le 2u^2\).

■

Lemma 5.42

Let \(z_i\ge 0\) i.i.d. with finite \(\kappa\). Then

\[\mathbb{P}(Q_T\le 1/2)=\mathbb{P}(\bar{z}_T\le m/2)\le \exp\!\Big(-\frac{T}{8\kappa}\Big).\]

Proof.

For any \(\lambda>0\), by Chernoff bound,

\[\mathbb{P}\Big(\sum_{i=1}^T z_i\le \tfrac{Tm}{2}\Big)=\mathbb{P}\Big(e^{-\lambda\sum_{i=1}^T z_i}\ge e^{-\lambda Tm/2}\Big)\le e^{\lambda Tm/2}\Big(\mathbb{E}[e^{-\lambda z}]\Big)^T.\]

Using \(e^{-x}\le 1-x+x^2/2\) for \(x\ge 0\),

\[\mathbb{E}[e^{-\lambda z}]\le 1-\lambda m+\frac{\lambda^2}{2}\mathbb{E}[z^2]\le\exp\!\Big(-\lambda m+\frac{\lambda^2}{2}\mathbb{E}[z^2]\Big).\]

Therefore

\[\mathbb{P}(\bar{z}_T\le m/2)\le\exp\!\Big(T\Big(\lambda m/2-\lambda m+\frac{\lambda^2}{2}\mathbb{E}[z^2]\Big)\Big)=\exp\!\Big(-T\Big(\frac{\lambda m}{2}-\frac{\lambda^2}{2}\mathbb{E}[z^2]\Big)\Big).\]

Choose \(\lambda=m/(2\mathbb{E}[z^2])\) to get the exponent \(-Tm^2/(8\mathbb{E}[z^2])=-T/(8\kappa)\).

■

Lemma 5.43

If \(s\) is \(\sigma^2\)-subgaussian, then

\[m^2\,\mathbb{E}[z^{-2}] \;=\; (\mathbb{E}[e^{s}])^2\,\mathbb{E}[e^{-2s}]\;\le\; e^{3\sigma^2}.\]

Proof.

Let \(\mu=\mathbb{E}[s]\) and \(X=s-\mu\). Then \(\mathbb{E}[X]=0\) and \(z=e^{s}=e^{\mu}e^{X}\). Thus

\[m^2\mathbb{E}[z^{-2}]=\big(e^{\mu}\mathbb{E}[e^{X}]\big)^2\cdot \big(e^{-2\mu}\mathbb{E}[e^{-2X}]\big)=\big(\mathbb{E}[e^{X}]\big)^2\,\mathbb{E}[e^{-2X}].\]

By subgaussianity,

\[\mathbb{E}[e^{X}]\le e^{\sigma^2/2},\qquad\mathbb{E}[e^{-2X}]\le e^{(2^2)\sigma^2/2}=e^{2\sigma^2}.\]

Hence \(m^2\mathbb{E}[z^{-2}]\le e^{\sigma^2}e^{2\sigma^2}=e^{3\sigma^2}\).

■

Theorem 5.18

Under Assumption 5.20, the SPMD iterate \(\nu_T\) produced by \(\alpha_t=y_{t-1}/(t-1)\) satisfies

\[\begin{equation}\label{eq:pmd-kappa-over-T} \mathbb{E}\big[F(\nu_T)-F(\nu_*)\big]\;\le\;\frac{2(\kappa-1)}{T}\;+\;e^{\frac{3}{2}\sigma^2}\exp\!\Big(-\frac{T}{16\kappa}\Big). \end{equation}\]

In particular, since the second term is exponentially small in \(T/\kappa\),

\[\mathbb{E}\big[F(\nu_T)-F(\nu_*)\big]=O(\kappa/T),\]

for every \(\sigma^2\)-subgaussian \(s(\zeta)\).

Proof.

Since \(F(\nu_T)-F(\nu_*) = g(U_T)\), we split the expectation on the events \(\{U_T\ge -1/2\}\) and \(\{U_T<-1/2\}\):

\[\mathbb{E}[g(U_T)]=\mathbb{E}[g(U_T)\mathbf{1}\{U_T\ge -1/2\}]+\mathbb{E}[g(U_T)\mathbf{1}\{U_T<-1/2\}].\]

On \(\{U_T\ge -1/2\}\), Lemma 5.41 yields

\[\mathbb{E}[g(U_T)\mathbf{1}\{U_T\ge -1/2\}]\le 2\,\mathbb{E}[U_T^2]=\frac{2(\kappa-1)}{T}.\]

On \(\{U_T<-1/2\}\) we have \(Q_T\le 1/2\), and since \(\log Q_T-1\le 0\),

\[g(U_T)=\frac{1}{Q_T}+\log Q_T-1 \le \frac{1}{Q_T}.\]

Hence, by Cauchy–Schwarz inequality,

\[\mathbb{E}[g(U_T)\mathbf{1}\{U_T<-1/2\}]\le\mathbb{E}[Q_T^{-1}\mathbf{1}\{Q_T\le 1/2\}]\le\big(\mathbb{E}[Q_T^{-2}]\big)^{1/2}\,\mathbb{P}(Q_T\le 1/2)^{1/2}.\]

By Jensen inequality and Lemma 5.43,

\[\mathbb{E}[Q_T^{-2}]=m^2\,\mathbb{E}[\bar{z}_T^{-2}]\le m^2\,\mathbb{E}[z^{-2}]\le e^{3\sigma^2}.\]

By Lemma 5.42, \(\mathbb{P}(Q_T\le 1/2)\le \exp(-T/(8\kappa))\). Therefore,

\[\mathbb{E}[g(U_T)\mathbf{1}\{U_T<-1/2\}]\le e^{\frac{3}{2}\sigma^2}\exp\!\Big(-\frac{T}{16\kappa}\Big).\]

Combining the two pieces proves (\(\ref{eq:pmd-kappa-over-T}\)).

■

A Distribution-free lower bound

Indeed, we can show that \(O\left(\frac{\kappa -1}{T}\right)\) is an optimal bound by establishing a matching lower bound for a black-box oracle model where the underlying distribution of \(z(\zeta)=e^{s(\zeta)}\) is unknown and for any query \(\nu\) the oracle returns

\[\Phi(\nu;\zeta)=z(\zeta) e^{-\nu}+\nu,\qquad g(\nu;\zeta)=\nabla_\nu \Phi(\nu;\zeta)=1-z(\zeta)e^{-\nu}.\]

Since

\[\begin{align*} &z(\zeta)=e^\nu(\Phi(\nu;\zeta)-\nu)=e^\nu(1-g(\nu;\zeta)), \end{align*}\]

hence, any \(T\)-query algorithm can reconstruct \(T\) i.i.d. samples \(z_1,\dots,z_T\) from \(P\). Thus, it suffices to prove the lower bound in the standard i.i.d. sampling model for \(z\).

Let us define a distribution class. For \(\kappa\ge 2\), define

\[\mathcal{P}_\kappa\coloneqq\left\{ P:\ z\ge 0,\ 0<\mathbb{E}_P[z]<\infty,\ \frac{\mathbb{E}_P[z^2]}{(\mathbb{E}_P[z])^2}\le \kappa \right\}.\]

Equivalently, \(\mathrm{Var}_P(z)/(\mathbb{E}_P[z])^2 \le \kappa-1\). For \(P\in\mathcal{P}_\kappa\) let \(m(P)=\mathbb{E}_P[z]\) and \(\nu_*(P)=\log m(P)\).

Lemma 5.44

Let \(\phi(u)\coloneqq e^{-u}+u-1\). Then \(\phi(0)=\phi'(0)=0\) and \(\phi''(u)=e^{-u}\). In particular, for all \(|u|\le 1\),

\[\begin{equation}\label{eq:phi-quad} \phi(u)\ \ge\ \frac{e^{-1}}{2}\,u^2. \end{equation}\]

Proof.

On the interval \([-1,1]\), \(\phi''(u)=e^{-u}\ge e^{-1}\), so \(\phi\) is \(e^{-1}\)-strongly convex on \([-1,1]\). Since \(\phi(0)=\phi'(0)=0\), strong convexity implies \(\phi(u)\ge \frac{e^{-1}}{2}u^2\) for all \(|u|\le 1\).

■

Lemma 5.45

Let \(\phi(u)=e^{-u}+u-1\). Fix \(\nu_0<\nu_1\) and let \(\Delta\coloneqq \nu_1-\nu_0\). Define

\[H(\nu)\coloneqq \phi(\nu-\nu_0)+\phi(\nu-\nu_1).\]

Then \(H\) is strictly convex and its unique minimizer \(\nu^\dagger\) lies in \((\nu_0,\nu_1)\). Moreover, if \(\Delta\le 1\), then

\[\begin{equation}\label{eq:inf-sum-loss} \inf_{\nu\in\mathbb{R}} H(\nu)\ \ge\ \frac{e^{-1}}{4}\,\Delta^2. \end{equation}\]

Proof.

We have \(\phi'(u)=1-e^{-u}\) and \(\phi''(u)=e^{-u}>0\), hence \(H\) is strictly convex with

\[H'(\nu)=\phi'(\nu-\nu_0)+\phi'(\nu-\nu_1)=2-e^{-(\nu-\nu_0)}-e^{-(\nu-\nu_1)}.\]

At the endpoints,

\[H'(\nu_0)=2-1-e^{-(\nu_0-\nu_1)}=1-e^{\Delta}<0,\qquad H'(\nu_1)=2-e^{-(\nu_1-\nu_0)}-1=1-e^{-\Delta}>0.\]

Since \(H'\) is strictly increasing (because \(H''>0\)), there is a unique root \(\nu^\dagger\in(\nu_0,\nu_1)\) and thus \(\inf_{\nu\in\mathbb{R}}H(\nu)=\inf_{\nu\in[\nu_0,\nu_1]}H(\nu)\).

Assume \(\Delta\le 1\). Then for all \(\nu\in[\nu_0,\nu_1]\) we have \(|\nu-\nu_0|\le \Delta\le 1\) and \(|\nu-\nu_1|\le \Delta\le 1\). On \([-1,1]\), \(\phi''(u)=e^{-u}\ge e^{-1}\), so \(\phi(u)\ge \frac{e^{-1}}{2}u^2\) for all \(|u|\le 1\). Therefore, for all \(\nu\in[\nu_0,\nu_1]\),

\[H(\nu)\ge \frac{e^{-1}}{2}\bigl((\nu-\nu_0)^2+(\nu-\nu_1)^2\bigr).\]

Minimizing the RHS over \(\nu\) yields \(\inf_\nu \bigl((\nu-\nu_0)^2+(\nu-\nu_1)^2\bigr)=\Delta^2/2\), hence \(\inf_{\nu\in\mathbb{R}}H(\nu)\ge \frac{e^{-1}}{4}\Delta^2\).

■

Lemma 5.46 (Le Cam’s Two-point Method)

Let \(P_0,P_1\) be two distributions and let \(L_0(\cdot),L_1(\cdot)\) be nonnegative loss functions. For any estimator \(\widehat{a}\) measurable w.r.t. the data,

\[\begin{equation}\label{eq:lecam-general} \max\{\mathbb{E}_{P_0}[L_0(\widehat{a})],\ \mathbb{E}_{P_1}[L_1(\widehat{a})]\}\ \ge\ \frac{1-\mathrm{TV}(P_0,P_1)}{2}\ \inf_{a}\bigl(L_0(a)+L_1(a)\bigr). \end{equation}\]

Proof.

Let \(M\coloneqq (P_0+P_1)/2\) and write \(dP_0=(1+f)\,dM\), \(dP_1=(1-f)\,dM\) where \(|f|\le 1\) and \(\int |f|\,dM=\mathrm{TV}(P_0,P_1)\). Then for any (possibly random) decision \(A\),

\[\begin{align*} \mathbb{E}_{P_0}[L_0(A)]+\mathbb{E}_{P_1}[L_1(A)] &=\int \Big(L_0(A)(1+f)+L_1(A)(1-f)\Big)\,dM\\ &=\int \Big( (L_0(A)+L_1(A)) + f(L_0(A)-L_1(A))\Big)\,dM\\ &\ge \int \Big( (L_0(A)+L_1(A)) - |f|\,(L_0(A)+L_1(A))\Big)\,dM\\ &=\int (L_0(A)+L_1(A))(1-|f|)\,dM\\ &\ge \inf_a(L_0(a)+L_1(a))\int (1-|f|)\,dM\\ &=(1-\mathrm{TV}(P_0,P_1))\inf_a(L_0(a)+L_1(a)). \end{align*}\]

Taking half and using \(\max\{x,y\}\ge (x+y)/2\) yields (\(\ref{eq:lecam-general}\)).

■

The final distribution-free suboptimality lower bound is stated in the following theorem.

Theorem 5.19

Let \(z=e^{s(\zeta)}\ge 0\) with \(m(P)=\mathbb{E}_P[z]\) and \(\nu_*(P)=\log m(P)\). For \(\kappa\ge 2\), define

\[\mathcal{P}_\kappa\coloneqq\left\{ P:\ z\ge 0,\ 0<\mathbb{E}_P[z]<\infty,\ \frac{\mathbb{E}_P[z^2]}{\mathbb{E}_P[z]^2}\le \kappa \right\}.\]

Let \(F_P(\nu)\coloneqq m(P)e^{-\nu}+\nu\) and \(\nu_*(P)=\arg\min_\nu F_P(\nu)\). Then there exists an absolute constant \(c>0\) such that for all \(T\ge \kappa\), any (possibly adaptive) algorithm using \(T\) value/gradient oracle calls and outputting \(\widehat\nu\) satisfies

\[\begin{equation}\label{eq:subopt-lb} \sup_{P\in\mathcal{P}_\kappa}\ \mathbb{E}_P\!\left[F_P(\widehat\nu)-F_P(\nu_*(P))\right]\ \ge\ c\,\frac{\kappa-1}{T}. \end{equation}\]

Proof.

We construct two strictly positive hard instances in \(\mathcal{P}_\kappa\). Fix \(\varepsilon\in(0,1]\) and define two distributions supported on \(\{\varepsilon,\kappa\}\):

\[P_i^\varepsilon:\quad\mathbb{P}(z=\kappa)=p_i,\qquad\mathbb{P}(z=\varepsilon)=1-p_i,\qquad i\in\{0,1\},\]

where

\[p_0\coloneqq \frac{1}{\kappa},\qquad p_1\coloneqq p_0+h,\qquad h\coloneqq \frac{1}{8\sqrt{\kappa T}}.\]

Since \(T\ge \kappa\), we have \(h\le \frac{1}{8\kappa}\) so \(p_1\in(0,1)\).

Next we show that \(P_0^\varepsilon,P_1^\varepsilon\in\mathcal{P}_\kappa\). For a generic \(p\in(0,1)\) and support \(\{\varepsilon,\kappa\}\), define

\[R_\varepsilon(p)\coloneqq \frac{\mathbb{E}[z^2]}{\mathbb{E}[z]^2}=\frac{p\kappa^2+(1-p)\varepsilon^2}{\bigl(p\kappa+(1-p)\varepsilon\bigr)^2}.\]

Let \(u\coloneqq \varepsilon/\kappa\in(0,1/\kappa]\subset(0,1]\). Then

\[R_\varepsilon(p)=\frac{p+(1-p)u^2}{\bigl(p+(1-p)u\bigr)^2}.\]

We claim \(R_\varepsilon(p)\le \frac{1}{p}\) for all \(u\in[0,1]\). Indeed,

\[\begin{align*} &\bigl(p+(1-p)u\bigr)^2 - p\bigl(p+(1-p)u^2\bigr)\\ &=p^2+2p(1-p)u+(1-p)^2u^2 - p^2 - p(1-p)u^2\\ &=(1-p)u\Bigl(2p+(1-2p)u\Bigr)\ \ge\ 0, \end{align*}\]

because \(u\in[0,1]\) and \(2p+(1-2p)u\ge \min\{2p,1\}\ge 0\). Thus \(R_\varepsilon(p)\le 1/p\). Since \(p_0=1/\kappa\) and \(p_1\ge p_0\), we have \(1/p_i\le \kappa\), hence \(R_\varepsilon(p_i)\le \kappa\) and therefore \(P_0^\varepsilon,P_1^\varepsilon\in\mathcal{P}_\kappa\).

Next, we compute the separation \(\Delta\) between \(\nu_*\)’s. Let \(m_i^\varepsilon=\mathbb{E}_{P_i^\varepsilon}[z]=\varepsilon+p_i(\kappa-\varepsilon)\) and \(\nu_i^\varepsilon=\log m_i^\varepsilon\). Then

\[m_1^\varepsilon-m_0^\varepsilon = h(\kappa-\varepsilon)\ge h(\kappa-1),\qquad m_0^\varepsilon = \varepsilon+p_0(\kappa-\varepsilon)=1+\Bigl(1-\frac{1}{\kappa}\Bigr)\varepsilon\in[1,2].\]

Hence

\[\Delta \coloneqq |\nu_1^\varepsilon-\nu_0^\varepsilon|=\log\!\left(1+\frac{m_1^\varepsilon-m_0^\varepsilon}{m_0^\varepsilon}\right)\ge \frac{1}{2}\cdot \frac{h(\kappa-1)}{2}=\frac{\kappa-1}{32\sqrt{\kappa T}},\]

where we used \(\log(1+x)\ge x/2\) for \(x\in[0,1/2]\) and the fact that \(\frac{h(\kappa-\varepsilon)}{m_0^\varepsilon}\le h\kappa \le 1/8\). In particular, \(\Delta\le h\kappa \le 1/8<1\).

Next, we show the lower bound of \(\inf_{\nu}\Big( (F_0(\nu)-F_0(\nu_0^\varepsilon))+(F_1(\nu)-F_1(\nu_1^\varepsilon))\Big)\). Under \(P_i^\varepsilon\) the objective is \(F_i(\nu)=m_i^\varepsilon e^{-\nu}+\nu\) and the optimal value is \(F_i(\nu_i^\varepsilon)=1+\nu_i^\varepsilon\). Thus the suboptimality can be written as

\[F_i(\nu)-F_i(\nu_i^\varepsilon)=e^{\nu_i^\varepsilon-\nu}+(\nu-\nu_i^\varepsilon)-1=\phi(\nu-\nu_i^\varepsilon),\qquad \phi(u)=e^{-u}+u-1.\]

Let \(\nu_0^\varepsilon<\nu_1^\varepsilon\) and set \(u=\nu-\nu_0^\varepsilon\). Then

\[\phi(\nu-\nu_0^\varepsilon)+\phi(\nu-\nu_1^\varepsilon)=\phi(u)+\phi(u-\Delta).\]

The function \(u\mapsto \phi(u)+\phi(u-\Delta)\) is convex and its minimizer lies in \([0,\Delta]\). Since \(\Delta\le 1\), applying Lemma 5.45 gives

\[\phi(u)+\phi(u-\Delta)\ \ge\ \frac{e^{-1}}{4}\Delta^2.\]

Therefore,

\[\inf_{\nu}\Big( (F_0(\nu)-F_0(\nu_0^\varepsilon))+(F_1(\nu)-F_1(\nu_1^\varepsilon))\Big)\ \ge\ \frac{e^{-1}}{4}\Delta^2.\]

Next, we show the total variation between \(P_0^\varepsilon\) and \(P_1^\varepsilon\) is bounded. Because the two distributions differ only in the Bernoulli parameter,

\[\mathrm{KL}(P_0^\varepsilon, P_1^\varepsilon)=p_0\log\frac{p_0}{p_1}+(1-p_0)\log\frac{1-p_0}{1-p_1}.\]

Using the bound \(\mathrm{KL}(P, Q)\le \chi^2(P, Q)\) and the fact that for Bernoulli measures \(\chi^2(P_0^\varepsilon, P_1^\varepsilon)=\frac{h^2}{p_1(1-p_1)}\), we get

\[\mathrm{KL}(P_0^\varepsilon, P_1^\varepsilon)\le \frac{h^2}{p_1(1-p_1)}.\]

Since \(h\le \frac{1}{2\kappa}\), we have \(p_1\le p_0+h \le \frac{3}{2\kappa}\le \frac{3}{4}\), hence \(1-p_1\ge 1/4\), and also \(p_1\ge p_0=1/\kappa\). Therefore \(p_1(1-p_1)\ge \frac{1}{4\kappa}\) and

\[\mathrm{KL}(P_0^\varepsilon, P_1^\varepsilon)\le 4\kappa h^2.\]

For \(T\) i.i.d. samples, this gives

\[\mathrm{KL}\big((P_0^\varepsilon)^{\otimes T}, (P_1^\varepsilon)^{\otimes T}\big)=T\,\mathrm{KL}(P_0^\varepsilon, P_1^\varepsilon)\le 4\kappa T h^2=\frac{1}{16}.\]

By Pinsker’s inequality,

\[\mathrm{TV}\big((P_0^\varepsilon)^{\otimes T},(P_1^\varepsilon)^{\otimes T}\big)\le\sqrt{\frac{1}{2}\mathrm{KL}\big((P_0^\varepsilon)^{\otimes T}, (P_1^\varepsilon)^{\otimes T}\big)}\le\sqrt{\frac{1}{32}}\le \frac{1}{4}.\]

Finally, we apply Lemma 5.46 to \(P_0=(P_0^\varepsilon)^{\otimes T}\), \(P_1=(P_1^\varepsilon)^{\otimes T}\) and losses \(L_i(\nu)\coloneqq F_i(\nu)-F_i(\nu_i^\varepsilon)\ge 0\). Using (\(\ref{eq:inf-sum-loss}\)) and \(\mathrm{TV}\le 1/4\) yields for any estimator \(\widehat\nu\),

\[\max_{i\in\{0,1\}}\mathbb{E}_{P_i^\varepsilon}\!\left[F_i(\widehat\nu)-F_i(\nu_i^\varepsilon)\right]\ge\frac{1-\mathrm{TV}}{2}\cdot \frac{e^{-1}}{4}\Delta^2\ge\frac{3}{8}\cdot \frac{e^{-1}}{4}\Delta^2=\frac{3e^{-1}}{32}\Delta^2.\]

Substituting \(\Delta^2\ge \frac{(\kappa-1)^2}{1024\,\kappa\,T}\ge \frac{\kappa-1}{2048\,T}\) (since \(\kappa\ge 2\)) gives

\[\max_{i\in\{0,1\}}\mathbb{E}_{P_i^\varepsilon}\!\left[F_i(\widehat\nu)-F_i(\nu_i^\varepsilon)\right]\ \ge\ \frac{3}{65536\,e}\cdot \frac{\kappa-1}{T}.\]

Since \(P_0^\varepsilon,P_1^\varepsilon\in\mathcal{P}_\kappa\), this implies (\(\ref{eq:subopt-lb}\)) with \(c=\frac{3}{65536\,e}\).

■

← Go Back