Section 5.1 Finite-sum Coupled Compositional Optimization

Specifically, we focus on the following optimization problem: \[\begin{align}\label{eqn:fcco} \min_{\mathbf{w} \in \mathbb{R}^d} F(\mathbf{w}) := \frac{1}{n} \sum_{i=1}^n f_i\big(\mathbb{E}_{\zeta \sim \mathbb{P}_i} g_i(\mathbf{w}; \zeta)\big), \end{align}\]

where \(g_i(\cdot; \zeta):\mathbb{R}^d\to\mathbb{R}^{d'}\) is a stochastic mapping, \(f_i(\cdot):\mathbb{R}^{d'}\to\mathbb{R}\) is a deterministic function, and \(\mathbb{P}_i\) denotes the distribution of the random variable \(\zeta\).

We refer to this problem as finite-sum coupled compositional optimization (FCCO). If we interpret \(i\) as an outer random variable, a distinctive feature that sets FCCO apart from standard stochastic compositional optimization (SCO) is that each inner stochastic function \(g_i(\mathbf{w};\zeta)\) depends on both an inner random variable \(\zeta\) and an outer index \(i\), giving rise to the term coupled. While this problem can be cast as a special case of SCO by defining \(f(\mathbf{g})=\frac{1}{n}\sum_{i=1}^n f_i(g_i)\) and \(\mathbf{g}(\mathbf{w})=[g_1(\mathbf{w}),\ldots,g_n(\mathbf{w})]\), the high dimensionality of \(\mathbf{g}\) due to large \(n\), along with its stochastic components, significantly complicates the construction of unbiased estimators and theoretical analysis. Therefore, FCCO warrants the development of specialized optimization methods.

Below, we revisit several applications of FCCO in ML and discuss the properties of \(f_i\) and \(g_i\).

Group DRO

In Section 2.2.3, we have formulated the CVaR divergence regularized group DRO as \[\begin{align}\label{eqn:cvargdro2} \min_{\mathbf{w},\nu}\frac{1}{K\alpha}\sum_{i=1}^K[L_i(\mathbf{w})-\nu]_+ + \nu, \end{align}\]

where \(\alpha\in(0,1)\), \(L_i(\mathbf{w})=\frac{1}{n_k}\sum_{j=1}^{n_k}\ell(\mathbf{w};\mathbf{x}^i_j,y^i_j)\) denotes the average loss over data from the \(i\)-th group. The first term above is an instance of the FCCO objective, where the outer function \(f(g)=([g]_1-[g]_2)_+\) is a convex but non-smooth function of \(g\), and each inner function \(g_i(\mathbf{w},\nu)=[L_i(\mathbf{w}),\nu]^{\top}\) could be convex or non-convex, smooth or non-smooth depending on applications.

AP Maximization

In Section 2.3.2, the AP maximization has been formulated as the following problem: \[\begin{align} \min_{\mathbf{w}} \frac{1}{n_{+}} \sum_{\mathbf{x}_i\in\mathcal{S}_+} f(g_i(\mathbf{w})), \end{align}\]

where \(\mathcal{S}_+\) is the set of \(n_+\) positive examples, \(g_i(\mathbf{w})=[g_1(\mathbf{w};\mathbf{x}_i,\mathcal{S}),g_2(\mathbf{w};\mathbf{x}_i,\mathcal{S})]^{\top}\) is a vector mapping with two components: \[\begin{align*} &g_1(\mathbf{w};\mathbf{x}_i,\mathcal{S})=\frac{1}{|\mathcal{S}|}\sum_{\mathbf{x}_j\in\mathcal{S}}\mathbb{I}(y_j=1)\ell(h(\mathbf{w};\mathbf{x}_j)-h(\mathbf{w};\mathbf{x}_i))\\ &g_2(\mathbf{w};\mathbf{x}_i,\mathcal{S})=\frac{1}{|\mathcal{S}|}\sum_{\mathbf{x}_j\in\mathcal{S}}\ell(h(\mathbf{w};\mathbf{x}_j)-h(\mathbf{w};\mathbf{x}_i)), \end{align*}\]

and \(f(\mathbf{g})=-\frac{[\mathbf{g}]_1}{[\mathbf{g}]_2}\) is a simple function. We can see that \(f\) is non-convex and smooth if the loss value is upper bounded and \(\ell(0)\) is lower bounded. The inner mapping \(g_i(\mathbf{w})\) could be convex (e.g., a linear model) or non-convex (e.g., a deep model), smooth or non-smooth depending on applications.

Contrastive Representation Learning

The contrastive objective of self-supervised representation learning presented here is the following: \[\begin{equation*} \begin{aligned} \min_{\mathbf{w}} & \frac{1}{n}\sum_{i=1}^n\tau\log\left(\varepsilon+\frac{1}{|\mathcal{S}_i^-|}\sum_{\mathbf{y}\in\mathcal{S}_i^-}\exp((s(\mathbf{w};\mathbf{x}_i,\mathbf{y})-s(\mathbf{w};\mathbf{x}_i,\mathbf{x}^+_i))/\tau)\right). \end{aligned} \end{equation*}\]

The outer function \(f(g)=\tau\log(\varepsilon+g)\) is a non-convex function and smooth when \(\varepsilon\) is lower bounded. Each inner function \(g_i\) is a non-convex function of \(\mathbf{w}\) in general.

← Go Back