Factor-augmented group effect selection with application to gene set analysis

Yihe Yang; Yihe Yang

doi:10.48130/stati-0026-0007

This paper addresses the problem of group variable selection when both the number of groups and the number of variables within each group are large. We propose the factor-augmented group effect selection (FAGES) method, which simultaneously identifies important groups, provides comparable estimates of group effect sizes, and determines the directions of these effects. The key idea is to assume that a low-dimensional latent factor captures the major information within each group and to apply a variable selection penalty to these factors in order to select relevant groups and estimate their effects. We establish the consistency of both parameter estimation and model selection under moderate conditions. Simulation studies demonstrate that FAGES can reliably recover significant groups and estimate their effects, even when the working model is misspecified. In practice, FAGES can be applied to gene set analysis to identify important biological pathways and quantify their direct effects. Using head and neck squamous cell carcinoma data, we illustrate the practical utility of FAGES by detecting multiple pathways implicated in disease development.

HTML

Materials and Methods

Notation

For a vector a = (a_j)_{p × 1}, let ||a||_q = ($\sum_{j=1}^p $|a_j|^q)^1/q with q $ \in $ [0, ∞]. For a symmetric matrix A = (A_ij)_{p × p}, σ_j(A) denotes the $ j $-th largest eigenvalue of matrix A, $ ||{\bf{A}}\|_F^2=\sum_{i}\sum_{j}A_{ij}^2 $, and ||A||_q = max{||Aa||_q, ||a||_q = 1}. Besides, a_n $ \asymp $ b_n if there are positive constants c and C such that c ≤ a_n/b_n ≤ C. Notations {a_j}, and {A_j} represent a set of vectors {a₁, …, a_p} and a set of matrices {A₁, …, A_p}, respectively. In addition, notation diag(a) denotes the diagonalizing operator of vector a and diag(A₁, …, A_p) denotes the block-diagonalizing operator of a series of matrices {A_j}. Moreover, we define σ_max(A) = σ₁(A) and σ_min(A) = σ_p(A) as the largest and smallest eigenvalues of a symmetric matrix A $ \in\mathbb{R}^{p\times p} $, respectively.

Background

Approximate factor model

A multivariate variable X_i = (X_i1, …, X_ip)^T is termed to follow a factor model if

$ {\bf{X}}={\bf{F}}{\boldsymbol{\Lambda}}^\top +{\bf{e}}, $

(1)

where, X = (X₁, …, X_n)^T is the sample matrix of X_i, F= (F₁, …, F_n)^T is the matrix of latent factor F_i = (f_i1, …, f_iK)^T, $ {\boldsymbol{\Lambda}} $ = (λ₁, …, λp)^T is the loadings matrix with λ_j = (λ_j1, …, λ_jK)^T, and e = (e₁, …, e_n)^T is the matrix of the idiosyncratic component e_i = (e_i1, …, e_ip)^T. The factor model is known as approximate factor model (AFM) if the idiosyncratic components have cross-sectional correlation, which has been verified to be effective in explaining the correlation structure of variables in econometrics^[18].

Principal components analysis (PCA) and AFM are closely related^[18]. Specifically, F and $ {\boldsymbol{\Lambda}} $ can be estimated by the following restricted minimization:

$ \begin{aligned} \hat{{\bf{F}}},\hat{{\bf{\Lambda}}}&=\arg\min_{{\bf{F}},{\bf{\Lambda}}}\ ||{\bf{X}}-{\bf{F}}{\bf{\Lambda}}^\top ||_\text{F}^2,\\ &\text{subject to } {\bf{F}}^\top {\bf{F}}/n={\bf{I}}_K \text{ and } {\bf{\Lambda}}^\top {\bf{\Lambda}} \text{ is diagonal}. \end{aligned} $

(2)

The minimizers of the above restricted minimization are explicit and unique. Suppose the singular value decomposition X = $ \sum_{s=1}^p $d_sU_sV_s^T, where U_s is the s-th left singular vector, V_s is the s-th right singular vector, and d_s is the s-th singular value of X. Then $ \hat{{\bf{F}}}=\sqrt n({\boldsymbol{U}}_1,\cdots,{\boldsymbol{U_K}}) $ and $ \hat{\Lambda}={\bf{X}}^\top \hat{{\bf{F}}}/\sqrt n $. That is, $ \hat{{\bf{F}}} $ consists of the first K principal components (PCs) of X.

Note that AFM is further connected to functional PCA, which is used to characterize the main pattern of the individualized trajectories around an overall mean trend function in functional data analysis^[19]. In the simulation example on functional regression, we introduce how to use AFM to handle the functional data through the projection-PCA^[16].

Group variable selection approaches

Multivariate variable X_i is termed to have a group structure if there is a series of sets $ \{{\cal{M}}_1,\cdots,{\cal{M}}_J\} $ such that X_i = $ ({\boldsymbol{X}}_{i{\cal{M}}_1}^\top,\dots,{\boldsymbol{X}}_{i{\cal{M}}_J}^\top)^\top $. Multivariate GLM with grouped variables refers to

$ g({\boldsymbol{\mu}})={\bf{1}}\beta_0+{\bf{X}}_{{\cal{M}}_1}{{\boldsymbol{\beta}}}_{{\cal{M}}_1}+\cdots+{\bf{X}}_{{\cal{M}}_J}{{\boldsymbol{\beta}}}_{{\cal{M}}_J}, $

(3)

where, $ {\boldsymbol{\mu}} $ = E(y) and y = (y₁, …, y_n)T $\in\mathbb{R}^{n} $ is the response that follows the exponential family distribution^[20]. Here, 1 $ \in\mathbb{R}^{n} $ is the all-one vector, $ {\bf{X}}_{{\cal{M}}_j}=({\boldsymbol{X}}_{1,{\cal{M}}_j},\dots,{\boldsymbol{X}}_{n,{\cal{M}}_j})^\top\in\mathbb{R}^{n\times p_j} $ is the design matrix of variables in group $ {\cal{M}}_j $, and $ {{\boldsymbol{\beta}}}_{{\cal{M}}_j}\in\mathbb{R}^{p_j} $ is the corresponding regression coefficient vector, where $ p_j=|{\cal{M}}_j| $ and $ s\in{\cal{M}}_j $. Moreover, X = $ ({\bf{1}},{\bf{X}}_{{\cal{M}}_1},\dots, {\bf{X}}_{{\cal{M}}_J})\in\mathbb{R}^{n\times(1+\sum_{j=1}^J p_j)} $ and β = $ (\beta_0,{{\boldsymbol{\beta}}}_{{\cal{M}}_1}^\top,\dots,{{\boldsymbol{\beta}}}_{{\cal{M}}_J}^\top)^\top\in\mathbb{R}^{1+\sum_{j=1}^J p_j} $; g(·) is known as the canonical link function. In addition, we assume that these sets are known in advance from preliminary information. For example, the KEGG^[10] database is designed according to the functional pathways of genes, while the gene ontology (GO)^[21] database is made based on the ontology annotations of genes.

The regression coefficient β can be estimated through the penalized likelihood^[22] shown below:

$ \hat{{\boldsymbol{\beta}}}=\arg\min\limits_{{{\boldsymbol{\beta}}}}\bigg{\{}{\cal{L}}({\bf{X}}{{\boldsymbol{\beta}}})+n\sum\limits_{j=1}^J\rho_\lambda(||{{\boldsymbol{\beta}}}_{{\cal{M}}_j}||)\bigg{\}}, $

(4)

where, $ {\cal{L}}({\boldsymbol{\eta}})=-({\bf{y}}^\top{\boldsymbol{\eta}}-{\bf{1}}^\top b({\boldsymbol{\eta}})) $ is the negative log-likelihood function and b'(x) is the derivative of b(x); η is known as the linear predictor, which is equal to Xβ in this case; b(·) is a known function satisfying b'(x) = g⁻¹(x); and ρ_λ(·) is a group variable selection penalty with a penalizing parameter λ. There are two categories of group variable selection penalties: group-level selection penalty, such as group LASSO^[4], group SCAD, and group MCP^[5], and bi-level selection penalty, including composite MCP (cMCP)^[6], group bridge approach^[7], and group exponential LASSO (GEL)^[8]. For example, the group MCP is defined as:

$ \rho^\text{mcp}_\lambda(||{{\boldsymbol{\beta}}}||_2)=\lambda\int_0^{||{{\boldsymbol{\beta}}}||_2}\bigg{(}1-\frac{t}{a\lambda}\bigg{)}_+\text{d}t, $

(5)

where, a > 2 is a tuning parameter. As well, the expression of GEL is given by

$ \rho^\text{gel}_\lambda(||{{\boldsymbol{\beta}}}||_1)=\frac{\lambda^2}{a}\bigg{\{}1-\exp\bigg{(}-\frac{a||{{\boldsymbol{\beta}}}||_1}{\lambda}\bigg{)}\bigg{\}}, $

(6)

where, a > 0 is an alternative tuning parameter. In addition, a weight w_β is generally applied to adjust λ as w_βλ to trade off the influence of group size. For group-level penalty, w_β is usually set as $ \sqrt{{{\rm{dim}}}({{\boldsymbol{\beta}}})} $; for bi-level penalty, w_β is often estimated as dim(β).

Note that whether ρ_λ(||β||) is called group- or bi-level penalty depends on the type of norm ||·|| rather than its expression. The penalty is called group-level penalty if it measures the $ \ell_2 $-norm of β, while it is termed bi-level penalty if it measures the $ \ell_1 $-norm of β. In addition, group-level penalty can only select the important groups, whereas bi-level penalty can select important variables and groups simultaneously. For more details, refer to Huang et al.^[5].

Statistical methodology of FAGES

Representation

FAGES is a novel group variable selection method that combines the AFM and group variable selection approach reviewed in the previous section. Suppose a multivariate variable with group structure, i.e., X_i = $(1,{\boldsymbol{X}}_{i,{\cal{M}}_1}^\top,\dots,{\boldsymbol{X}}_{i,{\cal{M}}_J}^\top)^\top $; in particular, the multiple variables $ {\boldsymbol{X}}_{i,{\cal{M}}_j} $ are usually highly correlated in the group $ {\cal{M}}_j $. FAGES assumes these multiple variables to follow an AFM:

$ {\bf{X}}_{{\cal{M}}_j}={\bf{F}}_{{\cal{M}}_j}{\boldsymbol{\Lambda}}_{{\cal{M}}_j}^\top+{\bf{e}}_{{\cal{M}}_j}, $

(7)

where, $ {\bf{X}}_{{\cal{M}}_j}=({\boldsymbol{X}}_{1,{\cal{M}}_j},\dots,{\boldsymbol{X}}_{n,{\cal{M}}_j})^\top $, $ {\bf{F}}_{{\cal{M}}_j}=({\boldsymbol{F}}_{1,{\cal{M}}_j},\dots,{\boldsymbol{F}}_{n,{\cal{M}}_j})^\top $, $ {\boldsymbol{\Lambda}}_{{\cal{M}}_j}= ({\boldsymbol{\lambda}}_{1,{\cal{M}}_j},\dots,{\boldsymbol{\lambda}}_{p_j,{\cal{M}}_j}) $, and $ {\bf{e}}_{{\cal{M}}_j}=({\boldsymbol{e}}_{1,{\cal{M}}_j},\dots,{\boldsymbol{e}}_{n,{\cal{M}}_j})^\top $.

Since the major information of $ {\bf{X}}_{{\cal{M}}_j} $ is represented by $ {\bf{F}}_{{\cal{M}}_j} $, it is able to detect the important group $ {\cal{M}}_j $ by looking at the significance of the corresponding latent factor $ {\bf{F}}_{{\cal{M}}_j} $. Motivated by this principle, FAGES considers a new group-wise factor-augmented GLM (GF-GLM), which addresses the regression between a response y and the multiple latent factors of all groups:

$ g({\boldsymbol{\mu}})={\bf{1}}\theta_0+\sum\limits_{j=1}^J{\bf{F}}_{{\cal{M}}_j}{{\boldsymbol{\theta}}}_{{\cal{M}}_j}. $

(8)

Compared with the high-dimensional GLM (3), the grouped factor-augmented GLM discards the redundant parts $ \{e_{i{\cal{M}}_j}\} $ and reduces the dimension of the model from $ \sum_j^J $p_j to $ \sum_j^J $K_j. Compared with the factor-augmented regression^[23], the grouped factor-augmented GLM allows the latent factors of different groups to be presented in the same regression, so that it is able to correctly prioritize the groups and make the valid inferences. In addition, if only a few latent factors of groups have significant associations with phenotype, the group LASSO and its modifications can be applied to select the non-zero $ {{\boldsymbol{\theta}}}_{{\cal{M}}_j} $.

When the grouped factor-augmented GLM is misspecified, FAGES is still able to select the important groups with high precision. Specifically, consider the following model:

$ \text{E}(y_i|{\boldsymbol{X_i}})=g^{-1}(\eta_i^F+\eta_i^e), $

(9)

Here, $ \eta_i^F=\beta_0+\sum_j{\boldsymbol{F}}_{i{\cal{M}}_j}^\top{\boldsymbol{\theta}}_j $ and $ \eta_i^e=\sum_j{\boldsymbol{e}}_{i{\cal{M}}_j}^\top{\boldsymbol{\beta}}_{{\cal{M}}_j} $, which means that both the latent factors $ \{{\boldsymbol{F}}_{i{\cal{M}}_j}\} $ and the idiosyncratic component $ \{{\boldsymbol{e}}_{i{\cal{M}}_j}\} $ have effects on $ y_i $. Using the Taylor expansion, this model reduces to

$ \text{E}(y_i|{\boldsymbol{X_i}})=g^{-1}(\eta_i^F)+O(|\eta_i^{e}|^2), $

(10)

indicating that the latent factors $ \{{\boldsymbol{F}}_{{\cal{M}}_j}\} $ can describe the main variation of y_i as long as they carry most information of $ \{{\boldsymbol{X}}_{{\cal{M}}_j}\} $. This is why FAGES can find the important groups and accurately estimate the group effects even if the grouped factor-augmented GLM is misspecified. In the literature, such an approximation has been utilized by Hall et al.^[24] to analyze the discrete functional data. They also found that a top few functional PCs (FPCs) can sufficiently describe the main variation of discrete functional data, offering the empirical confirmation of the robustness of our grouped factor-augmented generalized linear model (GLM) model.

In practice, we adopt a data-adaptive rule to select the number of factors. Specifically, for each group $ {\cal{M}}_j $, let {σ_jk}_k≥1 denote the squared eigenvalues of n⁻¹${\bf{X}}_{{\cal{M}}_j}^\top{\bf{X}}_{{\cal{M}}_j} $, truncated at K_max for stability. We consider three complementary criteria. The first is a gap-ratio statistic^[25]:

$ z_{jk}=\frac{\sigma_{j,k-1}-\sigma_{jk}}{\sigma_{jk}-\sigma_{j,k+1}},\qquad k=2,\dots,K_{\max}, $

(11)

and we set $ K_j^{{\rm{DR}}}=\arg\max_{2\le k\le K_{\max}} z_{jk} $. The second is an eigenvalue-ratio statistic^[26]:

$ r_{jk}=\frac{\sigma_{jk}}{\sigma_{j,k+1}},\qquad k=1,\dots,K_{\max}, $

(12)

yielding $ K_j^{{\rm{ER}}}=\arg\max_{1\le k\le K_{\max}} r_{jk} $. The third criterion applies a hard threshold on the spectrum of the standardized group variables^[27]: letting $ \tilde \sigma_{jk} $ be the eigenvalues of the sample correlation matrix of $ {\bf{X}}_{{\cal{M}}_j} $, we define

$ K_j^{{\rm{ACT}}}=\sum\limits_{k=1}^{K_{\max}}\text{I} \left(\tilde \sigma_{jk} \gt \,1+\sqrt{\frac{p_j}{n-1}}\right). $

(13)

Finally, we take the conservative aggregation

$ K_j=\min \left\{K_j^{{\rm{DR}}},\,K_j^{{\rm{ER}}},\,K_j^{{\rm{ACT}}}\right\}, $

(14)

and estimate the group factors $ {\bf{F}}_{{\cal{M}}_j} $ accordingly, which empirically stabilizes inference by avoiding overestimation of K_j across heterogeneous group sizes and within-group correlation structures. We adopt this conservative strategy because Fan et al.^[27] pointed out that the primary practical risk in factor-augmented models is not understimating the number of factors but rather including too many spurious factors, where subsequent inference may be adversely affected.

Estimation and inference

In the implementation, FAGES first estimates the latent factor of each group through the PCA and then yields the related coefficient using the penalized likelihood below:

$ \hat{{\boldsymbol{\theta}}}=\arg\min\limits_{{{\boldsymbol{\theta}}}}\bigg{\{}{\cal{L}}(\hat{{\bf{F}}}{{\boldsymbol{\theta}}})+n\sum\limits_{j=1}^J\rho_{w_j\lambda}(||{{\boldsymbol{\theta}}}_{{\cal{M}}_j}||)\bigg{\}}, $

(15)

where, $ {\cal{L}}({\boldsymbol{\eta}}) $ is the negative log-likelihood function; $ {\boldsymbol{\eta}}=\hat{{\bf{F}}}{{\boldsymbol{\theta}}} $ is the linear predictor; $ \hat{{\bf{F}}}=({\bf{1}},\hat{{\bf{F}}}_{{\cal{M}}_1},\dots,\hat{{\bf{F}}}_{{\cal{M}}_J}) $, where $ \hat{{\bf{F}}}_{{\cal{M}}_j} $ is the first K_j PCs of $ {\bf{X}}_{{\cal{M}}_j} $; ||·|| may be the $ \ell_2 $- or the $ \ell_1 $-norm, with respect to group- and bi-level selection approaches; and w_j is a given weight corresponding to group $ {\cal{M}}_j $. This penalized likelihood reduces to the traditional penalized likelihood if the latent factors are observed or their consistent estimators are given. The block descent algorithm^[28] can be used to solve (Eq. [15]) in a computationally efficient manner.

After obtaining the minimizer $ \hat{{\boldsymbol{\theta}}} $, we propose to quantify the strength of group effect of $ {\cal{M}}_j $ by using the following averaged group effect estimate:

$ \hat z_{{\cal{M}}_j}= \frac1{\sqrt n}||\hat{{\bf{F}}}_{{\cal{M}}_j}\hat{{\boldsymbol{\theta}}}_{{\cal{M}}_j}||_2. $

(16)

This averaged group effect estimate is based on the fact that multiple latent factors may exist within a group and their directions are not identifiable under the AFM/PCA representation, rendering individual coefficient signs uninformative. By aggregating the fitted group-specific signal across samples, the averaged group effect provides a meaningful and comparable summary of the overall group contribution regardless of factor orientation.

With the same motivation, we propose to determine the related effect direction by the sign of certain well-defined correlation statistics $ \widehat{{\rm{cor}}}(\hat{{\bf{F}}}_{{\cal{M}}_j}\hat{{\boldsymbol{\theta}}}_{{\cal{M}}_j},{\bf{y}}) $. For example, the rank correlation

$ \hat\omega_{{\cal{M}}_j}=\frac{1}{n(n-1)}\sum\limits_{i=1}^n\sum\limits_{s\neq i}^n{\bf{1}}(\hat{{\boldsymbol{F}}}_{i,{\cal{M}}_j}^\top\hat{{\boldsymbol{\theta}}}_{{\cal{M}}_j} \lt \hat{{\boldsymbol{F}}}_{s,{\cal{M}}_j}^\top\hat{{\boldsymbol{\theta}}}_{{\cal{M}}_j}){\bf{1}}(y_i \lt y_s)-\frac14 $

(17)

has been demonstrated to be robust for modelling the dependency of two variables^[29]. Other correlations like Kendall-$ \tau $ correlation are also appropriate to define the sign of the cluster effect. Although the factors of a group $ \hat{{\bf{F}}}_{{\cal{M}}_j} $ themselves lack a straightforward statistical interpretation, their biological significance can be elucidated by examining the signs of certain marker genes in the loading matrix $ \hat{{\boldsymbol{\Lambda}}}_{{\cal{M}}_j} $. This approach facilitates a deeper understanding of the underlying biological implications of the factors. With this specification of the averaged group effect, FAGES is able to rank the importance of significant groups and judge the related effect directions toward phenotype.

Large sample property

Denote $ \hat{{\bf{F}}}_{{\cal{M}}_j} $ as the matrix consisting of the first K_j PCs of the matrix $ {\bf{X}}_{{\cal{M}}_j} $ and $ \hat{{\boldsymbol{\Lambda}}}_{{\cal{M}}_j}={\bf{X}}_{{\cal{M}}_j}^\top\hat{{\bf{F}}}_{{\cal{M}}_j}/n $, $ \hat{{\bf{V}}}_{{\cal{M}}_j} $ as the (K_j × K_j) diagonal matrix consisting of the first K_j eigenvalues of matrix $ {\bf{X}}_{{\cal{M}}_j}^\top{\bf{X}}_{{\cal{M}}_j}/(np_j) $, and $ {\bf{H}}_{{\cal{M}}_j}^\top=\hat{{\bf{V}}}_{{\cal{M}}_j}^{-1}(\hat{{\bf{F}}}_{{\cal{M}}_j}^\top{\bf{F}}_{{\cal{M}}_j}/n)({\bf{\Lambda}}^\top{\boldsymbol{\Lambda}}/p_j) $. Denote $ {\bf{H}}_{{\cal{M}}}={{\rm{diag}}}({\bf{H}}_{{\cal{M}}_1},\dots,{\bf{H}}_{{\cal{M}}_{J_0}}) $, $ {\bf{H}}_{{\cal{M}}^c}={{\rm{diag}}}({\bf{H}}_{{\cal{M}}_{J_0+1}},\dots,{\bf{H}}_{{\cal{M}}_{J}}) $, and H = $ {{\rm{diag}}}({\bf{H}}_{{\cal{M}}},{\bf{H}}_{{\cal{M}}^c}) $. Let $ {{\boldsymbol{\theta}}}^\star=(\theta_0^\star,({{\boldsymbol{\theta}}}^\star_{{\cal{M}}_1})^\top,\dots,({{\boldsymbol{\theta}}}^\star_{{\cal{M}}_q})^\top) $ be the real regression coefficient vector. Denote $ {\cal{M}}=\{{\cal{M}}_j,||{{\boldsymbol{\theta}}}_{{\cal{M}}_j}^\star||_2\neq 0\} $, $ {\cal{M}}^c=\{{\cal{M}}_j, ||{{\boldsymbol{\theta}}}_{{\cal{M}}_j}^\star||_2=0\} $, $ {{\boldsymbol{\theta}}}^\star_{{\cal{M}}}=(\theta_0^\star,({{\boldsymbol{\theta}}}^\star_{{\cal{M}}_1})^\top,\dots,({{\boldsymbol{\theta}}}^\star_{{\cal{M}}_{J_0}})^\top)^\top $, and $ {{\boldsymbol{\theta}}}^\star_{{\cal{M}}^c}=(({{\boldsymbol{\theta}}}^\star_{{\cal{M}}_{J_0+1}})^\top,\dots,({{\boldsymbol{\theta}}}^\star_{{\cal{M}}_{J}})^\top)^\top $ = 0. Next, $ {{\rm{E}}}(y_i)=\mu_i=b'({\boldsymbol{F_i^\top}}{{\boldsymbol{\theta}}}^\star) $, $ {{\rm{var}}}(y_i)=\phi_0b''(\mu_i)=\phi_0b''({\boldsymbol{F_i^\top}}{{\boldsymbol{\theta}}}^\star) $, and W₀ = ϕ₀diag$ (b''({\boldsymbol{F}}_1^\top{{\boldsymbol{\theta}}}^\star),\dots,b''({\boldsymbol{F_n^\top}}{{\boldsymbol{\theta}}}^\star)) $. We consider the standard exponential family distribution so that the dispersion parameter ϕ₀ = 1. Let $ {\boldsymbol{\epsilon}}={\bf{y}}-b'({\bf{F}}{{\boldsymbol{\theta}}}^\star) $ be the residual vector, $ {\boldsymbol{\varepsilon}}={\bf{W}}_0^{-1/2}({\bf{y}}-b'({\bf{F}}{{\boldsymbol{\theta}}}^\star)) $ be the scaled residual vector, and $ {\boldsymbol{\delta}}=(\hat{{\bf{F}}}_{{\cal{M}}}-{\bf{F}}_{{\cal{M}}}{\bf{H}}_{{\cal{M}}}){\bf{H}}_{{\cal{M}}}^{-1}\theta^\star $ be the estimation error of significant latent factors. In addition, denote $ {\bf{C}}_{{\cal{M}}_j{\cal{M}}}=\lim_{n\to\infty}{\bf{H}}^\top_{{\cal{M}}_j}{\bf{F}}^\top_{{\cal{M}}_j}{\bf{W}}_0{\bf{F}}_{{\cal{M}}}{\bf{H}}_{{\cal{M}}} /n $ and $ {\bf{C}}_{{\cal{M}}{\cal{M}}}=\lim_{n\to\infty}{\bf{H}}^\top_{{\cal{M}}}{\bf{F}}^\top_{{\cal{M}}}{\bf{W}}_0{\bf{F}}_{{\cal{M}}}{\bf{H}}_{{\cal{M}}}/n $.

The following conditions facilitate the proofs of the theorems.

(A1) For all j, the group of variables $ {\boldsymbol{X}}_{i,{\cal{M}}_j} $, the associated common factor $ {\boldsymbol{F}}_{i,{\cal{M}}_j} $, the factor loading matrix $ {\bf{\Lambda}}_{{\cal{M}}_j} $, and the idiosyncratic components $ {\boldsymbol{e}}_{i,{\cal{M}}_j} $ satisfy the standard approximate factor condition given in the supplementary materials.

(A2) The scaled residual ε = (ε₁, …, ε_n)^T is a vector of IID variables, which satisfies that for all i, E(ε_i) = 0 , var(ε_i) = 1, and E(exp(tε_i)) ≤ exp(τ₀t²/2) for all t $ \in $ R, where τ₀ is a scale parameter of the tail. Next, {ε_i}, $ \{F_{i,{\cal{M}}_j}\}_{1\leq j\leq p} $, $ \{e_{i,{\cal{M}}_j}\}_{1\leq j\leq p} $ are mutually independent groups. Furthermore, max_iE(|ε_i|³) = O(1) and $ n^{-\frac32}\sum_{i=1}^n||{\boldsymbol{F}}_{i,{\cal{M}}}^\top{\bf{H}}_{{\cal{M}}}{\bf{C}}_{{\cal{MM}}}^{-2}{\bf{H}}_{{\cal{M}}}^\top{\boldsymbol{F}}_{i,{\cal{M}}}||_2^3\to $ 0.

(A3) There is a positive constant c₀ such that −c₀ < min_iη_i ≤ max_iη_i < c₀, where $ \eta_i={\boldsymbol{F_i^\top{\boldsymbol{\theta}}^\star}} $ for all $ i\in\{1,\dots,n\} $. There is a constant $ c_1 $ such that $ |b'(\eta_i)-b'(\eta_j)|\leq c_1|\eta_i-\eta_j| $ and $ |b''(\eta_i)-b''(\eta_j)|\leq c_1|\eta_i-\eta_j| $ for all $ i,j\in\{1,\dots,n\} $. Furthermore, there is a positive constant $ \sigma_0 $ such that $ \sigma_0<\sigma_{\min }({\bf{C}}_{{\cal{M}}_j{\cal{M}}}^\top{\bf{C}}_{{\cal{M}}_j{\cal{M}}})\leq\sigma_{\max }({\bf{C}}_{{\cal{M}}_j{\cal{M}}}^\top{\bf{C}}_{{\cal{M}}_j{\cal{M}}})<\sigma_0^{-1} $ and $ \sigma_0<\sigma_{\min }({\bf{C}}_{{\cal{M}}{\cal{M}}})\leq\sigma_{\max }({\bf{C}}_{{\cal{M}}{\cal{M}}})<\sigma_0^{-1} $ for all $ j\in\{1,\dots,J\} $.

(A4) The concave penalty $ \rho_\lambda(\cdot) $ with concavity parameter $ a $ satisfies the condition that $ \rho_\lambda(||x||) $ is increasing and concave in $ ||x||\in[0,+\infty) $ with $ \rho_\lambda(0)=0 $, and that $ \rho_\lambda(||x||) $ is differentiable in $ ||x||\in(0,+\infty) $ with $ \rho_\lambda'(0):=\rho_\lambda'(0+). $ In addition, $ \rho_\lambda'(||x||)\geq a_1\lambda $ for all $ ||x||\in[0,a_2\lambda] $, and $ \rho_\lambda'(||x||)=o(n^{-1/2}) $ for all $ ||x||\in[a\lambda,+\infty) $, for any $ a>a_2 $.

(A5) The dimensions of the latent factors $ \{K_j\} $ are bounded, the dimensions of the variables of groups $ \{p_j\} $ satisfy $ p_j\asymp n $ for all $ j $, and the weights $ w_1,\dots,w_J $ are bounded. Besides, $ J_0^2n^{-1}\to0 $ and $ \lambda^{-1}\alpha_n\to0 $, where $ \alpha_n=\max[(J_0/n)^{1/2},\{\log(J)/n\}^{1/2}] $. Furthermore, for both $ \ell_2 $- or $ \ell_\infty $-norm, there is a positive constant $ c_0 $ such that $ \max_{j\in\{1,\dots,J_0\}}||{{\boldsymbol{\theta}}}_j||<c_0<\infty $ and $ \min_{j\in\{1,\dots,J_0\}}\min_{s\in{\cal{M}}_j}|{{\boldsymbol{\theta}}}_{s,{\cal{M}}_j}|/\lambda\to\infty $.

Condition (A1) presents the standard conditions of factor structure given by Fan et al.^[30]. Condition (A2) demonstrates that we only pay attention to exponential family distributions where the noise term $ \varepsilon_i $ is sub-Gaussian distributed^[31]. In particular, the condition "$ \max_i {\rm{E}}(|\varepsilon_i|^3)=O(1) $ and $ n^{-\frac32}\sum_{i=1}^n|{\boldsymbol{F}}_{i,{\cal{M}}}^\top{\bf{H}}_{{\cal{M}}}{\bf{C}}_{{\cal{MM}}}^{-2}{\bf{H}}_{{\cal{M}}}^\top{\boldsymbol{F}}_{i,{\cal{M}}}|_2^3\to0 $" is imposed to ensure the validity of the Lyapunov condition for asymptotic normality, which is standard in high-dimensional statistical inference; see, for example, Condition 6 in Fan & Lv^[22]. Moreover, (A3) summarizes the standard conditions for the response and the Fisher information matrix of the penalized likelihood function (Eq. [15]). Condition (A4) refers to the standard conditions of group concave penalty given by Fan et al.^[32]. Condition (A5) is crucial to prove the estimation consistency and selection consistency of the FAGES. Additionally, in AFM (Eq. [1]), $ {\bf{F}} $ and $ {\boldsymbol{\Lambda}} $ are not separably identifiable without the restrictions $ {\bf{F}}^\top{\bf{F}}/n={\bf{I}}_K $ and $ {\boldsymbol{\Lambda}}^\top{\boldsymbol{\Lambda}} $ is a diagonal matrix, since $ {\bf{F}}{\bf{\Lambda}}^\top ={\bf{F}}{\bf{Q}}^{-1}{\bf{Q}}{\boldsymbol{\Lambda}} $ for any invertible matrix $ {\bf{Q}} $. For this problem, Bai^[18] defined an identification matrix $ {\bf{H}}^\top=(\hat{{\boldsymbol{\Lambda}}}^\top\hat{{\boldsymbol{\Lambda}}}/p)^{-1}(\hat{{\bf{F}}}^\top{\bf{F}}/n)({\boldsymbol{\Lambda}}^\top{\boldsymbol{\Lambda}}/p) $, which plays the central role in the asymptotic property study of AFM. Here, we employ the group-wise identification matrices $ \{{\bf{H}}_{{\cal{M}}_j}\} $ to study the asymptotic properties of FAGES.

Theorem 1 (model selection consistency) Suppose that conditions (A1)-(A5) are satisfied. Let $ {\cal{O}}_1(\hat{{\boldsymbol{\theta}}}) $ be the event that there exists a strict local minimizer $ \hat{{\boldsymbol{\theta}}} $ in Eq. (15) such that $ ||\hat{{\boldsymbol{\theta}}}_{{\cal{M}}_j}||_2>0 $ for all $ j\in\{1,\dots,J_0\} $ and $ {\cal{O}}_2(\hat{{\boldsymbol{\theta}}}) $ be the event that $ ||\hat{{\boldsymbol{\theta}}}_{{\cal{M}}_j}||_2=0 $ for all $ j\in\{J_0+1,\dots,J\} $. Then as $ n\to\infty $, $ \Pr({\cal{O}}_1(\hat{{\boldsymbol{\theta}}})\cap{\cal{O}}_2(\hat{{\boldsymbol{\theta}}}))\to1. $

Theorem 2 (parameter estimation consistency) Suppose that conditions (A1)−(A5) are satisfied. Then, for the local minimizer $ \hat{{\boldsymbol{\theta}}} $ estimated from Eq. (15), $ ||\hat{{\boldsymbol{\theta}}}_{{\cal{M}}}-{\bf{H}}_{{\cal{M}}}^{-1}{{\boldsymbol{\theta}}}_{{\cal{M}}}^\star||_2=O_P(\sqrt{J_0/n}) $. In addition, $ \sqrt n(\hat{{\boldsymbol{\theta}}}_{{\cal{M}}}-{\bf{H}}_{{\cal{M}}}^{-1}{{\boldsymbol{\theta}}}_{{\cal{M}}}^\star)\stackrel{D}{\longrightarrow}{\cal{N}}({\bf{b}}_{{\cal{M}}},{\bf{C}}_{{\cal{M}}{\cal{M}}}^{-1}) $, where, the bias term $ {\bf{b}}_{{\cal{M}}}=\lim_{n\to\infty}\frac1{\sqrt{n}}{\bf{C}}_{{\cal{M}}{\cal{M}}}^{-1}{\bf{H}}_{{\cal{M}}}^\top{\bf{F}}_{{\cal{M}}}^\top{\bf{W}}_0{\boldsymbol{\delta}} $.

Theorem 1 indicates that FAGES can achieve the model selection consistency. Theorem 2 points out the convergence rate and asymptotic normal distribution of $ \hat{{\boldsymbol{\theta}}}_{{\cal{M}}} $, which is the same as the case where we know the sparsity pattern of $ {{\boldsymbol{\theta}}}^\star $. In addition, we offer the asymptotic bias $ {\bf{b}}_{{\cal{M}}} $ in the asymptotic normal distribution of $ \hat{{\boldsymbol{\theta}}}_{{\cal{M}}}-{\bf{H}}_{{\cal{M}}}^{-1}{{\boldsymbol{\theta}}}_{{\cal{M}}}^\star $. In theory, this bias term does not vanish as $ n\to\infty $ because its scale has the same order of magnitude as the variance of $ \hat{{\boldsymbol{\theta}}}_{{\cal{M}}} $, i,e., $ ||{\bf{b}}_{{\cal{M}}}||^2_2=O_p(J_0) $. It should be pointed out that this asymptotic bias is not estimable in theory, and our empirical analysis shows its influence is small in practice. We suggest ignoring this asymptotic bias and then make the inference using the asymptotic covariance matrix $ {\bf{C}}_{{\cal{M}}{\cal{M}}}^{-1} $.

Supplementary Fig. S1 Results of the linear model with respect to Scenario 1.
Supplementary Fig. S2 Results of the linear model with respect to Scenario 2.
Supplementary Fig. S3 Results of the logistic model with respect to Scenario 1.
Supplementary Fig. S4 Results of the logistic model with respect to Scenario 2.

[1]	Fan J, Li R. 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96(456):1348−1360 doi: 10.1198/016214501753382273 CrossRef Google Scholar
[2]	Tibshirani R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology 58(1):267−288 doi: 10.1111/j.2517-6161.1996.tb02080.x CrossRef Google Scholar
[3]	Zhang CH. 2010. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics 38(2):894−942 doi: 10.1214/09-aos729 CrossRef Google Scholar
[4]	Yuan M, Lin Y. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology 68(1):49−67 doi: 10.1111/j.1467-9868.2005.00532.x CrossRef Google Scholar
[5]	Huang J, Breheny P, Ma S. 2012. A selective review of group selection in high-dimensional models. Statistical Science 27(4):481−499 doi: 10.1214/12-sts392 CrossRef Google Scholar
[6]	Breheny P, Huang J. 2009. Penalized methods for bi-level variable selection. Statistics and its Interface 2(3):369−380 doi: 10.4310/sii.2009.v2.n3.a10 CrossRef Google Scholar
[7]	Huang J, Ma S, Xie H, Zhang CH. 2009. A group bridge approach for variable selection. Biometrika 96(2):339−355 doi: 10.1093/biomet/asp020 CrossRef Google Scholar
[8]	Breheny P. 2015. The group exponential lasso for bi-level variable selection. Biometrics 71(3):731−740 doi: 10.1111/biom.12300 CrossRef Google Scholar
[9]	Hänzelmann S, Castelo R, Guinney J. 2013. GSVA gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics 14(1):7 doi: 10.1186/1471-2105-14-7 CrossRef Google Scholar
[10]	Kanehisa M. 2000. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28(1):27−30 doi: 10.1093/nar/28.1.27 CrossRef Google Scholar
[11]	Chandler G, Polonik W. 2021. Multiscale geometric feature extraction for high-dimensional and non-euclidean data with applications. The Annals of Statistics 49(2):988−1010 doi: 10.1214/20-aos1988 CrossRef Google Scholar
[12]	Bach FR. 2008. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research 9(6):1179−1225 Google Scholar
[13]	Zhao P, Yu B. 2006. On model selection consistency of lasso. Journal of Machine Learning Research 7:2541−2563 Google Scholar
[14]	Zou H, Hastie T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology 67(2):301−320 doi: 10.1111/j.1467-9868.2005.00503.x CrossRef Google Scholar
[15]	Fan J, Ke Y, Wang K. 2018. Factor-adjusted regularized model selection. SSRN Electronic Journal 216(1):71–85 doi: 10.2139/ssrn.3248047 CrossRef Google Scholar
[16]	Fan J, Liao Y, Wang W. 2016. Projected principal component analysis in factor models. The Annals of Statistics 44(1):219–254 doi: 10.1214/15-aos1364 CrossRef Google Scholar
[17]	Yao F, Müller HG, Wang JL. 2005. Functional data analysis for sparse longitudinal data. Journal of the American Statistical Association 100(470):577−590 doi: 10.1198/016214504000001745 CrossRef Google Scholar
[18]	Bai J. 2003. Inferential theory for factor models of large dimensions. Econometrica 71(1):135−171 doi: 10.1111/1468-0262.00392 CrossRef Google Scholar
[19]	Ramsay JO, Silverman BW. 2005. Functional data analysis. New York, US: Springer Science & Business Media. doi: 10.1007/b98888
[20]	Nelder JA, Wedderburn RWM. 1972. Generalized linear models. Journal of the Royal Statistical Society Series A (General) 135(3):370–384 doi: 10.2307/2344614 CrossRef Google Scholar
[21]	Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. 2000. Gene Ontology: tool for the unification of biology. Nature Genetics 25(1):25−29 doi: 10.1038/75556 CrossRef Google Scholar
[22]	Fan J, Lv J. 2011. Nonconcave penalized likelihood with NP-dimensionality. IEEE Transactions on Information Theory 57(8):5467−5484 doi: 10.1109/tit.2011.2158486 CrossRef Google Scholar
[23]	Bai J, Ng S. 2006. Confidence intervals for diffusion index forecasts and inference for factor-augmented regressions. Econometrica 74(4):1133−1150 doi: 10.1111/j.1468-0262.2006.00696.x CrossRef Google Scholar
[24]	Hall P, Müller HG, Yao F. 2008. Modelling sparse generalized longitudinal observations with latent Gaussian processes. Journal of the Royal Statistical Society Series B: Statistical Methodology 70(4):703−723 doi: 10.1111/j.1467-9868.2008.00656.x CrossRef Google Scholar
[25]	Onatski A. 2009. Testing hypotheses about the number of factors in large factor models. Econometrica 77(5):1447−1479 doi: 10.3982/ECTA6964 CrossRef Google Scholar
[26]	Lam C, Yao Q. 2012. Factor modeling for high-dimensional time series: inference for the number of factors. The Annals of Statistics 40(2):694−726 doi: 10.1214/12-AOS970 CrossRef Google Scholar
[27]	Fan J, Guo J, Zheng S. 2022. Estimating number of factors by adjusted eigenvalues thresholding. Journal of the American Statistical Association 117(538):852−861 doi: 10.1080/01621459.2020.1825448 CrossRef Google Scholar
[28]	Breheny P, Zeng Y, Kurth R, Breheny MP. 2024. Package grpreg. https://cran.r-project.org/web/packages/ncvreg/index.html
[29]	Li G, Peng H, Zhang J, Zhu L. 2012. Robust rank correlation based screening. The Annals of Statistics 40(3):1846−1877 doi: 10.1214/12-aos1024 CrossRef Google Scholar
[30]	Fan J, Liao Y, Mincheva M. 2013. Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society Series B: Statistical Methodology 75(4):603−680 doi: 10.1111/rssb.12016 CrossRef Google Scholar
[31]	Wainwright MJ. 2019. High-dimensional statistics: a non-asymptotic viewpoint. Vol. 48. UK: Cambridge University Press. doi: 10.1017/9781108627771
[32]	Fan J, Xue L, Zou H. 2014. Strong oracle optimality of folded concave penalized estimation. The Annals of Statistics 42(3):819−849 doi: 10.1214/13-aos1198 CrossRef Google Scholar
[33]	Schwarz G. 1978. Estimating the dimension of a model. The Annals of Statistics 6(2):461−464 doi: 10.1214/aos/1176344136 CrossRef Google Scholar
[34]	Wood SN. 2017. Generalized additive models: an introduction with R. New York, US: CRC Press. doi: 10.1201/9781315370279
[35]	Cuylen S, Blaukopf C, Politi AZ, Müller-Reichert T, Neumann B, et al. 2016. Ki-67 acts as a biological surfactant to disperse mitotic chromosomes. Nature 535(7611):308−312 doi: 10.1038/nature18610 CrossRef Google Scholar
[36]	Scholzen T, Gerdes J. 2000. The Ki-67 protein: from the known and the unknown. Journal of Cellular Physiology 182(3):311−322 doi: 10.1002/(SICI)1097-4652(200003)182:3<311::AID-JCP1>3.0.CO;2-9 CrossRef Google Scholar
[37]	Kanehisa M, Furumichi M, Tanabe M, Sato Y, Morishima K. 2017. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research 45(D1):D353−D361 doi: 10.1093/nar/gkw1092 CrossRef Google Scholar
[38]	Evan GI, Vousden KH. 2001. Proliferation, cell cycle and apoptosis in cancer. Nature 411(6835):342−348 doi: 10.1038/35077213 CrossRef Google Scholar
[39]	Huang SS, Huang JS. 2005. TGF-β control of cell proliferation. Journal of Cellular Biochemistry 96(3):447−462 doi: 10.1002/jcb.20558 CrossRef Google Scholar
[40]	Lorincz-Comi N, Yang Y, Li G, Zhu X. 2024. MRBEE: a bias-corrected multivariable Mendelian randomization method. Human Genetics and Genomics Advances 5(3):100290 doi: 10.1016/j.xhgg.2024.100290 CrossRef Google Scholar
[41]	Sanderson E, Glymour MM, Holmes MV, Kang H, Morrison J, et al. 2022. Mendelian randomization. Nature Reviews Methods Primers 2:6 doi: 10.1038/s43586-021-00092-5 CrossRef Google Scholar
[42]	Wu MC, Lee S, Cai T, Li Y, Boehnke M, et al. 2011. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics 89(1):82−93 doi: 10.1016/j.ajhg.2011.05.029 CrossRef Google Scholar
[43]	Dewey FE, Gusarova V, Dunbar RL, O'Dushlaine C, Schurmann C, et al. 2017. Genetic and pharmacologic inactivation of ANGPTL3 and cardiovascular disease. The New England Journal of Medicine 377(3):211−221 doi: 10.1056/NEJMoa1612790 CrossRef Google Scholar
[44]	Landfors F, Henneman P, Chorell E, Nilsson SK, Kersten S. 2024. Drug-target Mendelian randomization analysis supports lowering plasma ANGPTL3, ANGPTL4, and APOC3 levels as strategies for reducing cardiovascular disease risk. European Heart Journal Open 4(3):oeae035 doi: 10.1093/ehjopen/oeae035 CrossRef Google Scholar
[45]	Zhang AR, Cai TT, Wu Y. 2022. Heteroskedastic PCA: algorithm, optimality, and applications. The Annals of Statistics 50(1):53−80 doi: 10.1214/21-aos2074 CrossRef Google Scholar
[46]	Aibar S, González-Blas CB, Moerman T, Huynh-Thu VA, Imrichova H, et al. 2017. SCENIC: single-cell regulatory network inference and clustering. Nature Methods 14(11):1083−1086 doi: 10.1038/nmeth.4463 CrossRef Google Scholar
[47]	Fan J, Salathia N, Liu R, Kaeser GE, Yung YC, et al. 2016. Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis. Nature Methods 13(3):241−244 doi: 10.1038/nmeth.3734 CrossRef Google Scholar

{{lists.name}}

Factor-augmented group effect selection with application to gene set analysis

Abstract

Supplementary information

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors

Factor-augmented group effect selection with application to gene set analysis