A synthetic data approach for FDR control in change-point detection

Ao Sun; Jianxin Bi; Jingyuan Liu; Ao Sun; Jianxin Bi; Jingyuan Liu

doi:10.48130/stati-0026-0002

In multiple change-point analysis, the resulting detection sets are typically conservative, often identifying more change points than actually exist, due to the issues of 'unreliability of assumptions' and 'unreliability of algorithms'. Therefore, controlling the false discovery rate is of vital importance to multiple change-point detection. Data-splitting-based methods have gained widespread attention for false discovery rate control. However, relying solely on a part of the dataset during the validation stage typically suffers from power loss. Instead, the study introduces a novel synthetic data framework and proposes the Synthetic Data Filter to control the false discovery rate in multiple change-point detection. Here, the study demonstrates that the proposed method effectively controls the false discovery rate and achieves asymptotic power approaching one under mild conditions. Numerical comparisons with existing methods provide evidence for the superiority of the approach in terms of both false discovery rate control and statistical power. The proposed method is further applied to a bladder tumor microarray dataset, and potential loci are identified with structural changes.

HTML

Theoretical results

Asymptotic theory for FDR control

The error rate control results rely on the symmetry property of the comparison statistics $ W_k^q $ and $ W_k^{side,q} $ when $ \hat{\tau}_k $ is a false discovery. To lay the groundwork for FDR control, it is essential to systematically examine this symmetry property. Before presenting the main theorem, some regular conditions are first imposed.

Condition 3.1 (Moments and tails). Let $ \underline{b}, \overline{b} $, and any vector $ \boldsymbol{\vartheta} \in \mathbb{S}^d $, and for all $ \ell = 1, 2 $, $ i = 1, \ldots, 2n $ and $ j = 1, \ldots, d $. (1)$ \text*{E}[(\boldsymbol{\vartheta}^\top \boldsymbol{\zeta}_{i})^2] \geqslant \underline{b} $; (2)$ \text*{E}[(\boldsymbol{\vartheta}^\top \boldsymbol{\zeta}_{i})^{\ell+2}] \leqslant \overline{b}^{\ell} $; (3) $ \| \boldsymbol{\vartheta}^\top \boldsymbol{\zeta}_{i}\|_{\psi_1} \leqslant \overline{b} $, where for $ \beta \in [1, \infty) $, $ \| \cdot \|_{\psi_\beta} $ represents an Orlicz norm.

Condition 3.2 (Detection ability). Assume $ \hat{K}_n \geqslant K_n $. There exist $ \hat{\tau}_{j_1} \lt \ldots \lt \hat{\tau}_{j_{K_n}} $ belonging to $ \mathcal{T} $ such that $ \max_{1 \lt k \lt K_n} | \hat{\tau}_{j_k} - \tau_k| \leqslant \delta_n $ holds with probability approaching one as $ n \to \infty $, where $ \delta_n $ is some positive sequence.

Condition 3.3 (Minimum distance). Assume that $ \mathcal{T} \subseteq \mathcal{T}(\omega_n) = \{ \mathcal{T}: \min_{j}(\tau_{j+1} - \tau_j) \geqslant \underline{\lambda}_n\} $, where $ \underline{\lambda}_n $ is a positive sequence such that $ \underline{\lambda}_n \geqslant n^{\eta} $ for some $ 0<\eta \lt 1 $ and $ \underline{\lambda}_n \geqslant 2\delta_n $.

A sufficient condition for Condition 3.1 (1) is that the minimum eigenvalue of $ \boldsymbol{\Sigma}^{(k)} $ is uniformly bounded below for all $ k = 1, \ldots, K_n $. This is a common assumption when the dimension $ d $ is fixed. Condition 3.1 (2) is a mild moment condition of the linear transformation of $ \boldsymbol{\zeta}_i $. Condition 3.1 (3) assumes a sub-exponential tail for the transformation of the residuals. Note that $ \| \boldsymbol{\vartheta}^\top \boldsymbol{\zeta}_i\|_{\psi_1} \leqslant \sum_{j =1}^d \|\vartheta_j \zeta_{ij}\|_{\psi_1} \leqslant \sum_{j =1}^d \| \zeta_{ij} \|_{\psi_1} $. Therefore, if $ \sum_{j =1}^d \| \zeta_{ij} \|_{\psi_1} \leqslant \bar{b} $, then Condition 3.1 (3) is satisfied. Conditions similar to Condition 3.1 can be found in Liu et al.^[1] and Yu & Chen^[13] for change-point inference. Condition 3.2 requires that the set of candidate change-points $ \hat{\mathcal{S}} $ is sufficiently accurate. Condition 3.3 imposes sparsity on the true change-points. Both Condition 3.2 and Condition 3.3 are also considered in Chen et al.^[30].

Let $ \overline{n} = \max_{k = 1,\ldots, K_n} n_k $ and $ \underline{n} = \min_{k = 1,\ldots, K_n} n_k $ denote the maximum and minimum sample sizes of the intervals G_k, respectively. The following lemma is now established.

Lemma 3.1. Assume Condition 3.1, 3.2 and 3.3 holds, then

$ \begin{array}{l}\text{Pr}\left\{\max\limits_{k\in\mathcal{I}_0}\rho(T_k^q,\tilde{T}_k^q)\leqslant c\left(\dfrac{\log^7(\overline{n})}{\underline{s}}\right)^{1/6}\mid\mathcal{Z}^O\right\}\geqslant1-C/(\underline{n})^{\kappa}\end{array} $

(17)

where, $ \rho(T_1, T_2) = \sup_{t \in (0, \infty)} | \text{Pr}(T_1 \leqslant t) - \text{Pr}(T_2 \leqslant t) | $ denotes the Kolmogorov distance between $ T_1 $ and $ T_2 $. Here, $ \underline{s} $ is defined as $ \min_{k = 1,\ldots, K_n} \tau_{k+1} - \tau_{k} $, and $ \kappa $, $ C $, and $ c $ are positive constants.

Lemma 3.1 demonstrates that the distribution of $ T_k^q $ approximates that of its original CUSUM counterpart under mild conditions. Based on Lemma 3.1, the main results on FDR control are therefore presented.

Theorem 3.1. Under Condition 3.1, 3.2 and 3.3, and $ \log^7(\hat{K}_n \bar{n})/\underline{s} \to 0 $, the SD filter satisfies

$ \limsup\limits_{n \to \infty} \text*{E} \left[ \dfrac{ | \mathcal{T} \cap \mathcal{I}_0|}{| \mathcal{T}|} \Big | \mathcal{Z}^O \right] \leqslant \alpha, $

for any $ \alpha \in (0,1) $.

Theorem 3.1 establishes the asymptotic FDR control property of the SD filter. The condition $ \log^7(\hat{K}_n \bar{n} d^2)/\underline{s} \to 0 $ implies that the bound parameter $ \underline{s} $ should not approach the endpoints too closely. This requirement is reasonable, since accurate covariance estimation requires sufficiently large samples on both sides of s. Unlike Theorem 1 in Chen et al.^[30], Theorem 3.1 does not require $ \delta_n/\underline{\lambda_n} \to 0 $, indicating that the SD filter does not rely on highly accurate candidate localization.

The FDR control for the general $ \ell_q $-norm cannot be extended to high-dimensional MCP settings due to the constraints of high-dimensional central limit theorem for simple and sparse convex sets, as discussed in Chernozhukov et al.^[32]. To address the high-dimensional MCP challenges, a practical approach is to set $ q = \infty $. In this case, the high-dimensional central limit theorem for hyperrectangles can be employed to justify Lemma 3.1 even in high-dimensional MCP scenarios.

Theorem 3.2 (FDR control for high-dimensional MCP). When $ d \to \infty $, under Condition 3.1, 3.2 and 3.3, and $ \log^7(\hat{K}_n \bar{n} d^2)/\underline{s} \to 0 $. Let $ q = \infty $, the SD filter satisfies

$ \limsup\limits_{n \to \infty} \text*{E} \left[ \dfrac{ | \mathcal{T} \cap \mathcal{I}_0|}{| \mathcal{T}|} \Big | \mathcal{Z}^O \right] \leqslant \alpha $

for any $ \alpha \in (0,1) $.

Power analysis

Next, the power of the SD filter is analyzed under the following signal condition.

Condition 3.4 (Minimum signal). Let $ \boldsymbol{\delta}_{k} = \boldsymbol{\mu}_{k+1} - \boldsymbol{\mu}_k $ be the change magnitude, which satisfies

$ \min\limits_{k \in \mathcal{I}_1}\| \boldsymbol{\delta}^{(k)}\|_{q} \gg C\bar{\sigma}^2\sqrt{\dfrac{\log(\alpha_n \hat{K}_n\bar{n}d) }{t_k(1-t_k) \underline{n}}} $

where, $ t_k = \tau_k/n_k $, $ \alpha_n $ is a sequence that converges to infinity with a slow rate and $ C $ is a positive constant.

Condition 3.4 imposes a minimum signal separation between any two true change-points, ensuring their asymptotic identifiability. Similar conditions can be found in Harchaoui & Lévy-Leduc^[5], Fryzlewicz^[6], and Yu & Chen^[13].

Theorem 3.3. Under Condtion 3.1, 3.2, 3.3 and 3.4 and $ \log^7(\hat{K}_n \bar{n})/\underline{s} \to 0 $. The power of SD filter satisfies

$ \lim\limits_{n \to \infty} \text*{E}\left[\dfrac{| \mathcal{T} \cap \mathcal{I}_1|}{| \mathcal{I}_1|} \Big| \mathcal{Z}^O\right] = 1 $

Theorem 3.3 states that the power of SD filter approaches 1 asymptotically. Furthermore, the selection consistency property can be established.

Corollary 3.1. Under Conditions in Theorem 3.3, there is

$ \lim\limits_{n\to\infty}\text{Pr}\left\{\mathcal{S}=\mathcal{T}\mid\mathcal{Z}^O\right\}=1 $

Compared with the condition for selection consistency in Chen et al.^[30], which requires $ \min_{k \in \mathcal{I}_1}\| \boldsymbol{\delta}^{(k)}\|_2 \gg \sqrt{\log n/\underline{\lambda}_n} $, Condition 3.4 is of approximately the same order, noting that $ \underline{\lambda}_n \approx \underline{n} $.

[1]	Liu B, Zhou C, Zhang X, Liu Y. 2020. A unified data-adaptive framework for high dimensional change point detection. Journal of the Royal Statistical Society: Series B Statistical Methodology 82:933−963 doi: 10.1111/rssb.12375 CrossRef Google Scholar
[2]	Liu B, Zhang X, Liu Y. 2024. Simultaneous change point detection and identification for high dimensional linear models. arXiv 2401.08173 doi: 10.48550/arXiv.2401.08173 CrossRef Google Scholar
[3]	Aue A, Horváth L. 2013. Structural breaks in time series. Journal of Time Series Analysis 34:1−16 doi: 10.1111/j.1467-9892.2012.00819.x CrossRef Google Scholar
[4]	Niu YS, Hao N, Zhang H. 2016. Multiple change-point detection: a selective overview. Statistical Science 31(4):611−623 doi: 10.1214/16-sts587 CrossRef Google Scholar
[5]	Harchaoui Z, Lévy-Leduc C. 2010. Multiple change-point estimation with a total variation penalty. Journal of the American Statistical Association 105:1480−1493 doi: 10.1198/jasa.2010.tm09181 CrossRef Google Scholar
[6]	Fryzlewicz P. 2014. Wild binary segmentation for multiple change-point detection. The Annals of Statistics 42:2243−2281 doi: 10.32614/cran.package.wbs CrossRef Google Scholar
[7]	Cho H, Fryzlewicz P. 2015. Multiple-change-point detection for high dimensional time series via sparsified binary segmentation. Journal of the Royal Statistical Society: Series B Statistical Methodology 77:475−507 doi: 10.1111/rssb.12079 CrossRef Google Scholar
[8]	Lee S, Seo MH, Shin Y. 2016. The lasso for high dimensional regression with a possible change point. Journal of the Royal Statistical Society: Series B Statistical Methodology 78:193−210 doi: 10.1111/rssb.12108 CrossRef Google Scholar
[9]	Wang T, Samworth RJ. 2018. High dimensional change point estimation via sparse projection. Journal of the Royal Statistical Society: Series B Statistical Methodology 80:57−83 doi: 10.1111/rssb.12243 CrossRef Google Scholar
[10]	Enikeeva F, Harchaoui Z. 2019. High-dimensional change-point detection under sparse alternatives. The Annals of Statistics 47:2051−2079 doi: 10.1214/18-aos1740 CrossRef Google Scholar
[11]	Liu B, Zhang X, Liu Y. 2022. High dimensional change point inference: Recent developments and extensions. Journal of Multivariate Analysis 188:104833 doi: 10.1016/j.jmva.2021.104833 CrossRef Google Scholar
[12]	Wang D, Zhao Z, Lin KZ, Willett R. 2021. Statistically and computationally efficient change point localization in regression settings. Journal of Machine Learning Research 22:1−46 Google Scholar
[13]	Yu M, Chen X. 2021. Finite sample change point inference and identification for high-dimensional mean vectors. Journal of the Royal Statistical Society: Series B Statistical Methodology 83:247−270 doi: 10.1111/rssb.12406 CrossRef Google Scholar
[14]	Benjamini Y, Hochberg Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: Series B Statistical Methodology 57:289−300 doi: 10.1111/j.2517-6161.1995.tb02031.x CrossRef Google Scholar
[15]	Benjamini Y, Yekutieli D. 2001. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics 29:1165−1180 doi: 10.1214/aos/1013699998 CrossRef Google Scholar
[16]	Storey JD. 2002. A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B Statistical Methodology 64:479−98 doi: 10.1111/1467-9868.00346 CrossRef Google Scholar
[17]	Genovese C, Wasserman L. 2004. A stochastic process approach to false discovery control. The Annals of Statistics 32:1035−1061 Google Scholar
[18]	Storey JD, Taylor JE, Siegmund D. 2004. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society: Series B Statistical Methodology 66:187−205 doi: 10.1111/j.1467-9868.2004.00439.x CrossRef Google Scholar
[19]	Hao N, Niu YS, Zhang H. 2013. Multiple change-point detection via a screening and ranking algorithm. Statistica Sinica 23:1553−1572 doi: 10.5705/ss.2012.018s CrossRef Google Scholar
[20]	Li H, Munk A, Sieling H. 2016. FDR-control in multiscale change-point segmentation. Electronic Journal of Statistics 10:918−959 doi: 10.1214/16-ejs1131 CrossRef Google Scholar
[21]	Cheng D, He Z, Schwartzman A. 2020. Multiple testing of local extrema for detection of change points. Electronic Journal of Statistics 14:3705−3729 doi: 10.1214/20-ejs1751 CrossRef Google Scholar
[22]	Barber RF, Candès EJ. 2015. Controlling the false discovery rate via knockoffs. The Annals of Statistics 43:2055−85 Google Scholar
[23]	Candès E, Fan Y, Janson L, Lv J. 2018. Panning for gold: 'model-X'knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B Statistical Methodology 80:551−577 doi: 10.1111/rssb.12265 CrossRef Google Scholar
[24]	Barber RF, Candès EJ. 2019. A knockoff filter for high-dimensional selective inference. The Annals of Statistics 47:2504−2537 doi: 10.1214/18-aos1755 CrossRef Google Scholar
[25]	Fan Y, Demirkaya E, Li G, Lv J. 2020. RANK: large-scale inference with graphical nonlinear knockoffs. Journal of the American Statistical Association 115:362 doi: 10.1080/01621459.2018.1546589 CrossRef Google Scholar
[26]	Barber RF, Candès EJ, Samworth RJ. 2020. Robust inference with knockoffs. The Annals of Statistics 48:1409−1431 doi: 10.1214/24-aos2480 CrossRef Google Scholar
[27]	Liu J, Sun A, Ke Y. 2024. A Generalized knockoff procedure for FDR control in structural change detection. Journal of Econometrics 239:105331 doi: 10.1016/j.jeconom.2022.07.008 CrossRef Google Scholar
[28]	Du L, Guo X, Sun W, Zou C. 2023. False discovery rate control under general dependence by symmetrized data aggregation. Journal of the American Statistical Association 118:607−621 doi: 10.1080/01621459.2021.1945459 CrossRef Google Scholar
[29]	Dai C, Lin B, Xing X, Liu JS. 2023. False discovery rate control via data splitting. Journal of the American Statistical Association 118:2503−2520 doi: 10.1080/01621459.2022.2060113 CrossRef Google Scholar
[30]	Chen H, Ren H, Yao F, Zou C. 2023. Data-driven selection of the number of change-points via error rate control. Journal of the American Statistical Association 118:1415−1428 doi: 10.1080/01621459.2021.1999820 CrossRef Google Scholar
[31]	Chernozhukov V, Chetverikov D, Kato K. 2013. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. The Annals of Statistics 41:2786−2819 doi: 10.1920/wp.cem.2013.7613 CrossRef Google Scholar
[32]	Chernozhukov V, Chetverikov D, Kato K. 2017. Central limit theorems and bootstrap in high dimensions. The Annals of Probability 45:2309−2352 doi: 10.1920/wp.cem.2014.4914 CrossRef Google Scholar
[33]	Zou C, Wang G, Li R. 2020. Consistent selection of the number of change-points via sample-splitting. The Annals of Statistics 48:413−439 doi: 10.1214/19-aos1814 CrossRef Google Scholar
[34]	Yao YC, Au ST. 1989. Least-squares estimation of a step function Sankhyā. The Indian Journal of Statistics (Series A) 51:370−381 Google Scholar
[35]	Bleakley K, Vert JP. 2011. The group fused lasso for multiple change-point detection. arXiv 1106.4199 doi: 10.48550/arXiv.1106.4199 CrossRef Google Scholar
[36]	James NA, Matteson DS. 2015. ecp: An R package for nonparametric multiple change point analysis of multivariate data. Journal of Statistical Software 62:1−25 doi: 10.18637/jss.v062.i07 CrossRef Google Scholar
[37]	Matteson DS, James NA. 2014. A nonparametric approach for multiple change point analysis of multivariate data. Journal of the American Statistical Association 109:334 doi: 10.1080/01621459.2013.849605 CrossRef Google Scholar

{{lists.name}}

A synthetic data approach for FDR control in change-point detection

Abstract

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors