-
Change-point analysis, a process of detecting structural changes in a data sequence, has been an active area of research and attracted increasing attention with the growing availability of temporal data. It has applications across a wide range of fields, including but not limited to environmental sciences, econometrics, biology, geosciences, and linguistics. In this context, the accurate and efficient detection of multiple change-points (MCP) is undoubtedly one of the most crucial issues. For example, Liu et al.[1] introduced a general framework for high-dimensional change-point detection by constructing a U-statistic-based cumulative sum matrix
and aggregating it based on the adjusted$ \mathcal{C} $ -norm, while Liu et al.[2] focused on high-dimensional linear models, providing asymptotic validity and an extension to multiple change points via binary segmentation. For more comprehensive reviews of various existing approaches to MCP inference, see Aue & Horváth[3], and Niu et al.[4].$ L_{p} $ However, obtaining consistent estimators for the number and locations of MCP typically requires stringent conditions on the magnitude of changes, as extensively documented in prior studies[5–13]. Unfortunately, such requirements are often unrealistic, as small change magnitudes tend to cause underfitting. Consequently, an overfitted selection set is often obtained via some conservative algorithms. Furthermore, the empirical performance of certain detection methods is intrinsically linked to the choice of tuning parameters, which requires access to unavailable population-level information. These two issues, referred to as the unreliability of assumptions and the unreliability of algorithms, can introduce false discoveries, potentially leading to the reproducibility crisis if an excessive number of false detections occur.
To tackle this problem, a natural solution is to detect the active set while controlling the false discovery rate (FDR) at a pre-specified level. A widely adopted strategy is to treat change point detection as a multiple hypothesis testing problem, utilizing classical p-value-based methods [14–18] to control the FDR. Notable works include Hao et al.[19], Li et al.[20], and Cheng et al.[21]. These methods work properly for the univariate mean change problem, but extending them to a multi-dimensional setting is challenging, since the model's complexity renders the derivation of p-values intractable. Leveraging the knockoff framework[22–26], a related study is Liu et al.[27], which proposed a generalized knockoff procedure (GKnockoff) to control the FDR for detecting structural changes in the coefficients of a linear regression model. More recently, Du et al.[28], and Dai et al.[29] proposed a data-splitting-based FDR control framework, which outperforms knockoff methods in power under moderate to strong dependence and is more robust than asymptotic p-value-based methods. Chen et al.[30] adopted this data-splitting philosophy and proposed a data-driven selection procedure for MCP detection while controlling the error rate. This approach is quite general and can handle complex MCP scenarios. However, on the one hand, the symmetry property of the proposed statistic depends on the sample size. If the sample size is too small, the comparison statistic may become asymmetric, distorting the FDR control. On the other hand, data-splitting inevitably reduces the power of change-point detection as only half of the information is utilized.
The limitations of data-splitting motivate us to develop a novel framework for controlling the error rates in high-dimensional MCP detection. This framework, termed the Synthetic Data Filter (SD filter), is designed to enhance the accuracy of multiple change-point identification. Basically, the SD filter procedure consists of three steps. First, the dataset is divided according to the temporal order's parity, as inspired by Chen et al.[30]. The change-point detection is performed on the odd-part and can be carried out using the adaptive
aggregated CUSUM-type statistic introduced by Liu et al.[1]. Next, information from the odd part is leveraged to generate a synthetic dataset. In the final step, the synthetic dataset is merged with the reserved even-part dataset to construct symmetric statistics and control the false discovery rate (FDR). Figure 1 presents a flow chart summarizing the procedure of the proposed SD filter. The SD filter exhibits the capability to asymptotically control the FDR at the desired level while achieving a power that approaches one under mild conditions. Moreover, this work demonstrates its competence in addressing high-dimensional MCP problems and represents, to the best of current knowledge, the first attempt at controlling the error rate in this context.$ \ell_q $ The synthetic data framework clearly departs from the aforementioned error rate control procedures. Unlike conventional methods that quantify the distribution of the test statistic, the Gaussian multiplier bootstrap is employed [31,32] to construct a mirror statistic for FDR control. This framework is highly general, easily extendable to various model settings, and applicable to any scenario requiring data splitting for simultaneous hypothesis formulation and testing. By integrating the synthetic random sample with the second dataset in the testing phase, the original sample size is maintained, thereby boosting statistical power.
-
Suppose that a sequence of independent data have been observed,
collecting from$ \mathcal{Z} = \{ \mathbf{z}_1, \ldots, \mathbf{z}_{2n}\} $ $ \begin{array}{l} \mathbf{z}_i \sim F(\cdot | \boldsymbol{\beta}_k), \tau_k \lt i \leqslant \tau_{k+1}, k = 1,\ldots, K_n; i = 1, \ldots, 2n \end{array} $ (1) where, Kn is the number of change-points which could diverge with sample size
and$ n $ are the change locations with the convention that$ \tau_{k}'s $ and$ \tau_0 = 0 $ .$ \tau_{K_n+1} =2n $ represents the model structure of segment k, where$ F(\cdot | \boldsymbol{\beta}_k) $ is a d-dimensional parameter vector of interest, satisfying$ \boldsymbol{\beta}_k $ . The setting of MCP in Eq. (1) is quite general and encompasses many classical models, such as the multivariate mean change model and the regression model with structural breaks [33].$ \boldsymbol{\beta}_{k} \ne \boldsymbol{\beta}_{k+1} $ The objective of this study is to detect the active set
while controlling the FDR. However, the definition of FDR in this setting is context-specific, as the optimal rate of change-point estimation is characterized by Op(1)[34]. Therefore, it is essential to first establish the corresponding concepts in this context. Given a candidate change-point set$ \mathcal{S} = \{\tau_k, k = 1, \ldots, K_n\} $ , the definition of false discovery in MCP detection is as follows:$ \hat{\mathcal{S}} = \{ \hat{\tau}_k, k =1,\ldots, \hat{K}_n\} $ Definition 2.1 (False discovery). The candidate change-point
is a False Discovery if there is no true change-point falls in$ \hat{\tau}_{k} \in \hat{\mathcal{S}} $ $ \begin{array}{l}G_k:=\left[\lceil(\hat{\tau}_{k-1}+\hat{\tau}_k)/2\rceil,\lceil(\hat{\tau}_k+\hat{\tau}_{k+1})/2\rceil\right)\end{array} $ (2) where,
and$ \hat{\tau}_0 = 0 $ .$ \hat{\tau}_{ \hat{K}_n} = 2n $ Further, let the set
encompass all the false discoveries. Then, the set$ \mathcal{I}_0 $ contains all the true discoveries in the selection set$ \mathcal{I}_1 = \hat{\mathcal{S}}\cap \mathcal{I}_0^c $ . Throughout this paper,$ \hat{\mathcal{S}} $ and$ \mathcal{I}_1 $ are referred to as the informative and uninformative sets, respectively. This definition is well-defined, as each candidate change point is unambiguously classified as either a true or false discovery, with no overlap between the two categories. Based on this, the corresponding FDR is defined as follows:$ \mathcal{I}_0 $ $ \text{FDR}(\mathcal{T})=\text{E}\left[\dfrac{|\mathcal{T}\cap\mathcal{I}_0|}{|\mathcal{T}|}\right] $ (3) where
represents a subset of$ \mathcal{T} $ yielded by the selection procedure, and |A| denotes the cardinality of set A. The FDR is the expected value of the False Discovery Proportion (FDP), which represents the ratio of false discoveries to the total number of discoveries.$ \hat{\mathcal{S}} $ Synthetic data generating process
-
This subsection first introduces a synthetic data generating procedure. Following the order-preserving splitting procedure in Zou et al.[33], the data
is partitioned into odd and even parts$ \mathcal{Z} $ $ \mathcal{Z}^{O} := \{ \mathbf{z}_{2i-1}, i = 1, \ldots, n\} \text{ and } \mathcal{Z}^{E} := \{ \mathbf{z}_{2i}, i = 1, \ldots, n\} $ On the subset
, a candidate set of change points is estimated$ \mathcal{Z}^{O} $ using a suitable detection algorithm, such as the aggregated CUSUM method. This training phase allows for the possibility that$ \hat{\mathcal{S}} = \{ \hat{\tau}_1, \ldots, \hat{\tau}_{ \hat{K}_n}\} $ may overestimate the true set of change-points$ \hat{\mathcal{S}} $ . Based on the candidate set$ \mathcal{S} $ , sets Gk are defined according to Definition 2.2. Then the odd sample$ \hat{\mathcal{S}} $ and the even sample$ \mathcal{Z}^O $ are partitioned into segments$ \mathcal{Z}^E $ and$ \mathcal{Z}_{G_k}^O := \{ \mathbf{z}_{2i-1}: i \in G_k\} $ , respectively. In the validation stage,$ \mathcal{Z}_{G_k}^E := \{ \mathbf{z}_{2i}: i \in G_k\} $ is treated as given to avoid dealing with intractable post-selection inference.$ \mathcal{Z}^{O} $ Let
be a suitable loss function evaluated at data point$ \ell(\boldsymbol{\beta}; \mathbf{z}_i) $ , with its derivative denoted as$ \mathbf{z}_i $ . Ideally, for a given d-dimensional reference vector$ \mathbf{s}_{ \boldsymbol{\beta}}(\mathbf{z}_i) = {\partial \ell(\boldsymbol{\beta}; \mathbf{z}_i)}/{\partial \boldsymbol{\beta}} $ ,$ \boldsymbol{\gamma} $ is expected when there is a change between$ \text*{E}\{ \mathbf{s}_{\gamma}(\mathbf{z}_i)\} \ne \text*{E}\{ \mathbf{s}_{ \boldsymbol{\gamma}}(\mathbf{z}_{i^\prime})\} $ and$ i $ , since the score function remains constant in regions without change. This motivates the decomposition of the score as$ i^\prime $ $ \begin{array}{l} \mathbf{s}_{ \boldsymbol{\gamma}}( \mathbf{z}_i) = \boldsymbol{\mu}_i + \boldsymbol{\zeta}_i, i =1,\ldots, 2n. \end{array} $ (4) where,
is the expected score at$ \boldsymbol{\mu}_i = \text*{E}[\mathbf{s}_{ \boldsymbol{\gamma}}(\mathbf{z}_i)] $ , and$ \mathbf{z}_i $ represents the residual. It is further assumed that$ \boldsymbol{\zeta}_i = \mathbf{s}_{ \boldsymbol{\gamma}}(\mathbf{z}_i) - \text*{E}[\mathbf{s}_{ \boldsymbol{\gamma}}(\mathbf{z}_i)] $ for all$ \text*{Cov}(\boldsymbol{\zeta}_i) = \boldsymbol{\Sigma}^{(k)} $ such that$ i $ . This score-type framework was first introduced by Zou et al.[33] for selecting the number of change-points. The procedure is often invariant to the choice of$ \tau_{k}+1 \leqslant i \leqslant \tau_{k+1} $ , which can therefore be set as$ \boldsymbol{\gamma} $ when no prior information is available. Given a specific$ \boldsymbol{\gamma} := \arg\min_{ \boldsymbol{\beta}} \sum_{ \mathbf{z}_i \in \mathcal{Z}^{O}} \ell(\boldsymbol{\beta}; \mathbf{z}_i) $ , the score function$ \boldsymbol{\gamma} $ is denoted as$ \mathbf{s}_{ \boldsymbol{\gamma}}(\mathbf{z}_i) $ for simplicity.$ \mathbf{s}_i $ To monitor the change magnitude in the dataset
, the CUSUM statistic is employed for score$ \mathcal{Z}_{G_k}^E $ in$ \mathbf{s}_i^E $ , which is defined as,$ \mathcal{Z}_{G_k}^E $ $ \begin{array}{l} \mathbf{c}_{k}(s)=\sqrt{\dfrac{s(n_k-s)}{n_k}}\left(\dfrac{1}{s} \displaystyle\sum\limits_{i \leqslant s, i \in G_k} \mathbf{s}^E_{i}-\dfrac{1}{n_k-s} \displaystyle\sum\limits_{i \gt s, i \in G_k} \mathbf{s}^E_{i}\right), \end{array} $ (5) where,
and$ n_k = |G_k| $ varies in$ s $ with the exception of points that are too close to both ends. Substitute$ G_k $ by$ \mathbf{s}^E_i $ , it becomes$ \boldsymbol{\mu}_i^E + \boldsymbol{\zeta}_i^E $ $ \begin{array}{l} \mathbf{c}_{k}(s)=\sqrt{\dfrac{s(n_k-s)}{n_k}}\left(\dfrac{1}{s} \displaystyle\sum\limits_{i \leqslant s, i \in G_k} \boldsymbol{\zeta}^E_{i}-\dfrac{1}{n_k-s} \displaystyle\sum\limits_{i \gt s, i \in G_k} \boldsymbol{\zeta}^E_{i}\right) + \Delta_k(s), \end{array} $ (6) where,
If$ \Delta_k(s) = \sqrt{{s(n_k-s)}/{n_k}}\left({s^{-1}} \sum_{i \leqslant s, i \in G_k} \boldsymbol{\mu}_{i}^E-(n_k-s)^{-1} \sum_{i \gt s, i \in G_k} \boldsymbol{\mu}_{i}^E\right). $ , located at the s-th position of Gk, is a false discovery, then$ \hat{\tau}_k $ . Hence the CUSUM statistic$ \Delta_k(s) = 0 $ is just a combination of the errors,$ \mathbf{c}_k $ $ \begin{array}{l} \mathbf{c}_{k}(s)=\sqrt{\dfrac{s(n_k-s)}{n_k}}\left(\dfrac{1}{s} \displaystyle\sum\limits_{i \leqslant s, i \in G_k} \boldsymbol{\zeta}^E_{i}-\dfrac{1}{n_k-s} \displaystyle\sum\limits_{i \gt s, i \in G_k} \boldsymbol{\zeta}^E_{i}\right), \text{ if } \hat{\tau}_k \in \mathcal{I}_0. \end{array} $ (7) Observing that for a certain false discovery
lies in the s-th position of Gk,$ \hat{\tau}_k $ consists of two parts:$ \mathbf{c}_{k}(s) $ and$ s^{-1} \sum_{i \leqslant s, i \in G_k} \boldsymbol{\zeta}^E_{i} $ . According to the central limit theorem, it can be inferred that the two scaled summations$ {(n_k - s)^{-1}}\sum_{i \gt s, i \in G_k} \boldsymbol{\zeta}^E_{i} $ and$ s^{-1/2} \sum_{i \leqslant s, i \in G_k} \boldsymbol{\zeta}^E_{i} $ converge to a d-dimensional normal distribution with mean zero and covariance matrix$ (n_k - s)^{-1/2} \sum_{i \gt s, i \in G_k} \boldsymbol{\zeta}^E_{i} $ , provided that both s and$ \boldsymbol{\Sigma}^{(k)} $ are sufficiently large.$ n_k-s $ Based on this intuition, i.i.d random variables
are introduced, each of which follows a standard normal distribution$ \xi_1, \ldots, \xi_{n_k} $ . These variables are also independent of the original dataset$ N(0,1) $ and define two partial sums based on$ \mathcal{Z} $ as$ \mathcal{Z}^O $ and$ \bar{ \mathbf{s}}_k^{O,-}(s) = s^{-1}\sum_{i \leqslant s, i \in G_k} \mathbf{s}^O_i $ . Then a synthetic dataset$ \bar{ \mathbf{s}}_k^{O,+}(s) = (n_k-s)^{-1} \sum_{i \gt s, i \in G_k} \mathbf{s}^O_{i} $ is generated, consisting of the following two parts.$ \tilde{ \mathcal{Z}}_{G_k} $ Definition 2.2. Define the synthetic data based on the training sample as :
$ \begin{array}{l} \left\{\xi_{i}\left( \mathbf{s}^O_{i}-\bar{ \mathbf{s}}_k^{O,-}(s) \right) , i \leqslant s, i \in G_k \right\} ~~{ and } ~~\left\{\xi_{i}\left( \mathbf{s}^O_{i}-\bar{ \mathbf{s}}_k^{O,+}(s) \right), i \gt s, i \in G_k \right\}. \end{array} $ (8) By using the synthetic dataset
, a synthetic CUSUM can further be constructed$ \tilde{ \mathcal{Z}}_{G_k} $ $ \begin{split} \tilde{ \mathbf{c}}_{k}(s)=\;&\sqrt{\dfrac{s(n_k-s)}{n_k}}\Bigg(\dfrac{1}{s} \displaystyle\sum\limits_{i \leqslant s, i \in G_k} \xi_{i}\left( \mathbf{s}^O_{i}-\bar{ \mathbf{s}}_k^{O,-}(s) \right)-\\& \dfrac{1}{n_k - s}\displaystyle\sum\limits_{i \gt s, i \in G_k} \xi_{i}\left( \mathbf{s}^O_{k}-\bar{ \mathbf{s}}_k^{O,+}(s) \right) \Bigg). \end{split} $ (9) Given the dataset
, two summations in the$ \mathcal{Z}_{G_k}^O $ are also normally distributed, that is$ \tilde{ \mathbf{c}}_{k}(s) $ $ \begin{split} \dfrac{1}{\sqrt{s}} \sum\limits_{i \leqslant s, i \in G_k} \xi_{i}\left( \mathbf{s}_{i}^O-\bar{ \mathbf{s}}_k^{O,-}(s) \right) \sim N\left(0, \widehat{\boldsymbol{\Sigma}}^{(k)-}\right) \\\text{ and } \dfrac{1}{\sqrt{n_k - s}}\sum\limits_{i \gt s, i \in G_k} \xi_{i}\left( \mathbf{s}_{i}^O-\bar{ \mathbf{s}}_k^{O,+}(s) \right) \sim N\left(0, \widehat{\boldsymbol{\Sigma}}^{(k)+}\right), \end{split} $ where,
$ \begin{split} \widehat{\boldsymbol{\Sigma}}^{(k)-}(s) &= \dfrac{1}{s}\displaystyle\sum\limits_{i \leqslant s, i \in G_k} \left( \mathbf{s}^O_{i}-\bar{ \mathbf{s}}_k^{O,-}(s) \right) \left( \mathbf{s}^O_{i}-\bar{ \mathbf{s}}_k^{O,-}(s) \right) ^\top \\ \widehat{\boldsymbol{\Sigma}}^{(k)+}(s) &= \dfrac{1}{n_k - s}\displaystyle\sum\limits_{i \gt s, i \in G_k} \left( \mathbf{s}^O_{i}-\bar{ \mathbf{s}}_k^{O,+}(s) \right) \left( \mathbf{s}^O_{i}-\bar{ \mathbf{s}}_k^{O,+}(s) \right)^\top \end{split} $ are two plausible estimates of
. Hence, the synthetic CUSUM statistic$ \boldsymbol{\Sigma}^{(k)} $ mimics the distributional behavior of the original CUSUM statistic$ \tilde{\mathbf{c}}_k(s) $ when there are no change points, and it is independent of$ \mathbf{c}_k(s) $ given the odd samples$ \mathbf{c}_k(s) $ .$ \mathcal{Z}^O $ To conclude, for each candidate change point
, if it is a false discovery, the distributions of$ \hat{\tau}_k $ and$ \mathbf{c}_k(s) $ are highly similar. In contrast, if$ \tilde{ \mathbf{c}}_{k}(s) $ corresponds to a true change point, the two distributions differ significantly. This distinction is visually demonstrated in Fig. 2, where the distributions align closely in the absence of a change point but diverge otherwise. The synthetic data thus serves as a diagnostic tool for detecting false discoveries, thereby enhancing selection power in the validation step.$ \hat{\tau}_k $
Figure 2.
Densities of the synthetic-data-based statistic and the original statistic under stationary (left) and non-stationary (right) settings. Detailed settings are described in the simulation study.
Moreover, it is worth noting that solely comparing the difference at a fixed point
may not yield the optimal result because$ \hat{\tau}_k $ maximizes the difference in the odd part dataset$ \hat{\tau}_k $ instead of the even part dataset$ \mathcal{Z}^O $ . When the candidate change-points in$ \mathcal{Z}^E $ are validated, it is more reasonable to iterate through all possible locations in the interval Gk and select the point with the largest absolute difference. Furthermore, the CUSUM statistics are aggregated via a$ \mathcal{Z}^E $ norm, where q takes value in {1, 2, ...,$ \ell_{q} $ } to adapt the change structure under the alternatives. Hence, the comparison statistics for the original dataset$ \infty\ $ and the synthetic dataset$ \mathcal{Z}^E_{G_k} $ are defined as:$ \tilde{ \mathcal{Z}}_{G_k} $ $ \begin{array}{l} T_k^q = \max\limits_{ s \in G_k^*} \| \mathbf{c}_{k}(s) \|_{q} \text{ and } \tilde{T}_k^q = \max\limits_{ s \in G_k^*} \|\tilde{ \mathbf{c}}_{k}(s) \|_{q}, \text{ for } q \in \{1,2,\ldots, \infty\}. \end{array} $ (10) where, for a vector
,$ \mathbf{x} \in \mathbb{R}^d $ for$ \|\mathbf{x}\|_q := \big(\sum_{j=1}^d |x_j|^q\big)^{1/q} $ ,$ 1 \leqslant q \lt \infty $ and$ \|\mathbf{x}\|_{\infty} := \max_{1 \leqslant j \leqslant d}|x_j| $ is a truncated version of$ G_k^* $ , obtained by removing the first$ G_k $ and last$ \underline{s} $ indices. This truncation is necessary since$ \underline{s} $ and$ \widehat{\boldsymbol{\Sigma}}^{(k)-}(s) $ may not be reliable estimates of$ \widehat{\boldsymbol{\Sigma}}^{(k)+}(s) $ when sample sizes within the left or right segments are too small. Therefore, if$ \boldsymbol{\Sigma}^{(k)} $ is a false discovery, the distribution of$ \hat{\tau}_k $ is expected to closely approximate that of$ \tilde{T}_k^q $ , given the similarity between the synthetic and validation data.$ T_k^q $ However, if the candidate change-point
belongs to the informative set, i.e.,$ \hat{\tau}_k $ , then$ \hat{\tau}_k \in \mathcal{I}_1 $ in Eq. (6) should be the leading term in the$ \Delta_k(s) $ . Denote$ \mathbf{c}_{k}(s) $ $ \begin{array}{l} \mathbf{r}_{k}(s)=\sqrt{\dfrac{s(n_k-s)}{n_k}}\left(\dfrac{1}{s} \sum\limits_{i \leqslant s, i \in G_k} \boldsymbol{\zeta}^E_{i}-\dfrac{1}{n_k-s} \sum\limits_{i \gt s, i \in G_k} \boldsymbol{\zeta}^E_{i}\right) \end{array} $ then in this case,
$ \begin{array}{l} T_k^q = \max\limits_{s \in G_k^*} \| \mathbf{c}_k(s)\|_q \geqslant \max\limits_{s \in G_k^*} \| \Delta_{k}(s)\|_q -\max\limits_{s \in G_k^*}\|\mathbf{r}_{k}(s) \|_q \end{array} $ where
are all zero mean vectors. Furthermore, the CUSUM statistics defined on the synthetic dataset$ \mathbf{r}_{k}(s), s \in G_k^* $ also have zero mean, as they are transformations of zero-mean random vectors. Hence, the original comparison statistic$ \tilde{ \mathcal{Z}}_{G_k} $ should be sufficiently larger than the synthetic comparison statistic$ T_k^q $ . That is,$ \tilde{T}_k^q $ $ \begin{array}{l} T_k^q - \tilde{T}_k^q \geqslant \max\limits_{s \in G_k^*} \| \Delta_{k}(s)\|_q -\max\limits_{s \in G_k^*}\| \mathbf{c}_k(s) \|_q - \max_{s \in G_k^*}\| \tilde{\mathbf{c}}_k(s) \|_q \gg 0 \end{array} $ provided that the change magnitudes within
are sufficiently large and well-separated.$ G_k $ FDR control via SD filter
-
Since the synthetic data in Eq. (8) mimic the distributional behavior under the null, the distributions of
and$ T_k^{q} $ are close when$ \tilde{T}_k^{q} $ . Therefore, a comparison statistic is defined for each candidate change-point$ \hat{\tau}_k \in \mathcal{I}_0 $ as follows:$ \hat{\tau}_k $ $ \begin{array}{l} W_k^q = T_k^q- \tilde{T}_k^q, \text{ for } q \in \{1, 2, \ldots, \infty\} \end{array} $ (11) discussed previously.
Moreover, the odd dataset
, used to detect the candidate change-points, provides valuable side information. For each$ \mathcal{Z}^O $ , the CUSUM statistic can also be computed using$ \hat{\tau}_k $ , denoted by$ \mathcal{Z}_{G_k}^O $ . If the candidate set$ T_k^q(\mathcal{Z}_{G_k}^O) $ is a plausible estimate of the true set$ \hat{\mathcal{S}} $ , it is typically observed that$ \mathcal{S} $ for$ T_k^q(\mathcal{Z}_{G_k}^O) \gt T_{k^\prime}^q(\mathcal{Z}_{G_{k^\prime}}^O) $ and$ \hat{\tau}_k \in \mathcal{I}_1 $ . By incorporating this information, a new statistic$ \hat{\tau}_{k^\prime} \in \mathcal{I}_0 $ is introduced. This blends the original comparison statistic with the side information:$ W_k^{side,q} $ $ \begin{array}{l} W_k^{side,q} = (T_k^q- \tilde{T}_k^q) T_k^q( \mathcal{Z}_{G_k}^O), \end{array} $ (12) After incorporating the odd dataset
, the statistic still exhibits symmetry around zero in the case of a false discovery. The advantage of$ \mathcal{Z}^O $ lies in its ability to enhance the separation between the comparison statistics corresponding to informative and uninformative sets. Specifically, when$ W_k^{side,q} $ and$ \hat{\tau}_k \in \mathcal{I}_1 $ , the ratio$ \hat{\tau}_{k^\prime} \in \mathcal{I}_0 $ $ \begin{array}{l} \dfrac{W_k^{side,q}}{W_{k^\prime}^{side,q}} = \left(\dfrac{W_k^q}{W^q_{k^\prime}}\right) \cdot \left(\dfrac{T_k^q( \mathcal{Z}_{G_k}^O)}{T_k^q( \mathcal{Z}_{G_{k^\prime}}^O)}\right) \geqslant \dfrac{W^q_k}{W^q_{k^\prime}}, \end{array} $ (13) since the second factor is typically greater than or equal to one.
Building on the definitions of
and$ W_k^q $ , the candidate set is refined by selecting positions with large values of the comparison statistics. Specifically, it is defined as:$ W_k^{side,q} $ $ \begin{array}{l} \mathcal{T}(t) = \{ \hat{\tau}_k \in \hat{\mathcal{S}}: W_k^q \geqslant t\} \text{ or } \mathcal{T}^{side}(t) = \{ \hat{\tau}_k \in \hat{\mathcal{S}}: W_k^{side,q} \geqslant t\} \end{array} $ (14) where,
refered to as the selection set. Given such a set, the number of false discoveries can be estimated by exploiting the symmetric of$ \mathcal{T}(t) $ or$ W_k^q $ around zero when$ W_k^{side,q} $ . This leads to the following relationship:$ \hat{\tau}_k \in \mathcal{I}_0 $ $ \begin{array}{l} \#\{ \hat{\tau}_k \in \mathcal{I}_0: W_k^{side,q} \geqslant t\} \approx \#\{ \hat{\tau}_k \in \mathcal{I}_0: W_k^{side,q} \leqslant -t\} \leqslant \#\{ \hat{\tau}_k: W_k^{side,q} \leqslant -t\} \end{array} $ (15) where,
can be replaced by$ W_k^{side,q} $ . Based on this property, the FDR can be approximated by$ W_k^q $ $ \begin{array}{l} \text{FDR}(t) \approx \dfrac{| \mathcal{T}(t)\cap \mathcal{I}_0|}{| \mathcal{T}(t)|} \leqslant \dfrac{ \#\{k: W_k \leqslant -t\}}{ \#\{k: W_k \geqslant t\}} \end{array} $ To control the FDR at a target level
, the knockoff+ procedure [22] is followed and a data-dependent threshold$ \alpha $ is computed as follows:$ T(\alpha) $ $ \begin{array}{l} T(\alpha) = \min_{t} \left\{t \in \mathcal{W}: \dfrac{1+ \#\{k : W_k \leqslant -t\}}{\#\{k : W_k \geqslant t\} \vee 1} \leqslant \alpha\right\} \end{array} $ (16) where,
and the extra term 1 in the numerator makes the choice of$ \mathcal{W} = \{ W_1, \ldots W_{ \hat{K}_n}\} \backslash \{0\} $ slightly conservative.$ T(\alpha) $ In summary, compared with pure data-splitting methods such as MOPS and M-MOPS, the core methodological distinction of this approach lies in how the test statistics are constructed. Instead of performing inference on only half of the data, the method used generates a synthetic dataset via a Gaussian multiplier bootstrap, which integrates information from both the odd part and the reserved even part. This design increases the effective sample size and thus improves statistical power. Moreover, although the construction of
in Eq. (13) differs from that in MOPS and M-MOPS, it serves a conceptually similar role as the comparison statistic used in data-splitting–based procedures. As shown in Eq. (13), the formulation effectively amplifies the contrast between informative and uninformative subsets, which in turn facilitates better control of the false discovery rate.$ W_{k}^{q} $ -
The error rate control results rely on the symmetry property of the comparison statistics
and$ W_k^q $ when$ W_k^{side,q} $ is a false discovery. To lay the groundwork for FDR control, it is essential to systematically examine this symmetry property. Before presenting the main theorem, some regular conditions are first imposed.$ \hat{\tau}_k $ Condition 3.1 (Moments and tails). Let
, and any vector$ \underline{b}, \overline{b} $ , and for all$ \boldsymbol{\vartheta} \in \mathbb{S}^d $ ,$ \ell = 1, 2 $ and$ i = 1, \ldots, 2n $ . (1)$ j = 1, \ldots, d $ ; (2)$ \text*{E}[(\boldsymbol{\vartheta}^\top \boldsymbol{\zeta}_{i})^2] \geqslant \underline{b} $ ; (3)$ \text*{E}[(\boldsymbol{\vartheta}^\top \boldsymbol{\zeta}_{i})^{\ell+2}] \leqslant \overline{b}^{\ell} $ , where for$ \| \boldsymbol{\vartheta}^\top \boldsymbol{\zeta}_{i}\|_{\psi_1} \leqslant \overline{b} $ ,$ \beta \in [1, \infty) $ represents an Orlicz norm.$ \| \cdot \|_{\psi_\beta} $ Condition 3.2 (Detection ability). Assume
. There exist$ \hat{K}_n \geqslant K_n $ belonging to$ \hat{\tau}_{j_1} \lt \ldots \lt \hat{\tau}_{j_{K_n}} $ such that$ \mathcal{T} $ holds with probability approaching one as$ \max_{1 \lt k \lt K_n} | \hat{\tau}_{j_k} - \tau_k| \leqslant \delta_n $ , where$ n \to \infty $ is some positive sequence.$ \delta_n $ Condition 3.3 (Minimum distance). Assume that
, where$ \mathcal{T} \subseteq \mathcal{T}(\omega_n) = \{ \mathcal{T}: \min_{j}(\tau_{j+1} - \tau_j) \geqslant \underline{\lambda}_n\} $ is a positive sequence such that$ \underline{\lambda}_n $ for some$ \underline{\lambda}_n \geqslant n^{\eta} $ and$ 0<\eta \lt 1 $ .$ \underline{\lambda}_n \geqslant 2\delta_n $ A sufficient condition for Condition 3.1 (1) is that the minimum eigenvalue of
is uniformly bounded below for all$ \boldsymbol{\Sigma}^{(k)} $ . This is a common assumption when the dimension$ k = 1, \ldots, K_n $ is fixed. Condition 3.1 (2) is a mild moment condition of the linear transformation of$ d $ . Condition 3.1 (3) assumes a sub-exponential tail for the transformation of the residuals. Note that$ \boldsymbol{\zeta}_i $ . Therefore, if$ \| \boldsymbol{\vartheta}^\top \boldsymbol{\zeta}_i\|_{\psi_1} \leqslant \sum_{j =1}^d \|\vartheta_j \zeta_{ij}\|_{\psi_1} \leqslant \sum_{j =1}^d \| \zeta_{ij} \|_{\psi_1} $ , then Condition 3.1 (3) is satisfied. Conditions similar to Condition 3.1 can be found in Liu et al.[1] and Yu & Chen[13] for change-point inference. Condition 3.2 requires that the set of candidate change-points$ \sum_{j =1}^d \| \zeta_{ij} \|_{\psi_1} \leqslant \bar{b} $ is sufficiently accurate. Condition 3.3 imposes sparsity on the true change-points. Both Condition 3.2 and Condition 3.3 are also considered in Chen et al.[30].$ \hat{\mathcal{S}} $ Let
and$ \overline{n} = \max_{k = 1,\ldots, K_n} n_k $ denote the maximum and minimum sample sizes of the intervals Gk, respectively. The following lemma is now established.$ \underline{n} = \min_{k = 1,\ldots, K_n} n_k $ Lemma 3.1. Assume Condition 3.1, 3.2 and 3.3 holds, then
$ \begin{array}{l}\text{Pr}\left\{\max\limits_{k\in\mathcal{I}_0}\rho(T_k^q,\tilde{T}_k^q)\leqslant c\left(\dfrac{\log^7(\overline{n})}{\underline{s}}\right)^{1/6}\mid\mathcal{Z}^O\right\}\geqslant1-C/(\underline{n})^{\kappa}\end{array} $ (17) where,
denotes the Kolmogorov distance between$ \rho(T_1, T_2) = \sup_{t \in (0, \infty)} | \text{Pr}(T_1 \leqslant t) - \text{Pr}(T_2 \leqslant t) | $ and$ T_1 $ . Here,$ T_2 $ is defined as$ \underline{s} $ , and$ \min_{k = 1,\ldots, K_n} \tau_{k+1} - \tau_{k} $ ,$ \kappa $ , and$ C $ are positive constants.$ c $ Lemma 3.1 demonstrates that the distribution of
approximates that of its original CUSUM counterpart under mild conditions. Based on Lemma 3.1, the main results on FDR control are therefore presented.$ T_k^q $ Theorem 3.1. Under Condition 3.1, 3.2 and 3.3, and
, the SD filter satisfies$ \log^7(\hat{K}_n \bar{n})/\underline{s} \to 0 $ $ \limsup\limits_{n \to \infty} \text*{E} \left[ \dfrac{ | \mathcal{T} \cap \mathcal{I}_0|}{| \mathcal{T}|} \Big | \mathcal{Z}^O \right] \leqslant \alpha, $ for any
.$ \alpha \in (0,1) $ Theorem 3.1 establishes the asymptotic FDR control property of the SD filter. The condition
implies that the bound parameter$ \log^7(\hat{K}_n \bar{n} d^2)/\underline{s} \to 0 $ should not approach the endpoints too closely. This requirement is reasonable, since accurate covariance estimation requires sufficiently large samples on both sides of s. Unlike Theorem 1 in Chen et al.[30], Theorem 3.1 does not require$ \underline{s} $ , indicating that the SD filter does not rely on highly accurate candidate localization.$ \delta_n/\underline{\lambda_n} \to 0 $ The FDR control for the general
-norm cannot be extended to high-dimensional MCP settings due to the constraints of high-dimensional central limit theorem for simple and sparse convex sets, as discussed in Chernozhukov et al.[32]. To address the high-dimensional MCP challenges, a practical approach is to set$ \ell_q $ . In this case, the high-dimensional central limit theorem for hyperrectangles can be employed to justify Lemma 3.1 even in high-dimensional MCP scenarios.$ q = \infty $ Theorem 3.2 (FDR control for high-dimensional MCP). When
, under Condition 3.1, 3.2 and 3.3, and$ d \to \infty $ . Let$ \log^7(\hat{K}_n \bar{n} d^2)/\underline{s} \to 0 $ , the SD filter satisfies$ q = \infty $ $ \limsup\limits_{n \to \infty} \text*{E} \left[ \dfrac{ | \mathcal{T} \cap \mathcal{I}_0|}{| \mathcal{T}|} \Big | \mathcal{Z}^O \right] \leqslant \alpha $ for any
.$ \alpha \in (0,1) $ Power analysis
-
Next, the power of the SD filter is analyzed under the following signal condition.
Condition 3.4 (Minimum signal). Let
be the change magnitude, which satisfies$ \boldsymbol{\delta}_{k} = \boldsymbol{\mu}_{k+1} - \boldsymbol{\mu}_k $ $ \min\limits_{k \in \mathcal{I}_1}\| \boldsymbol{\delta}^{(k)}\|_{q} \gg C\bar{\sigma}^2\sqrt{\dfrac{\log(\alpha_n \hat{K}_n\bar{n}d) }{t_k(1-t_k) \underline{n}}} $ where,
,$ t_k = \tau_k/n_k $ is a sequence that converges to infinity with a slow rate and$ \alpha_n $ is a positive constant.$ C $ Condition 3.4 imposes a minimum signal separation between any two true change-points, ensuring their asymptotic identifiability. Similar conditions can be found in Harchaoui & Lévy-Leduc[5], Fryzlewicz[6], and Yu & Chen[13].
Theorem 3.3. Under Condtion 3.1, 3.2, 3.3 and 3.4 and
. The power of SD filter satisfies$ \log^7(\hat{K}_n \bar{n})/\underline{s} \to 0 $ $ \lim\limits_{n \to \infty} \text*{E}\left[\dfrac{| \mathcal{T} \cap \mathcal{I}_1|}{| \mathcal{I}_1|} \Big| \mathcal{Z}^O\right] = 1 $ Theorem 3.3 states that the power of SD filter approaches 1 asymptotically. Furthermore, the selection consistency property can be established.
Corollary 3.1. Under Conditions in Theorem 3.3, there is
$ \lim\limits_{n\to\infty}\text{Pr}\left\{\mathcal{S}=\mathcal{T}\mid\mathcal{Z}^O\right\}=1 $ Compared with the condition for selection consistency in Chen et al.[30], which requires
, Condition 3.4 is of approximately the same order, noting that$ \min_{k \in \mathcal{I}_1}\| \boldsymbol{\delta}^{(k)}\|_2 \gg \sqrt{\log n/\underline{\lambda}_n} $ .$ \underline{\lambda}_n \approx \underline{n} $ -
In this section, aseries of change-point detection experiments is conducted to evaluate the empirical performance of the SD filter. Before presenting the results, the competing methods, the Mirror with Order-Preserved Splitting (MOPS) method and its variant, the Modified-MOPS (M-MOPS) are briefly summarized, both introduced in Chen et al.[30].
The M-MOPS method controls the FDR via a mirror statistic
$ \begin{array}{l}W_k^{\text{M-MOPS}}=\dfrac{n_kn_{k+1}}{n_k+n_{k+1}}\left(\overline{\mathbf{S}}_k^{O,-}-\overline{\mathbf{S}}_k^{O,+}\right)^{\top}\Omega_n\left(\overline{\mathbf{S}}_k^{E,-}-\overline{\mathbf{S}}_k^{E,+}\right),\quad k=1,\ldots,\hat{K}_n\end{array} $ where,
,$ \overline{\mathbf{S}}_k^{O,-} $ ,$ \overline{\mathbf{S}}_k^{O,+} $ ,$ \overline{\mathbf{S}}_k^{E,-} $ are previously defined, and$ \overline{\mathbf{S}}_k^{E,+} $ is a positive matrix. Since the performance of M-MOPS is not sensitive to$ \Omega_n $ , the study sets$ \Omega_n $ , the d-dimensional identity matrix, in the simulations. Compared with M-MOPS, the MOPS statistic differs only in that it uses a larger sample to compute the sample means of the score functions:$ \Omega_n = \mathbf{I}_d $ $ \begin{array}{l}W_k^{\text{MOPS}}=\dfrac{n_kn_{k+1}}{n_k+n_{k+1}}\left(\tilde{\mathbf{S}}_k^{O,-}-\tilde{\mathbf{S}}_k^{O,+}\right)^{\top}\Omega_n\left(\tilde{\mathbf{S}}_k^{E,-}-\tilde{\mathbf{S}}_k^{E,+}\right),\quad k=1,\ldots,\hat{K}_n\end{array} $ where,
and$ \tilde{\mathbf{S}}_k^{O,-} $ denote the sample means of$ \tilde{\mathbf{S}}_k^{O,+} $ and$ \{ \mathbf{s}_i^O, \hat{\tau}_{k-1} \lt i \leqslant \hat{\tau}_k\} $ , respectively;$ \{ \mathbf{s}_i^O, \hat{\tau}_{k} \lt i \leqslant \hat{\tau}_{k+1}\} $ and$ \tilde{\mathbf{S}}_k^{E,-} $ are defined analogously using the even subsample.$ \tilde{\mathbf{S}}_k^{E,+} $ Then the computational complexity of the methods is compared. Treating basic arithmetic operations as O(1), computing
requires O(dn) opreations, where d is the data dimension. Similarly, the quantities$ W_{k}^{\mathrm{M}-\mathrm{MOPS}} $ and$ \mathbf{c}_{k}(s) $ also involve O(dn) operations, implying that both$ \tilde{\mathbf{c}}_{k}(s) $ and$ T_{k}^{q} $ , and therefore$ \tilde{T}_{k}^{q} $ , have the same complexity. This method generates only a single set of multiplier bootstrap samples and therefore adds no extra computational overhead. In practice, one may run$ W_{k}^{side,q} $ bootstrap repetitions, yielding a cost of O(Bdn); however, the procedure can be parallelized with ease, so the runtime can approach that of a single iteration. Overall, the SD filter does not incur high computational cost.$ B $ Beyond computational considerations, an important issue is statistical reliability. In particular, MOPS may fail to control the FDR when the discrepancy between the candidate and true change-point sets is large. To assess the performance of all methods, 200 simulation replications were conducted and each method was evaluated using the empirical FDR and power:
$ \widehat{\text { FDR }}=\dfrac{1}{200} \sum\limits_{i=1}^{200} \dfrac{\left| \mathcal{T}_i\cap \mathcal{I}_0\right|}{| \mathcal{T}_i|} \text { and } \widehat{\text { Power }}=\dfrac{1}{200} \sum\limits_{i=1}^{200} \dfrac{\left| \mathcal{T}_i \cap \mathcal{I}_1\right|}{|\mathcal{I}_1|} $ where,
denotes the estimated selection set in the$ \mathcal{T}_i $ th replication.$ i $ The detailed pseudocode of this approach is as the Algorithm 1.
Table 1. Synthetic data filter (SD filter) for MCP detection.
${\bf Input:}$ Observed data sequence $\mathcal Z = \{{\bf{z}}_1, \dots, {\bf{z}}_{2n}\}$, target FDR level $\alpha$, suitable candidate change-point detection algorithm $\mathcal{A}(\cdot)$ ${\bf Output:}$ Selected change-point set $\mathcal T$ 1: Split data into odd and even parts:$ \mathcal{Z}^O = \{ \mathbf{z}_1, \mathbf{z}_3, \dots, \mathbf{z}_{2n-1}\}, \,\, \mathcal{Z}^E = \{ \mathbf{z}_2, \mathbf{z}_4, \dots, \mathbf{z}_{2n}\} $ 2: Detect candidate change-points $ \hat{ \mathcal{S}}=\{ \hat{\tau}_1, $ $\dots, {\hat{\tau}_{\hat{K}_n}}\}$ through $ {\mathcal{A}}({\mathcal{Z}}^O) $ 3: for $ k \in 1,\cdots,\hat{K}_n $ do 4: Define $ G_k := \left[\lceil(\hat{\tau}_{k-1}+ \hat{\tau}_k)/2\rceil, \lceil(\hat{\tau}_{k} + \hat{\tau}_{k+1})/2\rceil\right) $, where $ \hat{\tau}_0 = 0 $, $ \hat{\tau}_{ \hat{K}_n} = 2n $$ \mathcal{Z}_{G_k}^O := \{ \mathbf{z}_{2i-1}: i \in G_k\} $, $ \mathcal{Z}_{G_k}^E := \{ \mathbf{z}_{2i}: i \in G_k\} $ 5: Compute score function $ \mathbf{s}_i^E $ and $ \mathbf{c}_{k}(s)=\sqrt{\dfrac{s(n_k-s)}{n_k}}\left(\dfrac{1}{s} \sum_{i \leqslant s, i \in G_k} \mathbf{s}^E_{i}-\dfrac{1}{n_k-s} \sum_{i \gt s, i \in G_k} \mathbf{s}^E_{i}\right) $ 6: Generate synthetic data $ \tilde{ \mathcal{Z}}_{G_k} $ through $ \left\{\xi_{i}\left(\mathbf{s}^O_{i}-\bar{ \mathbf{s}}_k^{O,-}(s) \right), i \leqslant s, i \in G_k \right\} \text{ and } \left\{\xi_{i}\left(\mathbf{s}^O_{i}-\bar{ \mathbf{s}}_k^{O,+}(s) \right), i \gt s, i \in G_k \right\} $, where $ \xi_i \sim N(0,1) $ 7: Compute synthetic CUSUM $ \tilde{ \mathbf{c}}_{k}(s)=\sqrt{\dfrac{s(n_k-s)}{n_k}}\left(\dfrac{1}{s} \sum_{i \leqslant s, i \in G_k} \xi_{i}\left(\mathbf{s}^O_{i}-\bar{ \mathbf{s}}_k^{O,-}(s) \right)- \dfrac{1}{n_k - s}\sum_{i \gt s, i \in G_k} \xi_{i}\left(\mathbf{s}^O_{k}-\bar{ \mathbf{s}}_k^{O,+}(s) \right) \right) $ 8: Calculate:$ T_k^q = \max_{s \in G_k^*} \| \mathbf{c}_k(s)\|_q, \,\, \tilde{T}_k^q = \max_{s \in G_k^*} \|\tilde{ \mathbf{c}}_k(s)\|_q $ and $ W_k^{side,q} = (T_k^q - \tilde{T}_k^q) \cdot T_k^q(\mathcal{Z}_{G_k}^O) $ 9: end for 10: Obtain threshold $ T(\alpha) $: $ T(\alpha) = \min \left\{ t : \dfrac{1 + \#\{k: W_k^{side,q} \leqslant -t\}}{\#\{k: W_k^{side,q} \geqslant t\} \vee 1} \leqslant \alpha \right\} $ 11: Select final change-point set:$ \mathcal{T}^{side} = \{\hat{\tau}_k \in \hat{ \mathcal{S}} : W_k^{side,q} \geqslant T(\alpha)\} $ 12: return $ \mathcal{T}^{side} $ Simulation for multiple mean changes model
-
Consider a sequence of
-dimensional mean vectors$ d $ , where the means are piecewise constant with change-points at positions$ \boldsymbol{\mu}_{i}, i = 1,\ldots, n $ . Specifically,$ \{\tau_k, k = 1, \ldots, K\} $ $ \boldsymbol{\mu}_i = \boldsymbol{\mu}_{\tau_k}, \,\, \text{for} \,\, \tau_{k}+1 \leqslant i \leqslant \tau_{k+1} $ The sequence is initialized with
, where$ \boldsymbol{\mu}_{\tau_1} = (A/2) \mathbf{1}_d $ denotes the$ \mathbf{1}_d $ -dimensional vector of ones. To define$ d $ , randomly selecting$ \boldsymbol{\mu}_{\tau_2} $ positions in$ r $ and flip the sign of those entries. Subsequent vectors$ \boldsymbol{\mu}_{\tau_1} $ are constructed recursively from their predecessors using the same procedure. Under this setup, the study has$ \boldsymbol{\mu}_{\tau_k} $ $ \| \boldsymbol{\mu}_{\tau_k} - \boldsymbol{\mu}_{\tau_{k+1}}\|_{\infty} = A, \quad k = 1, \ldots, K $ The data points are generated as
, and the signal strength is controlled by the value of$ \mathbf{z}_i = \boldsymbol{\mu}_i + \boldsymbol{\epsilon}_i, i = 1, \ldots, n $ . Throughout this subsection, the dimensionality is set to$ A $ , the sample size to$ d = 50 $ . The true change-points are set to$ n=4,000 $ , in Example 1, and to$ \tau_k=200\ k,\ k=1,\ldots,19 $ , in Example 2. To mitigate the undesired bias from detection algorithms, a sequence of candidate change-points are manually constructed as$ \tau_k=400\ k,\ k=1,\ldots,9 $ , where$ \mathcal{T} = \{150 k+ (-1)^{\rm{B_k}} {\rm{P_k}} \mid k = 1, \ldots 26\} $ ,$ \text{B}_{k} $ are independently drawn from$ \text{P}_{k} $ and$ \text{Bernoulli}\ (1/2) $ , respectively. Theoretically,$ \text{Poisson }(5) $ is required in high-dimensional settings, while any$ q = \infty $ is admissible in low-dimensional cases. Since the optimal choice of$ q \geqslant 1 $ depends on unknown signal conditions,$ q $ is adopted as a practical default in both simulations and empirical analysis. For the truncation size,$ q = \infty $ is chosen to ensure that both$ s \in [10, 30] $ and$ s $ are sufficiently large, thereby mitigating boundary effects and improving the stability of the algorithm. In addition, the study sets$ n_k - s $ and FDR level$ r = 1 $ .$ \alpha = 0.15 $ Example 1: Normal distribution
Consider the error term
to be drawn from multivariate normal distribution with covariance matrix$ \boldsymbol{\epsilon}_i $ . The study chose the change magnitude$ \boldsymbol{\Sigma} = \{\rho^{|i-j|}\}_{(i,j)} $ and the correlation coefficient$ A $ as follows:$ \rho $ ● Fix
= 0, and let A vary in {1.5, 1.7, 1.9, 2.1, 2.3, 2.5}.$ \rho $ ● Fix A = 1.5, and let
vary in {0, 0.2, 0.4, 0.6, 0.8}.$ \rho $ Example 2: Beyond normal distribution
Consider the error term
to be drawn from either a multivariate t distribution or a multivariate chi-square distribution, each having a covariance matrix$ \boldsymbol{\epsilon}_i $ . We choose the change magnitude A and the degrees of freedom df as follows:$ \boldsymbol{\Sigma} = \mathbf{I}_d $ ● Fix A = 2, and let df vary in {8, 9, 10, 11, 12}.
● Fix A = 3, and let df vary in {3, 4, 5, 6, 7}.
The simulation results summarized in Figs 3 and 4 demonstrate the superior performance of the SD filter across various signal strengths and dependence structures. The results indicate that both the SD filter and M-MOPS exhibit the capability to maintain control over the false discovery rate (FDR) at the predetermined level. In contrast, MOPS struggles to maintain FDR control due to the deliberate reduction in the quality of the candidate change-point set. When the candidate change positions are significantly distant from the actual change-points, MOPS fails to perform effectively. In terms of empirical power, the SD filter consistently outperforms the M-MOPS method across all settings.
Figure 3.
FDR and power trends with respect to A and $ \rho $ for SD filter, MOPS, and M-MOPS under the mean change model, where n = 4,000, d = 50, $ \alpha $ = 0.15.
Figure 4.
FDR and power trends with respect to df of the t-distribution and chi-square distribution for SD filter, MOPS, and M-MOPS under the mean change model, where n = 4,000, d = 50, $ \alpha $ = 0.15.
Structural breaks in linear regression model
-
Consider a linear regression model with structural breaks, defined as
. Initially, let$ \mathbf{y}_i = \mathbf{x}_i^\top \boldsymbol{\beta}_{\tau_k} +\epsilon_i, \,\, \text{for}\,\, \tau_{k-1} \leqslant i \leqslant \hat{\tau}_k $ . Then, define$ \boldsymbol{\beta}_{\tau_1} = (A/2) \mathbf{1}_d $ by randomly selecting$ \boldsymbol{\beta}_{\tau_2} $ positions in$ s $ and flipping the signs of those entries. Each subsequent vector$ \boldsymbol{\beta}_{\tau_1} $ is generated from$ \boldsymbol{\beta}_{\tau_k} $ using the same procedure. Under this setup, the study has$ \boldsymbol{\beta}_{\tau_{k-1}} $ $ \| \boldsymbol{\beta}_{\tau_k} - \boldsymbol{\beta}_{\tau_{k+1}}\|_{\infty} = A, \quad k = 1, \ldots, K. $ The covariates
are drawn from a multivariate normal distribution$ \mathbf{x}_i $ , where$ N(\mathbf{0}_{d}, \boldsymbol{\Sigma}) $ is the d-dimensional vector of zeros and the covariance matrix$ \mathbf{0}_{d} $ . The error terms$ \boldsymbol{\Sigma} = \{\rho^{|i-j|}\}_{(i,j)} $ are independently drawn from$ \epsilon_i $ . In this model, the sample size is set as n = 8,000 and the number of covariates d = 10. The true change-point set is defined as$ N(0,1) $ , and the candidate change-point set is given by$ \mathcal{S} = \{1000 k, k = 1, \ldots, 7\} $ , where$ \hat{\mathcal{S}} = \{ 450 k + (-1)^{\text{B}}_{\rm k} \text{P}_{\rm k} \mid k = 1, \ldots, 16\} $ and$ \text{B}_{k} \sim \text{Bernoulli}(1/2) $ . Except that the FDR level is set to$ \text{P}_k\sim\text{Poisson }(5) $ , all other parameters remain unchanged. The change magnitude$ \alpha = 0.2 $ and correlation coefficient$ A $ were chosen as follows:$ \rho $ ● Fix
= 0 , and let A vary in {0.20, 0.22, 0.24, 0.26, 0.28, 0.30}.$ \rho$ ● Fix A = 0.25, and let
vary in {0, 0.1, 0.2, 0.3, 0.4, 0.5}.$ \rho $ The simulation results presented in Figs 5 and 6 display the estimated FDR and empirical power results for the linear regression model with structural breaks under various scenarios. As before, it was observed that both the SD filter and M-MOPS successfully control the FDR at the pre-specified level. Again, MOPS fails to do so due to the deliberately reduced quality of the candidate change-point set. In terms of empirical power, the SD filter still consistently outperforms M-MOPS in this setting.
-
In this section, the proposed SD filter is applied to analyze the bladder tumor micro-array dataset sourced from Bleakley & Vert[35], which is conveniently available in the ecp R package. The dataset consists of log–intensity-ratio measurements for 2,215 genetic loci obtained from 43 individuals diagnosed with bladder tumors. The primary objective is to identify change-points within the genetic loci, enabling the study to pinpoint potential influential genes related to bladder tumors. This dataset has been widely used as a benchmark in several prior studies on change-point detection[35–37], making it a representative and well-established dataset for evaluating the empirical performance of new methods. The analysis was conducted on the full dataset. However, for visualization purposes and to provide a clearer and more interpretable presentation of the results, the findings for the first ten individuals were reported, specifically individuals 3, 4, 5, 6, 7, 8, 9, 10, 14, and 15.
Firstly, the inspect method[9] is applied to narrow the scope and obtain a candidate set. To ensure optimal performance, a minimum difference of
was established between two change-points. The set of identified change-points is as follows:$ 50 $ $\begin{split} \hat{\mathcal{S}} =\;& \{73,263,428,669,811,960, \\& 1050, 1378, 1436, 1559,1724, 1831, 1906, 2084\}\end{split} $ Subsequently, the SD filter, MOPS and M-MOPS are applied to further refine the results, controlling the FDR at a level of 0.1. The final sets of detected change-points from MOPS and M-MOPS are identical to
, whereas the set of change-points detected by the SD filter is as follows:$ \hat{\mathcal{S}} $ $ \mathcal{T}_{SD} = \{73,263,428,669,811,960, 1050, 1378, 1436, 1559, 1724, 1906, 2084\} $ This result of SD filter excludes the position 1,831. Figure 7 visually demonstrates the change-points identified through the SD filter. In each plot, individual data points represent log-intensity ratios on a specific genetic locus, with each plot corresponding to a different test subject. Vertical lines are used to indicate the locations of detected change-points. The change-points identified by the SD filter are shown as dashed lines. Notably, the only solid line—positioned at 1,831—does not correspond to any apparent change across the ten individuals, highlighting the greater precision and accuracy of the SD filter in identifying true change-points.
-
To overcome the limitations of existing FDR control methods for multiple change-point detection-particularly the drawbacks associated with data-splitting approaches, the study proposes a synthetic data filter (SD filter) for change-point detection and FDR control. After identifying potential change-points, Gaussian multiplier bootstrap is applied to generate synthetic data based on information from the detection dataset. This synthetic data is then used to construct a mirror statistic that enables control of the FDR, offering the flexibility to leverage information from the entire dataset and improve statistical power under a variety of alternatives and dimensions. The symmetry property of the mirror statistic is then established andits ability to rigorously control FDR asymptotically is proven. The detection power is also demonstrated under mild conditions. Simulation studies empirically verified the outstanding performance of the SD filter in terms of FDR control and power. The study also applies the proposed method to analyze a micro array dataset that describes the change loci of bladder tumor patients. As mentioned above, the framework of the SD filter is quite general, and it would be interesting to extend it to a broader range of cases where data splitting is required for formulating and testing hypotheses within the same dataset.
This work was supported by National Natural Science Foundation of China (Grant Nos 12271456 and 71988101), the Ministry of Education Research in the Humanities and Social Sciences (Grant No. 22YJA910002). The authors sincerely thank the editor and the referees for their constructive comments and helpful suggestions.
-
This study uses publicly available and anonymized data from the ecp R package. As the data are de-identified and distributed for open research purposes, no ethical approval was required from the authors' institution. All analyses were conducted in accordance with standard ethical guidelines for statistical research and data use.
-
The authors confirm contribution to the paper as follows: study conception and design: Sun A, Liu J; data collection: Sun A; analysis and interpretation of results: Sun A, Bi J, Liu J; draft manuscript preparation: Sun A, Bi J, Liu J. All authors reviewed the results and approved the final version of the manuscript.
-
The data that support the findings of this study are available through the ecp R package.
-
The authors declare that they have no conflict of interest.
- Copyright: © 2026 by the author(s). Published by Maximum Academic Press, Fayetteville, GA. This article is an open access article distributed under Creative Commons Attribution License (CC BY 4.0), visit https://creativecommons.org/licenses/by/4.0/.
-
About this article
Cite this article
Sun A, Bi J, Liu JY. 2026. A synthetic data approach for FDR control in change-point detection. Statistics Innovation 3: e002 doi: 10.48130/stati-0026-0002
A synthetic data approach for FDR control in change-point detection
- Received: 22 May 2025
- Revised: 15 December 2025
- Accepted: 29 December 2025
- Published online: 28 February 2026
Abstract: In multiple change-point analysis, the resulting detection sets are typically conservative, often identifying more change points than actually exist, due to the issues of 'unreliability of assumptions' and 'unreliability of algorithms'. Therefore, controlling the false discovery rate is of vital importance to multiple change-point detection. Data-splitting-based methods have gained widespread attention for false discovery rate control. However, relying solely on a part of the dataset during the validation stage typically suffers from power loss. Instead, the study introduces a novel synthetic data framework and proposes the Synthetic Data Filter to control the false discovery rate in multiple change-point detection. Here, the study demonstrates that the proposed method effectively controls the false discovery rate and achieves asymptotic power approaching one under mild conditions. Numerical comparisons with existing methods provide evidence for the superiority of the approach in terms of both false discovery rate control and statistical power. The proposed method is further applied to a bladder tumor microarray dataset, and potential loci are identified with structural changes.





