An improved method of Wu's functional clustering

Wenqi Pan; Jincan Che; Shuang Wu; Wenqi Pan; Jincan Che; Shuang Wu

doi:10.48130/stati-0026-0006

Functional clustering is a statistical framework designed to classify high-dimensional dynamic data on the basis of their underlying trajectories rather than static data points. Although this approach has been widely applied to omics and single-cell research, it is frequently challenged by data heterogeneity. To generalize its applications, we propose an improved framework of functional clustering by integrating (i) an allometric scaling-based mean-feature structure to characterize heteroscedasticity; (ii) a structured covariance model tailored to complex biological signal patterns; and (iii) a hybrid Expetation-Maximization (EM) algorithm incorporating softmax-based adaptive smoothing to effectively mitigate cluster collapse. Through extensive simulation studies, we demonstrate that the proposed method outperforms existing approaches in terms of clustering accuracy, stability, and robustness, particularly under complex covariance structures or high dimensionality. We use the new approach to analyze transcriptome data of Prunus mume 'Meiren', identifying 72 biologically meaningful gene modules related to DNA replication and repair, protein metabolism, and tissue-specific pathways, such as phenylalanine biosynthesis underlying anthocyanin production and the purple ornamental phenotype of this cultivar. The improved framework leverages functional clustering to strengthen optimization stability, biological interpretability, robustness, and flexibility for pattern discovery in high-dimensional data.

HTML

Introduction

The concept of functional clustering was first systematically introduced by Rongling Wu and his team^[1] to classify periodically expressed genes into different patterns through Fourier series expansions. This approach, particularly named Wu's functional clustering (or Wu's FunClu), was built on a mixture-based likelihood function containing weighted mixture components. Each of these components is assumed to follow a multivariate (longitudinal) normal distribution with a mean vector modeled by biologically meaningful mathematical equations, such as Fourier's periodic equations, and a (co)variance matrix modeled by an autoregressive process. Building on this foundation, Wu and his team^[2] further expanded functional clustering to multiple dimensions of data, showing the important application and uniqueness of this approach to dissect and resolve complex systems.

Although there is a rich body of literature on clustering algorithms^[3−5], Wu's FunClu has proved to be more powerful, versatile, and flexible for data clustering. Unlike conventional approaches that treat observations as static vectors, it explicitly considers the dynamic and temporal structure of the data. Its applications are especially important in analyzing omics datasets, including transcriptomics, proteomics, metabolomics, and single-cell data^[2,6,7], where clustering can reveal hidden biological modules and functional relationships. A recent study by Wu et al.^[2] applied functional clustering to large-scale metabolomics data. Metabolites in healthy and diseased groups were grouped into coherent modules, providing new insights into metabolic reprogramming in inflammatory bowel disease^[2]. These examples highlight the broad potential of functional clustering as a versatile method that links advanced statistical modeling with key challenges in biomedicine and other scientific fields.

Modern transcriptome and multi-omics studies often involve tens of thousands to over hundreds of thousands of features, and the increasing sample size presents a significant challenge for data processing^[8,9]. Traditional functional clustering methods struggle with large datasets because of the high computational demands, high memory requirements, and the risk of misclassification, leading to less effective clustering results. Additionally, time-series data are often limited or unavailable in biological research, which hinders the use of temporal clustering methods. Moreover, multi-omics datasets frequently vary in scale, noise characteristics, and correlation structures, requiring flexible and robust clustering techniques^[10−12]. These factors highlight the need for improved functional clustering approaches that can efficiently handle high-dimensional, nontemporal, and heterogeneous biological data while providing meaningful biological insights.

In this study, we propose an improved framework of Wu's FunClu to leverage its capacity for clustering large-scale, multi-omics data. The framework combines the allometric scaling law, advanced optimization techniques, and enhanced covariance modeling approaches to produce accurate, robust, and biologically meaningful clustering results. This framework is designed to be scalable and adaptable, offering a versatile tool for analyzing complex biological datasets.

The rest of the paper is organized as follows. Section "Methods" details the proposed methodology. Section "Silulation" shows the simulation results, and section "Data analysis" demonstrates an application to real biological data. The last section concludes the paper with a discussion and future directions.

Methods

Basic framework of functional clustering

We assume that the dataset consists of $ n $ observations, each represented by an $ m $-dimensional profile $ {y}_{i}=({y}_{i1},\cdots ,{y}_{im}{)}^{T} $. The structure of Wu's FunClu is modeled using a finite Gaussian mixture distribution with $ K $ components as follows:

$ p\left({y}_{i}|\Theta \right)=\sum \limits_{k=1}^{K}\,{\alpha }_{k}\mathcal{N}\left({y}_{i}|{\mu }_{ki},{\Sigma }_{k}\right) $

(1)

where, $ {\alpha }_{k} $ is the mixing proportion satisfying $ \sum \nolimits_{k=1}^{K}{\alpha }_{k}=1 $, and $ \mathcal{N}(\cdot ) $ denotes the multivariate Gaussian density with a mean vector $ {\mu }_{ki} $and a covariance matrix ${\Sigma }_{k}$.

Traditional approaches are often constructed under the assumption that sufficient temporal or dynamic data are available. Such data enable the direct modeling of time-dependent trajectories. However, in many practical applications, longitudinal measurements are either infeasible to obtain or require excessive experimental and financial resources. This limitation motivates the need for alternative approaches that can extract dynamic-like information from static high-dimensional data.

The study of complex biological systems at the cellular and molecular levels has increasingly drawn parallels to community ecology^[13−15]. Just as ecology examines the interactions between species and their environment, systems biology aims to model how the information flow between molecular components gives rise to the emergent behavior of the system^[16].

A central pillar of this "crosspollination" is the concept of the ecological niche, which characterizes the response of organisms to their environment across time and space^[17]. Within this niche space, a species' survival and growth are fundamentally governed by resource availability^[18,19]. In natural ecosystems, resource availability is the net result of the interactions between all resident organisms and the environmental supply. Consequently, the "net quality" or productivity of such an ecosystem can be summarized by the ecosystem index, a surrogate measure derived from the total abundance of all organisms that reflects the environmental factors essential for the system's survival^[20].

Analogously, we treat a sample as a gene ecosystem. In this framework, individual genes are equivalent to organisms, and their expression levels reflect their utilization of the "resources" provided by the cellular environment. Following the notion of the ecosystem index used in ecology and agriculture^[21,22], we define the habitat index $ {E}_{i} $ for each sample $ i $ as

$ {E}_{i}=\sum \limits_{j=1}^{m}\,{y}_{ij} $

(2)

where, $ {y}_{ij} $ represents the expression level of gene $ j $ in sample $ i $.

In complex living systems, the variation of a specific component in response to the state of the whole system is rarely linear. Instead, these relationships are often observed to obey the allometric scaling law, a fundamental biological principle typically described by a power equation^[23−25].

The term "allometry" was first coined by Huxley and Teissier^[26] to investigate the phenomenon of relative growth, specifically how the proportions of an organism's parts change in relation to its total size. Since then, the scope of allometry has been broadly expanded to refer to almost any covarying biological measurements where the part's behavior is dictated by the scale of the whole.

As mentioned above, by conceptualizing each sample as a complex ecosystem of interconnected entities, we use the habitat index—defined as the cumulative abundance of all entities within a specific spatiotemporal context—as a robust proxy for assessing an ecosystem. The observed spectrum of habitat indices captures the systemic variability of gene ecosystems, likely driven by intricate internal interactions and external environmental influences. This framework effectively formalizes the part–whole relationship between individual entities' abundance and the collective habitat index across diverse samples. This methodological approach has already gained significant traction in the biological sciences. Its applications range from monitoring the functional stability of terrestrial soil microbiomes to deciphering the progressive decoupling of gene interactions during biological aging^[27,28].

Translating this principle into the clustering model, the mean function of the $ k $-th cluster is expressed as a scaling function of the habitat index

$ {\mu }_{ki}={a}_{k}E_{i}^{{b}_{k}},i=1,\cdots ,n, $

(3)

where, $ {a}_{k} $ and $ {b}_{k} $ are cluster-specific parameters. This representation allows the model to reveal scaling patterns between local observations and the global structure of the dataset, thus providing a biologically meaningful yet statistically flexible formulation for functional clustering.

The specification of the covariance matrix ${\Sigma }_{k}$ is critical for clustering performance. High-dimensional biological data often exhibit complex dependencies, and using an unrestricted covariance matrix may result in excessive free parameters, high computational cost, and numerical instability. To address this, we use a structured covariance model based on an autoregressive formulation.

Let $ {E}_{i} $ denote the expression index of sample $ i $. The variance of sample $ i $ and the covariance between samples $ i $ and $ j $ are given^[29−31] by

$ {\sigma }^{2}\left({E}_{i}\right)=\dfrac{1-{\phi }^{2i}}{1-{\phi }^{2}}{\nu }^{2},{\sigma }^{2}\left({E}_{i},{E}_{j}\right)={\phi }^{\mid i-j\mid }\dfrac{1-{\phi }^{2i}}{1-{\phi }^{2}}{\nu }^{2} $

(4)

where, $ {\nu }^{2} $ represents the measurement error variance and $ \phi \in (-1{,}1) $ is a correlation parameter controlling the strength of dependence. With $ d $ samples, the covariance matrix $\Sigma$ is of dimension $ d\times d $ and admits the form

$ \Sigma =\left[\begin{matrix}\dfrac{1-{\phi }^{2}}{1-{\phi }^{2}}{\nu }^{2} & \phi \dfrac{1-{\phi }^{2}}{1-{\phi }^{2}}{\nu }^{2} & \cdots & {\phi }^{d-1}\dfrac{1-{\phi }^{2}}{1-{\phi }^{2}}{\nu }^{2}\\ \phi \dfrac{1-{\phi }^{2}}{1-{\phi }^{2}}{\nu }^{2} & \dfrac{1-{\phi }^{4}}{1-{\phi }^{2}}{\nu }^{2} & \cdots & {\phi }^{d-2}\dfrac{1-{\phi }^{2}}{1-{\phi }^{2}}{\nu }^{2}\\ \vdots & \vdots & \ddots & \vdots \\ {\phi }^{d-1}\dfrac{1-{\phi }^{2}}{1-{\phi }^{2}}{\nu }^{2} & {\phi }^{d-2}\dfrac{1-{\phi }^{2}}{1-{\phi }^{2}}{\nu }^{2} & \cdots & \dfrac{1-{\phi }^{2d}}{1-{\phi }^{2}}{\nu }^{2}\\ \end{matrix}\right] $

(5)

Given a dataset $ X=\{{x}_{1},{x}_{2},\cdots ,{x}_{n}\} $, the log-likelihood function can be expressed as follows:

$ L\left(X|\Theta \right)=\sum \limits_{i=1}^{n}\,\ln \left(\sum \limits_{k=1}^{K}\,{\alpha }_{k}\mathcal{N}\left({x}_{i}\mid {\mu }_{k}\left({E}_{i}\right),{\Sigma }_{k}\right)\right) $

(6)

Introducing latent variables that indicate cluster membership, the Expetation-Maximization (EM) auxiliary function can be written as follows:

$ \begin{split}Q\left(\Theta |{\Theta }^{\left(t\right)}\right)=&\sum \limits_{i=1}^{n}\,\sum \limits_{k=1}^{K}\,\omega _{i,k}^{\left(t\right)}\left[\ln {\alpha }_{k}-\dfrac{1}{2}\ln \det \left({\Sigma }_{k}\right)\right.\\ &\left. -\dfrac{1}{2}{\left({x}_{i}-{\mu }_{k}\left({E}_{i}\right)\right)}^{\mathrm{\top }}\Sigma _{k}^{-1}\left({x}_{i}-{\mu }_{k}\left({E}_{i}\right)\right)\right] \end{split}$

(7)

with the posterior responsibility

$ \omega _{i,k}^{\left(t\right)}=\dfrac{\alpha _{k}^{\left(t\right)}\mathcal{N}\left({x}_{i}|\mu _{k}^{\left(t\right)}\left({E}_{i}\right),\Sigma _{k}^{\left(t\right)}\right)}{\displaystyle\sum \limits_{\mathscr{l}=1}^{K}\,\alpha _{\mathscr{l}}^{\left(t\right)}\mathcal{N}\left({x}_{i}|\mu _{\mathscr{l}}^{\left(t\right)}\left({E}_{i}\right),\Sigma _{\mathscr{l}}^{\left(t\right)}\right)} $

(8)

To estimate the model parameters, we use a hybrid EM algorithm characterized by a dual-optimization strategy in the M-step. Specifically, for parameters with explicit gradients (e.g., mixing proportions), we derive closed-form analytic solutions to ensure computational efficiency. For parameters lacking such analytic properties, we incorporate numerical optimization routines to iteratively maximize the auxiliary function. This hybrid approach balances analytical precision with numerical flexibility, ensuring robust convergence in complex parameter spaces.

The mixing proportions admit the following analytic solution:

$ \alpha _{k}^{\left(t+1\right)}=\dfrac{1}{n}\sum \limits_{i=1}^{n}\,\omega _{i,k}^{\left(t\right)} $

(9)

However, in practice, it is possible that some clusters receive nearly zero assignments, leading to degeneracy (i.e., "empty clusters"). To mitigate this, we apply a smoothing transformation using the softmax function as follows:

$ \tilde{\alpha }_{k}^{\left(t+1\right)}=\dfrac{\exp \left(\gamma \alpha _{k}^{\left(t+1\right)}\right)}{\displaystyle\sum \limits_{\mathscr{l}=1}^{K}\,\exp \left(\gamma \alpha _{\mathscr{l}}^{\left(t+1\right)}\right)} $

(10)

where, $ \gamma $ is a temperature parameter controlling the degree of smoothing. This adjustment ensures that all components retain nonzero mixing weights, thereby preventing the collapse of the mixture model and improving stability.

For the mean function parameters $ ({a}_{k},{b}_{k}) $ and covariance parameters $ ({\phi }_{k},{v}_{k}) $, closed-form solutions are not available. We therefore adopt a gradient-based numerical optimization strategy. Specifically, the Adam algorithm is used to maximize $ Q $ with respect to these parameters^[32].

This mixed update strategy preserves analytic efficiency for simple parameters while leveraging modern optimization for complex components, thereby improving the convergence stability and scalability in high-dimensional applications.

Initialization of parameters is critical for the stability and convergence of the EM algorithm. In our framework, the mean function parameters are initialized using the results of a k-means clustering applied to the raw data matrix. Specifically, the empirical cluster centroids from k-means are taken as the initial estimates for the functional means, which are then projected onto the parametric form $ {\mu }_{k}(E)={a}_{k}{E}^{{{b}_{k}}} $.

The covariance parameters are initialized using identity-based structures with small perturbations to avoid singularity, whereas the mixing proportions are initialized uniformly, i.e., $ \alpha _{k}^{(0)}=1/K $.

This initialization strategy provides a reasonable approximation to the underlying structure of the data, effectively reducing the risk of poor local optima during subsequent EM iterations.

The optimal number of clusters $ K $ is determined by minimizing the Bayesian information criterion (BIC)^[33]. Specifically, for a given model with the likelihood $ L(X\mid {\hat{\Theta }}_{K}) $ and the parameter dimension $ {d}_{K} $, the BIC is defined as

$ \mathrm{BIC}\left(K\right)=-2\ln L\left(X|{\hat{\Theta }}_{K}\right)+{d}_{K}\ln n $

(11)

where, $ n $ is the sample size.

We evaluate the BIC across a grid of candidate cluster numbers and select the value of $ K $ that minimizes the criterion. This approach balances the model's fit and complexity, thereby preventing overfitting and ensuring biologically interpretable results.

Multivariate extension of functional clustering

In many real-world applications, data are collected across multiple conditions or tissues, which requires extending the functional clustering model to a multivariate setting. Suppose the dataset consists of paired vectors $ {\mathbf{x}}_{i}=({x}_{i1},\cdots ,{x}_{ip}) $ and $ {\mathbf{y}}_{i}=({y}_{i1},\cdots ,{y}_{iq}) $, representing gene expression trajectories measured under two (or more) conditions. For cluster $ k $, the mean functions are denoted as $ \mu _{k}^{x}(t)=f(\theta _{k}^{x};t) $ and $ \mu _{k}^{y}(t)=f(\theta _{k}^{y};t) $, where $ f(\cdot ) $ is the allometric growth function and $ \theta _{k}^{x},\theta _{k}^{y} $ are cluster-specific parameters.

Assuming a shared covariance structure $ \Sigma $, the joint distribution of observations is given by

$ p\left({\mathbf{x}}_{i},{\mathbf{y}}_{i}|\Theta \right)=\sum \limits_{k=1}^{K}{\alpha }_{k}\,\mathcal{N}\left({\mathbf{x}}_{i}|\mu _{k}^{x},{\Sigma }^{x}\right)\,\mathcal{N}\left({\mathbf{y}}_{i}|\mu _{k}^{y},{\Sigma }^{y}\right), $

(12)

where, $ {\alpha }_{k} $ represents the mixing proportions. The corresponding log-likelihood is

$ L\left(D|\Theta \right)=\sum \limits_{i=1}^{n}\ln \left(\sum \limits_{k=1}^{K}{\alpha }_{k} \mathcal{N}\left({\mathbf{x}}_{i}|\mu _{k}^{x},{\Sigma }^{x}\right) \mathcal{N}\left({\mathbf{y}}_{i}|\mu _{k}^{y},{\Sigma }^{y}\right)\right). $

(13)

Parameter estimation proceeds via the EM algorithm described in section "Methods", with the possibility of smoothing the mixing proportions to avoid degenerate (empty) clusters. This formulation naturally generalizes to more than two tissues or conditions.

Simulation

Data generation

To evaluate the performance of the proposed EM algorithm, we conducted a series of simulation studies based on synthetic datasets. Each dataset consisted of $ n=500 $ samples, each with $ p=50 $ features, partitioned into $ k=20 $ latent clusters.

For each cluster $ c\in \{1,\cdots ,k\} $, the mean vector was specified according to a log-linear function as follows:

$ {\mu }_{c}\left(t\right)={a}_{c}+{b}_{c}\log \left(t\right),t=1,\cdots ,p, $

(14)

where, the intercepts $ {a}_{c} $ were drawn independently from the uniform distribution $ U(0.5{,}5) $, and the slopes $ {b}_{c} $ were sampled from $ U(0.1{,}3) $. This formulation ensures heterogeneous mean structures across clusters while maintaining a consistent log-linear growth pattern.

Conditional on the cluster assignment $ {y}_{i} $, the feature vector for sample $ i $ was

$ {X}_{i}\sim \mathcal{N}\left({\mu }_{{{y}_{i}}},\Sigma \left(\phi ,v\right)\right) $

(15)

The covariance structure is assumed to be common across all components and follows the type form given in Eq. (4).

We considered two correlation settings, $ \phi =0.6 $ and $ \phi =0.9 $, and for each setting, we varied the noise variance across five levels: $ v\in \{0.2{,}\;0.4{,}\;0.6{,}\;0.8{,}\;1.0\} $. For each combination of parameters, we generated 20 independent replicates to reduce Monte Carlo variability.

Methods compared

We compare four algorithms:

FunClu: The full functional clustering pipeline described in section "Methods". In the variant FunClu, we do not apply softmax smoothing to the updated mixing proportions $ {\alpha }_{k} $.

FunClu smoothing: The same as FunClu except that after the analytic update of $ {\alpha }_{k} $ in the M-step, we apply a softmax smoothing given in Eq. (10).

This renormalization avoids zero weights and shrinks extreme small weights toward a nonzero baseline; in practice, it markedly reduces the occurrence of empty clustersand improves numeric stability when some components receive few or zero responsibilities.

We report the results for both variants to demonstrate the concrete effect of smoothing.

GMM: The standard Gaussian mixture model with full covariance, fitted by the EM. We use the same K (true $ k $) for a fair comparison.

KMeans: The standard k-means on the raw $ X $ matrix, with $ k $ set to the true number of clusters.

For all methods, we evaluate clustering recovery relative to the true labels.

Evaluation metrics
We assess clustering performance using the following metrics.

(1) Clustering accuracy (CA): The proportion of correctly classified samples under optimal label matching.

(2) Adjusted Rand index (ARI): This accounts for chance agreement in clustering.

(3) Normalized mutual information (NMI): This measures information-theoretic similarity.

Each experiment was replicated 50 times to reduce variability.

Simulation results

Tables 1 and 2 report the averaged performance (mean ± standard deviation [SD] over the replicates) of four methods (FunClu, FunClu Smoothing, GMM, and KMeans) under the noise levels $ v\in \{0.2{,}\;0.4{,}\;0.6{,}\;0.8{,}\;1.0\} $ for two correlation settings $ \phi =0.6 $ and $ \phi =0.9 $. The main findings are as follows.

Table 1. Simulation results ($\nu $ = 0.9).

$ {\nu }^{2} $	Method	CA	ARI	NMI
0.2	FunClu	0.8732 ± 0.0642	0.8357 ± 0.0508	0.9271 ± 0.0217
0.2	FunClu_ Smoothing	0.8646 ± 0.0474	0.8318 ± 0.0501	0.9236 ± 0.0227
0.2	GMM	0.7233 ± 0.0578	0.6590 ± 0.0682	0.8356 ± 0.0399
0.2	KMeans	0.7525 ± 0.0558	0.6820 ± 0.0564	0.8381 ± 0.0307
0.4	FunClu	0.6642 ± 0.0757	0.5805 ± 0.0783	0.7876 ± 0.0472
0.4	FunClu_ Smoothing	0.6859 ± 0.0678	0.5912 ± 0.0668	0.7795 ± 0.0432
0.4	GMM	0.5025 ± 0.0510	0.3895 ± 0.0516	0.6515 ± 0.0389
0.4	KMeans	0.5174 ± 0.0442	0.3930 ± 0.0510	0.6520 ± 0.0397
0.6	FunClu	0.4841 ± 0.0506	0.3847 ± 0.0524	0.6600 ± 0.0399
0.6	FunClu_ Smoothing	0.5115 ± 0.0421	0.3879 ± 0.0451	0.6456 ± 0.0357
0.6	GMM	0.3586 ± 0.0355	0.2439 ± 0.0371	0.5144 ± 0.0380
0.6	KMeans	0.3639 ± 0.0331	0.2347 ± 0.0335	0.5144 ± 0.0367
0.8	FunClu	0.3916 ± 0.0365	0.2776 ± 0.0334	0.5578 ± 0.0370
0.8	FunClu_ Smoothing	0.4152 ± 0.0353	0.2762 ± 0.0328	0.5458 ± 0.0329
0.8	GMM	0.2996 ± 0.0262	0.1723 ± 0.0262	0.4388 ± 0.0393
0.8	KMeans	0.3005 ± 0.0279	0.1727 ± 0.0260	0.4396 ± 0.0387
1.0	FunClu	0.3323 ± 0.0247	0.2142 ± 0.0235	0.4929 ± 0.0285
1.0	FunClu_ Smoothing	0.3407 ± 0.0228	0.2140 ± 0.0217	0.4861 ± 0.0275
1.0	GMM	0.2693 ± 0.0199	0.1412 ± 0.0193	0.3976 ± 0.0263
1.0	KMeans	0.2674 ± 0.0185	0.1490 ± 0.0154	0.3965 ± 0.0257

Table 2. Simulation results ($\nu $ = 0.6).

$ {\nu }^{2} $	Method	CA	ARI	NMI
0.2	FunClu	0.8818 ± 0.0629	0.8530 ± 0.0765	0.9565 ± 0.0225
0.2	FunClu_ Smoothing	0.9067 ± 0.0465	0.8918 ± 0.0488	0.9588 ± 0.0215
0.2	GMM	0.8728 ± 0.0440	0.8537 ± 0.0540	0.9448 ± 0.0215
0.2	KMeans	0.9060 ± 0.0460	0.8864 ± 0.0521	0.9548 ± 0.0217
0.4	FunClu	0.8323 ± 0.0487	0.7872 ± 0.0551	0.9066 ± 0.0242
0.4	FunClu_ Smoothing	0.8215 ± 0.0435	0.7752 ± 0.0534	0.8971 ± 0.0255
0.4	GMM	0.7431 ± 0.0597	0.6964 ± 0.0623	0.8633 ± 0.0304
0.4	KMeans	0.7845 ± 0.0573	0.7261 ± 0.0632	0.8716 ± 0.0287
0.6	FunClu	0.7183 ± 0.0392	0.6315 ± 0.0375	0.8791 ± 0.0263
0.6	FunClu_ Smoothing	0.7156 ± 0.0303	0.6255 ± 0.0415	0.8123 ± 0.0289
0.6	GMM	0.6310 ± 0.0622	0.5392 ± 0.0647	0.7689 ± 0.0400
0.6	KMeans	0.6664 ± 0.0549	0.5621 ± 0.0653	0.7748 ± 0.0380
0.8	FunClu	0.6060 ± 0.0534	0.4860 ± 0.0605	0.7237 ± 0.0371
0.8	FunClu_ Smoothing	0.6020 ± 0.0405	0.4771 ± 0.0523	0.7196 ± 0.0373
0.8	GMM	0.5095 ± 0.0364	0.3993 ± 0.0517	0.6732 ± 0.0377
0.8	KMeans	0.5311 ± 0.0421	0.4072 ± 0.0510	0.6729 ± 0.0429
1.0	FunClu	0.5380 ± 0.0382	0.4240 ± 0.0376	0.6891 ± 0.0288
1.0	FunClu_ Smoothing	0.5360 ± 0.0393	0.4215 ± 0.0371	0.6916 ± 0.0373
1.0	GMM	0.4673 ± 0.0296	0.3490 ± 0.0293	0.6133 ± 0.0296
1.0	KMeans	0.4870 ± 0.0356	0.3557 ± 0.0327	0.6348 ± 0.0292

First, FunClu generally outperforms the baseline methods in terms of CA, ARI, and NMI. This improvement is especially pronounced at low to moderate noise levels ($ \gamma \leq 0.6 $), highlighting the effectiveness of modeling the mean structure via the allometric scaling law in capturing the underlying log-linear relationships.

Second, the smoothing strategy (FunClu_Smoothing) adds stability. Although FunClu alone often achieves the highest numerical scores, with several parameter settings, it suffers from empty clusters, which undermines the interpretability of the clustering solution. FunClu_Smoothing, by contrast, prevents class loss by regularizing the mixing proportions through a softmax transformation, ensuring that all clusters are retained. Although the smoothing variant sometimes yields slightly lower CA, ARI, or NMI values, its solutions remain more reliable and biologically interpretable.

Third, as expected, performance decreases as the noise variance increases. All methods exhibit declining accuracy and agreement metrics when $ \gamma $ grows from 0.2 to 1.0. However, the decline is less severe for FunClu and FunClu_Smoothing compared with GMM and KMeans, confirming the robustness of our proposed functional clustering framework.

Finally, comparing across correlation strengths ($ v=0.9 $ vs $ \mathrm{v}=0.6 $), we observe that stronger correlation structures generally improve performance across all methods, with a higher CA and ARI at $ \mathrm{v}=0.6 $. Nevertheless, the relative advantage of FunClu_Smoothing remains consistent, underscoring its robustness to different covariance conditions.

Summary

In summary, the simulation results consistently demonstrate the advantages of the proposed FunClu framework. By jointly modeling the cluster-specific mean structure through an allometric scaling law and incorporating SAD1 covariance, FunClu achieves more accurate clustering performance than standard approaches in most parameter settings. However, in several high-noise configurations, FunClu occasionally suffers from empty clusters, which, although accompanied by higher numerical indices, may reduce the interpretability of the clustering solution. The smoothing variant effectively mitigates this issue by regularizing the mixing proportions through a softmax transformation, thereby preventing class loss and ensuring that all identified clusters remain biologically interpretable. Collectively, these findings suggest that FunClu, particularly with smoothing, provides a principled and flexible tool for functional clustering with high-dimensional dependent data, laying a solid foundation for subsequent applications to real biological datasets.

The simulation study is intended to evaluate the proposed method under its target modeling assumptions. Testing robustness to severe deviations from the allometric mean structure is beyond the scope of the present work and will be investigated in future studies.

[1]	Kim BR, Zhang L, Berg A, Fan J, Wu R. 2008. A computational approach to the functional clustering of periodic gene-expression profiles. Genetics 180:821−834 doi: 10.1534/genetics.108.093690 CrossRef Google Scholar
[2]	Wu S, Liu X, Dong A, Gragnoli C, Griffin C, et al. 2023. The metabolomic physics of complex diseases. Proceedings of the National Academy of Sciences of the United States of America 120:e2308496120 doi: 10.1073/pnas.2308496120 CrossRef Google Scholar
[3]	Xu R, Wunsch D. 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16:645−78 doi: 10.1109/TNN.2005.845141 CrossRef Google Scholar
[4]	Rokach L. 2010. A survey of clustering algorithms. In Data mining and knowledge discovery handbook. Boston, MA: Springer. pp. 269−298 doi: 10.1007/978-0-387-09823-4
[5]	Xu D, Tian Y. 2015. A comprehensive survey of clustering algorithms. Annals of Data Science 2:165−193 doi: 10.1007/s40745-015-0040-1 CrossRef Google Scholar
[6]	Wang Q, Dong A, Jiang L, Griffin C, Wu R. 2022. A single-cell omics network model of cell crosstalk during the formation of primordial follicles. Cells 11:332 doi: 10.3390/cells11030332 CrossRef Google Scholar
[7]	Dong A, Meng Y, Yau SST, Yau ST, Wu R. 2026. A Statistical Mechanics model to decode tissue crosstalk during graft formation. Advanced Science e23373 doi: 10.1002/advs.202523373 CrossRef Google Scholar
[8]	Misra BB, Langefeld C, Olivier M, Cox LA. 2019. Integrated omics: tools, advances and future approaches. Journal of Molecular Endocrinology 62:R21−R45 doi: 10.1530/JME-18-0055 CrossRef Google Scholar
[9]	Nardini C, Dent J, Tieri P. 2015. Multi-omic data integration. Frontiers in Cell and Developmental Biology 3:46 doi: 10.3389/978-2-88919-648-7 CrossRef Google Scholar
[10]	Wang D, Gu J. 2016. Integrative clustering methods of multi-omics data for molecule-based cancer classifications. Quantitative Biology 4:58−67 doi: 10.1007/s40484-016-0063-4 CrossRef Google Scholar
[11]	Tini G, Marchetti L, Priami C, Scott-Boyer MP. 2019. Multi-omics integration—a comparison of unsupervised clustering methodologies. Briefings in bioinformatics 20:1269−79 doi: 10.1093/bib/bbx167 CrossRef Google Scholar
[12]	Chauvel C, Novoloaca A, Veyre P, Reynier F, Becker J. 2020. Evaluation of integrative clustering methods for the analysis of multi-omics data. Briefings in bioinformatics 21:541−52 doi: 10.1093/bib/bbz015 CrossRef Google Scholar
[13]	Webb CO, Ackerly DD, McPeek MA, Donoghue MJ. 2002. Phylogenies and community ecology. Annual Review of Ecology and Systematics 33:475−505 doi: 10.1146/annurev.ecolsys.33.010802.150448 CrossRef Google Scholar
[14]	Johnson MTJ, Stinchcombe JR. 2007. An emerging synthesis between community ecology and evolutionary biology. Trends in Ecology & Evolution 22:250−257 doi: 10.1016/j.tree.2007.01.014 CrossRef Google Scholar
[15]	Cavender‐Bares J, Kozak KH, Fine PVA, Kembel SW. 2009. The merging of community ecology and phylogenetic biology. Ecology Letters 12:693−715 doi: 10.1111/j.1461-0248.2009.01314.x CrossRef Google Scholar
[16]	Raes J, Bork P. 2008. Molecular eco-systems biology: towards an understanding of community function. Nature Reviews Microbiology 6:693−99 doi: 10.1038/nrmicro1935 CrossRef Google Scholar
[17]	Shea K, Chesson P. 2002. Community ecology theory as a framework for biological invasions. Trends in Ecology & Evolution 17:170−176 doi: 10.1016/s0169-5347(02)02495-3 CrossRef Google Scholar
[18]	Chesson P. 2000. Mechanisms of maintenance of species diversity. Annual review of Ecology and Systematics 31:343−366 doi: 10.1146/annurev.ecolsys.31.1.343 CrossRef Google Scholar
[19]	Tilman D. 1982. Resource competition and community structure. Princeton, NJ, USA: Princeton University Press.
[20]	Peterson AT, Soberón J, Pearson RG, Anderson RP, Martínez-Meyer E, et al. 2011. Ecological niches and geographic distributions (MPB-49). Online Edition. Princeton University Press. doi: 10.1515/9781400840670
[21]	Finlay KW, Wilkinson GN. 1963. The analysis of adaptation in a plant-breeding programme. Australian Journal of Agricultural Research 14:742−754 doi: 10.1071/ar9630742 CrossRef Google Scholar
[22]	Lobell DB, Roberts MJ, Schlenker W, Braun N, Little BB, et al. 2014. Greater sensitivity to drought accompanies maize yield increase in the US Midwest. Science 344:516−519 doi: 10.1126/science.1251423 CrossRef Google Scholar
[23]	West GB, Brown JH, Enquist BJ. 1997. A general model for the origin of allometric scaling laws in biology. Science 276:122−126 doi: 10.1126/science.276.5309.122 CrossRef Google Scholar
[24]	West GB, Brown JH, Enquist BJ. 1999. The fourth dimension of life: fractal geometry and allometric scaling of organisms. Science 284:1677−1679 doi: 10.1126/science.284.5420.1677 CrossRef Google Scholar
[25]	Shingleton A. 2010. Allometry: the study of biological. Nature Education Knowledge 3(10):2 Google Scholar
[26]	Huxley JS, Teissier G. 1936. Terminology of relative growth. Nature 137:780−781 doi: 10.1038/137780b0 CrossRef Google Scholar
[27]	Feng L, Yang D, Wu S, Xue C, Sang M, et al. 2025. Network modeling and topology of aging. Physics Reports 1101:1−65 doi: 10.1016/j.physrep.2024.10.006 CrossRef Google Scholar
[28]	Gong H, Wang Y, Wang H, Sun X, Zhang S, et al. 2025. Towards a better understanding of structural-functional relationships in the forest soil microbiota. Reply to comments on "Topological change of soil microbiota networks for forest resilience under global warming". Physics of Life Reviews 54:152−154 doi: 10.1016/j.plrev.2025.07.017 CrossRef Google Scholar
[29]	Zhao W, Chen YQ, Casella G, Cheverud JM, Wu R. 2005. A non-stationary model for functional mapping of complex traits. Bioinformatics 21:2469−77 doi: 10.1093/bioinformatics/bti382 CrossRef Google Scholar
[30]	Zhao W, Hou W, Littell RC, Wu R. 2005. Structured antedependence models for functional mapping of multiple longitudinal traits. Statistical Applications in Genetics and Molecular Biology 4:1136 doi: 10.2202/1544-6115.1136 CrossRef Google Scholar
[31]	Zhao W, Wu R. 2008. Wavelet-based nonparametric functional mapping of longitudinal curves. Journal of the American Statistical Association 103:714−725 doi: 10.1198/016214508000000373 CrossRef Google Scholar
[32]	Kingma DP. 2014. Adam: a method for stochastic optimization. arXiv 1412.6980 doi: 10.48550/arXiv.1412.6980 CrossRef Google Scholar
[33]	Schwarz G. 1978. Estimating the dimension of a model. The Annals of Statistics 6(2):461−464 Google Scholar
[34]	Meng J, Li Z, Wang H, Miao R, Liu X, et al. 2025. Haplotype‐resolved genome assembly provides new insights into the genomic origin of purple colour in Prunus mume. Plant Biotechnology Journal 23:1416−1436 doi: 10.1111/pbi.14595 CrossRef Google Scholar
[35]	Dong A, Wu S, Che J, Wang Y, Wu R. 2023. idopNetwork: a network tool to dissect spatial community ecology. Methods in Ecology and Evolution 14:2272−83 doi: 10.1111/2041-210X.14172 CrossRef Google Scholar
[36]	Mu H, Chen J, Huang W, Huang G, Deng M, et al. 2024. OmicShare tools: a zero-code interactive online platform for biological data analysis and visualization. iMeta 3:e228 doi: 10.1002/imt2.228 CrossRef Google Scholar
[37]	Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. 2000. Gene ontology: tool for the unification of biology. Nature genetics 25:25−29 doi: 10.1038/75556 CrossRef Google Scholar
[38]	Kanehisa M, Goto S. 2000. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic acids research 28:27−30 Google Scholar

{{lists.name}}

An improved method of Wu's functional clustering

Abstract

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors

An improved method of Wu's functional clustering