A systematic evaluation of network subsampling methods for T-cell receptor repertoire network

Hailiang Wu; Tran Le; Hai Yang; Phi Le; David Oh; Li Zhang; Hailiang Wu; Tran Le; Hai Yang; Phi Le; David Oh; Li Zhang

doi:10.48130/stati-0026-0005

The T-cell receptor (TCR) is vital in adaptive immune responses. To investigate the structure and function of the TCR repertoire, researchers constructed TCR networks based on TCR sequences. However, the complexity and scale of TCR networks frequently exceed computational capabilities. We aimed to identify effective subsampling approaches to reduce computational burden while preserving essential network properties. We systematically evaluated network subsampling algorithms, including Random Node Sampling (RNS), Page Rank (PR), Metropolis-Hastings (MH), and Simple Random Walk Sampling with Fly Back (SRWFB) and three induced methods: Induced PR (InPR), Induced MH (InMH), and Induced SRWFB (InSRWFB) across varying subsampling percentages (5% to 30%) under direct and combined strategies. We applied these approaches to TCR sequence data from 11 ibrutinib-treated pancreatic cancer patients. For performance evaluation, portrait divergence (PDiv) and several network properties were utilized. We also assessed how well subnetworks preserved the direction and magnitude of changes by Cohen's d effect size. InSRWFB consistently outperformed, yielding the lowest PDiv and best preserving network properties. It also maintained stable effect sizes, becoming more pronounced in higher-abundance repertoires. As the first systematic evaluation of subsampling techniques in the context of immune repertoires, our work provides important insights and a reference for future immunological network analysis.

HTML

Materials and methods

Pancreatic adenocarcinoma cohort

This study included 11 metastatic pancreatic adenocarcinoma patients enrolled in Phase 1b clinical trial NCT02562898, who received ibrutinib at the University of California, San Francisco, and Oregon Health and Science University in Portland, USA. Baseline blood samples were collected prior to treatment, and post-timepoint samples were obtained after at least one administration of the drug. Cryopreserved peripheral blood mononuclear cells (PBMCs) were isolated from blood samples, and genomic DNA was extracted for sequencing of the TCR beta CDR3 region using the immunoSEQ platform (Adaptive Biotechnologies). Detailed clinical and demographic information for the cohort is available in the original publication^[10].

TCR repertoires network

TCR networks were constructed for each repertoire based on amino acid sequence similarity using the Network Analysis of Immune Repertoire (NAIR) package^[3]. We set the Hamming distance less than or equal to 1 to allow at most one residue at the same location between two unique TCR clones. Two types of TCR networks were analyzed in this study (Fig. 1b): (1) clone-based networks, where each unique TCR amino acid sequence was represented as a single node regardless of its abundance; and (2) abundance-based networks, where nodes were replicated according to clonal frequency, allowing high-abundance TCRs to appear multiple times. The distribution of TCR clone abundance is typically left-skewed, with the top 1% of expanded clones accounting for up to 20% of total reads, reflecting immune response specificity^[9,13]. We observed a similar abundance distribution in our samples, with most clones showing fewer than 500 reads. Based on these patterns, we classified the samples into high-, medium-, and low-abundance groups according to the proportion of highly expanded clones (HECs, abundance > 500 reads) > 2%, 1%–2%, and < 1%, respectively (Table 1). In our datasets, the abundance distribution is extremely heavy-tailed, and a handful of extreme counts can dominate weight magnitudes and lead to prohibitive memory/runtime during network construction; therefore, nodes with an abundance greater than 500 were truncated to 500 before subsampling. This threshold was specifically selected to prevent extremely expanded clones from disproportionately dominating the network topology as super-hubs^[14]. By mitigating the influence of these outliers, we ensure that subsampling algorithms can effectively capture the underlying structural diversity and connectivity patterns of the broader repertoire while maintaining computational stability during systematic evaluation. Such strategies are consistent with established practices in immune repertoire analysis for controlling clonal dominance and stabilizing diversity- and network-based metrics^[15,16].

Table 1. Distribution of TCR node abundance.

Patient	Proportion of nodes			Abundance level
Patient	≥ 100	≥ 200	≥ 500	Abundance level
P1	6.50%	2.90%	1.40%	Medium
P2	17.70%	10.50%	6.20%	High
P3	3.40%	1.10%	0.50%	Low
P4	8.20%	4.10%	1.80%	Medium
P5	0.90%	0.30%	0.00%	Low
P6	8.60%	6.10%	3.00%	High*
P7	4.70%	2.80%	1.50%	Medium
P8	5.60%	3.30%	1.60%	Medium*
P9	2.80%	0.90%	0.50%	Low
P10	6.70%	5.10%	2.40%	High
P11	0.20%	0.10%	0.00%	Low *
* Indicates representative patients selected from each group for evaluation of sampling methods.

Review of network sampling algorithms

The following will review five commonly used network sampling algorithms (Table 2). RandomNode Sampling (RNS) is a node-based method that selects nodes uniformly at random, independent of edge connectivity^[17]. It is especially effective for sampling isolated or sparsely connected nodes. Metropolis-Hastings (MH) is an edge-informed sampler based on Markov Chain Monte Carlo (MCMC), where acceptance of a candidate node depends on the properties of the current node, including its degree^[18]. Both RNS and MH have no tunable hyperparameters in our implementation. SnowBall (SB) is a hybrid approach that begins with a randomly selected set of seed nodes of size k and iteratively expands by adding neighboring nodes through connected edges^[19]. The parameter k represents the maximum number of neighbors that can be added in each sampling cycle. In the context of TCR networks with many small, disjoint clusters, k balances local expansion within a similarity neighborhood against the need to reach the target sample size without over-sampling a few large components. Smaller values of k favor highly local neighborhood preservation but may require more iterations to reach the target sample size, whereas larger values of k accelerate sampling at the risk of over-representing highly connected components. Page Rank (PR)^[20] is an edge-based technique that follows an iterative process, where nodes are subsampled based on their PageRank scores. The PageRank score measures the relative importance of a node by recursively considering both the number and significance of incoming edges, while incorporating a damping factor^[20]. The damping factor (α) represents the probability of the nodes being randomly selected instead of walking along the edge continuously. For disjoint TCR landscapes, α is critical for balancing edge-based importance with random jumping to ensure navigation across isolated components. Lower values α emphasize connectivity-driven importance within components, while higher values increase random jumps, improving coverage across disjoint clusters but reducing sensitivity to local topology. Simple Random Walk Sampling with Fly Back (SRWFB) starts with a random node and performs a random walk along its edges, choosing adjacent nodes sequentially with a predefined Fly Back probability (p)^[20]. The Fly Back probability (p) is the probability of returning to the starting point to explore other neighbors. Higher fly-back probabilities promote broader exploration across components, whereas lower values favor deeper traversal within individual clusters. This prevents the walker from staying within a single component, facilitating exploration across the many disjoint clusters in TCR networks. An iteration time parameter determines the total number of steps in the walk, ensuring that both densely and sparsely connected regions of the graph are covered.

Table 2. Summary of sampling algorithms.

Algorithm type	Description	Key parameters
Random Node Sampling (RNS)	Selects nodes uniformly at random from the network.	−
SnowBall (SB)	Starts from a set of seed nodes and expands by connecting edges.	k − Max number of neighbors added per cycle
Page Rank (PR)	Nodes are sampled based on their PageRank score in an iterative process.	α (damping factor) − 0.85
Metropolis-Hastings (MH)	Relies on edge connections and follows a Markov Chain Monte Carlo (MCMC) process.	Acceptance depends on node's degree
Simple Random Walk with Fly Back (SRWFB)	Starts with a random node and performs a random walk with a predefined probability of returning to the starting node.	p (fly-back probability) − 0.15, iteration time − 100
Induced-Page Rank (InPR)	Retains all original edges between selected nodes.	Inherits from PR
Induced-Metropolis-Hastings (InMH)	Retains all original edges between selected nodes.	Inherits from MH
Induced-Simple Random Walk with Fly Back (InSRWFB)	Retains all original edges between selected nodes.	Inherits from SRWFB

In addition, we also evaluated induced alternatives of some algorithms, such as Induced Metropolis-Hastings (InMH), Induced Page Rank (InPR), and Induced Simple Random Walk Sampling with Fly Back (InSRWFB). After obtaining a sampled node set, we construct the final network as the node-induced subgraph by recovering all edges from the original graph whose endpoints are both sampled^[17]. Those induced algorithms retain all original edges among the selected nodes from the full network, ensuring that local connectivity and topological measures are evaluated on a structurally faithful subgraph rather than being distorted by edge loss during sampling (Fig. 1c)^[21].

Implementation strategies

Two subsampling strategies were employed in this study: the direct method and the combined method (Fig. 1d). In the direct method, each subsampling algorithm described above was applied directly to the entire TCR network without distinguishing between isolated and connected nodes. However, clone-based networks and some abundance-based networks exhibited high sparsity, with over 50% of isolated nodes (Supplementary Fig. S1). In such graphs, directly applying topology-aware samplers to the full network may under-sample isolated nodes (which have no incident edges), thereby altering the proportion of isolated nodes and artificially increasing apparent connectivity in the sampled subgraph. To address this limitation, we designed a combined subsampling strategy motivated by the observation that isolated and connected nodes represent fundamentally different structural entities in sparse TCR networks. Because topology-aware samplers rely on edge traversal, they are inherently biased against isolated nodes, motivating a hybrid design that treats these two node types separately when sparsity is substantial and downstream analysis is sensitive to component-level structure. In the combined method, nodes are first partitioned into isolated (N_iso) and connected (N_con) sets, and each set is subsampled at a fixed rate r (Fig. 1d). This proportional scaling preserves network sparsity by construction, and prevents topology-aware algorithms from artificially inflating connectivity through systematic undersampling of isolated nodes. The combined strategy is therefore most advantageous for highly sparse TCR networks with a large fraction of isolated clonotypes, particularly when downstream analyses are sensitive to sparsity, component structure, or the relative prevalence of expanded vs unexpanded clonotypes.

By defining the sample sizes as $ {n}_{\text{iso}}={N}_{\text{iso}}\times r $ and $ {n}_{\text{con}}={N}_{\text{con}}\times r $, the subnetwork's isolation rate mathematically matches the original:

$ \dfrac{{{n}}_{\text{iso}}}{{{n}}_{\text{iso}}+{{n}}_{\text{con}}}=\dfrac{{{N}}_{\text{iso}}\times {r}}{\left({{N}}_{\text{iso}}+{{N}}_{\text{con}}\right)\times {r}}=\dfrac{{{N}}_{\text{iso}}}{{N}} $

This proportional scaling preserves network sparsity by construction and prevents topology-aware algorithms from artificially inflating connectivity through systematic under-sampling of isolated nodes. The combined strategy is therefore most advantageous for highly sparse TCR networks with a large fraction of isolated clonotypes, particularly when downstream analyses are sensitive to sparsity, component structure, or the relative prevalence of expanded vs unexpanded clonotypes.

Aligning each algorithm with the part of the graph where its preference was beneficial, together with separating isolates from the connected part, provided the methodological justification for the combined strategy and explained the lower distortion observed for PDiv and other component-sensitive metrics. A range of subsampling percentages (i.e., 5%, 10%, 15%, 20%, 25%, and 30%) was used to evaluate the performance under different combinations of algorithms and strategies. For each setting, twenty independent replicates (with distinct random seeds) were generated. Results were summarized as the mean across replicates, with standard error (SE) indicated.

Performance evaluation metrics

To evaluate the effectiveness of different sampling methods and strategies in preserving the original characteristics of TCR networks, we used a topological metric, PDiv. PDiv first calculates each network's 'portrait', a matrix representation that captures the network's structure, and then computes the Jensen-Shannon divergence between these portrait matrices to quantify the structural similarity^[22]. Compared to traditional comparison metrics, such as Graph Edit Distance (GED), which is computationally NP-hard, or DeltaCon, which requires strict node correspondence, PDiv is a permutation-invariant and parameter-free measure that is ideal for subsampling evaluation where node identities are not preserved^[22,23]. Furthermore, unlike individual network properties that capture local or fragmented features, PDiv provides a holistic, multi-scale comparison of the system-level topology^[23]. Taken together, these properties make PDiv particularly well-suited for evaluating network subsampling in TCR repertoire analyses, where networks are large, node identities are not preserved, and preservation of global topological structure is essential. PDiv values range from 0 to 1, with values close to 0 indicating higher structural similarity and minimal distortion.

Network properties have been widely used to quantify the networks at the cluster and node levels, which have been shown to associate with clinical outcomes and to provide causal inference in simulation studies^[24,25]. We employed four cluster-level network properties, i.e., degree, assortativity, transitivity, and density (Table 3), to evaluate the robustness of the subsampling methods on a real-world dataset. Degree and transitivity capture local clustering and centrality^[26,27], while assortativity and transitivity are global metrics. Density measures the proportion of possible edges present and indicates the overall level of network connectivity^[14]. The maximum degree is more sensitive than mean or median values for capturing rare but critical topological features relevant to immunological interpretation in left-skewed TCR network structures^[25]. In addition, we quantified the assortativity coefficient to capture how similarly connected TCR clones tend to link in repertoire networks using standard definitions^[28]. This metric is informative in repertoires as a positive value indicates that similarly connected clones preferentially link (assortative mixing), whereas a negative value indicates disassortative mixing (high-degree clones link to low-degree ones)^[5]. Besides direct comparisons of network metrics, we further examined the preservation of dynamic network behavior by calculating Cohen's d effect sizes to quantify changes in these metrics between the baseline and post-treatment periods^[29]. Shifts in assortativity reflect centralization of sequence similarity under antigen-driven selection and clonal expansion, and temporally specific changes provide a concise readout of repertoire centralization vs dispersion.

Table 3. Evaluation metrics.

Metric	Description
Network Portrait Divergence (PDiv)	Assesses similarity of 2 networks by analyzing 'Network Portrait'. Values range from 0 to 1.
Network properties
Max degree	The maximum number of edges connected to a single node.
Density	The ratio of the number of actual edges to the possible number of edges.
Assortativity	Measures how strongly nodes with similar properties preferentially connect.
Transitivity	Measures the tendency of similar nodes to connect to each other.

Discussion

We conducted a systematic benchmarking of established subsampling algorithms and strategies to maintain network structure similarity and support downstream analyses. Our workflow is, in principle, applicable to Adaptive Immune Receptor Repertoire (AIRR) networks beyond TCR, including BCR, but receptor-specific preprocessing (e.g., somatic hypermutation and chain-aware similarity measures) and independent validation are required. Unlike denser biological networks, such as PPI networks or neural networks^[23], TCR networks, especially clone-based ones, are extremely sparse. The high sparsity of the TCR network is evidenced by a proportion of isolated nodes (degree = 0) exceeding 50% (Supplementary Fig. S1a) and edge density far below 0.003 (Supplementary Fig. S12). In such extremely sparse TCR networks, structural information is concentrated in a few large, densely connected clusters. Therefore, edge-based sampling approaches that explore and retain these informative regions are better suited for preserving global topology.

Among all the methods tested, InSRWFB consistently yielded the lowest PDiv for both clone-based and abundance-based TCR networks. Although this performance is accompanied by a linear increase in memory demand in abundance-based settings (Fig. 4d), it suggests that InSRWFB's traversal-based strategy effectively prioritizes the preservation of key topological features over mere memory minimization. This makes it particularly suitable for scalable TCR network analysis where structural integrity is paramount. By traversing along edges and incorporating the fly-back mechanism, InSRWFB ensures a thorough exploration of local clusters, preserving the key topological features more effectively than other methods.

InSRWFB performs better in networks with higher abundance (Fig. 2). This may be explained by evidence that antigen-driven clonal expansion often involves multiple TCRs with similar sequences^[6,7,30], resulting in higher edge density and more complex clustering. As a result, InSRWFB's traversal-based strategy is better able to capture these structures. Interestingly, the variance of PDiv increased with abundance, suggesting greater structural heterogeneity associated with immune activation.

Given the ability of InSRWFB to capture local structural features, it is expected that it shows superior performance in retaining local (max degree and transitivity) and global (assortativity, density) (Supplementary Figs S3−S6, S7−S10) metrics. This observation suggests that preserving local features contributes to maintaining global structural features indirectly, particularly in denser connected networks. In the context of the TCR repertoire, network density can serve as an indicator of immune activation, where low density suggests limited clonal expansion, implying that the T cell population remains inactive or only weakly stimulated. Therefore, the ability of InSRWFB to effectively preserve network density makes it a promising approach for accurate interpretation of immune response. RNS appeared to perform comparably to InSRWFB, which might be misleading, as uniform node selection in sparse networks can inadvertently mimic structural patterns without capturing biologically relevant features.

Beyond the absolute values of network properties, the direction and magnitude of the change, quantified by Cohen's d effect size, are crucial for interpreting immune dynamics. Cohen's d effect size provides a quantitative summary of whether and to what extent immune activity has changed between time points or conditions, and has been widely used in network studies, including neuroscience^[5,31]. InSRWFB consistently preserves the direction and the scale of change across subsampling percentages, aligning closely with the negative value of original networks, ensuring appropriate clinical interpretation. While higher subsampling percentages naturally retain more information, InSRWFB uniquely maintains meaningful effect size estimates even at low subsampling levels, making it a practical choice for large-scale or incomplete datasets. However, we did not perform full downstream validation (e.g., treatment-response prediction) to confirm that subsampled networks reproduce identical clinical conclusions, which requires labeled cohorts and is planned for future work.

Although the combined strategy generally underperformed the direct strategy across most subsampling algorithms, including InSRWFB, SB under the combined strategy demonstrated a compensatory effect in PDiv, assortativity, and effect size. Because the combined strategy separates isolated nodes from connected nodes, SB performs particularly well in capturing the characteristics of connected structures, improving its overall effectiveness in this context. While our study focuses on sparse TCR networks, these findings suggest that SB may demonstrate competitive or superior performance in denser or fully connected biological networks, which requires further investigation in dense networks.

While time and memory consumption increase proportionally with subsampling percentage (Fig. 4), the improvement in portrait divergence for InSRWFB-sampled subnetworks under direct strategy exhibits a marginal effect pattern, beginning to plateau around 15% (Fig. 2a−f). Similarly, the differences in most network properties compared to the original network start to stabilize near this proportion (Supplementary Figs S3−S10). Notably, the effect sizes of key metrics tend to converge toward the original network values at approximately 15% (Fig. 3a−d, g, h), suggesting a marginal gain beyond this point.

Our decision to cap clone abundances > 500 prior to constructing abundance-weighted networks was a computational/numerical-stability safeguard to stabilize edge weights, not an attempt to downplay clonal expansion. The cap does not alter node identity or connectivity; it only bounds the influence of extreme counts on weight scaling. Analyses that rely on the magnitude of massive expansions should be performed without a cap or using a monotone transform (e.g., log1p). Our primary claims focus on the comparative behavior of subsampling methods and topology-preserving metrics (density, transitivity, assortativity, PDiv), which depend mainly on connectivity patterns rather than exact weight magnitudes and are qualitatively stable under bounded weight scaling. We document the capping step in the released code and allow users to deactivate it or set a higher threshold as resources permit, enabling re-analysis without the cap when emphasis is on highly expanded clones.

Several limitations of this study should be acknowledged. First, the analysis is based on a relatively small sample size (n = 11), limiting the generalizability of the findings across disease settings, treatment contexts, and cohort characteristics. Although each sample contains a high-complexity TCR repertoire, validation in larger and more diverse clinical cohorts will be necessary to assess the robustness of these conclusions. Second, the number of subsampling replicates was fixed at 20 to balance computational feasibility with empirical stability of network metrics. While this choice was sufficient for the scenarios examined here, different network sizes or evaluation metrics may require alternative replication strategies. Finally, this study evaluates subsampling methods primarily in terms of their ability to preserve network structural fidelity. Although structural preservation is a necessary condition for valid downstream biological and clinical interpretation, it does not guarantee inferential equivalence. The impact of subsampling-induced structural differences on specific downstream tasks, such as differential repertoire analysis or clinical outcome association, was not assessed and remains an important direction for future work.

Supplementary Table S1 The uumber of nodes in original TCR networks.
Supplementary Fig. S1 Proportion of isolated nodes of original TCR network.
Supplementary Fig. S2 Sensitivity analysis: impact of algorithm parameters on portrait divergence across subsampling percentages.
Supplementary Fig. S3 Difference in maximum degree between subsampled and original clone-based networks across sampling algorithms and abundance levels.
Supplementary Fig. S4 Difference in Assortativity between subsampled and original clone-based networks across sampling algorithms and abundance levels.
Supplementary Fig. S5 Difference in transitivity between subsampled and original clone-based networks across sampling algorithms and abundance levels.
Supplementary Fig. S6 Relative difference in density between subsampled and original clone-based networks across sampling algorithms and abundance levels.
Supplementary Fig. S7 Difference in maximum degree between subsampled and original abundance-based networks across sampling algorithms and abundance levels.
Supplementary Fig. S8 Difference in assortativity between subsampled and original abundance-based networks across sampling algorithms and abundance levels.
Supplementary Fig. S9 Difference in transitivity between subsampled and original abundance-based networks across sampling algorithms and abundance levels.
Supplementary Fig. S10 Relative difference in density between subsampled and original abundance-based networks across sampling algorithms and abundance levels.
Supplementary Fig. S11 Cohen's d effect size of original network and subnetworks using Snow Ball (SB) at different subsampling percentages (5% to 30%).
Supplementary Fig. S12 Edge density across samples in clone-based and abundance-based TCR networks.
Supplementary Fig. S13 Summary of best subsampling in different scenarios.

[1]	Lu H, Zhou Q, He J, Jiang Z, Peng C, et al. 2020. Recent advances in the development of protein–protein interactions modulators: mechanisms and clinical trials. Signal Transduction and Targeted Therapy 5(1):213 doi: 10.1038/s41392-020-00315-3 CrossRef Google Scholar
[2]	Hopfield JJ. 1982. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America 79(8):2554−2558 doi: 10.1201/9780429500459-2 CrossRef Google Scholar
[3]	Yang H, Cham J, Neal BP, Fan Z, He T, et al. 2023. NAIR: Network Analysis of Immune Repertoire. Frontiers in Immunology 14:1181825 doi: 10.3389/fimmu.2023.1181825 CrossRef Google Scholar
[4]	Yang, A and Poholek, AC. 2024. Systems immunology approaches to study T cells in health and disease. npj Systems Biology and Applications 10:117 doi: 10.1038/s41540-024-00446-1 CrossRef Google Scholar
[5]	Becker M, Nassar H, Espinosa C, Stelzer IA, Feyaerts D, et al. 2023. Large-scale correlation network construction for unraveling the coordination of complex biological systems. Nature Computational Science 3(4):346−359 doi: 10.1038/s43588-023-00429-y CrossRef Google Scholar
[6]	Dash P, Fiore-Gartland AJ, Hertz T, Wang GC, Sharma S, et al. 2017. Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature 547:89−93 doi: 10.1038/nature22383 CrossRef Google Scholar
[7]	Glanville, J, Huang, H, Nau, A, Hatton, O, Wagar, LE, et al. 2017. Identifying specificity groups in the T cell receptor repertoire. Nature 547:94−98 doi: 10.1038/nature22976 CrossRef Google Scholar
[8]	Wang X, Zhang B, Yang Y, Zhu J, Cheng S, et al. 2019. Characterization of distinct T cell receptor repertoires in tumor and distant non-tumor tissues from lung cancer patients. Genomics, proteomics & bioinformatics 17(3):287−296 doi: 10.1016/j.gpb.2018.10.005 CrossRef Google Scholar
[9]	Gong Q, Wang C, Zhang W, Iqbal J, Hu Y, et al. 2017. Assessment of T-cell receptor repertoire and clonal expansion in peripheral T-cell lymphoma using RNA-seq data. Scientific Reports 7:11301 doi: 10.1038/s41598-017-11310-0 CrossRef Google Scholar
[10]	Sinha M, Betts C, Zhang L, Griffith MJ, Solman I, et al. 2023. Modulation of myeloid and T cells in vivo by Bruton's tyrosine kinase inhibitor ibrutinib in patients with metastatic pancreatic ductal adenocarcinoma. Journal for Immunotherapy of Cancer 11(1):e005425 doi: 10.1136/jitc-2022-005425 CrossRef Google Scholar
[11]	Kwek SS, Yang H, Li T, Ilano A, Chow ED, et al. 2025. Identification and regulation of circulating tumor-TCR-matched cytotoxic CD4⁺ lymphocytes by KLRG1 in bladder cancer. JCI Insight 10(11):e177373 doi: 10.1172/jci.insight.177373 CrossRef Google Scholar
[12]	Naidus E, Bouquet J, Oh DY, Looney TJ, Yang H, et al. 2021. Early changes in the circulating T cells are associated with clinical outcomes after PD-L1 blockade by durvalumab in advanced NSCLC patients. Cancer immunology, immunotherapy  70(7):2095−102 doi: 10.1007/s00262-020-02833-z CrossRef Google Scholar
[13]	Joshi K, de Massy MR, Ismail M, Reading JL, Uddin I, et al. 2019. Spatial heterogeneity of the T cell receptor repertoire reflects the mutational landscape in lung cancer. Nature Medicine 25(10):1549−59 doi: 10.1038/s41591-019-0592-2 CrossRef Google Scholar
[14]	Newman, MEJ. 2003. The Structure and Function of Complex Networks. SIAM Review 45(2):167−256 doi: 10.1137/S003614450342480 CrossRef Google Scholar
[15]	Shugay M, Bagaev DV, Turchaninova MA, Bolotin DA, Britanova OV, et al. 2015. VDJtools: unifying post-analysis of T cell receptor repertoires. PLoS Computational Biology 11(11):e1004503 doi: 10.1371/journal.pcbi.1004503 CrossRef Google Scholar
[16]	Greiff V, Bhat P, Cook SC, Menzel U, Kang W, et al. 2015. A bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status. Genome medicine 7(1):49 doi: 10.1186/s13073-015-0169-8 CrossRef Google Scholar
[17]	Leskovec, J and Faloutsos, C. 2006. Sampling from large graphs. Proceedings of the 12^th ACM SIGKDD international conference on Knowledge discovery and data mining, 20−23 August, 2006, Philadelphia, PA, USA. USA: Association for Computing Machinery. pp. 631−636 doi: 10.1145/1150402.1150479
[18]	Hübler C, Kriegel HP, Borgwardt K, Ghahramani Z. 2008. Metropolis Algorithms for Representative Subgraph Sampling. 2008 Eighth IEEE International Conference on Data Mining, 15−19 December 2008, Pisa, Italy. USA: IEEE. pp. 283−92 DOI: 10.1109/ICDM.2008.124
[19]	Goodman, LA. Snowball sampling. The annals of mathematical statistics 1961: 148–70
[20]	Brin S, Page L. 1998. The anatomy of a large-scale hypertextual web search engine. Computer Networks 30:107−117 doi: 10.1016/j.comnet.2012.10.007 CrossRef Google Scholar
[21]	Aggarwal A. 2018. Graph_Sampling. San Francisco, CA, USA: GitHub. https://github.com/Ashish7129/Graph_Sampling
[22]	Bagrow, JP and Bollt, EM. 2019. An information-theoretic, all-scales approach to comparing networks. Applied Network Science 4(1):45 doi: 10.1007/s41109-019-0156-x CrossRef Google Scholar
[23]	Winding M, Pedigo BD, Barnes CL, Patsolic HG, Park Y, et al. 2023. The connectome of an insect brain. Science 379:eadd9330 doi: 10.1126/science.add9330 CrossRef Google Scholar
[24]	Sade-Feldman M, Yizhak K, Bjorgaard SL, Ray JP, Boer CGde, et al. 2018. Defining T cell states associated with response to checkpoint immunotherapy in melanoma. Cell 175(4):998−1013.e1–e20 doi: 10.1016/j.cell.2018.12.034 CrossRef Google Scholar
[25]	Banerjee S, Le P, Yang H, Zhang L, He T. 2024. TCR-NP: a novel approach to prioritize T-cell receptor repertoire network properties. Statistics Innovation 1:e003 doi: 10.48130/stati-0024-0003 CrossRef Google Scholar
[26]	Watts, DJ, Strogatz SH. 1998. Collective dynamics of 'small-world' networks. Nature 393(6684):440−442 doi: 10.1038/30918 CrossRef Google Scholar
[27]	Newman M. 2018. Networks. Oxford, UK: Oxford University Press. doi: 10.1093/oso/9780198805090.001.0001
[28]	Newman MEJ. 2003. Mixing patterns in networks. Physical Review. E, Statistical, Nonlinear, and Soft Matter Physics 67:026126 doi: 10.1103/PhysRevE.67.026126 CrossRef Google Scholar
[29]	Cohen J. 1988. Statistical power analysis for the behavioral sciences. Hillsdale, NJ, USA: Lawrence Erlbaum Associates. doi: 10.4324/9780203771587
[30]	Mayer-Blackwell K, Schattgen S, Cohen-Lavi L, Crawford JC, Souquette A, et al. 2021. TCR meta-clonotypes for biomarker discovery with tcrdist3 enabled identification of public, HLA-restricted clusters of SARS-CoV-2 TCRs. eLife 10:e68605 doi: 10.7554/elife.68605 CrossRef Google Scholar
[31]	Mandino F, Vrooman RM, Foo HE, Yeow LY, Bolton TAW, et al. 2022. A triple-network organization for the mouse brain. Molecular Psychiatry 27(2):865−872 doi: 10.1038/s41380-021-01298-5 CrossRef Google Scholar

Subsampling percentage	Relative time: Median (Min, Max)		Relative memory: Median (Min, Max)
Subsampling percentage	Clone-based network	Abundance-based network	Clone-based network	Abundance-based network
5	0.96% (0.6%, 3.1%)	2.4% (1.0%, 7.2%)	3.1% (0.1%, 7.2%)	10.8% (0.8%, 29.9%)
10	2.01% (1.2%, 4.6%)	4.5% (2.5%, 14.6%)	2.8% (1.6%, 10.6%)	13.3% (1.1%, 38.4%)
15	3.62% (2.1%, 8.5%)	7.3% (4.1%, 15.2%)	4.4% (1.5%, 19.2%)	17.2% (1.1%, 37.1%)
20	5.10% (3.4%, 9.8%)	9.2% (5.7%, 18.1%)	6.8% (1.2%, 19.7%)	16.8% (1.2%, 30.7%)
25	7.43% (4.8%, 12.9%)	11.8% (7.9%, 20.9%)	10.9% (1.6%, 38.7%)	20.3% (3.9%, 31.3%)
30	10.17% (6.9%, 16.3%)	13.8% (10.5%, 23.2%)	13.7% (1.2%, 48.0%)	20.7% (4.4%, 27.7%)

{{lists.name}}

A systematic evaluation of network subsampling methods for T-cell receptor repertoire network

Abstract

Supplementary information

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors

A systematic evaluation of network subsampling methods for T-cell receptor repertoire network