TCR-NP: a novel approach to prioritize T-cell Receptor repertoire network properties

Shilpika Banerjee; Phi Le; Hai Yang; Li Zhang; Tao He; Shilpika Banerjee; Phi Le; Hai Yang; Li Zhang; Tao He

doi:10.48130/stati-0024-0003

T-cell Receptors (TCRs) play a pivotal role in antigen recognition and binding, and their sequence similarity significantly impacts the breadth of antigen recognition. Network analysis is employed to explore TCR sequence similarity and investigate the architecture of the TCR repertoire. Network properties hence could be utilized to quantify the structure of the TCR network. However, the heterogeneous nature of TCR network properties poses challenges in performing statistical learning across subjects directly, particularly when assessing their relationship with disease states, clinical outcomes, or patient characteristics. To overcome this challenge, a powerful method is developed, TCR-NP (TCR Network properties Prioritization), that aggregates the raw heterogeneous network properties and conducts grouped feature selection using a pseudo-variables-assisted penalized group Lasso model. Unlike the traditional parameter-tuning using cross-validation, a novel tuning strategy is introduced by incorporating permutation and pseudo-variables to improve the selection performance. The effectiveness of the proposed method is demonstrated through comprehensive evaluation, including simulation studies and real data analysis. By comparing the performance of the different approaches, the advantages of the proposed methodology in capturing the underlying relationships between TCR network properties and clinical outcomes or patient characteristics are highlighted.

HTML

Introduction

T-cells are one of the key components of the adaptive immune system^[1]. T-cell Receptors
(TCRs)^[1,2] are a group of protein complexes on the surface of T-cells. TCRs recognize and bind to specific antigen peptides^[3] found on abnormal cells or potentially harmful pathogens. Once the TCRs bind to the pathogens, the T-cells attack these cells and help the body fight infection, cancer, or other diseases. TCR repertoires, which are continually shaped throughout the lifetime of an individual in response to pathogenic exposure, can serve as a fingerprint of an individual's current immunological profile. The protein structures of TCRs determine the binding between TCRs and antigen peptides^[4]. Thus, the similarity among TCR sequences directly influences the antigen peptide recognition breadth. Network analysis, where TCR clones are represented by vertices and connected if similar in sequences (distance is less than a particular number) by using some sequence similarity measures (e.g., Hamming distance, Levenshtein distance^[5], etc.), was used as a novel perspective to study TCR clusters and their binding to antigen peptides. As the binding patterns will eventually impact the high-level responses, the aim is to use network structure as a special layer of information to investigate its potential connection to clinical outcome or disease status, as evidenced in existing literature^[6]. For example, it was observed that lung cancer patients with focused TCR repertoires and complex network connections attained significantly longer overall survival (OS) than those with smaller clusters^[7]. Therefore, quantitative analysis of the TCR repertoire network properties has the potential to provide a better understanding of the immune landscape involving T cell responses. However, network properties are highly heterogeneous, as they can be measured at node and cluster levels, and networks differ in the number of nodes and clusters. Thus, it is challenging to perform statistical inference or machine learning directly on the TCR network properties to study their relationship with clinical outcomes.

A flexible and efficient approach was proposed to prioritize TCR Network Properties (TCR-NP) by leveraging extracted features from the heterogeneous network properties to assess their relationship with the outcome of interest, while incorporating the group structure based on the nature of the features. As an initial investigation, we propose extracting simple summary statistics (e.g., min, Q1, mean, median, Q3, max) from the network property values since they can be easily calculated and carry the key signatures of a distribution. Since the extracted features are naturally grouped by network property, TCR-NP is constructed on the Group Lasso model^[8], a classical statistical method that offers several advantages in feature selection and prioritization. Specifically, it promotes group-level selection and addresses multicollinearity issues by selecting or excluding the entire groups (network properties), which is particularly beneficial when dealing with highly correlated variables (often found within the same network property). Moreover, it can handle the high-dimensional data (small sample, a large number of features commonly seen in TCR network data) efficiently (via L1 regularization^[9]), resulting in a sparse solution that will facilitate the subsequent interpretations. Additionally, it can be applied to different response types (e.g., categorial, quantitative, and time-to-event), which will fulfill the needs of different application scenarios.

Instead of using the commonly used cross-validation (CV)^[10] technique, it was proposed to utilize pseudo-variables to assist the selection in the Group Lasso model, inspired by Yang et al.^[11]. Traditional CV tuning typically minimizes prediction errors, which are indirect measures of selection performance. In contrast, pseudo-variables, generated through permutation as artificial unrelated features, serve as 'known negatives'. The goal is to select features with stronger association signals than the pseudo-variables, which is a more direct approach to improve selection performance. Such a strategy enhances precision by minimizing false positives, resulting in a condensed set of strongly associated features. Moreover, acting as a filter, pseudo-variables contribute to a more robust model, capable of handling variations and maintaining performance in noisy datasets.

In this paper, extensive simulation studies under different scenarios were conducted to demonstrate the efficacy of TCR-NP. Performance measures, including F-1 score, False Discovery Rate (FDR), sensitivity, and stability were calculated for each of the four following methods: permutation-assisted Group Lasso (P-Group Lasso), cross-validation tunned Lasso (CV-Lasso), cross-validation tunned Group Lasso (CV-Group Lasso), and permutation-assisted Lasso (P-Lasso). The proposed methods were also applied to a lung cancer TCR data for illustration.

Supplementary Table S1 Parameters for Data Analysis.
Supplementary Table S2 Summary of the network property features for the lung cancer data.
Supplementary Fig. S1 Simulation strategy step 1: Network property feature simulation.
Supplementary Fig. S2 Simulation strategy step 2: Outcome simulation.

[1]	Alcover A, Alarcón B, Di Bartolo V. 2018. Cell biology of T cell receptor expression and regulation. Annual Review of Immunology 36:103−25 doi: 10.1146/annurev-immunol-042617-053429 CrossRef Google Scholar
[2]	Van Der Merwe PA, Dushek O. 2011. Mechanisms for T cell receptor triggering. Nature Reviews Immunology 11:47−55 doi: 10.1038/nri2887 CrossRef Google Scholar
[3]	Schumacher TNM. 2002. T-cell-receptor gene therapy. Nature Reviews Immunology 2:512−19 doi: 10.1038/nri841 CrossRef Google Scholar
[4]	Gromiha MM, Yugandhar K, Jemimah S. 2017. Protein–protein interactions: scoring schemes and binding affinity. Current Opinion in Structural Biology 44:31−38 doi: 10.1016/j.sbi.2016.10.016 CrossRef Google Scholar
[5]	Berger B, Waterman MS, Yu YW. 2021. Levenshtein Distance, Sequence Comparison and Biological Database Search. IEEE Transactions on Information Theory 67:3287−94 doi: 10.1109/TIT.2020.2996543 CrossRef Google Scholar
[6]	Attaf M, Huseby E, Sewell AK. 2015. αβ T cell receptors as predictors of health and disease. Cellular & Molecular Immunology 12:391−99 doi: 10.1038/cmi.2014.134 CrossRef Google Scholar
[7]	Naidus E, Bouquet J, Oh DY, Looney TJ, Yang H, et al. 2021. Early changes in the circulating T cells are associated with clinical outcomes after PD-L1 blockade by durvalumab in advanced NSCLC patients. Cancer Immunol Immunother 70:2095−102 doi: 10.1007/s00262-020-02833-z CrossRef Google Scholar
[8]	Meier L, Van De Geer S, Bühlmann P. 2008. The Group Lasso for Logistic Regression. Journal of the Royal Statistical Society Series B: Statistical Methodology 70:53−71 doi: 10.1111/j.1467-9868.2007.00627.x CrossRef Google Scholar
[9]	Vidaurre D, Bielza C, Larrañaga P. 2013. A survey of L₁ regression. International Statistical Review 81:361−87 doi: 10.1111/insr.12023 CrossRef Google Scholar
[10]	Browne MW. 2000. Cross-Validation Methods. Journal of Mathematical Psychology 44:108−32 doi: 10.1006/jmps.1999.1279 CrossRef Google Scholar
[11]	Yang S, Wen J, Eckert ST, Wang Y, Liu DJ, et al. 2020. Prioritizing genetic variants in GWAS with lasso using permutation-assisted tuning. Bioinformatics 36:3811−17 doi: 10.1093/bioinformatics/btaa229 CrossRef Google Scholar
[12]	Yang H, Cham J, Neal BP, Fan Z, He T, et al. 2023. NAIR: network analysis of immune repertoire. Frontiers in Immunology 14:1181825 doi: 10.3389/fimmu.2023.1181825 CrossRef Google Scholar
[13]	Candès E, Fan Y, Janson L, Lv J. 2018. Panning for gold: 'Model-X' knockoffs for high dimensional controlled variable selection. Journal of the Royal Statistical Society Series B: Statistical Methodology 80:551−77 doi: 10.1111/rssb.12265 CrossRef Google Scholar
[14]	Luo X, Stefanski LA, Boos DD. 2006. Tuning variable selection procedures by adding noise. Technometrics 48:165−75 doi: 10.1198/004017005000000319 CrossRef Google Scholar

Network properties	Description	Function
Count at baseline	Summation of the TCR counts of a given membership at baseline.	sum(input_data[input_datamembership == membership_id, Count_baseline])
Count post treatment	Summation of the TCR counts of a given membership post treatment.	sum(input_data [input_data$membership == membership_id, Count_post_treatment])
Cluster size	The number of node in a given membership.	table(input_data $membership)
Diameter length	The length of the longest geodesic.	get_diameter(net, directed = F)
Assortativity	The assortativity coefficient which is positive if similar vertices (based on some external property) tend to connect to each, or negative otherwise.	assortativity_degree(net, directed = F)
Transitivity	The probability that the adjacent vertices of a vertex are connected.	transitivity(net, type = "global")
Density	The ratio of the number of edges and the number of possible edges.	edge_density(net, loops = F)
Degree centrality	Graph centralization based on the degrees of vertices.	centr_degree(net, mode = "in", normalized = T) $centralization
Closeness centrality	Graph centralization based on the closeness of vertices.	centr_clo(net, mode = "all", normalized = T) $centralization
Eigenvector centrality	Graph centralization based on the eigenvector centralities of positions v within it.	eigen_centrality(net, directed = F, weights = NA) $value
Central Eigenvector	Graph centralization based on the Eigenvector centrality of vertices.	centr_eigen(net, directed = F, normalized = T) $centralization

Network properties	Network property features
Node count	Min, Q1, Median, Mean, Q3, Max
*Count pre infusion	Min, Q1, Median, Mean, Q3, Max
Count dose 2	Min, Q1, Median, Mean, Q3, Max
*Diameter length	Min, Q1, Median, Mean, Q3, Max
Assortativity	prob(NA), Min, Q1, Median, Mean, Q3, Max
Transitivity	prob(NA), Min, Q1, Median, Mean, Q3, Max
Density	Min, Q1, Median, Mean, Q3, Max
Degree centrality	Min, Q1, Median, Mean, Q3, Max
Closeness centrality	prob(NA), Min, Q1, Median, Mean, Q3, Max
*Eigenvector centrality	Min, Q1, Median, Mean, Q3, Max
*Central Eigen	prob(NA), Min, Q1, Median, Mean, Q3, Max
* Properties/property features in bold font are considered as the causal properties/features in simulation studies.

Scenario #	*n	**P			***Balanced data	****Linear/non-linear
Scenario #	*n	Causal	Non-causal	Additional	***Balanced data	****Linear/non-linear
n200_p270_baln_lin	200	25	45	200	Balanced	Linear
n200_p270_baln_nonlin	200	25	45	200	Balanced	Non-linear
n200_p70_baln_lin	200	25	45	NA	Balanced	Linear
n200_p70_baln_nonlin	200	25	45	NA	Balanced	Non-linear
n300_p270_baln_lin	300	25	45	200	Balanced	Linear
n300_p270_baln_nonlin	300	25	45	200	Balanced	Non-linear
n300_p270_unbaln_lin	300	25	45	200	Imbalanced	Linear
n300_p270_unbaln_nonlin	300	25	45	200	Imbalanced	Non-linear
n300_p70_baln_lin	300	25	45	NA	Balanced	Linear
n300_p70_baln_nonlin	300	25	45	NA	Balanced	Non-linear
n300_p70_unbaln_lin	300	25	45	NA	Imbalanced	Linear
n300_p70_unbaln_nonlin	300	25	45	NA	Imbalanced	Non-linear
* n: # of simulated patients. P: # of simulated signals (causal TCR network property features, non-causal TCR n/w property features, additional correlated multivariate noise signals). * Balanced data: denotes the simulated dataset has balanced proportion of shorter and longer survival groups. **** Linear: denotes the simulated response variable Y consist of a linear combination of the causal variables. Non-linear: denotes the simulated variable Y consists of linear and interaction terms generated using the causal variables.

{{lists.name}}

TCR-NP: a novel approach to prioritize T-cell Receptor repertoire network properties

Abstract

Supplementary information

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors