Predictive modeling and inference using deep transfer learning in genetic data analysis

Shan Zhang; Yuan Zhou; Kejin Dong; Jinling Liu; Pei Geng; Qing Lu; Shan Zhang; Yuan Zhou; Kejin Dong; Jinling Liu; Pei Geng; Qing Lu

doi:10.48130/stati-0025-0003

Transfer learning has been widely applied in text and image classification, demonstrating its effectiveness in numerous applications. In this paper, we propose a transfer learning procedure for both prediction and association testing between genotypes and phenotypes in a smaller primary data set with an available larger source data set. Specifically, we training a deep neural network model in the source data, transfer a part of the trained weight parameters to the model in the primary data, and complete the training process in the primary data with the remaining free parameters. Furthermore, we develop a permutation-based association test using the trained transfer learning model to identify significant genes in the primary data set. We apply the proposed procedure to two case studies for the investigation of nicotine dependence. These two case studies show that transfer learning can not only improve prediction accuracy but also the power of detecting candidate genes compared to those results without transfer learning.

HTML

Introduction

In recent years, deep learning methods have gained attraction due to the ability to learn complex patterns and representations from large amounts of data, enabling breakthroughs in challenging tasks in image recognition, natural language processing, and speech processing. However, there are two main constraints of deep learning methods: dependency on extensive labeled data and training costs^[1]. Often in genetics, the focused problems only have limited labeled phenotypes with high dimensional genetic data. In these cases, transfer learning has the potential of higher prediction accuracy and test power by the shared parameters from a well-trained deep learning model in a massive source problem. For example, with the vast amounts of genetic data collected from biobank projects, an interesting scientific question is whether these resources can be used to enhance genetic analysis in small-scale studies. A common assumption made by most existing approaches is that two studies should be similar (e.g., the same population). However, this assumption could fail in reality. The study design and study population may differ between the two studies (e.g., Caucasian vs African American).

While transfer learning attempts to improve the performance of target learners on target domains by transferring the model parameters contained in different but related source domains, it does not require data from two studies drawn from the same feature space or the same distribution. It learns possibly useful features from source studies and applies learned features based on focused problems. Therefore, it holds great promise in using the enriched resources from large-scale studies for uncovering novel genetic variants in small-scale studies^[2].

In recent years, transfer learning has been investigated in biological and medical fields. For example, a data-driven procedure for transfer learning^[3], called Trans-Lasso, was proposed in high-dimensional linear regression and applied to understand gene regulation using Genotype-Tissue Expression data. The transfer learning problem under high-dimensional generalized linear models^[4] aimed to improve the fit of target data by borrowing information from useful source data. Although several survey articles^[5−7] have reviewed recent developments on transfer learning in machine learning methods, including network-based deep transfer learning which reuses the partial of network pre-trained in the source data, the application of deep transfer learning in genetic data analysis is relatively sparse. The transfer learning in convolutional neural networks^[8] was developed to predict the progression-free interval of lung cancer with gene expression data. A pipeline of transfer learning for genotype-phenotype prediction^[9] was studied using deep learning models with a small number of genotypes.

Given that the transfer methods depend on the models or algorithms being used to learn the tasks, we propose to integrate the idea of transfer learning into deep neural networks for prediction and association analyses with high dimensional genetic data. For example, deep neural networks can be trained in the large-scale UK Biobank dataset for nicotine dependence and transfer the parameter weights to facilitate genetic analysis in small-scale studies. By integrating transfer learning into deep neural networks, we are able to transfer model parameters regarding complex genotype-phenotype relationships (e.g., gene-gene interactions) between two studies.

Besides the proposed predictive modeling using transfer learning in deep neural networks, we further develop a permutation-based association test to detect significant genes in the targeted problem based on the proposed transfer learning. The resulting p-values can be interpreted as a measure of feature importance, and they help decide the significance of variables and therefore improve model interpretability^[10−12]. The permutation-based association test using transfer learning shows higher statistical power compared to that without transfer learning in two case studies. Moreover, the permuted data is obtained via randomly shuffling the index of subjects while maintaining their intrinsic genetic structures, hence no model re-fitting is needed which greatly reduces the computation cost.

[1]	Iman M, Arabnia HR, Rasheed K. 2023. A review of deep transfer learning and recent advancements. Technologies 11(2):40 doi: 10.3390/technologies11020040 CrossRef Google Scholar
[2]	Torrey L, Shavlik J. 2010. Transfer learning. In Handbook of research on machine learning applications and trends: algorithms, methods, and techniques, eds. Soria E, Martín-Guerrero JD, Martinez M, Magdalena R, Serrano AJ. USA: IGI Global Scientific Publishing. pp. 242−64. doi: 10.4018/978-1-60566-766-9.ch011
[3]	Li S, Cai TT, Li H. 2022. Transfer learning for high-dimensional linear regression: prediction, estimation and minimax optimality. Journal of the Royal Statistical Society Series B: Statistical Methodology 84(1):149−73 doi: 10.1111/rssb.12479 CrossRef Google Scholar
[4]	Tian Y, Feng Y. 2023. Transfer learning under high-dimensional generalized linear models. Journal of the American Statistical Association 118(544):2684−97 doi: 10.1080/01621459.2022.2071278 CrossRef Google Scholar
[5]	Pan SJ, Yang Q. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22(10):1345−59 doi: 10.1109/TKDE.2009.191 CrossRef Google Scholar
[6]	Ribani R, Marengoni M. 2019. A survey of transfer learning for convolutional neural networks. 2019 32^nd SIBGRAPI conference on graphics, patterns and images tutorials (SIBGRAPI-T), Rio de Janeiro, Brazil, 2019. USA: IEEE. pp.47−57. doi: 10.1109/SIBGRAPI-T.2019.00010
[7]	Tan C, Sun F, Kong T, Zhang W, Yang C, et al. 2018. A survey on deep transfer learning. Artificial Neural Networks and Machine Learning–ICANN 2018: 27^th International Conference on Artificial Neural Networks Proceedings. Cham: Springer. pp. 270−79. doi: 10.1007/978-3-030-01424-7_27
[8]	López-García G, Jerez JM, Franco L, Veredas FJ. 2020. Transfer learning with convolutional neural networks for cancer survival prediction using gene-expression data. PLoS One 15(3):e0230536 doi: 10.1371/journal.pone.0230536 CrossRef Google Scholar
[9]	Muneeb M, Feng S, Henschel A. 2022. Transfer learning for genotype–phenotype prediction using deep learning models. BMC Bioinformatics 23(1):511 doi: 10.1186/s12859-022-05036-8 CrossRef Google Scholar
[10]	Altmann A, Toloşi L, Sander O, Lengauer T. 2010. Permutation importance: a corrected feature importance measure. Bioinformatics 26(10):1340−47 doi: 10.1093/bioinformatics/btq134 CrossRef Google Scholar
[11]	Mi X, Zou B, Zou F, Hu J. 2021. Permutation-based identification of important biomarkers for complex diseases via machine learning models. Nature communications 12(1):3008 doi: 10.1038/s41467-021-22756-2 CrossRef Google Scholar
[12]	Liu L, Meng Q, Weng C, Lu Q, Wang T, et al. 2022. Explainable deep transfer learning model for disease risk prediction using high-dimensional genomic data. PLoS Computational Biology 18(7):e1010328 doi: 10.1371/journal.pcbi.1010328 CrossRef Google Scholar
[13]	Wang J, Zhang H, Wang J, Pu Y, Pal NR. 2021. Feature selection using a neural network with group lasso regularization and controlled redundancy. IEEE Transactions on Neural Networks and Learning Systems 32(3):1110−23 doi: 10.1109/TNNLS.2020.2980383 CrossRef Google Scholar
[14]	Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, et al. 2020. A comprehensive survey on transfer learning. Proceedings of the IEEE 109(1):43−76 doi: 10.1109/JPROC.2020.3004555 CrossRef Google Scholar
[15]	Kingma DP. 2014. Adam: a method for stochastic optimization. arXiv Preprint doi: 10.48550/arXiv.1412.6980 CrossRef Google Scholar
[16]	Mathers CD, Loncar D. 2006. Projections of global mortality and burden of disease from 2002 to 2030. PLoS Medicine 3(11):e442 doi: 10.1371/journal.pmed.0030442 CrossRef Google Scholar
[17]	Liu JZ, Tozzi F, Waterworth DM, Pillai SG, Muglia P, et al. 2010. Meta-analysis and imputation refines the association of 15q25 with smoking quantity. Nature Genetics 42(5):436−40 doi: 10.1038/ng.572 CrossRef Google Scholar
[18]	Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, et al. 2018. The UK Biobank resource with deep phenotyping and genomic data. Nature 562:203−9 doi: 10.1038/s41586-018-0579-z CrossRef Google Scholar
[19]	Weiss RB, Baker TB, Cannon DS, von Niederhausern A, Dunn DM, et al. 2008. A candidate gene approach identifies the CHRNA5-A3-B4 region as a risk factor for age-dependent nicotine addiction. PLoS Genetics 4(7):e1000125 doi: 10.1371/journal.pgen.1000125 CrossRef Google Scholar
[20]	Zeiger JS, Haberstick BC, Schlaepfer I, Collins AC, Corley RP, et al. 2008. The neuronal nicotinic receptor subunit genes (CHRNA6 and CHRNB3) are associated with subjective responses to tobacco. Human Molecular Genetics 17(5):724−34 doi: 10.1093/hmg/ddm344 CrossRef Google Scholar
[21]	Zhang X, Lan T, Wang T, Xue W, Tong X, et al. 2019. Considering genetic heterogeneity in the association analysis finds genes associated with nicotine dependence. Frontiers in Genetics 10:448 doi: 10.3389/fgene.2019.00448 CrossRef Google Scholar
[22]	Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, et al. 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81(3):559−75 doi: 10.1086/519795 CrossRef Google Scholar
[23]	Yin L, Zhang H, Tang Z, Xu J, Yin D, et al. 2021. rMVP: a memory-efficient, visualization-enhanced, and parallel-accelerated tool for genome-wide association study. Genomics, proteomics & bioinformatics 19(4):619−28 doi: 10.1016/j.gpb.2020.10.007 CrossRef Google Scholar
[24]	Cody T, Beling PA. 2023. A systems theory of transfer learning. IEEE Systems Journal 17(1):26−37 doi: 10.1109/JSYST.2022.3224650 CrossRef Google Scholar
[25]	Tripuraneni N, Jordan M, Jin C. 2020. On the theory of transfer learning: the importance of task diversity. Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems (NeurIPS 2020). virtual. pp. 7852−62
[26]	Yang L, Hanneke S, Carbonell J. 2013. A theory of transfer learning with applications to active learning. Machine Learning 90:161−89 doi: 10.1007/s10994-012-5310-y CrossRef Google Scholar

Gene	$ \Delta $	$ \hat\sigma $	p-value
CHRNA3	−1.33e⁻³	1.10e⁻⁴	0
CHRNA5	−1.13e⁻³	1.01e⁻⁴	0
CHRNA6	−8.24e⁻⁵	3.19e⁻⁵	4.88e⁻³
CHRNB3	−1.20e⁻⁴	4.13e⁻⁵	2.19e⁻³
CHRNB4	−1.4e⁻³	1.11e⁻⁴	0

Gene	PT-DNN			PT-TL-DNN
Gene	$ \Delta $	$ \hat{\sigma} $	p-value	$ \Delta $	$ \hat{\sigma} $	p-value
CHRNA3	−0.0112	3.81e⁻³	1.66e⁻³	−8.28e⁻³	2.78e⁻³	1.48e⁻³
CHRNA5	−8.64e⁻³	3.63e⁻³	8.58e⁻³	−7.79e⁻³	3.26e⁻³	8.41e⁻³
CHRNA6	−9.16e⁻³	3.18e⁻³	1.97e⁻³	−6.54e⁻³	2.26e⁻³	1.91e⁻³
CHRNB3	−0.0139	3.20e⁻³	7.35e⁻⁶	−7.75e⁻³	2.53e⁻³	1.09e⁻³
CHRNB4	4.85e⁻⁸	1.39e⁻⁷	0.636	−5.15e⁻³	2.64e⁻³	0.0256

Gene	$ \Delta $	$ \hat\sigma $	p-value
CHRNA3	−6.6e⁻⁴	9.38e⁻⁵	1.02e⁻¹²
CHRNA5	−6.20e⁻⁴	1.07e⁻⁴	3.83e⁻⁹
CHRNA6	−6.09e⁻⁵	2.39e⁻⁵	5.474e⁻³
CHRNB3	−1.00e⁻⁴	3.88e⁻⁵	4.63e⁻³
CHRNB4	−1.02e⁻³	1.27e⁻⁴	5.00e⁻¹⁶

Gene	PT-DNN			PT-TL-DNN
Gene	$ \Delta $	$ \hat{\sigma} $	p-value	$ \Delta $	$ \hat{\sigma} $	p-value
CHRNA3	−1.15e⁻³	6.48e⁻⁴	0.0378	−6.95e⁻⁴	3.67e⁻⁴	0.0291
CHRNA5	−8.40e⁻⁴	5.22e⁻⁴	0.0529	−1.81e⁻³	7.9e⁻⁴	0.0110
CHRNA6	−4.20e⁻⁴	5.00e⁻⁴	0.201	−1.07e⁻³	3.97e⁻⁴	3.58e⁻³
CHRNB3	−1.58e⁻³	7.14e⁻⁴	0.0132	−2.59e⁻³	9.68e⁻⁴	3.75e⁻³
CHRNB4	8.60e⁻⁴	6.09e⁻⁴	0.0789	−1.02e⁻³	4.68e⁻⁴	0.0145

{{lists.name}}

Predictive modeling and inference using deep transfer learning in genetic data analysis

Abstract

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors