Predictive modeling and inference using deep transfer learning in genetic data analysis

Shan Zhang; Yuan Zhou; Kejin Dong; Jinling Liu; Pei Geng; Qing Lu; Shan Zhang; Yuan Zhou; Kejin Dong; Jinling Liu; Pei Geng; Qing Lu

doi:10.48130/stati-0025-0003

Figures (6) Tables (6)

Figure 1.
An illustration of transfer learning. The goal is to build predictive models and conduct inference in the primary data for the focused problem. The modeling parameters learned from the source data of a related problem are transferred to the modeling process in the primary problem.
Figure 2.
Transfer learning in deep neural networks. The left panel shows the application of neural networks in the focused problem while the right panel displays the proposed transfer learning method based on training the neural networks in the related problem. Here $ \hat{{\boldsymbol{S}}}_1' $ are the transferred weight parameters in hidden layers. The output layer weights $ \tilde{\boldsymbol{S}}_2 $ are updated during the training while $ \hat{{\boldsymbol{S}}}_1' $ are frozen.
Figure 3.
Prediction comparison regarding relative efficiency in the SAGE data set between the transfer learning (TL-DNN) from UK Biobank and the direct application of DNN without transfer learning.
Figure 4.
Prediction comparison regarding relative efficiency in the SAGE data set between the transfer learning (TL-DNN) from UK Biobank and the direct application of DNN without transfer learning.
Figure 5.
Prediction comparison regarding relative efficiency in the UKB black population between the transfer learning (TL-DNN) from the white British population and the direct application of DNN without transfer learning.
Figure 6.
Illustration of genetic heterogeneity among the ethnic UKB population groups regarding individual SNP p-values and affect sizes of the five genes in populations white British, white Irish, and black.

Permutation-based test using transfer learning with K-fold cross-validation
Input: Genetic variants of a gene x, Phenotype y, a set of candidate smooth Group Lasso regularization parameters λ_SGL.
Output: Empirical p-value of the gene
Step 1: Construct a TL-DNN model f(x) with 2 hidden layers.
Step 2: For k $ \leftarrow $ 1, ..., K do
1: Split (*x_train, y_train), (x_test, y_test*).
For each λ_i in λ do
a: Input (*x_train, y_train) and train f(x_train, y_train*;λ_i) with smooth Group Lasso regularization parameter λ_i, output $ \hat{f} $.
b: Evaluate Mean Square Error on (*x_train, y_train*) , $ MSE(y_{test},\hat{f}(x_{test};\lambda_i)) $.
end.
2: Choose λ_opt with the lowest MSE, output $ \hat y_{test} = \hat{f}(x_{test};\lambda_{opt}) $ and calculate $ MSE(y_{test}, \hat{f}(x_{test};\lambda_{opt})) $.
3: Permute $ x_{test} $ by row, denoted as $ x_{test}' $, calculate $ \hat y_{test}' = \hat{f}(x_{test}';\lambda_{opt}) $, $ MSE(y_{test}, \hat y'_{test}) $ and $ l = MSE(y_{test}, \hat y_{test})-MSE(y_{test}, \hat y_{test}') $.
4: Repeat 3 for B times, obtain $ l_1,...,l_B $. Calculate $ \Delta_k = \frac{1}{B}\sum_{b = 1}^{B}l_b $ and $ \hat{\sigma}_k^2 = var(l_1,...,l_B) $.
end.
Step 3: Calculate statistic $ \Delta = \frac{1}{K}\sum\Delta_k $ with limiting distribution $ N(0, \sigma^2 = \frac{1}{K} \sum_{1}^{K} \sigma_k^2) $, calculate and output p value.
end.

Table 1.

Algorithm for the permutation-based association test using transfer learning.

Gene	$ \Delta $	$ \hat\sigma $	p-value
CHRNA3	−1.33e⁻³	1.10e⁻⁴	0
CHRNA5	−1.13e⁻³	1.01e⁻⁴	0
CHRNA6	−8.24e⁻⁵	3.19e⁻⁵	4.88e⁻³
CHRNB3	−1.20e⁻⁴	4.13e⁻⁵	2.19e⁻³
CHRNB4	−1.4e⁻³	1.11e⁻⁴	0

Table 2.

PT-DNN results from the association of the five candidate genes in the UKB Caucasian sample.

Gene	PT-DNN			PT-TL-DNN
Gene	$ \Delta $	$ \hat{\sigma} $	p-value	$ \Delta $	$ \hat{\sigma} $	p-value
CHRNA3	−0.0112	3.81e⁻³	1.66e⁻³	−8.28e⁻³	2.78e⁻³	1.48e⁻³
CHRNA5	−8.64e⁻³	3.63e⁻³	8.58e⁻³	−7.79e⁻³	3.26e⁻³	8.41e⁻³
CHRNA6	−9.16e⁻³	3.18e⁻³	1.97e⁻³	−6.54e⁻³	2.26e⁻³	1.91e⁻³
CHRNB3	−0.0139	3.20e⁻³	7.35e⁻⁶	−7.75e⁻³	2.53e⁻³	1.09e⁻³
CHRNB4	4.85e⁻⁸	1.39e⁻⁷	0.636	−5.15e⁻³	2.64e⁻³	0.0256

Table 3.

Comparison between the permutation-based test without transfer learning (PT-DNN) and with transfer learning (PT-TL-DNN) in the SAGE data set.

Gene	$ \Delta $	$ \hat\sigma $	p-value
CHRNA3	−6.6e⁻⁴	9.38e⁻⁵	1.02e⁻¹²
CHRNA5	−6.20e⁻⁴	1.07e⁻⁴	3.83e⁻⁹
CHRNA6	−6.09e⁻⁵	2.39e⁻⁵	5.474e⁻³
CHRNB3	−1.00e⁻⁴	3.88e⁻⁵	4.63e⁻³
CHRNB4	−1.02e⁻³	1.27e⁻⁴	5.00e⁻¹⁶

Table 4.

PT-DNN results from the association analysis of five candidate genes in the UKB white British sample.

Gene	PT-DNN			PT-TL-DNN
Gene	$ \Delta $	$ \hat{\sigma} $	p-value	$ \Delta $	$ \hat{\sigma} $	p-value
CHRNA3	−1.15e⁻³	6.48e⁻⁴	0.0378	−6.95e⁻⁴	3.67e⁻⁴	0.0291
CHRNA5	−8.40e⁻⁴	5.22e⁻⁴	0.0529	−1.81e⁻³	7.9e⁻⁴	0.0110
CHRNA6	−4.20e⁻⁴	5.00e⁻⁴	0.201	−1.07e⁻³	3.97e⁻⁴	3.58e⁻³
CHRNB3	−1.58e⁻³	7.14e⁻⁴	0.0132	−2.59e⁻³	9.68e⁻⁴	3.75e⁻³
CHRNB4	8.60e⁻⁴	6.09e⁻⁴	0.0789	−1.02e⁻³	4.68e⁻⁴	0.0145

Table 5.

Comparison between the permutation-based test without transfer learning (PT-DNN) and with transfer learning (PT-TL-DNN) in the UKB white Irish sample.

Gene	PT-DNN			PT-TL-DNN
Gene	$ \Delta $	$ \hat{\sigma} $	p-value	$ \Delta $	$ \hat{\sigma} $	p-value
CHRNA3	−3.40e⁻³	1.73e⁻³	0.0247	−2.9e⁻³	9.2e⁻⁴	9.1e⁻⁴
CHRNA5	−2.50e⁻⁴	4.94e⁻⁴	0.305	−5.00e⁻³	1.80e⁻³	2.69e⁻³
CHRNA6	−1.09e⁻⁵	2.34e⁻⁵	0.679	−3.60e⁻³	1.12e⁻³	5.90e⁻⁴
CHRNB3	1.88e⁻¹¹	4.35e⁻¹¹	0.667	−1.20e⁻³	1.28e⁻³	0.183
CHRNB4	−1.62e⁻³	1.07e⁻³	0.0651	−2.80e⁻³	1.53e⁻³	0.0336

Table 6.

Comparison between the permutation-based test without transfer learning (PT-DNN) and with transfer learning (PT-TL-DNN) in the UKB black sample.