Predicting adult phenotypes from seedling transcriptional data using deep learning: a case study in chrysanthemum

Hua Zhang; Chenyuan Mao; Ziting Guo; Manman Sun; Jingshan Lu; Zhiyong Guan; Jiafu Jiang; Sumei Chen; Fadi Chen; Likai Wang; Hua Zhang; Chenyuan Mao; Ziting Guo; Manman Sun; Jingshan Lu; Zhiyong Guan; Jiafu Jiang; Sumei Chen; Fadi Chen; Likai Wang

doi:10.48130/gcomm-0026-0011

2026 Volume 3

Article Contents

Next Previous

ARTICLE Open Access

Predicting adult phenotypes from seedling transcriptional data using deep learning: a case study in chrysanthemum

1.
State Key Laboratory of Crop Genetics & Germplasm Enhancement and Utilization, Key Laboratory of Flower Biology and Germplasm Innovation, Ministry of Agriculture and Rural Affairs, College of Horticulture, Nanjing Agricultural University, Nanjing 210095, China
2.
Zhongshan Biological Breeding Laboratory, Nanjing 210014, China
^# Authors contributed equally: Hua Zhang, Chenyuan Mao, Ziting Guo

More Information

Corresponding author: wlk@njau.edu.cn (Wang L)

Received: 06 April 2026
Revised: 05 May 2026
Accepted: 20 May 2026
Published online: 17 June 2026
Genomics Communications 3, Article number: e013 (2026) | Cite this article

Abstract

Genotype-to-phenotype prediction remains a fundamental challenge in current genetic research. In recent years, it has become possible to construct different predictive models based on genomic data. However, in many horticultural crops, it is difficult to accurately verify genomic variations because of the complexity of their genome, making the application of these genome-based methods challenging. Gene expression reflects both genetic regulatory mechanisms and environmental stimuli, offering potential for predicting phenotypes in plants with complex genomes. Thus, in this paper, we tested the possibility for predicting adult plant phenotypes using the gene expression data from seedlings. By applying the transcriptional-based deep learning methods on cut chrysanthemums (Chrysanthemum spp.), which exhibits a complex genetic background characterized by high repetitiveness, heterozygosity, and genome size and is recognized as a segmental allopolyploid, we found that the method is robust and accurate for predicting continuous variables such as leaf vase life, as well as categorical variables such as flower types on the basis of gene expression data. Moreover, the power and performance of transcriptional-based deep learning methods for prediction was validated in rice (Oryza sativa). Our research shows the good performance of phenotype prediction based on gene expression, with potential applications in future gene chip-based breeding practices.
- Genotype-to-phenotype prediction,
- Machine learning,
- Transcriptional data

Supplementary information

Supplementary Fig. S1 The geographic distance between HuShu (HS) and BaGuaZhou (BGZ).
Supplementary Fig. S2 Feature selection using LASSO algorithm.
Supplementary Fig. S3 Leaf vase life prediction using simple machine learning models.
Supplementary Fig. S4 Flower prediction using LASSO regression algorithm.

Rights and permissions
Copyright: © 2026 by the author(s). Published by Maximum Academic Press, Fayetteville, GA. This article is an open access article distributed under Creative Commons Attribution License (CC BY 4.0), visit https://creativecommons.org/licenses/by/4.0/.

References

[1]	Li X, Guo T, Mu Q, Li X, Yu J. 2018. Genomic and environmental determinants and their interplay underlying phenotypic plasticity. Proceedings of the National Academy of Sciences of the United States of America 115:6679−6684 doi: 10.1073/pnas.1718326115 CrossRef Google Scholar
[2]	Zhou D, Liu X, Wang X, Yan F, Wang P, et al. 2021. A prognostic nomogram based on LASSO Cox regression in patients with alpha-fetoprotein-negative hepatocellular carcinoma following non-surgical therapy. BMC Cancer 21:246 doi: 10.1186/s12885-021-07916-3 CrossRef Google Scholar
[3]	Liu Z, Liu L, Weng S, Guo C, Dang Q, et al. 2022. Machine learning-based integration develops an immune-derived lncRNA signature for improving outcomes in colorectal cancer. Nature Communications 13:816 doi: 10.1038/s41467-022-28421-6 CrossRef Google Scholar
[4]	Sage AT, Donahoe LL, Shamandy AA, Mousavi SH, Chao BT, et al. 2023. A machine-learning approach to human ex vivo lung perfusion predicts transplantation outcomes and promotes organ utilization. Nature Communications 14:4810 doi: 10.1038/s41467-023-40468-7 CrossRef Google Scholar
[5]	Huang Y, Li J, Li M, Aparasu RR. 2023. Application of machine learning in predicting survival outcomes involving real-world data: a scoping review. BMC Medical Research Methodology 23:268 doi: 10.1186/s12874-023-02078-1 CrossRef Google Scholar
[6]	Richens JG, Lee CM, Johri S. 2020. Improving the accuracy of medical diagnosis with causal machine learning. Nature Communications 11:3923 doi: 10.1038/s41467-020-17419-7 CrossRef Google Scholar
[7]	Schreurs M, Piampongsant S, Roncoroni M, Cool L, Herrera-Malaver B, et al. 2024. Predicting and improving complex beer flavor through machine learning. Nature Communications 15:2368 doi: 10.1038/s41467-024-46346-0 CrossRef Google Scholar
[8]	Tuwani R, Wadhwa S, Bagler G. 2019. BitterSweet: building machine learning models for predicting the bitter and sweet taste of small molecules. Scientific Reports 9:7155 doi: 10.1038/s41598-019-43664-y CrossRef Google Scholar
[9]	Montesinos-López A, Montesinos-López OA, Gianola D, Crossa J, Hernández-Suárez CM. 2018. Multi-environment genomic prediction of plant traits using deep learners with dense architecture. G3 Genes\|Genomes\|Genetics 8:3813−3828 doi: 10.1534/g3.118.200740 CrossRef Google Scholar
[10]	Montesinos-López OA, Montesinos-López A, Crossa J, Gianola D, Hernández-Suárez CM, et al. 2018. Multi-trait, multi-environment deep learning modeling for genomic-enabled prediction of plant traits. G3 Genes\|Genomes\|Genetics 8:3829−3840 doi: 10.1534/g3.118.200728 CrossRef Google Scholar
[11]	Grinberg NF, Orhobor OI, King RD. 2020. An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat. Machine Learning 109:251−277 doi: 10.1007/s10994-019-05848-5 CrossRef Google Scholar
[12]	Yan J, Wang X. 2023. Machine learning bridges omics sciences and plant breeding. Trends in Plant Science 28:199−210 doi: 10.1016/j.tplants.2022.08.018 CrossRef Google Scholar
[13]	Parmley KA, Higgins RH, Ganapathysubramanian B, Sarkar S, Singh AK. 2019. Machine learning approach for prescriptive plant breeding. Scientific Reports 9:17132 doi: 10.1038/s41598-019-53451-4 CrossRef Google Scholar
[14]	Vourlaki IT, Ramos-Onsins SE, Pérez-Enciso M, Castanera R. 2024. Evaluation of deep learning for predicting rice traits using structural and single-nucleotide genomic variants. Plant Methods 20:121 doi: 10.1186/s13007-024-01250-y CrossRef Google Scholar
[15]	Amarasinghe A, Sangarasekara I, De Silva N, Ariyaratne M, Amarasinghe R, et al. 2024. Advancing food sustainability: a case study on improving rice yield prediction in Sri Lanka using weather-based, feature-engineered machine learning models. Discover Applied Sciences 6:603 doi: 10.1007/s42452-024-06300-7 CrossRef Google Scholar
[16]	Yang HE, Kim NW, Lee HG, Kim MJ, Sang WG, et al. 2024. Prediction of protein content in paddy rice (Oryza sativa L.) combining near-infrared spectroscopy and deep-learning algorithm. Frontiers in Plant Science 15:1398762 doi: 10.3389/fpls.2024.1398762 CrossRef Google Scholar
[17]	Guo T, Li X. 2023. Machine learning for predicting phenotype from genotype and environment. Current Opinion in Biotechnology 79:102853 doi: 10.1016/j.copbio.2022.102853 CrossRef Google Scholar
[18]	Cheng CY, Li Y, Varala K, Bubert J, Huang J, et al. 2021. Evolutionarily informed machine learning enhances the power of predictive gene-to-phenotype relationships. Nature Communications 12:5627 doi: 10.1038/s41467-021-25893-w CrossRef Google Scholar
[19]	Song A, Su J, Wang H, Zhang Z, Zhang X, et al. 2023. Analyses of a chromosome-scale genome assembly reveal the origin and evolution of cultivated chrysanthemum. Nature Communications 14:2021 doi: 10.1038/s41467-023-37730-3 CrossRef Google Scholar
[20]	Clevenger J, Chavarro C, Pearl SA, Ozias-Akins P, Jackson SA. 2015. Single nucleotide polymorphism identification in polyploids: a review, example, and recommendations. Molecular Plant 8:831−846 doi: 10.1016/j.molp.2015.02.002 CrossRef Google Scholar
[21]	Bray NL, Pimentel H, Melsted P, Pachter L. 2016. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34:525−527 doi: 10.1038/nbt.3519 CrossRef Google Scholar
[22]	Alhamzawi R, Ali HTM. 2018. The Bayesian adaptive lasso regression. Mathematical Biosciences 303:75−82 doi: 10.1016/j.mbs.2018.06.004 CrossRef Google Scholar
[23]	Hornik K., Buchta C, Zeileis A. 2009. Open-source machine learning: R meets Weka. Computational Statistics 24:225−232 doi: 10.1007/s00180-008-0119-7 CrossRef Google Scholar
[24]	Wang L, Xi Y, Sung S, Qiao H. 2018. RNA-seq assistant: machine learning based methods to identify more transcriptional regulated genes. BMC Genomics 19:546 doi: 10.1186/s12864-018-4932-2 CrossRef Google Scholar
[25]	Yang Y, Qian Z, Li Y, Liu R, Qiao H, et al. 2025. Competition binding of CmbHLH1L and CmNLP6/7L to CmbHLH63 modulates leaf senescence in chrysanthemum. Plant Biotechnology Journal 23:5456−5473 doi: 10.1111/pbi.70308 CrossRef Google Scholar
[26]	Song J, Zhang Y, Hu X, Shao Z, Song A, et al. 2026. Spongy mesophyll cell death is induced by jasmonic acid during leaf senescence in chrysanthemum. The Plant Cell 38:koag095 doi: 10.1093/plcell/koag095 CrossRef Google Scholar
[27]	Chong X, Zhang F, Wu Y, Yang X, Zhao N, et al. 2016. A SNP-enabled assessment of genetic diversity, evolutionary relationships and the identification of candidate genes in chrysanthemum. Genome Biology and Evolution 8:3661−3671 doi: 10.1093/gbe/evw270 CrossRef Google Scholar
[28]	Wilkerson MD, Hayes DN. 2010. ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics 26:1572−1573 doi: 10.1093/bioinformatics/btq170 CrossRef Google Scholar
[29]	Zhou W, Zhang X. 2021. Molecular mechanism of Verticillium dahliae-induced leaf senescence. Molecular Plant 14:1785−1786 doi: 10.1016/j.molp.2021.08.020 CrossRef Google Scholar
[30]	Larkin RM, Stefano G, Ruckle ME, Stavoe AK, Sinkler CA, et al. 2016. REDUCED CHLOROPLAST COVERAGE genes from Arabidopsis thaliana help to establish the size of the chloroplast compartment. Proceedings of the National Academy of Sciences of the United States of America 113:E1116−E1125 doi: 10.1073/pnas.1515741113 CrossRef Google Scholar
[31]	Lim PO, Lee IC, Kim J, Kim HJ, Ryu JS, et al. 2010. Auxin response factor 2 (ARF2) plays a major role in regulating auxin-mediated leaf longevity. Journal of Experimental Botany 61:1419−1430 doi: 10.1093/jxb/erq010 CrossRef Google Scholar
[32]	Groen SC, Ćalić I, Joly-Lopez Z, Platts AE, Choi JY, et al. 2020. The strength and pattern of natural selection on gene expression in rice. Nature 578:572−576 doi: 10.1038/s41586-020-1997-2 CrossRef Google Scholar
[33]	Huang L, Zhang F, Zhang F, Wang W, Zhou Y, et al. 2014. Comparative transcriptome sequencing of tolerant rice introgression line and its parents in response to drought stress. BMC Genomics 15:1026 doi: 10.1186/1471-2164-15-1026 CrossRef Google Scholar
[34]	Gao S, Martinez C, Skinner DJ, Krivanek AF, Crouch JH, et al. 2008. Development of a seed DNA-based genotyping system for marker-assisted selection in maize. Molecular Breeding 22:477−494 doi: 10.1007/s11032-008-9192-4 CrossRef Google Scholar
[35]	Li M, Poonam AD, Cui Q, Hsieh T, Jagadeesan S, et al. 2025. Non-destructive seed genotyping via microneedle-based DNA extraction. Plant Biotechnology Journal 23:2317−2329 doi: 10.1111/pbi.70055 CrossRef Google Scholar
[36]	Parmar S, Deshmukh DB, Kumar R, Manohar SS, Joshi P, et al. 2021. Single seed-based high-throughput genotyping and rapid generation advancement for accelerated groundnut genetics and breeding research. Agronomy 11:1226 doi: 10.3390/agronomy11061226 CrossRef Google Scholar
[37]	Azodi CB, Pardo J, VanBuren R, de los Campos G, Shiu SH. 2020. Transcriptome-based prediction of complex traits in maize. The Plant Cell 32:139−151 doi: 10.1105/tpc.19.00332 CrossRef Google Scholar
[38]	Ma W, Qiu Z, Song J, Li J, Cheng Q, et al. 2018. A deep convolutional neural network approach for predicting phenotypes from genotypes. Planta 248:1307−1318 doi: 10.1007/s00425-018-2976-9 CrossRef Google Scholar
[39]	Wang K, Ali Abid M, Rasheed A, Crossa J, Hearne S, et al. 2023. DNNGP, a deep neural network-based method for genomic prediction using multi-omics data in plants. Molecular Plant 16:279−293 doi: 10.1016/j.molp.2022.11.004 CrossRef Google Scholar
[40]	Yang Y, Huang Y, Wang T, Li S, Jiang J, et al. 2025. mRNA m⁶A regulates gene expression via H3K4me3 shift in 5' UTR. Genome Biology 26:54 doi: 10.1186/s13059-025-03515-8 CrossRef Google Scholar
[41]	Zhou J, Troyanskaya OG. 2015. Predicting effects of noncoding variants with deep learning–based sequence model. Nature Methods 12:931−934 doi: 10.1038/nmeth.3547 CrossRef Google Scholar
[42]	Jumper J, Evans R, Pritzel A, Green T, Figurnov M, et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature 596:583−589 doi: 10.1038/s41586-021-03819-2 CrossRef Google Scholar
[43]	Abramson J, Adler J, Dunger J, Evans R, Green T, et al. 2024. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630:493−500 doi: 10.1038/s41586-024-07487-w CrossRef Google Scholar
[44]	Gütter J, Kruspe A, Zhu XX, Niebling J. 2022. Impact of training set size on the ability of deep neural networks to deal with omission noise. Frontiers in Remote Sensing 3:932431 doi: 10.3389/frsen.2022.932431 CrossRef Google Scholar
[45]	Nikolados EM, Wongprommoon A, Mac Aodha O, Cambray G, Oyarzún DA. 2022. Accuracy and data efficiency in deep learning models of protein expression. Nature Communications 13:7755 doi: 10.1038/s41467-022-34902-5 CrossRef Google Scholar
[46]	Fanourakis D, Pieruschka R, Savvides A, MacNish AJ, Sarlikioti V, et al. 2013. Sources of vase life variation in cut roses: a review. Postharvest Biology and Technology 78:1−15 doi: 10.1016/j.postharvbio.2012.12.001 CrossRef Google Scholar
[47]	Chopy M, Cavallini-Speisser Q, Chambrier P, Morel P, Just J, et al. 2024. Cell layer–specific expression of the homeotic MADS-box transcription factor PhDEF contributes to modular petal morphogenesis in petunia. The Plant Cell 36:324−345 doi: 10.1093/plcell/koad258 CrossRef Google Scholar
[48]	Xing Y, Zhang Q. 2010. Genetic and molecular bases of rice yield. Annual Review of Plant Biology 61:421−442 doi: 10.1146/annurev-arplant-042809-112209 CrossRef Google Scholar

About this article

Cite this article

Zhang H, Mao C, Guo Z, Sun M, Lu J, et al. 2026. Predicting adult phenotypes from seedling transcriptional data using deep learning: a case study in chrysanthemum. Genomics Communications 3: e013 doi: 10.48130/gcomm-0026-0011

Zhang H, Mao C, Guo Z, Sun M, Lu J, et al. 2026. Predicting adult phenotypes from seedling transcriptional data using deep learning: a case study in chrysanthemum. Genomics Communications 3: e013 doi: 10.48130/gcomm-0026-0011

Figures(5)

Download PDF

Article Metrics

Article views(522) PDF downloads(257)

Other Articles By Authors

on this site
on Google Scholar

HTML

Introduction

The development of an organism is intricately controlled by genetic and environmental signals. By incorporating environmental variables such as day length and temperature, along with the intricate interplay of genetic influences at the individual gene level across various environments, plant phenotypes like flowering time can be accurately predicted^[1]. So far, genotype-to-phenotype prediction has emerged as a central approach to assist genomic selection-based breeding in the seed industry.

Machine learning techniques have been widely applied in predictive analytics. In the medical field, despite the underutilization of data in clinical settings, machine learning methods have been explored to aid in disease diagnosis and survival prognosis based on dozens of selected features^[2−6]. For instance, the construction of immune-related long noncoding RNA (lncRNA) signatures has led to more accurate machine learning-based survival predictions using identified 43 prognostic lncRNAs compared with traditional clinical variables, molecular features, and other published signatures^[3]. Within the field of food perception and evaluation, machine learning models have been leveraged to predict the flavors of specific compounds or compound combinations^[7,8]. Notably, models based on the gradient boosting algorithm can accurately predict complex food features and consumer preferences, surpassing predictions based on conventional statistics^[7]. In agricultural science, the utilization of complex machine learning models to link plant genotypes with phenotypes is gaining popularity. An increasing number of publications are predicting a diverse array of agronomic traits, such as yield, days to heading, and 1,000-kernel weight^[9−12]. Traditional methods, including regression-based techniques, often face limitations in analyzing high-dimensional data and capturing intricate multivariate relationships between predictor and response variables^[13]. In contrast, machine learning methods can effectively handle complex and nonlinear relationships within the data. Machine learning shows promise in translating biological knowledge and omics data into precision-designed plant breeding strategies, thereby accelerating crop breeding into the realm of artificial intelligence. Although artificial intelligence and machine learning have significantly influenced plant breeding, particularly for major crops, their potential applications in ornamental plant production and breeding have not been extensively explored.

Machine learning methodologies are increasingly advocated for and utilized in the realms of genomic selection and trait prediction. DNA variations, including single-nucleotide polymorphisms (SNPs), insertions and deletions (indels), and structural variations (SVs), serve as primary explanatory variables for predictive modeling^[14]. Supplementary data sources, such as environmental information^[15] and spectral information^[16], have been harnessed to enhance predictive accuracy. Given that DNA markers and final phenotypes signify the endpoints of information flow, the incorporation of intermediary layers like transcriptomic and proteomic data into machine learning frameworks may better capture information dynamics, thereby enhancing the accuracy of predictions. Additionally, strategies aimed at reducing noise-introducing factors, such as restricting analyses to functionally annotated variants or evolutionarily conserved genes, have been proposed to optimize machine learning's performance^[17,18]. Despite these advancements, the performance of machine learning-based predictions utilizing gene expression data remains largely unexplored.

Chrysanthemum species exhibit a complex genetic background characterized by high repetitiveness, heterozygosity, and genome size, and they are recognized as segmental allopolyploids^[19]. In polyploid species, identifying SNPs, indels, and SVs is particularly challenging because it requires one to distinguish homologous variants from allelic ones, especially given the high genetic similarity among subgenomes^[20]. The intricacies of SNP/indel/SV mining pose challenges for the accurate implementation of DNA variation-based machine learning predictions. Gene expression reflects both genetic regulatory mechanisms and environmental stimuli, offering potential as a robust alternative for phenotype prediction in plants harboring complex genomes. In this study, we propose using gene expression data from seedlings for predicting adult plant phenotypes, exemplified by its application in predicting leaf vase life and flower types in chrysanthemums. Furthermore, we evaluate the performance of this approach for predicting yield traits in rice (Oryza sativa). Collectively, our findings indicate the high sensitivity of phenotype prediction afforded by transcriptional-based deep learning methodologies.

Materials and methods

Plant growth conditions

Cut chrysanthemum flowers were obtained from the Chrysanthemum Germplasm Resource Conservation Center in Nanjing Agricultural University (Nanjing, China). In total, 101 cultivars were grown in HuShu (HS), with an additional 20 duplicated cultivars grown in HS and BaGuaZhou (BGZ) for validation purposes (Supplementary Fig. S1).

For plant growth conditions, chrysanthemum cuttings were transplanted into a greenhouse environment, maintaining a diurnal temperature range of 25/18 °C, a photoperiod of 16 h, light intensity set at 120 µmol/m²s, and a relative humidity of 70%.

Vase test of cut chrysanthemum
To assess the vase life of each cut chrysanthemum flower, a vase evaluation was conducted. Following harvest, the cut flowers were stored at 5 °C for 7 d to mimic storage and transportation conditions. Subsequently, they were recut and rehydrated in deionized water for 24 h before being arranged in vases for the vase life test. Each test group consisted of four cut chrysanthemums placed in a single vase, constituting one replication, with three replications conducted for each cultivar. The vase life of both leaves and flowers in the vases was monitored daily.

RNA extraction and RNA-seq analysis
From seedlings with approximately eight leaves, we collected the fourth leaf from each plant, and combined these leaves into a single sample for each cultivar. Typically, two biological replicates of leaf samples were collected per cultivar. Total RNA was extracted using the Quick RNA Isolation Kit (Waryong, Beijing, China) and sequenced on the BGI DNBSEQ-T7 platform. Adaptor sequences and low-quality reads were removed, and the clean data were aligned to the chrysanthemum transcriptome reference^[19] using Kallisto (version 0.46.0; parameter: −b 100). Kallisto quantified transcript abundance from RNA-seq data without mapping the reads to reference genomes, achieving similar accuracy to existing quantification tools^[21]. In this study, a gene was defined as expressed if its transcripts per million (TPM) value exceeded 10 in at least one sample within a population; only these expressed genes were used for subsequent analysis.

For the rice population, RNA-seq data were downloaded as specified in the main text and mapped to the Nipponbare reference genome IRGSP-1.0. Only samples grown under normal conditions were included in the analysis. A gene was considered expressed in a sample if it was detected in at least two of the three biological replicates.

Construction and evaluation of machine learning models
Basic mathematical analysis and processing were completed by Excel software. Heat maps, dot plots, receiver operating characteristic (ROC) curves, and Venn diagrams were created using R (version 4.3.3).

Flower types were converted into categorical variables to enhance the objectivity and simplicity of the model. In detail, a disbudded flower scored 0, a single-spray flower scored 1, and a multiple-spray flower scored 2. Important features were identified through feature selection using least absolute shrinkage and selection operator (LASSO) regression analysis, which was performed in R (v4.3.3) using all 101 cultivars, as we used an independent validation population of 20 cultivars grown at two separate locations. LASSO regression improves a model's accuracy and interpretability while addressing issues of multicollinearity among independent variables^[22]. Multivariate regression analysis was conducted to derive regression coefficients for the independent gene features. The phenotype was initially predicted using the multivariate regression coefficients, according to the following formula:

Predicted value = Gene1_exp × c1 + Gene2_exp × c2 + Gene3_exp × c3 … + geneN_exp × cN, where, 'exp' denotes the expression of each gene, and 'c' represents the regression coefficient.

For deep learning, the Keras framework with a tensorflow backend was utilized to train the models. For each population, the dataset was randomly divided into training (70%) and testing (30%) sets. We manually tested various parameters, including activation, optimizer, loss, dropout rates, number of units, regularizers, epoch, and layer configurations. The models' performance was assessed by monitoring the learning curves for both the training and testing data. The accuracy of the models was evaluated by calculating the Pearson's correlation coefficient (r) between the predicted and actual values, and the best-performing model was reported.

For leaf vase life, a fully connected deep neural network (DNN) model was developed for the regression analysis. The model architecture comprised nine layers in total, including dense fully connected layers and independent activation layers to introduce nonlinearity. The first dense layer contained 40 units followed by an exponential linear unit (ELU) activation layer; the subsequent hidden layers were set to have 40, 40, 80, 160, 320, and 640 units, activated by linear, elu, tanh, mish, relu6, and leaky_relu functions, respectively. All hidden dense layers were regularized using L2 kernel regularization with a coefficient of 0.01 to mitigate overfitting. The final output layer contained 1 unit with a linear transformation, followed by an additional ELU activation function to meet the regression's requirements. For model training, the network was compiled with the Adam optimizer (default learning rate = 0.001), using the mean squared error (MSE) as the loss function.

For flower types, the first dense layer contained 300 units with the Gaussian error linear unit (GELU) activation function; two subsequent dropout layers were applied with a fixed dropout rate of 0.2 to reduce overfitting. Two hidden dense layers were sequentially arranged, each containing 20 units and adopting the GELU activation function. Both hidden layers were regularized by L2 kernel regularization (0.01) and L1 activity regularization (0.01) to constrain the model's complexity. The output layer was designed with three units and applied the Softmax activation function for multiclass classification. For compiling each model, the RMSprop optimization algorithm was used with a default initial learning rate of 0.001, and categorical cross-entropy was selected as the loss function. The model was trained for 100 epochs with a default batch size of 32.

The source codes are publicly available at https://github.com/lkwwang-ui/Deep-model-for-predicting-adult-traits-using-seedling-data-study.git

We used Weka 3.9.7 data mining software^[23] and performed machine learning analysis as described in our previously published paper^[24]. In brief, all 101 samples were used for training and testing with 10-fold cross-validation, and the 20 samples from BGZ were used for model validation.

Discussion

Traditional marker-assisted selection, field-based leaf collection, labeling, and subsequent tracing back to the source plants after genotyping represent significant bottlenecks in leaf DNA-based genotyping processes. This entire workflow is labor-intensive procedures and time-consuming and has high associated costs. Consequently, in an effort to streamline breeder workflows and minimize expenses, an optimized genotyping approach utilizing endosperm DNA extracted from individual seeds has been developed. This method offers a viable alternative to leaf DNA-based genotyping, proving beneficial for both genetic investigations and breeding applications^[34−36]. In our current research, we tried to predict the phenotypes of adult plants using gene expression data derived from seedlings, facilitating the development of a single seed-based genotyping strategy. Through feature selection, only a limited number of genes have been selected for deep learning analyses, which will be potentially utilized in gene chip-based breeding practices in the future.

Previous reports demonstrated that maize (Zea mays) seedlings' transcriptomes can predict complex adult traits such as flowering time and grain yield^[37]. Our study extends this concept to polyploid horticultural crops, where the genome's complexity poses unique challenges for DNA-based prediction. Unlike maize, chrysanthemum possesses a highly heterozygous segmental allopolyploid genome that complicates accurate variant identification, making transcriptional-based prediction particularly advantageous. Furthermore, we demonstrate that the same framework can be applied to both continuous traits (leaf vase life) and categorical traits (flower type), highlighting its broader applicability. We also provide multilocation validation populations, evaluating the model's robustness under real-world environmental variation.

For genetically fixed traits such as flower type, the seedling transcriptome functions as a proxy for the underlying genotype. For species with relatively simple and well-characterized genomes, DNA-based genotyping likely provides a more stable and cost-effective approach for predicting such structural traits, as these traits are ultimately determined by DNA sequence variants that are fixed at the seed stage^[38,39]. However, for species with complex genomes, such as chrysanthemum (a segmental allopolyploid with high heterozygosity), accurate identification of DNA variations poses significant challenges. Distinguishing homologous variants from allelic ones is particularly difficult when the subgenomes share high sequence similarity, and these intricacies of SNP/InDel/SV mining can compromise the accuracy of DNA variation-based machine learning predictions^[20]. In this context, gene expression data which integrate both genetic regulatory mechanisms and environmental stimuli^[40] offer a complementary or alternative avenue for predicting phenotypes in plants with complex genomes. The transcriptional-based approach thus holds particular promise for polyploid horticultural crops where DNA-based variant calling remains technically challenging.

Although deep learning models exhibit remarkable accuracy in predicting phenotypes, they need vast amounts of data for training, typically spanning tens to hundreds of thousands of input features^[7,41−43]. For instance, in scenarios such as the use of aerial image datasets for segmentation tasks, the size of the training set assumes a pivotal role in fortifying the model against label noise, with larger training sets improving the robustness of the model against omission noise^[44]. In our study, given the limited number of samples available for the deep learning algorithm, expanding the sample size emerges as an effective strategy to improve the predictive performance of the model.

Beyond enlarging the training dataset, enhancing the diversity of the training data proves pivotal for optimizing the predictive capabilities of deep learning models. In some cases, deep learning frameworks can achieve commendable predictive accuracies even with relatively compact datasets. For instance, the manipulation of sequence diversity in controlled settings leads to significant enhancements in data efficiency, resulting in enhanced prediction accuracy for deep learning models^[45]. Leaf senescence is a complex trait influenced by genotype × environment (G × E) interactions; the validation populations were grown in two locations (BGZ and HS) 42 km apart, with differing soil and microclimatic conditions that may affect gene expression and phenotype independently. The postharvest life of cut flowers is also controlled by preharvest factors such as air humidity, disease infection, light conditions, nutrition, and so on^[46]. In our case, integrating environmental cues such as temperature, light exposure, water availability, and nutrient intake data stands out as another promising approach to further improve the model's predictive performance.

In this study, we evaluated the performance of transcriptional-based deep learning methods for predicting adult phenotypes from transcriptional data from seedling leaves of horticultural plants. However, many adult phenotypes may be influenced by genes that are expressed in a tissue-specific manner or at specific developmental stages beyond the seedling phase^[26,47]. Such regulatory patterns would not be fully captured by transcriptional analyses restricted to seedling leaves, which may partially account for the unexplained variance in our model's predictions. Future studies incorporating transcriptional data from multiple tissues and developmental time points may help address this limitation.

In our cross-species validation using rice, we used drought-induced fecundity changes as the predicted trait^[32] and compared them with observed grain yield changes^[33]. Although fecundity is a primary determinant of grain yield and the two traits are generally positively correlated, we acknowledge that grain yield is a more complex trait influenced by additional factors such as grain size, grain weight, panicle architecture, tiller number, and spikelet fertility^[48]. These yield components may be affected by environmental conditions during later developmental stages that are not fully reflected in seedling leaves' transcriptomes, introducing biological noise into the validation. Furthermore, the predicted fecundity values and the observed grain yield values were not on the same numerical scale (Fig. 5b), which likely reflects both the inherent differences between these two metrics and the contribution of unmodeled environmental variances. We therefore interpret the rice-based validation results cautiously as a preliminary indication of cross-species applicability. Future work based on matched transcription–phenotype datasets for the exact trait of interest across larger populations would enable more rigorous cross-species validation of this approach.

For breeding practice, we wish to clarify that our proposed strategy does not rely on RNA-seq for routine breeding genotyping. Instead, once the most predictive genes have been identified through initial RNA-seq screening (e.g., 166 genes for leaf vase life, as demonstrated in this study), a customized gene expression microarray targeting this focused gene set could be developed for downstream breeding applications, as has been noted that an RNA expression microarray can be cost-effective for studies focused on well-annotated genomes when the gene set is predefined (www.labmanager.com/microarray-vs-rna-sequencing-which-gene-expression-analysis-technique-is-more-effective-33683). Moreover, predictive gene sets for multiple traits can be consolidated onto a single microarray, enabling simultaneous screening of several traits in one assay and thereby further reducing the per-trait cost. This targeted approach avoids the high expense of full transcriptome sequencing while retaining the predictive power concentrated in the selected gene features.

{{lists.name}}

Predicting adult phenotypes from seedling transcriptional data using deep learning: a case study in chrysanthemum