Progress and perspectives on genomic selection models for crop breeding

Dongfeng Zhang; Feng Yang; Jinlong Li; Zhongqiang Liu; Yanyun Han; Qiusi Zhang; Shouhui Pan; Xiangyu Zhao; Kaiyi Wang; Dongfeng Zhang; Feng Yang; Jinlong Li; Zhongqiang Liu; Yanyun Han; Qiusi Zhang; Shouhui Pan; Xiangyu Zhao; Kaiyi Wang

doi:10.48130/tia-0025-0002

2025 Volume 5

Article Contents

Next Previous

REVIEW Open Access

Progress and perspectives on genomic selection models for crop breeding

1.
Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China
^# Authors contributed equally: Dongfeng Zhang, Feng Yang, Jinlong Li

More Information

Corresponding author: wangky@nercita.org.cn

Received: 15 December 2024
Revised: 28 February 2025
Accepted: 10 March 2025
Published online: 09 April 2025
Technology in Agronomy 5, Article number: e006 (2025) | Cite this article

Abstract

Genomic selection, a molecular breeding technique, is playing an increasingly important role in improving the efficiency of artificial selection and genetic gain in modern crop breeding programs. A series of algorithms have been proposed to improve the prediction accuracy of genomic selection. In this review, we describe emerging genomic selection techniques and summarize methods for best linear unbiased prediction and Bayesian estimation of the traditional statistics used for prediction during genomic selection. Moreover, with the rapid development of artificial intelligence, several machine learning algorithms are increasingly being employed to capture the effects of more genes to further improve prediction accuracy, which we describe in this review. We also describe the advantages and disadvantages of traditional models and machine learning models and discuss several crucial factors that could affect prediction accuracy. We propose that additional artificial intelligence techniques will be required for big data management, feature processing, and model innovation to generate a comprehensive model to optimize the prediction accuracy of genomic selection. We believe that improvements in artificial intelligence could accelerate the arrival of Breeding 4.0, in which combining any known alleles into optimal combinations in crops will be fully customizable.
- Bayesian estimation,
- BLUP,
- Genomic selection,
- Machine learning
Rights and permissions
Copyright: © 2025 by the author(s). Published by Maximum Academic Press, Fayetteville, GA. This article is an open access article distributed under Creative Commons Attribution License (CC BY 4.0), visit https://creativecommons.org/licenses/by/4.0/.

References

[1]	Wallace JG, Rodgers-Melnick E, Buckler ES. 2018. On the road to Breeding 4.0: unraveling the good, the bad, and the boring of crop quantitative genomics. Annual Review of Genetics 52:421−44 doi: 10.1146/annurev-genet-120116-024846 CrossRef Google Scholar
[2]	Jonas E, de Koning DJ. 2013. Does genomic selection have a future in plant breeding? Trends in Biotechnology 31:497−504 doi: 10.1016/j.tibtech.2013.06.003 CrossRef Google Scholar
[3]	Lande R, Thompson R. 1990. Efficiency of marker-assisted selection in the improvement of quantitative traits. Genetics 124:743−56 doi: 10.1093/genetics/124.3.743 CrossRef Google Scholar
[4]	Ribaut JM, Hoisington D. 1998. Marker-assisted selection: new tools and strategies. Trends in Plant Science 3:236−39 doi: 10.1016/S1360-1385(98)01240-0 CrossRef Google Scholar
[5]	Meuwissen THE, Hayes BJ, Goddard ME. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157:1819−29 doi: 10.1093/genetics/157.4.1819 CrossRef Google Scholar
[6]	Ertiro BT, Zeleke H, Friesen D, Blummel M, Twumasi-Afriyie S. 2013. Relationship between the performance of parental inbred lines and hybrids for food-feed traits in maize (Zea mays L.) in Ethiopia. Field Crops Research 153:86−93 doi: 10.1016/j.fcr.2013.02.008 CrossRef Google Scholar
[7]	Meuwissen THE, Goddard ME. 1996. The use of marker haplotypes in animal breeding schemes. Genetics Selection Evolution 28:161 doi: 10.1186/1297-9686-28-2-161 CrossRef Google Scholar
[8]	Whittaker JC, Thompson R, Denham MC. 2000. Marker-assisted selection using ridge regression. Genetics Research 75:249−52 doi: 10.1017/S0016672399004462 CrossRef Google Scholar
[9]	Heffner EL, Sorrells ME, Jannink JL. 2009. Genomic selection for crop improvement. Crop Science 49:1−12 doi: 10.2135/cropsci2008.08.0512 CrossRef Google Scholar
[10]	Eathington SR, Crosbie TM, Edwards MD, Reiter RS, Bull JK. 2007. Molecular markers in a commercial breeding program. Crop Science 47:S-154−S-163 doi: 10.2135/cropsci2007.04.0015IPBS CrossRef Google Scholar
[11]	Li J, Cheng D, Guo S, Chen C, Wang Y, et al. 2023. Genome-wide association and genomic prediction for resistance to southern corn rust in DH and testcross populations. Frontiers in Plant Science 14:1109116 doi: 10.3389/fpls.2023.1109116 CrossRef Google Scholar
[12]	Hayes BJ, Daetwyler HD, Bowman P, Moser G, Tier B, et al. 2009. Accuracy of genomic selection: comparing theory and results. Association for the Advancement of Animal Breeding and Genetics 18:34−37 Google Scholar
[13]	VanRaden PM. 2007. Genomic measures of relationship and inbreeding. INTERBULL Bulletin 37:33−36 Google Scholar
[14]	VanRaden PM. 2008. Efficient methods to compute genomic predictions. Journal of Dairy Science 91:4414−23 doi: 10.3168/jds.2007-0980 CrossRef Google Scholar
[15]	Astle W, Balding DJ. 2009. Population structure and cryptic relatedness in genetic association studies. Statistical Science 24:451−71 doi: 10.1214/09-STS307 CrossRef Google Scholar
[16]	Aguilar I, Misztal I, Legarra A, Tsuruta S. 2011. Efficient computation of the genomic relationship matrix and other matrices used in single-step evaluation. Journal of Animal breeding and Genetics 128:422−28 doi: 10.1111/j.1439-0388.2010.00912.x CrossRef Google Scholar
[17]	Legarra A, Christensen OF, Aguilar I, Misztal I. 2014. Single Step, a general approach for genomic selection. Livestock Science 166:54−65 doi: 10.1016/j.livsci.2014.04.029 CrossRef Google Scholar
[18]	Zhang Z, Liu J, Ding X, Bijma P, de Koning DJ, Zhang Q. 2010. Best linear unbiased prediction of genomic breeding values using a trait-specific marker-derived relationship matrix. PLoS One 5:e12648 doi: 10.1371/journal.pone.0012648 CrossRef Google Scholar
[19]	Wang J, Zhou Z, Zhang Z, Li H, Liu D, et al. 2018. Expanding the BLUP alphabet for genomic prediction adaptable to the genetic architectures of complex traits. Heredity 121:648−62 doi: 10.1038/s41437-018-0075-0 CrossRef Google Scholar
[20]	Wang Q, Tian F, Pan Y, Buckler ES, Zhang Z. 2014. A SUPER powerful method for genome wide association study. PLoS One 9:e107684 doi: 10.1371/journal.pone.0107684 CrossRef Google Scholar
[21]	Zhang Z, Ersoz E, Lai CQ, Todhunter RJ, Tiwari HK, et al. 2010. Mixed linear model approach adapted for genome-wide association studies. Nature Genetics 42:355−60 doi: 10.1038/ng.546 CrossRef Google Scholar
[22]	Endelman JB. 2011. Ridge regression and other kernels for genomic selection with R package rrBLUP. The Plant Genome 4:250−55 doi: 10.3835/plantgenome2011.08.0024 CrossRef Google Scholar
[23]	Lorenzana RE, Bernardo R. 2009. Accuracy of genotypic value predictions for marker-based selection in biparental plant populations. Theoretical and Applied Genetics 120:151−61 doi: 10.1007/s00122-009-1166-3 CrossRef Google Scholar
[24]	De Los Campos G, Hickey JM, Pong-Wong R, Daetwyler HD, Calus MP. 2013. Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics 193:327−45 doi: 10.1534/genetics.112.143313 CrossRef Google Scholar
[25]	Pérez P, de los Campos G. 2014. Genome-wide regression and prediction with the BGLR statistical package. Genetics 198:483−95 doi: 10.1534/genetics.114.164442 CrossRef Google Scholar
[26]	Park T, Casella G. 2008. The Bayesian Lasso. Journal of the American Statistical Association 103:681−86 doi: 10.1198/016214508000000337 CrossRef Google Scholar
[27]	De Los Campos G, Naya H, Gianola D, Crossa J, Legarra A, et al. 2009. Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genetics 182:375−85 doi: 10.1534/genetics.109.101501 CrossRef Google Scholar
[28]	Mutshinda CM, Sillanpää MJ. 2010. Extended Bayesian LASSO for multiple quantitative trait loci mapping and unobserved phenotype prediction. Genetics 186:1067−75 doi: 10.1534/genetics.110.119586 CrossRef Google Scholar
[29]	Legarra A, Robert-Granié C, Croiseau P, Guillaume F, Fritz S. 2011. Improved Lasso for genomic selection. Genetics Research 93:77−87 doi: 10.1017/S0016672310000534 CrossRef Google Scholar
[30]	Habier D, Fernando RL, Kizilkaya K, Garrick DJ. 2011. Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics 12:186 doi: 10.1186/1471-2105-12-186 CrossRef Google Scholar
[31]	Pong-Wong R, Woolliams JA. Bayes U: a genomic prediction method based on the horseshoe prior. Proc. 10 ^th World Congress of Genetics Applied to Livestock Production, Vancouver, BC. Canada, 2014. 3 pp
[32]	Shi S, Li X, Fang L, Liu A, Su G, et al. 2021. Genomic prediction using Bayesian regression models with global–local prior. Frontiers in Genetics 12:628205 doi: 10.3389/fgene.2021.628205 CrossRef Google Scholar
[33]	Wang T, Chen YPP, Bowman PJ, Goddard ME, Hayes BJ. 2016. A hybrid expectation maximisation and MCMC sampling algorithm to implement Bayesian mixture model based genomic prediction and QTL mapping. BMC Genomics 17:744 doi: 10.1186/s12864-016-3082-7 CrossRef Google Scholar
[34]	Cheng H, Qu L, Garrick DJ, Fernando RL. 2015. A fast and efficient Gibbs sampler for BayesB in whole-genome analyses. Genetics Selection Evolution 47:80 doi: 10.1186/s12711-015-0157-x CrossRef Google Scholar
[35]	Azevedo CF, de Resende MDV, Fonseca e Silva F, Viana JMS, Valente MSF, et al. 2015. Ridge, Lasso and Bayesian additive-dominance genomic models. BMC Genetics 16:105 doi: 10.1186/s12863-015-0264-2 CrossRef Google Scholar
[36]	Vieira IC, Dos Santos JPR, Pires LPM, Lima BM, Gonçalves FMA, et al. 2017. Assessing non-additive effects in GBLUP model. Genetics and Molecular Research 16:gmr16029632 doi: 10.4238/gmr1602963 CrossRef Google Scholar
[37]	Piepho HP, Möhring J, Melchinger AE, Büchse A. 2008. BLUP for phenotypic selection in plant breeding and variety testing. Euphytica 161:209−28 doi: 10.1007/s10681-007-9449-8 CrossRef Google Scholar
[38]	Zuk O, Hechter E, Sunyaev SR, Lander ES. 2012. The mystery of missing heritability: genetic interactions create phantom heritability. Proceedings of the National Academy of Sciences of the United States of America 109:1193−98 doi: 10.1073/pnas.1119675109 CrossRef Google Scholar
[39]	Ma C, Xin M, Feldmann KA, Wang X. 2014. Machine learning–based differential network analysis: a study of stress-responsive transcriptomes in Arabidopsis. The Plant Cell 26:520−37 doi: 10.1105/tpc.113.121913 CrossRef Google Scholar
[40]	Abdollahi-Arpanahi R, Gianola D, Peñagaricano F. 2020. Deep learning versus parametric and ensemble methods for genomic prediction of complex phenotypes. Genetics Selection Evolution 52:12 doi: 10.1186/s12711-020-00531-z CrossRef Google Scholar
[41]	Gianola D, Fernando RL, Stella A. 2006. Genomic-assisted prediction of genetic value with semiparametric procedures. Genetics 173:1761−76 doi: 10.1534/genetics.105.049510 CrossRef Google Scholar
[42]	Gianola D, Van Kaam JBCHM. 2008. Reproducing kernel Hilbert spaces regression methods for genomic assisted prediction of quantitative traits. Genetics 178:2289−303 doi: 10.1534/genetics.107.084285 CrossRef Google Scholar
[43]	De Los Campos G, Gianola D, Rosa GJM. 2009. Reproducing kernel Hilbert spaces regression: a general framework for genetic evaluation. Journal of Animal Science 87:1883−87 doi: 10.2527/jas.2008-1259 CrossRef Google Scholar
[44]	De los Campos G, Gianola D, Rosa GJM, Weigel KA, Crossa J. 2010. Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Genetics Research 92:295−308 doi: 10.1017/S0016672310000285 CrossRef Google Scholar
[45]	Long N, Gianola D, Rosa GJM, Weigel KA, Kranis A, González-Recio O. 2010. Radial basis function regression methods for predicting quantitative traits using SNP markers. Genetics Research 92:209−25 doi: 10.1017/S0016672310000157 CrossRef Google Scholar
[46]	Cortes C, Vapnik V. 1995. Support-vector networks. Machine Learning 20:273−97 doi: 10.1007/BF00994018 CrossRef Google Scholar
[47]	Chang CC, Lin CJ. 2011. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2:27 doi: 10.1145/1961189.1961199 CrossRef Google Scholar
[48]	Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B. 1998. Support vector machines. IEE Intelligent Systems and their Applications 13:18−28 doi: 10.1109/5254.708428 CrossRef Google Scholar
[49]	Zhao W, Lai X, Liu D, Zhang Z, Ma P, et al. 2020. Applications of support vector machine in genomic prediction in pig and maize populations. Frontiers in Genetics 11:598318 doi: 10.3389/fgene.2020.598318 CrossRef Google Scholar
[50]	Targhi MVA, Jafarabadi GA, Aminafshar M, Kashan NEJ. 2019. Comparison of non-parametric methods in genomic evaluation of discrete traits. Gene Reports 15:100379 doi: 10.1016/j.genrep.2019.100379 CrossRef Google Scholar
[51]	Breiman L. 2001. Random forests. Machine Learning 45:5−32 doi: 10.1023/A:1010933404324 CrossRef Google Scholar
[52]	Breiman L, Friedman JH, Olshen RA, Stone CJ. 1984. Classification and regression trees. New York: Chapman and Hall/CRC. 368 pp. doi: 10.1201/9781315139470
[53]	Naderi S, Yin T, König S. 2016. Random forest estimation of genomic breeding values for disease susceptibility over different disease incidences and genomic architectures in simulated cow calibration groups. Journal of Dairy Science 99:7261−73 doi: 10.3168/jds.2016-10887 CrossRef Google Scholar
[54]	Sarkar RK, Rao AR, Meher PK, Nepolean T, Mohapatra T. 2015. Evaluation of random forest regression for prediction of breeding value from genomewide SNPs. Journal of Genetics 94:187−92 doi: 10.1007/s12041-015-0501-5 CrossRef Google Scholar
[55]	Waldmann P. 2016. Genome-wide prediction using Bayesian additive regression trees. Genetics Selection Evolution 48:42 doi: 10.1186/s12711-016-0219-8 CrossRef Google Scholar
[56]	Friedman JH. 2001. Greedy function approximation: a gradient boosting machine. Annals of Statistics 29:1189−232 doi: 10.1214/aos/1013203451 CrossRef Google Scholar
[57]	Ke G, Meng Q, Finley T, Wang T, Chen W, et al. 2017. LightGBM: a highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems 30:3149−57 Google Scholar
[58]	Chen T, Guestrin C. 2016. XGBoost: a scalable tree boosting system. In Proceedings of the 22 ^nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, 2016, San Francisco, California, USA: Association for Computing Machinery. pp. 785–94. doi: 10.1145/2939672.293978
[59]	Dorogush AV, Ershov V, Gulin A. 2018. CatBoost: gradient boosting with categorical features support. arXiv 00:1810.11363 doi: 10.48550/arXiv.1810.11363 CrossRef Google Scholar
[60]	Yan J, Xu Y, Cheng Q, Jiang S, Wang Q, et al. 2021. LightGBM: accelerated genomically designed crop breeding through ensemble learning. Genome Biology 22:271 doi: 10.1186/s13059-021-02492-y CrossRef Google Scholar
[61]	Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, et al. 2019. A primer on deep learning in genomics. Nature Genetics 51:12−18 doi: 10.1038/s41588-018-0295-5 CrossRef Google Scholar
[62]	Bellot P, De Los Campos G, Pérez-Enciso M. 2018. Can deep learning improve genomic prediction of complex human traits? Genetics 210:809−19 doi: 10.1534/genetics.118.301298 CrossRef Google Scholar
[63]	Montesinos-López A, Montesinos-López OA, Gianola D, Crossa J, Hernández-Suárez CM. 2018. Multi-environment genomic prediction of plant traits using deep learners with dense architecture. G3 8:3813−28 doi: 10.1534/g3.118.200740 CrossRef Google Scholar
[64]	Montesinos-López OA, Martín-Vallejo J, Crossa J, Gianola D, Hernández-Suárez CM, et al. 2019. A benchmarking between deep learning, support vector machine and Bayesian threshold best linear unbiased prediction for predicting ordinal traits in plant breeding. G3 9:601−18 doi: 10.1534/g3.118.200998 CrossRef Google Scholar
[65]	Pérez-Enciso M, Zingaretti LM. 2019. A guide on deep learning for complex trait genomic prediction. Genes 10:553 doi: 10.3390/genes10070553 CrossRef Google Scholar
[66]	Fukushima K. 1980. Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics 36:193−202 doi: 10.1007/BF00344251 CrossRef Google Scholar
[67]	Lecun Y, Bottou L, Bengio Y, Haffner P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86:2278−324 doi: 10.1109/5.726791 CrossRef Google Scholar
[68]	Ma W, Qiu Z, Song J, Li J, Cheng Q, et al. 2018. A deep convolutional neural network approach for predicting phenotypes from genotypes. Planta 248:1307−18 doi: 10.1007/s00425-018-2976-9 CrossRef Google Scholar
[69]	Liu Y, Wang D, He F, Wang J, Joshi T, et al. 2019. Phenotype prediction and genome-wide association study using deep convolutional neural network of soybean. Frontiers in Genetics 10:1091 doi: 10.3389/fgene.2019.01091 CrossRef Google Scholar
[70]	Wang K, Abid MA, Rasheed A, Crossa J, Hearne S, et al. 2023. DNNGP, a deep neural network-based method for genomic prediction using multi-omics data in plants. Molecular Plant 16:279−93 doi: 10.1016/j.molp.2022.11.004 CrossRef Google Scholar
[71]	Wang N, Wang H, Zhang A, Liu Y, Yu D, et al. 2020. Genomic prediction across years in a maize doubled haploid breeding program to accelerate early-stage testcross testing. Theoretical and Applied Genetics 133:2869−79 doi: 10.1007/s00122-020-03638-5 CrossRef Google Scholar
[72]	Juliana P, Singh RP, Braun HJ, Huerta-Espino J, Crespo-Herrera L, et al. 2020. Genomic selection for grain yield in the CIMMYT wheat breeding program—status and perspectives. Frontiers in Plant Science 11:564183 doi: 10.3389/fpls.2020.564183 CrossRef Google Scholar
[73]	Tessema BB, Liu H, Sørensen AC, Andersen JR, Jensen J. 2020. Strategies using genomic selection to increase genetic gain in breeding programs for wheat. Frontiers in Genetics 11:578123 doi: 10.3389/fgene.2020.578123 CrossRef Google Scholar
[74]	Chung PY, Liao CT. 2020. Identification of superior parental lines for biparental crossing via genomic prediction. PLoS One 15:e0243159 doi: 10.1371/journal.pone.0243159 CrossRef Google Scholar
[75]	Chung PY, Liao CT. 2022. Selection of parental lines for plant breeding via genomic prediction. Frontiers in Plant Science 13:934767 doi: 10.3389/fpls.2022.934767 CrossRef Google Scholar
[76]	Sun X, Qu L, Garrick DJ, Dekkers JCM, Fernando RL. 2012. A fast EM algorithm for BayesA-like prediction of genomic breeding values. PLoS One 7:e49157 doi: 10.1371/journal.pone.0049157 CrossRef Google Scholar
[77]	Jiang S, Cheng Q, Yan J, Fu R, Wang X. 2020. Genome optimization for improvement of maize breeding. Theoretical and Applied Genetics 133:1491−502 doi: 10.1007/s00122-019-03493-z CrossRef Google Scholar
[78]	Hayes BJ, Visscher PM, Goddard ME. 2009. Increased accuracy of artificial selection by using the realized relationship matrix. Genetics Research 91:47−60 doi: 10.1017/S0016672308009981 CrossRef Google Scholar
[79]	Joshi R, Skaarud A, Alvarez AT, Moen T, Ødegård J. 2021. Bayesian genomic models boost prediction accuracy for survival to Streptococcus agalactiae infection in Nile tilapia (Oreochromus nilioticus). Genetics Selection Evolution 53:37 doi: 10.1186/s12711-021-00629-y CrossRef Google Scholar
[80]	Meher PK, Rustgi S, Kumar A. 2022. Performance of Bayesian and BLUP alphabets for genomic prediction: analysis, comparison and results. Heredity 128:519−30 doi: 10.1038/s41437-022-00539-9 CrossRef Google Scholar
[81]	Jarquín D, Kocak K, Posadas L, Hyma K, Jedlicka J, et al. 2014. Genotyping by sequencing for genomic prediction in a soybean breeding population. BMC Genomics 15:740 doi: 10.1186/1471-2164-15-740 CrossRef Google Scholar
[82]	Jia Z. 2017. Controlling the overfitting of heritability in genomic selection through cross validation. Scientific Reports 7:13678 doi: 10.1038/s41598-017-14070-z CrossRef Google Scholar
[83]	Jubair S, Tucker JR, Henderson N, Hiebert CW, Badea A, et al. 2021. GPTransformer: a transformer-based deep learning method for predicting Fusarium related traits in barley. Frontiers in Plant Science 12:761402 doi: 10.3389/fpls.2021.761402 CrossRef Google Scholar
[84]	Zhang H, Wang X, Pan Q, Li P, Liu Y, et al. 2019. QTG-Seq accelerates QTL fine mapping through QTL partitioning and whole-genome sequencing of bulked segregant samples. Molecular Plant 12:426−37 doi: 10.1016/j.molp.2018.12.018 CrossRef Google Scholar
[85]	Crossa J, Fritsche-Neto R, Montesinos-Lopez OA, Costa-Neto G, Dreisigacker S, et al. 2021. The modern plant breeding triangle: optimizing the use of genomics, phenomics, and enviromics data. Frontiers in Genetics 12:651480 doi: 10.3389/fpls.2021.651480 CrossRef Google Scholar
[86]	Weyen J. 2021. Applications of doubled haploids in plant breeding and applied research. In Doubled Haploid Technology, ed. Segui-Simarro JM. New York, NY: Humana. Volume 2287. pp. 23–39. doi: 10.1007/978-1-0716-1315-3_2
[87]	Wang N, Yuan Y, Wang H, Yu D, Liu Y, et al. 2020. Applications of genotyping-by-sequencing (GBS) in maize genetics and breeding. Scientific Reports 10:16308 doi: 10.1038/s41598-020-73321-8 CrossRef Google Scholar
[88]	Rich-Griffin C, Stechemesser A, Finch J, Lucas E, Ott S, et al. 2020. Single-cell transcriptomics: a high-resolution avenue for plant functional genomics. Trends in Plant Science 25:186−97 doi: 10.1016/j.tplants.2019.10.008 CrossRef Google Scholar
[89]	Liu Y, Lu S, Liu K, Wang S, Huang L, et al. 2019. Proteomics: a powerful tool to study plant responses to biotic stress. Plant Methods 15:135 doi: 10.1186/s13007-019-0515-8 CrossRef Google Scholar
[90]	Jamil IN, Remali J, Azizan KA, Nor Muhammad NA, Arita M, et al. 2020. Systematic Multi-Omics Integration (MOI) approach in plant systems biology. Frontiers in Plant Science 11:944 doi: 10.3389/fpls.2020.00944 CrossRef Google Scholar
[91]	Khaki S, Khalilzadeh Z, Wang L. 2020. Predicting yield performance of parents in plant breeding: a neural collaborative filtering approach. PLoS One 15:e0233382 doi: 10.1371/journal.pone.0233382 CrossRef Google Scholar
[92]	Harfouche AL, Jacobson DA, Kainer D, Romero JC, Harfouche AH, et al. 2019. Accelerating climate resilient plant breeding by applying next-generation artificial intelligence. Trends in Biotechnology 37:1217−35 doi: 10.1016/j.tibtech.2019.05.007 CrossRef Google Scholar
[93]	Liang M, An B, Chang T, Deng T, Du L, et al. 2022. Incorporating kernelized multi-omics data improves the accuracy of genomic prediction. Journal of Animal Science and Biotechnology 13:103 doi: 10.1186/s40104-022-00756-6 CrossRef Google Scholar
[94]	Devlin J, Chang MW, Lee K, Toutanova K. 2018. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv 00:1810.04805 doi: 10.48550/arXiv.1810.04805 CrossRef Google Scholar
[95]	Ma X, Wang H, Wu S, Han B, Cui D, et al. 2024. DeepCCR: large-scale genomics-based deep learning method for improving rice breeding. Plant Biotechnology Journal 22:2691−93 doi: 10.1111/pbi.14384 CrossRef Google Scholar
[96]	Gao P, Zhao H, Luo Z, Lin Y, Feng W, et al. 2023. SoyDNGP: a web-accessible deep learning framework for genomic prediction in soybean breeding. Briefings in Bioinformatics 24:bbad349 doi: 10.1093/bib/bbad349 CrossRef Google Scholar
[97]	Wu C, Zhang Y, Ying Z, Li L, Wang J, et al. 2023. A transformer-based genomic prediction method fused with knowledge-guided module. Briefings in Bioinformatics 25:bbad438 doi: 10.1093/bib/bbad438 CrossRef Google Scholar
[98]	Ren Y, Wu C, Zhou H, Hu X, Miao Z. 2024. Dual-extraction modeling: a multi-modal deep-learning architecture for phenotypic prediction and functional gene mining of complex traits. Plant Communications 5:101002 doi: 10.1016/j.xplc.2024.101002 CrossRef Google Scholar
[99]	Yan J, Wang X. 2023. Machine learning bridges omics sciences and plant breeding. Trends in Plant Science 28:199−210 doi: 10.1016/j.tplants.2022.08.018 CrossRef Google Scholar
[100]	Khaki S, Wang L. 2019. Crop yield prediction using deep neural networks. Frontiers in Plant Science 10:621 doi: 10.3389/fpls.2019.00621 CrossRef Google Scholar

About this article

Cite this article

Zhang D, Yang F, Li J, Liu Z, Han Y, et al. 2025. Progress and perspectives on genomic selection models for crop breeding. Technology in Agronomy 5: e006 doi: 10.48130/tia-0025-0002

Zhang D, Yang F, Li J, Liu Z, Han Y, et al. 2025. Progress and perspectives on genomic selection models for crop breeding. Technology in Agronomy 5: e006 doi: 10.48130/tia-0025-0002

Figures(6) / Tables(2)

Download PDF

Article Metrics

Article views(8612) PDF downloads(5673)

Other Articles By Authors

on this site
on Google Scholar

HTML

Introduction

The development of new crop varieties through breeding is an efficient way to enhance crop productivity. Breeders aim to improve breeding traits of interest such as yield, disease resistance, stress tolerance, and nutritional value. Modern crop breeding is both an art and a science, and is vastly different from early crop selection and domestication due to the rapid development of scientific techniques such as biostatistics, genetic engineering, and genomics and to advancements in various modern breeding techniques, including transgenesis, genome editing, speed breeding, and doubled haploid (DH) technology^[1]. As a result, modern breeding is entering a novel stage, Breeding 4.0, based on the proposed stages of agriculture from 1.0 to 4.0^[1].

Two major strategies are employed in molecular breeding: marker-assisted selection (MAS), and genomic selection (GS)^[2]. During MAS, each individual plant is identified based on linked or functional markers that confirm the trait of interest^[3]. Owing to the discovery of increasing numbers of major quantitative trait loci (QTLs) and functional genes through genetic research, MAS has been extensively applied in different types of breeding programs to improve traits of interest since its initial use in the 1990s^[3]. MAS is an effective way to increase selection efficiency for qualitative traits or categories of traits that are regulated by only a few genes^[4]. However, important traits targeted by breeding that are regulated by one or a few major QTLs/genes are extremely rare. Most traits, such as yield and plant height, are quantitative traits that are regulated by multiple loci. As it is challenging to conduct artificial selection using only a few markers to identify the genes underlying traits of interest, a second strategy known as GS (also known as whole-genome prediction) was developed^[5].

Genomic BLUP (GBLUP) estimation for GS

Before genomic relationships were used for GS, pedigrees were used to predict the breeding values of individual crop plants via BLUP mixed-model equations. It is straightforward to determine phenotypic similarity when the progenies share the same pedigree information, which enhances the efficiency of artificial selection^[12]. However, this method cannot be used by breeders when pedigree information is lacking or progenies are derived from the same parents (i.e., a full-sib family). Therefore, scientists explored ways to more accurately assess the kinship of individuals based on their genomic relationships with the advent of DNA molecular markers.

GBLUP were used to predict the breeding values of individual crop plants via BLUP mixed-model equations. BLUP estimation is derived from the linear mixed model and the following mixed-model equation:

$ \left[\begin{array}{cc}{X}^{T}{R}^{-1}X& {X}^{T}{R}^{-1}Z\\ {Z}^{T}{R}^{-1}X& {Z}^{T}{R}^{-1}Z+\lambda {G}^{-1}\end{array}\right]\left[\begin{array}{c}\beta \\ u\end{array}\right]=\left[\begin{array}{c}{X}^{T}{R}^{-1}y\\ {Z}^{T}{R}^{-1}y\end{array}\right] $

(1)

where, $ {\sigma }_{g}^{2} $ is the total genetic variance, $ {\sigma }^{2} $ is the random error variance, $ \lambda =\dfrac{{\sigma }^{2}}{{\sigma }_{g}^{2}} $, $ R $ is the random error variance–covariance coefficient matrix $ I $, and $ G $ is the genetic variance–covariance coefficient, which is derived from the estimation of genomic relationships^[13,14] as follows:

$ G=\dfrac{Z{Z}^{T}}{2\sum {p}_{i}\left(1-{p}_{i}\right)} $

(2)

where, Z is the matrix of all markers, p_i is the frequency of each marker, and all other parameters are identical. The ability to estimate genomic relationships is a critical measure for improving the prediction accuracy of GBLUP. Therefore, various methods have been introduced for estimating genomic relationships^[14,15] including single-step GBLUP (ssGBLUP), which combines pedigree-kinship information with genomic relationship information to estimate genomic relationships^[16,17]. Subsequently, additional GBLUP methods have emerged.

In trait-specific marker-derived relationship matrix (TABLUP), another GBLUP method, a trait-specific relationship matrix is built based on the identity by descent (IBD) between both individuals from the locus with the genetic variance in the trait^[18]. The equation is as follows:

$ T{A}_{ij}=\mathop\sum\nolimits _{k\,=\,1}^{n}2{P}_{IBD,ijk}{\sigma }_{g,k}^{2} $

(3)

where, k is the locus, $ {P}_{IBD,ijk} $ is the probability of IBD, i and j are individuals, $ {\sigma }_{g}^{2} $ is the genetic variance of the trait, and kinship is the sum of values for all loci. Subsequently, Wang et al. proposed two more methods: SUPER BLUP (sBLUP) and compressed BLUP (cBLUP)^[19], which were derived from the GWAS (genome-wide association study) theory. In sBLUP, the kinship matrix is designed by selecting significant bins, which are grouped based on the markers associated with traits of interest via GWAS^[20]. In cBLUP, all individuals are clustered into several groups according to their genomic relationships. The kinship matrix is the average of the genomic relationships^[21]. rrBLUP is a specific model used to implement kinship-based BLUP and SNP-based BLUP. With this method, the genomic relationship is calculated as $ {K}_{RR}=G{G}^{T} $, which is equivalent to the SNP-based rrBLUP methods^[22,23].

Bayesian estimation for GS

Bayesian ridge regression (BRR)

In BRR, the effect of all SNP markers is assigned to an identical and independent Gaussian prior distribution, and the beta coefficients used to describe the effect are at normal densities: $ \beta \sim \mathrm{N}(0,{\sigma }_{\beta }^{2}) $, where the prior distribution of the variance comes from the scaled inverse chi-square distribution, $ {\sigma }_{\beta }^{2}\sim{\chi }^{-2}\left({\upsilon }_{\beta },{S}_{\beta }\right) $; the hyperparameters $ {\upsilon }_{\beta } $ and $ {S}_{\beta } $ will be updated in the model^[24,25]. Under this hypothesis, all markers are linked with QTLs. However, in practice, not all markers have a phenotype effect. As a result, other prior distributions have been proposed to solve this problem.

Bayesian LASSO (BL)
The Bayesian form of LASSO infers variants according to the prior density of double-exponential distribution^[26,27]. Here, beta coefficients of the effect are at normal densities, $ \beta \sim\mathrm{N}(0,{\tau }_{p}^{2}{\sigma }^{2}) $. The variance is scaled-inverted chi-square density with degrees of freedom $ \upsilon $ and scale $ S $, $ {\sigma }^{2}\sim{\chi }^{-2}(\upsilon ,S) $. $ {\tau }_{j}^{2}\sim\mathrm{E}xp\left(\lambda \right) $ is an exponential density with the hyperparameter $ \lambda $, $ {\lambda }^{2}\sim Gamma({\alpha }_{1},{\alpha }_{2}) $ is a gamma distribution with two hyperparameters: shape parameter $ {\alpha }_{1} $ and rate parameter $ {\alpha }_{2} $. There are two other $ \lambda $ prior distributions: $ {\lambda }^{2}\sim beta(p,\pi ) $ and a flat distribution^[25,28]. Moreover, Legarra et al. proposed BL2Var, with coefficient $ {\tau }_{j}^{2} $: $ {\tau }_{j}^{2}\sim\mathrm{I}\mathrm{G}\left(\sqrt{\dfrac{{\tilde{\lambda }}^{2}}{{\tilde{a}}_{i}^{2}}},{\lambda }^{2}\right) $, and BL1Var, with coefficient $ {\tau }_{j}^{2} $: $ {\tau }_{j}^{2} \sim\mathrm{I}\mathrm{G}\left(\sqrt{\dfrac{{\tilde{\lambda }}^{2}{\sigma }^{2}}{{\tilde{a}}_{i}^{2}}},{\lambda }^{2}\right) $^[29].

BayesA
In BayesA, the variances of each marker effect are different genome-wide, with a scaled inverse chi-square distribution: $ {\sigma }_{{\beta }_{j}}^{2}\sim{\chi }^{-2}({\upsilon }_{\beta },{S}_{ \beta }) $; $ {\upsilon }_{\beta }=4.012 $ and $ {s}_{\beta }= $ 0.0020 are the suggested fixed values for the degrees of freedom and scaled parameter, respectively. As a result, a scaled-t density could explain the marginal distribution of marker effects^[5]. However, the shrinkage of SNP effects is severely affected by $ {S}_{ \beta } $, which should be considered as an unknown parameter with its own prior $ {S}_{ \beta }\sim Gamma\left(r,s\right) $; $ r $ and $ s $ are rate parameter and shape parameter, respectively^[30].

BayesC
In BayesC, the variances of all marker effects are identical and independent, with a prior scaled inverse chi-square distribution: $ {\sigma }_{\beta }^{2}\sim {\chi }^{-2}({\upsilon }_{\beta },{S}_{ \beta }) $; $ {\upsilon }_{\beta } $ and $ {S}_{ \beta } $ are the degrees of freedom and scaled parameter, respectively. $ {S}_{ \beta } $ has a prior density: $ {S}_{ \beta }\sim Gamma\left(r,s\right) $; $ r $ and $ s $ are the rate and shape parameters, respectively. The beta coefficients of the effect are normal density: $ {\beta }_{j}\sim N(0,{\sigma }_{\beta }^{2}) $ with probability $ \pi $, $ {\beta }_{j}=0 $ with probability $ 1-\pi $. Additionally, $ \pi $ is the prior density: $ beta(a,b) $, when $ \pi $ is suggested as the unknown parameter with a prior density $ beta(a,b) $, which is referred to as BayesCπ^[30].

BayesB
BayesB combines the hypotheses from BayesA and BayesC. First, the beta coefficient of the effect is normal density: $ {\beta }_{j}\sim N(0,{\sigma }_{{\beta }_{j}}^{2}) $ with probability $ \pi $, and $ {\beta }_{j}=0 $ with probability $ 1-\pi $. Additionally, $ \pi $ is the prior density: $ beta(a,b) $, when $ \pi $ is suggested as the unknown parameter with a prior density $ beta(a,b) $, which is referred to as BayesDπ. The variances of each marker effect are different genome-wide, with a scaled inverse chi-square distribution: $ {\sigma }_{{\beta }_{j}}^{2}\sim {\chi }^{-2}({\upsilon }_{\beta },{S}_{ \beta }) $, $ {S}_{ \beta }\sim Gamma\left(r,s\right) $; $ r $ and $ s $ are rate parameter and shape parameter, respectively^[5,25].

Several other Bayesian estimations can be used for different prior distributions on marker effects, including BayesU^[31], BayesHP, BayesHE^[32], BayesR, and emBayesR^[33]. Based on the above descriptions, Bayesian theory could provide a series of models for GS given different prior hypotheses. A mixture of prior distributions is used to generate many diverse types of models by combining the different prior hypotheses. The mean and variance of all parameters could be estimated by Markov chain Monte Carlo of Metropolis-Hastings or Gibbs sampling^[34] and the prediction accuracy of these models is comparable to that of other models^[35].

Application scenarios of GS in crop breeding

Screening of DHs, backcross materials, and selfing materials

Maize DH lines can be produced in a high-throughput manner in the laboratory, greenhouse, or winter nursery via yearly management. However, it is challenging to screen and develop DHs. It is not feasible to plant all DHs in the field due to the excessive workload and low-efficiency selection process. Consequently, GS is a critical step for methods involving DH^[71]. Similarly, numerous F2/F3 or BC1/BC2 materials of maize, rice, and wheat are generated on a large scale during breeding, although the genotypes of these materials are not stable and some loci are segregating. However, phenotypic prediction of these materials still increases the efficiency of the breeding process^[72,73].

Hybrid prediction
With the hybrid materials of maize or rice in place, breeders can use this model to predict the performance of different hybrid combinations^[60]. The genomic information from the selected parental lines is used to estimate the expected performance of the hybrid progeny for the target traits. These predictions help breeders make informed decisions about which hybrid combinations are likely to display superior performance and should be advanced in the breeding program.

Prediction of simulated progenies of parental lines
During inbred line selection, the two parental lines from maize, rice, or wheat must be selected to perform a single cross. A general rule is that the two parents will generate lines that have improved versions of their phenotypic traits. However, parents that have complementary advantaged genetic backgrounds tend to produce superior progeny. As a result, it is difficult to decide which combination is best when many parental lines are available. In modern breeding, each elite line could be genotyped using high-density markers, and a breeding population containing 300–400 individuals could be simulated by the genetic algorithm based on the markers of the parental lines. Each phenotypic trait of these individuals could be predicted by the trained models^[74,75]. A predicted phenotypic value could be obtained for each possible combination from two parental lines. All combinations could then be compared to facilitate decision-making based on the predicted breeding values.

Perspectives on GS

In previous sections, we summarized GS using models based on statistical analysis or ML. Much effort has focused on improving the prediction accuracy of all these models; however, breeding traits, especially yield, are highly complex due to the control exerted by gene regulatory networks, and innovative new models must be developed. What is the future direction of GS algorithms?

With the continuous improvement of commercial breeding systems, the types of breeding populations have become increasingly diverse, including many DHs from bi-parental populations and sets of breeding populations or lines derived from different stages of self-crosses in breeding programs and hybrid trials across multiple locations and over multiple years^[85,86]. Multi-population joint prediction models are urgently needed to address this issue. With the development of high-throughput sequencing techniques and advances in biotechnology, the cost of processing single samples has decreased, making the production of multi-omics data, including genomics^[87], transcriptomics^[88], and proteomics^[89] data easier and more convenient. The construction and preprocessing of large-scale multi-omics datasets, including screening, filtering, and integration, are challenging using traditional methods^[90,91]. With the continued collection, accumulation, and analysis of envirotypic data, enviromics is increasingly being used to explore genotype-by-environment interactions based on spatial and temporal variability at multidimensional scales^[85,92]. This analysis provides insights into the environmental drivers of the distribution of elite germplasm, facilitates the screening of breeding materials, and enhances the precise evaluation of plant varieties, ultimately leading to advanced breeding processes.

Complex quantitative traits, such as yield, are regulated by multiple genes and their interactions. These traits pose significant challenges for current predictive models due to their genetic complexity. Each gene contributes a small effect to the overall phenotype, and these effects accumulate to determine the final trait expression. Consequently, to capture interactions between genetic factors and specific environments, much more novel and innovative models must be explored. Three specific models for genomic selection (GS) based on CNN architecture have been reported: DeepGS^[68], DLGWAS^[69], and DNNGP^[70]. These methods have proven to be competitive with others and outperform traditional methods in some respects due to their ability to handle high-dimensional data. We believe that AI holds tremendous potential applications for GS. Some researchers have started to use RNA-seq data to enhance the efficiency of selection by integrating gene expression data into GS models using kernel-based methods, which have been used to capture complex genetic interactions and non-linear relationships between genetic variants and phenotypic traits in animals^[93].

However, applying these models directly for GS might not be straightforward, since they are primarily designed for computer vision (CV), speech recognition, and natural language processing (NLP), especially Bidirectional Encoder Representations from Transformers (BERT)^[94] and Generative Pre-trained Transformer (GPT) based on transformer architecture models. Genomic data are different from text data, making it necessary to preprocess and represent the genomic sequences in a format suitable for the models. This may involve encoding variants, genomic regions, or other relevant genetic features appropriately^[70,83]. Directly applying ML models is not better than Bayesian models and rrBLUP, like AlexNet, is not very well suited for GS applications^[63] (Fig. 6a & b). It is important to handle the sequential nature of genetic data and to ensure that the representations capture the relevant information.

Other technologies for applying these models to GS could also be considered: the underlying transformer architecture and the transfer learning principles they employ could be adapted for GS with the appropriate modifications. DeepCCR^[95] improves contextual comprehension by integrating BiLSTM layers, which are particularly beneficial for the interpretation of sequential data. In comparison, SoyDNGP^[96] employs a three-dimensional input matrix, enabling it to capture intricate genotypic variations and provide richer feature density than the one-dimensional vectors used in other models. GPformer^[97] utilizes innovative attention mechanisms to enhance the representation of SNP relationships, thereby enhancing predictive accuracy. Meanwhile, Dual-Extraction Modeling (DEM)^[98] stands out with its dual-extraction mechanism, effectively integrating multi-omics data and improving performance through noise reduction and enhanced feature separability. Collectively, the advancements exemplified by GPformer, SoyDNGP, and DEM reflect a trend toward developing more complex and integrated processing architectures that effectively address the multifaceted challenges of genomic prediction. Pre-training the models on large-scale genomic data or related tasks could be beneficial for adapting the pre-training process to GS, capturing meaningful patterns from the pre-training data, and assisting in downstream GS tasks. After pre-training, it is essential to fine-tune the models for specific GS tasks. The fine-tuning process helps adapt the models to specific tasks, and the relationships between genetic variants and performance can be deduced by incorporating domain-specific knowledge into the models. This could be achieved by incorporating prior knowledge about the biology and genetic pathways of the organism, or by using specialized loss functions that account for the specific requirements of GS^[92].

Breeding 4.0 reflects the future ability to combine any known alleles into optimal combinations, potentially revolutionizing the field. This stage of agriculture will rely on the development of highly advanced genetic manipulation technologies that can be used to obtain ideal genetic combinations for specific traits. In future breeding pipelines, large-scale multimodal data will be created and developed, such as DNA/protein sequences; text annotation of multi-omics data; images, audio files, and video files from phenomic analysis; sensor readings; and telemetry data^[99]. Additionally, numerous public databases, data-sharing platforms, and research papers will be channeled into the pipelines. We believe that GS can serve as a bridge from AI to Breeding 4.0 with innovations in ML, NLP, CV, and other novel technologies. Such enhancements will extend the use of AI in breeding programs.

In summary, the above factors could enhance the prediction accuracy of GS models. These advanced analytical methods can handle big genomic data, identify complex patterns, and enhance prediction accuracy. As computational tools continue to improve, they could enhance the efficiency and effectiveness of genomic prediction models. By considering a wider range of molecular information, researchers can gain a deeper understanding of the biological mechanisms underlying traits and develop more accurate prediction models for GS.

Comparison item	ML-based GS algorithms	Statistical algorithms
Data handling capacity	Process high-dimensional datasets, handle omics data	Limited to traditional markers
Non-linear relationship	Capture non-linear relationships and enhance model performance	Struggle with non-linear relationships
Computational resources	Require significant computational resources	Require fewer resources
Interpretability	Act as black boxes, difficult to interpret	Provide transparent models
Applicability	Offer flexible processing, require tuning	Suit linear relationships

Crop	Population size	Marker no.	Performance	Ref.
Wheat	2,374	39,758	GBLUP ≥ MLP	[63]
Wheat	250	12,083	GBLUP ≥ MLP	[63]
Wheat	693, 670, 807	15,744	GBLUP ≥ MLP	[63]
Maize	309	158,281	GBLUP ≥ MLP	[63]
Wheat	767, 775, 964, 980, 945, 1,145	2,038	GBLUP ≥ MLP ≈ SVM	[64]
Maize	2,267	19,465	MLP > Lasso	[100]
Maize	4,328	564,692	GBLUP ≈ BayesR ≈ SVM	[49]
Barley	400	50,000	Transformer ≈ BLUP	[83]
Maize	8,652	32,559	LightGBM > rrBLUP	[60]
Wheat	2,000	33,709	LightGBM ≈ DNNGP > GBLUP	[70]
Maize	1,404	6,730,418	SVR ≈ DNNGP > GBLUP	[70]
Wheat	599	1,447	SVR ≈ DNNGP > GBLUP	[70]

{{lists.name}}

Progress and perspectives on genomic selection models for crop breeding

Abstract