Gene regulatory network prediction using machine learning, deep learning, and hybrid approaches

Sai Teja Mummadi; Md Khairul Islam; Victor Busov; Hairong Wei; Sai Teja Mummadi; Md Khairul Islam; Victor Busov; Hairong Wei

doi:10.48130/forres-0025-0014

Construction of gene regulatory networks (GRNs) is essential for elucidating the regulatory mechanisms underlying metabolic pathways, biological processes, and complex traits. In this study, we developed and evaluated machine learning, deep learning, and hybrid approaches for constructing GRNs by integrating prior knowledge and large-scale transcriptomic data from Arabidopsis thaliana, poplar, and maize. Among these, hybrid models that combined convolutional neural networks and machine learning consistently outperformed traditional machine learning and statistical methods, achieving over 95% accuracy on the holdout test datasets. These models not only identified a greater number of known transcription factors regulating the lignin biosynthesis pathway but also demonstrated higher precision in ranking key master regulators such as MYB46 and MYB83, as well as many upstream regulators, including members of the VND, NST, and SND families, at the top of candidate lists. To address the challenge of limited training data in non-model species, we implemented transfer learning, enabling cross-species GRN inference by applying models trained on well-characterized and data-rich species to another species with limited data. This strategy enhanced model performance and demonstrated the feasibility of knowledge transfer across species. Overall, our findings underscore the effectiveness of hybrid and transfer learning approaches in GRN prediction, offering a scalable framework for elucidating regulatory mechanisms in both model and non-model plant systems.

HTML

Introduction

A gene regulatory network (GRN) visually represents the intricate regulatory interactions between regulators and their target genes, which collectively control metabolic pathways and biological processes essential for plant growth and development, as well as adaptation to various environmental cues and stresses^[1]. Constructing GRNs is therefore critical for elucidating the molecular mechanisms underlying plant physiology and stress responses. With the explosion of publicly available omics data, a handful of highly efficient computational methods have been developed to infer transcription factor (TF)-target gene (TF-target) relationships. However, supervised learning approaches remain underutilized, despite their potential to leverage known regulatory interactions to accurately predict novel TF-target pairs at scale.

Although GRNs can be constructed through experimental means, for example, yeast one hybrid assay (Y1H)^[2], DNA electrophoretic mobility shift assay (EMSA)^[3], chromatin immunoprecipitation and sequencing (ChIP-seq)^[4], and DNA affinity purification and sequencing (DAP-seq)^[5], these approaches are labor intensive and low-throughput, limiting their application only to small gene sets. In contrast, in silico approaches based on omics data offer a scalable alternative for revealing regulatory relationships^[6−9]. Currently, transcriptomic data sets serve as the most widely used high-throughput resource for GRN construction. The analytical strategy for transcriptomic data depends on its structure. Most publicly available datasets from plants and animals are static, often pooled from multiple non-time-course experiments such as treatment-versus-control comparisons. For such static data, suitable GRN inference methods include TIGRESS^[10], mutual information-based algorithms such as ARACNE^[11] and CLR^[12], and random forests-based methods such as GENIE3^[13], all of which infer regulatory relationships without requiring and using temporal information.

In recent years, some algorithms have been developed to construct hierarchical GRNs, such as the BWERF algorithm^[9], the Top-down GGM algorithm^[8], and the Bottom-up GGM algorithm^[7]. Additionally, several methods are capable of constructing multiple GRNs jointly using data from multiple tissues or conditions, such as JRmGRN^[14] and joint graphical lasso using ADMM^[15]. Recently, our team developed a novel method, TGPred^[16], which infers the target genes of each TF by integrating statistics, machine learning (ML), and optimization.

While experimental approaches and computational inference algorithms exist for GRN construction, ML, deep learning (DL), and hybrid approaches have emerged as powerful alternatives for reconstructing GRNs at scale^[17]. Compared to experimental techniques such as yeast one-hybrid (Y1H) assays, EMSA, ChIP-seq, and DNA affinity purification sequencing (DAP-seq) − which are accurate but labor-intensive and low-throughput − ML and DL methods offer several practical advantages for genome-wide GRN prediction across diverse conditions. ML and DL models are highly scalable, enabling the analysis of large datasets where traditional experimental methods often fall short^[18]. ML and DL approaches can capture nonlinear, hierarchical, and context-dependent regulatory relationships−features often difficult to capture with traditional statistical or rule-based methods. In particular, DL architectures such as Convolutional Neural Networks (CNNs)^[19] and Recurrent Neural Networks (RNNs)^[20] excel at learning high-order dependencies and hidden patterns in gene expression data. Tools like DeepBind^[21], DeeperBind^[22], and DeepSEA^[23] apply CNN-based models to predict regulatory relationships from sequence-based features. Moreover, ML and DL frameworks can integrate heterogeneous data types—including gene expression profiles, sequence motifs, and epigenetic information—to improve predictive power^[24]. However, these benefits come with challenges. DL models typically require large, high-quality labeled datasets for effective training, which are often unavailable for many plant species. Traditional ML methods such as multiple linear regression^[25], Support Vector Machine (SVM)^[26], and Decision Trees^[27] can struggle with high-dimensional, noisy omics data and may fail to capture nonlinear or hierarchical relationships. Additionally, overfitting and limited interpretability can be concerns, particularly when applied to small or unbalanced datasets.

To address these limitations, various hybrid approaches that combine the feature learning capabilities of DL with the classification strength and interpretability of ML have gained traction. For example, temporal attention mechanisms integrated with LSTM architectures have been used to predict crop yield by combining genomic and environmental features^[28]. Optimized hybrid deep learning frameworks, such as CNN-stacked LSTM architectures have been used to improve time-series prediction accuracy for environmental variables^[29]. Similarly, MusiteDeep^[30] combines CNNs with attention modules to enhance kinase-specific phosphorylation site prediction. Recent work has also demonstrated that combining deep feature extraction with machine learning ensembles can improve classification performance on imbalanced datasets^[31]. Together, these hybrid frameworks offer a flexible and robust solutions for inferring integrated regulatory networks−a broadly defined GRNs that incorporate multi-omics layers such as protein–protein interactions and metabolic pathways−especially when dealing with limited or heterogeneous datasets. By leveraging the strengths of ML, DL, and hybrid strategies, researchers gain a versatile computational toolkit for uncovering complex regulatory mechanisms, enabling large-scale, cross-context GRN construction that complements traditional experimental methods.

Transfer learning is a ML strategy that leverages knowledge acquired from one domain with large-scale datasets to improve performance in a related but less well-characterized domain with limited data^[32]. In plant genomics and bioinformatics, transfer learning can facilitate the inference of gene regulatory relationships in a target species with limited training data by reusing models trained on a well-annotated, data-rich species. For example, Moore et al. used annotated gene expression data from Arabidopsis thaliana to classify specialized and general metabolism in tomato^[33], demonstrating the potential of transfer learning for cross-species analysis. To maximize its effectiveness, it is essential to select a source species with extensive and well-curated datasets, such as Arabidopsis, to support robust representation learning. Considering evolutionary relationships and the conservation of genes, especially transcription factor families, between source and target species should be considered to enhance the transferability of regulatory features. Beyond using orthologous gene expression levels and patterns, recent studies have integrated metabolic network models into transfer learning frameworks to further constrain and guide GRN reconstruction. This integration of biochemical constraints alongside transcriptomic data can significantly improve prediction accuracy by capturing the underlying biological context more effectively^[34].

In this study, we investigated the potential of ML, DL, and hybrid approaches for constructing GRNs using transcriptomic data from three plant species: Arabidopsis thaliana, poplar (Populus trichocarpa), and maize (Zea mays). Our results demonstrate that the hybrid approaches integrating ML and DL significantly outperformed traditional methods. A key challenge in GRN inference is the limited availability of experimentally validated regulatory pairs, particularly in less characterized species like poplar and maize. To address this limitation, we employed transfer learning strategies, leveraging training data of Arabidopsis to predict regulatory relationships in the other two species. Our results demonstrate that transfer learning, particularly when integrated with CNN-based models, significantly improves prediction performance across species. These findings underscore the promise of cross-species learning and provide a foundation for advancing knowledge transfer approaches in regulatory network inference for data-scarce plant systems.

Discussion

Gene regulatory networks (GRNs) play a central role in elucidating the regulatory mechanisms that govern metabolic pathways, biological processes, and complex traits in plants. However, GRN construction remains a major challenge due to the intricate nature of regulatory interactions and the limitations of experimental methods, which are often labor-intensive, time-consuming, and difficult to scale. In this study, we addressed these challenges using supervised learning approaches that integrate prior biological knowledge with gene expression data. Specifically, we employed a combination of machine learning (ML), deep learning (DL), and hybrid methods to predict regulatory relationships from large-scale transcriptomic datasets. To further improve performance and generalizability, we implemented transfer learning strategies that leverage regulatory knowledge from well-characterized species and apply it to less-studied plant systems. This integrated computational framework enables scalable, cross-species GRN inference and offers a practical foundation for identifying and prioritizing candidate regulators for downstream experimental validation.

Our research demonstrates the effectiveness of ML, DL, and hybrid techniques for predicting GRNs by leveraging prior knowledge and transcriptomic data from multiple plant species, including Arabidopsis thaliana, poplar, and maize. We systematically evaluated several ML and DL architectures − including FCNs and CNNs − across a range of loss functions. Among the FCN models tested with eight different loss functions, the model using BCE achieved the highest average accuracy (90.58%). Similarly, among CNN models tested with ten loss functions, the CNN with BCE loss achieved the best performance, reaching an average accuracy of 95.32%, followed by models using mean absolute error (MAE), hinge loss, and Huber loss (Table 2). Building on these results, we developed hybrid models that integrate CNN-based feature representations with traditional ML classifiers, using BCE loss for consistency. The hybrid Random Forest, hybrid Extremely Randomized Trees, and hybrid AdaBoost models showed substantial improvements over their non-hybrid counterparts, consistently outperforming them on both holdout and independent test datasets. These findings highlight the value of combining DL-driven feature extraction with ML-based classification for accurate, scalable, and cross-species GRN inference.

Using Arabidopsis Transcriptomic Test Data Set 1, and Poplar and Maize Transcriptomic Test Data Sets, we evaluated the performance of three hybrid Random Forest, hybrid Extremely Randomized Trees, and hybrid AdaBoost—against their corresponding plain versions in identifying known positive TFs regulating LBP. Spearman's rank correlation was also included as a baseline statistical method for comparison. We revealed the following facts: (1) The hybrid models not only identified a greater number of TFs but also consistently ranked them among the top candidates, highlighting their superior ability to prioritize biologically relevant regulators. (2) Notably, the hybrid models consistently identified the master regulators of LBP, MYB46 and MYB83, which were rarely detected by their plain counterparts and Spearman's rank correlation. (3) In addition, the hybrid models demonstrated greater competency in identifying upstream regulators of LBP—such as VND6/7/1, NST1/3, SND1/2, E2FC, and MYB26—compared to the plain models, while Spearman's rank method identified these regulators only sporadically.

Our analysis using Arabidopsis Transcriptomic Test Data Set 2 demonstrates that hybrid ML and DL models outperformed both plain ML models and traditional inference methods, including GENIE3, CLR, and Spearman's rank correlation, in predicting experimentally validated regulatory relationships. Among the evaluated models, tree-based hybrid approaches achieved the highest AUC scores (Fig. 6, Table 4), indicating superior capability in distinguishing true regulatory interactions. Although these models exhibited slightly lower accuracy and precision at fixed classification thresholds, their stronger overall discriminatory power highlights the advantages of hybrid architectures in GRN inference. Furthermore, motif enrichment analysis confirmed the biological relevance of the predicted regulatory pairs: a substantial proportion of the predicted target genes contained TF binding sites in their promoter regions. Collectively, these results underscore the improved predictive performance and biological interpretability offered by hybrid ML/DL frameworks for reconstructing gene regulatory networks.

A key limitation of our study is the limited availability of high-quality training datasets for all three species—Arabidopsis, poplar, and maize. While the Arabidopsis training set was constructed using experimentally validated regulatory pairs from the AGRIS database, equivalent datasets are largely unavailable for poplar and maize. Although some gene regulatory relationships have been reported in the literature for poplar and maize, they often lack solid experimental evidence. For example, upregulation or downregulation of a gene of interest by a regulator does not necessarily imply there is a direct regulation between them, and high-throughput methods such as Y1H, ChIP-seq, and DAP-seq are known to yield substantial noise and false positives and need to be substantiated by other experimental means. Careful curation is essential to collect, classify, and validate these data before they can be reliably used for ML or DL model training. Consequently, we treated poplar and maize as non-model species with no reliable prior knowledge of gene regulatory interactions − reflecting a realistic, data-scarce scenario. To enable supervised learning under these constraints, we generated training data by applying a homologous gene mapping strategy, using validated regulatory pairs from Arabidopsis. This cross-species gene mapping approach allowed us to leverage existing knowledge where direct evidence was lacking. However, it relies on the assumption that certain regulatory relationships and modules are conserved across species—an assumption that may not always hold over large evolutionary distances. Despite this limitation, the strategy enabled preliminary modeling and prioritization of candidate regulatory interactions in non-model species. Importantly, our results suggest that this approach is still practical: we were able to identify key pathway regulators in poplar and maize, albeit in smaller numbers and with lower predictive efficiency compared to Arabidopsis. These findings highlight both the promise and the limitations of knowledge transfer in cross-species GRN inference.

Constructing GRNs using ML and DL approaches requires robust and diverse training data. However, the amount of data available for each species in public repositories remains insufficient to support analyses specific to tissues, conditions, or developmental stages. As a result, we compiled gene expression matrices from a variety of tissues, conditions, and developmental stages to expand the available training data. While this inclusive strategy facilitates the development of generalizable models, it may obscure regulatory relationships specific to some tissues, conditions, or developmental stages. We focused on the LBP, which is active across multiple tissue types and therefore suitable for generalized modeling. However, to enable the future construction of tissue-, condition-, or stage-specific networks, it is essential to expand public databases and curate existing gene regulatory knowledge. Concurrently, transfer learning provides an alternative practical approach for extending regulatory knowledge from Arabidopsis—a well-characterized model species—to less-studied species such as poplar and maize. By pretraining a convolutional encoder on Arabidopsis data and subsequently fine-tuning it with transcriptomic data from poplar and maize, we achieved significant improvements in prediction accuracy, reduced training time, and enhanced cross-species applicability. These findings underscore the complementary roles of intra-species data integration and cross-species transfer learning in improving GRN inference under real-world data limitations.

Although limited by data availability, our study demonstrates the strong potential of machine learning (ML), deep learning (DL), and hybrid models for gene regulatory network (GRN) inference. Future research should refine hybrid approaches by, for example, incorporating time-series data and applying temporal models—such as recurrent neural networks (RNNs) or one-dimensional convolutional neural networks (1D CNNs)—to better capture dynamic regulatory processes that static models often miss. In addition, using lightweight CNNs, more advanced ML frameworks, and interpretability tools such as SHAP^[71] could further enhance efficiency, model performance, and transparency. Moreover, our recent findings support the exploration of attention-based architectures to construct context-dependent GRNs and identify key regulators, as recently demonstrated in Islam et al.^[72]. Furthermore, integrating multi-omics data, including chromatin accessibility and epigenomic profiles, may further improve prediction accuracy. Given the complexity of ML/DL-based GRN models, the development of explainable artificial intelligence (AI) frameworks will be essential to improve interpretability and facilitate biological insights. Finally, transfer learning strategies should be further refined to enhance cross-species prediction. In addition to relying on orthologous gene mappings, emerging species-agnostic approaches can learn shared patterns in data structures—for example, aligning patterns of expression or co-expression modules across species, even when direct gene orthology is unavailable^[73]. To support these advancements, we emphasize the importance of broader data sharing and the public deposition of high-resolution, tissue-specific datasets and experimentally validated regulatory interactions.

Supplementary Table S1 The hyperparameters learned through training and the description for different models using Arabidopsis Training Data.
Supplementary Table S2 Comparison of the Top 50 transcription factors (TFs) predicted to regulate the lignin biosynthesis pathway by Hybrid Extremely Randomized Trees and Plain Hybrid Extremely Randomized Trees on Poplar Transcriptomic Test Data Set. The frequency of each TF within the top 2,000 predicted regulatory relationships was calculated to represent how many pathway genes it might have inferred. Red font indicates TFs that are known regulators of the lignin biosynthesis pathway, based on published literature. Red font with yellow highlight indicates TFs that are recognized as master regulators of lignin biosynthesis (e.g., MYB83, MYB46). Blue font denotes TFs that act further upstream in the regulatory hierarchy, influencing the expression of MYB83 and MYB46.
Supplementary Table S3 Comparison of the top 50 transcription factors (TFs) predicted to regulate the lignin biosynthesis pathway by hybrid Random Forest and plain Random Forest Models as well as a baseline method, Spearman's rank correlation on Arabidopsis Transcriptomic Test Data Set 1. The top 1,000 predicted regulatory relationships was used to calculate the frequency of how many BLP genes each TF regulated. TFs highlighted in red represent known true regulators according to current literature, with the corresponding references provided.
Supplementary Fig. S1 Box plots showing the distribution of read counts for Arabidopsis thaliana samples before and after TMM normalization. The x-axis represents individual samples, and the y-axis shows log-transformed read counts.
Supplementary Fig. S2 Box plots showing the distribution of read counts for Poplar (Populus trichocarpa) samples before and after TMM normalization. The x-axis represents individual samples, and the y-axis shows log-transformed read counts.
Supplementary Fig. S3 Box plots showing the distribution of read counts for Zea mays (maize) samples before and after TMM normalization. The x-axis represents individual samples, and the y-axis shows log-transformed read counts.
Supplementary Fig. S4 Training and validation accuracy curves of the convolutional neural networks shown in the step 1 of the Hybrid architecture.
Supplementary Fig. S5 Training and validation loss curves of the convolutional neural networks shown in the step 1 of the Hybrid architecture.

[1]	Zhao Q, Dixon RA. 2011. Transcriptional networks for lignin biosynthesis: more complex than we thought? Trends in Plant Science 16:227−33 doi: 10.1016/j.tplants.2010.12.005 CrossRef Google Scholar
[2]	Wilson TE, Fahrner TJ, Johnston M, Milbrandt J. 1991. Identification of the DNA binding site for NGFI-B by genetic selection in yeast. Science 252(5010):1296−300 doi: 10.1126/science.1925541 CrossRef Google Scholar
[3]	Hellman LM, Fried MG. 2007. Electrophoretic mobility shift assay (EMSA) for detecting protein–nucleic acid interactions. Nature Protocols 2:1849−61 doi: 10.1038/nprot.2007.249 CrossRef Google Scholar
[4]	Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, et al. 2007. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nature Methods 4:651−57 doi: 10.1038/nmeth1068 CrossRef Google Scholar
[5]	Bartlett A, O'Malley RC, Huang SC, Galli M, Nery JR, et al. 2017. Mapping genome-wide transcription-factor binding sites using DAP-seq. Nature Protocols 12:1659−72 doi: 10.1038/nprot.2017.055 CrossRef Google Scholar
[6]	Nie J, Stewart R, Zhang H, Thomson JA, Ruan F, et al. 2011. TF-Cluster: a pipeline for identifying functionally coordinated transcription factors via network decomposition of the shared coexpression connectivity matrix (SCCM). BMC Systems Biology 5:53 doi: 10.1186/1752-0509-5-53 CrossRef Google Scholar
[7]	Kumari S, Deng W, Gunasekara C, Chiang V, Chen HS, et al. 2016. Bottom-up GGM algorithm for constructing multilayered hierarchical gene regulatory networks that govern biological pathways or processes. BMC Bioinformatics 17:132 doi: 10.1186/s12859-016-0981-1 CrossRef Google Scholar
[8]	Wei H. 2019. Construction of a hierarchical gene regulatory network centered around a transcription factor. Briefings in Bioinformatics 20:1021−31 doi: 10.1093/bib/bbx152 CrossRef Google Scholar
[9]	Deng W, Zhang K, Busov V, Wei H. 2017. Recursive random forest algorithm for constructing multilayered hierarchical gene regulatory networks that govern biological pathways. PLoS One 12:e0171532 doi: 10.1371/journal.pone.0171532 CrossRef Google Scholar
[10]	Haury AC, Mordelet F, Vera-Licona P, Vert JP. 2012. TIGRESS trustful inference of gene REgulation using stability selection. BMC Systems Biology 6:145 doi: 10.1186/1752-0509-6-145 CrossRef Google Scholar
[11]	Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, et al. 2006. ARACNE an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 7:S7 doi: 10.1186/1471-2105-7-S1-S7 CrossRef Google Scholar
[12]	Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, et al. 2007. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biology 5:e8 doi: 10.1371/journal.pbio.0050008 CrossRef Google Scholar
[13]	Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P. 2010. Inferring regulatory networks from expression data using tree-based methods. PLoS One 5:e12776 doi: 10.1371/journal.pone.0012776 CrossRef Google Scholar
[14]	Deng W, Zhang K, Liu S, Zhao PX, Xu S, et al. 2018. JRmGRN: joint reconstruction of multiple gene regulatory networks with common hub genes using data from multiple tissues or conditions. Bioinformatics 34:3470−78 doi: 10.1093/bioinformatics/bty354 CrossRef Google Scholar
[15]	Danaher P, Wang P, Witten DM. 2014. The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society Series B, Statistical Methodology 76:373−97 doi: 10.1111/rssb.12033 CrossRef Google Scholar
[16]	Cao X, Zhang L, Islam MK, Zhao M, He C, et al. 2023. TGPred: efficient methods for predicting target genes of a transcription factor by integrating statistics, machine learning and optimization. NAR Genomics and Bioinformatics 5:lqad083 doi: 10.1093/nargab/lqad083 CrossRef Google Scholar
[17]	Tabares-Soto R, Orozco-Arias S, Romero-Cano V, Segovia Bucheli V, Rodríguez-Sotelo JL, et al. 2020. A comparative study of machine learning and deep learning algorithms to classify cancer types based on microarray gene expression data. PeerJ Computer Science 6:e270 doi: 10.7717/peerj-cs.270 CrossRef Google Scholar
[18]	Angermueller C, Pärnamaa T, Parts L, Stegle O. 2016. Deep learning for computational biology. Molecular Systems Biology 12:878 doi: 10.15252/msb.20156651 CrossRef Google Scholar
[19]	Rukhsar L, Bangyal WH, Ali Khan MS, Ag Ibrahim AA, Nisar K, et al. 2022. Analyzing RNA-seq gene expression data using deep learning approaches for cancer classification. Applied Sciences 12:1850 doi: 10.3390/app12041850 CrossRef Google Scholar
[20]	Eetemadi A, Tagkopoulos I. 2019. Genetic Neural Networks: an artificial neural network architecture for capturing gene expression relationships. Bioinformatics 35:2226−34 doi: 10.1093/bioinformatics/bty945 CrossRef Google Scholar
[21]	Alipanahi B, Delong A, Weirauch MT, Frey BJ. 2015. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology 33:831−38 doi: 10.1038/nbt.3300 CrossRef Google Scholar
[22]	Hassanzadeh HR, Wang MD. 2016. DeeperBind: enhancing prediction of sequence specificities of DNA binding proteins. 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Shenzhen, China, 2016. US: IEEE. pp. 178−83. doi: 10.1109/BIBM.2016.7822515
[23]	Zhou J, Troyanskaya OG. 2015. Predicting effects of noncoding variants with deep learning-based sequence model. Nature Methods 12:931−34 doi: 10.1038/nmeth.3547 CrossRef Google Scholar
[24]	Min S, Lee B, Yoon S. 2017. Deep learning in bioinformatics. Briefings in Bioinformatics 18:851−69 doi: 10.1093/bib/bbw068 CrossRef Google Scholar
[25]	Salleh FHM, Zainudin S, Arif SM. 2017. Multiple linear regression for reconstruction of gene regulatory networks in solving cascade error problems. Advances in Bioinformatics 2017:4827171 doi: 10.1155/2017/4827171 CrossRef Google Scholar
[26]	Gillani Z, Akash MSH, Matiur Rahaman MD, Chen M. 2014. CompareSVM: supervised, Support Vector Machine (SVM) inference of gene regularity networks. BMC Bioinformatics 15:395 doi: 10.1186/s12859-014-0395-x CrossRef Google Scholar
[27]	Huynh-Thu VA, Geurts P. 2019. Unsupervised gene network inference with decision trees and random forests. Methods in Molecular Biology 1883:195−215 doi: 10.1007/978-1-4939-8882-2_8 CrossRef Google Scholar
[28]	Shook J, Gangopadhyay T, Wu L, Ganapathysubramanian B, Sarkar S, et al. 2021. Crop yield prediction integrating genotype and weather variables using deep learning. PLoS One 16:e0252402 doi: 10.1371/journal.pone.0252402 CrossRef Google Scholar
[29]	Mariappan Y, Ramasamy K, Velusamy D. 2025. An optimized deep learning based hybrid model for prediction of daily average global solar irradiance using CNN SLSTM architecture. Scientific Reports 15:10761 doi: 10.1038/s41598-025-95118-3 CrossRef Google Scholar
[30]	Wang D, Liu D, Yuchi J, He F, Jiang Y, et al. 2020. MusiteDeep: a deep-learning based webserver for protein post-translational modification site prediction and visualization. Nucleic Acids Research 48:W140−W146 doi: 10.1093/nar/gkaa275 CrossRef Google Scholar
[31]	Verma N, Ranvijay, Yadav DK. 2024. Hybrid of deep feature extraction and machine learning ensembles for imbalanced skin cancer datasets. Experimental Dermatology 33:e70020 doi: 10.1111/exd.70020 CrossRef Google Scholar
[32]	Theodoris CV, Xiao L, Chopra A, Chaffin MD, Al Sayed ZR, et al. 2023. Transfer learning enables predictions in network biology. Nature 618:616−24 doi: 10.1038/s41586-023-06139-9 CrossRef Google Scholar
[33]	Moore BM, Wang P, Fan P, Lee A, Leong B, et al. 2020. Within- and cross-species predictions of plant specialized metabolism genes using transfer learning. In Silico Plants 2:diaa005 doi: 10.1093/insilicoplants/diaa005 CrossRef Google Scholar
[34]	Pio G, Mignone P, Magazzù G, Zampieri G, Ceci M, et al. 2022. Integrating genome-scale metabolic modelling and transfer learning for human gene regulatory network reconstruction. Bioinformatics 38:487−93 doi: 10.1093/bioinformatics/btab647 CrossRef Google Scholar
[35]	Leinonen R, Sugawara H, Shumway M. 2011. The sequence read archive. Nucleic Acids Research 39:D19−D21 doi: 10.1093/nar/gkq1019 CrossRef Google Scholar
[36]	Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114−20 doi: 10.1093/bioinformatics/btu170 CrossRef Google Scholar
[37]	Andrews S. 2010. FastQC: a quality control tool for high throughput sequence data. www.bioinformatics.babraham.ac.uk/projects/fastqc
[38]	Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, et al. 2013. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29:15−21 doi: 10.1093/bioinformatics/bts635 CrossRef Google Scholar
[39]	Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26:841−42 doi: 10.1093/bioinformatics/btq033 CrossRef Google Scholar
[40]	Robinson MD, McCarthy DJ, Smyth GK. 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139−40 doi: 10.1093/bioinformatics/btp616 CrossRef Google Scholar
[41]	Yilmaz A, Mejia-Guerra MK, Kurz K, Liang X, Welch L, et al. 2011. AGRIS the Arabidopsis gene regulatory information server, an update. Nucleic Acids Research 39:D1118−D1122 doi: 10.1093/nar/gkq1120 CrossRef Google Scholar
[42]	Boerjan W, Ralph J, Baucher M. 2003. Lignin biosynthesis. Annual Review of Plant Biology 54:519−46 doi: 10.1146/annurev.arplant.54.031902.134938 CrossRef Google Scholar
[43]	Taylor-Teeples M, Lin L, de Lucas M, Turco G, Toal TW, et al. 2015. An Arabidopsis gene regulatory network for secondary cell wall synthesis. Nature 517:571−75 doi: 10.1038/nature14099 CrossRef Google Scholar
[44]	Jochen Supper HF, Spieth C, Dräger A, Zell A. 2007. Inferring gene regulatory networks by machine learning methods. Series on Advances in Bioinformatics and Computational Biology Proceedings of the 5th Asia-Pacific Bioinformatics Conference. pp. 247−56. doi: 10.1142/9781860947995_0027
[45]	Parry RM, Jones W, Stokes TH, Phan JH, Moffitt RA, et al. 2010. K-Nearest neighbor models for microarray gene expression analysis and clinical outcome prediction. The Pharmacogenomics Journal 10:292−309 doi: 10.1038/tpj.2010.56 CrossRef Google Scholar
[46]	Choi SH, Labadorf AT, Myers RH, Lunetta KL, Dupuis J, et al. 2017. Evaluation of logistic regression models and effect of covariates for case-control study in RNA-seq analysis. BMC Bioinformatics 18:91 doi: 10.1186/s12859-017-1498-y CrossRef Google Scholar
[47]	Peignier S, Sorin B, Calevro F. 2023. Ensemble learning based gene regulatory network inference. International Journal on Artificial Intelligence Tools 32:2360005 doi: 10.1142/S0218213023600059 CrossRef Google Scholar
[48]	Breiman L. 1996. Bagging predictors. Machine Learning 24:123−40 doi: 10.1023/A:1018054314350 CrossRef Google Scholar
[49]	Karshenas A, Röschinger T, Garcia HG. 2024. Predictive modeling of gene expression and localization of DNA binding site using deep convolutional neural networks. bioRxiv 2024.12.17.629042 doi: 10.1101/2024.12.17.629042 CrossRef Google Scholar
[50]	Ji R, Geng Y, Quan X. 2024. Inferring gene regulatory networks with graph convolutional network based on causal feature reconstruction. Scientific Reports 14:21342 doi: 10.1038/s41598-024-71864-8 CrossRef Google Scholar
[51]	Abadi M, Barham P, Chen J, Chen Z, Davis A, et al. 2016. TensorFlow: a system for large-scale machine learning. arXiv 1605.08695 doi: 10.48550/arXiv.1605.08695 CrossRef Google Scholar
[52]	Chen Y, Li Y, Narayan R, Subramanian A, Xie X. 2016. Gene expression inference with deep learning. Bioinformatics 32:1832−39 doi: 10.1093/bioinformatics/btw074 CrossRef Google Scholar
[53]	Kong Y, Yu T. 2018. A deep neural network model using random forest to extract feature representation for gene expression data classification. Scientific Reports 8:16477 doi: 10.1038/s41598-018-34833-6 CrossRef Google Scholar
[54]	He K, Zhang X, Ren S, Sun J. 2016. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016. US: IEEE. pp. 770−78. doi: 10.1109/CVPR.2016.90
[55]	Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, et al. 2017. MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv 1704.04861 doi: 10.48550/arXiv.1704.04861 CrossRef Google Scholar
[56]	Wang H, Zhao Q, Chen F, Wang M, Dixon RA. 2011. NAC domain function and transcriptional control of a secondary cell wall master switch. The Plant Journal 68:1104−14 doi: 10.1111/j.1365-313X.2011.04764.x CrossRef Google Scholar
[57]	Wang H, Avci U, Nakashima J, Hahn MG, Chen F, et al. 2010. Mutation of WRKY transcription factors initiates pith secondary wall formation and increases stem biomass in dicotyledonous plants. Proceedings of the National Academy of Sciences of the United States of America 107:22338−43 doi: 10.1073/pnas.1016436107 CrossRef Google Scholar
[58]	Zhong R, Ye ZH. 2012. MYB46 and MYB83 bind to the SMRE sites and directly activate a suite of transcription factors and secondary wall biosynthetic genes. Plant and Cell Physiology 53:368−80 doi: 10.1093/pcp/pcr185 CrossRef Google Scholar
[59]	Mitsuda N, Iwase A, Yamamoto H, Yoshida M, Seki M, et al. 2007. NAC transcription factors, NST1 and NST3, are key regulators of the formation of secondary walls in woody tissues of Arabidopsis. The Plant Cell 19(1):270−80 doi: 10.1105/tpc.106.047043 CrossRef Google Scholar
[60]	Zhou J, Zhong R, Ye ZH. 2014. Arabidopsis NAC domain proteins, VND1 to VND5, are transcriptional regulators of secondary wall biosynthesis in vessels. PLoS One 9:e105726 doi: 10.1371/journal.pone.0105726 CrossRef Google Scholar
[61]	Ohashi-Ito K, Oda Y, Fukuda H. 2010. Arabidopsis VASCULAR-RELATED NAC-DOMAIN6 directly regulates the genes that govern programmed cell death and secondary wall formation during xylem differentiation. The Plant Cell 22:3461−73 doi: 10.1105/tpc.110.075036 CrossRef Google Scholar
[62]	Yamaguchi M, Mitsuda N, Ohtani M, Ohme-Takagi M, Kato K, et al. 2011. VASCULAR-RELATED NAC-DOMAIN 7 directly regulates the expression of a broad range of genes for xylem vessel formation. The Plant Journal 66:579−90 doi: 10.1111/j.1365-313X.2011.04514.x CrossRef Google Scholar
[63]	Han X, Zhao Y, Chen Y, Xu J, Jiang C, et al. 2022. Lignin biosynthesis and accumulation in response to abiotic stresses in woody plants. Forestry Research 2:9 doi: 10.48130/FR-2022-0009 CrossRef Google Scholar
[64]	Yang C, Xu Z, Song J, Conner K, Vizcay Barrena G, et al. 2007. Arabidopsis MYB26/MALE STERILE35 regulates secondary thickening in the Endothecium and is essential for anther dehiscence. The Plant Cell 19:534−48 doi: 10.1105/tpc.106.046391 CrossRef Google Scholar
[65]	Kumari S, Nie J, Chen HS, Ma H, Stewart R, et al. 2012. Evaluation of gene association methods for coexpression network construction and biological knowledge discovery. PLoS One 7:e50411 doi: 10.1371/journal.pone.0050411 CrossRef Google Scholar
[66]	Wei Z, Wei H. 2024. Deciphering the intricate hierarchical gene regulatory network: unraveling multi-level regulation and modifications driving secondary cell wall formation. Horticulture Research 11(2):uhad281 doi: 10.1093/hr/uhad281 CrossRef Google Scholar
[67]	Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, et al. 2001. A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 17:1113−22 doi: 10.1093/bioinformatics/17.12.1113 CrossRef Google Scholar
[68]	Chai G, Kong Y, Zhu M, Yu L, Qi G, et al. 2015. Arabidopsis C3H14 and C3H15 have overlapping roles in the regulation of secondary wall thickening and anther development. Journal of Experimental Botany 66:2595−609 doi: 10.1093/jxb/erv060 CrossRef Google Scholar
[69]	Qin W, Yin Q, Chen J, Zhao X, Yue F, et al. 2020. The class II KNOX transcription factors KNAT3 and KNAT7 synergistically regulate monolignol biosynthesis in Arabidopsis. Journal of Experimental Botany 71:5469−83 doi: 10.1093/jxb/eraa266 CrossRef Google Scholar
[70]	Li E, Bhargava A, Qiang W, Friedmann MC, Forneris N, et al. 2012. The Class II KNOX gene KNAT7 negatively regulates secondary wall formation in Arabidopsis and is functionally conserved in Populus. New Phytologist 194:102−15 doi: 10.1111/j.1469-8137.2011.04016.x CrossRef Google Scholar
[71]	Nahiduzzaman M, Abdulrazak LF, Kibria HB, Khandakar A, Ayari MA, et al. 2025. A hybrid explainable model based on advanced machine learning and deep learning models for classifying brain tumors using MRI images. Scientific Reports 15:1649 doi: 10.1038/s41598-025-85874-7 CrossRef Google Scholar
[72]	Islam MK, Wagh H, Wei H. 2025. Dynamic gene attention focus (DyGAF): enhancing biomarker identification through dual-model attention networks. Bioinformatics and Biology Insights 19:11779322251325390 doi: 10.1177/11779322251325390 CrossRef Google Scholar
[73]	Park Y, Muttray NP, Hauschild AC. 2024. Species-agnostic transfer learning for cross-species transcriptomics data integration without gene orthology. Briefings in Bioinformatics 25:bbae004 doi: 10.1093/bib/bbae004 CrossRef Google Scholar

(a) Training data
Species	Number of genes	Expression samples	Training data
Species	Number of genes	Expression samples	Total	Positive pairs	Negative pairs
Arabidopsis thaliana	22,093 (Compendium Data Set 1)	1,253	2,462	1,231	1,231
Populus trichocarpa	34,699 (Compendium Data Set 2)	743	4,214	2,107	2,107
Zea mays (B73)	39,756 (Compendium Data Set 3)	1,626	16,900	8,450	8,450
(b) Test data
Species		TFs	Targets	Expression samples	Total pairs
Arabidopsis Transcriptomic Test Data Set 1		1,415	20	1,253	28,300
Arabidopsis Transcriptomic Test Data Set 2		199	35	1,253	1,164
Poplar Transcriptomic Test Data Set		1,717	25	743	42,925
Maize Transcriptomic Test Data Set		2,555	38	1,626	97,090

(a) Fully connected layer
Species	FCN BCE	FCN HINGE	FCN MSE	FCN MSLE	FCN MAE	FCN POISSON	FCN HUBER	FCN LOGCOSH
Arabidopsis	87.42	84.58	87.02	85.4	89.25	87.62	87.42	87.42
Poplar	95.28	91.85	91.11	92.44	92.92	92.8	92.25	91.37
Maize	89.05	88.82	89.79	89.5	89.05	90.8	90.95	90.71
Average scores	90.58	88.42	89.31	89.11	90.41	90.41	90.21	89.83
(b) Convolutional neural network
Species	CNN BCE	CNN HINGE	CNN MSLE	CNN MSE	CNN MAE	CNN POISSON	CNN HUBER	CNN LOGCOSH	ResNet 50	Mobile Net
Arabidopsis	93.5	91.48	92.29	91.48	91.88	92.69	91.47	92.08	81.93	74.03
Poplar	97.59	98.1	97.59	97.21	98.35	97.85	97.72	96.32	88.67	85.47
Maize	94.86	95.34	90.48	94.08	95.4	88.51	95.65	94.9	85.89	83.07
Average scores	95.32	94.97	93.45	94.26	95.21	93.02	94.95	94.43	85.5	80.86

No.	Model	Accuracy	Precision	Recall	Specificity	F1-score	AUC score
1	Random Forest Classifier Hybrid	83.26	83.33	83.26	85.59	83.25	93.00
2	Random Forest Classifier Plain	84.55	86.19	84.54	95.20	84.37	89.80
3	Extra Trees Classifier Hybrid	85.15	85.38	85.15	89.19	85.12	93.31
4	Extra Trees Classifier Plain	84.03	85.60	84.03	94.51	83.85	88.05
5	AdaBoost Classifier Hybrid	81.20	81.33	81.20	84.39	81.18	91.84
6	AdaBoost Classifier Plain	84.98	85.23	84.97	89.19	84.95	89.38

{{lists.name}}

Gene regulatory network prediction using machine learning, deep learning, and hybrid approaches

Abstract

Supplementary information

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors