HB-PLS: A statistical method for identifying biological process or pathway regulators by integrating Huber loss and Berhu penalty with partial least squares regression

Wenping Deng; Kui Zhang; Cheng He; Sanzhen Liu; Hairong Wei; Wenping Deng; Kui Zhang; Cheng He; Sanzhen Liu; Hairong Wei

doi:10.48130/FR-2021-0006

2021 Volume 1

Article Contents

Next Previous

ARTICLE Open Access

HB-PLS: A statistical method for identifying biological process or pathway regulators by integrating Huber loss and Berhu penalty with partial least squares regression

1.
College of Forest Resources and Environmental Science, Michigan Technological University, Houghton, Michigan 49931, United States of America
2.
Department of Mathematical Science, Michigan Technological University, Houghton, Michigan 49931, United States of America
3.
Department of Plant Pathology, Kansas State University, Manhattan, Kansas 66506, United States of America

More Information

Corresponding author: hairong@mtu.edu

Received Date: 11 January 2021
Accepted Date: 11 March 2021
Published Online: 30 March 2021
Forestry Research 1, Article number: 6 (2021) | Cite this article

Abstract

Gene expression data features high dimensionality, multicollinearity, and non-Gaussian distribution noise, posing hurdles for identification of true regulatory genes controlling a biological process or pathway. In this study, we integrated the Huber loss function and the Berhu penalty (HB) into partial least squares (PLS) framework to deal with the high dimension and multicollinearity property of gene expression data, and developed a new method called HB-PLS regression to model the relationships between regulatory genes and pathway genes. To solve the Huber-Berhu optimization problem, an accelerated proximal gradient descent algorithm with at least 10 times faster than the general convex optimization solver (CVX), was developed. Application of HB-PLS to recognize pathway regulators of lignin biosynthesis and photosynthesis in Arabidopsis thaliana led to the identification of many known positive pathway regulators that had previously been experimentally validated. As compared to sparse partial least squares (SPLS) regression, an efficient method for variable selection and dimension reduction in handling multicollinearity, HB-PLS has higher efficacy in identifying more positive known regulators, a much higher but slightly less sensitivity/(1-specificity) in ranking the true positive known regulators to the top of the output regulatory gene lists for the two aforementioned pathways. In addition, each method could identify some unique regulators that cannot be identified by the other methods. Our results showed that the overall performance of HB-PLS slightly exceeds that of SPLS but both methods are instrumental for identifying real pathway regulators from high-throughput gene expression data, suggesting that integration of statistics, machine leaning and convex optimization can result in a method with high efficacy and is worth further exploration.
- Huber loss,
- Berhu penalty,
- partial least squares regression,
- Huber-Berhu partial least squares,
- pathway regulators

Supplementary Information

Supplementary Fig. S1 The performance of Huber-Berhu-Partial Least Squares (HB-PLS) was compared with the conventional partial least squares (PLS) and the sparse partial least squares (SPLS) method.

Rights and permissions
Copyright: © 2023 by the author(s). Published by Maximum Academic Press, Fayetteville, GA. This article is an open access article distributed under Creative Commons Attribution License (CC BY 4.0), visit https://creativecommons.org/licenses/by/4.0/.

References

[1]	Chen T, He HL, Church GM. 1999. Modeling gene expression with differential equations. In Proceeding of the Pacific Symposium on Biocomputing 1999, 4:611. USA: World Scientific. pp.29−40 https://doi.org/10.1142/3925
[2]	Kauffman S. 1969. Homeostasis and differentiation in random genetic control networks. Nature 224:177−8 doi: 10.1038/224177a0 CrossRef Google Scholar
[3]	Chen BS, Chang CH, Wang YC, Wu CH, Lee HC. 2011. Robust model matching design methodology for a stochastic synthetic gene network. Math. Biosci. 230:23−36 doi: 10.1016/j.mbs.2010.12.007 CrossRef Google Scholar
[4]	Friedman N, Nachman I, Pe'er D. 1999. Learning bayesian network structure from massive datasets: the "sparse candidate" algorithm. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI1999). pp. 206−15. Stockholm: Morgan Kaufmann Publishers Inc.
[5]	Friedman N, Linial M, Nachman I, Pe'er D. 2000. Using Bayesian networks to analyze expression data. Journal of Computational Biology 7:601−20 doi: 10.1089/106652700750050961 CrossRef Google Scholar
[6]	Chai LE, Law CK, Mohamad MS, Chong CK, Choon YW, et al. 2014. Investigating the effects of imputation methods for modelling gene networks using a dynamic bayesian network from gene expression data. Malays J Med Sci 21:20−7 https://pubmed.ncbi.nlm.nih.gov/24876803/
[7]	Exarchos TP, Rigas G, Goletsis Y, Stefanou K, Jacobs S, et al. 2014. A dynamic Bayesian network approach for time-specific survival probability prediction in patients after ventricular assist device implantation. 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Chicago, IL, USA, 2014, pp. 3172−5. USA: IEEE https://doi.org/doi:10.1109/EMBC.2014.6944296
[8]	Cao J, Qi X, Zhao H. 2012. Modeling gene regulation networks using ordinary differential equations. In Next Generation Microarray Bioinformatics. Methods in Molecular Biology (Methods and Protocols), eds. Wang J, Tan AC, Tian T, vol 802. USA: Humana Press. pp: 185−97 https://doi.org/10.1007/978-1-61779-400-1_12
[9]	Sima C, Hua J, Jung S. 2009. Inference of Gene Regulatory Networks Using Time-Series Data: A Survey. Current genomics 10:416−29 doi: 10.2174/138920209789177610 CrossRef Google Scholar
[10]	de la Fuente A, Bing N, Hoeschele I, Mendes P. 2004. Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics 20:3565−74 doi: 10.1093/bioinformatics/bth445 CrossRef Google Scholar
[11]	Schäfer J, Strimmer K. 2005. An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics 21:754−64 doi: 10.1093/bioinformatics/bti062 CrossRef Google Scholar
[12]	Butte A, Kohane I. 2000. Mutual information relevance networks: Functional genomic clustering using pairwise entropy measurements. In Proceedings of Pacific Symposium on Biocomputing 2000, 5:704. USA: World Scientific. pp.415−26 https://doi.org/10.1142/4316
[13]	Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, et al. 2006. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 7(Suppl 1):S7 doi: 10.1186/1471-2105-7-S1-S7 CrossRef Google Scholar
[14]	Altay G, Emmert-Streib F. 2010. Inferring the conservative causal core of gene regulatory networks. BMC Systems Biology 4:132 doi: 10.1186/1752-0509-4-132 CrossRef Google Scholar
[15]	Meyer PE, Lafitte F, Bontempi G. 2008. minet: A R/Bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinformatics 9:461 doi: 10.1186/1471-2105-9-461 CrossRef Google Scholar
[16]	Huynh-Thu VA, Geurts P. 2019. Unsupervised Gene Network Inference with Decision Trees and Random Forests. In Gene Regulatory Networks. Methods in Molecular Biology, eds. Sanguinetti G, Huynh-Thu VA, vol 1883. New York: Humana Press. pp. 195−215 https://doi.org/10.1007/978-1-4939-8882-2_8
[17]	Deng W, Zhang K, Busov V, Wei H. 2017. Recursive random forest algorithm for constructing multilayered hierarchical gene regulatory networks that govern biological pathways. PLoS One 12:e0171532 doi: 10.1371/journal.pone.0171532 CrossRef Google Scholar
[18]	Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS. 2000. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl. Acad. Sci. U. S. A. 97:12182−6 doi: 10.1073/pnas.220392197 CrossRef Google Scholar
[19]	Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, et al. 2007. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol. 5:e8 doi: 10.1371/journal.pbio.0050008 CrossRef Google Scholar
[20]	Meyer PE, Kontos K, Lafitte F, Bontempi G. 2007. Information-theoretic inference of large transcriptional regulatory networks. EURASIP journal on bioinformatics and systems biology 2007:79879 https://rdcu.be/chDK7
[21]	Gunasekara C, Zhang K, Deng W, Brown L, Wei H. 2018. TGMI: an efficient algorithm for identifying pathway regulators through evaluation of triple-gene mutual interaction. Nucleic Acids Res. 46:e67 doi: 10.1093/nar/gky210 CrossRef Google Scholar
[22]	Zhang X, Zhao X, He K, Lu L, Cao Y, et al. 2012. Inferring gene regulatory networks from gene expression data by path consistency algorithm based on conditional mutual information. Bioinformatics 28:98−104 doi: 10.1093/bioinformatics/btr626 CrossRef Google Scholar
[23]	Meinshausen N, Bühlmann P. 2006. High-dimensional graphs and variable selection with the Lasso. Annals of statistics 34:1436−62 doi: 10.1214/009053606000000281 CrossRef Google Scholar
[24]	Zhang X, Liu K, Liu Z, Duval B, Richer JM, et al. 2013. NARROMI: a noise and redundancy reduction technique improves accuracy of gene regulatory network inference. Bioinformatics 29:106−13 doi: 10.1093/bioinformatics/bts619 CrossRef Google Scholar
[25]	Hayes AF, Cai L. 2007. Using heteroskedasticity-consistent standard error estimators in OLS regression: an introduction and software implementation. Behav. Res. Methods 39:709−22 doi: 10.3758/BF03192961 CrossRef Google Scholar
[26]	Hoerl AE, Kennard RW. 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12:55−67 doi: 10.1080/00401706.1970.10488634 CrossRef Google Scholar
[27]	Tibshirani R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58:267−88 doi: 10.1111/j.2517-6161.1996.tb02080.x CrossRef Google Scholar
[28]	Zou H. 2006. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101:1418−29 doi: 10.1198/016214506000000735 CrossRef Google Scholar
[29]	Zou H, Hastie T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67:301−20 doi: 10.1111/j.1467-9868.2005.00503.x CrossRef Google Scholar
[30]	Wang H, Li G, Jiang G. 2007. Robust regression shrinkage and consistent variable selection through the LAD-Lasso. Journal of Business & Economic Statistics 25:347−55 doi: 10.1198/073500106000000251 CrossRef Google Scholar
[31]	Yu C, Yao W. 2017. Robust linear regression: A review and comparison. Communications in Statistics - Simulation and Computation 46:6261−82 doi: 10.1080/03610918.2016.1202271 CrossRef Google Scholar
[32]	Lambert-Lacroix S, Zwald L. 2011. Robust regression through the Huber’s criterion and adaptive lasso penalty. Electronic Journal of Statistics 5:1015−53 doi: 10.1214/11-EJS635 CrossRef Google Scholar
[33]	Owen AB. 2007. A robust hybrid of lasso and ridge regression. Proceedings of the AMS-IMS-SIAM Joint Summer Research Conference on Machine and Statistical Learning: Prediction and Discovery, Snowbird, UT, 2006, Contemporary Mathematics 443:59−72. Providence, RI: American Mathematical Society http://www.ams.org/books/conm/443/
[34]	Zwald L, Lambert-Lacroix S. 2012. The BerHu penalty and the grouped effect. arXiv preprint arXiv: 1207.6868
[35]	Grant M, Boyd S, Ye Y. 2008. CVX: Matlab software for disciplined convex programming. http://cvxr.com/cvx/
[36]	Sæbø S, Almøy T, Aarøe J, Aastveit AH. 2007. ST-PLS: a multi-directional nearest shrunken centroid type classifier via PLS. Chemometrics 22:54−62 doi: 10.1002/cem.1101 CrossRef Google Scholar
[37]	Chun H, Keleş S. 2010. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72:3−25 doi: 10.1111/j.1467-9868.2009.00723.x CrossRef Google Scholar
[38]	Lê Cao K-A, Rossouw D, Robert-Granié C, Besse P. 2008. A sparse PLS for variable selection when integrating omics data. Statistical applications in genetics and molecular biology 7:Ariticl 35 doi: 10.2202/1544-6115.1390 CrossRef Google Scholar
[39]	Chaffey N, Cholewa E, Regan S, Sundberg B. 2002. Secondary xylem development in Arabidopsis: a model for wood formation. Physiologia plantarum 114:594−600 doi: 10.1034/j.1399-3054.2002.1140413.x CrossRef Google Scholar
[40]	Kumari S, Deng W, Gunasekara C, Chiang V, Chen HS, et al. 2016. Bottom-up GGM algorithm for constructing multilayered hierarchical gene regulatory networks that govern biological pathways or processes. BMC Bioinformatics 17:132 doi: 10.1186/s12859-016-0981-1 CrossRef Google Scholar
[41]	Zheng J, He C, Qin Y, Lin G, Park WD, et al. 2019. Co-expression analysis aids in the identification of genes in the cuticular wax pathway in maize. Plant J. 97:530−42 doi: 10.1111/tpj.14140 CrossRef Google Scholar
[42]	Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, et al. 2010. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28:511−5 doi: 10.1038/nbt.1621 CrossRef Google Scholar
[43]	Huber PJ. 2011. Robust statistics. In International Encyclopedia of Statistical Science, ed. Lovric M. Berlin, Heidelberg: Springer. pp. 1248−51 https://doi.org/10.1007/978-3-642-04898-2_594
[44]	Yi C, Huang J. 2017. Semismooth newton coordinate descent algorithm for elastic-net penalized huber loss regression and quantile regression. Journal of Computational and Graphical Statistics 26:547−57 doi: 10.1080/10618600.2016.1256816 CrossRef Google Scholar
[45]	Parikh N, Boyd S. 2014. Proximal algorithms. Foundations and Trends® in Optimization 1:127−239 doi: 10.1561/2400000003 CrossRef Google Scholar
[46]	Yu YL. 2013. On decomposing the proximal map. NIPS'13: Proceedings of the 26th International Conference on Neural Information Processing Systems, Lake Tahoe, Nevada, 2013, vol 1:91−9. New York: Curran Associates Inc. https://proceedings.neurips.cc/paper/2013/file/98dce83da57b0395e163467c9dae521b-Paper.pdf
[47]	Beck, A. and M. Teboulle. 2009. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences 2(1):183−202 doi: 10.1137/080716542 CrossRef Google Scholar
[48]	Vinzi VE, Russolillo G. 2013. Partial least squares algorithms and methods. WIREs Computational Statistics 5:1−19 doi: 10.1002/wics.1239 CrossRef Google Scholar
[49]	Shen H, Huang JZ. 2008. Sparse principal component analysis via regularized low rank matrix approximation. Journal of multivariate analysis 99:1015−34 doi: 10.1016/j.jmva.2007.06.007 CrossRef Google Scholar
[50]	Tenenhaus A, Guillemot V, Gidrol X, Frouin V. 2010. Gene association networks from microarray data using a regularized estimation of partial correlation based on PLS regression. IEEE/ACM Trans Comput Biol Bioinform 7:251−62 doi: 10.1109/TCBB.2008.87 CrossRef Google Scholar
[51]	Simca P. 2002. SIMCA-P+ 10 Manual. Umetrics AB
[52]	Deng W. 2018. Algorithms for reconstruction of gene regulatory networks from high -throughput gene expression data. PhD. Open Access Dissertation. Michigan Technological University. pp. 101 https://digitalcommons.mtu.edu/etdr/722/
[53]	Zhou J, Lee C, Zhong R, Ye ZH. 2009. MYB58 and MYB63 are transcriptional activators of the lignin biosynthetic pathway during secondary cell wall formation in Arabidopsis. Plant Cell 21:248−66 doi: 10.1105/tpc.108.063321 CrossRef Google Scholar
[54]	Shuai B, Reynaga-Peña CG, Springer PS. 2002. The lateral organ boundaries gene defines a novel, plant-specific gene family. Plant Physiol. 129:747−61 doi: 10.1104/pp.010926 CrossRef Google Scholar
[55]	Nishitani K, Demura T. 2015. Editorial: An Emerging View of Plant Cell Walls as an Apoplastic Intelligent System. Plant and Cell Physiology 56:177−9 doi: 10.1093/pcp/pcv001 CrossRef Google Scholar
[56]	Wang P, Hendron RW, Kelly S. 2017. Transcriptional control of photosynthetic capacity: conservation and divergence from Arabidopsis to rice. New Phytol. 216:32−45 doi: 10.1111/nph.14682 CrossRef Google Scholar
[57]	Cluis CP, Mouchel CF, Hardtke CS. 2004. The Arabidopsis transcription factor HY5 integrates light and hormone signaling pathways. Plant J. 38:332−47 doi: 10.1111/j.1365-313X.2004.02052.x CrossRef Google Scholar
[58]	Andronis C, Barak S, Knowles SM, Sugano S, Tobin EM. 2008. The clock protein CCA1 and the bZIP transcription factor HY5 physically interact to regulate gene expression in Arabidopsis. Mol. Plant 1:58−67 doi: 10.1093/mp/ssm005 CrossRef Google Scholar
[59]	Job N, Yadukrishnan P, Bursch K, Datta S, Johansson H. 2018. Two B-Box Proteins Regulate Photomorphogenesis by Oppositely Modulating HY5 through their Diverse C-Terminal Domains. Plant Physiol. 176:2963−76 doi: 10.1104/pp.17.00856 CrossRef Google Scholar
[60]	Jiang Y, Yang C, Huang S, Xie F, Xu Y, et al. 2019. The ELF3-PIF7 Interaction Mediates the Circadian Gating of the Shade Response in Arabidopsis. iScience 22:288−98 doi: 10.1016/j.isci.2019.11.029 CrossRef Google Scholar
[61]	Kim K, Jeong J, Kim J, Lee N, Kim ME, et al. 2016. PIF1 Regulates Plastid Development by Repressing Photosynthetic Genes in the Endodermis. Molecular plant 9:1415−27 doi: 10.1016/j.molp.2016.08.007 CrossRef Google Scholar
[62]	Shin J, Kim K, Kang H, Zulfugarov IS, Bae G, et al. 2009. Phytochromes promote seedling light responses by inhibiting four negatively-acting phytochrome-interacting factors. Proc. Natl. Acad. Sci. U. S. A. 106:7660−5 doi: 10.1073/pnas.0812219106 CrossRef Google Scholar
[63]	Leivar P, Monte E, Al-Sady B, Carle C, Storer A, et al. 2008. The Arabidopsis phytochrome-interacting factor PIF7, together with PIF3 and PIF4, regulates responses to prolonged red light by modulating phyB levels. Plant Cell 20:337−52 doi: 10.1105/tpc.107.052142 CrossRef Google Scholar
[64]	Waters MT, Wang P, Korkaric M, Capper RG, Saunders NJ, Langdale JA. 2009. GLK transcription factors coordinate expression of the photosynthetic apparatus in Arabidopsis. The Plant cell 21:1109−28 doi: 10.1105/tpc.108.065250 CrossRef Google Scholar
[65]	Zubo YO, Blakley IC, Franco-Zorrilla JM, Yamburenko MV, Solano R, et al. 2018. Coordination of Chloroplast Development through the Action of the GNC and GLK Transcription Factor Families. Plant physiology 178:130−47 doi: 10.1104/pp.18.00414 CrossRef Google Scholar
[66]	Privat I, Hakimi MA, Buhot L, Favory JJ, Mache-Lerbs S. 2003. Characterization of Arabidopsisplastid sigma-like transcription factors SIG1, SIG2 and SIG3. Plant Mol. Biol. 51:385−99 doi: 10.1023/A:1022095017355 CrossRef Google Scholar
[67]	Litthauer S, Battle MW, Lawson T, Jones MA. 2015. Phototropins maintain robust circadian oscillation of PSⅡ operating efficiency under blue light. Plant J. 83:1034−45 doi: 10.1111/tpj.12947 CrossRef Google Scholar
[68]	Manfield IW, Devlin PF, Jen CH, Westhead DR, Gilmartin PM. 2007. Conservation, convergence, and divergence of light-responsive, circadian-regulated, and tissue-specific expression patterns during evolution of the Arabidopsis GATA gene family. Plant Physiol 143:941−58 doi: 10.1104/pp.106.090761 CrossRef Google Scholar
[69]	Zhang Z, Ren C, Zou L, Wang Y, Li S, et al. 2018. Characterization of the GATA gene family in Vitis vinifera: genome-wide analysis, expression profiles, and involvement in light and phytohormone response. Genome 61:713−23 doi: 10.1139/gen-2018-0042 CrossRef Google Scholar
[70]	Jeong MJ, Shih MC. 2003. Interaction of a GATA factor with cis-acting elements involved in light regulation of nuclear genes encoding chloroplast glyceraldehyde-3-phosphate dehydrogenase in Arabidopsis. Biochem. Biophys. Res. Commun. 300:555−62 doi: 10.1016/S0006-291X(02)02892-9 CrossRef Google Scholar
[71]	Kawoosa T, Gahlan P, Devi AS, Kumar S. 2014. The GATA and SORLIP motifs in the 3-hydroxy-3-methylglutaryl-CoA reductase promoter of Picrorhiza kurrooa for the control of light-mediated expression. Funct. Integr. Genomics 14:191−203 doi: 10.1007/s10142-013-0350-3 CrossRef Google Scholar
[72]	Liu Y, Patra B, Pattanaik S, Wang Y, Yuan L. 2019. GATA and Phytochrome Interacting Factor Transcription Factors Regulate Light-Induced Vindoline Biosynthesis in Catharanthus roseus. Plant Physiol. 180:1336−50 doi: 10.1104/pp.19.00489 CrossRef Google Scholar
[73]	Gargouri M, Park JJ, Holguin FO, Kim MJ, Wang H, et al. 2015. Identification of regulatory network hubs that control lipid metabolism in Chlamydomonas reinhardtii. J. Exp. Bot. 66:4551−66 doi: 10.1093/jxb/erv217 CrossRef Google Scholar
[74]	Waters MT, Langdale JA. 2009. The making of a chloroplast. EMBO J. 28:2861−73 doi: 10.1038/emboj.2009.264 CrossRef Google Scholar
[75]	Yoshida T, Ohama N, Nakajima J, Kidokoro S, Mizoi J, et al. 2011. Arabidopsis HsfA1 transcription factors function as the main positive regulators in heat shock-responsive gene expression. Mol. Genet. Genomics. 286:321−32 doi: 10.1007/s00438-011-0647-7 CrossRef Google Scholar
[76]	Oh E, Zhu JY, Wang ZY. 2012. Interaction between BZR1 and PIF4 integrates brassinosteroid and environmental responses. Nature cell biology 14:802−9 doi: 10.1038/ncb2545 CrossRef Google Scholar
[77]	Zhong R, Lee C, Zhou J, McCarthy RL, Ye ZH. 2008. A battery of transcription factors involved in the regulation of secondary cell wall biosynthesis in Arabidopsis. Plant Cell 20:2763−82 doi: 10.1105/tpc.108.061325 CrossRef Google Scholar
[78]	Chun H, Keleş S. 2010. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J. R. Stat. Soc. Series B Stat. Methodol 72:3−25 doi: 10.1111/j.1467-9868.2009.00723.x CrossRef Google Scholar
[79]	Efron B, Hastie T, Johnstone I, Tibshirani R. 2004. Least angle regression. Annals of Statistics 32:407−99 doi: 10.1214/009053604000000067 CrossRef Google Scholar
[80]	Xie Y, Liu Y, Valdar W. 2016. Joint estimation of multiple dependent Gaussian graphical models with applications to mouse genomics. Biometrika 103:493−511 doi: 10.1093/biomet/asw035 CrossRef Google Scholar

About this article

Cite this article

Deng W, Zhang K, He C, Liu S, Wei H. 2021. HB-PLS: A statistical method for identifying biological process or pathway regulators by integrating Huber loss and Berhu penalty with partial least squares regression. Forestry Research 1:6 doi: 10.48130/FR-2021-0006

Figures(7) / Tables(2)

Download PDF

Article Metrics

Article views(4820) PDF downloads(847)

Other Articles By Authors

on this site
on Google Scholar

HTML

INTRODUCTION

In a gene regulatory network (GRN), a node corresponds to a gene and an edge represents a directional regulatory relationship between a transcription factor (TF) and a target gene. Understanding the regulatory relationships among genes in GRNs can help elucidate the various biological processes and underlying mechanisms in a variety of organisms. Although experiments can be conducted to acquire evidence of gene regulatory interactions, these are labor-intensive and time-consuming. In the past two decades, the advent of high-throughput technologies including microarray and RNA-Seq, have generated an enormous wealth of transcriptomic data. As the data in public repositories grows exponentially, computational algorithms and tools utilizing gene expression data offer a more time- and cost-effective way to reconstruct GRNs. To this end, efficient mathematical and statistical methods are needed to infer qualitative and quantitative relationships between genes.

Many methods have been developed to reconstruct GRNs, each employing different theories and principles. The earliest methods include differential equations^[1], Boolean networks^[2], stochastic networks^[3], Bayesian^[4,5] or dynamic Bayesian networks (BN)^[6,7], and ordinary differential equations (ODE)^[8]. Some of these methods require time series datasets with short time intervals, such as those generated from easily manipulated single cell organisms (e.g. bacteria, yeast etc.) or mammalian cell lines^[9]. For this reason, most of these methods are not suitable for gene expression data, especially time series data involving time intervals on the scale of days, from multicellular organisms like plants and mammals (except cell lines).

In general, the methods that are useful for building gene networks with non-time series data generated from higher plants and mammals include ParCorA^[10], graphical Gaussian models (GGM)^[11], and mutual information-based methods such as Relevance Network (RN)^[12], Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE)^[13], C3NET^[14], maximum relevance/minimum redundancy Network (MRNET)^[15], and random forests^[16,17]. Most of these methods are based on the information-theoretic framework. For instance, Relevance Network (RN)^[18], one of the earliest methods developed, infers a network in which a pair of genes are linked by an edge if the mutual information is larger than a given threshold. The context likelihood relatedness (CLR) algorithm^[19], an extension of RN, derives a score from the empirical distribution of the mutual information for each pair of genes and eliminates edges with scores that are not statistically significant. ARACNE is similar to RN; however, ARACNE makes use of the data processing inequality (DPI) to eliminate the least significant edge of a triplet of genes, which decreases the false positive rate of the inferred network. MRNET^[20] employs the maximum relevance and minimum redundancy feature selection method to infer GRNs. Finally, triple-gene mutual interaction (TGMI) uses condition mutual information to evaluate triple gene blocks to infer GRNs^[21]. Information theory-based methods are used extensively for constructing GRNs and for building large networks because they have a low computational complexity and are able to capture nonlinear dependencies. However, there are also disadvantages in using mutual information, including high false-positive rates^[22] and the inability to differentiate positive (activating), negative (inhibiting), and indirect regulatory relationships. Reconstruction of the transcriptional regulatory network can be implemented by the neighborhood selection method. Neighborhood selection^[23] is a sub-problem of covariance selection. Assume $ \Gamma $ is a set containing all of the variables (genes), the neighborhood $ {ne}_{a} $ of a variable $ a\in \Gamma $ is the smallest subset of $ \Gamma \backslash \left\{a\right\} $ such that, given all variables in $ {ne}_{a}\, $ , variable $ a $ is conditionally independent of all remaining variables. Given $ n $ i.i.d. observations of $ \Gamma $, neighborhood selection aims to estimate the neighborhood of each variable in $ \Gamma $ individually. The neighborhood selection problem can be cast as a multiple linear regression problem and solved by regularized methods.

Following the differential equation in^[24], the expression levels of a target gene $ y $ and the expression levels of the TF genes $ x $ form a linear relationship:

$ {y}_{i}={\beta }_{0}+{x}_{i}^{T}\beta +{\varepsilon }_{i}\;\;\; i=\mathrm{1,2},\dots ,n $

(1)

where $ n $ is the number of samples, $ {x}_{i}={({x}_{i1},\dots ,{x}_{ip})}^{T} $ is the expression level of $ p $ TF genes, and $ {y}_{i} $ is the expression level of the target gene in sample $ i $. $ {\beta }_{0} $ is the intercept and ${\boldsymbol{\beta }} ={({\beta }_{1},\cdots ,{\beta }_{p})}^{T}$ are the associated regression coefficients; if any $ {\beta }_{j}\ne 0 $ $ (j=1,\cdots ,p) $, then TF gene $ j $ regulates target gene $ i $. $ {\{\varepsilon }_{i}\} $ are independent and identically distributed random errors with mean 0 and variance $ {\sigma }^{2} $. The method to get an estimate of $\boldsymbol{\beta }$ and $ {\beta }_{0} $ is to transform this statistical problem to a convex optimization problem:

$ {\boldsymbol{\beta }}={argmin}_{\boldsymbol{\beta }} f\left({\boldsymbol{\beta }}\right)={argmin}_{\boldsymbol{\beta }}{\sum }_{i=1}^{n}L\left({y}_{i}-{{{\beta}} }_{0}-{x}_{i}^{T}{\boldsymbol{\beta }}\right) +\lambda P\left({\boldsymbol{\beta }}\right) $

(2)

where $ L(\cdot ) $ is a loss function, $ P(\cdot ) $ is a penalization function, and $ \lambda >0 $ is a tuning parameter which determines the importance of penalization. Different loss functions, penalization functions, and methods for determining $ \lambda $ have been proposed in the literature. Ordinary least squares (OLS) is the simplest method with a square loss function $L({y}_{i}-{\beta }_{0}-{x}_{i}^{T}{\boldsymbol{\beta }})={({y}_{i}-{\beta }_{0}-{x}_{i}^{T}{\boldsymbol{\beta }})}^{2}$ and no penalization function. The OLS estimator is unbiased^[25]. However, since it is common for the number of genes, $p$, to be much larger than the number of samples, $ n $, (i.e. $ p\gg n) $ in any given gene expression data set, there is no unique solution for OLS. Even when $n > p$, OLS estimation features high variance. To tackle these problems, ridge regression^[26] adds a $\ell _{2}$ penalty, $P\left({\boldsymbol{\beta }}\right)={\sum }_{j=1}^{p}{\beta }_{j}^{2}$, on the coefficients which introduces a bias but reduces the variance of the estimated, $ \hat{\beta } $. In ridge regression, there is a unique solution even for the $ p>n $ case. Least absolute shrinkage and selection operator (LASSO)^[27] is similar to ridge regression, except the $\ell _{2}$ penalty in ridge regression is replaced by the $\ell_{1}$ penalty, $P\left({\boldsymbol{\beta }}\right)={\sum }_{j=1}^{p}\left|{\beta }_{j}\right|$.

The main benefit of least absolute shrinkage and selection operator (LASSO) is that it performs variable selection and regularization simultaneously thereby generating a sparse solution, a desirable property for constructing GRNs. When LASSO is used for selecting regulatory TFs for a target gene, there are two potential limitations. First, if several TF genes are correlated and have large effects on the target gene, LASSO has a tendency to choose only one TF gene while zeroing out the other TF genes. Second, some studies^[28] state that LASSO does not have oracle properties; that is, it does not have the capability to identify the correct subset of true variables or to have an optimal estimation rate. It is claimed that there are cases where a given $ \lambda $ that leads to optimal estimation rate ends up with an inconsistent selection of variables. For the first limitation, Zou and Hastie^[29] proposed elastic net, in which the penalty is a mixture of LASSO and ridge regressions: $P\left({\boldsymbol{\beta }}\right)=\alpha {\sum }_{j=1}^{p}\left|{\beta }_{j}\right|+\frac{1-\alpha }{2}{\sum }_{j=1}^{p}{\beta }_{j}^{2}\;{\myriadfont\text{,}}$ where $ \alpha (0<\alpha <1) $ is called the elastic net mixing parameter. When $ \alpha =1\,{\myriadfont\text{,}} $ the elastic net penalty becomes the LASSO penalty; when $ \alpha =0 $, the elastic net penalty becomes the ridge penalty. For the second limitation, adaptive LASSO^[28] was proposed as a regularization method, which enjoys the oracle properties. The penalty function for adaptive LASSO is: $P\left({\boldsymbol{\beta }}\right)={\sum }_{j=1}^{p}{\hat{w}}_{j}\left|{\beta }_{j}\right| \;{\myriadfont\text{,}}$ where adaptive weight $ {\hat{w}}_{j}=\frac{1}{{\left|{\hat{\beta }}_{ini}\right|}^{\gamma }}\;{\myriadfont\text{,}} $ and $ \left|{\hat{\beta }}_{ini}\right| $ is an initial estimate of the coefficients obtained through ridge regression or LASSO; $ \gamma $ is a positive constant, and is usually set to 1. It is evident that adaptive LASSO penalizes more those coefficients with lower initial estimates.

It is well known that the square loss function is sensitive to heavy-tailed errors or outliers. Therefore, adaptive LASSO may fail to produce reliable estimates for datasets with heavy-tailed errors or outliers, which commonly appear in gene expression datasets. One possible remedy is to remove influential observations from the data before fitting a model, but it is difficult to differentiate true outliers from normal data. The other method is to use robust regression. Wang et al.^[30] combined the least absolute deviation (LAD) and weighted LASSO penalty to produce the LAD-LASSO method. The objective function is:

$ {\sum }_{i=1}^{n}\left|{y}_{i}-{\beta }_{0}-{x}_{i}^{T}{\boldsymbol{\beta }}\right|+\lambda {\sum }_{j=1}^{p}{\hat{w}}_{j}\left|{\beta }_{j}\right| $

(3)

With this LAD loss, LAD-LASSO is more robust than OLS to unusual $ y $ values, but it is sensitive to high leverage outliers. Moreover, LAD estimation degrades the efficiency of the resulting estimation if the error distribution is not heavy tailed^[31]. To achieve both robustness and efficiency, Lambert-Lacroix and Zwald 2011^[32], proposed Huber-LASSO, which combined the Huber loss function and a weighted LASSO penalty. The Huber function (see Materials and Methods) is a hybrid of squared error for relatively small errors and absolute error for relatively large ones. Owen 2007^[33] proposed the use of the Huber function as a loss function and the use of a reversed version of Huber’s criterion, called Berhu, as a penalty function. For the Berhu penalty (see Materials and Methods), relatively small coefficients contribute their $\ell _{1}$ norm to the penalty while larger ones cause it to grow quadratically. This Berhu penalty sets some coefficients to 0, like LASSO, while shrinking larger coefficients in the same way as ridge regression. In^[34], the authors showed that the combination of the Huber loss function and an adaptive Berhu penalty enjoys oracle properties, and they also demonstrated that this procedure encourages a grouping effect. In previous research, the authors solved a Huber-Berhu optimization problem using CVX software^[33−35], a Matlab-based modeling system for convex optimization. CVX turns Matlab into a modeling language, allowing constraints and objectives to be specified using standard Matlab expression syntax. However, since CVX is slow for large datasets, a proximal gradient descent algorithm was developed for the Huber-Berhu regression in this study, which runs much faster than CVX.

Reconstruction of GRNs often involves ill-posed problems due to high dimensionality and multicollinearity. Partial least squares (PLS) regression has been an alternative to ordinary regression for handling multicollinearity in several areas of scientific research. PLS couples a dimension reduction technique and a regression model. Although PLS has been shown to have good predictive performance in dealing with ill-posed problems, it is not particularly tailored for variable selection. Sæbø et al. 2007^[36] first proposed the soft-threshold-PLS (ST-PLS), in which the $\ell _{1}$ penalty is used for PLS loading weights of multiple latent components. Such a method is especially applicable for classification and variable selection when the number of variables is greater than the number of samples. Chun and Keleş 2010^[37] proposed a similar sparse PLS regression for simultaneous dimension reduction and variable selection. Both the methods from Sæbø et al. 2007 and Chun and Keleş 2010 used the same $\ell _{1}$ penalty for PLS loading weights. Lê Cao et al. 2008^[38] also proposed a sparse PLS method for variable selection when integrating omics data. They added sparsity into PLS with a LASSO penalization combined with singular value decomposition (SVD) computation. In this study, the Huber loss function and the Berhu penalty function were embedded into a PLS framework. Real gene data was used to demonstrate that this approach is applicable for the reconstruction of GRNs.

Algorithm 1: Accelerated proximal gradient descent method to minimize $ f\left({\boldsymbol{\beta }}\right) $ in equation (7) respected to $ {\beta }_{0} $ and ${\boldsymbol{\beta}}$
Input: predictor matrix ($X $), dependent vector ($y $), and penalty constant ($ {\boldsymbol{\lambda}}$)
Output: regression coefficient ($ {\boldsymbol{\beta }} $)
1	Initiate $ {\boldsymbol{\beta }}={\bf{0}} $, $\boldsymbol{t}$ = 1, $ {{\boldsymbol{\beta }}}_{\boldsymbol{p}\boldsymbol{r}\boldsymbol{e}\boldsymbol{v}}={\bf{0}} $
2	For $k $ in 1… MAX_ITER
3	$v={\boldsymbol{\beta }}+\left(k/\left( {k + 3} \right)\right)\boldsymbol{*}\left({\boldsymbol{\beta }}-{{\boldsymbol{\beta }}}_{\boldsymbol{p}\boldsymbol{r}\boldsymbol{e}\boldsymbol{v}}\right)$
4	compute the gradient of Huber loss at $ v $ using (5), denoted as $ {\boldsymbol{G}}_{\boldsymbol{v}} $
5	while TRUE
6	compute $ {\boldsymbol{p}}_{1}={\boldsymbol{P}\boldsymbol{r}\boldsymbol{o}\boldsymbol{x}}_{\boldsymbol{t},\boldsymbol{\lambda }\left\|\cdot \right\|}\left(\boldsymbol{v}\right) $ using (10)
7	compute ${\boldsymbol{p}}_{2}={\boldsymbol{P}\boldsymbol{r}\boldsymbol{o}\boldsymbol{x}}_{\boldsymbol{t},\boldsymbol{\lambda }\boldsymbol{u}}\left(\boldsymbol{p}_1\right)$ using (9)
8	if ${\bf\sum }_{i=1}^{n}{\boldsymbol{H}}_{\boldsymbol{M}}\left({\boldsymbol{y}}_{\boldsymbol{i}} -{\boldsymbol{\beta}}_{\boldsymbol{0}}- {\boldsymbol{x}}_{\boldsymbol{i}}^{\boldsymbol{T}}{\boldsymbol{p}}_{2}\right)\le {\sum }_{i=1}^{n}{\boldsymbol{H}}_{\boldsymbol{M}}\left({\boldsymbol{y}}_{\boldsymbol{i}} -{\boldsymbol{\beta}}_{\boldsymbol{0}}- {\boldsymbol{x}}_{\boldsymbol{i}}^{\boldsymbol{T}}\boldsymbol{v}\right)+$${\boldsymbol{G}}_{\boldsymbol{v}}'({\boldsymbol{p}}_{\bf 2} -\boldsymbol{v})+ \frac{\bf 1}{\bf 2\boldsymbol{t}}{\left\|\right\|{\boldsymbol{p}}_{\bf 2}-\boldsymbol{v}\left\|\right\|}_{\bf 2}^{\bf 2}$
9	break
10	else $ t=t*0.5 $
11	$ {{\boldsymbol{\beta }}}_{\boldsymbol{p}\boldsymbol{r}\boldsymbol{e}\boldsymbol{v}}={\boldsymbol{\beta }} $, $ {\boldsymbol{\beta }}={\boldsymbol{p}}_{2} $
12	if converged
13	break

Algorithm 2: Finding the solution of the Huber-Berhu PLS regression
Input: TF matrix ($ X $), pathway matrix ($ Y $), penalty constant (${\boldsymbol{\lambda}}$), and number of components ($ K $)
Output: regression coefficient matrix ($ A $)
1	$ {\boldsymbol{X}}_{0}=\boldsymbol{X},{\boldsymbol{X}}_{0}={\boldsymbol{Y}} $, $ {\boldsymbol{c}\boldsymbol{F}}={\boldsymbol{I}} $, $ {\boldsymbol{A}}={\bf{0}} $
2	For $ k $ in 1,...,$K $
3	set $ {\boldsymbol{M}}_{\boldsymbol{k}-1}={\boldsymbol{X}}_{\boldsymbol{k}-1}'{\boldsymbol{Y}}_{\boldsymbol{k}-1} $
4	Initialize $ {\boldsymbol{u} }$ to be the first left singular vector and initialize $ {\boldsymbol{v}} $ to be the product of first right singular vectors and first singular value.
5	until convergence of $ {\boldsymbol{u}} $ and $ {\boldsymbol{v}} $
6	update ${ \boldsymbol{u}} $ using (16)
7	update $ {\boldsymbol{v}} $ using (17)
8	extract component $ { {{ξ}}} ={\boldsymbol{X}\boldsymbol{u}} $
9	compute regression coefficients in (8) ${\boldsymbol c}={\boldsymbol X}'{ {{ξ}}}/({ {{ξ}}}'{ {{ξ}}}), \;{\boldsymbol d}={\boldsymbol Y}'{ {{ξ}}}/$$({ {{ξ}}}'{ {{ξ}}}) $
10	update $\boldsymbol{A}=\boldsymbol{A}+\boldsymbol{c}\boldsymbol{F}\cdot \boldsymbol{u}\cdot \boldsymbol{d}'$
11	update ${\boldsymbol{c}\boldsymbol{F}}={\boldsymbol{c}\boldsymbol{F}}\cdot ({\bf{I}}-{\boldsymbol{u}}\cdot {\boldsymbol{c}}'$)
12	compute residuals for $X $ and $ Y$, ${\boldsymbol{X}}={\boldsymbol{X}}- { {{ξ}}}{\boldsymbol c}'$, $ {\boldsymbol{Y}}= {\boldsymbol{Y}}- { {{ξ}}}{\boldsymbol d}$

{{lists.name}}

HB-PLS: A statistical method for identifying biological process or pathway regulators by integrating Huber loss and Berhu penalty with partial least squares regression