Feature selection with annealing for shallow neural networks using the multistage stochastic algorithm

Xinglei Li; Chenlu Zhang; Lizhe Sun; Zhufeng Meng; He Zhang; Xinglei Li; Chenlu Zhang; Lizhe Sun; Zhufeng Meng; He Zhang

doi:10.48130/stati-0026-0008

Feature selection is a fundamental challenge in statistics and machine learning, playing a critical role in improving prediction accuracy and enhancing model interpretability. It is widely applied across various scientific and practical domains. However, most existing variable selection methods are developed within the framework of linear and generalized linear models, which limit their applicability to more complex data structures. In this article, we introduce two novel nonlinear variable selection approaches that integrate an iterative hard-thresholding operator into stochastic training of shallow neural networks. The proposed methods help to simultaneously identify important features and build a predictive model based on the selected subset using shallow neural networks. Furthermore, we propose a new Bayesian information criterion that can facilitate effective model selection in the high-dimensional setting. Extensive numerical experiments on both simulated and real-world data sets demonstrated that the proposed methods show remarkable performance compared with the classical approaches in the literature.

HTML

Introduction

Feature selection is a fundamental topic of research in statistics and machine learning, as it improves predictive accuracy, enhances interpretability, and reduces computational costs by eliminating irrelevant or redundant variables. Consequently, it has been widely used in real-world data analysis across various application domains.

Several methods for variable selection have been proposed for both parametric and nonparametric models. The classical techniques include the $ \ell_0 $-penalized approach^[1,2], the $ \ell_1 $-penalized method^[3−5], as well as the SCAD^[6] and MCP^[7] methods. In addition, group-penalized methods, such as group Lasso^[8] and adaptive group Lasso^[9], have been proposed for variable selection. Although these methods are widely applied in practice, they have primarily been developed for high-dimensional linear and generalized linear models. Extensions to high-dimensional varying-coefficient models^[10] and sparse additive models^[11] have also been explored, which are, however, less applicable to neural networks.

Being a fundamental component of deep learning, neural networks have undergone significant theoretical and empirical advances, and can now be used to approximate complex data structures and tackle intricate modeling tasks^[12−16]. However, neural networks are often over-parameterized, posing challenges for training, prediction, and interpretation. Several methods have been developed to reduce the number of model parameters and compress network architectures in order to enhance the predictive performance. A widely adopted solution is dropout^[17], which randomly drops nodes during training to prevent overfitting. Subsequently, sparse neural networks have been proposed to reduce the number of parameters and compress model architectures. The existing research on sparse neural networks falls into two categories: Bayesian and frequentist. Within the Bayesian framework, various sparsity-inducing priors, such as the hierarchical prior^[18,19], the mixture Gaussian prior^[20], and the spike-and-slab prior^[21], have been proposed for learning sparse Bayesian neural networks. In the frequentist framework, the group $ \ell_1 $-regularized methods^[22−25] have been proposed to learn sparse neural networks. In addition, node pruning techniques^[26,27] have been introduced to compress neural network architectures.

Although the existing methods encourage sparsity in the weight matrix and compress the network architectures, few methods directly address the sparse-input problem. However, these methods are not specifically designed for variable selection. To identify the important covariates in neural network models, several nonlinear variable selection techniques have been proposed. First, the group $ \ell_1 $ penalty is applied to induce sparsity into the input weight matrix^[28]. Next, an additional selection layer is introduced, and both $ \ell_0 $^[29] and $ \ell_1 $^[30] penalties are used for variable selection. More recently, a group $ \ell_0 $-regularized method has been proposed for variable selection^[31], using the iterative hard-thresholding (IHT) algorithm to estimate the parameters^[32]. Furthermore, the concept of group $ \ell_0 $ penalty has been extended to nonparametric expectile regression within the framework of deep neural networks^[33].

However, the current neural network-based variable selection methods face challenges in both variable selection accuracy and predictive performance. First, these $ \ell_1 $-penalized neural network approaches rely heavily on tuning parameters, leading to instability in both variable selection and prediction. Moreover, using the $ \ell_1 $ penalty may incur unwanted bias and lead to inconsistency in variable selection^[5,34]. Therefore, the $ \ell_0 $ penalty may be an ideal choice, as it does not incur any bias^[2].

Just like the $ \ell_1 $ penalty is associated with the soft-thresholding operator, the $ \ell_0 $ penalty corresponds to the hard-thresholding operator^[35]. Accordingly, the IHT algorithm^[32], which combines gradient descent with a hard-thresholding operator, has been proposed to minimize $ \ell_0 $-penalized loss functions. Although this gradient-based approach is computationally efficient and easy to implement, it may converge to the local optima when applied to neural networks. Moreover, $ \ell_0 $- and $ \ell_1 $-regularized methods require users to prespecify the number of selected variables k and the regularization parameter λ.

In comparison with the IHT algorithm, which is based on deterministic learning^[31], we propose the multistage stochastic iterative hard-thresholding (StoIHT) algorithm. The proposed methods, which are based on stochastic learning algorithms, are used to select important variables by estimating the weights connecting the input and the first hidden layer. Then, we develop a novel variable selection method with an annealing strategy within the framework of a multistage stochastic algorithm^[36]. This approach can help in substantially enhancing both variable selection and prediction accuracy. To simplify the discussion and provide a clear description of our methodology, we will focus on shallow neural networks in this article. Furthermore, we investigate the criteria for variable selection and develop a novel Bayesian information criterion (BIC) tailored to the high-dimensional setting (the large-p−small-n scenario), along with the conventional BIC for the low-dimensional setting (n > p scenario)^[37]. The main contributions of this paper are as follows:

(1) This article proposes two novel variable selection methods that use neural networks as surrogate models with a group $ \ell_0 $ penalty. To enhance the selection accuracy, we introduce an annealing strategy and develop a multistage stochastic optimization framework to identify key variables. The performance and effectiveness of the proposed methods are demonstrated through numerical experiments and real-data applications.

(2) This paper develops a novel BIC for model selection in the high-dimensional setting. Numerical experiments demonstrate that the proposed criterion helps to substantially improve variable selection accuracy relative to the conventional BIC.

The remainder of this paper is organized as follows. The next section introduces the notation and outlines the model framework. The section "Methodology" presents the proposed variable selection approaches based on a stochastic optimization strategy. The section "Simulation" assesses the performance of the proposed methods through numerical experiments and real data applications. The last section provides the conclusions and discusses the potential directions for future research.

[1]	Barbu A, She Y, Ding L, Gramajo G. 2017. Feature selection with annealing for computer vision and big data learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(2):272−286 doi: 10.1109/TPAMI.2016.2544315 CrossRef Google Scholar
[2]	She Y, Shen J, Barbu A. 2023. Slow kill for big data learning. IEEE Transactions on Information Theory 69(9):5936−5955 doi: 10.1109/TIT.2023.3273179 CrossRef Google Scholar
[3]	Tibshirani R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Methodological 58(1):267−288 doi: 10.1111/j.2517-6161.1996.tb02080.x CrossRef Google Scholar
[4]	Zou H, Hastie T. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology 67(2):301−320 doi: 10.1111/j.1467-9868.2005.00503.x CrossRef Google Scholar
[5]	Zou H. 2006. The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101(476):1418−1429 doi: 10.1198/016214506000000735 CrossRef Google Scholar
[6]	Fan J, Li R. 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96(456):1348−1360 doi: 10.1198/016214501753382273 CrossRef Google Scholar
[7]	Zhang CH. 2010. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics 38(2):894−942 doi: 10.1214/09-AOS729 CrossRef Google Scholar
[8]	Yuan M, Lin Y. 2006. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B: Statistical Methodology 68(1):49−67 doi: 10.1111/j.1467-9868.2005.00532.x CrossRef Google Scholar
[9]	Wang M, Tian GL. 2019. Adaptive group lasso for high-dimensional generalized linear models. Statistical Papers 60(5):1469−1486 doi: 10.1007/s00362-017-0882-z CrossRef Google Scholar
[10]	Wei F, Huang J, Li H. 2011. Variable selection and estimation in high-dimensional varying-coefficient models. Statistica Sinica 21(4):1515−1540 doi: 10.5705/ss.2009.316 CrossRef Google Scholar
[11]	Ravikumar P, Lafferty J, Liu H, Wasserman L. 2009. Sparse additive models. Journal of the Royal Statistical Society Series B: Statistical Methodology 71(5):1009−1030 doi: 10.1111/j.1467-9868.2009.00718.x CrossRef Google Scholar
[12]	Schmidt-Hieber J. 2020. Nonparametric regression using deep neural networks with ReLU activation function. The Annals of Statistics 48(4):1875−1897 doi: 10.1214/19-aos1875 CrossRef Google Scholar
[13]	Nakada R, Imaizumi M. 2020. Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. Journal of Machine Learning Research 21(174):1−38 Google Scholar
[14]	Kohler M, Langer S. 2021. On the rate of convergence of fully connected deep neural network regression estimates. The Annals of Statistics 49(4):2231−2249 doi: 10.1214/20-aos2034 CrossRef Google Scholar
[15]	Jiao Y, Shen G, Lin Y, Huang J. 2023. Deep nonparametric regression on approximate manifolds: nonasymptotic error bounds with polynomial prefactors. The Annals of Statistics 51(2):691−716 doi: 10.1214/23-aos2266 CrossRef Google Scholar
[16]	Siegel JW. 2023. Optimal approximation rates for deep relu neural networks on sobolev and besov spaces. Journal of Machine Learning Research 24(357):1−52 Google Scholar
[17]	Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15(1):1929−1958 Google Scholar
[18]	Liang F, Li Q, Zhou L. 2018. Bayesian neural networks for selection of drug sensitive genes. Journal of the American Statistical Association 113(523):955−972 doi: 10.1080/01621459.2017.1409122 CrossRef Google Scholar
[19]	Ghosh S, Yao J, Doshi-Velez F. 2019. Model selection in Bayesian neural networks via horseshoe priors. Journal of Machine Learning Research 20(182):1−46 Google Scholar
[20]	Sun Y, Song Q, Liang F. 2022. Consistent sparse deep learning: theory and computation. Journal of the American Statistical Association 117(540):1981−1995 doi: 10.1080/01621459.2021.1895175 CrossRef Google Scholar
[21]	Sun Y, Song Q, Liang F. 2022. Learning sparse deep neural networks with a spike-and-slab prior. Statistics & Probability Letters 180:109246 doi: 10.1016/j.spl.2021.109246 CrossRef Google Scholar
[22]	Wen W, Wu C, Wang Y, Chen Y, Li H. 2016. Learning structured sparsity in deep neural networks. NIPS'16: Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. Vol. 29. Red Hook, NY, USA: Curran Associates, Inc. pp. 2082–2090 https://proceedings.neurips.cc/paper_files/paper/2016/file/41bfd20a38bb1b0bec75acf0845530a7-Paper.pdf (Accessed March 20, 2026)
[23]	Scardapane S, Comminiello D, Hussain A, Uncini A. 2017. Group sparse regularization for deep neural networks. Neurocomputing 241:81−89 doi: 10.1016/j.neucom.2017.02.029 CrossRef Google Scholar
[24]	Bungert L, Roith T, Tenbrinck D, Burger M. 2022. A Bregman learning framework for sparse neural networks. Journal of Machine Learning Research 23(192):1−43 Google Scholar
[25]	Li G, Wang G, Ding J. 2023. Provable identifiability of two-layer ReLU neural networks via LASSO regularization. IEEE Transactions on Information Theory 69(9):5921−5935 doi: 10.1109/tit.2023.3274152 CrossRef Google Scholar
[26]	Guo Y, She Y, Barbu A. 2021. Network pruning via annealing and direct sparsity control. 2021 International Joint Conference on Neural Networks (IJCNN). Shenzhen, China, 18−22 July 2021. New Jersey: IEEE. pp. 1−8 doi: 10.1109/ijcnn52387.2021.9533741
[27]	Jantre S, Bhattacharya S, Maiti T. 2025. Spike-and-slab shrinkage priors for structurally sparse Bayesian neural networks. IEEE Transactions on Neural Networks and Learning Systems 36(6):11176−11188 doi: 10.1109/tnnls.2024.3485529 CrossRef Google Scholar
[28]	Dinh VC, Ho LS. 2020. Consistent feature selection for analytic deep neural networks. NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020, eds. Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H. Vol. 33. Red Hook, NY, USA: Curran Associates, Inc. pp. 2420−2431 https://proceedings.neurips.cc/paper_files/paper/2020/file/1959eb9d5a0f7ebc58ebde81d5df400d-Paper.pdf (Accessed March 20, 2026)
[29]	Chen Y, Gao Q, Liang F, Wang X. 2021. Nonlinear variable selection via deep neural networks. Journal of Computational and Graphical Statistics 30(2):484−492 doi: 10.1080/10618600.2020.1814305 CrossRef Google Scholar
[30]	Lemhadri I, Ruan F, Abraham L, Tibshirani R. 2021. Lassonet: a neural network with feature sparsity. Journal of Machine Learning Research 22(127):1−29 Google Scholar
[31]	Yang Z, Zheng S, Tang N. 2026. Supervised predictive modeling of high-dimensional data with group ℓ⁰-norm constrained neural networks. Journal of Computational and Graphical Statistics 00:1−14 doi: 10.1080/10618600.2025.2581774 CrossRef Google Scholar
[32]	Yuan XT, Li P, Zhang T. 2018. Gradient hard thresholding pursuit. Journal of Machine Learning Research 18(166):1−43 Google Scholar
[33]	Yang R, Song Y. 2024. Nonparametric expectile regression meets deep neural networks: a robust nonlinear variable selection method. Statistical Analysis and Data Mining: The ASA Data Science Journal 17(6):e70002 doi: 10.1002/sam.70002 CrossRef Google Scholar
[34]	Zhao P, Yu B. 2006. On model selection consistency of lasso. Journal of Machine Learning Research 7(90):2541−2563 Google Scholar
[35]	She Y. 2009. Thresholding-based iterative selection procedures for model selection and shrinkage. Electronic Journal of Statistics 3:384−415 doi: 10.1214/08-ejs348 CrossRef Google Scholar
[36]	Agarwal A, Negahban SN, Wainwright MJ. 2012. Stochastic optimization and sparse statistical recovery: optimal algorithms for high dimensions. Advances in Neural Information Processing Systems 25 (NIPS 2012), Lake Tahoe, NV, USA. Red Hook, NY, USA: Curran Associates, Inc. pp. 1547–1555 https://proceedings.neurips.cc/paper_files/paper/2012/file/5751ec3e9a4feab575962e78e006250d-Paper.pdf (Accessed March 20, 2026).
[37]	McInerney A, Burke K. 2025. A statistical modelling approach to feedforward neural network model selection. Statistical Modelling 25(4):323−342 doi: 10.1177/1471082x241258261 CrossRef Google Scholar
[38]	Nguyen N, Needell D, Woolf T. 2017. Linear convergence of stochastic iterative greedy algorithms with sparse constraints. IEEE Transactions on Information Theory 63(11):6869−6895 doi: 10.1109/tit.2017.2749330 CrossRef Google Scholar
[39]	Sun L, Barbu A. 2025. Stochastic feature selection with annealing and its applications to streaming data. Journal of Nonparametric Statistics 37(3):580−597 doi: 10.1080/10485252.2025.2456767 CrossRef Google Scholar
[40]	Zou H, Hastie T, Tibshirani R. 2007. On the "degrees of freedom" of the lasso. The Annals of Statistics 35(5):2173−2192 doi: 10.1214/009053607000000127 CrossRef Google Scholar
[41]	Du J, Li Z, Gu Z, Feng L. 2025. A nonparametric statistics approach to feature selection in deep neural networks with theoretical guarantees. arXiv 2512.13565 doi: 10.48550/arXiv.2512.13565 CrossRef Google Scholar
[42]	Guo Y, Wu YN, Barbu A. 2021. A study of local optima for learning feature interactions using neural networks. 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18−22 July 2021. New Jersey: IEEE. pp. 1−8 doi: 10.1109/ijcnn52387.2021.9533833
[43]	He K, Zhang X, Ren S, Sun J. 2015. Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7−13 December 2015. New Jersey: IEEE. pp. 1026−1034 doi: 10.1109/iccv.2015.123
[44]	Liang F, Xue J, Jia B. 2022. Markov neighborhood regression for high-dimensional inference. Journal of the American Statistical Association 117(539):1200−1214 doi: 10.1080/01621459.2020.1841646 CrossRef Google Scholar
[45]	Sun L, Liang F. 2022. Markov neighborhood regression for statistical inference of high-dimensional generalized linear models. Statistics in Medicine 41(20):4057−4078 doi: 10.1002/sim.9493 CrossRef Google Scholar
[46]	Pedregosa F, Varoquaux C, Gramfort A, Michel V, Thirion B, et al. 2011. Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12(85):2825−2830 Google Scholar
[47]	Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, et al. 2012. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483:603−607 doi: 10.1038/nature11003 CrossRef Google Scholar
[48]	Hadley KE, Hendricks DT. 2014. Use of NQO1 status as a selective biomarker for oesophageal squamous cell carcinomas with greater sensitivity to 17-AAG. BMC Cancer 14:334 doi: 10.1186/1471-2407-14-334 CrossRef Google Scholar
[49]	Guyon I, Gunn S, Ben-Hur A, Dror G. 2004. Result analysis of the NIPS 2003 feature selection challenge. Advances in Neural Information Processing Systems 17 (NIPS 2004), eds. Saul L, Weiss Y, Bottou L. Cambridge, MA: MIT Press. pp. 545–552. https://proceedings.neurips.cc/paper_files/paper/2004/file/5e751896e527c862bf67251a474b3819-Paper.pdf (Accessed March 20, 2026)

Sample size	Parameter	Multistage FSA			Multistage StoIHT			IHT
Sample size	Parameter	−	HD-BIC	BIC	−	HD-BIC	BIC	−	HD-BIC	BIC
n = 1,000	Model size^†	10 (0)	11.67 (1.556)	6 (0)	10 (0)	11.99 (1.972)	6 (0)	10 (0)	10.91 (1.167)	6 (0)
	FSR^†	0.091 (0.071)	0.207 (0.113)	0 (0)	0.185 (0.074)	0.284 (0.135)	0 (0)	0.075 (0.099)	0.141 (0.120)	0 (0)
	NSR^†	0.091 (0.071)	0.089 (0.073)	0.4 (0)	0.185 (0.074)	0.165 (0.077)	0.4 (0)	0.075 (0.099)	0.072 (0.094)	0.4 (0)
	RMSE^†	1.745 (0.510)	1.743 (0.533)	3.333 (0.142)	2.353 (0.461)	2.234 (0.499)	3.222 (0.101)	1.520 (0.617)	1.492 (0.593)	3.197 (0.095)
	Time^‡	2.972	34.27	32.63	11.58	128.1	130.0	2.017	22.01	23.17
n = 1,500	Model size^†	10 (0)	10.43 (0.711)	7.160 (1.815)	10 (0)	10.41 (0.861)	6.040 (0.398)	10 (0)	10.43 (0.778)	6 (0)
	FSR^†	0.022 (0.044)	0.061 (0.075)	0.006 (0.024)	0.013 (0.034)	0.041 (0.068)	0 (0)	0.073 (0.094)	0.090 (0.106)	0 (0)
	NSR^†	0.022 (0.044)	0.025 (0.048)	0.290 (0.173)	0.013 (0.034)	0.007 (0.026)	0.396 (0.040)	0.073 (0.094)	0.056 (0.089)	0.4 (0)
	RMSE^†	1.192 (0.343)	1.228 (0.400)	2.802 (1.054)	1.119 (0.270)	1.075 (0.205)	3.206 (0.236)	1.467 (0.563)	1.368 (0.538)	3.179 (0.097)
	Time^‡	4.551	51.80	52.55	17.22	192.0	192.8	3.183	34.48	36.30
n = 3,000	Model size^†	10 (0)	10.02 (0.140)	10 (0)	10 (0)	10.35 (0.921)	10 (0)	10 (0)	10.51 (0.877)	10.34 (0.681)
	FSR^†	0 (0)	0.002 (0.013)	0 (0)	0 (0)	0.028 (0.067)	0 (0)	0.070 (0.094)	0.089 (0.100)	0.081 (0.098)
	NSR^†	0 (0)	0 (0)	0 (0)	0 (0)	0 (0)	0 (0)	0.070 (0.094)	0.048 (0.083)	0.053 (0.085)
	RMSE^†	1.192 (0.343)	1.228 (0.400)	2.802 (1.054)	1.119 (0.270)	1.075 (0.205)	3.206 (0.236)	1.467 (0.563)	1.313 (0.506)	1.347 (0.525)
	Time^‡	4.551	51.80	52.55	17.22	192.0	192.8	3.183	56.24	59.04
All experiments are conducted on 100 independent data sets. ^† Average FSR, NSR, and RMSE of test data are presented; ^‡ average computational time (s).

Sample size	Parameter	Multistage FSA			Multistage StoIHT			IHT
Sample size	Parameter	−	HD-BIC	BIC	−	HD-BIC	BIC	−	HD-BIC	BIC
n = 10⁴	Model size^†	6 (0)	8.270 (1.399)	5.850 (0.766)	6 (0)	7.810 (1.270)	6.120 (0.325)	6 (0)	7.640 (1.797)	5.120 (1.283)
	FSR^†	0.100 (0.082)	0.309 (0.142)	0.070 (0.090)	0 (0)	0.212 (0.126)	0.017 (0.046)	0.275 (0,128)	0.401 (0.164)	0.172 (0.175)
	NSR^†	0.100 (0.082)	0.077 (0.083)	0.102 (0.081)	0 (0)	0 (0)	0 (0)	0.275 (0,128)	0.278 (0.106)	0.323 (0.099)
	RMSE^†	2.518 (0.176)	2.453 (0.168)	2.543 (0.222)	1.506 (0.084)	1.531 (0.082)	1.519 (0.082)	2.888 (0.207)	2.943 (0.183)	3.020 (0.198)
	Time^‡	23.78	195.6	192.8	27.75	223.5	219.7	10.80	83.04	89.33
n = 2 × 10⁴	Model size^†	6 (0)	8.150 (1.352)	6.350 (0.684)	6 (0)	8.180 (1.403)	6.090 (0.286)	6 (0)	7.740 (1.659)	6.510 (1.895)
	FSR^†	0.015 (0.048)	0.257 (0.125)	0.059 (0.094)	0 (0)	0.243 (0.139)	0.013 (0.041)	0.282 (0.122)	0.406 (0.164)	0.291 (0.205)
	NSR^†	0.015 (0.048)	0.017 (0.050)	0.013 (0.045)	0 (0)	0 (0)	0 (0)	0.282 (0.122)	0.267 (0.139)	0.283 (0.130)
	RMSE^†	2.110 (0.150)	2.087 (0.159)	2.076 (0.159)	1.408 (0.075)	1.393 (0.083)	1.390 (0.083)	2.906 (0.210)	2.885 (0.280)	2.943 (0.308)
	Time^‡	47.67	379.9	386.5	55.51	457.8	446.5	18.11	156.6	162.2
n = 3 × 10⁴	Model size^†	6 (0)	7.790 (1.409)	6.610 (0.937)	6 (0)	7.760 (1.556)	6.020 (0.140)	6 (0)	7.660 (1.589)	6.590 (1.650)
	FSR^†	0.007 (0.033)	0.209 (0.148)	0.080 (0.113)	0 (0)	0.195 (0.159)	0.003 (0.020)	0.252 (0.124)	0.388 (0.159)	0.296 (0.180)
	NSR^†	0.007 (0.033)	0.007 (0.033)	0.003 (0.023)	0 (0)	0 (0)	0 (0)	0.252 (0.124)	0.252 (0.119)	0.267 (0.120)
	RMSE^†	1.823 (0.149)	1.816 (0.143)	1.800 (0.130)	1.357 (0.080)	1.358 (0.069)	1.354 (0.067)	2.859 (0.242)	2.852 (0.246)	2.876 (0.241)
	Time^‡	72.70	587.9	586.0	82.18	682.6	670.2	33.98	264.8	268.5
All experiments are conducted on 100 independent data sets. ^†Average FSR, NSR, and RMSE of the test data are presented; ^‡average computational time (s).

Sample size	Parameter	Multistage FSA			Multistage StoIHT			IHT
Sample size	Parameter	−	HD-BIC	BIC	−	HD-BIC	BIC	−	HD-BIC	BIC
n = 10⁴	Model size^†	6 (0)	6.530 (0.830)	4.680 (0.760)	6 (0)	5.450 (1.982)	4.020 (0.140)	6 (0)	7.500 (1.825)	4.040 (0.196)
	FSR^†	0.072 (0.118)	0.135 (0.161)	0.021 (0.069)	0.650 (0.065)	0.580 (0.131)	0.497 (0.088)	0.330 (0.023)	0.411 (0.146)	0.248 (0.010)
	NSR^†	0.072 (0.118)	0.073 (0.134)	0.238 (0.125)	0.650 (0.065)	0.652 (0.075)	0.663 (0.058)	0.330 (0.023)	0.305 (0.071)	0.493 (0.033)
	AUC^†	0.960 (0.008)	0.959 (0.011)	0.942 (0.021)	0.903 (0.015)	0.903 (0.016)	0.902 (0.015)	0.933 (0.009)	0.936 (0.013)	0.904 (0.013)
	Time^‡	24.22	201.6	196.3	44.29	365.7	375.2	12.23	105.1	101.6
n = 2 × 10⁴	Model size^†	6 (0)	7.460 (1.315)	5.890 (0.467)	6 (0)	6.880 (1.856)	4.310 (0.523)	6 (0)	7.670 (1.750)	5.190 (0.674)
	FSR^†	0.013 (0.056)	0.175 (0.138)	0.005 (0.027)	0.512 (0.114)	0.505 (0.159)	0.317 (0.157)	0.330 (0.023)	0.426 (0.133)	0.219 (0.065)
	NSR^†	0.013 (0.056)	0.003 (0.023)	0.023 (0.078)	0.512 (0.114)	0.468 (0.112)	0.515 (0.100)	0.330 (0.023)	0.302 (0.073)	0.332 (0.029)
	AUC^†	0.967 (0.007)	0.967 (0.006)	0.956 (0.049)	0.915 (0.016)	0.918 (0.018)	0.915 (0.017)	0.934 (0.009)	0.938 (0.013)	0.934 (0.009)
	Time^‡	50.43	398.4	397.6	88.29	711.3	750.4	22.64	198.6	189.5
n = 3 × 10⁴	Model size^†	6 (0)	8.810 (1.102)	5.970 (0.386)	6 (0)	4.710 (1.458)	4.050 (0.218)	6 (0)	7.640 (1.616)	5.450 (1.117)
	FSR^†	0.008 (0.036)	0.309 (0.100)	0.008 (0.033)	0.350 (0.055)	0.150 (0.173)	0.075 (0.112)	0.332 (0.017)	0.431 (0.119)	0.244 (0.097)
	NSR^†	0.008 (0.036)	0.003 (0.023)	0.013 (0.061)	0.350 (0.055)	0.367 (0.067)	0.377 (0.073)	0.332 (0.017)	0.303 (0.080)	0.330 (0.053)
	AUC^†	0.967 (0.007)	0.968 (0.006)	0.962 (0.034)	0.936 (0.008)	0.935 (0.008)	0.934 (0.009)	0.933 (0.008)	0.938 (0.014)	0.935 (0.012)
	Time^‡	73.86	592.2	592.6	124.3	1069	1138	33.84	256.8	250.2
All experiments are conducted on 100 independent data sets. ^† Average FSR, NSR, and AUC for test data are presented; ^‡ average computational time (s).

	Parameter	Multistage FSA			Multistage StoIHT			IHT
	Parameter	−	HD-BIC	BIC	−	HD-BIC	BIC	−	HD-BIC	BIC
k* = 3	Model size^†	3 (0)	3.580 (0.982)	1 (0)	3 (0)	6.940 (0.276)	1 (0)	3 (0)	1.160 (0.913)	1 (0)
	FSR^†	0.277 (0.430)	0.333 (0.405)	0.250 (0.433)	0.870 (0.199)	0.834 (0.117)	0.880 (0.325)	0.997 (0.033)	0.887 (0.313)	0.890 (0.313)
	NSR^†	0.277 (0.430)	0.253 (0.419)	0.750 (0.144)	0.870 (0.199)	0.613 (0.274)	0.960 (0.108)	0.997 (0.033)	0.957 (0.112)	0.963 (0.104)
	AUC^†	0.848 (0.227)	0.856 (0.225)	0.501 (0.020)	0.505 (0.053)	0.522 (0.110)	0.500 (0.019)	0.500 (0.009)	0.500 (0.000)	0.500 (0.000)
	Time^‡	2.378	18.58	19.81	2.833	22.64	23.22	4.458	17.71	17.59
k* = 4	Model size^†	4 (0)	6.130 (1.101)	1 (0)	4 (0)	7 (0)	1 (0)	4 (0)	1 (0)	1 (0)
	FSR^†	0.290 (0.396)	0.473 (0.260)	0.320 (0.466)	0.840 (0.175)	0.786 (0.132)	0.880 (0.325)	1 (0)	0.850 (0.357)	0.850 (0.357)
	NSR^†	0.290 (0.396)	0.223 (0.337)	0.830 (0.117)	0.840 (0.175)	0.625 (0.230)	0.970 (0.081)	1 (0)	0.963 (0.089)	0.963 (0.089)
	AUC^†	0.812 (0.231)	0.823 (0.221)	0.497 (0.018)	0.498 (0.019)	0.506 (0.050)	0.497 (0.020)	0.498 (0.010)	0.500 (0.001)	0.500 (0.001)
	Time^‡	2.615	20.51	20.47	2.788	24.52	24.50	4.611	17.46	17.43
k* = 5	Model size^†	5 (0)	6.590 (0.750)	1 (0)	5 (0)	7 (0)	1 (0)	5 (0)	1 (0)	1 (0)
	FSR^†	0.636 (0.313)	0.624 (0.264)	0.610 (0.488)	0.792 (0.160)	0.734 (0.146)	0.830 (0.376)	0.998 (0.020)	0.820 (0.384)	0.820 (0.384)
	NSR^†	0.636 (0.313)	0.512 (0.325)	0.922 (0.098)	0.792 (0.160)	0.628 (0.204)	0.966 (0.075)	0.998 (0.020)	0.964 (0.077)	0.964 (0.077)
	AUC^†	0.573 (0.166)	0.598 (0.185)	0.500 (0.019)	0.503 (0.020)	0.508 (0.048)	0.500 (0.017)	0.498 (0.011)	0.500 (0.000)	0.500 (0.000)
	Time^‡	2.616	20.88	20.97	2.828	25.30	23.83	4.618	17.37	17.57
All experiments are conducted on 100 independent data sets. ^† Average FSR, NSR, and AUC for test data are presented; ^‡ average computational time (s).

{{lists.name}}

Feature selection with annealing for shallow neural networks using the multistage stochastic algorithm

Abstract

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors