SAS MACRO for forward model selection in a logistic regression model with missing values

Mingyue Han; Xueqing Zhang; ChangXing Ma; Mingyue Han; Xueqing Zhang; ChangXing Ma

doi:10.48130/stati-0024-0002

Missing data often presents significant challenges in statistical modeling. Previous practices have demonstrated that the SAS LOGISTIC procedure for forward, backward, or stepwise selections may produce unreliable results when a substantial portion of data is missing in logistic regression models. To address this issue, a new Macro-based Forward Model Selection method (MFMS) was developed to overcome the limitations of the LOGISTIC procedure. The MFMS approach preserves information from incomplete observations by Macro, automated controlled addition of one significant effect at a time while retaining incomplete observations for other variables in the analysis. To evaluate the robustness of MFMS, simulations were conducted using independent multivariate normal data with a binary response. The complete data was randomly deleted at rates of 5%, 10%, 15%, and 20% for each predictor variable, generating datasets under three mechanisms. The performance of MFMS, PROC LOGISTIC, and MI procedures were compared for logistic modeling across different levels of missingness and mechanisms. The results demonstrated that MFMS consistently outperformed the other methods, proving to be the most reliable model selection method. The Macro implementation of MFMS is available for download as a supplementary file to this paper.

HTML

Introduction

Missing data have long been a serious impediment in statistical analysis and modeling in research and business. The logistic regression model is a standard statistical method used in analysis when the response variables are dichotomous or binary^[1]. The present work aims to demonstrate the advantages or disadvantages of logistic regression modeling methods among SAS PROC LOGISTIC forward selection, SAS MI procedure, and the developed macro-based forward selection method, using the artificial binary response data with the presence of random missing values.

Missing data mechanisms
According to the generating mechanisms, missing data can be classified into three board types: missing completely at random (MCAR) if missing data are completely independent of any other factors; missing at random (MAR) describes the missingness based on other complete variables; not missing at random (NMAR) if the probability of missing data systematically depends on incomplete variables. According to missing data mechanisms, distribution assumptions are made in advance indicating the modeling methods. To specify, missing data does not represent the entire data. Thus, the modeling assumption could be MCAR, MAR, or NMAR depending on the variables of interest^[2−5]. MCAR is possible to test in practice empirically.

Deletion and single imputation
Traditionally, missing data are deleted or replaced with single imputations^[5]. Maximum likelihood and multiple imputation are advantageous^[6]. The deletion method includes two approaches: listwise deletion and pairwise deletion. In SAS, the default effect selection of the LOGISTIC procedure listwise deletes any observation with missing values in independent and response variables. The analysis is restricted to the remaining complete data set using standard procedures^[7]. Listwise deletion of missing values is under the assumption of MCAR and results in bias estimates since MCAR is rarely satisfied^[5]. In addition, deleting missing values severely reduces the analyzable sample size, particularly for data with a high proportion of missing values or missing in many variables. This results in the loss of power of the significance test^[8].

Single imputation is another traditional approach to processing missing data. The most widely used single imputation methods include mean imputation, regression imputation, and stochastic regression^[9]. Mean imputation replaces the missing values with arithmetic mean based on the available data of the corresponding variables. The added mean values are equivalent to a group of uncorrelated data, which results in an underestimation of the overall correlations. On the other hand, regression imputation will strengthen the variable correlation. It predicts a regression equation from available values and replaces the missing values with predicted regression outcomes^[6]. Mean imputation and regression imputation dramatically reduce the variability of data and result in bias. Stochastic regression imputation adds residual terms to regression imputation, which effectively restores the variability represented by the lost values^[3].

Multiple imputation
Multiple imputation (MI) is the procedure that replaces the missing value with more than one imputation method. Each missing value is replaced by m (m > 1) plausible imputed values given the observed values. The imputation creates m complete data sets which are analyzed separately with the same procedure, resulting in m sets of parameter estimates and standard errors. The multiple sets of results are subsequently merged into one set of results. These are the three steps of multiple imputation: imputation phase, analysis phase, and pooling phase^[10−13].

Imputation algorithms - Markov Chain Monte Carlo
Multiple imputation is regarded as a model-based imputation since the imputation model is constructed according to the distributional relationship between the missing values and the observed values^[14]. Data augmentation is the most often used algorithm for normally distributed data which belongs to the family of Markov Chain Monte Carlo (MCMC) procedure. MCMC draws pseudo-random samples with the Monte Carlo method using Markov chains^[10,11]. MCMC is a composite of two steps: The imputation step (I-step) and the Posterior step (P-step). Initial I-step is principally identical to stochastic imputation where model-based imputed values replace the missing values and the error terms are added to the estimated parameters to represent the uncertainty of observed data. In P-step, the I-step data perform as building blocks to generate new estimates of means and covariances under the Bayesian estimation principles. The new parameters randomly differ from those used in the previous I-step and are successively used to create imputed values in the preceding I-step. Repeating this two-step procedure for specific times yields multiple copies of complete data sets, each of which contains unique estimate parameters of the missing values^[15,16]. Multiple imputation shares the advantage of a single imputation and restores the variability of imputed complete data.

SAS MI procedure
To apply the MI procedure, the type and distribution of the data need to satisfy certain assumptions. MI assumes that the data are from a continuous multivariate distribution and contain missing values that may occur on any of the variables. It also assumes that the missing data are under the MAR mechanism^[17]. The next step is to analyze the existing missing data patterns, arbitrary or other specific patterns such as monotone. The PROC MI default MCMC method assumes that the multivariate normal data can be used. MCMC is an appropriate method for continuous arbitrary missing data^[18].

Methods

Macro-based Forward Model Selection method

SAS LOGISTIC procedure forward selection provides an automated model selection for binary response. It computes and ranks the p-value of the Chi-square statistic for each effect not in the model. If the minimum p-value is smaller than α, this corresponding effect is added to the model and never removed. The above procedures are repeated until none of the remaining effects is significant at level α^[17]. However, forward selection is not a good method to process missing data, since the observations with missing values are entirely excluded from analysis. The huge loss of information represented by incomplete observations impairs the accuracy of effect selection and power of significance test.

To keep the maximum information in incomplete observations, the Macro-based Forward Model Selection (MFMS) method was developed, in which the macro controls the automated effect selection using the forward selection algorithms. The biggest advantage of MFMS is to process one variable each time while keeping the incomplete observations in other variables undeleted. MFMS can be decomposed into steps: p-values of the Chi-square score for all one-variable models are calculated and ranked using the SAS LOGISTIC procedure. The smallest p-value (P-min) is considered as the candidate for the first cycle. If P-min is smaller than α, it indicates that the corresponding variable is significant at level α and eligible to enter the model. Only if the addition of the first effect happens, we consider all the two-variable combinations consisting of the first enrolled variable and one of the remained variables not in the model. Again, the smallest p-value is compared with α. If significant, the effect not in the previous model will enter as the second effect. Otherwise, the program ends up with the previous model. Macro controls the sequential entry of significant effects one by one until the termination point and returns the final effects list. The final model will contain the best significant effect added in each cycle.

Fan^[19] reported the application of MFMS algorithms to analyze missing data in linear regression and concluded that MFMS is superior to traditional REG and MI procedures.

Simulation and model assumptions

Logistic regression is used to investigate the binary response, ordinal response, and nominal response. For binary response models, the response variable Y can take on one of two possible values, denoted for convenience by 1 and 2. Suppose vector X is the explanatory variable, and the response probability is:

$ {\text π} =\mathrm{P}\mathrm{r}(Y=1|\mathit{x}) $

The logic response model can be a linear form as:

$ logit\left({\text π}\right)=\mathrm{ln}\left(\dfrac{{\text π}}{1-{\text π} }\right)={\mathrm{\beta }}_{0}+{\mathrm{\beta }}_{i}{\mathrm{X}}_{i} $

A logistic regression model with 200 observations was simulated, each of which consists of a binary response Y and 20 independent variables X₁, X₂, …, X₂₀, as Table 1. In most cases, actual data usually have a small sample size and a complex set of variables. Additionally, the smaller the sample size, the larger the influence that the missingness could cause. So, the complete data was simulated with 200 samples and 20 X variables.

Table 1. The complete data matrix.

Observation	Y	X₁	X₂	…	X₂₀
1	Y₁	X₁₁	X₁₂	…	X_1,20
2	Y₂	X₂₁	X₂₂	…	X_2,20
3	Y₃	X₃₁	X₃₂	…	X_3,20
…	…	…	…	…	…
200	Y₂₀₀	X_200,1	X_200,2	…	X_200,20

The model relies on several assumptions. Each X variable is independently drawn from a standard normal distribution as our complete explanatory variable matrix X. Y is the binary logistic function of X. Thus, our complete data set is a binary response with 20 multivariate normal-distributed explanatory variables. The assumption of our models, along with MCAR and MAR mechanisms, satisfy the requirements for the SA LOGISTIC and MI procedures.

The complete and missing data were simulated based on the following logistic function with descending coefficients:

$ \mathit{l}\mathit{o}\mathit{g}\mathit{i}\mathit{t}\left({\text π}\right)=\mathbf{ln}\left(\dfrac{{{\text π}}}{1-{{\text π}}}\right)={1-\mathit{x}}_{1}+0.8{\mathit{x}}_{2}+0.6{\mathit{x}}_{3} $

The descending order of coefficients was designed to examine how the coefficients can affect the model selection.

First, the 100 simulated complete data sets were processed by traditional logistic procedures with forward selection and each of the final effects list was recorded. By counting the effect selected times in the 100 procedures, the numbers of each X selected out of 100 times were acquired. Similarly, the 5%, 10%, 15%, and 20% randomly deleted datasets were processed by the proposed MFMS method, SAS traditional logistic procedure, and MI procedure respectively. Merging the results from the first and the second steps by SAS SQL procedure, an overall list of each X variable selected numbers were achieved for the complete data and at each missing percentage as shown in Tables 2−7.

Table 2. Model selection at 5% missing for 200 sample size, MCAR.

V	Complete data			SAS macro			SAS forward			MI
V	Mean	N	MSE	Mean	N	MSE	Mean	N	MSE	Mean	N	MSE
x1	1.161	100	0.250	1.162	100	0.345	1.152	99	0.322	1.092	100	0.244
x2	−0.880	100	0.252	−0.938	99	0.334	−0.947	89	0.287	−0.837	100	0.266
x3	−0.679	96	0.186	−0.712	87	0.239	−0.750	61	0.233	−0.626	98	0.208
x4	−0.130	8	0.474	0.049	8	0.621	−0.013	15	0.452	−0.009	62	0.267
x5	−0.083	5	0.393	0.060	7	0.563	−0.015	15	0.445	0.007	58	0.240
x6	0.118	22	0.428	0.168	10	0.472	0.107	13	0.525	0.021	65	0.286
x7	0.229	10	0.337	−0.066	10	0.608	0.254	15	0.314	0.016	65	0.255
x8	0.446	9	0.057	0.012	15	0.543	−0.014	15	0.336	0.027	60	0.262
x9	0.187	8	0.406	0.110	16	0.478	−0.037	17	0.360	0.065	57	0.251
x10	0.095	14	0.403	−0.060	17	0.554	0.045	12	0.362	0.040	61	0.269
x11	−0.041	7	0.407	−0.059	9	0.567	0.079	7	0.348	0.007	57	0.232
x12	0.018	9	0.474	0.000	13	0.596	−0.091	8	0.376	0.009	58	0.243
x13	−0.085	13	0.443	0.101	20	0.482	−0.161	14	0.417	0.036	59	0.265
x14	0.071	6	0.578	0.212	11	0.544	0.010	17	0.334	−0.022	51	0.268
x15	−0.135	8	0.426	−0.022	12	0.553	−0.086	13	0.352	−0.017	63	0.253
x16	0.040	6	0.480	−0.090	11	0.455	−0.122	18	0.415	−0.078	59	0.240
x17	0.034	11	0.417	0.042	12	0.547	0.017	13	0.351	−0.034	57	0.248
x18	0.038	8	0.435	0.016	12	0.511	0.174	15	0.401	0.041	59	0.236
x19	−0.431	4	0.085	0.002	15	0.510	−0.228	10	0.311	0.005	55	0.239
x20	−0.029	7	0.482	−0.246	6	0.524	0.092	15	0.298	0.001	59	0.253

Table 3. Model selection at 10% missing for 200 sample size, MCAR.

V	Complete data			SAS macro			SAS forward			MI
V	Mean	N	MSE	Mean	N	MSE	Mean	N	MSE	Mean	N	MSE
x1	1.161	100	0.250	1.197	100	0.576	1.045	54	0.312	0.985	100	0.221
x2	−0.880	100	0.252	−1.012	97	0.418	−0.867	35	0.346	−0.748	100	0.224
x3	−0.679	96	0.186	−0.867	79	0.365	−0.677	25	0.266	−0.571	98	0.193
x4	−0.130	8	0.474	0.132	16	0.884	0.020	13	0.343	−0.019	54	0.265
x5	−0.083	5	0.393	−0.233	13	0.746	0.011	14	0.246	−0.001	60	0.223
x6	0.118	22	0.428	−0.018	19	0.710	0.029	9	0.229	0.018	63	0.276
x7	0.229	10	0.337	−0.028	14	1.206	0.012	9	0.221	0.009	59	0.224
x8	0.446	9	0.057	0.335	9	0.746	0.077	14	0.210	0.049	66	0.228
x9	0.187	8	0.406	0.374	9	0.707	0.121	15	0.281	0.055	64	0.240
x10	0.095	14	0.403	0.14	17	0.725	0.091	15	0.224	0.025	62	0.256
x11	−0.041	7	0.407	−0.397	17	0.586	−0.083	9	0.281	−0.026	51	0.229
x12	0.018	9	0.474	−0.207	16	0.739	0.037	12	0.328	−0.010	52	0.238
x13	−0.085	13	0.443	0.018	17	0.727	0.028	15	0.297	0.015	55	0.261
x14	0.071	6	0.578	0.014	12	0.828	−0.012	15	0.241	0.004	60	0.238
x15	−0.135	8	0.426	−0.133	13	0.750	−0.060	14	0.308	0.007	60	0.240
x16	0.040	6	0.480	0.051	10	0.748	−0.079	14	0.224	−0.076	46	0.246
x17	0.034	11	0.417	−0.93	10	0.898	0.027	21	0.322	−0.028	64	0.229
x18	0.038	8	0.435	−0.102	10	1.034	0.124	16	0.298	0.018	58	0.254
x19	−0.431	4	0.085	−0.026	8	0.866	−0.024	15	0.258	−0.038	56	0.216
x20	−0.029	7	0.482	0.071	10	0.860	−0.021	14	0.184	0.009	59	0.234

Table 4. Model selection at 15% missing for 200 sample size, MCAR.

V	Complete data			SAS macro			SAS forward			MI
V	Mean	N	MSE	Mean	N	MSE	Mean	N	MSE	Mean	N	MSE
x1	1.161	100	0.250	1.635	99	1.790	0.959	14	0.212	0.879	100	0.203
x2	−0.880	100	0.252	−1.608	97	3.742	−0.721	12	0.292	−0.685	100	0.201
x3	−0.679	96	0.186	−1.364	75	2.802	−0.656	4	0.223	−0.525	97	0.156
x4	−0.130	8	0.474	0.731	13	2.760	0.087	5	0.250	−0.015	59	0.235
x5	−0.083	5	0.393	0.847	9	1.421	0.023	4	0.129	0.002	52	0.239
x6	0.118	22	0.428	−0.065	15	1.023	−0.015	9	0.145	0.026	60	0.262
x7	0.229	10	0.337	1.715	17	6.451	−0.076	6	0.323	0.010	55	0.218
x8	0.446	9	0.057	−1.392	16	5.247	0.092	2	0.024	0.059	53	0.242
x9	0.187	8	0.406	0.252	12	1.769	−0.030	3	0.093	0.070	59	0.234
x10	0.095	14	0.403	−0.736	11	2.755	0.060	9	0.087	0.040	58	0.236
x11	−0.041	7	0.407	0.146	15	1.496	0.048	6	0.204	0.011	56	0.221
x12	0.018	9	0.474	0.127	10	0.969	0.094	4	0.214	0.027	59	0.218
x13	−0.085	13	0.443	0.493	15	1.020	0.009	7	0.144	0.033	56	0.259
x14	0.071	6	0.578	0.089	16	1.213	0.177	6	0.190	−0.021	58	0.223
x15	−0.135	8	0.426	−0.107	12	0.810	0.026	3	0.187	−0.017	60	0.228
x16	0.040	6	0.480	−0.318	18	0.769	−0.087	5	0.268	−0.064	54	0.203
x17	0.034	11	0.417	−0.126	15	1.107	−0.002	10	0.122	−0.047	65	0.219
x18	0.038	8	0.435	−0.342	13	1.420	0.021	5	0.285	0.000	58	0.231
x19	−0.431	4	0.085	1.089	14	4.293	−0.035	5	0.159	−0.032	56	0.223
x20	−0.029	7	0.482	2.323	19	9.146	−0.056	6	0.068	0.051	61	0.230

Table 5. Model selection at 20% missing for 200 sample size, MCAR.

V	Complete data			SAS macro			SAS forward			MI
V	Mean	N	MSE	Mean	N	MSE	Mean	N	MSE	Mean	N	MSE
x1	1.161	100	0.250	3.586	99	8.270	0.916	10	0.147	0.795	100	0.191
x2	−0.880	100	0.252	−3.931	95	12.843	−0.882	6	0.204	−0.599	100	0.205
x3	−0.679	96	0.186	−3.412	65	9.724	.	.	.	−0.446	98	0.149
x4	−0.130	8	0.474	−3.784	8	12.557	0.084	3	0.101	0.040	53	0.207
x5	−0.083	5	0.393	4.551	14	12.84	0.013	6	0.145	0.041	58	0.226
x6	0.118	22	0.428	6.864	9	13.589	−0.065	6	0.115	0.025	56	0.236
x7	0.229	10	0.337	−1.193	12	5.979	−0.067	4	0.092	−0.020	57	0.229
x8	0.446	9	0.057	0.864	15	6.120	−0.075	2	0.370	0.053	59	0.224
x9	0.187	8	0.406	−0.884	13	20.753	.	.	.	0.063	54	0.225
x10	0.095	14	0.403	0.387	10	2.697	0.025	21	0.196	0.047	57	0.217
x11	−0.041	7	0.407	1.571	20	8.790	−0.014	1	.	−0.002	60	0.215
x12	0.018	9	0.474	−0.349	10	1.559	−0.069	8	0.111	0.000	60	0.188
x13	−0.085	13	0.443	0.512	16	2.378	0.013	6	0.240	0.054	61	0.221
x14	0.071	6	0.578	0.622	16	2.122	.	.	.	−0.011	55	0.236
x15	−0.135	8	0.426	2.194	18	15.942	0.040	4	0.234	−0.033	58	0.236
x16	0.040	6	0.480	−2.824	18	11.498	.	.	.	−0.028	60	0.195
x17	0.034	11	0.417	−2.208	13	7.217	−0.010	8	0.175	−0.031	53	0.205
x18	0.038	8	0.435	−0.337	9	1.457	0.115	3	0.113	−0.007	51	0.196
x19	−0.431	4	0.085	−1.937	11	11.302	.	.	.	−0.040	56	0.199
x20	−0.029	7	0.482	−1.990	13	5.940	−0.014	22	0.160	−0.026	55	0.212

Table 6. Model selection at 5% missing for 200 sample size, MAR.

V	Complete data			SAS macro			SAS forward			MI
V	Mean	N	MSE	Mean	N	MSE	Mean	N	MSE	Mean	N	MSE
x1	1.122	100	0.226	1.090	100	0.307	1.171	92	0.349	1.051	100	0.234
x2	−0.884	98	0.223	−0.892	100	0.248	−0.967	82	0.322	−0.822	100	0.244
x3	−0.644	95	0.193	−0.733	91	0.226	−0.726	67	0.265	−0.583	100	0.210
x4	−0.029	13	0.448	0.147	7	0.482	0.186	11	0.468	0.024	65	0.271
x5	0.007	15	0.460	−0.057	13	0.468	−0.079	22	0.480	−0.001	60	0.297
x6	−0.132	12	0.386	0.156	11	0.487	−0.262	12	0.418	−0.028	60	0.268
x7	−0.209	11	0.383	0.110	12	0.551	0.015	13	0.481	−0.024	62	0.264
x8	0.114	15	0.415	0.034	14	0.561	0.095	17	0.488	0.026	59	0.295
x9	−0.001	12	0.415	0.072	8	0.495	−0.123	16	0.418	0.025	63	0.256
x10	−0.068	7	0.468	−0.091	11	0.475	−0.207	15	0.277	−0.049	54	0.259
x11	−0.186	6	0.403	−0.147	10	0.504	−0.054	8	0.422	−0.034	61	0.233
x12	−0.099	13	0.420	−0.056	16	0.555	−0.057	17	0.531	−0.035	64	0.254
x13	0.143	16	0.360	−0.198	7	0.464	0.002	16	0.369	0.012	65	0.249
x14	0.147	14	0.400	0.157	7	0.582	−0.081	15	0.534	0.048	70	0.259
x15	−0.080	12	0.465	0.187	14	0.483	−0.071	12	0.529	−0.006	58	0.274
x16	−0.177	11	0.435	−0.363	4	0.554	0.129	10	0.349	−0.050	51	0.254
x17	0.268	10	0.358	−0.211	17	0.466	0.014	15	0.416	0.053	51	0.256
x18	0.033	10	0.428	0.172	12	0.456	0.231	8	0.240	0.063	66	0.246
x19	0.242	9	0.386	0.160	10	0.514	0.049	11	0.472	0.018	62	0.265
x20	0.039	8	0.472	0.084	10	0.470	−0.132	10	0.333	0.031	56	0.246

Table 7. Model selection at 5% missing for 200 sample size, NMAR.

V	Complete data			SAS macro
V	Mean	N	MSE	Mean	N	MSE
x1	1.097	100	0.258	1.145	100	0.316
x2	−0.859	100	0.206	−0.949	100	0.271
x3	−0.656	95	0.205	−0.718	87	0.268
x4	0.009	9	0.486	0.098	19	0.535
x5	0.027	10	0.451	0.003	13	0.656
x6	0.106	9	0.463	0.185	12	0.432
x7	0.134	16	0.442	0.004	10	0.559
x8	0.001	16	0.409	0.012	5	0.481
x9	−0.171	8	0.463	−0.013	14	0.521
x10	0.057	11	0.419	0.200	15	0.410
x11	−0.147	11	0.448	0.000	15	0.562
x12	−0.154	10	0.419	0.049	10	0.583
x13	0.248	10	0.342	−0.042	18	0.558
x14	−0.032	12	0.409	0.197	16	0.559
x15	0.053	9	0.409	0.009	13	0.607
x16	−0.091	12	0.378	0.048	12	0.641
x17	0.193	12	0.402	0.140	13	0.552
x18	0.197	8	0.410	0.043	10	0.561
x19	0.416	7	0.371	−0.125	14	0.660
x20	−0.001	11	0.486	0.085	10	0.668

Results

The development of the MFMS method aims to circumvent the shortcomings of LOGISTIC forward selection when processing missing data of logistic regression. MI procedure MCMC method is the most popular logistic modeling procedure. The MFMS, PROC LOGISTIC, and PROC MI methods were compared by testing the selection frequency of each X variable. The significant level of entry for all methods was 0.1. Detailed methodologies are described in the Methods section. Several rules for model selection accuracy were established: (1) the true model variables could be selected 50 times or more; (2) the true model variables could be enrolled into models more frequently than other variables; (3) the results of model selection and parameter estimates should be close to those of true models or complete data. (4) less false positive effects. The methods that satisfy those criteria are believed to be reliable for processing missing data. Tables 2−5 show the selecting frequency of X variables at different missing levels among all methods. Tables 6 & 7 present the selecting frequency of X variables under different missing mechanisms.

Columns 2−4 in each table show the results of complete data model selection by PROC LOGISTIC at significance level 0.1 where X1, X2, and X3 are in the true model and X4−X20 are not in the model. Tables 2−5 show that with the current model setting, almost all three model variables, except X3 for 96 times, were selected to enter into the model 100 times. The frequencies of other non-model effects were far behind. In addition, the coefficient estimates of X1−X3 matched the true model with minor differences. The above indicated the accuracy of the LOGISTIC procedure in model selection for complete data.

To demonstrate the model selection efficiency and accuracy, the performance of the three modeling methods were analyzed at each missing level under MCAR.

At a 5% missing level (Table 2), the model selection pattern and the coefficient estimates obtained from the proposed MFMS method were highly consistent with the complete data. X1, X2, and X3 were found for 100, 99, and 87, respectively. This suggests the validity of the MFMS method at a 5% missing level. SAS PROC LOGISTIC did not perform as well as MFMS, which had a lower selection frequency and a larger difference compared to MFMS. However, it was still acceptable in the current setting. The MI also identified X1 and X2 100 times, while X3 was found 98 times. The coefficient estimates were also close to those of the complete data. However, after carefully examining the MI results, it was found that all the non-model variables were included in the model more than 50 times. This indicated that the MI procedure not only selected the true model variables but also incorrectly included the non-model variables at a high frequency. This increased the likelihood of achieving a biased conclusion based on the insignificant model. The issues inherent in the forward selection and the MI procedure worsened with increasing missingness.

To evaluate the performance of the three model selection methods under MAR data scenarios, datasets with a 5% missing level for MAR were established. The results, shown in Table 6, demonstrate consistent conclusions with those observed under MCAR. Since NMAR does not meet the assumptions required by the LOGISTIC and MI procedures, only the performance of the MFMS method under the NMAR mechanism were evaluated. The results, presented in Table 7, indicate that the MFMS method also performs well under NMAR conditions.

At a 10% missing level shown in Table 3, MFMS enrolled X1−X3 100, 97, and 79 times, respectively. The parameter estimates were close to the complete data with slight changes. This approved that MFMS was reliable at the 10% missing level. SAS LOGISTIC could not maintain its accuracy. It found X1 54 times, X2 35 times, and X3 25 times. The selection frequency dramatically decreased for true model variables. Only X1 entered the model more than 50 times, while X2 and X3 could not stand out from other non-model variables. It indicated that PROC LOGISTIC was no longer able to build up a correct logit model. This was because missingness at this level caused a huge loss of usable data. Although it was only 10% missing in each X variable. When considering 20 X variables, the listwise deletion algorithms resulted in a much higher percentage of overall missing. The complete subset for SAS LOGISTIC was much smaller than the 200 sample size. As seen with a 10% missing level, MI enrolled too many non-significant variables and was not a reliable method.

The results are shown in Table 4 for 15% missingness. MFMS sturdily selected the true model variables with a similar frequency of 10%. However, the estimates of coefficients moderately deviated from the true model. SAS LOGISTIC failed to select any significant variable. All three true model variables were found under 50 times. The coefficient estimates by MI started to deviate from the true model while the model selection was almost the same as previous missing levels.

As seen in Table 5, the 20% missing level where MFMS still found the true model was examined, but the parameter estimates largely deviated from the complete data. SAS LOGISTIC completely lost its accuracy of model selection. No significant variable was found. Parts of the X variables, including X3, X9, X14, X16, and X19, were not found even once. No parameter was estimated for those variables. Those demonstrated that SAS LOGISTIC was no longer able to analyze the data at a high missing level. The frequency of variable selection by MI seldom changed with increasing of missing level. The parameter estimates deviated further from the true model.

In summary, the MFMS satisfied the criteria to be a reliable logistic modeling method at 5% and 10% missing levels. MFMS exhibited a high consistency of model selection pattern compared to the complete data set. The three true model variables were successively selected at all missing levels. Additionally, the happening of false positive selection was efficiently forbidden, since MFMS only added the best significant effect with the smallest p-value each cycle. MFMS also extensively reduced the exclusion of incomplete observations. Only the missing values in the current analyzed variable were excluded and the other variable was untouched. It maintained the maximum information and uncertainty of the missing data. As proof, the parameter estimates at a 5% missing level were very close to those of complete sets. With the increase of missingness to 15%, the parameter estimates slightly deviated from complete data. However, compared to the SAS LOGISTIC and MI procedures, MFMS was still the best one. At a 20% missing level, MFMS was able to select the right model, but with incorrect parameter estimates due to the low events per variable.

SAS LOGISTIC excluded any observation that contained missing values. When missingness happens in multiple variables, the listwise deletion causes severe loss of information. The selecting frequency of true model variables dramatically decreased at 10% and higher missing levels. Finally, it completely failed to select the accurate model and estimate the parameters at the 15% and 20% missing levels.

The MI procedure could choose the true model variables at all missing levels 100 times. However, the selection frequency of non-significant variables was also around 50 times. MI automatically sampled the missing data with model-based imputation, which led to an overestimation of the logistic association and a narrower confidence interval. Those resulted in insignificant model selection and biased parameter estimates.

In summary, it was concluded that MFMS was the best reliable selection method for binary logistic models with independently normal-distributed multivariate for low-moderate level missing under MCAR, MAR, or NMAR assumption.

Real example

The World Health Organization estimates 12 million annual deaths from heart diseases globally, with half of all deaths in the United States and other developed countries attributed to cardiovascular diseases. Early detection enables high-risk individuals to adopt preventive lifestyle changes. A publicly available cardiovascular dataset from a study in Framingham, Massachusetts (USA), with over 4,000 records and 15 attributes, was analyzed to identify key risk factors for heart disease. The dataset is publicly available on the Kaggle website. Given its missing values, this dataset was used to compare SAS forward selection with the Macro-based selection method. Results are shown in Tables 8 & 9.

Table 8. SAS forward selection results.

Parameter	Estimate	Standard error	Pr > ChiSq
Intercept	9.130	0.476	< 0.0001
Male	−0.561	0.107	< 0.0001
Age	−0.066	0.006	< 0.0001
cigsPerDay	−0.019	0.004	< 0.0001
totChol	−0.002	0.001	0.043
sysBP	−0.018	0.002	< 0.0001
Glucose	−0.007	0.002	< 0.0001
Male: gender, male or female (Nominal); Age: Age of the patient (Continuous); cigsPerDay: the number of cigarettes that the person smoked on average in one day (continuous); totChol: total cholesterol level (Continuous); sysBP: systolic blood pressure (Continuous); Glucose: glucose level (Continuous).

Table 9. SAS macro-based selection results.

Parameter	Estimate	Standard error	Pr > ChiSq
Intercept	9.125	0.463	< 0.0001
Age	−0.066	0.006	< 0.0001
sysBP	−0.017	0.002	< 0.0001
cigsPerDay	−0.020	0.004	< 0.0001
Male	−0.549	0.104	< 0.0001
Glucose	−0.008	0.002	< 0.0001
PrevalentStroke	−1.064	0.442	0.016
totChol	−0.003	0.001	0.018
Age: Age of the patient (Continuous); sysBP: systolic blood pressure (Continuous); cigsPerDay: the number of cigarettes that the person smoked on average in one day (continuous); Male: gender, male or female (Nominal); Glucose: glucose level (Continuous); PrevalentStroke: whether or not the patient had previously had a stroke (Nominal); totChol: total cholesterol level (Continuous).

The results indicate that both methods identify key variables — age, systolic blood pressure, cigarettes smoked per day, gender, glucose level, and total cholesterol level — as significant predictors of cardiovascular disease risk. However, the macro-based selection method also includes an additional variable, the history of stroke, with a significant p-value (0.016), suggesting it may have identified an additional important predictor overlooked by the forward selection method.

[1]	Hosmer D, Lemeshow S. 2000. Applied logistic regression. 2^nd Edition. Hoboken, NJ: John Wiley & Sons. doi: 10.1002/0471722146
[2]	Rubin DB. 1976. Inference and missing data. Biometrika 63:581−92 doi: 10.1093/biomet/63.3.581 CrossRef Google Scholar
[3]	Little RJA, Rubin DB. 2002. Statistical analysis with missing data. 2^nd Edition. Hoboken, NJ: John Wiley & Sons. doi: 10.1002/9781119013563
[4]	Ibrahim JG, Zhu H, Tang N. 2008. Model selection criteria for missing data problems using the EM algorithm. Journal of the American Statistical Association 103(484):1648−58 doi: 10.1198/016214508000001057 CrossRef Google Scholar
[5]	Peugh JL, Enders CK. 2004. Missing data in educational research: A review of reporting practices and suggestions for improvement. Review of Educational Research 74:525−56 doi: 10.3102/00346543074004525 CrossRef Google Scholar
[6]	Baraldi AN, Enders CK. 2010. An introduction to modern missing data analyses. Journal of School Psychology 48:5−37 doi: 10.1016/j.jsp.2009.10.001 CrossRef Google Scholar
[7]	SAS Institute Inc. 1999. SAS/STAT User’s Guide. Version 8. Cary, NC: SAS Institute Inc. www.sfu.ca/sasdoc/sashtml/hrddoc/indfiles/57388.htm
[8]	Allison PD. 2001. Missing data. Series: Quantitative Applications in the Social Sciences. Thousand Oaks, CA: SAGE Publications.
[9]	Pigott TD. 2001. A review of methods for missing data. Educational Research and Evaluation 7(4):353−83 doi: 10.1076/edre.7.4.353.8937 CrossRef Google Scholar
[10]	Rubin DB. 1987. Multiple imputation for nonresponse in surveys. Hoboken, NJ: John Wiley & Sons. doi: 10.1002/9780470316696
[11]	Allison PD. 2002. Missing data. Thousand Oaks, CA: SAGE Publications. doi: 10.4135/9781412985079
[12]	Enders CK. 2010. Applied missing data analysis. New York: Guilford Press
[13]	Schafer JL, Olsen MK. 1998. Multiple imputation for multivariate missing-data problems: a data analyst's perspective. Multivariate Behavioral Research 33:545−71 doi: 10.1207/s15327906mbr3304_5 CrossRef Google Scholar
[14]	Liu M, Taylor JMG, Belin TR. 2000. Multiple imputation and posterior simulation for multivariate missing data in longitudinal studies. Biometrics 56(4):1157−63 doi: 10.1111/j.0006-341X.2000.01157.x CrossRef Google Scholar
[15]	White IR, Royston P, Wood AM. 2011. Multiple imputation using chained equations: issues and guidance for practice. Statistics in Medicine 30(4):377−99 doi: 10.1002/sim.4067 CrossRef Google Scholar
[16]	Young R, Johnson DR. 2011. Imputing the missing Y's: implications for survey producers and survey users. Proceedings of the AAPOR conference abstracts pp. 6242−48
[17]	SAS Institute Inc. 2010. What's New in SAS/STAT 9.22. In SAS/STAT® 9.22 User's Guide. Cary, NC: SAS Institute Inc. https://support.sas.com/documentation/cdl/en/statugwhatsnew/63792/PDF/default/statugwhatsnew.pdf
[18]	Berglund PA. 2010. An introduction to multiple imputation of complex sample data using SAS® v9.2. Cary, NC: SAS Institute Inc. https://support.sas.com/resources/papers/proceedings10/265-2010.pdf
[19]	Fan Y. 2014. Comparison of model selection methods in multiple linear regression model with missing value. Master's Thesis. University of Buffalo, USA.
[20]	Bergtold JS, Yeager EA, Featherstone AM. 2011. Sample size and robustness of inferences from logistic regression in the presence of nonlinearity and multicollinearity. 2011 Annual Meeting, 24−26 July 2011, Pittsburgh, Pennsylvania. Pennsylvania: Agricultural and Applied Economics Association. pp. 227−41. doi: 10.22004/AG.ECON.103771
[21]	Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. 1996. A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology 49(12):1373−79 doi: 10.1016/S0895-4356(96)00236-3 CrossRef Google Scholar
[22]	Allison PD. 2012. Handling missing data by maximum likelihood. SAS Global Forum 2012. Stat Horizons. Haverford, PA: Statistical Horizons. https://statisticalhorizons.com/wp-content/uploads/MissingDataByML.pdf

{{lists.name}}

SAS MACRO for forward model selection in a logistic regression model with missing values

Abstract

Supplementary information

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors