# Repeated Data Splitting Approach For Variable Selection Education Essay

**Published:** **Last Edited:**

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

Logistic regression is a popular technique for modeling categorical response variable in different fields such as business applications, medicine, epidemiology and most recently in genetics(Agresti 2002).

Frequently researchers face the problem with large number of variables when try to figure out which variable should include in the final model. Methodologists suggest inclusion of as many variables as we can to control confounding. But this approach produces large error variance and consequently large standard error of the coefficients. On the other hand exclusion of important variables from the model results misestimating the parameters (Murtaugh 1998).

Selecting only important variables increases efficiency in regression models and helps to understand the underlying systems. (Liu and Motoda 1998)

To select important variables (Miller 1984) proposed automated variable selection methods, which were, forward, backward and stepwise selection.

Forward variable selection method: only adding new variables based on predefined criterion with no removal of added variables.

Backward elimination method: only removing variables from a full dataset based on predefined criterion with no addition of removed variables.

Stepwise variable selection method: removal and/or addition of variables.

If the number of candidate variable increase the probability of correctly identifying variables will decrease, that is, probability of correctly identifying variables is inversely proportional to the number of candidate variables (Murtaugh 1998).

Automated variable selection methods are not capable to produce stable models, such as, in a same dataset if two different researchers use two different method of selecting variables it produces different results. So to use automated variable selection method extra care should be taken (Austin and Tu 2004). Here automated variable selection refers to forward, backward and stepwise method.

Automated variables selection method depends on initial criterion, usually p-value which cannot retain potential confounder variables. To select important variables along with potential confounder variables a purposeful selection (PS) algorithm was proposed (Bursac, Gauss et al. 2008).

Backward elimination in conjunction with bootstrap method has been studied and showed that the variables that were selected at least 60% times in all bootstrap samples provide better predictive model and more stable estimate of the parameters (Austin and Tu 2004). But the author did not provide any argument regarding confounder variables.

There is no study has been done to assess the stability of the purposeful selection method. Also the author recommended this approach only for risk factor modeling but not for predictive modeling. In this study we will assess the stability of purposeful selection method using repeated data splitting and find out a cut off value for the percentage of times a variable was selected that should keep in the final model, which can be used both for risk factor modeling as well as for predictive modeling.

In some cases logistic regression is used to estimate relative risk which is based on the predicted probability from logistic regression (Santos, Fiaccone et al. 2008).

Through logistic regression model we can estimate the probability of a disease of an individual who exposed to a particular risk factor; also for the same individual we can estimate probability of same disease if not exposed. Estimating the probability of disease in this way we can infer about casual effect of the particular risk factor (Ahern, Hubbard et al. 2009).

To assess the stability of purposeful selection method we will conduct simulation based study in conjunction with repeated data splitting.

## Methodology

The purposeful selection method starts with univariate analysis and keeps those variables for the candidate of multivariate logistic regression model with significance level 0.2 or 0.25. Then the multivariate model is fitted and examines the significance level of the variables. If any variable lost significance at 0.1 levels it will remove from multivariate model. After that reduced models is fitted and checks the estimated coefficient of the variables in the reduced model and calculate change in the coefficients, if the change is more than 15% or 20%, then the pick the removed variable in the model as confounder variable otherwise dropped. In this way initial main effect model is constructed. With the initial main effect model, the variables that were dropped from univariate analysis now added one at a time and check for significance and for possible confounder. If it is significant at 0.1 or .15 level or it is a confounder then the corresponding variable is picked for final model. At the end of this step the list of final variable is constructed.

This method is applied once to the whole dataset there may be arises some bias in variable selection and may not reproducible due to random sampling fluctuation. To overcome from the variable selection bias and non-reproducibility will apply the following methodology:

Generate data

Run the purposeful selection method and store the names of selected variable

Re-run the purposeful selection method 1000 times but for each run re-generate the dataset Record the following characteristics:

How many unique model was selected by the method

Which parameter and how many times changed the sign

Produce a table showing percentage of time a variable was selected

Store the variable names based on the following condition:

Select those variables that were selected at least 90% time

Select those variables that were selected at least 70% time

Select those variables that were selected at least 50% time

Now generate another dataset with same parameterization and split the dataset in training and test set with 75% observation in training set and 25% observation in test set

Construct a set of candidate models as:

Model-1: Take all of the variables in the dataset

Model-2: Take all of the variables selected by purposeful selection method

Model-3: Take the variables that were selected at least 90% times

Model-4: Take the variables that were selected at least 70% times

Model-5: Take the variables that were selected at least 50% times

Model-6: Take variables from forward variable selection method

Model-7: Take variables from backward variable selection method

Model-8: Take variables from stepwise variable selection method

Calculate training error, and test error from each of the eight models and compare

Calculate AIC from the five candidate models and compare

Calculate sensitivity and specificity and compare

## Simulation

We will conduct two simulation studies with the presence of confounding variable. In first settings we will generate data with three significant and three non significant variables to assess the stability of the PS method and find out a cut off value for the percentage of times a variable was selected that should keep in the final model, which can be used both for risk factor modeling as well as for predictive modeling. Here the significance is considered in terms of true value of the parameters. The zero-value corresponds to non-significant variables. For this setting the simulation steps are follows:

Choose the value of the parameters for the population model. For our setting we choose, , and and the remaining parameter is set to zero.

Generate x1~Binomial (0.5) and the confounder variable x2=U (-6,3) if x1=1 and x2=U(-3,6) if x1=0.

Generate x3-x6~U(-6,6)

Obtain true logit as

The outcome variable is obtained from Binomial distribution with the probability estimated from true logit by the following relationshiop:

The final dataset will contain six candidate variables with three significant variables.

Using the same setting we will generate another dataset with 10 candidate variables with three significant variables.

In second setting we will generate data from two known population to check the predictive ability of the model selected from first simulation setting. To generate this dataset the following steps will be applied:

Generate data from multivariate normal distribution with given covariance and mean with 3 variables. And create a categorical variable that will be used as dependent or outcome variable and give the values as "Yes" or 1 for all of the observation.

Again generate data from multivariate normal distribution with the same covariance but different mean and same number of variables. And give level the dependent variable as "No" or 0 for all of the observation generated in this step.

Combine step-1 dataset and step-2 dataset

Now generate additional 3 variables from different distribution (continuous and discrete). The number of observation should be equal to the sum of step-1 and step-2 observations.

Add these three additional variables to the generated dataset

In the final dataset there will be six candidate variables with three significant variables. Here significant variable means the variables which have the real contribution to generate outcome variables. In our settings first three variables are significant.

Using the same setting we will generate another dataset with 10 candidate variables with three significant variables.

In both settings we will use the sample size 60, 120, 240, 480 and 600 and will repeat each sample for 1000 times. The difference between two settings is:

For first setting we know the true probability and logit and true parameter values but not the exact value of the outcome variable.

For second setting we have the true value of the outcome variable but not the true parameter value and not the true probability or logit.

From the first setting we will select the best model among the eight candidate models which will be used for risk factor modeling. In second simulation setting we will assess the predictive accuracy of the selected model.

## Results

No. of unique model selected by PS method for different sample size: xxx

## Table-1: Number of times out of 1000 a variable change its sign

## Variable Name

## Frequency of Positive Sign

## Frequency of Negative Sign

## 60

## 120

## 240

## 480

## 600

## 60

## 120

## 240

## 480

## 600

Var1

615

816

974

1000

1000

1

0

0

0

0

Var2

184

358

606

877

927

50

20

3

0

0

Var3

517

721

937

997

999

2

0

0

0

0

Var4

68

47

65

45

50

57

67

59

47

44

Var5

70

58

52

43

64

68

63

41

44

39

Var6

64

55

55

50

48

92

48

58

51

55

## Table-2: Percentage of selecting a variable for final model

Variable Name

Percent of selection

60

120

240

480

600

Var1

Var2

Var3

Var4

Var5

Var6

## Table-3: AIC, training error, test error, sensitivity and specificity for different model

Model

AIC

Training Error

Test Error

Sensitivity

Specificity

Model-1

xxx.x

x.xxx

x.xxx

x.xxx

x.xxx

Model-2

## â€¦

## â€¦

## â€¦

## â€¦

## â€¦

Model-3

Model-4

Model-5

Model-6

Model-7

Model-8

## â€¦

## â€¦

## â€¦

## â€¦

## â€¦

The above result will be given for each sample size used in this study.

## Discussion

Agresti, A. (2002). Categorical Data Analysis, Wiley.

Ahern, J., A. Hubbard, et al. (2009). Estimating the effects of potential public health interventions on population disease burden: a step-by-step illustration of causal inference methods, Oxford Univ Press.

Austin, P. C. and J. V. Tu (2004). "Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality." Journal of clinical epidemiology 57(11): 1138-1146.

Austin, P. C. and J. V. Tu (2004). "Bootstrap Methods for Developing Predictive Models." The American Statistician 58(2): 131-138.

Bursac, Z., C. H. Gauss, et al. (2008). "Purposeful selection of variables in logistic regression." Source Code for Biology and Medicine 3: 17.

Liu, H. and H. y. Motoda (1998). Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic, Boston.

Miller, A. J. (1984). "Selection of subsets of regression variables." Journal of the Royal Statistical Society. Series A (General): 389-425.

Murtaugh, P. A. (1998). "Methods of variable selection in regression modeling." Communications in statistics. Simulation and computation 27(3): 711-734.

Santos, C., R. L. Fiaccone, et al. (2008). Estimating adjusted prevalence ratio in clustered cross-sectional epidemiological data, BioMed Central Ltd. 8: 80.