This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
A first stage in the development of a good predictive model or a good classification rule is the identification of potentially useful predictor variables based on domain knowledge. The general type of model to be developed also needs to be defined. Depending on the circumstances the type of model to be considered could, for instance, be a linear regression model, or a logistic regression model, or a regression tree or a neural network.
In exploratory model building the selection of appropriate variables for inclusion in a final model is often done algorithmically. For instance, algorithms such as backward elimination, forward selection or best subsets are routinely employed to develop regression models (see ). The motivation behind the development of the cited algorithms is to have a procedure that will identify a good subset of predictor variables. In this sense the ideas of variable selection and subset selection become synonymous. The use of these algorithms in regression problems is widespread even though their use is known to be problematic. The extent of the use of the algorithmic approach in model building is aptly summarised by George , who writes "The problem of variable selection is one of the most pervasive model selection problems in statistical applications. The use of variable selection procedures will only increase as the information revolution brings us larger data sets with more and more variables. The demand for variable selection will be strong and it will continue to be a basic strategy for data analysis".
Variable selection problems from using backward elimination, forward selection, best subset regression and other automated model building techniques are well documented in the context of multiple linear regression. In the main, investigations have been through simulation work in which the theoretical underpinning model assumptions are satisfied and any deviation between simulation results and anticipated theoretical results is therefore attributable to the variable selection technique. For instance, the simulation work of Derksen and Keselman  give broad conclusions that automated selection techniques overly capitalize on false associations between potential predictors and the criterion variable with too many purely random (noise) variables being wrongly classified as authentic (true) predictors. The inclusion of noise variables in a final model necessarily implies a model misspecification and incorrect inferences are drawn.
Derksen and Keselman  concluded that the inclusion of noise variables in a model can result in the failure to classify genuine (authentic) variables as being genuine predictors of the criterion. Thus, well established automated techniques can paradoxically inflate the probability of type I errors and in some cases result in a loss of power. Moreover, the conclusions drawn by Derksen and Keselman  indicate that "the degree of correlation between predictor variables affected the frequency with which authentic variables found their way into the model". Accordingly the rate at which type I errors occur is quite problem dependent and there is no simple mechanism for controlling this error rate.
The over capitalization on false associations, known as overfitting, gives rise to overly optimistic within sample estimates of goodness-of-fit and overly optimistic predictive ability which is not replicated on new data from the same population. Best subset regression solutions are based on the overall within sample maximisation of the goodness-of-fit statistic, and these "best subset" solutions necessarily show the greatest upward bias in the estimation of the population coefficient . This problem is compounded when the number of potential predictor variables J increases relative to the number of cases I .
We consider an alternative technique to correctly quantify the type I error rate in assessing overall model significance for best subset regression solutions. In Section II we outline the traditional approach for assessing overall significance of a best subsets regression. In Section III we describe an alternative procedure based on randomization. In Section IV we describe a series of models that will be used to compare the performance of the proposed algorithm against the traditional approach. Section V summarizes the results of the simulations and shows that the traditional method for assessing significance is flawed whereas the proposed algorithm correctly controls type I error rate in a null model and retains power in a non null situation. Section VI demonstrates that the extent of the problem depends on the number of predictor variables and that the correction under the proposed method is a non-trivial correction.
II. Best Subsets Regression
Consider the classic linear regression model
where is the dependent variable, with predictors and where denotes a normally distributed random variable with mean zero and variance . Let , denote independent cases generated from the above model.
In best subsets regression, the best subset of size is that subset of predictor variables that maximizes the within sample prediction of the dependent variable , in a linear least squares regression. The percentage of variation in that is accounted for by a regression equation is the usual statistic, known as the coefficient of determination. In the following will be used to denote the statistic for the best subset of size j. Traditionally the overall significance of the best subset of size j is judged using the standard statistic, where is the mean square to regression, is the mean square error and reference is made to the distribution with degrees of freedom. See  for a more detailed explanation.
If the potential predictor variables , are noise variables i.e. unrelated to in as much as , then the p-values for judging overall model significance, for any subset of size j, should be uniformly distributed on (0, 1). That is to say, if a researcher works at the significance level, and if none of the potential predictor variables are related to , then a type I error in assessing significance of the overall best subset model should only be made of the time for any value . We propose an alternative procedure for assessing the overall significance of any best subset of size . This alternative procedure, the fake variable method, does not make reference to the distribution.
III. Fake Variable Method
Reconsider the sample data , and let denote the coefficient of determination for the best subset of size . Now consider where the order of cases for the predictor variables in the data is randomly permuted but with the response held fixed i.e. . Note that this random permutation of predictor records ensures that the sample correlation structure between the predictors in the real data set is precisely preserved in the newly created randomized or fake, data set. The random permutation also ensures that the predictor variables in the fake data set are stochastically independent of the response, Y, but may be correlated with Y in any sample through chance.
Best subsets regression can be performed on the newly created fake data set. Let denote the coefficient of determination for the best subset of size , for the fake data set. If for subset j then the fake "chance" solution may be viewed as having better within sample predictability than the observed data.
Naturally, for any given data set many instances of a fake data set may be generated simply by taking another random permutation. In what follows the proportion of instances that is estimated through simulation. This estimate is taken to be an estimate of the p-value for determining the statistical significance of for any subset of size j.
The above procedure may be summarized as follows: For given data and for a subset of size j
1. Determine best subset of predictors of size j and record the coefficient of determination
2. Set KOUNT = 0
3. DO n = 1 TO N
a. Randomly permute independently of
b. For the newly created fake data set determine the best subset of size and record the coefficient of determination
c. If Then KOUNT = KOUNT + 1
P-Value = KOUNT/N
IV. Simulation Design
For a specific application consider the model
To illustrate the properties of the proposed technique, four specific parameter settings (referred to in the following as Model A, Model B, Model C, and Model D) with two different correlation structures have been considered.
Model A is a genuine null model with and with i.e. all proposed predictors are in fact noise variables and are unrelated to the outcome. For Model B we consider , , (i.e. one authentic variable and three noise variables). For Model C we consider , , , and . For Model D we consider , , , , and . In the following simulations each model is considered with potential predictor variables being (1) stochastically independent in which their correlation matrix is the identity matrix, and (2) strongly correlated with elements of the correlation matrix being where denotes Pearson's correlation coefficient between and . In all instances the error terms are independent identically distributed realizations from the standard normal distribution so that the underpinning assumptions for the linear regression models are satisfied. In what follows simulations are reported based on cases per simulation instance.
V. Simulation Results
Fig. 1 is a percentile-percentile plot of the p-values obtained from implementing the aforementioned algorithm for step j = 1 in best subsets regression for Model A with potential predictor variables being stochastically independent. The vertical axis denotes the theoretical percentiles of the uniform distribution (0, 1) and the horizontal axis represents the empirically derived percentiles based on 500 simulations. Note that the p-values based on the traditional method are systematically smaller than required indicating that the true type I error rate for overall model significance is greater than any pre-chosen nominal significance level,. In contrast the estimated p-values based on the fake variable data set have an empirical distribution that is entirely consistent with the uniform distribution (0, 1).
Under Model A, qualitatively similar results are obtained for j = 1, 2, 3, both for potential predictors being independent, case 1, or correlated, case 2. For j = 4 there is no subset selection under the simulations and in these cases both the traditional method and the fake variable method have p-values uniformly distributed on (0, 1).
Simulations under Model B, C, and D with independent predictors, case 1, or with correlated predictors, case 2, correctly show that the proposed method retains power at any level of ; the power is marginally lower than the power under the traditional method but this is expected due to the liberal nature of the traditional method as evidenced in Fig. 2.
Fig. 1. Percentile - Percentile plot for p-values for best subset of size 1 from 4 independent predictors, Model A.
Fig. 2. Percentile - Percentile plot for p-values for best subset of size 1 from 4 independent predictors, Model B.
VI. Effect Of The Number Of Predictors
Simulations under a true null model (i.e. with all potential predictors being noise variables), for J = 4, 8, 16, 32, 64, keeping the number of cases fixed, I = 30, have been performed. In all of these cases the simulations show that the p-value for subset significance using the fake variable method is uniformly distributed on (0, 1). In each and every simulation instance the estimated p-value in the fake variable method is not less than the p-value under the traditional method. The distribution of the differences in p-values for j = 1 and J = 4, 8, 16, 32, 64 is summarized in Fig. 3. Note that the discrepancy tends to increase with increasing values of J and that this discrepancy is a substantive non-trivial difference.
Fig. 3. Discrepancy between fake and traditional p-values for best subset of size 1.
A computer based heuristics that allows the type I error for a best subsets regression to be controlled at any pre-determined nominal significance level has been described. The given procedure corrects the bias in the overall p-value for best subsets regression. The correction is a non-trivial correction and even applies in those particularly problematic situations when the number of predictors exceeds the number of cases.
Significance tests in classical least squares regression are based on the assumption that the underpinning error terms are independent identically distributed normal random variables. Even when these assumptions are satisfied the p-values when estimated under best subsets regression are still biased, leading to wrong inferences. In practice the underpinning normality assumptions are likely to be violated to some extent leading to further bias in the p-values in best subsets regression. In contrast the fake variable approach is based on the sample data and the estimation of the p-value does not explicitly rely upon distributional assumptions. In principle the same procedure could be adapted for use for other best subsets regression techniques (e.g. logistic regression models).
Stoppiglia et. al.  and Austin and Tu  have considered the use of a single fake variable (also known as a probe variable) to help determine the reliability of any final model. Stoppiglia  considers the problem of building a model many times over to determine ranking of an independent random fake variable in relation to other variables in the model. For instance in best subsets a single fake variable would be added to the data set and a record would be made of the proportion of times the random fake variable is included in the best subset solution. The rationale is to retain those variables that consistently rank higher than the fake variable that "probes" the solution. Austin and Tu  do something similar and include a randomly generated single fake variable in each of their bootstrap samples and then determine the proportion of times the fake variable is included in any final bootstrap model for comparison with the proportion of inclusion of the other variables. Note however neither  nor  give explicit decision rules for the use of a single fake variable. Moreover, the work contained in this paper additionally casts doubt on whether it is correct to use a single fake variable as multiple fake variables are needed for a valid best subsets p-value.
More generally the method given in this paper is strongly suggestive of ways in which computer scientists can generate other fake variable algorithms to be used with other heuristics (e.g. backward elimination) and for use with other generic models (e.g. regression trees) and in doing so validly control error rates.