This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.
Regression analysis is a powerful forecasting tool which is used in many areas such as engineering, sociology, psychology, etc. It is a statistical and mathematical method suitable for determining relationship between one variable and altering other(s). Mendenhall and Sincich (1992) states that, models that relate a dependent variable "y" to a series of independent variables are known as regression models.
There is various regression models used in scientific studies. Important regression models and their area of use tabulated below.
Important Regression Models
Simple Linear Regression Model
It is the simplest regression model. Only one dependent and one independent variable used in a linear equation. Resultant function is like
Multi-Linear Regression Model
Similar to simple linear regression model, linear equations are used in this model. Main difference is more than one independent variable included. Resultant function is
Non-Linear Regression Model
Nonlinear regression is characterized by the fact that the prediction equation depends nonlinearly on one or more unknown parameters. It usually arises when there are physical reasons for believing that the relationship between the response and the predictors follows a particular functional form (Smyth, 2002)
Logistic Regression Model
It is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve
Table 3.1 List of Important Regression Models
In this research, multi-linear regression will be studied in detail. Reason behind choosing multi-linear regression model and performance of other regression types and forecasting tools can be found in "Chapter 5 Analyzing Data".
Fig 3.1 shows hypothetical scatter data distributed in x and y axis and a linear line. One can see that there is a linear relationship between x and y data hence it is not perfectly linear. The line fitted to curve represents best fitted linear equation which minimizes the vertical deviations between the scatter points and the line. Method of least squares adjustment used for best-fit line.
Fig. 3.1 Hypothetical scatter data
3.2.1 Method of Least Squares:
Method of least squares can be used with both simple regression models and multiple regression models. Equations and brief explanation are work of R.Sureshkumar (1998). For each pair of observation , an error coefficient can be defined.
a and b should be computed in such a way that the sum of the squared errors over all the observations is minimized. i.e. the quantity needed to be minimized is :
In order to minimize errors, derivative should be applied which is and . Which yields:
3.2.2 Assumptions for Regression Model:
Although multi-linear regression model can be applied to any kind of data set, in order to apply a successful model, data should carry some specific characteristics.
1- ) For all forms of independent variables, the variance of E is constant. (homoscedasticity)
2- ) The probability distribution of points about the line of means is normal.
3- ) Outliers do not exist in the dataset.
4- ) The random errors are independent or not serially correlated.
(Statistics for Engineering and the Sciences, 1992)
Deborah R. Abrams of Princeton University, Data and Statistical Services (2007) gathered similar assumptions :
1- ) Numbers of Cases : When doing regression, the cases-to-independent variables (IVs) ratio should ideally be 20 cases for every IV in the model. Lowest ratio should be minimum 5 to 1.
2- ) Accuracy of Data : If data have entered instead of an established set, it is a good idea to check the accuracy of the data entry. For example, a variable that is measured using a 1 to 5 scale should not have a value of 8.
3- ) Missing Data: If specific variables have a lot of missing values, one may decide not to include those variables in analysis. If only a few cases have any missing values, then those cases might be deleted. If there are missing values for several cases on different variables, then deleting cases may not be suitable. In such a case, dataset can be seperated in two groups; those cases missing values for a certain variable, and those not missing a value for that variable. Using t-tests it can be determined if the two groups differ on other variables included in the sample.
After examining data, missing values can be replaced with some other value. The easiest thing to use as the replacement value is the mean of this variable. Alternatively, substituting a group mean can be used.
4- ) Outliers : Data should be checked for outliers (i.e., an extreme value on a particular item). An outlier is often operationally defined as a value that is at least 3 standart deviations above or below the mean. If the cases that produced the outliers are not part of the same "population" as the other cases, then those cases might be deleted. Alternatively, those extreme values can be counted as "missing", but retain the case for other variables.
5- ) Normality: For checking normality of the data , one can construct histograms and "look" at the data to see its distribution. Another way is looking at the plot of the "residuals". Residuals are the difference between obtained and predicted independent variable scores. If the data are normally distributed, then residuals should be normally distributed around each predicted dependent variable score. In addition to graphic examination of data, one can also statistically examine the data's normality. Statistical programs such as SPSS will calculate the skewness and kurtosis for each variable; an extreme value for either one would tell that the data are not normally distributed. "Skewness" is a measure of how symmetrical the data are and "Kurtosis" displays how peaked the distribution is, either too peaked or too flat. Extreme values for skewness and kurtosis are values greater than +3 or less than -3. Checking for outliers will also help with the normality problem.
6- ) Linearity: Regression analysis also has an assumption of linearity. Linearity means that there is a straight line relationship between the independent variables and the dependent variable. This assumption is important because regression analysis only tests for a linear relationship between independent variables and the dependent variable. Linearity between independent variable and the dependent variable can be tested by looking at a bivariate scatterplot (i.e., a graph with the independent variable on one axis and the dependent variable on the other). If the two variables are linearly related, the scatterplot will be oval.
7- ) Homoscedasticity: The assumption of homoscedasticity is that the residuals are approximately equal for all predicted dependent value scores. Homoscedasticity can be checked by looking at the same residuals plot talked about in the linearity and normality sections. Data are homoscedastic if the residuals plot is he same width for all values of the predicted dependent variable.
8- ) Multicollinearity and Singularity: Multicollinearity is a condition which the independent variables are very highly correlated (.90 or greater) and singularity is when the independent variables are perfectly correlated and one independent variable is a combination of one or more of the other independent variables. Calculation of the regression coefficients is done through matrix inversion and if singularity exists, the inversion is impossible, and if multicollinearity exists the inversion is unstable. In such a case then independent variables are redundant with one another . As such, having multicollinearity or singularity can weaken the analysis. In general two independent variables that correlate with one another at 0.70 or greater considered correlated.
3.2.3 Significance and Validity for Regression Model:
Various test's and control methods are widely used for checking significance and validity of multi-linear regression models. Important ones are mentioned below:
, also known as Error sum of squares, can be represented as . Similarly is known as regression sum of squares. So according to Eqn. 3.5, total corrected sum of squares, can be shown as
As previously discussed in this chapter, regression equation can be displayed as . "k" value in the equation represents degrees of freedom of the regression model. Similarly "p" value, stands for number of parameters in regression equation and "n" value stands for number of observations. Under these assumptions & data following equation can be written:
(Applied Statistics and Probability for Engineers, 2003)
stands for the mean square from regression and likewise refers to mean square of errors (or residuals).
The F value theory stands for if two data sets are similar, the variance between them should be similar as well. F value is a number between 1 to any number. The null hypothesis (null hypothesis : a statistical hypothesis to be tested and accepted or rejected in favor of an alternative; specifically : the hypothesis that an observed difference (as between the means of two samples) is due to chance alone and not due to a systematic cause (Webster's dictionary)) is rejected if it gets too high. F value is also used for student's t-test for determining significance of coefficients.
22.214.171.124 Student's t-test and Significance:
The t statistic is the coefficient divided by its standard error. The standard error is an estimate of the standard deviation of the coefficient, the amount it varies across cases. Most of the statistic software's compare the t statistic on the regression variables with values in the Student's t distribution to determine p-value. (Princeton University, Data and Statistical Services , 2007)
Generalized formula of the t-test given below :
t and p indices refer to the target and predicted samples whereas s indie means variance of the samples.
Lesser values of the p-value stand for the more significance of the variables in regression equation. Although many confidence levels may be considered as accepted, most of the scientists find 95% confidence interval as statistically significant. (
126.96.36.199 R-square , Wellness of Fit :
, also known as coefficient of determination, is used for determining how well a line fitted to a dataset. It can be determined by dividing total corrected sum of squares to regression sum of squares. It can be formulated as follows:
gives result as percentage and higher values shows a better fit line. Perfect line would have a value of 1 (which means , error sum of squares, equals to 0).
R square adjusted () is a similar term as . Since it includes degrees of freedom, it is more useful to determine if newly added regression coefficient decreases the error mean square.
3.2.4 Software Output of a Regression Analysis:
Most of the scientists and researchers use statistical software for analyzing their data. These package programs have relatively long history in scientific world and proven to be correct, fast and reliable. In the following segments of this study SPSS by IBM Company will be used for analysis. Other statistical software like Minitab, Statistica, SAS, etc. give similar regression output like SPSS. The following output charts put here for displaying regression definitions and will be discussed in detail in the following chapters.
Sum of Squares
Table 3.2 Sample One Way ANOVA analysis output from SPSS
Sum of Squares Column : , and previously discussed at section 188.8.131.52
df Column : shows degrees of freedom n and p
Mean Square Column : and previously discussed at section 184.108.40.206
F value : previously discussed at section 220.127.116.11
Sig: correspondent of t value at student's t distribution. Shows .
a. Dependent Variable: a5_Passenger_Number
Table 3.3 Sample Coefficient Output Table from SPSS Output
Model Column : Independent Variables (dependent variable mentioned at the bottom of the table)
Unstandardized Coefficients Column: These coefficients are not standardized. (In the latter parts of this thesis, Unstandardized values will be used hence it gives more reliable resultant values)
B Column: Coefficients of the values (remember equation )
Std. Error Column: Standard errors of the coefficients
t Column : t values of the coefficients . (from section 18.104.22.168)
Sig. Column: Significance values are listed here. These values are also known as values. Smaller values have more significant contribution to the overall regression model.