Example SPSS Essay
SPSS Unemployment Illness
Introduction
The aim of this report is to try and find patterns and model relevant factors such as:

Unemployment levels across different districts in England.

Modelling long term illnesses in relation to other key variables

Create and develop an index which measures affluence for the different English districts.
The analysis will be carried out using two key sets of data. With reference to these datasets, there are no issues with small sample sizes as both sets of data are large and hence are a good representation of the population under review in this case England.
Analysis
The first part of the project will focus on developing a model which fits
unemployment levels across the different districts. Multiple regression analysis will be used to develop this model. Before taking this step, some basic exploratory analysis would be useful as it gives us an idea of what are the main characteristics of the data.
As can be observed from the above chart and table the majority of the districts have unemployment rates ranging between 2% and 6%. Isles of Scilly have the lowest unemployment rate at 1.34% whilst Hackney has the highest unemployment rate at 10.59%. The average unemployment rate per district is in the region of about 4%.
The Pearson correlation coefficient (Appendix 1) clearly shows that the level of unemployment is correlated with all the variables involved in this study. However, the key positive relationships are with % of lone parent families (r=0.907, p<0.001), with % of persons without a car (r=0.886, p<0.001), with % of people living in rented LA accommodation (r=0.739, p<0.001). On the other hand the key negative relationships are with % married (r=0.815, p<0.001), with % of homeowners (r=0.678, p<0.001) and with % of the population aged between 4559 (r=0.613, p<0.001).
Interpreting the above results means that the higher the unemployment rates than people are less likely to have a car, likely to have an accommodation rented out to them by the LA and are also likely to be single parents. On the other hand low unemployment rates figures suggest that people are more likely to be married, middle aged and have their own house.
Correlations are a good start to identify which predictors could be useful for the regression model. In this instance there are 18 variables which are all statistically significant with % of unemployment. In an ideal world the 'best' model which accounts for all the variation would be one which includes all these 18 variables however, from a practical perspective, this is a bit unworkable and hence stepwise regression will be used as this will provide a model with less predictors, yet a statistically robust enough model.
For the purpose of this project no more than six predictors can be used to model unemployment rates. Stepwise regression produced the following model, based on six predictors which are % of lone parent families (r=0.907, p<0.001), % with llti in the district (r=0.573, p,<0.001), % of persons aged 60 and over (r=0.225, p<0.001), % of hh with no earners (r=0.551, p<0.001), % of persons in detached /Semi detached or terraced housing (r=0.525, p<0.001), % of females in each district (r=0.283, p<0.001).
Various multiple regression models were tested. The model with three predictors provided by the stepwise regression model had a lower rsquared statistic and hence has been discarded. Another model with six independent variables (those variables that had the highest correlations with the dependent variable were also tested, but again, the rsquared statistic was lower and hence this option too was discarded) In the light of these results, it has been decided that the original six variables proposed by the stepwise model will be used.

The above model looks appropriate as about 94% of the variability of % of unemployed is explained by these six predictors. The rsq adjusted is also very similar to the rsquared implying that the data under study provides a true reflection of the entire population.

Autocorrelations are not a problem for this model as the DW statistic is very close to 2.

F = 997.4 and p<0.001 means that this regression model is a statistically valid one.
This means that the model is:
% persons unemployed = 4.81 + 0.241*% of lone parent families + 0.219*% of persons with llti in district  0.069*%persons aged 60 and over + 0.094*%persons in hh with over 1.5 persons per room  0.005*%persons in detached/semidetached or terraced housing  0.087*%female in each district.

Large tvalues are linked with small pvalues and indicate that particular independent variable is appropriate for the model. In this particular model all the six predictor variables contribute to the model. Furthermore the 95% confidence interval of the value of the coefficients do not include zero.

Multicollinearity is not a problem for this model as VIF values are very low.

The residuals appear to be normally distributed and randomly distributed.

Broadly speaking this is decent model as all diagnostics and criteria indicate so. The tables below show which cases or districts have extreme values, statistically termed as outliers.
District 
Observed value 
Predicted value 

207 
4.90 
3.54 

258 
9.21 
7.81 

357 
8.65 
7.35 
District 
Observed value 
Predicted value 
366 
1.34 
3.05 
241 
3.84 
5.53 
280 
4.34 
5.97 
The above three tables show those districts/cases whose observed unemployment rates deviate considerably from those fitted via the regression model.
The regression model predicted significantly lower unemployment rates for Gravesham, Haringey and Newham districts. On the other hand this model predicted significantly higher unemployment rates for Isles of Scilly, Hyndburn and Burnley.
The next phase of the analysis is to try and model long term illness. Based on the earlier analysis it has been established that long term illness is
positively correlated with unemployment levels. Logistic regression will be used to model this variable as this is a dichotomous variable (a variable which
has only two outcomes, i.e. a yes and a no). Before performing logistic regression, some basic analysis need to be carried out to find out which variables
are statistically significant and actually demonstrate that there is a real and meaningful relationship. Care needs to be taken when dealing with this
dataset as it is very large and therefore the least difference can result in a statistically significant relationship when actually there may not
necessarily be one.
The majority of the variables are categorical and hence chi square statistic alongside the Spearman rank correlation (Appendix 2) will be used to find out if there are any meaningful bivariate relationships. The bar charts (please refer to the SPSS output files) will also be used to ascertain verify if the statistically significant relationships are actually meaningful. Including all the variables presented is not a practical approach as some of them have multiple categories resulting with a very long model made up of loads of dummy variables. The four variables which had the highest statistically significant relationship have been picked up to create this model. These are Age, economic position, marital status and number of cars.
The next step is to provide some descriptive statistics on these four variables.
Economic Position

As can be observed from the above table, those who are permanently sick, unemployed and retired are more likely to be long term sick.

On the other hand those in employment are less likely to be long term sick.
Marital Status

From the above table it can be clearly seen that the 'healthiest' people are those who are single.

Those who remarried, divorced and widowed were more likely to suffer from longterm illness.
Number of Cars

Those without a car were more likely to suffer from longterm illness.
Age

It can be seen from the above two plots that the older the respondents the less likely that they have long term illness.
This is a decent model as the inclusion of age, Economic position, marital status and number of cars has reduced considerably 2log likelihood. This statistic measures how poorly the model predicts the decisions  the smaller the statistic the better the model.
In fact, the 2 Log Likelihood statistic has dropped significantly with four independent variables (from about 3000 to just under 2000) when compared with no variables, indicating that our expanded model is doing a better job at predicting decisions than one will fewer predictors.
The Cox & Snell Rsquared can be interpreted similarly to Rsquared in a multiple regression and in this context it is ok.
The HosmerLemeshow is used to test the null hypothesis that there is a linear relationship between the independent variables and the log odds of the criterion variable.
A requirement for this model to be valid is that the Hosmer and the Lemeshow test should not be statistically significant. Within this context, p = 0.722 which is greater than the 0.05 cut off point and indicates that the data fits the model adequately.
Overall predictions using this model account for 93% , implying that this model is a valid one.
There are a number of categories where the confidence interval for exp(B) do include 1. This means that these categories are not adding any meaningful information to the model.
It can be seen from the above table that:

Respondents working part time are about 1.2 times more likely not to have long term illness when compared with full time workers

Respondents working full time are about 3 times more likely not to have long term illness when compared with retired respondents.

Respondents working full time are about 1.5 times more likely not to have long term illness when compared with students.

Respondents working full time are about 3.6 times more likely not to have long term illness when compared with inactive respondents.

Respondents who are permanently sick are certain to have long term illness.

Respondents working full time are about 2.4 times more likely not to have long term illness when compared with unemployed respondents.

Respondents who are married are almost twice as likely to have long term illness when compared with single respondents.

Respondents who remarry are about 2.2 times more likely to have long term illness when compared with single respondents.

Respondents who widowed and are divorced are also more likely than single respondents to be suffering from long term illness.

Respondents who have a car are about 1.8 times more likely to have long term illness when compared with those who do not have a car.

Respondents who have two cars are about 2.2 times more likely to have long term illness when compared with those who do not have a car.
From the above results it is interesting to point out that the exploratory data analysis regarding car ownership contrasts with the findings proposed by this logistic model. It is interesting to point out that being employed in some form or shape reduces the likelihood of being long term ill and obviously those who are long term ill are very unlikely to be employed. If a respondent is out of work it is more than twice as likely that they will be long term sick when compared with full time respondents. This may be to the case that being unemployed is not a particularly pleasant situation and perhaps people are more likely to me depressed, anxious and fall ill.
It would be ideal to try and compare the two datasets used to do the analysis, however this is not possible as the two datasets are not quite identical. There are a number of variables that can be found in one dataset but not found in the other.
As found out in question 1 long term illness is statistically positively correlated with unemployment (r=0.573, p<0.001), meaning that if unemployment
is high than long term illness is high. This makes sense as people who are long term sick can't work.
Furthermore in question 2 it has been found out that respondents working full time are about 2.4 times more likely not to have long term illness when
compared with unemployed respondents. This too makes sense as being unemployed is not an easy situation and anxiety, depression and other forms of
illnesses can kick in.
The next phase of the analysis will focus on creating an affluence index. Affluence is a measure of the living standards of a particular person, neighbourhood, family etc. This means that for a district to be affluent than low unemployment and home ownership would be two factors which should be a good thermometer to measure this phenomenon. So these two variables are central to create this index of affluence.
The next step is to find out which variables are both statistically significant and strongest with both unemployment and homeownership. Please refer to section A. Unemployment is statistically significant with lone parent families, married, living in rented LA accommodation.
Home ownership is statistically significant with living in rented LA accommodation, lone parent families and being married. As can be observed, there are numerous variables which are correlated with both unemployment and homeownership.
Based on this information it has been decided that this affluence index will be built using the following variables: unemployment, home ownership, married,
lone parent families, living in rented LA accommodation and aged between 45 and 59.
The way this index will work is as follows:
The lower the coefficient the more affluent is the district. Based on the above table and correlations, for a district to be affluent it has to have the following characteristics:

Low unemployment rates

High home ownership

Higher marriage rates

Low LA rented housing

People without cars will be low

Aged between 45 and 59.

For this index to be meaningful the home ownership, married and persons aged 4559 variables need to be recalibrated in a way that they do not skew the index as these three variables need to be high whilst the remaining three variables need to be low. To overcome this issue, home ownership, married and aged 4559 variables were transformed in a way that their values reflected the overall pattern that the lower the values the more affluent the district is.
The equation used to streamline these three variables was the following:
[1/(var)^0.9]*100
This means that the range for the following six variables is:
VAR 1  Homeownership 1.72 to 11.6
VAR 2  Married 2.67 to 4.55
VAR 3  Persons aged 4559 5.32 to 9.94
VAR 4  Renting from LA 2.29 to 60.93
VAR 5  Persons without a car 6.08 to 57.22
VAR 6  Unemployed 1.34 to 10.59
The equation which created this index is:
[1/(VAR1)^0.9]*100 + [1/(VAR2)^0.9]*100 + [1/(VAR3)^0.9]*100 + VAR4 + VAR5 + VAR6=AFF_1
AFF_1 being the coefficient of affluence.
It can be seen from the above correlation table that the affluence coefficient (AFF_1) is strongly correlated with all the six variables which contributed to developing it. There is a strong positive correlation between AFF_1 and % persons unemployed (r=0.899, p<0.001), between AFF_1 and % persons in housing rented from local authority (r=0.911, p<0.001), between AFF_1 and % persons without a car (r=0.947, p<0.001). These positive correlations make sense as the unemployment, LA accommodation and having no car variables need to be low to have a low AFF_1 coefficient.
On the other hand there is a negative correlation between AFF_1 and % married (r=0.799, p<0.001), between aff_1 and % of persons aged between 45 and 59 (r=0.604, p<0.001) and between AFF_1 and persons in owner occupied households (r=0.848, p<0.001). These negative correlations make sense as being married, being a homeowner and being aged between 45 and 59 need to be high for the AFF_1 index to be low.
With reference to unemployment as mentioned earlier AFF_1 is strongly related to this variable.
The above scatter plot clearly shows this positive strong association (r=0.899). The next step is to find out if this affluence index is a true reflection of the unemployment figures, meaning that districts will lower unemployment rates should have a lower affluence index and those districts with a higher unemployment rate should have a higher affluence index.
district 
county 
unemp 
AFF_1 
Isles of Scilly 
Cornwall a 
1.34 
76.47 
South Lakeland 
Cumbria 
1.81 
40.65 
Ribble Valley 
Lancashire 
1.88 
32.08 
With reference to the top three areas with the lowest unemployment levels, this model does not appear to be entirely correct as the Isles of Scilly are ranked 321st, South Lakeland are ranked 89th and Ribble Valley is ranked 13th. Isles of Scilly have this 'erratic' index because about 45% of the respondents do not have a car and homeownership is relatively very low.
district 
county 
unemp 
AFF_1 
Tower Hamlets 
Inner Lond 
9.32 
147.42 
Knowsley 
Merseyside 
9.50 
107.15 
Hackney 
Inner Lond 
10.59 
129.90 
With reference to the top three areas with the highest unemployment levels, in this instance Tower Hamlets, Knowsley and Hackey are ranked 366th, 359th and 365th respectively and hence are more or less a true reflection of the unemployment rates.
Conclusion
With respect to the datasets, both datasets are large enough to generate reasonably decent data analysis. The issue arises with the fact that at the local level, due to the large dataset, statistical significance is easily achieved even with little differences. Large datasets detect even the slightest changes and hence they are sensitive enough to trigger statistically significance. One of the issues with these two datasets is the fact that they are similar but not similar enough to compare data and results.
In the first question, multiple regression was used to model unemployment. A model based on six predictors was generated. It is a reasonably good model
which observed all the required diagnostics and statistics. In an ideal world it would be great to try and have all the possible predictors available but
from a practical point it is very difficult to interpret a model made up of more than ten predictors. So in that instance it is very difficult to come up
with a better/ more accurate model.
With respect to the logistic model used to model long term illness, perhaps a model including more independent variables could have been a sensible approach. Another issue with that dataset is the lack of continuous variables. Currently there is only one, i.e. age.
The affluence model was created using a mathematical equation which appeared to be reasonably good but certainly not perfect. In fact with regards to modelling the lowest unemployment districts, this model did not do very well. On the other hand it performed much better when the areas with the highest unemployment areas had to be modelled.
It would have been interesting for example to compare unemployment and long term illness not just at a district level but perhaps also at a county level. It would have been more interesting to compare all the different districts throughout the UK rather than parts of it. Further analysis could perhaps involve comparing unemployment rates between males and females.
This can be done through a ttest or the nonparametric equivalent. ANOVA could be used to compare unemployment rates between the four nations making up the UK Multivariate analysis could have also been used to group/cluster districts which have common underlying characteristics. With respect to the affluence model, islands would perhaps need to have a separate index as the characteristics of islands may well differ from those of mainland Britain.