# Keys to statistics success

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.

a) Real Estate Prices in Florida

b) The above histogram shows the number of variates, no. of houses sold, in bins with class interval 50,000 for a sample of 25 real variates with a uniform distribution from 100,000 and 400,000. Therefore, bin 1 gives the number of variates (houses) in the range \$100,000-150,000; bin 2 gives the number of variates in the range \$151,000-200,000, etc.

A histogram is a way of summarizing large data sets graphically. Moreover, measurements to specifications can be compared and information can be communicated graphically. Histograms are also used as a tool that assists in decision making. Histograms can help investigating problems and making decisions. (Balanced Scorecard Institute, 2009, p. 2)

### Q2

a) Symmetrical: A Histogram is considered to be symmetrical if it has a single peak and looks about the same to the left and right of the peak e.g. the distribution of a sample of electricity usage. The distribution is fairly symmetric and follows a bell-shaped "normal" distribution. (Albright, Winston, & Zappe, 2009)

b) Positively skewed: A Histogram is considered to be positively skewed if it is leaned towards the right hand side on the X-axis with a single peak. An example for a positively skewed histogram is the time between customer arrivals at a supermarket on a given day is being illustrated in a histogram. A positively skewed histogram has a "long tail" to the right of the peak and no tail to the left. (Albright et al., 2009)

c) Negatively skewed: A Histogram is considered to be negatively skewed if its long tail is extended to the left on the X-axis with a single peak. One example is the following histogram of scores on a test.

d) Bimodal: A Histogram is considered to be bimodal if it has precisely two peaks. This is often due to the use of data from more than two distinct populations. The bimodal distribution is an evident that two separate population samples were used. (Albright et al., 2009)

### Q3

The following scatterplot shows the relationship between two variables, in this case age in years and the average annual income in USD. As one can see from the scatterplot, there is no linear relationship between the two variables.

As one can observe the income is being generated the earliest at the age of approximately 14 years when underaged are eleigible to work part time. While at school and university the annual income increases due to better work opportunities. At the end of twenty most people have finished their secondary education and have gained higher education that qualifies them for more high profile jobs with a higher salary. The increase in salary continues, probably according to work experience, until a peak after 65 years of age is reached. This can be explained by the fact that most employees retire at that time and are elegible to pension. This might explain why the average annual income slightly decreases.

### Q4

Time series graphs are mainly used to forecast future values for a certain time variable using historical data. In order to generate a time series graph, repeated measurements have to be taken over regular time intervals. The dimension on the horizontal x-axis is always time when using a two-dimensional graph. The data points are drawn at regular intervals and then being jointed, generally with straight lines. With time series graphs one may be able to illustrate trends or patterns. Time series graphs frequently include peaks and throughs. (Te Tari Awhina, 2009)

According to Render et al. when analysing a time series graph, both aspects any observable trend as well as any seasonal pattern are ofparticular interest. (Render et al., 2009)

One example for a time series graph are the revenues in the chocolate industry in\$000 for every Quarter from 1998 until 2001 as illustrated below.

### Observable Trend

From the graph for revenues in the chocolate industry for each Quarter from 1998 until 2001, it appears to be an obeservable trend in the long term to higher sales as the peaks and throughs look as if they increase each year. Long term trends are more obvious after the time series is "smoothed". (Te Tari Awhina, 2009)

### Seasonal Pattern

As one may expect chocolate sales are higher in winter and lower in summer due to the fact that more chocolate is being sold during christmas time. This effect is being called seasonal variation and is shown in the above graph. Other examples of time series with a seasonal pattern or seasonal variation are ice cream revenues or electricity usage. (Te Tari Awhina, 2009)

However, a time series graph may indicate a recurring pattern that is not being caused by seasonal factors. An example is the increase in wages in a four year cycle, one year prior to elections being held. By allowing wage increases the existing government enhances its chance to be re-elected. (Te Tari Awhina, 2009)

### Q5

Three types of average are commonly being used in order to analyse how a MBA class has performed at a specific exam. These three types are the mean, median and mode.

Mean: The mean is the balance point of a data set of values. The mean can be calculated by dividing the sum of the values by the number of values. In case that each value was replaced by the mean, the total sum of values used would not change.

Median: The median is the middle value when ranking values from low to high. In case that the data set consists of an even number of scores, the median is halfway between the middle two scores.

Mode: The mode is the value respectively multiple values that occur most frequently. That means that if all values are different no mode exists. For example if the set of values is arranged in a frequency table, the group with the highest frequency is the modal group or class.

i) Similarity of Measures

All three types of average can be similar, e.g. if all students score 50, all three types, the mean, mode and median will be the same.

ii) Difference of Measures

Median

Due to the fact that for the calculation of the mean all scores are being used, extreme low or high scores (outliers) have a significant influence on the mean. That means that if even only one extreme outlier exist- in the example given there are even four extreme low scores- the mean of 65,28 is significantly influenced by the four scores of only 20. However, no data is being excluded in contrast to the median.

In contrast to the mean, the median is not being influenced by outliers; however the median is more difficult to find out if a large set of data is being used. In the example given, the median is 72.5. The median is midway between the scores 71 and 74 as highlighted in the table.

QMDM Assignment 1

The mode is a helpful tool to find the most popular value, however it may not be very representative of a set of scores. The mode 20 as calculated in the above example is not near the middle.

### Q 6

The standard deviation is being used to measure the spread of set of values about the mean. Basically, the standard deviation is an indication of the average of the amounts that each value in a data set varies from the mean. The standard deviation is the square root of the variance. (Albright et al., 2009)

In a normal distribution it can be stated that about 68% of the values are within 1 standard deviation of the mean. About 95% of the data is located within two standard deviations of the mean and about 99.7% of the values are to be found between 3 Standard Deviations of the mean.

A manager may use the standard deviation as a tool to understand the variation of sales figures (e.g. sales revenues) at a specific time from an average. By using the standard deviation, a sales manager is able to identify changes, either positive or negative, in sales performance considering historical data. If sales have changed, either minor or significantly, that can be seen by calculating the standard deviation. By using the standard deviation the sales manager is able to assess the degree of dispersion of the values around its mean. This is especially in the field of sales of great importance as unusual decreases in sales have to be monitored closely in order to adjust for example marketing tools. Moreover, if standard deviation is great, then sales must be prepared to satisfy the customer needs in times of high demand. This is due to the fact that standard deviation may support the sales manager in assessing the error to which the mean of a sample is subject when estimating the mean of future sales.

### Q7

"Researchers find that the correlation between yearly beer consumption and yearly deaths from heart disease is -0.53. Thus, it is reasonable to conclude that increased consumption of beer causes fewer deaths from heart disease in industrialized countries."

The association between two variables, in this case yearly beer consumption in litres per person and yearly deaths from heart diseases per 100,000 people is known as correlation. It is an indicator for the strength of a linear relationship between the two numerical variables. If a significant cluster towards a straight line can be observed in a two-dimensional graph a strong relationship between the two variables is evident. If the line emerges from the left to right the relationship is considered to be positive. If the line descends from the left to right the relationship is considered to be negative. In the example given, the correlation between the two variables is -0.53. The relationship between yearly beer consumption and yearly deaths is negative. As the reporter states one may assume that the higher the beer consumption the less people die in industrialised countries. However, correlation does not mean causation. A correlation does not necessarily mean that a change in one variable, such as beer consumption, causes a change in another variable, such as declining death rates. It may be necessary to consider other factors that help explaining the fewer deaths from heart diseases. Factors that may have contributed to the negative tendency in heart disease deaths could be increased health consciousness or better medical supply due to technological progressions. (Albright et al., 2009)

### Q8

Interpreting the descriptive statistics for the variable hours of sunshine per day, the mean can be used as an indicator that the sun shines for an average of 7.138 hours per day. The standard of error shows that in 68% of the overall model, the estimate of sun shine hours shows a standard error of 2.119. It is a relatively non-accurate estimate as the standard error is quite high considering the mean. The mode indicates that the most frequent observed value is 8 hours sunshine per day. The standard deviation measures the spread of the data about the mean, which is 1.659. The sample variance is the squared standard deviation value, in this case 2.752. The kurtosis reveals that the distribution is peaked due to the fact that a kurtosis of 3 is an indicator for a normal distribution. The skewness of -2.003 implies that the distribution of the data is leaned towards the left hand side on the X-axis; its long tail is extended to the left. Total number of observed values for this statistics is 365. It can be assumed that a whole year was being observed. The sum of the values observed is 2602. The minimum of sunshine hours per day observed is 2 hours; the maximum is 13 hours, which explains the range of 11. (Albright et al., 2009)

Interpreting the descriptive statistics for the variable sales of suntan lotion per day, the mean can be used as an indicator that the average sales of suntan lotion is 1639.427 units per day. The standard of error shows that in 68% of the overall model, the estimate of suntan lotion sales shows a standard error of 186.814. It is a relatively accurate estimate as the standard error is quite low considering the mean. The mode indicates that the most frequent observed value is 1512 units of suntan lotion per day. The standard deviation measures the spread of the data about the mean, in this case 1026.341. The sample variance is the squared standard deviation value, in this case 1053373.796. The kurtosis reveals that the distribution is peaked. The skewness of 0.054 implies that the distribution of the data is slightly leaned towards the right hand side on the X-axis; its long tail is extended to the right. Total number of observed values for this statistics is again 365. The sum of the values observed is 598123. It can be interpreted that this is the total sales of suntan lotion for one year. The minimum sales of suntan lotion per day observed is 110 units, the maximum is 3769 units, which explains the range of 3659 units. (Albright et al., 2009)

### Q9

The scatterplot indicates the relationship between two variables. The type of behaviour observed in a scatterplot can be summarized by correlation.

The scatterplot indicates an upward trend. If one draws a line of best fit an increase in variance could be observed. As sunshine increases it can be derived from the data that the demand for suntan lotion increases. The relationship between the two variables would therefore be positive.

The correlation of 0.428 is evidence that there is no very strong correlation between the two variables sunshine and suntan lotion.

### Q10

The R-square is being used to explain the percentage of variation of the dependent variable (suntan lotion demand) in a regression analysis. The regression results indicate that the overall model explains 67.4% of variation in the dependent variable suntan lotion demand by the independent variable (sunshine). Considering the number of percentage it can be stated that the overall model has a quite strong explanatory power in explaining the sales of suntan lotion against sunshine.

The adjusted R-square indicates principally if additional explanatory variables belong in the equation. As the adjusted R-square is 0.661, which lower than the R-square, it can be stated that extra variables should be excluded as they are not pulling their weight.

Yet, it may be helpful to take in other variables in order to identify the effects on suntan lotion. By doing so, it may be the case that another variable indicates a higher percentage of variation along with sunshine.

The standard error of estimate is the standard deviation of the sampling distribution of the estimate. It is being used to identify how much estimates vary from sample to sample. The standard error of estimate is 202.854. If that is compared to the standard deviation value of 1026.341 (as given in Q8) the model is considered to be very confident of estimating coefficients accurately.

The equation for the predicted suntan lotion sales is 1043.565 + 20.523 sunshine hours, which is graphically illustrated below.

The independent respectively explanatory variable, sunshine, has a coefficient value of 20.523, which means that for every extra sunshine hour the demand for suntan lotion increases by 20.523 units.

The P-value helps in analysing the probability of occurrence by chance. In the given case the p-value is 0.051, which indicates that there is moderate confidence that it did not occur by chance. The range of P-values between 0.05 and 0.1 is considered to be a "grey area", where more sample evidence is recommended to acquire. The significance of the given sample is considered to be moderate. (Render et al., 2009, p. 503)

### References and Bibliography

Albright, S.C., Winston, W.L., & Zappe, C.J. (2009). Data Analysis and & Decision Making with Microsoft Excel (Revised 3rd ed.). Mason, OH: South-Western

Balanced Scorecard Institute, a Strategy Management Group company (2010). Basic Tools for process improvement. Retrieved January 13, 2010 from http://www.balancedscorecard.org/Portals/0/PDF/histgram.pdf

Render, B., Stair, R.M., & Hanna, M.E. (2009). Quantitative Analysis for Management (10th ed.). Upper Saddle River, NJ: Pearson Prentice Hall.

Te Tari Awhina, the Learning Development Centre (2009). KEYS to STATISTICS SUCCES. Auckland: Auckland University of Technology.

Wisniewski, M. (2006). Quantitative Methods for Decision Makers. Harlow: Pearson Education Limited.