Tests of Significance: Uses and Limitations
Disclaimer: This dissertation has been submitted by a student. This is not an example of the work written by our professional dissertation writers. You can view samples of our professional work here.
Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of UK Essays.
Abstract
Statistical tools are undoubtedly important in decision making. The use of these tools in everyday problems has led to a number of discoveries, conclusions and enhancement of knowledge. This ranges from direct calculations using general statistical formulas to formulas integrated in Statistical software to fasten the process of decision making.
Statistical tools for testing hypothesis, significance tests are strong but only if used correctly and in good understanding of their concepts and limitations. Some researchers have indulged into wrong usage of this tests leading to wrong conclusions.
This paper looks at the different significance tests (both parametric and nonparametric tests) their uses, when to be used and their limitations. It also evaluates the use of Statistical Significance tests in Information Retrieval and then proceeds to check the different significant tests used by researchers in the papers submitted to Special Interest Group on Information Retrieval (SIGR) in the period 2006, 2007 and 2008. For the combined period 20062008, including the years 2006 and 2008, of the papers submitted had statistical tests used and of these tests were used wrongly.
Key Words: Significance Test, Information Retrieval, Parametric Tests, Nonparametric Tests, Hypothesis Testing
Chapter One
1.0 Introduction
Statistical methods play a very important role in all aspects of research, ranging from data collection, recording, analysis, to making conclusions and inferences. The credibility of the research results and conclusions will depend on each and every step mentioned above; any fault made in these steps can render a research carried out for several years, spending millions of shillings to be worthless.
This does not mean carrying any test and mincing figures shows that statistics has been used in the given research; the researcher should be able support why he or she used that specific test or method.
Misuse of significance test is not new in the world of science. According to Campbell (1974), there are different types of statistical misuse:
Discarding unfavorable portion of data
This occurs when the researcher selects only a portion of data which produces the results that he/she requires perfectly while discarding the other portion. After a well done research, the researcher might get values that are not consistent to what he/she was expecting. This researcher might decide to ignore this section of data during the analysis so as to get the “expected results”. This is a wrong take since the inconsistent data could give very new thoughts in that particular field that is if these irregularities are checked and explained why they occurred, more ideas abut that area can be explored..
Overgeneralization
Sometimes the conclusions from a research can only work on that particular research problem but the researcher might blindly generalize the results obtained to other kinds of research similar or dissimilar. Overgeneralization is a common mistake in current research activities. A researcher after successfully completing a research on a particular field, he/she might be tempted to make generalizations reached in this research to other fields of study without regarding the different orientations of these different populations and assumptions in them.
Non representative sample
This arises when the researcher selects a sample which produces results geared towards his/her liking. Sample selected for a particular study should be one that truly represents the entire population. The procedure of selecting the sample units to be used in the study should be done in an unbiased manner.
Consciously manipulating data
Occurs when a researcher consciously changes the collected data in order to reach a particular conclusion. This is mainly noticed when the researcher knows exactly what the customers aim are, so the researcher changes part of the data so that the aim of that research is covered strongly. For example if a researcher is carrying out a regression analysis and does a scatter plot, if he/she sees that there are many out liers,the researcher might decide to change some values so that the scatter plot appears as a straight line or something very close to that. This act leads to results which are appealing to the customer and the eyes of other user but in real sense does not give a clear indicator of what is really happening in the population at large.
1.0.5 False correlation
This is observed when the researcher claims that one factor causes the other while in real sense both two factors are caused by another hidden factor which was not identified during the study. Correlation researches are common in social sciences and sometimes they are not adequately approached, this leads to wanting results. In correlation studies say to check if variable X causes variable Y, in real sense there are four possible things. The first one is that X causes Y,secondly Y causes X, third is X and Y are both caused by another unidentified variable say Z and lastly the correlation between X and Y occurred purely by sheer luck.
All these possibilities should be checked while doing these kinds of study to avoid rushing into wrong conclusions. False causality can be eliminated in studies by using two groups for the same experiment that is the “control group (the one receiving a placebo)” and the “treatment group (the one receiving the treatment)” .
Even though this method is efficient, implementing it raises very many challenges. There are ethical issues like when one patient is given a placebo (effect less drug) without his/her conscious and the other group given the right drug. One question comes to mind; is it ethical to do this to the first group? Carrying out the experiment in parallel for two different groups can also prove to be very expensive.
1.0.6 Overloaded questions.
The questions used in survey can really affect the outcome of the survey. The structure of questions in a questionnaires and the method of formulating and asking the questions can influence the manner in which the respondent answers the questions. Long wordy questions in a questionnaire can be too boring to a respondent and he/she might just fill the questionnaire in a hurry so that he/she finishes it but does not really care about the answers that he/she has provided. The framing of questions can also yield leading questions. Some questions will just lead the respondent on what to answer for example “The government is not offering security to its citizens, do you agree to this? (Yes or No)”
Use of statistical significance has been with us for more than 300 years (Huberty, 1993).Despite being used for a long time, this field of decision making is cornered by criticism from all directions, which has led to many researchers writing materials digging into the problems of statistical significance testing. Harlow et. al (1997), discussed the controversy in significance testing in depth. Carver (1993) expressed dislike of significance tests and clearly advocated researchers to stop using them.
In his book, How to Lie with Statistics, Huff (1954) outlined errors both intentional and unintentional and misinterpretations made in statistical analyses in depth. Some journals e.g. American Psychological Association (APA) recommended minimum use of statistical significance test by researchers submitting papers for publications (APA, 1996), though not revoking the use of the tests.
With the relentless criticism, other researchers have not given up on using statistical significance testing but have clearly encourage users of the tests to have good knowledge in them before making conclusions using them. Mohr (1990) discussed the use of these tests and supported their use but warning researchers to know the limitations of each tests and correct application of the tests so as to make a correct inferences and conclusions. In his paper, Burr (1960) supported the use of statistical significance test but requested researchers to make allowances for existence of statistical errors in the data.
Amidst these controversies, statistical significance testing has been applied to many areas of research and remarkable achievements have been recorded. One such area is the information retrieval (IR). Significant tests have been used to compare different algorithms in information retrieval.
1.1.0 Information retrieval
Information retrieval is defined as the science of searching databases, World Wide Web and other documents looking for information on a particular subject. In order to get information, the user is required to enter keywords which are to be used for searching, a combination of objects containing the keywords are usually returned from which the user looking for information can single out and pick one which gives him or her the much required information.
The user usually progressively refines the search by narrowing down and using specific words. Information retrieval has developed as a highly dynamic and empirical discipline, requiring careful and thorough evaluation to show the superior performance of different new techniques on representative document collections.
There are many algorithms for Information Retrieval .It is usually important to measure the performance of different information retrieval systems so as to know which one gives the required information faster. In order to measure information retrieval effectiveness, three test items are required;
 (i) A collection of documents on which the different retrieval methods will be run on and compared.
 (ii) A test collection of information needs which are expressible in terms of queries
 (iii)A collection of “relevance judgment” that will distinguish on whether the results returned are relevant to the person doing the search or they are irrelevant.
A question might arise on which collection of objects to be used in testing different systems. There are several standard test collections used universally, these include;
(i) Text Retrieval Conference (TREC). – This a standard collection comprising 6 CDs containing 1.89 million documents (mainly, but not exclusively, newswire articles) and relevance judgments for 450 information needs, which are called topics and specified in detailed text passages. Individual test collections are defined over different subsets of this data.
(ii)GOV2This was developed by The U.S. National Institute of Standards and Technology (NIST).It is a 25 paged collection of web pages.
(iii) NII Test Collections for IR Systems (NTCIR)This is also a large test collection focusing mainly on East Asian language and crosslanguage information retrieval, where queries are made in one language over a document collection containing documents in one or more other languages.
(iii) Cross Language Evaluation Forum (CLEF). This Test collection is mainly focused on European languages and crosslanguage information retrieval.
(iv) 20 Newsgroups. This text collection was collected by Ken Lang. It consists of 1000 articles from each of 20 Usenet newsgroups (the newsgroup name being regarded as the category). After the removal of duplicate articles, as it is usually used, it contains 18941 articles.
(v) The Cranfield collection. This is the oldest test collection in allowing precise quantitative measures of information retrieval effectiveness, but is nowadays too small for anything but the most elementary pilot experiments. It was collected in the United Kingdom starting in the late 1950s and it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries, and exhaustive relevance judgments of all (query, document) pairs.
There exist several methods of measuring the performance of retrieval systems namely; Precision, Recall, FallOut, Emeasure and Fmeasure just to mention a few since researchers are coming up with other new methods.
A brief description of each method will shade some light.
1.1.1 Recall
Recall in information retrieval is defined as the number of relevant documents returned from a search divided by the total number of documents that can be retrieved from a database. Recall can also be looked at as evaluating how well the method that is being used to retrieve information gets the required information.
Letbe the set of all retrieved objects andbe the set of all relevant objects then,
Recall(1.1)
As an example, if a database contains 500 documents, out of which 100 contain relevant information required by a researcher, the complement ,number of documents not required = 400.
If the researcher uses a system to search for the documents in this database and it return 100 documents of which all of them are relevant to the researcher, then the recall is given by:
Recall
Supposed that out of 120 returned documents, 30 are irrelevant, then the recall would be given by
Recall
1.1.2 Precision
Precision is defined as the number of relevant documents retrieved from the system over the total number of documents retrieved in that search. It valuates how well the method being used to retrieve information filters the unwanted information.
Letbe the set of all retrieved objects andbe the set of all relevant objects then,
Precision(1.2)
As an example, if a database contains 500 documents, out of which 100 contain relevant information required by a researcher, the complement ,number of documents not required = 400.
If the researcher uses a system to search for the documents in this database and it returns 100 documents of which all of them are relevant to the researcher, then the precision is given by:
Precision
Supposed that out of 120 returned documents, 30 are irrelevant, then the precision would be given by
Precision
Both precision and recall are based on one term; Relevance Oxford dictionary defines relevance as “connected to the issue being discussed”.
Yolanda Jones (2004) identified three types of relevance, namely;
Subject relevance which is the connection between the subject submitted via a query and subject covered by returned texts. Situational relevance: connection between the situation being considered and texts returned by database system. Motivational relevance: connection between the motivations of a researcher and texts returned by database system.
There are two measures of relevance;
 Novelty Ratio: This refers to the proportion of items returned from a search and acknowledged by the user as being relevant, of which they were previously unaware of.
 Coverage Ratio: This refers to the proportion of items returned from a search out of the total relevant documents that the user was aware of before he/she started the search.
Precision and recall affect each other i.e. increase in recall value decreases precision value.
If one increases a system’s ability to retrieve more documents, this implies increasing recall, this will have a drawback since the system will also be retrieving more irrelevant documents hence reducing the precision of that system. This means that a tradeoff is required in these two measures so as to ensure better search results.
Precision and recall measures make use of the following assumptions
They make the assumption that either a system returns a document or doesn’t.
They make the assumption that either the document is relevant or not relevant, nothing in between.
New methods are being introduced by researchers which rank the degree of relevance of the documents.
1.1. 3 Receiver Operating Characteristics (ROC) Curve
This is the plot of the true positive rate or sensitivity against the false positive rate or (1 − specificity).Sensitivity is just another term for recall. The false positive rate is given by. An ROC curve always goes from the bottom left to the top right of the graph. For a good system, the graph climbs steeply on the left side. For unranked result sets, specificity, given bywas not seen as a very useful idea. Because the set of true negatives is always so large, its value would be almost 1 for all information needs (and, correspondingly, the value of the false positive rate would be almost 0).
1.1.4 Fmeasure and Emeasure
This is defined as the weighted harmonic mean of the recall and precision. Numerically, it is defined as
(1.3)
Whereis the weight.
Ifis assumed to be 1, then
(1.4)
The Emeasure is given by(1.5)
E –measure has a maximum value of 1.0, 1.0 being the best.
1.1.5 FallOut
This is defined as the proportion of irrelevant documents that are returned in a search out of all the possible irrelevant documents.
Fall out(1.6)
It can also be defined as the probability of a system retrieving an irrelevant document.
These are just a few methods of measuring performance of search systems. Then after looking after one system, there arise a problem of comparing two systems or algorithms, that is, is this system better than the other one?
To answer this question, scientist in Information retrieval use statistical significance tests to do the comparisons in order to establish if the difference in systems performance are not by chance. These tests are used to confirm beyond doubt that one system is better than another.
Statement of the problem
Statistical inference tools like statistical significance tests are important in decision making. Their use has been on the rise in different areas of research. With their rise, novel users make use of these tools but in questionable manners. There are many researchers who do not understand the basic concepts in statistics leading to misuse of the tools. Any conclusions reached from a research might be termed bogus if the statistical tests used in it are shoddy.
More light needs to be shade in this area of research to ensure correct use of these tests. Researchers in Information Retrieval also use these tests to compare systems and algorithms, are the conclusions from these tests truly correct? Are there any other ways of comparison which minimize the use of statistical tests?
Objectives of the study
The objectives of this study are:
Investigate use and misuse of statistical significance tests in scientific papers submitted by researchers to SIGIR.
Shade light on different statistical significance tests their use, assumptions and limitations.
Identify the most important statistical concepts that can provide solutions to the problems of statistical significance in scientific papers submitted by researchers to SIGIR.
Investigate the reality of the problems of statistical significance in scientific papers submitted by researchers to SIGIR.
Investigate the use of statistical significant tests used by researchers in Information Retrieval
Discover the availability of statistical concepts and methods that can provide solutions to the problems of statistical significance in scientific papers submitted by researchers to SIGIR
Chapter Two
This section of this paper has been divided into three major parts, the sample selection and sample size choosing which will discusses methods of selecting a sample and the size of the sample to be used in a given research, the second part deals with statistical analysis methods and procedures, mainly in significance testing and the third part discusses other statistical methods that can be used in place of statistical significance test.
2.0 Sample Selection and Sample Size
2.0.1 Sample selection
Sampling plays a major role in research, according to Cochran (1977), sampling is the process of selecting a portion of the population and using the information derived from this portion to make inferences about the entire population.
Sampling has several advantages, namely;
(i)Reduced cost
For example it is very expensive to carry out a census than just collecting information from a small portion of the population. This is because only a small number of measures will be made so only a few people will be hired to do the job compared to complete census which will require a large labor force.
(ii)Greater speed during the process(less time)
Since only a few people will be used or rather only a few items will be measured, the time for doing the measurement will be reduced and also summarization of the data will be quick as opposed to when measures are taken for the whole population.
(iii)Greater accuracy
Since only a few people will be considered in the process, the researchers will be very thorough as compared to the entire population which will see the researchers get tired in the middle of the process leading to lousy collection of data and shoddy analysis.
The choice of the sampling units in a given research may affect the credibility of the whole research. The researcher must make sure that the sample being used is not biased, that is it represents the whole population.
There are several methods of selecting samples to be used in a study. A researcher should always make sure that the sample drawn is large enough to be a representative of the population as a whole and at the same time manageable. In this section the two major types of sampling, random and nonrandom, will be examined.
2.0.1.1 Random sampling
In random sampling, all the items or individuals in the population have equal chances of being selected into the sample. This procedure ensures that no bias is introduced during the selection of sample units since a n items selection will be only by chance and will not depend on the person assigned with the duty of coming up with the sample. There exist five major random sampling techniques, namely; simple random sampling, multistage sampling, stratified sampling, cluster sampling and systematic sampling. The following section discusses each of these.
2.0.1.1.1 Simple random sampling
In simple random sampling, each item in the population has the same and equal chance of being included in the sample. Usually each sampling unit is assigned a unique number and then numbers are generated using a random number generator and a sampling unit is included in the sample if its corresponding number is generated from the random number generator.
One advantage attributed to simple random sampling is its simplicity and ease in application when dealing with small populations. Every entity in the population has to be enlisted and given a unique number then their respective random numbers be read. This makes this method of sampling very tedious and cumbersome especially where large populations are involved.
2.0.1.1.2 Stratified sampling
In stratified random sampling, the entire population is first divided into N disjoint subpopulations .Each sampling unit belongs to one and only one sub population. These sub populations are called strata, they might be of different sizes and they are homogenous within the strata and each stratum completely differs with the other strata. It is from these strata that samples are drawn for a particular study. Examples of strata that are commonly used include States, provinces, Age and Sex, religion, academic ability or marital status etc.
Stratification is most useful when the stratifying variables are simple to work with, easy to observe and closely related to the topic of the survey (Sheskin, 1997).
Stratification can be used to select more of one group than another. This may be done if it is felt that the responses obtained vary in one group than another. So, if the researcher knows that every entity in each group has much the same value, he/she will only need a small sample to get information for that group; whereas in another group, the values may differ widely and a bigger sample is needed.
If you want to combine group level information to get an answer for the whole population, you have to take account of what proportion you selected from each group. This method is mainly used when information is required for only a particular subdivision of the population, administrative convenience is an issue and the sampling problems differ greatly in different portions of the population of study.
2.0.1.1.3 Systematic sampling
Systematic sampling is quite different from the other methods of sampling, supposed the population contains N units and a sample of n units is required, a random number is generated using the random number generator, call it k, then a unit(represented as a number) is drown from the sample then the researcher picks every kth unit thereafter. Consider the example that k is 20 and the first unit that is drawn is 5, the subsequent units will be 25,45,65,85 and so on.
The implication of this method is that the selection of the whole sample will be determined by only the first item since the rest will be obtained sequentially. This type is called an every kth systematic sample. This technique can also be used when questioning people in a sample survey. A researcher might select every 15th person who enters a particular store, after selecting a person at random as a starting point; or interview the shopkeepers of every 3rd shop in a street, after selecting a starting shop at random.
It may be that a researcher wants to select a fixed size sample. In this case, it is first necessary to know the whole population size from which the sample is being selected. The appropriate sampling interval, I, is then calculated by dividing population size, N, by required sample size, n. This method is advantageous since it is easy and it is more precise than simple random sampling.
Also it is simpler in systematic sampling to select one random number and then every kth member on the list, than to select as many random numbers as sample size. It also gives a good spread right across the population. A disadvantage is that the researcher may be forced to have a starting list if he/she wishes to know the sample size and calculate the sampling interval.
2.0.1.1.4 Cluster sampling
The Austarlian Bureau of Statistics insinuates that cluster sampling divides the population into groups, or clusters. A number of clusters are selected randomly to represent the population, and then all units within selected clusters are included in the sample. No units from nonselected clusters are included in the sample. They are represented by those from selected clusters. This differs from stratified sampling, where some units are selected from each group.
The clusters are heterogeneous within each cluster (that is the sampling units inside a cluster vary from each other completely) and each cluster looks alike with the other clusters. Cluster sampling has several advantages which include reduced costs, simplified field work and administration is more convenient. Instead of having a sample scattered over the entire coverage region, the sample is more concentrated in relatively few collection points (clusters).
Cluster sampling provides results that are less accurate compared to stratified random sampling.
2.0.1.1.5 Multistage sampling
Multistage sampling is like cluster sampling, but involves selecting a sample within each chosen cluster, rather than including all units in the cluster. The Australian Bureau of Statistics postulates that multistage sampling involves selecting a sample in at least two stages. In the first stage, large groups or clusters are selected. These clusters are designed to contain more population units than are required for the final sample.
In the second stage, population units are chosen from selected clusters to derive a final sample. If more than two stages are used, the process of choosing population units within clusters continues until the final sample is achieved. If two stages are used then it will be called a two stage sampling, if three stages are used it will be called a three stage sampling and so on.
2.0.2 Determination of sample size to be used
2.1 Statistical Analysis
In this section, different statistical tests are discussed in details in their general form, then move to discussed how each of them(the ones used in IR) are applied to information retrieval. Only some of these tests are used to compare systems or/and algorithms.
In this paper we look at three sections of statistical analysis, namely:
(i) Summarizing data using a single value.
(ii) Summarizing variability.
(iii) Summarizing data using an interval (no specific value)
In the first case, we have the mean, mode, median etc and in the second case, we look at variability in the data and in the third case we look at the confidence intervals, parametric and nonparametric tests of hypothesis testing
2.1.1 Summarizing data using a single value
In this case, the data being analyzed is represented by a single value, example for this scenario are discussed below:
2.1.1.1 Mean
There are three different kinds of mean:
(i)Arithmetic mean
(ii)Geometric Mean
(iii)Harmonic mean
(i) Arithmetic mean
This is computed by summing all the observations then dividing by the number of observations that you have collected.
Letbe n observations of a random variable X. The arithmetic mean is defined as
Arithmetic mean
When to use the arithmetic mean
The arithmetic mean is used when:
When the collected data is a numeric observation.
When the data has only one mode (unimodal)
When the data is not skewed i.e. not concentrated to extreme values.
When the data does not have many outliers (very extreme values)
The arithmetic mean is not used when:
You have categorical data
When the data is extremely skewed.
(ii) Geometric mean
This is defined as the product of the observations, everything raised to power of, usually n.
Letbe n observations of a random variable X. The geometric mean is defined as
Geometric mean
The Geometric mean is used when:
The observations are numeric.
The item that we are interested in is the product of the observations.
(iii) Harmonic mean
This is defined as the number of observations divide be the sum of reciprocals of the observations.
Letbe n observations of a random variable X. The harmonic mean is defined as
Harmonic mean
The Harmonic mean is used when:
The average can be justified for the reciprocal of the observations.
2.1.1.2 Median
This is defined as the middle value of the observations. The observations are first arranged in ascending or descending order then the middle value is taken as the median.
The median is used when:
When the observations are skewed.
The observations have a single mode.
The observations are numerical.
The median is not used when:
We are interested in the total value.
2.1.1.3 Mode
This is defined as the largest value in the given dataset or the value that has the highest frequency of occurrence.
The mode is used when:
The dataset is categorical.
The dataset is both numeric and multimodal.
2.1.2 Summarizing variability
Variability in a data can be summarized using the following measures:
2.1.2.1 Sample variance
Letbe n observations of a random variable X, then the Sample variance, is given by
The standard deviation is used when:
The data is normally distributed.
2.1.2.2 The Coefficient of Variation (C.O.V)
This is given by
Where s is the standard deviation (square root of the sample variance) andis the sample mean.
C.O.V is more advantageous since it does not depend on the units of measurement of the observations.
2.1.2.3 Range
This is the difference between the largest value and the smallest value in the data set.
Letbe n observations of a random variable X, then the range is given by:
Range
The range is mainly used when the distribution is bounded.
2.1.2.4 Mean absolute deviation (M.A.D)
Letbe n observations of a random variable X. The M.A.D for the data is given by:
2.1.3 Confidence intervals
An interval is a group of values or a set. In this case we don’t talk of a single value that a statistic takes but a possibility of it lying in a given interval. For example what is the probability that the mean of a given data lies within the interval [10, 15].
In research and experiments, the researcher usually starts with the confidence level that he/she will be working with, then from the data compute the confidence interval.
Suppose a researcher uses 95% confidence level in calculating the confidence interval of the mean, then it will mean that the probability that the mean lies in that interval is 0.95.
Confidence intervals are highly used in hypothesis testing.
2.1.4 Hypothesis Testing
In hypothesis testing, the researcher usually attempts to check if ‘A’ and ‘B’ is the same or not. For the example is the speed of algorithm A the same as the speed of algorithm B?, is the old system better than the new system?
First the researcher will come with a hypothesis, called the null hypothesis and check this hypothesis against the alternative hypothesis. This null hypothesis is evaluated at a predefined confidence level. The null hypothesis is usually stated positively for example “Speed of algorithm A is the same as the speed of algorithm B.”
Before we proceed to look at the different statistical significance tests, we explore the different categories of observations.
There are two categories, namely:
Paired observations
Unpaired observations.
The different tests are applied differently for each kind of observation, for example there is a t test for paired data and a different t test for unpaired data.
2.1.4.1 Paired Observations
Suppose a researcher wants to compare two systems, say system X and system Y. If the researcher carries out n experiments, that is he/she carries experiment i on system X and corresponding experiment i on system Y. Then the observations are paired and each pair of experiment is treated as a single experiment, having the value
2.1.4.2 Unpaired observation
If the measurements in the two systems are just done separately(no corresponding measurements) then the resulting observations will be unpaired and it will not be practical to compute.In this case we deal with each data separately and only comparer the statistics of interest rather than the corresponding observations.
2.1.4.3 Null and Alternative Hypotheses
In the previous section we mentioned that in test of hypothesis, usually there is a null hypothesis which is tested against the alternative hypothesis. In this section, these two are discuss in great detail.
A Null Hypothesis denoted byrepresents an idea or notion which is believed to be true but has not been proved. For example in information retrieval a null hypothesis may be put as follows:
:”There is no difference in the speeds of two search algorithms”
An Alternative Hypothesis, denoted bis the statement of what the test wishes to check. It is usually the opposite of the Null Hypothesis.
After the test has been done, the results are usually presented in terms of the Null Hypothesis, either “Reject the Null hypothesis” or “Do not reject the null hypothesis.”
Failing to reject the null hypothesis does not necessarily mean that the null hypothesis is accepted. This means that no enough evidence was found during the study to support the null hypothesis.
A hypothesis can either be simple or complex.
A simple hypothesis is one which clearly specifies the distribution.
For example consider a random variable from a normal distribution with mean µ and standard deviation 100, we may test the hypothesis
A complex hypothesis does not completely specify the distribution. For example consider a random variable from a normal distribution with mean µ and standard deviation 100, we may test the hypothesis
2.1.4.4 Type I and Type II errors
In test of hypothesis, two kinds of errors may arise, namely; Type I and Type II errors.
Type I arises when you reject the null hypothesis but in actual sense it is true. Type I error is usually denoted be α, the level of significance.
Type II errors arises when you fail to reject the null hypothesis but in real sense it is false. It is usually denoted by β
These two errors are related in such a way that if you reduce one, the other bone increases and so on.
The following table summarizes the Type I and Type II errors.
2.1.4.5 Test Statistic
This is a value computed from the collected data and it is used to decide whether to reject or not reject the null hypothesis. The test statistic to be used in a given hypothesis testing situation will depend on the distribution from which the sample comes from.
2.1.4.6 Critical Region or Rejection region
This refers to the values that if the test statistic takes, will lead to rejection of the null hypothesis. This region depends on the significance level, α.
Since the values of the test statistic determine whether the null hypothesis will be rejected or no, that is some of its values will lead to rejection of he null hypothesis and vice versa, the sample space of a test statistic is partitioned into two regions. One leads to rejection of the null hypothesis (Critical region) and the other leads to not rejecting the null hypothesis.
2.1.4.7 Significance level
This is defined as a fixed probability of making a mistake of wrongly rejecting the null hypothesis while in real sense it is true. It was mentioned above to be denoted by α and it is also the probability of Type I error.
Many researchers prefer to usebut there is nothing unique to this figure, it is only that it is commonly preferred by many scientist. One can use other values of α provided they are sufficiently low.
2.1.4.7 Pvalue
This is defined as the probability of seeing results as extreme as those observed given that the null hypothesis is true. It has the same value as the significance level of the test for which would just be rejected. A result is said to be significant if the pvalue is less than the significant level. For example if one is doing a test of hypothesis with the level of significance being 0.05, then result will be rejected if, where the p is the pvalue.
2.1.4.8 Power of a test
This is used o measure the ability of the test being used to reject the null hypothesis when it is actually wrong. It is also defined as the probability of not making Type II error.
Power of a test
Power of a test ranges from the value 0 to 1, one being the best.
A statistical test can either be onetailed or twotailed.
(i) Onetailed
A test is said to be one tailed if the values which lead to rejection of a null hypothesis are located wholly in one tail of the probability distribution.
For example, if a researcher claims that the average speed of a search algorithm is 0.1, then the test of hypothesis can be formulated as;
Vs
This is an example of a onesided test, since the critical region will be on the right hand side alone.
(ii) Twotailed test
A test is said to be twotailed if the values which lead to the rejection of the null hypothesis are located in both ends of the distribution.
For example, if a researcher claims that the average speed of a search algorithm is 0.1, then the test of hypothesis can be formulated as;
Vs
This is an example of a twosided test because the critical region is in both the right and left ends.
There exists different tests in statistical analysis and significance testing. The tests can be categorized into two broad categories, namely;
Parametric tests
Nonparametric tests
2.1.5 Parametric tests
In parametric tests, it is assumed that the data which is being used in the test came from a population whose distribution is known. More assumptions are made in parametric tests and so the accuracy of the results will depend on whether the assumptions made are in deed correct. If the assumptions were indeed correct then the parametric methods give reliable conclusions, otherwise the conclusions are misleading.
Parametric tests are mainly used where the normality assumption holds, that is it is assumed that the data came from a population which is normally distributed. This is based on the Central Limit Theorem, which can be summarized to mean: If the sample size is large then the normality assumption holds.
Next the different parametric tests are discussed in depth.
2.1.5.1 Comparing one group to a hypothetical value [One Sample t test]
In one sample t test, the mean of the sample data is compared to a known value, i.e. checked if the population mean from which the sample was collected has a mean equal to the known value.
Assumptions made
The population from which the data is collected is normally distributed
The sigma, δ is known.
The data are random samples of independent observations
The null hypothesis for this test is given by:
where theis known
The null hypothesis is tested against any one of the following Alternative hypotheses
or
t score is used in this test and it is calculated as follows
Letbe a sample data
Whereis the sample mean,
is the population mean.
is the standard error of the mean.
2.1.5.2 Comparing two unpaired groups [Unpaired t test]
The unpaired t test is used to test the null hypothesis that the means of two independent random samples from normal distributions are equal.
Assumptions made:
The population from which the data is collected is normally distributed.
The samples are independent.
It has two different approaches, one is when it is assumed that the variances from the two samples are equal and secondly when the two variances are not equal.
(i) If the two variances are equal the test statistic is calculated as follows:
Whereis the sample mean of the first sample
Whereis the sample mean of the second sample is the pooled sample variance, n_{1} and n_{2} are the sample sizes
(ii) If the two variances are unequal the test statistic is calculated as follows:
If the two variances are not equal, an approximate form of t test called the Satterthwaite’s test is usually used. It is as follows
Whereis the sample mean of the first sample
Whereis the sample mean of the second sample
n_{1} and n_{2} are the sample sizes of sample 1 and sample 2 respectively.
is the sample variance, we have two of them,1 and 2,corresponding to samples 1 and 2 respectively.
d is the BehrensWelch test statistic evaluated as a Student t quartile with df degrees of freedom using Satterthwaite's approximation.
Consider the example from Armitage and Berry (1994, pg.111) where the gain in weight of 19 female rats is checked between 28 and 84 days after birth.12 were fed on high protein diet and 7 on low protein diet.
Here the null hypothesis is that the means of high protein and that of low protein are equal.
High protein has a sample size n =12
Low protein has a sample size n =7
Mean of High Protein = 120
Mean of Low Protein = 101
Assuming equal variances
Combined standard error = 10.045276
The degree of freedom (d.f) is given by (12+72)
d.f = 17
t = 1.891436
Two sided P = 0.0757
95% confidence interval for difference between means = 2.193679 to 40.193679
Since the pvalue > 0.05, (it is being tested at 95%), we fail to reject the null hypothesis
Assuming unequal variances
Combined standard error = 9.943999
df = 13.081702
t(d) = 1.9107
Two sided P = 0.0782
95% confidence interval for difference between means = 1.980004 to 39.980004
Since the pvalue > 0.05, (it is being tested at 95%), we fail to reject the null hypothesis
2.1.5.3 Comparing two paired groups [paired t test]
This test is used to compare the mean of the same individual/item or related items at different times. Items are usually tested in a pre and post intervention (treatment) or when the individuals are paired such as in twins’ case. Since the observations are in pairs, the two samples will have equal sizes (sample sizes).
It usually tests the difference between two corresponding observations. Suppose you have observations and.
Then the difference between corresponding observationsis given by
The test of hypothesis for this case is formulated as shown below
[There is no difference between the observations]
Vs
The test statistic is given by
Where
is usually set to zero.
is the standard deviation of the new variable
n is the sample size. The test statistic is t with n1 degrees of freedom. Suppose the test is done at 95% significance level then reject the null hypothesis if the pvalue associated with t < 0.05. So there would de evidence that there is a difference in means across the paired observations.
Assumptions made:
(i) The observations are independent of each other.
(ii) The dependent variable is measured on an interval scale.
(iii)The differences are normally distributed in the population.
Consider Anthony Green’s (2000) example, the corresponding value of D for each pair is calculate in the last column.
2.1.5.4 Comparing more than two groups [ANOVA TEST]
T test is used when the data is in two groups only and the researcher wishes to compare the means of the groups. When there are more than two groups, the comparison is approached in a different way, which is called the ANOVA, (Gossett, 1908).
Although it is possible to compare many groups using t test, this is achieved by comparing two groups at a time .In this case you get many t test then use this to do the comparison. The draw back of these many t tests is that complications might arise leading to total confusion, Lindman (1974).
Assumptions used in ANOVA.
The errors are normally distributed.
The expected values of the errors are zero.
The variances of all errors are equal to each other.
The errors are independent.
In ANOVA, generally the research has k groups each with means and the groups need not to have the same sizes (the n may vary) .In ANOVA the researcher wishes to test the hypothesis:
Against
At least one of the means differs from the others.
Hinkelmann et al. (2008) discussed in detail two sources of errors in statistics, which are the assignable and chance causes.
Assignable causes are ones which can be identified, traced and eliminated or enhanced.
Chance causes are beyond the control of man.
ANOVA compares two groups by examining the ratio of variability between two conditions and variability within each condition also known as the ‘within variability’ and ‘between variability’. The amount of variation due to assignable causes (or variance between the samples) and variation due to chance causes (or variance within the samples) are obtained separately and compared using an Ftest
So the total sum of squares is partitioned into sum of squares due to errors and sum of squares due to treatment, as shown below
Also the degrees of freedom (df) will be in a similar partition form:
The F statistic is used to test the hypothesis in ANOVA test.
Consider doing a one way ANOVA test, The F statistic F=
Where n is the number of treatments and N is the total number of cases, is compared to the F distribution with and degrees of freedom in that order, i.e. at the specified level of significance..
One way ANOVA
This is when only one factor is applied in the experiment.
Two way ANOVA
This is when two factors are applied in the experiment.
The other types of ANOVA are the Factorial ANOVA and MANOVA.
Factorial ANOVA is used when the researcher wishes to check the effects of two or more factor variables. The most encountered type of factorial ANOVA is the 2×2 design, where there are two independent variables and each variable has two levels.
MANOVA is used when the research is multivariable that is when there is more than one dependent variable.
Hinkelmann et al. (2008) discussed these other types of ANOVA in detail.
2.1.5.5 Quantification of association between variables (Correlation)
This is measured using the Pearson correlation coefficient for the case of parametric test. It is used to determine the strength and direction of the relationship between any two variables.
Assumptions:
(i)Both variables should be normally distributed.
(ii) Both variables should be interval or ratio variables.
Pearson’s correlation produces a correlation coefficient which ranges fro 1 to +1.
If r is negative then there is an inverse relationship between the dependent and independent variable, i.e. when one increases the other decreases and visa versa.
If r is positive then it means that both variables move in the same direction, i.e. as one is increased the other one also increases. The further that r is away from 0 the stronger the relationship between the two variables.
In his work, Sebastian (2003) summarized the properties and assumptions about correlation as follows;
r measures how close the points in a scatter plot approximate to a straight line. This property does not hold when the straight is perfectly horizontal or perpendicular to the x axis.
r is not affected by linear transforms of data. In other words, if income is used as one of the variables and all incomes are divided by 100 to simplify computation, this will not change the obtained value of r.
r can be significantly affected by extreme values or outliers of x or y.
r cannot be used to established causal relationships.
r is affected by range restrictions. This means that if the values used for x or y are limited to a particular set of values this is liable to decrease the value of r.
Pearson’s r between to variables is calculated using the formula;
Where x is the independent variable and y is the dependent variable.
2.1.6 Nonparametric tests.
In the previous section, the parametric tests were reviewed; in this section the corresponding nonparametric tests are discussed. Nonparametric tests do not make any assumptions about the underlying distribution. In nonparametric tests, the outcome variable is ranked from the smallest to the largest, then the ranks obtained are analyzed and conclusions made. The tests are discussed below:
2.1.6.1 Comparing one group to a hypothetical value [Wilcoxon test]
This is the nonparametric counterpart of the onesample t test.
Assumptions
(i) It makes the assumption that the observations are symmetrically distributed about the mean.
The Wilcox on test is used to test if the location (median) of the measurement is equal to a specified value.
This test is based on the sum of the (positive or negative) ranks of the differences between observed and expected center. The Test statistic corresponds to selecting each number from 1 to n with probability ½ and calculating the sum.
This test evaluates whether a sample of n observations is drawn from a population in which the median equals a specific (hypothesized) value.
The test requires one numeric data column. Dallal (2008) gave via an example on how the ranks are obtained as follows; Data are ranked by ordering them from lowest to highest and assigning them, in order, the integer values from 1 to the sample size. Ties are resolved by assigning tied values the mean of the ranks they would have received if there were no ties, e.g., 117, 119, 119, 125, 128 becomes 1, 2.5, 2.5, 4, 5. (If the two 119s were not tied, they would have been assigned the ranks 2 and 3. The mean of 2 and 3 is 2.5.)
2.1.6.2 Comparing two unpaired groups [MannWhitney test]
Its parametric counterpart is the unpaired t test. Suppose you have two groups, with sample sizes n_{1} and n_{2} .
The MannWhitney U ranks all the cases from the lowest to the highest score. The Mean Rank is the mean of the ranks for each group and the Sum of Ranks is the sum of the ranks for each group. U_{1} is defined as the number of times that a score from the first group is lower in rank than a score from the second group . U_{2} is defined as the number of times that a score from the second group is lower in rank that a score from the first group. U is defined as the least value between U_{1} and U_{2}.
The computational formulas for U1 and U2 are as follows:
Where
n_{1} = number of observations in group 1
n_{2} = number of observations in group 2
R_{1} = sum of ranks assigned to group 1
R_{2} = sum of ranks assigned to group 2
MannWhitney U checks at the locations of one set of scores relative to the locations of the other set of scores. If U is not significant then the rankings of one set of scores are similar to the rankings of the other set of scores.
2.1.6.3 Compare two paired groups [Wilcoxon paired test]
This is a nonparametric test that compares two paired groups. It is he counterpart of paired t test in parametric tests. Suppose the data which is in pairs is named as column X and column Y. First the difference () between each set of pairs is found then the absolute values of this differences () are ranked from the smallest to the largest., then the researcher sums the ranks of the differences where column X was higher (positive ranks), sums the ranks where column Y was higher (call it negative ranks).If the two sums of ranks are very different, the P value will be small, hence rejecting the null hypothesis.
Assumptions
He differences are symmetrically distributed.
He pairs are independent.
2.1.6.4 Compare three or more unmatched groups [Kruskalwallis test]
The KruskalWallis test is a nonparametric test that compares three or more unpaired groups. It is the nonparametric equivalent to Oneway ANOVA. First the values are ranked from the smallest to the largest without caring which value is in which group. The deviations among the rank sums are combined to create a single value called the KruskalWallis statistic. A large KruskalWallis statistic corresponds to a large discrepancy among rank sums. This test has less power.
When to use KruskalWallis test
When the errors are independent.
When the data are unpaired.
When the data was sampled from nonGaussian populations.
2.1.6.5 Friedman test
This is a nonparametric test which is used to compare the means of three or more paired groups.
It is used when:
The items are independent.
The sample is collected from a population which is not normally distributed.
The matching of the pairs is effective.
2.1.6.6 Spearman correlation
His is a nonparametric measure of correlation between two variables. Here the data under each variable are ranked from the smallest o he larges, he smallest being given value 1 and so on. The Pearson’s correlation is then calculated on his rankings. It is usually denoted by, and it is given by
Where
And n is the sample size, which is the same for the two variables.
2.1.6.7 Chi square test (Pearson’s)
This test checks the null hypothesis that the frequency distribution of an events observed in a sample is consistent with a particular theoretical distribution. The events being investigated must be mutually exclusive and have total probability 1.It checks the goodness of fit of a given sample to a particular distribution.
It is mail applied in a contingency table.
2.1.7 Statistical significance tests used in information retrieval
In this section, statistical tests which are used in information retrieval are discussed.
2.1.7.1 McNemar’s test
Carl Staelin (2001), described McNemar’s Test as one which compares algorithms A and B by using one test set with n samples.
Letbe the number items misclassified by both A and B
Letbe the number items misclassified by A alone
Letbe the number items misclassified by B alone
Letbe classified correctly by both A and B
Then the test statistic which is a Chi square is calculated using:
2.1.7.2 Permutation test.
This test can be used to compare two algorithms. It is based on the fact that, Even if two algorithms were equally accuracy, some random difference will be expected in outcomes based on data splits. If he measured difference is random, then the average of many random permutation of results would give about the same difference.
The procedure is as outlined below;
First get a set of k estimates of accuracy say, A = {a_{1},a_{2}, ..., a_{k}} for M_{1} and B = {b_{1}, ..., b_{k}} for M_{2}
Calculate the average accuracies, μ_{A} =and μ_{B} =
Calculate dAB = μA  μB
let p = 0
Repeat n times

let S={ a_{1}, ..., a_{k}, b_{1}, ..., b_{k}}

randomly partition S into two equal sized sets, R and T (statistically best if partitions not repeated)

Calculate the average accuracies, μR and μT

Calculate dRT = μR  μT
if dRT ≥ dAB then p = p+1
pvalue = p/n (Give the values of p, n, and pvalue)
A low pvalue implies that the algorithms really are different
2.1.7.3 Two Proportions Test
This test is based on comparing the error rates of algorithms A and B. It uses the assumption that the probability of misclassification is a Binomial random variable.
and
The Mean = nP_{A} and Variance: = n P_{A}(1 P_{A})
When n is large, and assuming P_{A} and P_{B} are independent, (P_{A}–P_{B} ) is approximately normal. then we use the test statistic ; to compare the two.
Other tests include; Paired ttest, kfold Crossvalidated Paired tTest, 5x2cv Paired tTest.
2.2 Other Measures Which Can Be Used In Place of Statistical Significance Tests.
In this section, other statistical measures that can be used in place of significant test are discussed. These include:
2.2.1 Effect Size
This is defined as a measure of strength of relationship between any two variables. Statistical significance tests only check if there is a difference, they don’t check how big the difference is or how small it is .Significance tests do not tell us if the difference is big enough or meaningful for the researcher to use it to make a decision. For example if are checking the effect of remedial classes on performance of students, suppose that before the remedial class the mean marks of the students was 35% and after the remedial class the mean mark rose to 35%.
When testing this for statistical significance, depending on the sample size, the researcher might find that there is a difference in performance before and after the remedial classes. However, in true sense a rise in 1% does not indicate a real change and it will no be meaningful to declare that the remedial classes had an effect on the performance of the students.
To know if an observed difference is just not only statistically significant but it also has an important or meaningful interpretation, a researcher will need to calculate its effect size. Instead of giving the results of the difference in terms of the marks themselves, effect size is standardized. As a matter of fact, all effect sizes are calculated on a common scale, this allows the researcher or scientist to compare the effectiveness of different treatments based on the same outcome.
In practical situations, effect sizes are very useful for making decisions, since a highly significant relationship may not be of any importance if its effect size is small. Effect size can be a standardized measure of effect (such as odds ratio, Cohen's d, and r) or unstandardized measure (e.g., the raw difference between group means and unstandardized regression coefficients).Reporting of effect size in scientific papers is critically important and usually boosts the readers confidence in the results of the findings of that particular research paper.
Effect size makes it possible to do metaanalysis.
There are very many effect size measures used by researchers and each of them has a specific situation when it is used. This may include: Standardized Mean Difference, Correlation Coefficient, OddsRatio, Standardized Gain Score, Proportion, Relative risk (RR) etc.
In this section different effect size measures are discussed in great detail.
2.2.1.1 Standard Mean difference
For two groups being studied in a research, the population effect size in this case is usually based on the standard difference between the means of the two groups. This is given by the formula:
Where is the mean of population 1 andis the mean of population 2
is the population standard deviation which may be taken to be the one for the second population or it might be taken to be the spooled standard deviation of the two populations.
If this is compared to the t statistic, used in hypothesis testing, it easy to see that they are almost similar, the only difference is that the t statistic usually has thein the denominator while this measure of effect doesn’t have any function of the sample size. This implies that the effect of size is not affected by the sample size used in the research.
2.2.1.2 Cohen's d
Cohen (1988) defined d as the difference between the means, M1  M2, divided by standard deviation,, of either group where M1 is the mean of the first group and M2 is the mean of the second group in the study. Cohen clearly outlined in his work that the standard deviation of either group could be used when the variances of the two groups are homogeneous.
Other authors in their books and papers usually used the pooled standard deviation for the two groups which is given byto be the standard deviation, where n_{1} is the sample size for group 1 and n_{2} is the sample size for group 2. Usually in metaanalysis the two groups are considered to be the treatment group and the placebo group. By convention the subtraction, M1  M2, is done so that the difference is positive if it is in the direction of improvement or in the predicted direction and negative if in the direction of deterioration or opposite to the p
Cite This Dissertation
To export a reference to this article please select a referencing stye below:
Reference Copied to Clipboard.Reference Copied to Clipboard.Reference Copied to Clipboard.Reference Copied to Clipboard.Reference Copied to Clipboard.Reference Copied to Clipboard.Reference Copied to Clipboard.