### Abstract

Statistical tools are undoubtedly important in decision making. The use of these tools in everyday problems has led to a number of discoveries, conclusions and enhancement of knowledge. This ranges from direct calculations using general statistical formulas to formulas integrated in Statistical software to fasten the process of decision making.

Statistical tools for testing hypothesis, significance tests are strong but only if used correctly and in good understanding of their concepts and limitations. Some researchers have indulged into wrong usage of this tests leading to wrong conclusions.

This paper looks at the different significance tests (both parametric and non-parametric tests) their uses, when to be used and their limitations. It also evaluates the use of Statistical Significance tests in Information Retrieval and then proceeds to check the different significant tests used by researchers in the papers submitted to Special Interest Group on Information Retrieval (SIGR) in the period 2006, 2007 and 2008. For the combined period 2006-2008, including the years 2006 and 2008, of the papers submitted had statistical tests used and of these tests were used wrongly.

Key Words: Significance Test, Information Retrieval, Parametric Tests, Non-parametric Tests, Hypothesis Testing

### Chapter One

### 1.0 Introduction

Statistical methods play a very important role in all aspects of research, ranging from data collection, recording, analysis, to making conclusions and inferences. The credibility of the research results and conclusions will depend on each and every step mentioned above; any fault made in these steps can render a research carried out for several years, spending millions of shillings to be worthless.

This does not mean carrying any test and mincing figures shows that statistics has been used in the given research; the researcher should be able support why he or she used that specific test or method.

Misuse of significance test is not new in the world of science. According to Campbell (1974), there are different types of statistical misuse:

### Discarding unfavorable portion of data

This occurs when the researcher selects only a portion of data which produces the results that he/she requires perfectly while discarding the other portion. After a well done research, the researcher might get values that are not consistent to what he/she was expecting. This researcher might decide to ignore this section of data during the analysis so as to get the “expected results”. This is a wrong take since the inconsistent data could give very new thoughts in that particular field that is if these irregularities are checked and explained why they occurred, more ideas abut that area can be explored..

### Overgeneralization

Sometimes the conclusions from a research can only work on that particular research problem but the researcher might blindly generalize the results obtained to other kinds of research similar or dissimilar. Overgeneralization is a common mistake in current research activities. A researcher after successfully completing a research on a particular field, he/she might be tempted to make generalizations reached in this research to other fields of study without regarding the different orientations of these different populations and assumptions in them.

### Non representative sample

This arises when the researcher selects a sample which produces results geared towards his/her liking. Sample selected for a particular study should be one that truly represents the entire population. The procedure of selecting the sample units to be used in the study should be done in an unbiased manner.

### Consciously manipulating data

Occurs when a researcher consciously changes the collected data in order to reach a particular conclusion. This is mainly noticed when the researcher knows exactly what the customers aim are, so the researcher changes part of the data so that the aim of that research is covered strongly. For example if a researcher is carrying out a regression analysis and does a scatter plot, if he/she sees that there are many out liers,the researcher might decide to change some values so that the scatter plot appears as a straight line or something very close to that. This act leads to results which are appealing to the customer and the eyes of other user but in real sense does not give a clear indicator of what is really happening in the population at large.

### 1.0.5 False correlation

This is observed when the researcher claims that one factor causes the other while in real sense both two factors are caused by another hidden factor which was not identified during the study. Correlation researches are common in social sciences and sometimes they are not adequately approached, this leads to wanting results. In correlation studies say to check if variable X causes variable Y, in real sense there are four possible things. The first one is that X causes Y,secondly Y causes X, third is X and Y are both caused by another unidentified variable say Z and lastly the correlation between X and Y occurred purely by sheer luck.

All these possibilities should be checked while doing these kinds of study to avoid rushing into wrong conclusions. False causality can be eliminated in studies by using two groups for the same experiment that is the “control group (the one receiving a placebo)” and the “treatment group (the one receiving the treatment)” .

Even though this method is efficient, implementing it raises very many challenges. There are ethical issues like when one patient is given a placebo (effect less drug) without his/her conscious and the other group given the right drug. One question comes to mind; is it ethical to do this to the first group? Carrying out the experiment in parallel for two different groups can also prove to be very expensive.

### 1.0.6 Overloaded questions.

The questions used in survey can really affect the outcome of the survey. The structure of questions in a questionnaires and the method of formulating and asking the questions can influence the manner in which the respondent answers the questions. Long wordy questions in a questionnaire can be too boring to a respondent and he/she might just fill the questionnaire in a hurry so that he/she finishes it but does not really care about the answers that he/she has provided. The framing of questions can also yield leading questions. Some questions will just lead the respondent on what to answer for example “The government is not offering security to its citizens, do you agree to this? (Yes or No)”

Use of statistical significance has been with us for more than 300 years (Huberty, 1993).Despite being used for a long time, this field of decision making is cornered by criticism from all directions, which has led to many researchers writing materials digging into the problems of statistical significance testing. Harlow et. al (1997), discussed the controversy in significance testing in depth. Carver (1993) expressed dislike of significance tests and clearly advocated researchers to stop using them.

In his book, How to Lie with Statistics, Huff (1954) outlined errors both intentional and unintentional and misinterpretations made in statistical analyses in depth. Some journals e.g. American Psychological Association (APA) recommended minimum use of statistical significance test by researchers submitting papers for publications (APA, 1996), though not revoking the use of the tests.

With the relentless criticism, other researchers have not given up on using statistical significance testing but have clearly encourage users of the tests to have good knowledge in them before making conclusions using them. Mohr (1990) discussed the use of these tests and supported their use but warning researchers to know the limitations of each tests and correct application of the tests so as to make a correct inferences and conclusions. In his paper, Burr (1960) supported the use of statistical significance test but requested researchers to make allowances for existence of statistical errors in the data.

Amidst these controversies, statistical significance testing has been applied to many areas of research and remarkable achievements have been recorded. One such area is the information retrieval (IR). Significant tests have been used to compare different algorithms in information retrieval.

### 1.1.0 Information retrieval

Information retrieval is defined as the science of searching databases, World Wide Web and other documents looking for information on a particular subject. In order to get information, the user is required to enter keywords which are to be used for searching, a combination of objects containing the keywords are usually returned from which the user looking for information can single out and pick one which gives him or her the much required information.

The user usually progressively refines the search by narrowing down and using specific words. Information retrieval has developed as a highly dynamic and empirical discipline, requiring careful and thorough evaluation to show the superior performance of different new techniques on representative document collections.

There are many algorithms for Information Retrieval .It is usually important to measure the performance of different information retrieval systems so as to know which one gives the required information faster. In order to measure information retrieval effectiveness, three test items are required;

- (i) A collection of documents on which the different retrieval methods will be run on and compared.
- (ii) A test collection of information needs which are expressible in terms of queries
- (iii)A collection of “relevance judgment” that will distinguish on whether the results returned are relevant to the person doing the search or they are irrelevant.

### A question might arise on which collection of objects to be used in testing different systems. There are several standard test collections used universally, these include;

(i) Text Retrieval Conference (TREC). – This a standard collection comprising 6 CDs containing 1.89 million documents (mainly, but not exclusively, newswire articles) and relevance judgments for 450 information needs, which are called topics and specified in detailed text passages. Individual test collections are defined over different subsets of this data.

(ii)GOV2-This was developed by The U.S. National Institute of Standards and Technology (NIST).It is a 25 paged collection of web pages.

(iii) NII Test Collections for IR Systems (NTCIR)-This is also a large test collection focusing mainly on East Asian language and cross-language information retrieval, where queries are made in one language over a document collection containing documents in one or more other languages.

(iii) Cross Language Evaluation Forum (CLEF). This Test collection is mainly focused on European languages and cross-language information retrieval.

(iv) 20 Newsgroups. This text collection was collected by Ken Lang. It consists of 1000 articles from each of 20 Usenet newsgroups (the newsgroup name being regarded as the category). After the removal of duplicate articles, as it is usually used, it contains 18941 articles.

(v) The Cranfield collection. This is the oldest test collection in allowing precise quantitative measures of information retrieval effectiveness, but is nowadays too small for anything but the most elementary pilot experiments. It was collected in the United Kingdom starting in the late 1950s and it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries, and exhaustive relevance judgments of all (query, document) pairs.

### There exist several methods of measuring the performance of retrieval systems namely; Precision, Recall, Fall-Out, E-measure and F-measure just to mention a few since researchers are coming up with other new methods.

A brief description of each method will shade some light.

### 1.1.1 Recall

Recall in information retrieval is defined as the number of relevant documents returned from a search divided by the total number of documents that can be retrieved from a database. Recall can also be looked at as evaluating how well the method that is being used to retrieve information gets the required information.

### Letbe the set of all retrieved objects andbe the set of all relevant objects then,

### Recall(1.1)

As an example, if a database contains 500 documents, out of which 100 contain relevant information required by a researcher, the complement ,number of documents not required = 400.

If the researcher uses a system to search for the documents in this database and it return 100 documents of which all of them are relevant to the researcher, then the recall is given by:

Recall

Supposed that out of 120 returned documents, 30 are irrelevant, then the recall would be given by

Recall

### 1.1.2 Precision

Precision is defined as the number of relevant documents retrieved from the system over the total number of documents retrieved in that search. It valuates how well the method being used to retrieve information filters the unwanted information.

### Letbe the set of all retrieved objects andbe the set of all relevant objects then,

### Precision(1.2)

As an example, if a database contains 500 documents, out of which 100 contain relevant information required by a researcher, the complement ,number of documents not required = 400.

If the researcher uses a system to search for the documents in this database and it returns 100 documents of which all of them are relevant to the researcher, then the precision is given by:

### Precision

Supposed that out of 120 returned documents, 30 are irrelevant, then the precision would be given by

### Precision

Both precision and recall are based on one term; Relevance Oxford dictionary defines relevance as “connected to the issue being discussed”.

Yolanda Jones (2004) identified three types of relevance, namely;

Subject relevance which is the connection between the subject submitted via a query and subject covered by returned texts. Situational relevance: connection between the situation being considered and texts returned by database system. Motivational relevance: connection between the motivations of a researcher and texts returned by database system.

There are two measures of relevance;

- Novelty Ratio: This refers to the proportion of items returned from a search and acknowledged by the user as being relevant, of which they were previously unaware of.
- Coverage Ratio: This refers to the proportion of items returned from a search out of the total relevant documents that the user was aware of before he/she started the search.

Precision and recall affect each other i.e. increase in recall value decreases precision value.

If one increases a system’s ability to retrieve more documents, this implies increasing recall, this will have a drawback since the system will also be retrieving more irrelevant documents hence reducing the precision of that system. This means that a trade-off is required in these two measures so as to ensure better search results.

Precision and recall measures make use of the following assumptions

They make the assumption that either a system returns a document or doesn’t.

They make the assumption that either the document is relevant or not relevant, nothing in between.

New methods are being introduced by researchers which rank the degree of relevance of the documents.

### 1.1. 3 Receiver Operating Characteristics (ROC) Curve

This is the plot of the true positive rate or sensitivity against the false positive rate or (1 − specificity).Sensitivity is just another term for recall. The false positive rate is given by. An ROC curve always goes from the bottom left to the top right of the graph. For a good system, the graph climbs steeply on the left side. For unranked result sets, specificity, given bywas not seen as a very useful idea. Because the set of true negatives is always so large, its value would be almost 1 for all information needs (and, correspondingly, the value of the false positive rate would be almost 0).

### 1.1.4 F-measure and E-measure

This is defined as the weighted harmonic mean of the recall and precision. Numerically, it is defined as

(1.3)

Whereis the weight.

Ifis assumed to be 1, then

(1.4)

The E-measure is given by(1.5)

E –measure has a maximum value of 1.0, 1.0 being the best.

### 1.1.5 Fall-Out

This is defined as the proportion of irrelevant documents that are returned in a search out of all the possible irrelevant documents.

Fall out(1.6)

It can also be defined as the probability of a system retrieving an irrelevant document.

These are just a few methods of measuring performance of search systems. Then after looking after one system, there arise a problem of comparing two systems or algorithms, that is, is this system better than the other one?

To answer this question, scientist in Information retrieval use statistical significance tests to do the comparisons in order to establish if the difference in systems performance are not by chance. These tests are used to confirm beyond doubt that one system is better than another.

### Statement of the problem

Statistical inference tools like statistical significance tests are important in decision making. Their use has been on the rise in different areas of research. With their rise, novel users make use of these tools but in questionable manners. There are many researchers who do not understand the basic concepts in statistics leading to misuse of the tools. Any conclusions reached from a research might be termed bogus if the statistical tests used in it are shoddy.

More light needs to be shade in this area of research to ensure correct use of these tests. Researchers in Information Retrieval also use these tests to compare systems and algorithms, are the conclusions from these tests truly correct? Are there any other ways of comparison which minimize the use of statistical tests?

### Objectives of the study

The objectives of this study are:

Investigate use and misuse of statistical significance tests in scientific papers submitted by researchers to SIGIR.

Shade light on different statistical significance tests their use, assumptions and limitations.

Identify the most important statistical concepts that can provide solutions to the problems of statistical significance in scientific papers submitted by researchers to SIGIR.

Investigate the reality of the problems of statistical significance in scientific papers submitted by researchers to SIGIR.

Investigate the use of statistical significant tests used by researchers in Information Retrieval

Discover the availability of statistical concepts and methods that can provide solutions to the problems of statistical significance in scientific papers submitted by researchers to SIGIR

### Chapter Two

This section of this paper has been divided into three major parts, the sample selection and sample size choosing which will discusses methods of selecting a sample and the size of the sample to be used in a given research, the second part deals with statistical analysis methods and procedures, mainly in significance testing and the third part discusses other statistical methods that can be used in place of statistical significance test.

### 2.0 Sample Selection and Sample Size

### 2.0.1 Sample selection

Sampling plays a major role in research, according to Cochran (1977), sampling is the process of selecting a portion of the population and using the information derived from this portion to make inferences about the entire population.

Sampling has several advantages, namely;

(i)Reduced cost

For example it is very expensive to carry out a census than just collecting information from a small portion of the population. This is because only a small number of measures will be made so only a few people will be hired to do the job compared to complete census which will require a large labor force.

(ii)Greater speed during the process(less time)

Since only a few people will be used or rather only a few items will be measured, the time for doing the measurement will be reduced and also summarization of the data will be quick as opposed to when measures are taken for the whole population.

(iii)Greater accuracy

Since only a few people will be considered in the process, the researchers will be very thorough as compared to the entire population which will see the researchers get tired in the middle of the process leading to lousy collection of data and shoddy analysis.

The choice of the sampling units in a given research may affect the credibility of the whole research. The researcher must make sure that the sample being used is not biased, that is it represents the whole population.

There are several methods of selecting samples to be used in a study. A researcher should always make sure that the sample drawn is large enough to be a representative of the population as a whole and at the same time manageable. In this section the two major types of sampling, random and non-random, will be examined.

### 2.0.1.1 Random sampling

In random sampling, all the items or individuals in the population have equal chances of being selected into the sample. This procedure ensures that no bias is introduced during the selection of sample units since a n items selection will be only by chance and will not depend on the person assigned with the duty of coming up with the sample. There exist five major random sampling techniques, namely; simple random sampling, multi-stage sampling, stratified sampling, cluster sampling and systematic sampling. The following section discusses each of these.

### 2.0.1.1.1 Simple random sampling

In simple random sampling, each item in the population has the same and equal chance of being included in the sample. Usually each sampling unit is assigned a unique number and then numbers are generated using a random number generator and a sampling unit is included in the sample if its corresponding number is generated from the random number generator.

One advantage attributed to simple random sampling is its simplicity and ease in application when dealing with small populations. Every entity in the population has to be enlisted and given a unique number then their respective random numbers be read. This makes this method of sampling very tedious and cumbersome especially where large populations are involved.

### 2.0.1.1.2 Stratified sampling

In stratified random sampling, the entire population is first divided into N disjoint subpopulations .Each sampling unit belongs to one and only one sub population. These sub populations are called strata, they might be of different sizes and they are homogenous within the strata and each stratum completely differs with the other strata. It is from these strata that samples are drawn for a particular study. Examples of strata that are commonly used include States, provinces, Age and Sex, religion, academic ability or marital status etc.

Stratification is most useful when the stratifying variables are simple to work with, easy to observe and closely related to the topic of the survey (Sheskin, 1997).

Stratification can be used to select more of one group than another. This may be done if it is felt that the responses obtained vary in one group than another. So, if the researcher knows that every entity in each group has much the same value, he/she will only need a small sample to get information for that group; whereas in another group, the values may differ widely and a bigger sample is needed.

If you want to combine group level information to get an answer for the whole population, you have to take account of what proportion you selected from each group. This method is mainly used when information is required for only a particular subdivision of the population, administrative convenience is an issue and the sampling problems differ greatly in different portions of the population of study.

### 2.0.1.1.3 Systematic sampling

Systematic sampling is quite different from the other methods of sampling, supposed the population contains N units and a sample of n units is required, a random number is generated using the random number generator, call it k, then a unit(represented as a number) is drown from the sample then the researcher picks every kth unit thereafter. Consider the example that k is 20 and the first unit that is drawn is 5, the subsequent units will be 25,45,65,85 and so on.

The implication of this method is that the selection of the whole sample will be determined by only the first item since the rest will be obtained sequentially. This type is called an every kth systematic sample. This technique can also be used when questioning people in a sample survey. A researcher might select every 15th person who enters a particular store, after selecting a person at random as a starting point; or interview the shopkeepers of every 3rd shop in a street, after selecting a starting shop at random.

It may be that a researcher wants to select a fixed size sample. In this case, it is first necessary to know the whole population size from which the sample is being selected. The appropriate sampling interval, I, is then calculated by dividing population size, N, by required sample size, n. This method is advantageous since it is easy and it is more precise than simple random sampling.

Also it is simpler in systematic sampling to select one random number and then every kth member on the list, than to select as many random numbers as sample size. It also gives a good spread right across the population. A disadvantage is that the researcher may be forced to have a starting list if he/she wishes to know the sample size and calculate the sampling interval.

### 2.0.1.1.4 Cluster sampling

The Austarlian Bureau of Statistics insinuates that cluster sampling divides the population into groups, or clusters. A number of clusters are selected randomly to represent the population, and then all units within selected clusters are included in the sample. No units from non-selected clusters are included in the sample. They are represented by those from selected clusters. This differs from stratified sampling, where some units are selected from each group.

The clusters are heterogeneous within each cluster (that is the sampling units inside a cluster vary from each other completely) and each cluster looks alike with the other clusters. Cluster sampling has several advantages which include reduced costs, simplified field work and administration is more convenient. Instead of having a sample scattered over the entire coverage region, the sample is more concentrated in relatively few collection points (clusters).

Cluster sampling provides results that are less accurate compared to stratified random sampling.

### 2.0.1.1.5 Multi-stage sampling

Multi-stage sampling is like cluster sampling, but involves selecting a sample within each chosen cluster, rather than including all units in the cluster. The Australian Bureau of Statistics postulates that multi-stage sampling involves selecting a sample in at least two stages. In the first stage, large groups or clusters are selected. These clusters are designed to contain more population units than are required for the final sample.

In the second stage, population units are chosen from selected clusters to derive a final sample. If more than two stages are used, the process of choosing population units within clusters continues until the final sample is achieved. If two stages are used then it will be called a two stage sampling, if three stages are used it will be called a three stage sampling and so on.

### 2.0.2 Determination of sample size to be used

### 2.1 Statistical Analysis

In this section, different statistical tests are discussed in details in their general form, then move to discussed how each of them(the ones used in IR) are applied to information retrieval. Only some of these tests are used to compare systems or/and algorithms.

In this paper we look at three sections of statistical analysis, namely:

(i) Summarizing data using a single value.

(ii) Summarizing variability.

(iii) Summarizing data using an interval (no specific value)

In the first case, we have the mean, mode, median etc and in the second case, we look at variability in the data and in the third case we look at the confidence intervals, parametric and nonparametric tests of hypothesis testing

### 2.1.1 Summarizing data using a single value

In this case, the data being analyzed is represented by a single value, example for this scenario are discussed below:

### 2.1.1.1 Mean

There are three different kinds of mean:

(i)Arithmetic mean

(ii)Geometric Mean

(iii)Harmonic mean

(i) Arithmetic mean

This is computed by summing all the observations then dividing by the number of observations that you have collected.

Letbe n observations of a random variable X. The arithmetic mean is defined as

Arithmetic mean

When to use the arithmetic mean

The arithmetic mean is used when:

When the collected data is a numeric observation.

When the data has only one mode (uni-modal)

When the data is not skewed i.e. not concentrated to extreme values.

When the data does not have many outliers (very extreme values)

The arithmetic mean is not used when:

You have categorical data

When the data is extremely skewed.

(ii) Geometric mean

This is defined as the product of the observations, everything raised to power of, usually n.

Letbe n observations of a random variable X. The geometric mean is defined as

Geometric mean

The Geometric mean is used when:

The observations are numeric.

The item that we are interested in is the product of the observations.

(iii) Harmonic mean

This is defined as the number of observations divide be the sum of reciprocals of the observations.

Letbe n observations of a random variable X. The harmonic mean is defined as

Harmonic mean

The Harmonic mean is used when:

The average can be justified for the reciprocal of the observations.

### 2.1.1.2 Median

This is defined as the middle value of the observations. The observations are first arranged in ascending or descending order then the middle value is taken as the median.

The median is used when:

When the observations are skewed.

The observations have a single mode.

The observations are numerical.

The median is not used when:

We are interested in the total value.

### 2.1.1.3 Mode

This is defined as the largest value in the given dataset or the value that has the highest frequency of occurrence.

The mode is used when:

The dataset is categorical.

The dataset is both numeric and multimodal.

### 2.1.2 Summarizing variability

Variability in a data can be summarized using the following measures:

### 2.1.2.1 Sample variance

Letbe n observations of a random variable X, then the Sample variance, is given by

The standard deviation is used when:

The data is normally distributed.

### 2.1.2.2 The Coefficient of Variation (C.O.V)

This is given by

Where s is the standard deviation (square root of the sample variance) andis the sample mean.

C.O.V is more advantageous since it does not depend on the units of measurement of the observations.

### 2.1.2.3 Range

This is the difference between the largest value and the smallest value in the data set.

Letbe n observations of a random variable X, then the range is given by:

Range

The range is mainly used when the distribution is bounded.

### 2.1.2.4 Mean absolute deviation (M.A.D)

Letbe n observations of a random variable X. The M.A.D for the data is given by:

### 2.1.3 Confidence intervals

An interval is a group of values or a set. In this case we don’t talk of a single value that a statistic takes but a possibility of it lying in a given interval. For example what is the probability that the mean of a given data lies within the interval [10, 15].

In research and experiments, the researcher usually starts with the confidence level that he/she will be working with, then from the data compute the confidence interval.

Suppose a researcher uses 95% confidence level in calculating the confidence interval of the mean, then it will mean that the probability that the mean lies in that interval is 0.95.

Confidence intervals are highly used in hypothesis testing.

### 2.1.4 Hypothesis Testing

In hypothesis testing, the researcher usually attempts to check if ‘A’ and ‘B’ is the same or not. For the example is the speed of algorithm A the same as the speed of algorithm B?, is the old system better than the new system?

First the researcher will come with a hypothesis, called the null hypothesis and check this hypothesis against the alternative hypothesis. This null hypothesis is evaluated at a predefined confidence level. The null hypothesis is usually stated positively for example “Speed of algorithm A is the same as the speed of algorithm B.”

Before we proceed to look at the different statistical significance tests, we explore the different categories of observations.

There are two categories, namely:

Paired observations

Unpaired observations.

The different tests are applied differently for each kind of observation, for example there is a t test for paired data and a different t test for unpaired data.

### 2.1.4.1 Paired Observations

Suppose a researcher wants to compare two systems, say system X and system Y. If the researcher carries out n experiments, that is he/she carries experiment i on system X and corresponding experiment i on system Y. Then the observations are paired and each pair of experiment is treated as a single experiment, having the value

### 2.1.4.2 Unpaired observation

If the measurements in the two systems are just done separately(no corresponding measurements) then the resulting observations will be unpaired and it will not be practical to compute.In this case we deal with each data separately and only comparer the statistics of interest rather than the corresponding observations.

### 2.1.4.3 Null and Alternative Hypotheses

In the previous section we mentioned that in test of hypothesis, usually there is a null hypothesis which is tested against the alternative hypothesis. In this section, these two are discuss in great detail.

A Null Hypothesis denoted byrepresents an idea or notion which is believed to be true but has not been proved. For example in information retrieval a null hypothesis may be put as follows:

:”There is no difference in the speeds of two search algorithms”

An Alternative Hypothesis, denoted bis the statement of what the test wishes to check. It is usually the opposite of the Null Hypothesis.

After the test has been done, the results are usually presented in terms of the Null Hypothesis, either “Reject the Null hypothesis” or “Do not reject the null hypothesis.”

Failing to reject the null hypothesis does not necessarily mean that the null hypothesis is accepted. This means that no enough evidence was found during the study to support the null hypothesis.

A hypothesis can either be simple or complex.

A simple hypothesis is one which clearly specifies the distribution.

For example consider a random variable from a normal distribution with mean µ and standard deviation 100, we may test the hypothesis

A complex hypothesis does not completely specify the distribution. For example consider a random variable from a normal distribution with mean µ and standard deviation 100, we may test the hypothesis

### 2.1.4.4 Type I and Type II errors

In test of hypothesis, two kinds of errors may arise, namely; Type I and Type II errors.

Type I arises when you reject the null hypothesis but in actual sense it is true. Type I error is usually denoted be α, the level of significance.

Type II errors arises when you fail to reject the null hypothesis but in real sense it is false. It is usually denoted by β

These two errors are related in such a way that if you reduce one, the other bone increases and so on.

The following table summarizes the Type I and Type II errors.

### 2.1.4.5 Test Statistic

This is a value computed from the collected data and it is used to decide whether to reject or not reject the null hypothesis. The test statistic to be used in a given hypothesis testing situation will depend on the distribution from which the sample comes from.

### 2.1.4.6 Critical Region or Rejection region

This refers to the values that if the test statistic takes, will lead to rejection of the null hypothesis. This region depends on the significance level, α.

Since the values of the test statistic determine whether the null hypothesis will be rejected or no, that is some of its values will lead to rejection of he null hypothesis and vice versa, the sample space of a test statistic is partitioned into two regions. One leads to rejection of the null hypothesis (Critical region) and the other leads to not rejecting the null hypothesis.

### 2.1.4.7 Significance level

This is defined as a fixed probability of making a mistake of wrongly rejecting the null hypothesis while in real sense it is true. It was mentioned above to be denoted by α and it is also the probability of Type I error.

Many researchers prefer to usebut there is nothing unique to this figure, it is only that it is commonly preferred by many scientist. One can use other values of α provided they are sufficiently low.

### 2.1.4.7 P-value

This is defined as the probability of seeing results as extreme as those observed given that the null hypothesis is true. It has the same value as the significance level of the test for which would just be rejected. A result is said to be significant if the p-value is less than the significant level. For example if one is doing a test of hypothesis with the level of significance being 0.05, then result will be rejected if, where the p is the p-value.

### 2.1.4.8 Power of a test

This is used o measure the ability of the test being used to reject the null hypothesis when it is actually wrong. It is also defined as the probability of not making Type II error.

Power of a test

Power of a test ranges from the value 0 to 1, one being the best.

A statistical test can either be one-tailed or two-tailed.

(i) One-tailed

A test is said to be one tailed if the values which lead to rejection of a null hypothesis are located wholly in one tail of the probability distribution.

For example, if a researcher claims that the average speed of a search algorithm is 0.1, then the test of hypothesis can be formulated as;

Vs

This is an example of a one-sided test, since the critical region will be on the right hand side alone.

(ii) Two-tailed test

A test is said to be two-tailed if the values which lead to the rejection of the null hypothesis are located in both ends of the distribution.

For example, if a researcher claims that the average speed of a search algorithm is 0.1, then the test of hypothesis can be formulated as;

Vs

This is an example of a two-sided test because the critical region is in both the right and left ends.

There exists different tests in statistical analysis and significance testing. The tests can be categorized into two broad categories, namely;

Parametric tests

Non-parametric tests

### 2.1.5 Parametric tests

In parametric tests, it is assumed that the data which is being used in the test came from a population whose distribution is known. More assumptions are made in parametric tests and so the accuracy of the results will depend on whether the assumptions made are in deed correct. If the assumptions were indeed correct then the parametric methods give reliable conclusions, otherwise the conclusions are misleading.

Parametric tests are mainly used where the normality assumption holds, that is it is assumed that the data came from a population which is normally distributed. This is based on the Central Limit Theorem, which can be summarized to mean: If the sample size is large then the normality assumption holds.

Next the different parametric tests are discussed in depth.

### 2.1.5.1 Comparing one group to a hypothetical value [One Sample t test]

In one sample t test, the mean of the sample data is compared to a known value, i.e. checked if the population mean from which the sample was collected has a mean equal to the known value.

Assumptions made

The population from which the data is collected is normally distributed

The sigma, δ is known.

The data are random samples of independent observations

The null hypothesis for this test is given by:

where theis known

The null hypothesis is tested against any one of the following Alternative hypotheses

or

t score is used in this test and it is calculated as follows

Letbe a sample data

Whereis the sample mean,

is the population mean.

is the standard error of the mean.

2.1.5.2 Comparing two unpaired groups [Unpaired t test]

The unpaired t test is used to test the null hypothesis that the means of two independent random samples from normal distributions are equal.

Assumptions made:

The population from which the data is collected is normally distributed.

The samples are independent.

It has two different approaches, one is when it is assumed that the variances from the two samples are equal and secondly when the two variances are not equal.

(i) If the two variances are equal the test statistic is calculated as follows:

Whereis the sample mean of the first sample

Whereis the sample mean of the second sample is the pooled sample variance, n_{1} and n_{2} are the sample sizes

(ii) If the two variances are unequal the test statistic is calculated as follows:

If the two variances are not equal, an approximate form of t test called the Satterthwaite’s test is usually used. It is as follows

Whereis the sample mean of the first sample

Whereis the sample mean of the second sample

n_{1} and n_{2} are the sample sizes of sample 1 and sample 2 respectively.

is the sample variance, we have two of them,1 and 2,corresponding to samples 1 and 2 respectively.

d is the Behrens-Welch test statistic evaluated as a Student t quartile with df degrees of freedom using Satterthwaite's approximation.

Consider the example from Armitage and Berry (1994, pg.111) where the gain in weight of 19 female rats is checked between 28 and 84 days after birth.12 were fed on high protein diet and 7 on low protein diet.

Here the null hypothesis is that the means of high protein and that of low protein are equal.

High protein has a sample size n =12

Low protein has a sample size n =7

Mean of High Protein = 120

Mean of Low Protein = 101

Assuming equal variances

Combined standard error = 10.045276

The degree of freedom (d.f) is given by (12+7-2)

d.f = 17

t = 1.891436

Two sided P = 0.0757

95% confidence interval for difference between means = -2.193679 to 40.193679

Since the p-value > 0.05, (it is being tested at 95%), we fail to reject the null hypothesis

Assuming unequal variances

Combined standard error = 9.943999

df = 13.081702

t(d) = 1.9107

Two sided P = 0.0782

95% confidence interval for difference between means = -1.980004 to 39.980004

Since the p-value > 0.05, (it is being tested at 95%), we fail to reject the null hypothesis

### 2.1.5.3 Comparing two paired groups [paired t test]

This test is used to compare the mean of the same individual/item or related items at different times. Items are usually tested in a pre and post intervention (treatment) or when the individuals are paired such as in twins’ case. Since the observations are in pairs, the two samples will have equal sizes (sample sizes).

It usually tests the difference between two corresponding observations. Suppose you have observations and.

Then the difference between corresponding observationsis given by

The test of hypothesis for this case is formulated as shown below

[There is no difference between the observations]

Vs

The test statistic is given by

Where

is usually set to zero.

is the standard deviation of the new variable

n is the sample size. The test statistic is t with n-1 degrees of freedom. Suppose the test is done at 95% significance level then reject the null hypothesis if the p-value associated with t < 0.05. So there would de evidence that there is a difference in means across the paired observations.

Assumptions made:

(i) The observations are independent of each other.

(ii) The dependent variable is measured on an interval scale.

(iii)The differences are normally distributed in the population.

Consider Anthony Green’s (2000) example, the corresponding value of D for each pair is calculate in the last column.

2.1.5.4 Comparing more than two groups [ANOVA TEST]

T test is used when the data is in two groups only and the researcher wishes to compare the means of the groups. When there are more than two groups, the comparison is approached in a different way, which is called the ANOVA, (Gossett, 1908).

Although it is possible to compare many groups using t test, this is achieved by comparing two groups at a time .In this case you get many t test then use this to do the comparison. The draw back of these many t tests is that complications might arise leading to total confusion, Lindman (1974).

Assumptions used in ANOVA.

The errors are normally distributed.

The expected values of the errors are zero.

The variances of all errors are equal to each other.

The errors are independent.

In ANOVA, generally the research has k groups each with means and the groups need not to have the same sizes (the n may vary) .In ANOVA the researcher wishes to test the hypothesis:

Against

At least one of the means differs from the others.

Hinkelmann et al. (2008) discussed in detail two sources of errors in statistics, which are the assignable and chance causes.

Assignable causes are ones which can be identified, traced and eliminated or enhanced.

Chance causes are beyond the control of man.

ANOVA compares two groups by examining the ratio of variability between two conditions and variability within each condition also known as the ‘within variability’ and ‘between variability’. The amount of variation due to assignable causes (or variance between the samples) and variation due to chance causes (or variance within the samples) are obtained separately and compared using an F-test

So the total sum of squares is partitioned into sum of squares due to errors and sum of squares due to treatment, as shown below

Also the degrees of freedom (df) will be in a similar partition form:

The F statistic is used to test the hypothesis in ANOVA test.

Consider doing a one way ANOVA test, The F statistic F=

Where n is the number of treatments and N is the total number of cases, is compared to the F distribution with and degrees of freedom in that order, i.e. at the specified level of significance..

One way ANOVA

This is when only one factor is applied in the experiment.

Two way ANOVA

This is when two factors are applied in the experiment.

The other types of ANOVA are the Factorial ANOVA and MANOVA.

Factorial ANOVA is used when the researcher wishes to check the effects of two or more factor variables. The most encountered type of factorial ANOVA is the 2×2 design, where there are two independent variables and each variable has two levels.

MANOVA is used when the research is multivariable that is when there is more than one dependent variable.

Hinkelmann et al. (2008) discussed these other types of ANOVA in detail.

### 2.1.5.5 Quantification of association between variables (Correlation)

This is measured using the Pearson correlation coefficient for the case of parametric test. It is used to determine the strength and direction of the relationship between any two variables.

Assumptions:

(i)Both variables should be normally distributed.

(ii) Both variables should be interval or ratio variables.

Pearson’s correlation produces a correlation coefficient which ranges fro -1 to +1.

If r is negative then there is an inverse relationship between the dependent and independent variable, i.e. when one increases the other decreases and visa versa.

If r is positive then it means that both variables move in the same direction, i.e. as one is increased the other one also increases. The further that r is away from 0 the stronger the relationship between the two variables.

In his work, Sebastian (2003) summarized the properties and assumptions about correlation as follows;

r measures how close the points in a scatter plot approximate to a straight line. This property does not hold when the straight is perfectly horizontal or perpendicular to the x axis.

r is not affected by linear transforms of data. In other words, if income is used as one of the variables and all incomes are divided by 100 to simplify computation, this will not change the obtained value of r.

r can be significantly affected by extreme values or outliers of x or y.

r cannot be used to established causal relationships.

r is affected by range restrictions. This means that if the values used for x or y are limited to a particular set of values this is liable to decrease the value of r.

Pearson’s r between to variables is calculated using the formula;

Where x is the independent variable and y is the dependent variable.

### 2.1.6 Nonparametric tests.

In the previous section, the parametric tests were reviewed; in this section the corresponding nonparametric tests are discussed. Nonparametric tests do not make any assumptions about the underlying distribution. In nonparametric tests, the outcome variable is ranked from the smallest to the largest, then the ranks obtained are analyzed and conclusions made. The tests are discussed below:

### 2.1.6.1 Comparing one group to a hypothetical value [Wilcoxon test]

This is the nonparametric counterpart of the one-sample t test.

Assumptions

(i) It makes the assumption that the observations are symmetrically distributed about the mean.

The Wilcox on test is used to test if the location (median) of the measurement is equal to a specified value.

This test is based on the sum of the (positive or negative) ranks of the differences between observed and expected center. The Test statistic corresponds to selecting each number from 1 to n with probability ½ and calculating the sum.

This test evaluates whether a sample of n observations is drawn from a population in which the median equals a specific (hypothesized) value.

The test requires one numeric data column. Dallal (2008) gave via an example on how the ranks are obtained as follows; Data are ranked by ordering them from lowest to highest and assigning them, in order, the integer values from 1 to the sample size. Ties are resolved by assigning tied values the mean of the ranks they would have received if there were no ties, e.g., 117, 119, 119, 125, 128 becomes 1, 2.5, 2.5, 4, 5. (If the two 119s were not tied, they would have been assigned the ranks 2 and 3. The mean of 2 and 3 is 2.5.)

### 2.1.6.2 Comparing two unpaired groups [Mann-Whitney test]

Its parametric counterpart is the unpaired t test. Suppose you have two groups, with sample sizes n_{1} and n_{2} .

The Mann-Whitney U ranks all the cases from the lowest to the highest score. The Mean Rank is the mean of the ranks for each group and the Sum of Ranks is the sum of the ranks for each group. U_{1} is defined as the number of times that a score from the first group is lower in rank than a score from the second group . U_{2} is defined as the number of times that a score from the second group is lower in rank that a score from the first group. U is defined as the least value between U_{1} and U_{2}.

The computational formulas for U1 and U2 are as follows:

Where

n_{1} = number of observations in group 1

n_{2} = number of observations in group 2

R_{1} = sum of ranks assigned to group 1

R_{2} = sum of ranks assigned to group 2

Mann-Whitney U checks at the locations of one set of scores relative to the locations of the other set of scores. If U is not significant then the rankings of one set of scores are similar to the rankings of the other set of scores.

### 2.1.6.3 Compare two paired groups [Wilcoxon paired test]

This is a nonparametric test that compares two paired groups. It is he counterpart of paired t test in parametric tests. Suppose the data which is in pairs is named as column X and column Y. First the difference () between each set of pairs is found then the absolute values of this differences () are ranked from the smallest to the largest., then the researcher sums the ranks of the differences where column X was higher (positive ranks), sums the ranks where column Y was higher (call it negative ranks).If the two sums of ranks are very different, the P value will be small, hence rejecting the null hypothesis.

### Assumptions

He differences are symmetrically distributed.

He pairs are independent.

### 2.1.6.4 Compare three or more unmatched groups [Kruskal-wallis test]

The Kruskal-Wallis test is a nonparametric test that compares three or more unpaired groups. It is the nonparametric equivalent to One-way ANOVA. First the values are ranked from the smallest to the largest without caring which value is in which group. The deviations among the rank sums are combined to create a single value called the Kruskal-Wallis statistic. A large Kruskal-Wallis statistic corresponds to a large discrepancy among rank sums. This test has less power.

When to use Kruskal-Wallis test

When the errors are independent.

When the data are unpaired.

When the data was sampled from non-Gaussian populations.

### 2.1.6.5 Friedman test

This is a nonparametric test which is used to compare the means of three or more paired groups.

It is used when:

The items are independent.

The sample is collected from a population which is not normally distributed.

The matching of the pairs is effective.

### 2.1.6.6 Spearman correlation

His is a nonparametric measure of correlation between two variables. Here the data under each variable are ranked from the smallest o he larges, he smallest being given value 1 and so on. The Pearson’s correlation is then calculated on his rankings. It is usually denoted by, and it is given by

Where

And n is the sample size, which is the same for the two variables.

### 2.1.6.7 Chi square test (Pearson’s)

This test checks the null hypothesis that the frequency distribution of an events observed in a sample is consistent with a particular theoretical distribution. The events being investigated must be mutually exclusive and have total probability 1.It checks the goodness of fit of a given sample to a particular distribution.

It is mail applied in a contingency table.

### 2.1.7 Statistical significance tests used in information retrieval

In this section, statistical tests which are used in information retrieval are discussed.

### 2.1.7.1 McNemar’s test

Carl Staelin (2001), described McNemar’s Test as one which compares algorithms A and B by using one test set with n samples.

Letbe the number items misclassified by both A and B

Letbe the number items misclassified by A alone

Letbe the number items misclassified by B alone

Letbe classified correctly by both A and B

Then the test statistic which is a Chi square is calculated using:

### 2.1.7.2 Permutation test.

This test can be used to compare two algorithms. It is based on the fact that, Even if two algorithms were equally accuracy, some random difference will be expected in outcomes based on data splits. If he measured difference is random, then the average of many random permutation of results would give about the same difference.

The procedure is as outlined below;

First get a set of k estimates of accuracy say, A = {a_{1},a_{2}, ..., a_{k}} for M_{1} and B = {b_{1}, ..., b_{k}} for M_{2}

Calculate the average accuracies, μ_{A} =and μ_{B} =

Calculate dAB = |μA - μB|

let p = 0

Repeat n times

let S={ a_{1}, ..., a_{k}, b_{1}, ..., b_{k}}

randomly partition S into two equal sized sets, R and T (statistically best if partitions not repeated)

Calculate the average accuracies, μR and μT

Calculate dRT = |μR - μT|

if dRT ≥ dAB then p = p+1

p-value = p/n (Give the values of p, n, and p-value)

A low p-value implies that the algorithms really are different

### 2.1.7.3 Two Proportions Test

This test is based on comparing the error rates of algorithms A and B. It uses the assumption that the probability of misclassification is a Binomial random variable.

and

The Mean = nP_{A} and Variance: = n P_{A}(1- P_{A})

When n is large, and assuming P_{A} and P_{B} are independent, (P_{A}–P_{B} ) is approximately normal. then we use the test statistic ; to compare the two.

Other tests include; Paired t-test, k-fold Cross-validated Paired t-Test, 5x2cv Paired t-Test.

### 2.2 Other Measures Which Can Be Used In Place of Statistical Significance Tests.

In this section, other statistical measures that can be used in place of significant test are discussed. These include:

### 2.2.1 Effect Size

This is defined as a measure of strength of relationship between any two variables. Statistical significance tests only check if there is a difference, they don’t check how big the difference is or how small it is .Significance tests do not tell us if the difference is big enough or meaningful for the researcher to use it to make a decision. For example if are checking the effect of remedial classes on performance of students, suppose that before the remedial class the mean marks of the students was 35% and after the remedial class the mean mark rose to 35%.

When testing this for statistical significance, depending on the sample size, the researcher might find that there is a difference in performance before and after the remedial classes. However, in true sense a rise in 1% does not indicate a real change and it will no be meaningful to declare that the remedial classes had an effect on the performance of the students.

To know if an observed difference is just not only statistically significant but it also has an important or meaningful interpretation, a researcher will need to calculate its effect size. Instead of giving the results of the difference in terms of the marks themselves, effect size is standardized. As a matter of fact, all effect sizes are calculated on a common scale, this allows the researcher or scientist to compare the effectiveness of different treatments based on the same outcome.

In practical situations, effect sizes are very useful for making decisions, since a highly significant relationship may not be of any importance if its effect size is small. Effect size can be a standardized measure of effect (such as odds ratio, Cohen's d, and r) or unstandardized measure (e.g., the raw difference between group means and unstandardized regression coefficients).Reporting of effect size in scientific papers is critically important and usually boosts the readers confidence in the results of the findings of that particular research paper.

Effect size makes it possible to do meta-analysis.

There are very many effect size measures used by researchers and each of them has a specific situation when it is used. This may include: Standardized Mean Difference, Correlation Coefficient, Odds-Ratio, Standardized Gain Score, Proportion, Relative risk (RR) etc.

In this section different effect size measures are discussed in great detail.

### 2.2.1.1 Standard Mean difference

For two groups being studied in a research, the population effect size in this case is usually based on the standard difference between the means of the two groups. This is given by the formula:

Where is the mean of population 1 andis the mean of population 2

is the population standard deviation which may be taken to be the one for the second population or it might be taken to be the spooled standard deviation of the two populations.

If this is compared to the t statistic, used in hypothesis testing, it easy to see that they are almost similar, the only difference is that the t statistic usually has thein the denominator while this measure of effect doesn’t have any function of the sample size. This implies that the effect of size is not affected by the sample size used in the research.

#### 2.2.1.2 Cohen's d

Cohen (1988) defined d as the difference between the means, M1 - M2, divided by standard deviation,, of either group where M1 is the mean of the first group and M2 is the mean of the second group in the study. Cohen clearly outlined in his work that the standard deviation of either group could be used when the variances of the two groups are homogeneous.

Other authors in their books and papers usually used the pooled standard deviation for the two groups which is given byto be the standard deviation, where n_{1} is the sample size for group 1 and n_{2} is the sample size for group 2. Usually in meta-analysis the two groups are considered to be the treatment group and the placebo group. By convention the subtraction, M1 - M2, is done so that the difference is positive if it is in the direction of improvement or in the predicted direction and negative if in the direction of deterioration or opposite to the predicted direction is a descriptive measure.

### 2.2.1.3 Hedges' g

In the year 1981, Larry Hedges suggested a measure g based on the standardized difference of the means of two study groups. It is normally computed by using the square root of the Mean Square Error from the analysis of variance testing for differences between the two groups. The formula for g is as given below

Where the standard deviation for this case is given by

This formula g is almost similar to Cohen’s d but with a difference only in the formula for computing the standard deviation. Cohen’s d has alone in the denominator while Hedges’ g has in the denominator.

### 2.2.1.4 Glass’

Glass (1976) also came up with his formula for measure of effect and his method used the standard deviation of the second group. Glass's delta is defined as the mean difference between the experimental and control group divided by the standard deviation of the control group. The formula is given by

Where s_{2} is the standard deviation of the second group (control group)

The second group may be regarded as a control (placebo) group and the first one as the treatment group, Glass reasoned that if several treatments were compared to the control group it would be better to use just the standard deviation computed from the control group, so that effect sizes would not differ under equal means and different variances.

Under an assumption of equal population variances a pooled estimate for σ is more precise.

2.2.1.5 Cohen’sCohen’smeasure of effect is used in the cases of F test for the ANOVA and multiple regression. This measure of effect in the case of multiple regression is defined as

Where is the squared multiple correlation.

effect sizes of 0.02, 0.15, and 0.35 are said to be small, medium and large respectively

### 2.2.1.6 Odds ratio

This is also a measure of effect and it is used when both variables of the study are binary. For example consider a an examination situation in which there are two classes One which sat for a remedial and the other one did not sit for a remedial classes. In the control group, four students pass the class for every one who fails, so the odds of passing are two to one. In the treatment group (sat remedial), eight students pass for every one who fails, so the odds of passing are eight to one. The effect size can be calculated by using the idea that the odds of passing in the treatment group are two times higher than in the control group (because 8 divided by 4 is 2). Therefore, the odds ratio is 2.

### 2.2.1.7 Relative risk

This is a measure of effect size and it is defined as the probability of an event occurring in relative to an independent variable. The difference between relative risk and odds ratio is that relative risk compares probabilities while Odds ratio compares the Odds of a particular event occurring.

The Relative risk and Odds ratio have different applications in epidemiology, the relative risk is used in randomized controlled trials and cohort studies while the Odds ratio is used in retrospective studies and case-control studies.

### 2.2.1.8 Cramer’s V

This measure is very adequate for the association for the chi-square test

While the measure d, Cohen’s d, only estimates the extent of the relationship between two variables, Cramer’s V may be used with variables having more than two levels. This measure can also be applied to 'goodness of fit' chi-square models

### Chapter 3

### Data Analysis

In this section, data that is being used was retrieved from SIGIR. The papers submitted to SIGIR by researchers from the years 2006 to 2008 were scrutinized on whether the tests were used correctly or not. To know whether a test was used correctly or not, the papers submitted to SIGIR were checked and where each test is used, it was checked against the assumptions in each case (test) and when each test should be applied. These assumptions and when to apply each test was discussed in depth in chapter 2 two of this paper.

The data collected for each year was summarized busing proportions then all the years were combined together and a single conclusion reached.

First, the random variable X is defined according to the usage of tests, it is coded as either 0 or 1 thus it is a Bernoulli variable:

If a test was used

If no test was used.

After classification as either used or not used, the number of used is also classified as either used correctly or not. This bit is also a Bernoulli. The summarized data was classified into used/not used and correctly used/wrongly used.

Graphical views are given using Pie charts, which were done using a powerful statistical package called R.

The table below is for data for year 2006

Table for proportion that used test,2006.

Therefore the proportion is

This is

Table for proportions of correct usage of tests,2006.

Proportion of tests used correctly is

This is

The proportion of tests used wrongly is

This is

R statistical software was used to draw the pie charts so as to give a graphical view of the differences.

Figure1.0 Pie chart showing the proportion of number of documents that had statistical tests and those not having test.(In R statistical package 2D)

Figure1.1 Pie chart showing the proportion of number of documents that had statistical tests and those not having test, for the year 2006

Figure1.0 Pie chart showing the proportion of cases in which statistical tests were used correctly and cases where they were used wrongly, for the year 2006.

The table below is for data for year 2007

Table for proportion that used test, 2007.

Therefore the proportion is

This is

Table for proportions of correct usage of tests, 2007.

Proportion of tests used correctly is

This is

The proportion of tests used wrongly is

This is

Figure1.3 Pie chart showing the proportion of number of documents that had statistical tests and those not having test, for he year 2007

Figure1.4 Pie chart showing the proportion of cases in which statistical tests were used correctly and cases where they were used wrongly, for the year 2007

The table below is for data for year 2008

Table for proportion that used test, 2008.

Therefore the proportion is

This is

Table for proportions of correct usage of tests, 2008.

Proportion of tests used correctly is

This is

The proportion of tests used wrongly is

This is

Figure1.5 Pie chart showing the proportion of number of documents that had statistical tests and those not having test, for the year 2008

Figure1.6 Pie chart showing the proportion of cases in which statistical tests were used correctly and cases where they were used wrongly, for the year 2008.

The combined tables of proportions for the 2006, 2007 and 2008 data are as given below.

Combined Table for proportion that used test, 2006-2008.

Therefore the proportion is

This is

Combined Table for proportions of correct usage of tests, 2006-2008.

Proportion of tests used correctly is

This is

The proportion of tests used wrongly is

This is

Figure1.5 Pie chart showing the proportion of number of documents that had statistical tests and those not having test, for the period 2006-2008

Figure1.8 Pie chart showing the proportion of cases in which statistical tests were used correctly and cases where they were used wrongly, for the period 2006-2008.

### Chapter Four

### 4.0 Discussion and Conclusion

This research has shed light into the area of statistical testing by discussing into detail, the use of different statistical tests, when to use them and the assumptions involved in each of them. It went ahead to discuss the different statistical tests used in Information Retrieval and comparison of search algorithms/systems.

The research also scrutinized the research papers submitted by researchers to SIGIR to check if the significance tests used in these papers were used correctly.

In the year 2006,of the papers submitted had statistical tests used andof these tests were used wrongly.

In the year 2007,of the papers submitted had statistical tests used andof these tests were used wrongly.

In the year 2008,of the papers submitted had statistical tests used andof these tests were used wrongly.

For the combined period 2006-2008, including the years 2006 and 2008,of the papers submitted had statistical tests used andof these tests were used wrongly.

Researchers need to understand the use of each statistical test and should only use them where they are applicable only. If any test is used in a paper, the researcher should state clearly any assumptions made and why use that particular test. Statistical significance test is a powerful tool for making inferences but can easily be misused, if it is not necessary, do not use, it that is if the data can speak for itself, don’t drive in significance test!.

### 4.1 Limitations of The Study and Further Areas of Research

This study is only limited to scientific papers submitted to SIGIR, this can be extended to other journal hubs and fields of research because statistical tests are applied in many areas. Their use in these areas can be scrutinized.

### Works Cited

American Psychological Association. Task Force on Statistical Inference Report.

Washington, DC: American Psychological Association, 1996.

American Psychological Association. Publication Manual of the American Psychological

Association (5^{th} ed.). Washington, DC: American Psychological Association, 2001.

Archives of Clinical Neuropsychology, 16, 653–667.

Bartlett, J. E., II, Kotrlik, J. W., & Higgins, C. Organizational research:

Determining appropriate sample size for survey research. Information Technology, Learning, and Performance Journal, 19(1), 2001, pp.43-50.

Cochran, William G. Sampling Techniques (Third ed.). Wiley publishers, 1977.

Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2^{nd} ed.

Cohen, J. "A power primer". Psychological Bulletin. Vol.112: 1992, pp.155–159.

David C. Blair, Some thoughts on the reported results of TREC, 38, Information Processing and

Management, 445 (2002).

Daniel P. Dabney. Statistical Modeling of Relevance Judgments for Probabilistic Retrieval of

American Case Law. PhD thesis, University of California at Berkeley, Library and Information Studies, (1993).

David Freedman, Robert Pisani, Roger Purves, and Ani Adhikari Statistics (2nd Ed.). p.

137, 1980.

Dar, R., Serlin, R. C., & Omer, H. (1994). Misuse of statistical tests in three decades of

psychotherapy research. Journal of Consulting and Clinical Psychology, 62(1), 75–82.

Hubbard, R. & Bayarri, M. J. Confusion over measures of evidence (p’s) versus

errors (a’s) in classical statistical testing (with comments), The American Statistician, vol.57, (August), pp.171-182, 2003.

Hunter, J. E. Needed: A ban on the significance test. Psychological Science,

vol.8, no.1, pp.1-20, 1997.

Jaffe, A.J. and H.F. Spirer, Misused Statistics; Marcel Dekker, Inc., New York, NY:

Pearson, 1987.

Kevin Gerson, Evaluating Legal Information Retrieval Systems: How Do the Ranked-Retrieval

Methods of Westlaw and Lexis Measure Up?, 14 Legal Reference Services Q. 53, 54 Kish, L. (1965), Survey Sampling, New York: Wiley publishers.

Kruskal, William S. 196a. "Tests of Statistical Significance." Pp. 238-250, in David Sills,

ed., International Encyclopedia of the Social Sciences, vol.14. New York: Macmillan.

(1999).

Larry V. Hedges "Distribution theory for Glass's estimator of effect size and

related estimators". Journal of Educational Statistics, vol.6 (2): 1981, pp.107–128.

Rosenthal, R. and Rosnow, R. L. Essentials of behavioral research: Methods and data Analysis. (2nd edn.). New York: McGraw Hill, 1991.

Rosnow, R. L., and Rosenthal, R. Computing contrasts, effect sizes, and counternulls on

other people's published data: General procedures for research consumers. Psychological Methods, vol.1, 1996, pp.331-340.

Pedhazur, E., & Schmelkin, L. (1991). Measurement design and analysis: An integrated

approach. New York: Psychology Press.

Scott F. Burson, A Reconstruction of Thamus: Comments on the Evaluation of Legal

Information Retrieval Systems, 79 LAW LIBR. J. 133, 139 (1987).

Sarndal, Carl-Erik, and Swensson, Bengt, and Wretman, Jan (1992). Model Assisted

Survey Sampling. Springer-Verlag.

Smith, M. Is it the sample size of the sample as a fraction of the population that matters? Journal of Statistics Education, Vo.12:2, 2004.

Schmidt, Frank L. & Hunter, J. E. Eight common but false objections to the

discontinuation of significance testing in the analysis of research data, in Harlow, Lisa L., Mulaik, S. A. & Steiger, J. H. What if there were no Significance Tests? London: Lawrence Erlbaun, 1997.

Sheskin, David J. Handbook of Parametric and Nonparametric Statistical

Procedures. Boca Raton, Fl: CRC Press, 1997.

Siegel, S. Non-parametric statistics for the behavioral sciences. New York: McGraw-Hill, 1956

Wilcoxon, F. Individual comparisons by ranking methods. Biometrics, vol.1, 1945,

pp.80-83.

Zakzanis, K. K. Statistics to tell the truth, the whole truth, and nothing but the truth: Formulae,

illustrative, 2001.