Aspect of data collection

Published: Last Edited:

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers.


Samplingis that part ofstatisticalpractice concerned with the selection of an unbiased orrandomsubset of individual observations within a population of individuals intended to yield some knowledge about thepopulationof concern, especially for the purposes of making predictions based onstatistical inference. Sampling is an important aspect ofdata collection.

Researchers rarely survey the entire population for two reasons , the cost is too high, and the population is dynamic in that the individuals making up the population may change over time. The three main advantages of sampling are that the cost is lower, data collection is faster, and since the data set is smaller it is possible to ensure homogeneity and to improve the accuracy and quality of the data.

Eachobservationmeasures one or more properties (such as weight, location, color) of observable bodies distinguished as independent objects or individuals. Insurvey sampling, survey weights can be applied to the data to adjust for thesample design. Results fromprobability theoryandstatistical theoryare employed to guide practice. In business and medical research, sampling is widely used for gathering information about a population.


Random sampling by using lots is an old idea, mentioned several times in the Bible. In 1786 Pierre SimonLaplaceestimated the population of France by using a sample, along withratio estimator. He also computed probabilistic estimates of the error. These were not expressed as modernconfidence intervalsbut as the sample size that would be needed to achieve a particular upper bound on the sampling error with probability 1000/1001. His estimates usedBayes' theoremwith a uniformprior probabilityand it assumed his sample was random. The theory of small-sample statistics developed by William Sealy Gossettput the subject on a more rigorous basis in the 20th century. However, the importance of random sampling was not universally appreciated and in the USA the 1936Literary Digest prediction of a Republican win in thepresidential electionwent badly awry, due to severebias. More than two million people responded to the study with their names obtained through magazine subscription lists and telephone directories. It was not appreciated that these lists were heavily biased towards Republicans and the resulting sample, though very large, was deeply flawed.


The sampling process comprises several stages:

  • Defining the population of concern
  • Specifying asampling frame, asetof items or events possible to measure
  • Specifying asampling methodfor selecting items or events from the frame
  • Determining the sample size
  • Implementing the sampling plan
  • Sampling and data collecting
  • Reviewing the sampling process

Probability and non probability sampling

Aprobability samplingscheme is one in which every unit in the population has a chance (greater than zero) of being selected in the sample, and this probability can be accurately determined. The combination of these traits makes it possible to produce unbiased estimates of population totals, by weighting sampled units according to their probability of selection.

Probability sampling includes: Simple Random Sampling, Systematic Sampling, Stratified Sampling, Probability Proportional to Size Sampling, and Cluster or Multistage Sampling. These various ways of probability sampling have two things in common:

  1. Every element has a known nonzero probability of being sampled and
  2. involves random selection at some point.

Nonprobability samplingis any sampling method where some elements of the population havenochance of selection (these are sometimes referred to as 'out of coverage'/'undercovered'), or where the probability of selection can't be accurately determined. It involves the selection of elements based on assumptions regarding the population of interest, which forms the criteria for selection. Hence, because the selection of elements is nonrandom, nonprobability sampling does not allow the estimation of sampling errors. These conditions place limits on how much information a sample can provide about the population. Information about the relationship between sample and population is limited, making it difficult to extrapolate from the sample to the population.

Example: We visit every household in a given street, and interview the first person to answer the door. In any household with more than one occupant, this is a nonprobability sample, because some people are more likely to answer the door (e.g. an unemployed person who spends most of their time at home is more likely to answer than an employed housemate who might be at work when the interviewer calls) and it's not practical to calculate these probabilities.

Nonprobability Sampling includes:Accidental Sampling,Quota SamplingandPurposive Sampling. In addition, nonresponse effects may turnanyprobability design into a nonprobability design if the characteristics of nonresponse are not well understood, since nonresponse effectively modifies each element's probability of being sampled.

Sampling methods

Within any of the types of frame identified above, a variety of sampling methods can be employed, individually or in combination. Factors commonly influencing the choice between these designs include:

  • Nature and quality of the frame
  • Availability of auxiliary information about units on the frame
  • Accuracy requirements, and the need to measure accuracy
  • Whether detailed analysis of the sample is expected
  • Cost/operational concerns

Simple random sampling

In asimple random sample('SRS') of a given size, all such subsets of the frame are given an equal probability. Each element of the frame thus has an equal probability of selection: the frame is not subdivided or partitioned. Furthermore, any givenpairof elements has the same chance of selection as any other such pair (and similarly for triples, and so on). This minimises bias and simplifies analysis of results. In particular, the variance between individual results within the sample is a good indicator of variance in the overall population, which makes it relatively easy to estimate the accuracy of results.

However, SRS can be vulnerable to sampling error because the randomness of the selection may result in a sample that doesn't reflect the makeup of the population. For instance, a simple random sample of ten people from a given country willon averageproduce five men and five women, but any given trial is likely to overrepresent one sex and underrepresent the other. Systematic and stratified techniques, discussed below, attempt to overcome this problem by using information about the population to choose a more representative sample.

SRS may also be cumbersome and tedious when sampling from an unusually large target population. In some cases, investigators are interested in research questions specific to subgroups of the population. For example, researchers might be interested in examining whether cognitive ability as a predictor of job performance is equally applicable across racial groups. SRS cannot accommodate the needs of researchers in this situation because it does not provide subsamples of the population. Stratified sampling, which is discussed below, addresses this weakness of SRS.

Simple random sampling is always an EPS design, but not all EPS designs are simple random sampling.

Systematic sampling

Systematic samplingrelies on arranging the target population according to some ordering scheme and then selecting elements at regular intervals through that ordered list. Systematic sampling involves a random start and then proceeds with the selection of everykth element from then onwards. In this case,k=(population size/sample size). It is important that the starting point is not automatically the first in the list, but is instead randomly chosen from within the first to thekth element in the list. A simple example would be to select every 10th name from the telephone directory (an 'every 10th' sample, also referred to as 'sampling with a skip of 10').

As long as the starting point israndomized, systematic sampling is a type ofprobability sampling. It is easy to implement and thestratificationinduced can make it efficient,ifthe variable by which the list is ordered is correlated with the variable of interest. 'Every 10th' sampling is especially useful for efficient sampling fromdatabases.

Example: Suppose we wish to sample people from a long street that starts in a poor district (house #1) and ends in an expensive district (house #1000). A simple random selection of addresses from this street could easily end up with too many from the high end and too few from the low end (or vice versa), leading to an unrepresentative sample. Selecting (e.g.) every 10th street number along the street ensures that the sample is spread evenly along the length of the street, representing all of these districts. (Note that if we always start at house #1 and end at #991, the sample is slightly biased towards the low end; by randomly selecting the start between #1 and #10, this bias is eliminated.)

However, systematic sampling is especially vulnerable to periodicities in the list. If periodicity is present and the period is a multiple or factor of the interval used, the sample is especially likely to beunrepresentative of the overall population, making the scheme less accurate than simple random sampling.

Example: Consider a street where the odd-numbered houses are all on the north (expensive) side of the road, and the even-numbered houses are all on the south (cheap) side. Under the sampling scheme given above, it is impossible' to get a representative sample; either the houses sampled willallbe from the odd-numbered, expensive side, or they willallbe from the even-numbered, cheap side.

Another drawback of systematic sampling is that even in scenarios where it is more accurate than SRS, its theoretical properties make it difficult toquantifythat accuracy. (In the two examples of systematic sampling that are given above, much of the potential sampling error is due to variation between neighbouring houses - but because this method never selects two neighbouring houses, the sample will not give us any information on that variation.)

As described above, systematic sampling is an EPS method, because all elements have the same probability of selection (in the example given, one in ten). It isnot'simple random sampling' because different subsets of the same size have different selection probabilities - e.g. the set {4,14,24,...,994} has a one-in-ten probability of selection, but the set {4,13,24,34,...} has zero probability of selection.

Systematic sampling can also be adapted to a non-EPS approach; for an example, see discussion of PPS samples below.

Quota sampling

Inquota sampling, the population is first segmented intomutually exclusivesub-groups, just as instratified sampling. Then judgment is used to select the subjects or units from each segment based on a specified proportion. For example, an interviewer may be told to sample 200 females and 300 males between the age of 45 and 60.

It is this second step which makes the technique one of non-probability sampling. In quota sampling the selection of the sample is non-random. For example interviewers might be tempted to interview those who look most helpful. The problem is that these samples may be biased because not everyone gets a chance of selection. This random element is its greatest weakness and quota versus probability has been a matter of controversy for many years

Convenience sampling

Convenience sampling(sometimes known asgraboropportunity sampling) is a type of nonprobability sampling which involves the sample being drawn from that part of the population which is close to hand. That is, a sample population selected because it is readily available and convenient. The researcher using such a sample cannot scientifically make generalizations about the total population from this sample because it would not be representative enough. For example, if the interviewer was to conduct such a survey at a shopping center early in the morning on a given day, the people that he/she could interview would be limited to those given there at that given time, which would not represent the views of other members of society in such an area, if the survey was to be conducted at different times of day and several times per week. This type of sampling is most useful for pilot testing. Several important considerations for researchers using convenience samples include:

  1. Are there controls within the research design or experiment which can serve to lessen the impact of a non-random, convenience sample whereby ensuring the results will be more representative of the population?
  2. Is there good reason to believe that a particular convenience sample would or should respond or behave differently than a random sample from the same population?
  3. Is the question being asked by the research one that can adequately be answered using a convenience sample?

In social science research,snowball samplingis a similar technique, where existing study subjects are used to recruit more subjects into the sample.

Panel sampling

Panel samplingis the method of first selecting a group of participants through a random sampling method and then asking that group for the same information again several times over a period of time. Therefore, each participant is given the same survey or interview at two or more time points; each period of data collection is called a "wave". This sampling methodology is often chosen for large scale or nation-wide studies in order to gauge changes in the population with regard to any number of variables from chronic illness to job stress to weekly food expenditures. Panel sampling can also be used to inform researchers about within-person health changes due to age or help explain changes in continuous dependent variables such as spousal interaction. There have been several proposed methods of analyzing panel sample data, including MANOVA, growth curves, and structural equation modeling with lagged effects. For a more thorough look at analytical techniques for panel data, see Johnson (1995).

Sampling and data collection

Good data collection involves:

  • Following the defined sampling process
  • Keeping the data in time order
  • Noting comments and other contextual events
  • Recording non-responses

Most sampling books and papers written by non-statisticians focus only in the data collection aspect, which is just a small though important part of the sampling process.

Errors in research

There are always errors in a research. By sampling, the total errors can be classified into sampling errors and non-sampling errors.

Sampling error

Sampling errors are caused by sampling design. It includes:

  1. Selection error: Incorrect selection probabilities are used.
  2. Estimation error: Biased parameter estimate because of the elements in these samples.

Non-sampling error

Non-sampling errors are caused by the mistakes in data processing. It includes:

  1. Overcoverage: Inclusion of data from outside of the population.
  2. Undercoverage: Sampling frame does not include elements in the population.
  3. Measurement error: The respondent misunderstand the question.
  4. Processing error: Mistakes in data coding.
  5. Non-response:

quote After sampling, a review should be held of the exact process followed in sampling, rather than that intended, in order to study any effects that any divergences might have on subsequent analysis. A particular problem is that ofnon-responses. Insurvey sampling, many of the individuals identified as part of the sample may be unwilling to participate or impossible to contact. In this case, there is a risk of differences, between (say) the willing and unwilling, leading to biased estimates of population parameters. This is often addressed by follow-up studies which make a repeated attempt to contact the unresponsive and to characterize their similarities and differences with the rest of the frame. The effects can also be mitigated by weighting the data when population benchmarks are available or by imputing data based on answers to other questions. Nonresponse is particularly a problem in internet sampling. One of the main reasons for this problem could be that people may hold multiple e-mail addresses, which they don't use anymore or don't check regularly


  • Gy, P (1992)Sampling of Heterogeneous and Dynamic Material Systems: Theories of Heterogeneity, Sampling and Homogenizing.
  • Cochran, William G.(1977).Sampling Techniques(Third ed.). Wiley.ISBN0-471-16240-X.