Wednesday, 14 August 2013

Introduction to Sampling and the various Techniques of Sampling

Continuing from where we had left off in our previous class, the day began   with a revision of concepts studied in the last class. i.e the distinction between chi-square and T-test methods  and their applicability under different situations. As an addition to last days learning we were provided with different situations and on the basis of data on hand we decided on the appropriate method to be used . Thereafter, we progressed to the concept of Sampling

SAMPLING
Sampling is a method widely used in business that allows researchers to infer information about a population, without having to investigate every individual. Reducing the number of individuals in a study reduces the cost and workload, and may make it easier to obtain high quality information, but this has to be balanced against having a large enough sample size with enough power to detect a true association.
We obtain a sample rather than a complete enumeration (a census ) of the population for many reasons few of which are a)Economy b)Timeliness c)the large size of many populations d)Inaccessibility of some of the population  e)Destructiveness of the observation and f)accuracy

Sampling Frame:
Sampling frame is the actual set of units from which a sample has been drawn. In the ideal case, the sampling frame should coincide with the population of interest.

There are two major categories in sampling 1) probability and 2) non-probability sampling.

Probability Sampling
Under probability sampling, for a given population, each element of that population has a chance of being picked to part of the sample. In other words, no single element of the population has a zero chance of being picked
The odd/chances/probability of picking any element is known or can be calculated. This is possible if we know the total number in the entire population such that we are then able to determine that odds of picking any one element.
Probability sampling involves random picking of elements from a population, and that is the reason as to why no element has a zero chance of being picked to be part of a sample.

Methods of Probability Sampling
There are a number of different methods of probability sampling including:

Random Sampling
Random sampling is the method that most closely defines probability sampling. Each element of the sample is picked at random from the given population such that the probability of picking that element can be calculated by simply dividing the frequency of the element by the total number of elements in the population. In this method, all elements are equally likely to be picked if they have the same frequency.

Systematic Sampling
Systematic sampling is the method that involves arranging the population in a given order and then picking the nth element from the ordered list of all the elements in the population. The probability of picking any given element can be calculated but is not likely to be the same for all elements in the population regardless of whether they have the same frequency.

Stratified Sampling
Stratified sampling involves dividing the population into groups and then sampling from those different groups depending on a certain set criteria.
For example, dividing the population of a certain class into boys and girls and then from those two different groups picking those who fall into the specific category that you intend to study with your sample.

Cluster Sampling
Cluster sampling involves dividing up the population into clusters and assigning each element to one and only one cluster, in other words, an element can't appear in more than one cluster.

Multistage Sampling
Multistage sampling involves use of more than one probability sampling method and more than one stage of sampling, for example for using the stratified sampling method in the first stage and then the random sampling method in the second stage and so on until you achieve the sample that you want.

Probability Proportional to Size Sampling
Under probability proportional to size sampling, the sample is chosen as a proportion to the total size of the population. It is a form of multistage sampling where in stage one you cluster the entire population and then in stage two you randomly select elements from the different clusters, but the number of elements that you select from each cluster is proportional to the size of the population of that cluster.

Non-Probability Sampling
Unlike probability sampling, under non-probability sampling certain elements of the population might have a zero chance of being picked. This is because we can't accurately determine the chances/probability of picking a given element so we do not know whether the odds of picking that element are zero or greater than zero. Non-probability sampling may not always be a consequence of the sampler's ignorance of the total number of elements in the population but may be a result of the sampler's bias in the way he chooses the sample by excluding some elements.

Methods of Non-Probability Sampling
There are a number of different methods of Non-probability sampling which include:

Quota Sampling
Quota sampling is similar to stratified sampling only that in this case, after the population is divided into groups, the elements are then sampled from the group using the sampler's judgement and as a consequence the method loses any aspect of being random and can be extremely biased.
Accidental or Convenience Sampling
Accidental sampling is a method of sampling where by the sampler picks the sample based on the fact that the elements that he/she picks are conveniently close at the moment. For example, if you walked down the street and sampled the first ten people you meet, the fact that they happened to be there is convenient for you but accidental for them which leads to the name of the method.
Purposive or Judgemental Sampling
Purposive or judgemental sampling is a method of sampling where by the sampler picks the sample from the entire population solely based on the his/her judgement. The sampler controls to a very large extend which elements have a chance of being selected to be in the sample and which ones don't.
Voluntary Sampling
Voluntary sampling, as the name suggests, involves picking the sample based on which elements of the population volunteer to participate in the sample. This is the most common method used in research polls.
Snowball Sampling
Snowball sampling is a method of sampling that relies on referrals of previously selected elements to pick other elements that will participate in the sample.

Benford’s Law

Everyone knows that our number system uses the digits 1 through 9 and that the odds of randomly obtaining any one of them as the first significant digit in a number is 1/9. (First significant digit means we ignore zeros.)  This works well for fake data generated with a random number generator or the type of data an embezzler would create. With naturally occurring data this generally isn't true. The odds of obtaining a 1 for the first significant digit of a number are much higher than the odds of obtaining any other digit as shown below:

Digit
1
2
3
4
5
6
7
8
9
Odds of Obtaining as 1st Digit (%)
30.1
17.6
12.5
9.7
7.9
6.7
5.8
5.1
4.6
This rather amazing fact was discovered in 1881 by the American astronomer Simon Newcomb.In 1938 a physicist Dr. Frank Benford made the same discovery. However, he studied a much larger amount of data than Newcomb. He analyzed about 20,229 different sets of data, including the areas of rivers, baseball statistics, numbers in magazine articles and the street addresses of the first 342 people listed in the book "American Men of Science (ref 1). Unlike Newcomb, Benford was recognized for his contributions and the relationship he derived was eventually named Benford's law in his honor.

When the logarithms of the digits 1 through 9 are plotted they look like the number line shown below:


Logarithmic Scale
1               
30.1%
2         
17.6%
3   
12.5%
9.7%
5
7.9%
6
6.7%
7
5.8%
8
5.1%
9
4.6%


This means that all numbers starting with a "1" will occupy 30.1% of the total length of the scale. Numbers like 1.23784, 1.5, or 1.879 would fall in this region. 

An example could be 
"If we think of the Dow Jones stock average as 1,000, our first digit would be 1.
"To get to a Dow Jones average with a first digit of 2, the average must increase to 2,000, and getting from 1,000 to 2,000 is a 100 percent increase.
"Let's say that the Dow goes up at a rate of about 20 percent a year. That means that it would take five years to get from 1 to 2 as a first digit.
"But suppose we start with a first digit 5. It only requires a 20 percent increase to get from 5,000 to 6,000, and that is achieved in one year.
"When the Dow reaches 9,000, it takes only an 11 percent increase and just seven months to reach the 10,000 mark, which starts with the number 1. At that point you start over with the first digit a 1, once again. Once again, you must double the number -- 10,000 -- to 20,000 before reaching 2 as the first digit.
"As we can see, the number 1 predominates at every step of the progression, as it does in logarithmic sequences."

Note that these relative distances are independent of the power of ten a number is multiplied by. For example, the distance between .001 and .002 on a logarithmic scale is identical to the distance between 1000 and 2000. In other words the distance between 1 x 10 -3 and  2 x 10 -3 is identical to the distance between  1 x 10 3 and 2 x 10 3. Again the power of ten makes no difference on a logarithmic scale.
Zeros are also not considered as first significant digits in a decimal fraction because  they are only used as place holders to indicate the location of the decimal point. For example, .001 would be written as 1 x 10 3. One would be considered the first significant digit.
Benford reasoned that the length of the distance from one number to the next divided by the length of the entire scale would give the probability of the digit being the first one in a given data value. Mathematically this is expressed as follows for base 10 numbers:
Log10 (n+1) - Log10 n 
Log10 10 -Log10 1
  
=
Log10 (n+1) - Log10 n 
=
Log10 (1+1/n)
where: n = the first significant digit of a number
Notice that if a data entry (base 10) begins with a 1, the entry has to be at most doubled to have a first significant digit of 2. However, if a data entry begins with a 9, it only has to be increased by, at most, 11% to change the first significant digit into a 1. This once again illustrates that a first significant digit of 1 is more likely to occur than a 9.

Benford's law has been used as a method for spotting fraudulent accounting data by looking at the first significant digit of each data entry and comparing the actual frequency of occurrence with the predicted frequency. Most white collar criminals are unaware of Benford's law and will use each digit about 10% of the time for the first significant digit in a number.

Benford's law doesn't work for numbers controlled to a specific value, nor does it work for truly random numbers such as those generated by a random number generator. Benford's law also doesn't work well for small sample sizes. However, it holds true in a surprising number of situations. Benford's law shows that natural processes can be remarkably resistant to complete randomness. 

Bias in sampling
There are five important potential sources of bias that should be considered when selecting a sample, by whatever method:
  1. Any changes from the pre-agreed sampling rules can introduce bias
  2. Bias is introduced if people in hard to reach groups are omitted
  3. Replacing selected individuals with others, for example if they are difficult to contact, also introduces bias
  4. It is important to try and maximise the response rate to a survey; low response rates can introduce bias
  5. If an out of date list is used as the sample frame, it may also introduce bias, if it excludes people who have recently moved to an area, for example.

Written by : Pawan Agarwal

Other Members : Pragya Singh
                           Priyanka Doshi
                           Nilay Kohale
                           Poulami Sarkar

No comments:

Post a Comment