Wednesday 14 August 2013


Summary of 14th aug 2013 lecture

In today’s class we first revised the concepts of the previous lectures  i.e how chi square, null hypothesis is applied and where it is used.

After that we did the following;

Sampling

In statistics, quality assurance, and survey methodology, sampling is concerned with the selection of a subset of individuals from within a statistical population to estimate characteristics of the whole population. Two advantages of sampling are that the cost is lower and data collection is faster than measuring the entire population.

Each observation measures one or more properties (such as weight, location, color) of observable bodies distinguished as independent objects or individuals. In survey sampling, weights can be applied to the data to adjust for the sample design, particularly stratified sampling (blocking). Results from probability theory and statistical theory are employed to guide practice. In business and medical research, sampling is widely used for gathering information about a population.

The sampling process comprises several stages:

  • Defining the population of concern
  • Specifying a sampling frame, a set of items or events possible to measure
  • Specifying a sampling method for selecting items or events from the frame
  • Determining the sample size
  • Implementing the sampling plan
  • Sampling and data collecting
  • Data which can be selected

 

 

Probability Sampling

A probability sampling method is any method of sampling that utilizes some form of random selection. In order to have a random selection method, we must set up some process or procedure that assures that the different units in your population have equal probabilities of being chosen.we  have long practiced various forms of random selection, such as picking a name out of a hat, or choosing the short straw. These days, we tend to use computers as the mechanism for generating random numbers as the basis for random selection.

Some Definitions

Before I can explain the various probability methods we have to define some basic terms. These are:

  • N = the number of cases in the sampling frame
  • n = the number of cases in the sample
  • NCn = the number of combinations (subsets) of n from N
  • f = n/N = the sampling fraction

That's it. With those terms defined we can begin to define the different probability sampling methods.

Simple Random Sampling

The simplest form of random sampling is called simple random sampling.

In statistics, a simple random sample is a subset of individuals (a sample) chosen from a larger set (a population). Each individual is chosen randomly and entirely by chance, such that each individual has the same probability of being chosen at any stage during the sampling process, and each subset of k individuals has the same probability of being chosen for the sample as any other subset of k individuals.[1] This process and technique is known as simple random sampling A simple random sample is an unbiased surveying technique.

The principle of simple random sampling is that every object has the same probability of being chosen. For example, suppose N college students want to get a ticket for a basketball game, but there are only X < N tickets for them, so they decide to have a fair way to see who gets to go. Then, everybody is given a number in the range from 0 to N-1, and random numbers are generated, either electronically or from a table of random numbers. Numbers outside the range from 0 to N-1 are ignored, as are any numbers previously selected. The first X numbers would identify the lucky ticket winners..

Systematic Random Sampling

Systematic sampling is to be applied only if the given population is logically homogeneous, because systematic sample units are uniformly distributed over the population. The researcher must ensure that the chosen sampling interval does not hide a pattern. Any pattern would threaten randomness.

Example: Suppose a supermarket wants to study buying habits of their customers, then using systematic sampling they can choose every 10th or 15th customer entering the supermarket and conduct the study on this sample.

This is random sampling with a system. From the sampling frame, a starting point is chosen at random, and choices thereafter are at regular intervals. For example, suppose you want to sample 8 houses from a street of 120 houses. 120/8=15, so every 15th house is chosen after a random starting point between 1 and 15. If the random starting point is 11, then the houses selected are 11, 26, 41, 56, 71, 86, 101, and 116.

If, as more frequently, the population is not evenly divisible (suppose you want to sample 8 houses out of 125, where 125/8=15.625), should you take every 15th house or every 16th house? If you take every 16th house, 8*16=128, so there is a risk that the last house chosen does not exist. On the other hand, if you take every 15th house, 8*15=120, so the last five houses will never be selected. The random starting point should instead be selected as a noninteger between 0 and 15.625 (inclusive on one endpoint only) to ensure that every house has equal chance of being selected; the interval should now be nonintegral (15.625); and each noninteger selected should be rounded up to the next integer. If the random starting point is 3.6, then the houses selected are 4, 19, 35, 51, 66, 82, 98, and 113, where there are 3 cyclic intervals of 15 and 5 intervals of 16.

 

Non Probability Sampling

Stratified Random Sampling

Stratified Random Sampling, also sometimes called proportional or quota random sampling, involves dividing your population into homogeneous subgroups and then taking a simple random sample in each subgroup. In more formal terms:

There are several major reasons why you might prefer stratified sampling over simple random sampling. First, it assures that you will be able to represent not only the overall population, but also key subgroups of the population, especially small minority groups. If you want to be able to talk about subgroups, this may be the only way to effectively assure you'll be able to. If the subgroup is extremely small, you can use different sampling fractions (f) within the different strata to randomly over-sample the small group (although you'll then have to weight the within-group estimates using the sampling fraction whenever you want overall population estimates). When we use the same sampling fraction within strata we are conducting proportionate stratified random sampling. When we use different sampling fractions in the strata, we call this disproportionate stratified random sampling. Second, stratified random sampling will generally have more statistical precision than simple random sampling. This will only be true if the strata or groups are homogeneous. If they are, we expect that the variability within-groups is lower than the variability for the population as a whole. Stratified sampling capitalizes on that fact.

For example, let's say that the population of clients for our agency can be divided into three groups: Caucasian, African-American and Hispanic-American. Furthermore, let's assume that both the African-Americans and Hispanic-Americans are relatively small minorities of the clientele (10% and 5% respectively). If we just did a simple random sample of n=100 with a sampling fraction of 10%, we would expect by chance alone that we would only get 10 and 5 persons from each of our two smaller groups. And, by chance, we could get fewer than that! If we stratify, we can do better. First, let's determine how many people we want to have in each group. Let's say we still want to take a sample of 100 from the population of 1000 clients over the past year. But we think that in order to say anything about subgroups we will need at least 25 cases in each group. So, let's sample 50 Caucasians, 25 African-Americans, and 25 Hispanic-Americans. We know that 10% of the population, or 100 clients, are African-American. If we randomly sample 25 of these, we have a within-stratum sampling fraction of 25/100 = 25%. Similarly, we know that 5% or 50 clients are Hispanic-American. So our within-stratum sampling fraction will be 25/50 = 50%. Finally, by subtraction we know that there are 850 Caucasian clients. Our within-stratum sampling fraction for them is 50/850 = about 5.88%. Because the groups are more homogeneous within-group than across the population as a whole, we can expect greater statistical precision (less variance). And, because we stratified, we know we will have enough cases from each group to make meaningful subgroup inferences.

 

 

Cluster (Area) Random Sampling

Cluster sampling is a sampling technique used when "natural" but relatively homogeneous groupings are evident in a statistical population. It is often used in marketing research. In this technique, the total population is divided into these groups (or clusters) and a simple random sample of the groups is selected. Then the required information is collected from a simple random sample of the elements within each selected group. This may be done for every element in these groups or a subsample of elements may be selected within each of these groups. A common motivation for cluster sampling is to reduce the total number of interviews and costs given the desired accuracy. Assuming a fixed sample size, the technique gives more accurate results when most of the variation in the population is within the groups, not between them

In cluster sampling, we follow these steps:

  • divide population into clusters (usually along geographic boundaries)
  • randomly sample clusters
  • measure all units within sampled clusters

Multi-Stage Sampling

Multistage sampling is a complex form of cluster sampling. Cluster sampling is a type of sampling which involves dividing the population into groups (or clusters). Then, one or more clusters are chosen at random and everyone within the chosen cluster is sampled.

Using all the sample elements in all the selected clusters may be prohibitively expensive or not necessary. Under these circumstances, multistage cluster sampling becomes useful. Instead of using all the elements contained in the selected clusters, the researcher randomly selects elements from each cluster. Constructing the clusters is the first stage. Deciding what elements within the cluster to use is the second stage. The technique is used frequently when a complete list of all members of the population does not exist and is inappropriate.

In some cases, several levels of cluster selection may be applied before the final sample elements are reached. For example, household surveys conducted by the Australian Bureau of Statistics begin by dividing metropolitan regions into 'collection districts', and selecting some of these collection districts (first stage). The selected collection districts are then divided into blocks, and blocks are chosen from within each selected collection district (second stage). Next, dwellings are listed within each selected block, and some of these dwellings are selected (third stage). This method means that it is not necessary to create a list of every dwelling in the region, only for selected blocks. In remote areas, an additional stage of clustering is used, in order to reduce travel requirements.[1]

Although cluster sampling and stratified sampling bear some superficial similarities, they are substantially different. In stratified sampling, a random sample is drawn from all the strata, where in cluster sampling only the selected clusters are studied, either in single- or multi-stage.

Advantages

  • cost and speed that the survey can be done in
  • convenience of finding the survey sample
  • normally more accurate than cluster sampling for the same size sample

Disadvantages

  • Is not as accurate as SRS if the sample is the same size
  • More testing is difficult to do
 
Benfords law
Benford's Law, also called the First-Digit Law, refers to the frequency distribution of digits in many (but not all) real-life sources of data. In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time. This distribution of first digits is the same as the widths of grid-lines on a logarithmic scale. Benford's Law also concerns the expected distribution for digits beyond the first, which approach a uniform distribution.
This result has been found to apply to a wide variety of data sets, including electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, physical and mathematical constants, and processes described by power laws (which are very common in nature). It tends to be most accurate when values are distributed across multiple orders of magnitude.
The graph to the right shows Benford's Law for base 10. There is a generalization of the law to numbers expressed in other bases (for example, base 16), and also a generalization to second digits and later digits.
    A sequence of decreasing blue bars against a light gray grid background

 
Written by Priyanka Sudan

Group members

Pranshu Aggarwal
Pooja Shukla
Nishant R
Prateek jain

 

 

No comments:

Post a Comment