Summary
of 14th aug 2013 lecture
In today’s class we first revised the concepts of
the previous lectures i.e how chi
square, null hypothesis is applied and where it is used.
After that we did the following;
Sampling
In statistics, quality
assurance, and survey methodology, sampling is concerned with the selection of a subset of
individuals from within a statistical population to estimate characteristics of the whole
population. Two advantages of sampling are that the cost is lower and data
collection is faster than measuring the entire population.
Each observation measures one or more properties
(such as weight, location, color) of observable bodies distinguished as
independent objects or individuals. In survey sampling, weights can be applied to the
data to adjust for the sample design,
particularly stratified sampling (blocking).
Results from probability theory and statistical theory are employed to guide practice. In business and medical research,
sampling is widely used for gathering information about a population.
The
sampling process comprises several stages:
- Defining the population of
concern
- Specifying a sampling
frame, a set of
items or events possible to measure
- Specifying a sampling
method for
selecting items or events from the frame
- Determining the sample size
- Implementing the sampling
plan
- Sampling and data collecting
- Data which can be selected
Probability
Sampling
A probability sampling method is any
method of sampling that utilizes some form of random selection. In order
to have a random selection method, we must set up some process or procedure
that assures that the different units in your population have equal
probabilities of being chosen.we have
long practiced various forms of random selection, such as picking a name out of
a hat, or choosing the short straw. These days, we tend to use computers as the
mechanism for generating random numbers as the basis for random selection.
Some Definitions
Before I
can explain the various probability methods we have to define some basic terms.
These are:
- N = the number of cases in
the sampling frame
- n = the number of cases in
the sample
- NCn = the number of
combinations (subsets) of n from N
- f = n/N = the sampling
fraction
That's
it. With those terms defined we can begin to define the different probability
sampling methods.
Simple
Random Sampling
The
simplest form of random sampling is called simple random sampling.
In statistics, a simple random sample is a subset of individuals (a sample) chosen from a larger set (a population). Each individual is chosen randomly and entirely by chance, such
that each individual has the same probability of being chosen at any stage
during the sampling process, and each subset of k individuals has the
same probability of being chosen for the sample as any other subset of k
individuals.[1] This process and technique is
known as simple random sampling
A simple random sample is an unbiased surveying technique.
The
principle of simple random sampling is that every object has the same
probability of being chosen. For example, suppose N college students
want to get a ticket for a basketball game, but there are only X < N
tickets for them, so they decide to have a fair way to see who gets to go.
Then, everybody is given a number in the range from 0 to N-1, and random
numbers are generated, either electronically or from a table of random numbers.
Numbers outside the range from 0 to N-1 are ignored, as are any numbers
previously selected. The first X numbers would identify the lucky ticket
winners..
Systematic
Random Sampling
Systematic
sampling is to be applied only if the given population is logically
homogeneous, because systematic sample units are uniformly distributed over the
population. The researcher must ensure that the chosen sampling interval does
not hide a pattern. Any pattern would threaten randomness.
Example:
Suppose a supermarket wants to study buying habits of their customers, then
using systematic sampling they can choose every 10th or 15th customer entering
the supermarket and conduct the study on this sample.
This is
random sampling with a system. From the sampling frame, a starting point is
chosen at random, and choices thereafter are at regular intervals. For example,
suppose you want to sample 8 houses from a street of 120 houses. 120/8=15, so
every 15th house is chosen after a random starting point between 1 and 15. If
the random starting point is 11, then the houses selected are 11, 26, 41, 56,
71, 86, 101, and 116.
If, as
more frequently, the population is not evenly divisible (suppose you want to
sample 8 houses out of 125, where 125/8=15.625), should you take every 15th
house or every 16th house? If you take every 16th house, 8*16=128, so there is
a risk that the last house chosen does not exist. On the other hand, if you
take every 15th house, 8*15=120, so the last five houses will never be
selected. The random starting point should instead be selected as a noninteger
between 0 and 15.625 (inclusive on one endpoint only) to ensure that every
house has equal chance of being selected; the interval should now be
nonintegral (15.625); and each noninteger selected should be rounded up to the
next integer. If the random starting point is 3.6, then the houses selected are
4, 19, 35, 51, 66, 82, 98, and 113, where there are 3 cyclic intervals of 15
and 5 intervals of 16.
Non Probability Sampling
Stratified
Random Sampling
Stratified Random Sampling, also sometimes called proportional
or quota random sampling, involves dividing your population into
homogeneous subgroups and then taking a simple random sample in each subgroup.
In more formal terms:
There are
several major reasons why you might prefer stratified sampling over simple
random sampling. First, it assures that you will be able to represent not only
the overall population, but also key subgroups of the population, especially
small minority groups. If you want to be able to talk about subgroups, this may
be the only way to effectively assure you'll be able to. If the subgroup is
extremely small, you can use different sampling fractions (f) within the
different strata to randomly over-sample the small group (although you'll then
have to weight the within-group estimates using the sampling fraction whenever
you want overall population estimates). When we use the same sampling fraction
within strata we are conducting proportionate stratified random
sampling. When we use different sampling fractions in the strata, we call this disproportionate
stratified random sampling. Second, stratified random sampling will generally
have more statistical precision than simple random sampling. This will only be
true if the strata or groups are homogeneous. If they are, we expect that the
variability within-groups is lower than the variability for the population as a
whole. Stratified sampling capitalizes on that fact.
For example, let's say that the population of clients for our agency can
be divided into three groups: Caucasian, African-American and
Hispanic-American. Furthermore, let's assume that both the African-Americans
and Hispanic-Americans are relatively small minorities of the clientele (10%
and 5% respectively). If we just did a simple random sample of n=100 with a
sampling fraction of 10%, we would expect by chance alone that we would only
get 10 and 5 persons from each of our two smaller groups. And, by chance, we
could get fewer than that! If we stratify, we can do better. First, let's
determine how many people we want to have in each group. Let's say we still
want to take a sample of 100 from the population of 1000 clients over the past
year. But we think that in order to say anything about subgroups we will need
at least 25 cases in each group. So, let's sample 50 Caucasians, 25
African-Americans, and 25 Hispanic-Americans. We know that 10% of the
population, or 100 clients, are African-American. If we randomly sample 25 of
these, we have a within-stratum sampling fraction of 25/100 = 25%. Similarly,
we know that 5% or 50 clients are Hispanic-American. So our within-stratum
sampling fraction will be 25/50 = 50%. Finally, by subtraction we know that
there are 850 Caucasian clients. Our within-stratum sampling fraction for them
is 50/850 = about 5.88%. Because the groups are more homogeneous within-group
than across the population as a whole, we can expect greater statistical
precision (less variance). And, because we stratified, we know we will have
enough cases from each group to make meaningful subgroup inferences.
Cluster (Area) Random
Sampling
Cluster sampling is a sampling technique used when
"natural" but relatively homogeneous groupings are evident in a statistical
population. It
is often used in marketing
research. In
this technique, the total population is divided into these groups (or clusters)
and a simple
random sample of
the groups is selected. Then the required information is collected from a
simple random sample of the elements within each selected group. This may be
done for every element in these groups or a subsample of elements may be
selected within each of these groups. A common motivation for cluster sampling
is to reduce the total number of interviews and costs given the desired
accuracy. Assuming a fixed sample size, the technique gives more accurate
results when most of the variation in the population is within the groups, not
between them
In
cluster sampling, we follow these steps:
- divide population into
clusters (usually along geographic boundaries)
- randomly sample clusters
- measure all units
within sampled clusters
Multi-Stage Sampling
Multistage sampling is a complex form of cluster sampling. Cluster sampling is a type of
sampling which involves dividing the population into groups (or clusters).
Then, one or more clusters are chosen at random and everyone within the chosen
cluster is sampled.
Using all
the sample elements in all the selected clusters may be prohibitively expensive
or not necessary. Under these circumstances, multistage cluster sampling
becomes useful. Instead of using all the elements contained in the selected
clusters, the researcher randomly selects elements from each cluster.
Constructing the clusters is the first stage. Deciding what elements within the
cluster to use is the second stage. The technique is used frequently when a
complete list of all members of the population does not exist and is
inappropriate.
In some
cases, several levels of cluster selection may be applied before the final
sample elements are reached. For example, household surveys conducted by the Australian Bureau of Statistics begin by dividing metropolitan
regions into 'collection districts', and selecting some of these collection
districts (first stage). The selected collection districts are then divided
into blocks, and blocks are chosen from within each selected collection
district (second stage). Next, dwellings are listed within each selected block,
and some of these dwellings are selected (third stage). This method means that
it is not necessary to create a list of every dwelling in the region, only for
selected blocks. In remote areas, an additional stage of clustering is used, in
order to reduce travel requirements.[1]
Although
cluster sampling and stratified sampling bear some superficial similarities, they are substantially different.
In stratified sampling, a random sample is drawn from all the strata, where in cluster
sampling only the selected clusters are studied, either in single- or
multi-stage.
Advantages
- cost and speed that the
survey can be done in
- convenience of finding the
survey sample
- normally more accurate than
cluster sampling for the same size sample
Disadvantages
- Is not as accurate as SRS if
the sample is the same size
- More testing is difficult to
do
Benfords law
Benford's Law, also called the First-Digit Law, refers to the frequency distribution of digits in many (but not all) real-life sources of data. In this distribution, the number 1 occurs as the leading digit about 30% of the time, while larger numbers occur in that position less frequently: 9 as the first digit less than 5% of the time. This distribution of first digits is the same as the widths of grid-lines on a logarithmic scale. Benford's Law also concerns the expected distribution for digits beyond the first, which approach a uniform distribution.This result has been found to apply to a wide variety of data sets, including electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, physical and mathematical constants, and processes described by power laws (which are very common in nature). It tends to be most accurate when values are distributed across multiple orders of magnitude.
The graph to the right shows Benford's Law for base 10. There is a generalization of the law to numbers expressed in other bases (for example, base 16), and also a generalization to second digits and later digits.
Group members
Pranshu Aggarwal
Pooja Shukla
Nishant R
Prateek jain
No comments:
Post a Comment