Applied Business Statistics: An Overview Of Session 9 and 10

Cross Tabulation and CHI Square Analysis

A cross tabulation is a joint frequency distribution of cases based on two or more categorical
variables. Displaying a distribution of cases by their values on two or more variables is
known as contingency table analysis and is one of the more commonly used analytic methods in the social sciences. The joint frequency distribution can be analyzed with the chisquare statistic ( ) to determine whether the variables are statistically independent or if
they are associated. If a dependency between variables does exist, then other indicators of
association, such as Cramer’s V, gamma, Sommer’s d, and so forth, can be used to describe
the degree which the values of one variable predict or vary with those of the other variable.
More advanced techniques such as log-linear models and multinomial regression can be

used to clarify the relationships contained in contingency tables.

Type of variables. Are the variables of interest continuous or discrete (e.g., categorical)
Categorical variables contain integer values that indicate membership in one of several possible categories. The range of possible values for such variables is limited, and whenever the
range of possible values is relatively circumscribed, the distribution is unlikely to approach
that of the Gaussian distribution. Continuous variables, in contrast, have a much wider
range, no limiting categories, and have the potential to approximate the Gaussian distribution, provided their range is not artificially truncated. Whenever you encounter a categorical
or a nominal, discrete variable, be aware that the assumption of normality is likely violated.
Shape of the distribution. Categorical variables often have such a small number of possible
values that one cannot even pretend that the assumption of normality is approximated.
Consider for example, the possible values for sex, grade levels, and so forth. Statistical tests
that require the assumption of normality cannot be used to analyze such data. (Of course, a
statistical program such as SPSS will process the numbers without complaint and yield
results that may appear to be interpretable — but only to those who ignore the necessity of
examining the distributions of each variable first, and who fail to check whether the
assumptions were met). Because the assumption of normality is a requirement for the t-test,
analysis of variance, correlation and regression, these procedures cannot be used to analyze
count data.

The chi-square test of statistical significance, first developed by Karl Pearson, assumes that
both variables are measured at the nominal level. To be sure, chi-square may also be used
with tables containing variables measured at a higher level; however, the statistic is calculated as if the variables were measured only at the nominal level. This means that any information regarding the order of, or distances between, categories is ignored.

The assumptions for chi-square include:
1. Random sampling is not required, provided the sample is not biased. However, the best
way to insure the sample is not biased is random selection.
2. Independent observations. A critical assumption for chi-square is independence of observations. One person’s response should tell us nothing about another person’s response.
Observations are independent if the sampling of one observation does not affect the choice
of the second observation. (In contrast, consider an example in which the observations are
not independent. A researcher wishes to estimate to what extent students in a school engage
in cheating on tests and homework. The researcher randomly chooses one student to interview. At the completion of the interview the researcher asks the student for the name of a
friend so that the friend can be interviewed, too).
3. Mutually exclusive row and column variable categories that include all observations.
The chi-square test of association cannot be conducted when categories overlap or do not
include all of the observations.
4. Large expected frequencies. The chi-square test is based on an approximation that works
best when the expected frequencies are fairly large. No expected frequency should be less
than 1 and no more than 20% of the expected frequencies should be less than 5.

Hypothesis: The null hypothesis is the k classifications are independent (i.e., no relationship between
classifications). The alternative hypothesis is that the k classifications are dependent (i.e.,
that a relationship or dependency exists).

Example

For 200 tosses, we would expect 100 heads and 100 tails.
The next step is to prepare a table as follows.

	Heads	Tails	Total
Observed	108	92	200
Expected	100	100	200
Total	208	192	400

The Observed values are those we gather ourselves. The expected values are the frequencies expected, based on our null hypothesis. We total the rows and columns as indicated. It's a good idea to make sure that the row totals equal the column totals (both total to 400 in this example).
Using probability theory, statisticians have devised a way to determine if a frequency distribution differs from the expected distribution. To use this chi-square test, we first have to calculate chi-squared.
Chi-squared = (observed-expected)²/(expected)
We have two classes to consider in this example, heads and tails.
Chi-squared = (100-108)²/100 + (100-92)²/100 = (-8)²/100 + (8)²/100 = 0.64 + 0.64 = 1.28
Now we have to consult a table of critical values of the chi-squared distribution. Here is a portion of such a table.

df/prob.	0.99	0.95	0.90	0.80	0.70	0.50	0.30	0.20	0.10	0.05
1	0.00013	0.0039	0.016	0.64	0.15	0.46	1.07	1.64	2.71	3.84
2	0.02	0.10	0.21	0.45	0.71	1.39	2.41	3.22	4.60	5.99
3	0.12	0.35	0.58	1.00	1.42	2.37	3.66	4.64	6.25	7.82
4	0.3	0.71	1.06	1.65	2.20	3.36	4.88	5.99	7.78	9.49
5	0.55	1.14	1.61	2.34	3.00	4.35	6.06	7.29	9.24	11.07

The left-most column list the degrees of freedom (df). We determine the degrees of freedom by subtracting one from the number of classes. In this example, we have two classes (heads and tails), so our degrees of freedom is 1. Our chi-squared value is 1.28. Move across the row for 1 df until we find critical numbers that bound our value. In this case, 1.07 (corresponding to a probability of 0.30) and 1.64 (corresponding to a probability of 0.20). We can interpolate our value of 1.24 to estimate a probability of 0.27. This value means that there is a 73% chance that our coin is biased. In other words, the probability of getting 108 heads out of 200 coin tosses with a fair coin is 27%. In biological applications, a probability 5% is usually adopted as the standard. This value means that the chances of an observed value arising by chance is only 1 in 20. Because the chi-squared value we obtained in the coin example is greater than 0.05 (0.27 to be precise), we accept the null hypothesis as true and conclude that our coin is fair.

Degree Of Freedom

The term “degrees of freedom” is used to describe the number of values in the final calculation of a statistic that are free to vary. It is a function of both the number of variables and number of observations. In general, the degrees of freedom is equal to the number of independent observations minus the number of parameters estimated as intermediate steps in the estimation (based on the sample) of the parameter itself.

For , the degrees of freedom are equal to (r-1)(c-1), where r is the number of rows and c is

the number of columns. In the field trip example, r = 2 and c = 3, so df = (2-1)(3-1) = 2

The risk of making an incorrect decision is an integral part of hypothesis testing. Simply following the steps prescribed for hypothesis testing does not guarantee that the correct decision will be made. We cannot know with certainty whether any one particular sample mirrors the true state of affairs that exists in the population or not. Thus, before a researcher tests the null hypothesis, the researcher must determine how much

risk of making an incorrect decision is acceptable. “How much risk of an incorrect decision

am I willing to accept? One chance out of a hundred? Five chances out of a hundred? Ten?

Twenty?”

The researcher decides, before testing, on the cutoff value. The convention, which the researcher is free to ignore, is 5 times out of a hundred. This value is known as the “significance level,” or “alpha” ( ). After the researcher decides on the alpha level, the researcher looks at the table of critical values. With alpha set at 0.05, the researcher knows which column of the table to use. If the researcher chooses to set alpha at 0.01, then a different column in the table is used.

Logic of Hypothesis Testing

The last step is to make a judgment about the null hypothesis. The statistic is large when

some of the cells have large discrepancies between the observed and expected frequencies.

Thus we reject the null hypothesis when the statistic is large. In contrast, a small calculated value does not provide evidence for rejecting the null hypothesis. The question we are asking here: Is the calculated chi-square value of 6.14 sufficiently large (with df = 2 and alpha = 0.05) to provide the evidence we need to reject the null hypothesis? Suppose that the statistic we calculated is so large that the probability of getting a statistic at least as large or extreme (i.e., somewhere out in the tail of the chi-square distribution) as our calculated statistic is very small, if the null hypothesis is true. If this is the case, the results from our sample are very unlikely if the null hypothesis is true, and so we reject the null hypothesis.

To test the null hypothesis we need to find the probability of obtaining a statistic at least as extreme as the calculated statistic from our sample, assuming that the null hypothesis is true. We use the critical values found in the table to find the approximate probability.

Hypothesis Testing Error

Whenever we make a decision based on a hypothesis test, we can never know whether or decision is correct. There are two kinds of mistakes we can make:

1 we can fail to accept the null hypothesis when it is indeed true (Type I error), or

2 we can accept the null hypothesis when it is indeed false (Type II error)

The best we can do is to reduce the chance of making either of these errors. If the null hypothesis is true (i.e., it represents the true state of affairs in the population), the significance level (alpha) is the probability of making a Type I error. Because the researcher decides the significance level, we control the probability of making a Type I error.

The primary methods for controlling the probability of making a Type II error is to select an

appropriate sample size. The probability of a Type II error decreases as the sample size

increases. At first glance the best strategy might appear to be to obtain the largest sample

that is possible. However, time and money are always limitations. We do not want a sample

size that is larger than the minimum necessary for a small probability of a Type II error.

Compiled By:

Raghav Bhatter(2013216)

Group Members:

Neha Gupta

Nitesh Beriwal

Parthajit Sar

Prachee Kasera

Applied Business Statistics

Sunday, 21 July 2013

An Overview Of Session 9 and 10

No comments:

Post a Comment