Sunday, 21 July 2013

A face off with Chi-Square and Cross Tabulation



Statistics can be boring as it requires us to work plainly with numbers but as was pointed out by our professor even statistics can be interesting if we build up a story around it. This seemed quite an absorbing thought and on that note the class began with the introduction of a new concept Cross Tabulation.

CROSS TABULATION

Cross-tabulation is one of the most useful analytical tools and is a main-stay of the market research industry. Cross-tabulation analysis, also known as contingency table analysis, is most often used to analyze categorical (nominal measurement scale) data. A cross-tabulation is a two (or more) dimensional table that records the number (frequency) of respondents that have the specific characteristics described in the cells of the table. Cross-tabulation tables provide a wealth of information about the relationship between the variables. In simple terms cross tabulation is a presentation of data about categorical variable in a tabular form to aid in identifying a relationship between the variables.

Cross-tabulation analysis has its own unique language, using terms such as “banners”, “stubs”, “Chi-Square Statistic” and “Expected Values.” 

In order to gain a better understanding about the use of Cross Tabulation we were provided with a SPSS data sheet containing details regarding a setup having 4 stores along with details about other parameters like gender, age-category of shoppers, shopping frequency, service satisfaction etc. We were provided with a task to evaluate the levels of service satisfaction across the stores.  

The first and foremost thing we had to deal with was to determine the direction of the relationship. In order to do this we were told to determine which variable is the "dependent" variable and which variable is the "independent" variable (i.e. in other words what influences what) In the instant case the stores was the independent variable and the service satisfaction was considered as the dependent variable.   

Another point which was observed was that the percentages for the cross tabulation held more relevance across the row as it helped us establish the particular department which had a problem with the levels of service satisfaction. The results created showed that Store 2 had the greatest percentage of unsatisfied people while the Store 3 had the highest levels of service satisfaction.

This left us with a conclusion that Store 2 was the problem area. However, it was still left to be established if we could trust these results. Could the results have happened by chance? This left scope for a further analysis. It was then that we were introduced to the concept of Pearson Chi-Square results.

CROSS-TABULATION WITH CHI-SQUARE ANALYSIS

The Chi-square statistic is the primary statistic used for testing the statistical significance of the cross-tabulation table. Chi-square tests whether or not the two variables are independent. If the variables are independent (have no relationship), then the results of the statistical test will be “non-significant” and we “are not able to reject the null hypothesis”, meaning that we believe there is no relationship between the variables.

Null Hypothesis

In statistical inference, Null Hypothesis refers to a general or default position: that there is no relationship between 2 variables tested. Rejecting or disproving null hypothesis and concluding that there are grounds for believing that there is a relationship between 2 variables gives a precise claim in which a claim is capable of being proven false.


If the variables are related, then the results of the statistical test will be “statistically significant” and we “are able to reject the null hypothesis”, meaning that we can state that there is some relationship between the variables. The chi-square statistic, along with the associated probability of chance observation, may be computed for any table. If the variables are related (i.e. the observed table relationships would occur with very low probability, say only 5%) then we say that the results are “statistically significant” at the “.05 or 5% level”. This means that the variables have a low chance of being independent.

The probability values (.05 or .01) reflect the researcher’s willingness to accept a type I error, or the probability of rejecting a true null hypothesis (meaning that we thought there was a relationship between the variables when there really wasn’t). Furthermore these probabilities are cumulative, meaning that if 20 tables are tested, the researcher can be almost assured that one of the tables is incorrectly found to have a relationship (20 x .05 = 100% chance). Depending on the cost of making mistakes, the researcher may apply more stringent criteria for declaring “significance” such as .01 or .005.

Upon applying the chi square analysis in the instant case it was observed that the null hypothesis was wrong and hence there appeared to be no relationship between the stores and the level of service satisfaction. However, this was still inconclusive therefore we resorted to the use of another variable i.e. contact with the employee to see if the contact with the employee had a bearing on the outcome. It was then that we realized that the null hypothesis held true. The outcome is interpreted below.


                                             Store * Service satisfaction * Contact with employee Crosstabulation
 


Contact with employee


Service satisfaction
Total



Strongly Negative
Somewhat Negative
Neutral
Somewhat Positive
Strongly Positive
Strongly Negative
No
Store
Store 1
Count
16
9
18
17
19
79



% within Store
20.3%
11.4%
22.8%
21.5%
24.1%
100.0%


Store 2
Count
2
15
16
13
12
58



% within Store
3.4%
25.9%
27.6%
22.4%
20.7%
100.0%


Store 3
Count
9
14
23
22
14
82



% within Store
11.0%
17.1%
28.0%
26.8%
17.1%
100.0%


Store 4
Count
17
14
19
10
10
70



% within Store
24.3%
20.0%
27.1%
14.3%
14.3%
100.0%

Total
Count
44
52
76
62
55
289


% within Store
15.2%
18.0%
26.3%
21.5%
19.0%
100.0%
Yes
Store
Store 1
Count
9
11
20
13
14
67



% within Store
13.4%
16.4%
29.9%
19.4%
20.9%
100.0%


Store 2
Count
24
15
18
14
7
78



% within Store
30.8%
19.2%
23.1%
17.9%
9.0%
100.0%


Store 3
Count
6
6
18
11
15
56



% within Store
10.7%
10.7%
32.1%
19.6%
26.8%
100.0%


Store 4
Count
10
21
25
12
24
92



% within Store
10.9%
22.8%
27.2%
13.0%
26.1%
100.0%

Total
Count
49
53
81
50
60
293


% within Store
16.7%
18.1%
27.6%
17.1%
20.5%
100.0%

                                                         

                                                                                             Chi-Square Tests

Contact with employee

Value
df
Asymp. Sig. (2-sided)
No
Pearson Chi-Square
20.898(a)
12
.052
Likelihood Ratio
22.937
12
.028
Linear-by-Linear Association
3.514
1
.061
N of Valid Cases
289


Yes
Pearson Chi-Square
25.726(b)
12
.012
Likelihood Ratio
25.777
12
.012
Linear-by-Linear Association
1.993
1
.158
N of Valid Cases
293



 


a  0 cells (.0%) have expected count less than 5. The minimum expected count is 8.83.
b  0 cells (.0%) have expected count less than 5. The minimum expected count is 9.37.



 
The second part of the day began with a study of correlation across different aspects of satisfaction ( as illustrated below). However, before we observe the table it would be appropriate to have an insight about Correlation.

CORRELATION

Correlation refers to any of a broad class of statistical relationships involving dependence. Formally, dependence refers to any situation in which a random variable does not satisfy a mathematical condition of probabilistic independence. Informally, correlation can refer to any departure of two or more random variables from independence, but technically it refers to any of several more specialized types of relationships between mean values. There are several correlation coefficients, often denoted ρ or r, measuring the degree of correlation. The commonest of these is the Pearson correlation coefficient, which is sensitive only to a linear relationship between two variables (which may exist even if one is a nonlinear function of the other).

                                                                                   Correlations
  

Chi-Square Tests



Price satisfaction
Variety satisfaction
Organization satisfaction
Service satisfaction
Item quality satisfaction
Overall satisfaction
Price satisfaction
Pearson Correlation
1
.694(**)
.306(**)
.585(**)
.505(**)
.585(**)
Sig. (2-tailed)

.000
.000
.000
.000
.000
N
582
582
582
582
582
582
Variety satisfaction
Pearson Correlation
.694(**)
1
.182(**)
.604(**)
.529(**)
.572(**)
Sig. (2-tailed)
.000

.000
.000
.000
.000
N
582
582
582
582
582
582
Organization satisfaction
Pearson Correlation
.306(**)
.182(**)
1
.279(**)
.210(**)
.233(**)
Sig. (2-tailed)
.000
.000

.000
.000
.000
N
582
582
582
582
582
582
Service satisfaction
Pearson Correlation
.585(**)
.604(**)
.279(**)
1
.424(**)
.602(**)
Sig. (2-tailed)
.000
.000
.000

.000
.000
N
582
582
582
582
582
582
Item quality satisfaction
Pearson Correlation
.505(**)
.529(**)
.210(**)
.424(**)
1
.457(**)
Sig. (2-tailed)
.000
.000
.000
.000

.000
N
582
582
582
582
582
582
Overall satisfaction
Pearson Correlation
.585(**)
.572(**)
.233(**)
.602(**)
.457(**)
1
Sig. (2-tailed)
.000
.000
.000
.000
.000

N
582
582
582
582
582
582


 

Subsequently, we were detailed regarding the actual process through which the Chi-square statistic is derived at in the SPSS .

COMPUTATION OF THE CHI-SQUARE STATISTIC FOR CROSS-TABULATION TABLES

The chi-square statistic is computed by first computing a chi-square value for each individual cell of the table and then summing them up to form a total Chi-square value for the table. The chi-square value for the cell is computed as:
(Observed Value – Expected Value)2 / (Expected Value)




This concluded the day with something more to learn, reflect and work upon in the future as a budding student manager.

Written by : Priyanka Doshi

Other members : Pragya Singh
                            Nilay Kohaley
                            Pawan Agarwal
                            Poulami Sarkar 

No comments:

Post a Comment