Applied Business Statistics: A face off with Chi-Square and Cross Tabulation

Statistics can be boring as it requires us to work plainly with numbers but as was pointed out by our professor even statistics can be interesting if we build up a story around it. This seemed quite an absorbing thought and on that note the class began with the introduction of a new concept Cross Tabulation.

CROSS TABULATION

Cross-tabulation is one of the most useful analytical tools and is a main-stay of the market research industry. Cross-tabulation analysis, also known as contingency table analysis, is most often used to analyze categorical (nominal measurement scale) data. A cross-tabulation is a two (or more) dimensional table that records the number (frequency) of respondents that have the specific characteristics described in the cells of the table. Cross-tabulation tables provide a wealth of information about the relationship between the variables. In simple terms cross tabulation is a presentation of data about categorical variable in a tabular form to aid in identifying a relationship between the variables.

Cross-tabulation analysis has its own unique language, using terms such as “banners”, “stubs”, “Chi-Square Statistic” and “Expected Values.”

In order to gain a better understanding about the use of Cross Tabulation we were provided with a SPSS data sheet containing details regarding a setup having 4 stores along with details about other parameters like gender, age-category of shoppers, shopping frequency, service satisfaction etc. We were provided with a task to evaluate the levels of service satisfaction across the stores.

The first and foremost thing we had to deal with was to determine the direction of the relationship. In order to do this we were told to determine which variable is the "dependent" variable and which variable is the "independent" variable (i.e. in other words what influences what) In the instant case the stores was the independent variable and the service satisfaction was considered as the dependent variable.

Another point which was observed was that the percentages for the cross tabulation held more relevance across the row as it helped us establish the particular department which had a problem with the levels of service satisfaction. The results created showed that Store 2 had the greatest percentage of unsatisfied people while the Store 3 had the highest levels of service satisfaction.

This left us with a conclusion that Store 2 was the problem area. However, it was still left to be established if we could trust these results. Could the results have happened by chance? This left scope for a further analysis. It was then that we were introduced to the concept of Pearson Chi-Square results.

CROSS-TABULATION WITH CHI-SQUARE ANALYSIS

The Chi-square statistic is the primary statistic used for testing the statistical significance of the cross-tabulation table. Chi-square tests whether or not the two variables are independent. If the variables are independent (have no relationship), then the results of the statistical test will be “non-significant” and we “are not able to reject the null hypothesis”, meaning that we believe there is no relationship between the variables.

Null Hypothesis

In statistical inference, Null Hypothesis refers to a general or default position: that there is no relationship between 2 variables tested. Rejecting or disproving null hypothesis and concluding that there are grounds for believing that there is a relationship between 2 variables gives a precise claim in which a claim is capable of being proven false.

If the variables are related, then the results of the statistical test will be “statistically significant” and we “are able to reject the null hypothesis”, meaning that we can state that there is some relationship between the variables. The chi-square statistic, along with the associated probability of chance observation, may be computed for any table. If the variables are related (i.e. the observed table relationships would occur with very low probability, say only 5%) then we say that the results are “statistically significant” at the “.05 or 5% level”. This means that the variables have a low chance of being independent.

The probability values (.05 or .01) reflect the researcher’s willingness to accept a type I error, or the probability of rejecting a true null hypothesis (meaning that we thought there was a relationship between the variables when there really wasn’t). Furthermore these probabilities are cumulative, meaning that if 20 tables are tested, the researcher can be almost assured that one of the tables is incorrectly found to have a relationship (20 x .05 = 100% chance). Depending on the cost of making mistakes, the researcher may apply more stringent criteria for declaring “significance” such as .01 or .005.

Upon applying the chi square analysis in the instant case it was observed that the null hypothesis was wrong and hence there appeared to be no relationship between the stores and the level of service satisfaction. However, this was still inconclusive therefore we resorted to the use of another variable i.e. contact with the employee to see if the contact with the employee had a bearing on the outcome. It was then that we realized that the null hypothesis held true. The outcome is interpreted below.

Store * Service satisfaction * Contact with employee Crosstabulation

Contact with employee				Service satisfaction					Total
				Strongly Negative	Somewhat Negative	Neutral	Somewhat Positive	Strongly Positive	Strongly Negative
No	Store	Store 1	Count	16	9	18	17	19	79
			% within Store	20.3%	11.4%	22.8%	21.5%	24.1%	100.0%
		Store 2	Count	2	15	16	13	12	58
			% within Store	3.4%	25.9%	27.6%	22.4%	20.7%	100.0%
		Store 3	Count	9	14	23	22	14	82
			% within Store	11.0%	17.1%	28.0%	26.8%	17.1%	100.0%
		Store 4	Count	17	14	19	10	10	70
			% within Store	24.3%	20.0%	27.1%	14.3%	14.3%	100.0%
	Total		Count	44	52	76	62	55	289
			% within Store	15.2%	18.0%	26.3%	21.5%	19.0%	100.0%
Yes	Store	Store 1	Count	9	11	20	13	14	67
			% within Store	13.4%	16.4%	29.9%	19.4%	20.9%	100.0%
		Store 2	Count	24	15	18	14	7	78
			% within Store	30.8%	19.2%	23.1%	17.9%	9.0%	100.0%
		Store 3	Count	6	6	18	11	15	56
			% within Store	10.7%	10.7%	32.1%	19.6%	26.8%	100.0%
		Store 4	Count	10	21	25	12	24	92
			% within Store	10.9%	22.8%	27.2%	13.0%	26.1%	100.0%
	Total		Count	49	53	81	50	60	293
			% within Store	16.7%	18.1%	27.6%	17.1%	20.5%	100.0%

Chi-Square Tests

Contact with employee		Value	df	Asymp. Sig. (2-sided)
No	Pearson Chi-Square	20.898(a)	12	.052
	Likelihood Ratio	22.937	12	.028
	Linear-by-Linear Association	3.514	1	.061
	N of Valid Cases	289
Yes	Pearson Chi-Square	25.726(b)	12	.012
	Likelihood Ratio	25.777	12	.012
	Linear-by-Linear Association	1.993	1	.158
	N of Valid Cases	293

a 0 cells (.0%) have expected count less than 5. The minimum expected count is 8.83.

b 0 cells (.0%) have expected count less than 5. The minimum expected count is 9.37.

The second part of the day began with a study of correlation across different aspects of satisfaction ( as illustrated below). However, before we observe the table it would be appropriate to have an insight about Correlation.

CORRELATION

Correlation refers to any of a broad class of statistical relationships involving dependence. Formally, dependence refers to any situation in which a random variable does not satisfy a mathematical condition of probabilistic independence. Informally, correlation can refer to any departure of two or more random variables from independence, but technically it refers to any of several more specialized types of relationships between mean values. There are several correlation coefficients, often denoted ρ or r, measuring the degree of correlation. The commonest of these is the Pearson correlation coefficient, which is sensitive only to a linear relationship between two variables (which may exist even if one is a nonlinear function of the other).

Correlations

Chi-Square Tests

		Price satisfaction	Variety satisfaction	Organization satisfaction	Service satisfaction	Item quality satisfaction	Overall satisfaction
Price satisfaction	Pearson Correlation	1	.694(**)	.306(**)	.585(**)	.505(**)	.585(**)
	Sig. (2-tailed)		.000	.000	.000	.000	.000
	N	582	582	582	582	582	582
Variety satisfaction	Pearson Correlation	.694(**)	1	.182(**)	.604(**)	.529(**)	.572(**)
	Sig. (2-tailed)	.000		.000	.000	.000	.000
	N	582	582	582	582	582	582
Organization satisfaction	Pearson Correlation	.306(**)	.182(**)	1	.279(**)	.210(**)	.233(**)
	Sig. (2-tailed)	.000	.000		.000	.000	.000
	N	582	582	582	582	582	582
Service satisfaction	Pearson Correlation	.585(**)	.604(**)	.279(**)	1	.424(**)	.602(**)
	Sig. (2-tailed)	.000	.000	.000		.000	.000
	N	582	582	582	582	582	582
Item quality satisfaction	Pearson Correlation	.505(**)	.529(**)	.210(**)	.424(**)	1	.457(**)
	Sig. (2-tailed)	.000	.000	.000	.000		.000
	N	582	582	582	582	582	582
Overall satisfaction	Pearson Correlation	.585(**)	.572(**)	.233(**)	.602(**)	.457(**)	1
	Sig. (2-tailed)	.000	.000	.000	.000	.000
	N	582	582	582	582	582	582

Subsequently, we were detailed regarding the actual process through which the Chi-square statistic is derived at in the SPSS .

COMPUTATION OF THE CHI-SQUARE STATISTIC FOR CROSS-TABULATION TABLES

The chi-square statistic is computed by first computing a chi-square value for each individual cell of the table and then summing them up to form a total Chi-square value for the table. The chi-square value for the cell is computed as:

(Observed Value – Expected Value)2 / (Expected Value)

This concluded the day with something more to learn, reflect and work upon in the future as a budding student manager.

Written by : Priyanka Doshi

Other members : Pragya Singh

Nilay Kohaley

Pawan Agarwal

Poulami Sarkar

Applied Business Statistics

Sunday, 21 July 2013

A face off with Chi-Square and Cross Tabulation

No comments:

Post a Comment