Sunday, 21 July 2013


 Session 9 and 10 : Cross Tabulation in SBD

In the 9th and 10th session of  SBD, the session began with discussion about the types of variables.

The study of variables can be studied under the following main headings.

Continuous Variables: If a variable can take on any value between its minimum value and its maximum value, it is called a continuous variable.These can have fractional value, for example: salary, interest, age etc.

Category Variables: Variables which record a response as a set of categories are termed categorical. Such variables fall into three classifications: Nominal, Ordinal, and Interval. Nominal variables have categories that have no natural order to them. Examples could be different crops: wheat, barley, and peas or different irrigation methods: flood, furrow, and dry land. Ordinal variables, on the other hand, do have a natural order. Examples of these could be pesticide levels: high, medium, and low or an injury scale: 0, 1, 2, 3, 4, and 5.

Then we discussed a case study to discuss whether customers are satisfied with a particular store we do a cross tabular analysis.

Cross-tabulation is one of the most useful analytical tools and is a main-stay of the market research industry. One estimate is that single variable frequency analysis and cross-tabulation analysis account for more than 90% of all research analyses.Cross-tabulation analysis, also known as contingency table analysis, is most often used to analyze categorical (nominal measurement scale) data. A cross-tabulation is a two (or more) dimensional table that records the number (frequency) of respondents that have the specific characteristics described in the cells of the table. Cross-tabulation tables provide a wealth of information about the relationship between the variables.Cross-tabulation analysis has its own unique language, using terms such as “banners”, “stubs”, “Chi-Square Statistic” and “Expected Values.”hen a hypothesis is made using Chi square to check the hypothesis.

We did cross tab for Store and Service satisfaction variables.
This is how we do crosstab.Analyse -> Descriptive Statistics -> Crosstabs
Here we put store in rows as we are comparing for stores.For store 1 as we can see from the table 17.1% of the total people who visited store 1 are stronly negetive and 26.9% of the total people who are strongly negetive are negetive for store 1.Next we learned about Null Hypothesis and CHI-SquareThe null hypothesis in cross tab says that there is no relationship between the two variables we are testing.

Significance value - < 0.05 reject                           
                                          > 0.05 accept
When it is more than 0.05 we accept it and it means there is no relation between the two variables.chi-squared test, also referred to as chi-square test or χ² test, is any statistical hypothesis test in which the sampling distribution of the test statistic is a chi-squared distribution when the null hypothesis is true. Also considered a chi-squared test is a test in which this is asymptotically true, meaning that the sampling distribution (if the null hypothesis is true) can be made to approximate a chi-squared distribution as closely as desired by making the sample size large enough.Some examples of chi-squared tests where the chi-squared distribution is only approximately valid:
  • Pearson's chi-squared test, also known as the chi-squared goodness-of-fit test or chi-squared test for independence. When the chi-squared test is mentioned without any modifiers or without other precluding context, this test is usually meant (for an exact test used in place of χ², see Fisher's exact test).
  • Yates's correction for continuity, also known as Yates' chi-squared test.
  • Cochran–Mantel–Haenszel chi-squared test.
  • McNemar's test, used in certain 2 × 2 tables with pairing
  • The portmanteau test in time-series analysis, testing for the presence of auto correlation
  • Likelihood-ratio tests in general statistical modelling, for testing whether there is evidence of the need to move from a simple model to a more complicated one (where the simple model is nested within the complicated one).
One case where the distribution of the test statistic is an exact chi-squared distribution is the test that the variance of a normally distributed population has a given value based on a sample variance. Such a test is uncommon in practice because values of variances to test against are seldom known exactly.   
Then we did the correlation of various satisfactions.

If the correlation value is high then if one variable is high than other correlation variable is also high.

When the numbers get low, stop analysis.For any deep analysis take big sample space.
This was the concluding note of the session.

Written By: Pooja Shukla
Roll No. 2013199

Group Members: 
Nishant Renjith
Pranshu Agarwal
Prateek Jain

Priyanka Sudan


An Overview Of Session 9 and 10


                                               Cross Tabulation and CHI Square Analysis

A cross tabulation is a joint frequency distribution of cases based on two or more categorical
variables. Displaying a distribution of cases by their values on two or more variables is
known as contingency table analysis and is one of the more commonly used analytic methods in the social sciences. The joint frequency distribution can be analyzed with the chisquare statistic ( ) to determine whether the variables are statistically independent or if
they are associated. If a dependency between variables does exist, then other indicators of
association, such as Cramer’s V, gamma, Sommer’s d, and so forth, can be used to describe
the degree which the values of one variable predict or vary with those of the other variable.
More advanced techniques such as log-linear models and multinomial regression can be

used to clarify the relationships contained in contingency tables.

Type of variables. Are the variables of interest continuous or discrete (e.g., categorical)
Categorical variables contain integer values that indicate membership in one of several possible categories. The range of possible values for such variables is limited, and whenever the
range of possible values is relatively circumscribed, the distribution is unlikely to approach
that of the Gaussian distribution. Continuous variables, in contrast, have a much wider
range, no limiting categories, and have the potential to approximate the Gaussian distribution, provided their range is not artificially truncated. Whenever you encounter a categorical
or a nominal, discrete variable, be aware that the assumption of normality is likely violated.
Shape of the distribution. Categorical variables often have such a small number of possible
values that one cannot even pretend that the assumption of normality is approximated.
Consider for example, the possible values for sex, grade levels, and so forth. Statistical tests
that require the assumption of normality cannot be used to analyze such data. (Of course, a
statistical program such as SPSS will process the numbers without complaint and yield
results that may appear to be interpretable — but only to those who ignore the necessity of
examining the distributions of each variable first, and who fail to check whether the
assumptions were met). Because the assumption of normality is a requirement for the t-test,
analysis of variance, correlation and regression, these procedures cannot be used to analyze
count data.

The chi-square test of statistical significance, first developed by Karl Pearson, assumes that
both variables are measured at the nominal level. To be sure, chi-square may also be used
with tables containing variables measured at a higher level; however, the statistic is calculated as if the variables were measured only at the nominal level. This means that any information regarding the order of, or distances between, categories is ignored.

The assumptions for chi-square include:
1. Random sampling is not required, provided the sample is not biased. However, the best
way to insure the sample is not biased is random selection.
2. Independent observations. A critical assumption for chi-square is independence of observations. One person’s response should tell us nothing about another person’s response.
Observations are independent if the sampling of one observation does not affect the choice
of the second observation. (In contrast, consider an example in which the observations are
not independent. A researcher wishes to estimate to what extent students in a school engage
in cheating on tests and homework. The researcher randomly chooses one student to interview. At the completion of the interview the researcher asks the student for the name of a
friend so that the friend can be interviewed, too).
3. Mutually exclusive row and column variable categories that include all observations.
The chi-square test of association cannot be conducted when categories overlap or do not
include all of the observations.
4. Large expected frequencies. The chi-square test is based on an approximation that works
best when the expected frequencies are fairly large. No expected frequency should be less
than 1 and no more than 20% of the expected frequencies should be less than 5.

Hypothesis: The null hypothesis is the k classifications are independent (i.e., no relationship between
classifications). The alternative hypothesis is that the k classifications are dependent (i.e.,
that a relationship or dependency exists).

Example

For 200 tosses, we would expect 100 heads and 100 tails.
The next step is to prepare a table as follows.
 HeadsTailsTotal
Observed10892200
Expected100100200
Total208192400

                           

The Observed values are those we gather ourselves. The expected values are the frequencies expected, based on our null hypothesis. We total the rows and columns as indicated. It's a good idea to make sure that the row totals equal the column totals (both total to 400 in this example).
Using probability theory, statisticians have devised a way to determine if a frequency distribution differs from the expected distribution. To use this chi-square test, we first have to calculate chi-squared.
Chi-squared = (observed-expected)2/(expected)
We have two classes to consider in this example, heads and tails.
Chi-squared = (100-108)2/100 + (100-92)2/100 = (-8)2/100 + (8)2/100 = 0.64 + 0.64 = 1.28
Now we have to consult a table of critical values of the chi-squared distribution. Here is a portion of such a table.
df/prob.0.990.950.900.800.700.500.300.200.100.05
10.000130.00390.0160.640.150.461.071.642.713.84
20.020.100.210.450.711.392.413.224.605.99
30.120.350.581.001.422.373.664.646.257.82
40.30.711.061.652.203.364.885.997.789.49
50.551.141.612.343.004.356.067.299.2411.07
The left-most column list the degrees of freedom (df). We determine the degrees of freedom by subtracting one from the number of classes. In this example, we have two classes (heads and tails), so our degrees of freedom is 1. Our chi-squared value is 1.28. Move across the row for 1 df until we find critical numbers that bound our value. In this case, 1.07 (corresponding to a probability of 0.30) and 1.64 (corresponding to a probability of 0.20). We can interpolate our value of 1.24 to estimate a probability of 0.27. This value means that there is a 73% chance that our coin is biased. In other words, the probability of getting 108 heads out of 200 coin tosses with a fair coin is 27%. In biological applications, a probability 5% is usually adopted as the standard. This value means that the chances of an observed value arising by chance is only 1 in 20. Because the chi-squared value we obtained in the coin example is greater than 0.05 (0.27 to be precise), we accept the null hypothesis as true and conclude that our coin is fair.

Degree Of Freedom

The term “degrees of freedom” is used to describe the number of values in the final calculation of a statistic that are free to vary. It is a function of both the number of variables and number of observations. In general, the degrees of freedom is equal to the number of independent observations minus the number of parameters estimated as intermediate steps in the estimation (based on the sample) of the parameter itself.

For , the degrees of freedom are equal to (r-1)(c-1), where r is the number of rows and c is 
the number of columns. In the field trip example, r = 2 and c = 3, so df = (2-1)(3-1) = 2

The risk of making an incorrect decision is an integral part of hypothesis testing. Simply following the steps prescribed for hypothesis testing does not guarantee that the correct decision will be made. We cannot know with certainty whether any one particular sample mirrors the true state of affairs that exists in the population or not. Thus, before a researcher tests the null hypothesis, the researcher must determine how much 
risk of making an incorrect decision is acceptable. “How much risk of an incorrect decision 
am I willing to accept? One chance out of a hundred? Five chances out of a hundred? Ten? 
Twenty?” 
The researcher decides, before testing, on the cutoff value. The convention, which the researcher is free to ignore, is 5 times out of a hundred. This value is known as the “significance level,” or “alpha” ( ). After the researcher decides on the alpha level, the researcher looks at the table of critical values. With alpha set at 0.05, the researcher knows which column of the table to use. If the researcher chooses to set alpha at 0.01, then a different column in the table is used.

Logic of Hypothesis Testing

The last step is to make a judgment about the null hypothesis. The statistic is large when 
some of the cells have large discrepancies between the observed and expected frequencies. 
Thus we reject the null hypothesis when the statistic is large. In contrast, a small calculated value does not provide evidence for rejecting the null hypothesis. The question we are asking here: Is the calculated chi-square value of 6.14 sufficiently large (with df = 2 and alpha = 0.05) to provide the evidence we need to reject the null hypothesis? Suppose that the statistic we calculated is so large that the probability of getting a statistic at least as large or extreme (i.e., somewhere out in the tail of the chi-square distribution) as our calculated statistic is very small, if the null hypothesis is true. If this is the case, the results from our sample are very unlikely if the null hypothesis is true, and so we reject the null hypothesis. 
To test the null hypothesis we need to find the probability of obtaining a statistic at least as extreme as the calculated statistic from our sample, assuming that the null hypothesis is true. We use the critical values found in the table to find the approximate probability.

Hypothesis Testing Error

Whenever we make a decision based on a hypothesis test, we can never know whether or decision is correct. There are two kinds of mistakes we can make:
1 we can fail to accept the null hypothesis when it is indeed true (Type I error), or
2 we can accept the null hypothesis when it is indeed false (Type II error)
The best we can do is to reduce the chance of making either of these errors. If the null hypothesis is true (i.e., it represents the true state of affairs in the population), the significance level (alpha) is the probability of making a Type I error. Because the researcher decides the significance level, we control the probability of making a Type I error. 

The primary methods for controlling the probability of making a Type II error is to select an 
appropriate sample size. The probability of a Type II error decreases as the sample size 
increases. At first glance the best strategy might appear to be to obtain the largest sample 
that is possible. However, time and money are always limitations. We do not want a sample 
size that is larger than the minimum necessary for a small probability of a Type II error.

Compiled By:

Raghav Bhatter(2013216)

Group Members:
Neha Gupta
Nitesh Beriwal
Parthajit Sar
Prachee Kasera

SBD lecture no. 9 &10(20 July 2013)

In today's lecture we learnt how to work on retail stores data. We learnt how to process data and the different relations between various variables


variable view of the data.


The classification of variables is as follows-
Nominal -  It is the lowest level of data measurement . These numbers don't have any meaning . These can only be used to classify or categorize .
Ordinal - It is the second level of data measurement . These numbers can be used to rank or order objects.

Next we learnt the concept of crosstabs-
Crosstabs-In statistics, a "crosstab" is another name for a contingency table, which is a type of table created by crosstabulation. In survey research (e.g., polling, market research), a "crosstab" is any table showing summary statistics. Commonly, crosstabs in survey research are concatenations of multiple different tables.


This is how we do crosstab.
Analyse -> Descriptive Statistics -> Crosstabs




After examining the distribution of each of the variables, the researcher’s next task is to look
for relationships among two or more of the variables. Some of the tools that may be used
include correlation and regression, or derivatives such as the t-test, analysis of variance, and
contingency table (crosstabulation) analysis. The type of analysis chosen depends on the
research design, characteristics of the variables, shape of the distributions, level of measurement,
and whether the assumptions required for a particular statistical test are met.
crosstabulation is a joint frequency distribution of cases based on two or more categorical
variables. Displaying a distribution of cases by their values on two or more variables is
known as contingency table analysis and is one of the more commonly used analytic methods
in the social sciences


Assumptions: The assumptions for chi-square include:
1. Random sampling is not required, provided the sample is not biased. However, the best
way to insure the sample is not biased is random selection.
2. Independent observations. A critical assumption for chi-square is independence of observations.
One person’s response should tell us nothing about another person’s response.
Observations are independent if the sampling of one observation does not affect the choice
of the second observation. (In contrast, consider an example in which the observations are
not independent. A researcher wishes to estimate to what extent students in a school engage
in cheating on tests and homework. The researcher randomly chooses one student to interview.
At the completion of the interview the researcher asks the student for the name of a
friend so that the friend can be interviewed, too).
3. Mutually exclusive row and column variable categories that include all observations.
The chi-square test of association cannot be conducted when categories overlap or do not
include all of the observations.
4. Large expected frequencies. The chi-square test is based on an approcimation that works
best when the expected frequencies are fairly large. No expected frequency should be less
than 1 and no more than 20% of the expected frequencies should be less than 5.
Hypotheses The null hypothesis is the classifications are independent (i.e., no relationship between
classifications). The alternative hypothesis is that the classifications are dependent (i.e.,
that a relationship or dependency exists).


Next we learned about Null Hypothesis and CHI-Square

The null hypothesis in cross tab says that there is no relationship between the two variables we are testing.

Significance value - < 0.05 reject
                              > 0.05 accept


When it is more than 0.05 we accept it and it means there is no relation between the two variables.


Then we did the correlation of various satisfactions.


If the correlation value is high then if one variable is high than other correlation variable is also high.

When the numbers get low, stop analysis.
For any deep analysis take big sample space.

Submitted by-
Pranav Sharma (2013206)

Group Members-
Payal Singh
Nupur Mandhyan
Omkar 
Radhika Agarwall

                           

                 A Survey on Retail Sales - 20th July, 2013


In today's lecture we worked on retail stores data. We learned how to dig deep into data and come up with relations between various variables. Which variables are related to each other and which are not.

So here is the variable view of the data.


First we tried to find out which variables are nominal and which are ordinal. As we already know :-
Nominal -  It is the lowest level of data measurement . These numbers don't have any meaning . These can only be used to classify or categorize .
OrdinalIt is the second level of data measurement . These numbers can be used to rank or order objects.


Then we analysed the frequency for age category variable to see the percentage of people of various age groups.


Then we learned a new concept of crosstab.

Crosstab - In statistics, a "crosstab" is another name for a contingency table, which is a type of table created by crosstabulation. In survey research (e.g., polling, market research), a "crosstab" is any table showing summary statistics. Commonly, crosstabs in survey research are concatenations of multiple different tables.

We did cross tab for Store and Service satisfaction variables.




This is how we do crosstab.
Analyse -> Descriptive Statistics -> Crosstabs


Here we put store in rows as we are comparing for stores.

For store 1 as we can see from the table 17.1% of the total people who visited store 1 are stronly negetive and 26.9% of the total people who are strongly negetive are negetive for store 1.

Next we learned about Null Hypothesis and CHI-Square

The null hypothesis in cross tab says that there is no relationship between the two variables we are testing.

Significance value - < 0.05 reject
                              > 0.05 accept

When it is more than 0.05 we accept it and it means there is no relation between the two variables.


Then we did the correlation of various satisfactions.


If the correlation value is high then if one variable is high than other correlation variable is also high.

When the numbers get low, stop analysis.
For any deep analysis take big sample space.

With this we ended the lecture.

Blog written by :-
 Piyush Mittal - 2013197

Group Members
   Prerna Bansal
   Priya Jain
   Neeraj Garg
   Piyush Mittal
   Pallavi Gupta
SBD – SESSION 9 & 10




Different responses from customers
        1) Strongly negative
        2) Somewhat negative
        3) Neutral
        4) Somewhat positive
        5) Strongly positive

CROSS TABULATION

Cross-tabulation is one of the most useful analytical tools and is a main-stay of the market research industry. Cross-tabulation analysis, also known as contingency table analysis, is most often used to analyze categorical (nominal measurement scale) data. A cross-tabulation is a two (or more) dimensional table that records the number (frequency) of respondents that have the specific characteristics described in the cells of the table. Cross-tabulation tables provide a wealth of information about the relationship between the variables. In simple terms cross tabulation is a presentation of data about categorical variable in a tabular form to aid in identifying a relationship between the variables.

After examining the distribution of each of the variables, the researcher’s next task is to look
for relationships among two or more of the variables. Some of the tools that may be used
include correlation and regression, or derivatives such as the t-test, analysis of variance, and
contingency table (crosstabulation) analysis. The type of analysis chosen depends on the
research design, characteristics of the variables, shape of the distributions, level of measurement,
and whether the assumptions required for a particular statistical test are met.
A crosstabulation is a joint frequency distribution of cases based on two or more categorical
variables. Displaying a distribution of cases by their values on two or more variables is
known as contingency table analysis and is one of the more commonly used analytic methods
in the social sciences. The joint frequency distribution can be analyzed with the chisquare
statistic ( ) to determine whether the variables are statistically independent or if
they are associated. If a dependency between variables does exist, then other indicators of
association, such as Cramer’s V, gamma, Sommer’s d, and so forth, can be used to describe
the degree which the values of one variable predict or vary with those of the other variable.
More advanced techniques such as log-linear models and multinomial regression can be
used to clarify the relationships contained in contingency tables.
Considerations: Type of variables. Are the variables of interest continuous or discrete (e.g., categorical)?
Categorical variables contain integer values that indicate membership in one of several possible
categories. The range of possible values for such variables is limited, and whenever the
range of possible values is relatively circumscribed, the distribution is unlikely to approach
that of the Gaussian distribution. Continuous variables, in contrast, have a much wider
range, no limiting categories, and have the potential to approximate the Gaussian distribution,
provided their range is not artifically truncated. Whenever you encounter a categorical
or a nominal, discrete variable, be aware that the assumption of normality is likely violated.
Shape of the distribution. Categorical variables often have such a small number of possible
values that one cannot even pretend that the assumption of normality is approximated.
Consider for example, the possible values for sex, grade levels, and so forth. Statistical tests
that require the assumption of normality cannot be used to analyze such data. (Of course, a
statistical program such as SPSS will process the numbers without complaint and yield
results that may appear to be interpretable — but only to those who ignore the necessity of
examining the distributions of each variable first, and who fail to check whether the
assumptions were met). Because the assumption of normality is a requirement for the t-test,
analysis of variance, correlation and regression, these procedures cannot be used to analyze
count data.

Assumptions: The assumptions for chi-square include:
1. Random sampling is not required, provided the sample is not biased. However, the best
way to insure the sample is not biased is random selection.
2. Independent observations. A critical assumption for chi-square is independence of observations.
One person’s response should tell us nothing about another person’s response.
Observations are independent if the sampling of one observation does not affect the choice
of the second observation. (In contrast, consider an example in which the observations are
not independent. A researcher wishes to estimate to what extent students in a school engage
in cheating on tests and homework. The researcher randomly chooses one student to interview.
At the completion of the interview the researcher asks the student for the name of a
friend so that the friend can be interviewed, too).
3. Mutually exclusive row and column variable categories that include all observations.
The chi-square test of association cannot be conducted when categories overlap or do not
include all of the observations.
4. Large expected frequencies. The chi-square test is based on an approcimation that works
best when the expected frequencies are fairly large. No expected frequency should be less
than 1 and no more than 20% of the expected frequencies should be less than 5.
Hypotheses The null hypothesis is the k classifications are independent (i.e., no relationship between
classifications). The alternative hypothesis is that the k classifications are dependent (i.e.,
that a relationship or dependency exists).

Submitted by: Nihal Moidu (2013170)
Group Members:
Nikita Agarwal
Nimisha Agarwal
Parth Mehta
Priyesh Bhadauriya


SUMMARIZE AND INTERPRET THE INFORMATION




Today, we have learned about following things .
1>In which category , we can classify data about retail store?
2>How to use crosstab
       a> when there is two variable?
       b> when there is three variable?
       c>where we can use cell?
       b>how to use chi-square?, 
3>how to interpret table ?
       a>which get by using crosstab(two variable)
       b>which get by using crosstab(three variable)
       c>which get by using cell
      d>which get by using chi-square
        
4>   on what basis we can reject data data?

IN WHICH CATEGORY , WE CAN CLASSIFY DATA ABOUT RETAIL THINGS?
Survey  about retail store got information about customer satisfaction .in following category . they have been divided .
        1) Strongly negative           
        2) Somewhat negative
        3) Neutral
        4) Somewhat positive
        5) Strongly positive
This is category in which researcher divide customer satisfaction .when we divide data into category , that data call nominal data, therefore , we can say that data on retail store is nominal data.we can also say this data as continuous discrete data.

HOW TO USE CROSSTAB?
   here, we use software . In which, we took data .then we went in analyze ,then into descriptive statastics, then into cross tab. In crosstab , we took two variable, then we also use crosstab in three variable .By using crosstab,we can make table of two variable or three variable .

By using cell in crosstab, we can also analyze raw or column separately .
By using chi-square , we can say probability or possibility at which event will be.

HOW TO INTERPRET DATA?
we had taken data of store in first variable .then second variable is service satisfaction . from this table , we interpret that in second store, dissatisfaction  level of customer is high.
we find out reasons , we took store as first variable then contact with employee s is second.but when we use chi-square. we found that there is no relationship between this two variable.
Then , we have taken three variable .
 1> store 
2> service satisfaction 
3> contact with employee 
when, we have found that , when there is no contact with employee , at that time, there is high possibility that service dissatisfaction level will be high, but in case that when there is contact with employee, we can not relate anything.

ON WHAT BASIS WE CAN REJECT DATA?
1>when there is cancelling effect.

2>In statistical inference, Null Hypothesis refers to a general or default position: that there is no relationship between 2 variables tested.

written by:                shyam pandule

Group Members :-      Praloy Pankaj
                                 Ruplani Saha
                                 Navdeep Singh 
                                 Navneet Singh