Saturday, 31 August 2013

                 In today's session , we applied statistical knowledge gained in all the previous sessions , in making power point presentation ,on our project topic "MONTH WISE ARRIVALS OF FOREIGN TOURISTS IN INDIA" .We started with collection of data from site www. Indiastat.com. We applied various suitable statistical measures to interpret different facts . Statistical measures applied include mean , t-test , correlation , regression.

 MEAN

    In probability and statistics, mean and expected value are used synonymous to refer to one measure of the central tendency either of a probability distribution or of the variable characterized by that distribution. In the case of a discrete probability distribution of a random variable X, the mean is equal to the sum over every possible value weighted by the probability of that value; that is, it is computed by taking the product of each possible value x of X and its probability P(x), and then adding all these products together, giving  .
µ =Σ x P(x) 

An analogous formula applies to the case of a continuous probability distribution. Not every probability distribution has a defined mean; see the Cauchy distribution for an example. Moreover, for some distributions the mean is infinite: for example, when the probability of the value   is   for n = 1, 2, 3, ....

For a data set, the terms arithmetic mean, mathematical expectation, and sometimes average are used synonymously to refer to a central value of a discrete set of numbers: specifically, the sum of the values divided by the number of values. T bar". If the data set were based on a series of observations obtained by sampling from a statistical population, the arithmetic mean is termed the sample mean to distinguish it from the population mean. .

For a finite population, the population mean of a property is equal to the arithmetic mean of the given property while considering every member of the population. For example, the population mean height is equal to the sum of the heights of every individual divided by the total number of individuals. The sample mean may differ from the population mean, especially for small samples. The law of large numbers dictates that the larger the size of the sample, the more likely it is that the sample mean will be close to the population mean. 

       DIAGRAM SHOWING COMPARISON OF MEAN , MEDIAN , MODE 
                   
     
    
                         
t-test
A  t-test is any statistical hypothesis test in which the test statistic follows a Student's t distribution if the null hypothesis is supported. It can be used to determine if two sets of data are significantly different from each other, and is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistic (under certain conditions) follows a Student's t distribution.

Unpaired and paired two-sample t-test

Two-sample t-tests for a difference in mean involve independent samples, paired samples and overlapping samples. Paired t-tests are a form of blocking, and have greater power than unpaired tests when the paired units are similar with respect to "noise factors" that are independent of membership in the two groups being compared. In a different context, paired t-tests can be used to reduce the effects of confounding factors in an observational study.

(a) Independent samples 

The independent samples t-test is used when two separate sets of independent and identically distributed samples are obtained, one from each of the two populations being compared. For example, suppose we are evaluating the effect of a medical treatment, and we enroll 100 subjects into our study, then randomize 50 subjects to the treatment group and 50 subjects to the control group. In this case, we have two independent samples and would use the unpaired form of the t-test. The randomization is not essential here—if we contacted 100 people by phone and obtained each person's age and gender, and then used a two-sample t-test to see whether the mean ages differ by gender, this would also be an independent samples t-test, even though the data are observational.

(b) Paired samples 

Paired samples t-tests typically consist of a sample of matched pairs of similar units, or one group of units that has been tested twice (a "repeated measures" t-test).
             A typical example of the repeated measures t-test would be where subjects are tested prior to a treatment, say for high blood pressure, and the same subjects are tested again after treatment with a blood-pressure lowering medication. By comparing the same patient's numbers before and after treatment, we are effectively using each patient as their own control. That way the correct rejection of the null hypothesis (here: of no difference made by the treatment) can become much more likely, with statistical power increasing simply because the random between-patient variation has now been eliminated. Note however that an increase of statistical power comes at a price: more tests are required, each subject having to be tested twice. Because half of the sample now depends on the other half, the paired version of Student's t-test has only 'n/2 - 1' degrees of freedom (with 'n' being the total number of observations). Pairs become individual test units, and the sample has to be doubled to achieve the same number of degrees of freedom.
                 A paired samples t-test based on a "matched-pairs sample" results from an unpaired sample that is subsequently used to form a paired sample, by using additional variables that were measured along with the variable of interest. The matching is carried out by identifying pairs of values consisting of one observation from each of the two samples, where the pair is similar in terms of other measured variables. This approach is sometimes used in observational studies to reduce or eliminate the effects of confounding factors.
Paired samples t-tests are often referred to as "dependent samples t-tests" (as are t-tests on overlapping samples).

(c) Overlapping samples 

An overlapping samples t-test is used when there are paired samples with data missing in one or the other samples (e.g., due to selection of "Don't know" options in questionnaires or because respondents are randomly assigned to a subset question). These tests are widely used in commercial survey research (e.g., by polling companies) and are available in many standard crosstab software packages.

REGRESSION ANALYSIS

Regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is
In  on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function, which can be described by a probability distribution.
                   Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. 
                                    
                       DIAGRAM SHOWING REGRESSION ANALYSIS





X TABULATION:

There are several different types of correlation, and we’ll talk about them later, but in this lesson we’re going to spend most of the time on the most commonly used type of correlation: the Pearson Product Moment Correlation. This correlation, signified by the symbol r, ranges from –1.00 to +1.00. A correlation of 1.00, whether it’s positive or negative, is a perfect correlation. It means that as scores on one of the two variables increase or decrease, the scores on the other variable increase or decrease by the same magnitude—something you’ll probably never see in the real world. A correlation of 0 means there’s no relationship between the two variables, i.e., when scores on one of the variables go up, scores on the other variable may go up, down, or whatever. You’ll see a lot of those.
Thus, a correlation of .8 or .9 is regarded as a high correlation, i.e., there is a very close relationship between scores on one of the variables with the scores on the other. And correlations of .2 or .3 are regarded as low correlations, i.e., there is some relationship between the two variables, but it’s a weak one. Knowing people’s score on one variable wouldn’t allow you to predict their score on the other variable very well.

CORRELATION AND DEPENDENCE

        In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence.
Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its price. Correlations are useful because they can indicate a predictive relationship that can be exploited in practice. For example, an electrical utility may produce less power on a mild day based on the correlation between electricity demand and weather. In this example there is a causal relationship, because extreme weather causes people to use more electricity for heating or cooling; however, statistical dependence is not sufficient to demonstrate the presence of such a causal relationship (i.e., correlation does not imply causation).
Formally, dependence refers to any situation in which random variables do not satisfy a mathematical condition of probabilistic independence. In loose usage, correlation can refer to any departure of two or more random variables from independence, but technically it refers to any of several more specialized types of relationship between mean values. There are several correlation coefficients, often denoted ρ or r, measuring the degree of correlation. The commonest of these is the Pearson correlation coefficient, which is sensitive only to a linear relationship between two variables (which may exist even if one is a nonlinear function of the other). Other correlation coefficients have been developed to be more robust than the Pearson correlation – that is, more sensitive to nonlinear relationships. Mutual information can also be applied to measure dependence between two variables
                                          
  DIAGRAM SHOWING FOUR SETS OF DATA SHOWING SAME CORRELATION                                     
                                                                                
                             
                         
                                                SUBMITTED BY : Pallavi Bizoara (2013186)
                                                          
                                                  GROUP No : 7
                                                            Nidhi Sharma (2013169)                          
                                                            Nitesh Singh Patel(2013178)
                                                            Nitin Boratwar(2013179)
                                                            Palak Jain(2013185)                
                                                                                     
                             

T-test and Crosstabs

We were required to carry out a small project based on statistical analysis of data. Our topic was ‘Month wise production of cement in India’. Data was collected from www.indiastat.com. Following is a snapshot of the data.
On the required data, we were asked to perform the following functions.

1.       Mean
Mean is what most people commonly refer to as an average. The mean refers to the number you obtain when you sum up a given set of numbers and then divide this sum by the total number in the set. Mean is also referred to more correctly as arithmetic mean.
mean= sum of elements in set/ number of elements in set
Example.
To find the mean of the set of numbers below 
3, 4, -1, 22, 14, 0, 9, 18, 7, 0, 1
The first step is to count how many numbers there are in the set, which we shall call n,
n=10
The next step is to add up all the numbers in the set
sum= 77
The last step is to find the actual mean by dividing the sum by n,
mean=7.7

2.       Median
The median is defined as the number in the middle of a given set of numbers arranged in order of increasing magnitude. When given a set of numbers, the median is the number positioned in the exact middle of the list when you arrange the numbers from the lowest to the highest. The median is also a measure of average. In higher level statistics, median is used as a measure of dispersion. The median is important because it describes the behavior of the entire set of numbers.
 Example.
To find the median in the set of numbers given below
15, 16, 15, 7, 21, 18, 19, 20, 21
From the definition of median, we should be able to tell that the first step is to rearrange the given set of numbers in order of increasing magnitude, i.e. from the lowest to the highest
7, 11, 15, 15, 16, 18, 19, 20, 21
Then we inspect the set to find that number which lies in the exact middle.
median=16

3.       Mode
The mode is defined as the element that appears most frequently in a given set of elements. Using the definition of frequency given above, mode can also be defined as the element with the largest frequency in a given data set. For a given data set, there can be more than one mode. As long as those elements all have the same frequency and that frequency is the highest, they are all the modal elements of the data set.
Example.
To find the Mode of the following data set.
3, 12, 15, 3, 15, 8, 20, 19, 3, 15, 12, 19, 9
Mode = 3 and 15

4.       T-test
We use this test for comparing the means of two samples (or treatments), even if they have different numbers of replicates. In simple terms, the t-test compares the actual difference between two means in relation to the variation in the data (expressed as the standard deviation of the difference between the means).
The formula given below is used to compute the T Test 

Where,
x1 is the mean of first data set, x2 is the mean of first data set
S12 is the standard deviation of first data set, S22 is the standard deviation of first data set
N1 is the number of elements in the first data set, N2 is the number of elements in the first data set

Example.
Calculate the T test value whose inputs are 10, 20, 30, 40, 50 and 1, 29, 46, 78, 99.
First Calculate Standard Deviation & mean of the given data set, 

For 10, 20, 30, 40, 50 
Total Inputs(N)=5
Means(xm)= 30 
SD =15.8114

For 1, 29,46,78,99 
Total Inputs(N) = 5
Means(xm) = 50.6 
SD=38.8626 

To Perform T Test 
From above we know that, 
x1 = 30, x2 = 50.6, S12 = 250, S22 = 1510.3, N1 = 5, N2 = 5 
Substitute these values in the above formula, 
  T = (30 - 50.6)/√((250/5) + (1510.3/5)) 
= -1.0979

5.       Cross Tabulation
It is a statistical process that summarises categorical data to create a contingency table. They are heavily used in survey research, business intelligence, engineering and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them.
Example.

We examine the 3 rows for the unit J1. This unit needs both Adult size t-shirts and Child sizes.

A total of 8 adult (A) shirts (Total Of ID): 
   2 medium (M), 3 small (S), 3 extra large (X)
A total of 8 child (C) t-shirts (Total Of ID)::
   5 large (L), 2 medium (M), 1 extra large (X)

By Pallavi Gupta (2013187)
Group Members:
Piyush (2013197)
Prerna Bansal (2013209)
Priya Jain (2013210)
Neeraj Garg (2013318)


Lecture 17th & 18th

In the 17th and 18th session we covered important tools of statistics for our
 group project. Tools like T- test , Regression , Central Tendencies ( Mean , Median , Mode) etc.
Further Explained : -

T-Test:
A statistical examination of two population means. A two-sample t-test examines whether two samples are different and is commonly used when the variances
of two normal distributions are unknown and when an experiment uses a small
sample size. For example, a t-test could be used to compare the average floor
 routine score of the U.S. women's Olympic gymnastics team to the average
 floor routine score of China's women's team.
The test statistic in the t-test is known as the t-statistic. The t-test looks at the
t-statistic, t-distribution and degrees of freedom to determine a p value
(probability) that can be used to determine whether the population means
differ. The t-test is one of a number of hypothesis tests. To compare three or
 more variables, statisticians use an analysis of variance (ANOVA). If the
 sample size is large, they use a z-test. Other hypothesis tests include the chi-square test and f-test.


Regression:
A statistical measure that attempts to determine the strength of the relationship between one dependent variable (usually denoted by Y) and a series of other changing variables (known as independent variables)
The two basic types of regression are linear regression and multiple regression.
Linear regression: Uses one independent variable to explain and/or predict the outcome of Y, while
Multiple regressions: Uses two or more independent variables to predict the outcome. The general form of each type of regression is:
Where:
Y= the variable that we are trying to predict
X= the variable that we are using to predict Y
a= the intercept
b= the slope
u= the regression residual.
In multiple regression the separate variables are differentiated by using sub scripted numbers.
Regression takes a group of random variables, thought to be predicting Y, and tries to find a mathematical relationship between them. This relationship is typically in the form of a straight line (linear regression) that best approximates all the individual data points. Regression is often used to determine how much specific factors such as the price of a commodity, interest rates, particular industries or sectors influence the price movement of an asset


Central Tendencies:
An average value of any distribution of data that best represents the middle. Also called centrality. So we need to choose the measure which describes the most
appropriate
Central tendency measure viz. MEAN , MEDIAN , MODE

MEAN:
The mean is the average of the numbers. The mean is the average of the numbers: a calculated "central" value of a set of numbers. The mean is equal to the sum of all the values.
in the data set divided by the number of values in the data set.
** MEAN is to be used and is appropriate when the value of data does not have
Repetition nor extreme values because this won’t give good picture for the central
Value of the data set

MEDIAN:
The Median is the "middle number.The middle number in a sorted list of numbers. To determine the median value in a sequence of numbers, the numbers must first be arranged in value order from lowest to highest. If there is an odd amount of numbers, the median value is the number that is in the middle, with the same amount of numbers below and above. If there is an even amount of numbers in the list, the middle pair must be determined, added together and divided by two to find the median value. The median can be used to determine an approximate average.
** MEDIAN is appropriate when the set of data have outliers i.e the extreme values, Because using any other central tendency measure wont give you appropriate central value tendency of the data set.

MODE:
The mode is the most frequent score in our data set.
It refers to the most frequently occurring number found in a set of numbers. The mode is found by collecting and organising the data in order to count the frequency of each result. The result with the highest occurrences is the mode of the set.
**MODE is appropriate to be used when there is repetition of value in the data set.



Submitted     by :   Parth Mehta ( 2013193)

Group Members:
                               Nikita Agarwal  (2013171)
                               Nimisha Agarwal (2013173)
                               Nihal Moidu  (2013170)
                               Priyesh Bhadauriya (2013214)

Use of Statistical Tools for Project on Rural Urban Distribution of Sex Ratio

The 17TH & 18TH session began with us incorporating some of the very important tools of statistics in our projects such as Measures of Central Tendency – Mean, Median & Mode, T-Test, Regression, Co-relation, Xtab, etc.


Measures of Central Tendency

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. As such, measures of central tendency are sometimes called measures of central location. They are also classed as summary statistics. The mean, median and mode are all valid measures of central tendency, but under different conditions, some measures of central tendency become more appropriate to use than others.

·        Mean - The mean (or average) is the most popular and well known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data. The mean is equal to the sum of all the values in the data set divided by the number of values in the data set.

·        Median - The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. In order to calculate the median, we first need to rearrange that data into order of magnitude (smallest first) and then our median mark is the middle mark.

·        Mode - The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram. Therefore, sometimes mode is considered to be the most popular option.

                       


Summary of when to use the mean, median and mode

Type of Variable
Best measure of central tendency
Nominal
Mode
Ordinal
Median
Interval/Ratio (not skewed)
Mean
Interval/Ratio (skewed)
Median



T-Test

A t-test is any statistical hypothesis test in which the test statistic follows a Student's t distribution if the null hypothesis is supported. It can be used to determine if two sets of data are significantly different from each other, and is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistic (under certain conditions) follows a Student's t distribution.




Among the most frequently used t-tests are:
  • A one-sample location test of whether the mean of a population has a value specified in a null hypothesis.
  • A two-sample location test of the null hypothesis that the means of two populations are equal.
  • A test of the null hypothesis that the difference between two responses measured on the same statistical unit has a mean value of zero.
  • A test of whether the slope of a regression line differs significantly from 0.

There were a lot of new things that we got to learn in the 3 hours enriching our knowledge in statistics furthermore. The implementation of these formulas in the project gave us an idea of how these tools are used in a corporate setting and how to analyse and interpret the data and the various graphs. Overall, it was a very informative and interesting session.





Submitted By:- Priyanka Doshi - 2013212

Group members:-
Nilay Kohaley – 2013172
Pawan Agarwal  – 2013195
Poulami Sarkar  – 2013201
Pragya Singh – 2013203

STATISTICS SESSION 17 AND 18

Statistics session 17 and 18.

Mean:In mathematics and statistics, the arithmetic mean  or simply the mean or average when the context is clear, is the sum of a collection of numbers divided by the number of numbers in the collection. The collection is often a set of results of an experiment, or a set of results from a survey. The term "arithmetic mean" is preferred in some contexts in mathematics an.d statistics because it helps distinguish it from other means such as the geometric mean and the harmonic mean.For example, per capita income is the arithmetic average income of a nation's population.
                             
                               Suppose we have a data set containing the values a_1,\ldots,a_n. The arithmetic mean A is defined by the formula
A=\frac{1}{n}\sum_{i=1}^{n} a_i.

MEDIAN:In statistics and probability theory, the median is the numerical value separating the higher half of a data sample, a population, or a probability distribution, from the lower half. The median of a finite list of numbers can be found by arranging all the observations from lowest value to highest value and picking the middle one (e.g., the median of {3, 5, 9} is 5). If there is an even number of observations, then there is no single middle value; the median is then usually defined to be the mean of the two middle values, which corresponds to interpreting the median as the fully trimmed mid-range. The median is of central importance in robust statistics, as it is the most resistant statistic, having a breakdown point of 50%: so long as no more than half the data is contaminated, the median will not give an arbitrarily large result.
A median is only defined on ordered one-dimensional data, and is independent of any distance metric. A geometric median, on the other hand, is defined in any number of dimensions.
For any probability distribution on the real line R with cumulative distribution function F, regardless of whether it is any kind of continuous probability distribution, in particular an absolutely continuous distribution (which has a probability density function), or a discrete probability distribution, a median is by definition any real number m that satisfies the inequalities
\operatorname{P}(X\leq m) \geq \frac{1}{2}\text{ and }\operatorname{P}(X\geq m) \geq \frac{1}{2}\,\!
MODE:
The mode is the value that appears most often in a set of data. The mode of a discrete probability distribution is the value x at which its probability mass function takes its maximum value. In other words, it is the value that is most likely to be sampled. The mode of a continuous probability distribution is the value x at which its probability density function has its maximum value, so, informally speaking, the mode is at the peak.
Like the statistical mean and median, the mode is a way of expressing, in a single number, important information about a random variable or a population. The numerical value of the mode is the same as that of the mean and median in a normal distribution, and it may be very different in highly skewed distributions.
The mode is not necessarily unique, since the same maximum frequency may be attained at different values. The most extreme case occurs in uniform distributions, where all values occur equally frequently.


T-Test:
t-test is any statistical hypothesis test in which the test statistic follows a Student's t distribution if the null hypothesis is supported. It can be used to determine if two sets of data are significantly different from each other, and is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known.

Independent samples

The independent samples t-test is used when two separate sets of independent and identically distributed samples are obtained, one from each of the two populations being compared. For example, suppose we are evaluating the effect of a medical treatment, and we enroll 100 subjects into our study, then randomize 50 subjects to the treatment group and 50 subjects to the control group. In this case, we have two independent samples and would use the unpaired form of the t-test. The randomization is not essential here—if we contacted 100 people by phone and obtained each person's age and gender, and then used a two-sample t-test to see whether the mean ages differ by gender, this would also be an independent samples t-test, even though the data are observational.

Paired samples

Paired samples t-tests typically consist of a sample of matched pairs of similar units, or one group of units that has been tested twice (a "repeated measures" t-test).
A typical example of the repeated measures t-test would be where subjects are tested prior to a treatment, say for high blood pressure, and the same subjects are tested again after treatment with a blood-pressure lowering medication. By comparing the same patient's numbers before and after treatment, we are effectively using each patient as their own control. That way the correct rejection of the null hypothesis (here: of no difference made by the treatment) can become much more likely, with statistical power increasing simply because the random between-patient variation has now been eliminated. Note however that an increase of statistical power comes at a price: more tests are required, each subject having to be tested twice. Because half of the sample now depends on the other half, the paired version of Student's t-test has only 'n/2 - 1' degrees of freedom (with 'n' being the total number of observations). Pairs become individual test units, and the sample has to be doubled to achieve the same number of degrees of freedom.


REGRESSION:
In statisticsregression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function, which can be described by a probability distribution.
Regression models involve the following variables:
  • The unknown parameters, denoted as β, which may represent a scalar or a vector.
  • The independent variablesX.
  • The dependent variableY.
In various fields of application, different terminologies are used in place of dependent and independent variables.
A regression model relates Y to a function of X and β.
Y \approx f (\mathbf {X}, \boldsymbol{\beta} )


CORRELATION:
In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence.
Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its price. Correlations are useful because they can indicate a predictive relationship that can be exploited in practice. For example, an electrical utility may produce less power on a mild day based on the correlation between electricity demand and weather. In this example there is a causal relationship, because extreme weather causes people to use more electricity for heating or cooling; however, statistical dependence is not sufficient to demonstrate the presence of such a causal relationship (i.e., correlation does not imply causation).
Formally, dependence refers to any situation in which random variables do not satisfy a mathematical condition of probabilistic independence. In loose usage, correlation can refer to any departure of two or more random variables from independence, but technically it refers to any of several more specialized types of relationship between mean values. There are several correlation coefficients, often denoted ρ or r, measuring the degree of correlation. The commonest of these is the Pearson correlation coefficient, which is sensitive only to a linear relationship between two variables (which may exist even if one is a nonlinear function of the other). Other correlation coefficients have been developed to be more robust than the Pearson correlation – that is, more sensitive to nonlinear relationships. Mutual information can also be applied to measure dependence between two variables.
The population correlation coefficient ρX,Y between two random variables X and Y with expected values μX and μY and standard deviations σX and σY is defined as:
\rho_{X,Y}=\mathrm{corr}(X,Y)={\mathrm{cov}(X,Y) \over \sigma_X \sigma_Y} ={E[(X-\mu_X)(Y-\mu_Y)] \over \sigma_X\sigma_Y},

CROSS TABULATION:
Cross tabulation (or crosstabs for short) is a statistical process that summarises categorical data to create a contingency table.They are heavily used in survey research, business intelligence, engineering and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them.For example
Sample #GenderHandedness
1FemaleRight-handed
2MaleLeft-handed
3FemaleRight-handed
4MaleRight-handed
5MaleLeft-handed
6MaleRight-handed
7FemaleRight-handed
8FemaleLeft-handed
9MaleRight-handed
10FemaleRight-handed
cross tabulation leads to,
Left-
handed
Right-
handed
Total
Males235
Females145
Total3710
EX:

SUBMITTED BY:
P.PRIYATHAM KIREETI(2013183)
GROUP NUMBER:10

GROUP MEMBERS:
P.KALYANI(2013184)
P.S.V.P.S.G.KARTHEEKI(2013198)
NISHIDH LAD(2013176)
PRIYADARSHI TANDON(2013211)