Thursday 5 September 2013

Correlation & Regression

CORRELATION
In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation refers to any of a broad class of statistical relationships involving dependence.
Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a product and its price. Correlations are useful because they can indicate a predictive relationship that can be exploited in practice. For example, an electrical utility may produce less power on a mild day based on the correlation between electricity demand and weather. In this example there is a causal relationship, because extreme weather causes people to use more electricity for heating or cooling; however, statistical dependence is not sufficient to demonstrate the presence of such a causal relationship (i.e., correlation does not imply causation).
Formally, dependence refers to any situation in which random variables do not satisfy a mathematical condition of probabilistic independence. In loose usage, correlation can refer to any departure of two or more random variables from independence, but technically it refers to any of several more specialized types of relationship between mean values. There are several correlation coefficients, often denoted ρ or r, measuring the degree of correlation. The commonest of these is the Pearson correlation coefficient, which is sensitive only to a linear relationship between two variables (which may exist even if one is a nonlinear function of the other). Other correlation coefficients have been developed to be more robust than the Pearson correlation – that is, more sensitive to nonlinear relationships. Mutual information can also be applied to measure dependence between two variables.
The most familiar measure of dependence between two quantities is the Pearson product-moment correlation coefficient, or "Pearson's correlation coefficient", commonly called simply "the correlation coefficient". It is obtained by dividing the covariance of the two variables by the product of their standard deviationsThe Pearson correlation is defined only if both of the standard deviations are finite and both of them are nonzero. It is a corollary of the Cauchy–Schwarz inequality that the correlation cannot exceed 1 in absolute value. The correlation coefficient is symmetric: corr(X,Y) = corr(Y,X).
The Pearson correlation is +1 in the case of a perfect positive (increasing) linear relationship (correlation), −1 in the case of a perfect decreasing (negative) linear relationship (anticorrelation),[5]and some value between −1 and 1 in all other cases, indicating the degree of linear dependence between the variables. As it approaches zero there is less of a relationship (closer to uncorrelated). The closer the coefficient is to either −1 or 1, the stronger the correlation between the variables.

Techniques in Determining Correlation
There are several different correlation techniques. The Survey System's optional Statistics Moduleincludes the most common type, called the Pearson or product-moment correlation. The module also includes a variation on this type called partial correlation. The latter is useful when you want to look at the relationship between two variables while removing the effect of one or two other variables.
Like all statistical techniques, correlation is only appropriate for certain kinds of data. Correlation works for quantifiable data in which numbers are meaningful, usually quantities of some sort. It cannot be used for purely categorical data, such as gender, brands purchased, or favorite color.

REGRESSION:                                                                                                    
In statisticsregression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function, which can be described by a probability distribution.
Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable;for example, correlation does not imply causation.
A large body of techniques for carrying out regression analysis has been developed. Familiar methods such as linear regression and ordinary least squares regression are parametric, in that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the dataNonparametric regression refers to techniques that allow the regression function to lie in a specified set of functions, which may be infinite-dimensional. statisticslinear regression is an approach to model the relationship between a scalar dependent variable y and one or more explanatory variablesdenoted X. The case of one explanatory variable is called simple linear regression.

Steps in such investigation
  1. Plot the data. In many cases the plot can tell us visually whether there seems to be a relationship: if there is some correlation, do the variables increase or decrease together?, does one decrease when the other increases? Also, is a straight line a suitable model to describe the relationship between the two variables, and so on. If we want to go beyond this qualitative level of analysis then simple linear regression is often a useful tool. This involves fitting a straight line through our data and investigating the properties of the fitted line. It is conventional to plot the Y- response variable on the vertical axis and the independent variable X on the horizontal axis.
  2. Plot the line of best fit. If the the plot suggests a linear relationship, we proceed to quantify the relationship between the two variables by fitting a regression line through the data points.
Using regression we can also fit many other types of models including those where we have more than one independent variable.


Submitted By:  Pragya Singh (2013203)

Group Members: Priyanka Doshi (2013212)
                           Poulami Sarkar (2013201)
                           Nilay Kohaley (2013172)
                           Pawan Agarwal (2013195)





Monday 2 September 2013

T-Test and Z-Test

INTRODUCTION

T-TEST
A statistical examination of two population means. A two-sample t-test examines whether two samples are different and is commonly used when the variances of two normal distributions are unknown and when an experiment uses a small sample size. For example, a t-test could be used to compare the average score obtained by Class A in Maths team to the average score obtained by Class B in the same.

Z-TEST 
A statistical test used to determine whether two population means are different when the variances are known and the sample size is large. The test statistic is assumed to have a normal distribution and nuisance parameters such as standard deviation should be known in order for an accurate z-test to be performed.

A singe sample experiment compares one sample to a population. There are two types of statistics you can use to compare a single sample to a population:

A. Single Sample z-test
If your sample size is above 1000 then this is the appropriate statistic to use.
The single sample z-test formula is shown below:

The formula reads: the z-test equals the sum of the sample mean minus the population mean and then divided by the standard error.
As mentioned in an earlier lesson, if we do not know the population standard deviation we can use the sample standard deviation to estimate the standard error.

The formula reads: the standard error equals the sample standard deviation divided by the square root of the sample size.
When we use an alpha level of 0.05, any z score that results in a probability of less than 0.05 allows us to reject the null hypothesis and accept the research hypothesis.  All you need to know is the minimal z score necessary for significance.  Rather than constantly going to the Z table you can just memorize the one-tailed and two-tailed z scores that equate to a 0.05 level of significance. If you go back to the end of Lesson 11 you will see that a two-tailed hypothesis needs a z score of 1.96 to be significant while a one-tailed test needs a z score of 1.64 to be significant.
B. Single Sample t-test
If your sample size is below 1000 then this is the appropriate statistic to use.
The single sample t-test formula is shown below:


The formula reads: the t test equals the sum of the sample mean minus the population mean and then divided by the standard error.
As with the z-test above, if we do not know the population standard deviation we can use the sample standard deviation to estimate the standard error.

The formula reads: the standard error equals the sample standard deviation divided by the square root of the sample size.
The t distribution is similar to the z distribution in that both are symmetrical, bell-shaped sampling distributions. The overall shape of the t distribution is influenced by the sample size used to generate it.  Therefore, when the sample is large (n >1000) you should use the z-test and when the sample is small you should use the t-test.  Because of this fact we need to use degrees of freedom to determine our significance threshold. For a single sample t-test the degrees of freedom calculation is as follows:
df = n - 1

Now we can go to the T Table to see if our statistic is significant.


One-Tail = .4
.25
.1
.05
.025
.01
.005
.0025
.001
.0005
df
Two-Tail = .8
.5
.2
.1
.05
.02
.01
.005
.002
.001
1
0.325
1.000
3.078
6.314
12.706
31.821
63.657
127.32
318.31
636.62
2
0.289
0.816
1.886
2.920
4.303
6.965
9.925
14.089
22.327
31.598
3
0.277
0.765
1.638
2.353
3.182
4.541
5.841
7.453
10.214
12.924
4
0.271
0.741
1.533
2.132
2.776
3.747
4.604
5.598
7.173
8.610











5
0.267
0.727
1.476
2.015
2.571
3.365
4.032
4.773
5.893
6.869
6
0.265
0.718
1.440
1.943
2.447
3.143
3.707
4.317
5.208
5.959
7
0.263
0.711
1.415
1.895
2.365
2.998
3.499
4.029
4.785
5.408
8
0.262
0.706
1.397
1.860
2.306
2.896
3.355
3.833
4.501
5.041
9
0.261
0.703
1.383
1.833
2.262
2.821
3.250
3.690
4.297
4.781

The T Table continues
The T Table is similar to the R Table we used in lesson 7. The degrees of freedom are in the far left column and the levels of significance for each type of tailed test are in the above column headings. As with the R Table critical R values, the T Table gives you the critical T values. Your calculated T value must surpass the critical T value for your statistic to be considered significant.

III. Two Sample Experimental Statistics
For these experiments we are comparing two samples. This is the very common control group vs. experimental group research design. There are two ways to conduct the analysis base on your sample groups.
A. t-test for Independent Groups
If your two sample groups are independent of each other then you can conduct a t-test for independent groups. The formula for this specific type of t-test is as follows:


The formula reads: t (for independent groups) equals the sum of sample mean number 1 minus sample mean number 2 and then divided by the standard error of the difference.
The standard error of the difference is similar to the standard error calculated earlier. It simply is a better estimate for two independent samples. The standard error of the difference between independent sample means can be calculated with the formula below:


The formula reads: the standard error of the difference equals the square root of the standard error of sample one squared plus the standard error of sample 2 squared.
The calculation for the degrees of freedom is as follows:

df independent groups = (n1 - 1) + (n2 - 1)

Once your calculations are complete you go to the T Table to see if your statistic is significant as above.
B. t-test for Correlated Groups
If the two samples are not independent of each other but instead are positively correlated to each other, we conduct a t-test for correlated groups. There are two ways of calculating this statistic. One uses the correlational coefficient (r) of the two samples and one does not.

 1. t-test for Correlated Groups: using the r value
The t-test formula is the same as was used for independent groups:


The formula reads: t (for correlated groups) equals the sum of sample mean number 1 minus sample mean number 2 and then divided by the standard error of the difference.
The new standard error formula is as follows:


The formula reads: the standard error of the difference equals the square root of the following: the sum of the squared standard error of the first sample mean and the squared standard error of the second sample mean. Then subtract the product of 2 times the r value times the standard error from the first sample times the standard error from the second sample.
The calculation for the degrees of freedom is as follows:

df correlated groups = number of pairs - 1

Once your calculations are complete you go to the T Table to see if your statistic is significant as above.
2: t-test for correlated samples: using raw data
The t-test for correlated groups using the raw data is as follows:

The formula reads: t (for correlated groups) equals D bar divided by the standard error of the difference.

D bar is the mean of all the difference scores. Difference scores are calculated by subtracting each Y value from its X pair value. You then sum these difference scores and divide by the number of pairs to get D bar. An example is shown in the table below:

X
Y
D
15
5
10
7
1
6
12
8
4
18
12
6
8
9
-1


sum = 25
n = 5
D bar = 25/5
D bar = 5
The new standard error formula is as follows:



The formula reads: the standard error of the difference equals the square root of the following: D bar squared subtracted from the sum of D squared over n and then this entire sum divided by the number of pairs minus one.

In order to get the sum of D squared you need to generate a new column of data as is shown below:

X
Y
D
D2
15
5
10
100
7
1
6
36
12
8
4
16
18
12
6
36
8
9
-1
1


sum = 25
sum = 189
n = 5
D bar = 25/5
D bar = 5

The calculation for the degrees of freedom is the same:

df correlated groups = number of pairs - 1

Once your calculations are complete you go to the T Table to see if your statistic is significant as above.



 Name : Nilay Kohaley (2013172)

Members : Pawan Agarwal
                    Priyanka Doshi
                    Pragya Singh
                    Poulami Sarkar

Refrences : Investopedia, Wikipedia