Friday, 30 August 2013

Statistics - Session 17 & Session 18

In today's class we discussed about mean, median, mode, T test, correlation, regression and cross tab. I would brief you about the following topics.

Mean: The mean may often be confused with the median, mode or range. The mean is the arithmetic average of a set of values, or distribution; however, forskewed distributions, the mean is not necessarily the same as the middle value (median), or the most likely (mode). For example, mean income is skewed upwards by a small number of people with very large incomes, so that the majority have an income lower than the mean. By contrast, the median income is the level at which half the population is below and half is above. The mode income is the most likely income, and favors the larger number of people with lower incomes. The median or mode are often more intuitive measures of such data.
For example, the arithmetic mean of five values: 4, 36, 45, 50, 75 is
\frac{4 + 36 + 45 + 50 + 75}{5} = \frac{210}{5} = 42.

Median : Median is the numerical value separating the higher half of a data sample, a population, or a probability distribution, from the lower half. The median of a finite list of numbers can be found by arranging all the observations from lowest value to highest value and picking the middle one (e.g., the median of {3, 5, 9} is 5). If there is an even number of observations, then there is no single middle value; the median is then usually defined to be the mean of the two middle values, which corresponds to interpreting the median as the fully trimmed mid-range. The median is of central importance in robust statistics, as it is the most resistant statistic, having a breakdown point of 50%: so long as no more than half the data is contaminated, the median will not give an arbitrarily large result.

Mode: The mode of a set of data values is the value(s) that occurs most often.The mode has applications in printing.  For example, it is important to print more of the most popular books; because printing different books in equal numbers would cause a shortage of some books and an oversupply of others.Likewise, the mode has applications in manufacturing.  For example, it is important to manufacture more of the most popular shoes; because manufacturing different shoes in equal numbers would cause a shortage of some shoes and an oversupply of others.
For example, 
      48     44     48     45     42     49     48
      The mode is 48 as it appears most often
    T test:  t-test is any statistical hypothesis test in which the test statistic follows a Student's t distribution if       the null hypothesis is supported. It can be used to determine if two sets of data are significantly different from       each other, and is most commonly applied when the test statistic would follow a normal distribution if the  value     of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by  an         estimate based on the data, the test statistic (under certain conditions) follows a Student's t distribution
   Uses: 
  • A two-sample location test of the null hypothesis that the means of two populations are equal. All such tests are usually called Student's t-tests, though strictly speaking that name should only be used if the variances of the two populations are also assumed to be equal; the form of the test used when this assumption is dropped is sometimes called Welch's t-test. These tests are often referred to as "unpaired" or "independent samples" t-tests, as they are typically applied when the statistical units underlying the two samples being compared are non-overlapping.
  • A test of the null hypothesis that the difference between two responses measured on the same statistical unit has a mean value of zero. For example, suppose we measure the size of a cancer patient's tumor before and after a treatment. If the treatment is effective, we expect the tumor size for many of the patients to be smaller following the treatment. This is often referred to as the "paired" or "repeated measures" t-test 

Correlation

              
When two sets of data are strongly linked together we say they have a high Correlation
  • Correlation is Positive when the values increase together, and
  • Correlation is Negative when one value decreases as the other increases
Correlation can have a value:
  • 1 is a perfect positive correlation
  • 0 is no correlation (the values don't seem linked at all)
  • -1 is a perfect negative correlation
The value shows how good the correlation is (not how steep the line is), and if it is positive or negative.
Regression
Regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function, which can be described by a probability distribution.
Regression analysis is widely used for prediction and weather forecasting.
Cross Tab
Cross tabulation (or crosstabs for short) is a statistical process that summarises categorical data to create a contingency table. They are heavily used in survey research, business intelligence, engineering and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them.
Some entries may be weighted, unweighted tables are commonly known as pivot tables.

Example]

Sample #GenderHandedness
1FemaleRight-handed
2MaleLeft-handed
3FemaleRight-handed
4MaleRight-handed
5MaleLeft-handed
6MaleRight-handed
7FemaleRight-handed
8FemaleLeft-handed
9MaleRight-handed
10FemaleRight-handed
Cross-tabulation leads to the following contingency table:
Left-
handed
Right-
handed
Total
Males235
Females145
Total3710
We had a clear understanding of all these concepts in today's class.
Written by: Neeraj Ramadoss (2013167)
Group Members
Nishanth Agarwal
Nitin Kumar Shukla
Prakar Swami
Prerana Arora
Praveen Iyer
Neeraj Ramadoss

No comments:

Post a Comment