Applied Business Statistics

Sunday, 1 September 2013

SUMMARY OF 19th & 20th SESSIONS

MEAN :

In probability and statistics, mean and expected are used synonymous to refer to one measure of the central tendency either of a probability distribution or of the random variable characterized by that distribution.In the case of a discrete probability distribution of a random variable X, the mean is equal to the sum over every possible value weighted by the probability of that value; that is, it is computed by taking the product of each possible value x of X and its probability P(x), and then adding all these products together, giving $\mu = \sum x P(x)$ . An analogous formula applies to the case of a continuous probability distribution. Not every probability distribution has a defined mean; see the Cauchy distribution for an example. Moreover, for some distributions the mean is infinite: for example, when the probability of the value $2^n$ is $\tfrac{1}{2^n}$ for n = 1, 2, 3, ....

For a data set, the terms arithmetic mean, mathematical expectation and sometimes average are used synonymous to refer to a central value of a discrete set of numbers: specifically, the sum of the values divided by the number of values. The arithmetic mean of a set of numbers x₁, x₂, ..., x_n is typically denoted by $\bar{x}$ , pronounced "x bar". If the data set were based on a series of observations obtained by sampling from a statistical population, the arithmetic mean is termed the sample mean (denoted $\bar{x}$ ) to distinguish it from the population mean (denoted $\mu$ or $\mu_x$ ).

For a finite population, the population of a property is equal to the arithmetic mean of the given property while considering every member of the population. For example, the population mean height is equal to the sum of the heights of every individual divided by the total number of individuals. The sample mean may differ from the population mean, especially for small samples. The law of large numbers dictates that the larger the size of the sample, the more likely it is that the sample mean will be close to the population mean

MEDIAN :

In statistics and probability theory, the median is the numerical value separating the higher half of a data sample, a population, or a probability distribution, from the lower half. The median of a finite list of numbers can be found by arranging all the observations from lowest value to highest value and picking the middle one (e.g., the median of {3, 5, 9} is 5). If there is an even number of observations, then there is no single middle value; the median is then usually defined to be the mean of the two middle values, which corresponds to interpreting the median as the fully trimmed mid range. The median is of central importance in robust statistics, as it is the most resistant stat, having a breakdown point of 50%: so long as no more than half the data is contaminated, the median will not give an arbitrarily large result.

A median is only defined on ordered one-dimensional data, and is independent of any distance metric. A geometric median, on the other hand, is defined in any number of dimensions.In a sample of data, or a finite population, there may be no member of the sample whose value is identical to the median (in the case of an even sample size); if there is such a member, there may be more than one so that the median may not uniquely identify a sample member. Nonetheless, the value of the median is uniquely determined with the usual definition. A related concept, in which the outcome is forced to correspond to a member of the sample, is the medoid. At most, half the population have values strictly less than the median, and, at most, half have values strictly greater than the median. If each group contains less than half the population, then some of the population is exactly equal to the median. For example, if a < b < c, then the median of the list {a, b, c} is b, and, if a < b < c < d, then the median of the list {a, b, c, d} is the mean of b and c; i.e., it is (b + c)/2.

The median can be used as a measure of location when a distribution is skewed, when end-values are not known, or when one requires reduced importance to be attached to outliers, e.g., because they may be measurement errors.In terms of notation, some authors represent the median of a variable x either as $\tilde{x}$ or as $\mu_{1/2},$ sometimes also M. There is no widely accepted standard notation for the median, so the use of these or other symbols for the median needs to be explicitly defined when they are introduced.

MODE :

The mode is the value that appears most often in a set of data. The mode of a discrete probability distribution is the value x at which its probability mass function takes its maximum value. In other words, it is the value that is most likely to be sampled. The mode of a continuous probability distribution is the value x at which its proability density function has its maximum value, so, informally speaking, the mode is at the peak.

Like the statistical mean and median, the mode is a way of expressing, in a single number, important information about a random or a population. The numerical value of the mode is the same as that of the mean and median in a noormal distribution, and it may be very different in highly skewed distributions.

The mode is not necessarily unique, since the same maximum frequency may be attained at different values. The most extreme case occurs in uniform distributions, where all values occur equally frequently.

REGRESSION :

In statistics, regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variable. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed.

Regression models involve the following variables:

The unknown parameters, denoted as β, which may represent a scalar or a vector.
The independent variables, X.
The dependent variable, Y LINEAR REGRESSION:
In linear regression, the model specification is that the dependent variable, $y_i$ is a linear combination of the parameters (but need not be linear in the independent variables). For example, in simple linear regression for modeling $n$ data points there is one independent variable: $x_i$ , and two parameters, $\beta_0$ and $\beta_1$

straight line: $y_i=\beta_0 +\beta_1 x_i +\varepsilon_i,\quad i=1,\dots,n.\!$

In multiple linear regression, there are several independent variables or functions of independent variables.

Adding a term in x_i² to the preceding regression gives:

parabola: $y_i=\beta_0 +\beta_1 x_i +\beta_2 x_i^2+\varepsilon_i,\ i=1,\dots,n.\!$

Submitted by
Polisetti kartheeki

GROUP MEMBERS:
Nishid lad(2013176)
Priyatam(2013183)
kalyani(2013184)
Polisetti kartheeki(2013198)
Priyadarshi(2013211)

Regression

Given the data of MONTH WISE CEMENT PRODUTION IN INDIA for last few years, we used techniques of CORRELATION, CENTRAL TENDENCIES, Regression Analysis etc. As the first two have been explained in earlier blogs of our group, here is the last one.

Regression analysis

Regression analysis allows you to model, examine, and explore spatial relationships, and can help explain the factors behind observed spatial patterns. Regression analysis is also used for prediction. You may want to understand why people are persistently dying young in certain regions, for example, or may want to predict rainfall where there are no rain gauges.

When used properly, regression methods like OLS, GWR methods are powerful and reliable statistics for examining/estimating linear relationships. Linear relationships are either positive or negative. The graphic below depicts both positive and negative relationships, as well as the case where there is no relationship between two variables:

Correlation analyses and their associated graphics depicted above, test the strength of the relationship between two variables. Regression analyses, on the other hand, make a stronger claim; they attempt to demonstrate the degree to which one or more variables potentially promote positive or negative change in another variable.

Using Regression Analysis

Regression analysis can be used for a large variety of applications:

Modeling fire frequency to determine high risk areas and to understand the factors that contribute to high risk areas.
Modeling property loss from fire as a function of variables such as degree of fire department involvement, response time, property value, etc. If you find that response time is the key factor, you may need to build more fire stations. If you find that involvement is the key factor, you may need to increase equipment/officers dispatched.
Modeling traffic accidents as a function of speed, road conditions, weather, etc. in order to inform policy aimed at decreasing accidents.

There are three primary reasons we use regression analysis:

1. To model some phenomena in order to better understand it and possibly use that understanding to affect policy or to make decisions about appropriate actions to take. Basic objective: to measure the extent that changes in one or more variables jointly affect changes in another. Example: Understand the key characteristics of the habitat for some particular endangered species of bird (perhaps precipitation, food sources, vegetation, predators… ) to assist in designing legislation aimed at protecting that species.

To model some phenomena in order to predict values for that phenomenon at other places or other times. Basic objective: to build a prediction model that is consistent and accurate. Example: where are real estate values likely to go up next year? Or: there are rain gauges at particular places and a set of variables that explain the observed precipitation values… how much rain falls in places where there are no gauges? (Regression may be used in cases where interpolation is not effective because of insufficient sampling: there are no gauges on peaks or in valleys, for example).
You can also use regression analysis to test hypotheses. Suppose you are modeling residential crime in order to better understand it, and hopefully implement policy to prevent it. As you begin your analysis you probably have questions or hypotheses you want to test:

"Broken Window Theory" indicates that defacement of public property (graffiti, damaged structures, etc.) invite other crimes. Will there be a positive relationship between vandalism incidents and residential burglary?
Is there a relationship between illegal drug use and burglary (might drug addicts steal to support their habits)?
Are burglars predatory? Might there be more incidents in residential neighborhoods with higher proportions of elderly or female headed households?
Is a person at greater risk for burglary if they live in a rich or a poor neighborhood?

You can use regression analysis to test these relationships and answer your questions.

Regression Analysis components

It is impossible to discuss regression analysis without first becoming familiar with a few terms and basic concepts specific to regression statistics:

Regression equation: this is the mathematical formula applied to the explanatory variables in order to best predict the dependent variable you are trying to model. Unfortunately for those in the Geosciences who think of X and Y as coordinates, the notation in regression equations for the dependent variable is always "y" and for independent or explanatory variables is always "X". Each independent variable is associated with a regression coefficient describing the strength and the sign of that variable's relationship to the dependent variable. A regression equation might look like this (y is the dependent variable, the X's are the explanatory variables, and the β's are regression coefficients; each of these components of the regression equation are explained further below):

Dependent variable (y): this is the variable representing the process you are trying to predict or understand (e.g., residential burglary, foreclosure, rainfall). In the regression equation, it appears on the left side of the equal sign. While you can use regression to predict the dependent variable, you always start with a set of known y values and use these to build (or to calibrate) the regression model. The known y values are often referred to as observed values.
Independent/Explanatory variables (X): these are the variables used to model or to predict the dependent variable values. In the regression equation, they appear on the right side of the equal sign and are often referred to as explanatory variables. We say that the dependent variable is a function of the explanatory variables. If you are interested in predicting annual purchases for a proposed store, you might include in your model explanatory variables representing the number of potential customers, distance to competition, store visibility, and local spending patterns, for example.
Regression coefficients (β): coefficients are computed by the regression tool. They are values, one for each explanatory variable, that represent the strength and type of relationship the explanatory variable has to the dependent variable. Suppose you are modeling fire frequency as a function of solar radiation, vegetation, precipitation and aspect. You might expect a positive relationship between fire frequency and solar radiation (the more sun, the more frequent the fire incidents). When the relationship is positive, the sign for the associated coefficient is also positive. You might expect a negative relationship between fire frequency and precipitation (places with more rain have fewer fires). Coefficients for negative relationships have negative signs. When the relationship is a strong one, the coefficient is large. Weak relationships are associated with coefficients near zero.

β₀ is the regression intercept. It represents the expected value for the dependent variable if all of the independent variables are zero.

P-Values: most regression methods perform a statistical test to compute a probability, called a p-value, for the coefficients associated with each independent variable. The null hypothesis for this statistical test states that a coefficient is not significantly different from zero (in other words, for all intents and purposes, the coefficient is zero and the associated explanatory variable is not helping your model). Small p-values reflect small probabilities, and suggest that the coefficient is, indeed, important to your model with a value that is significantly different from zero (the coefficient is NOT zero). You would say that a coefficient with a p value of 0.01, for example, is statistically significant at the 99% confidence level; the associated variable is an effective predictor. Variables with coefficients near zero do not help predict or model the dependent variable; they are almost always removed from the regression equation, unless there are strong theoretical reasons to keep them.

R²/R-Squared: Multiple R-Squared and Adjusted R-Squared are both statistics derived from the regression equation to quantify model performance. The value of R-squared ranges from 0 to 100 percent. If your model fits the observed dependent variable values perfectly, R-squared is 1.0 (and you, no doubt, have made an error… perhaps you've used a form of y to predict y). More likely, you will see R-squared values like 0.49, for example, which you can interpret by saying: this model explains 49% of the variation in the dependent variable. To understand what the R-squared value is getting at, create a bar graph showing both the estimated and observed Y values sorted by the estimated values. Notice how much overlap there is. This graphic provides a visual representation of how well the model's predicted values explain the variation in the observed dependent variable values. View an illustration. The Adjusted R-Squared value is always a bit lower than the Multiple R-Squared value because it reflects model complexity (the number of variables) as it relates to the data.

Residuals: these are the unexplained portion of the dependent variable, represented in the regression equation as the random error term, ε. View an illustration. Known values for the dependent variable are used to build and to calibrate the regression model. Using known values for the dependent variable (y) and known values for all of the explanatory variables (the Xs), the regression tool constructs an equation that will predict those known y values, as well as possible. The predicted values will rarely match the observed values exactly. The difference between the observed y values and the predicted y values are called the residuals. The magnitude of the residuals from a regression equation is one measure of model fit. Large residuals indicate poor model fit.

Building a regression model is an iterative process that involves finding effective independent variables to explain the process you are trying to model/understand, then running the regression tool to determine which variables are effective predictors… then removing/adding variables until you find the best model possible.

Blogged By : Neeraj Garg (2013166)

Group No. 1 Members:

Piyush (2013197)

Pallavi Gupta (2013187)

Prerna Bansal (2013209)

Priya Jain (2013210)

T-test & Correlation

T-test

A t-test is any statistical hypothesis test in which the test statistic follows a Student's t distribution if the null hypothesis is supported. It can be used to determine if two sets of data are significantly different from each other, and is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistic (under certain conditions) follows a Student's t distribution.

Unpaired and paired two-sample t-test

Two-sample t-tests for a difference in mean involve independent samples, paired samples and overlapping samples. Paired t-tests are a form of blocking, and have greater power than unpaired tests when the paired units are similar with respect to "noise factors" that are independent of membership in the two groups being compared. In a different context, paired t-tests can be used to reduce the effects of confounding factors in an observational study.

(a) Independent samples

The independent samples t-test is used when two separate sets of independent and identically distributed samples are obtained, one from each of the two populations being compared. For example, suppose we are evaluating the effect of a medical treatment, and we enroll 100 subjects into our study, then randomize 50 subjects to the treatment group and 50 subjects to the control group. In this case, we have two independent samples and would use the unpaired form of the t-test. The randomization is not essential here—if we contacted 100 people by phone and obtained each person's age and gender, and then used a two-sample t-test to see whether the mean ages differ by gender, this would also be an independent samples t-test, even though the data are observational.

(b) Paired samples

Paired samples t-tests typically consist of a sample of matched pairs of similar units, or one group of units that has been tested twice (a "repeated measures" t-test).
A typical example of the repeated measures t-test would be where subjects are tested prior to a treatment, say for high blood pressure, and the same subjects are tested again after treatment with a blood-pressure lowering medication. By comparing the same patient's numbers before and after treatment, we are effectively using each patient as their own control. That way the correct rejection of the null hypothesis (here: of no difference made by the treatment) can become much more likely, with statistical power increasing simply because the random between-patient variation has now been eliminated. Note however that an increase of statistical power comes at a price: more tests are required, each subject having to be tested twice. Because half of the sample now depends on the other half, the paired version of Student's t-test has only 'n/2 - 1' degrees of freedom (with 'n' being the total number of observations). Pairs become individual test units, and the sample has to be doubled to achieve the same number of degrees of freedom.
A paired samples t-test based on a "matched-pairs sample" results from an unpaired sample that is subsequently used to form a paired sample, by using additional variables that were measured along with the variable of interest. The matching is carried out by identifying pairs of values consisting of one observation from each of the two samples, where the pair is similar in terms of other measured variables. This approach is sometimes used in observational studies to reduce or eliminate the effects of confounding factors.
Paired samples t-tests are often referred to as "dependent samples t-tests" (as are t-tests on overlapping samples).

(c) Overlapping samples

An overlapping samples t-test is used when there are paired samples with data missing in one or the other samples (e.g., due to selection of "Don't know" options in questionnaires or because respondents are randomly assigned to a subset question). These tests are widely used in commercial survey research (e.g., by polling companies) and are available in many standard crosstab software packages.

Year	t value	P value	Null hypothesis
2003-04	-10.4625	0.00	reject
2004-05	-10.1878	0.00	reject
2005-06	-8.47942	0.00	reject
2006-07	-6.92902	0.00	reject
2007-08	-1.71972	0.06	accept
2008-09	0.716896	0.24	accept
2009-10	-8.8544	0.00	reject
2010-11	-4.72108	0.00	reject
2011-12	-3.09829	0.01	reject

CORRELATION

Correlation is a measure of the degree of relatedness of variables. It can help a business researcher determine, for example, whether the stocks of two airlines rise and fall in any related manner. For a sample of pairs of data, correlation analysis can yield a numerical value that represents the degree of relatedness of the two stock prices over time.

Correlation is determined using sample coefficient of correlation, r, where r is a measure of the linear correlation of two variables.

The Correlation between 2 variables can be computed using the Product Pearson – Moment Correlation Coefficient which can be given by

In probability and statistics, mean and expected value are used synonymous to refer to one measure of the central tendency either of a probability distribution or of the variable characterized by that distribution. In the case of a discrete probability distribution of a random variable X, the mean is equal to the sum over every possible value weighted by the probability of that value; that is, it is computed by taking the product of each possible value x of X and its probability P(x), and then adding all these products together, giving .
µ =Σ x P(x)

An analogous formula applies to the case of a continuous probability distribution. Not every probability distribution has a defined mean; see the Cauchy distribution for an example. Moreover, for some distributions the mean is infinite: for example, when the probability of the value is for n = 1, 2, 3, ....

For a data set, the terms arithmetic mean, mathematical expectation, and sometimes average are used synonymously to refer to a central value of a discrete set of numbers: specifically, the sum of the values divided by the number of values. T bar". If the data set were based on a series of observations obtained by sampling from a statistical population, the arithmetic mean is termed the sample mean to distinguish it from the population mean. .

For a finite population, the population mean of a property is equal to the arithmetic mean of the given property while considering every member of the population. For example, the population mean height is equal to the sum of the heights of every individual divided by the total number of individuals. The sample mean may differ from the population mean, especially for small samples. The law of large numbers dictates that the larger the size of the sample, the more likely it is that the sample mean will be close to the population mean.

SUBMITTED BY :  Palak Jain(2013185)
GROUP No : 7
Nidhi Sharma (2013169)
Nitesh Singh Patel(2013178)
Nitin Boratwar(2013179)
Palak Jain(2013185)
  Pallavi Bizoara (2013186)

t-test

A t-test is a statistics that checks if two means (averages)are reliably different from each other why not just look at the mean looking at the means may show a difference but we can't be sure if that is a reliable diffrence for example if we toss a coin 100 times u get 52 times head where as i get 49 heads this is by chance

this leads to diffrence between infrential and descriptive statistics. a descriptive statistics is a stats that describe you have but cant be genralized beyond that tells us the sample we have it doesnt tell us the further results what will happen in future where as infrential staistics is same as t-test it allow us to make infrences about the population beyond our data how does t-test work it measures the diffrence between the group nand within the group

variance between groups

t= variance within groups

A big t-value diffrent groups

A small t-value similar groups

ASSUMPTIONS:

Most t-test statistics have the form t = Z/s, where Z and s are functions of the data. Typically, Z is designed to be sensitive to the alternative hypothesis (i.e., its magnitude tends to be larger when the alternative hypothesis is true), whereas s is a scaling parameter that allows the distribution of t to be determined.

As an example, in the one-sample t-test Z = $\bar{X}/(\hat{\sigma}/\sqrt{n})$ , where $\bar{X}$ is the sample mean of the data, $n$ is the sample size, and $\hat{\sigma}$ is the population standard deviation of the data; s in the one-sample t-test is $\hat{\sigma}/\sqrt{n}$ , where $\hat{\sigma}$ is the sample standard deviation.

The assumptions underlying a t-test are that

Z follows a standard normal distribution under the null hypothesis
s² follows a χ² distribution with p degrees of freedom under the null hypothesis, where p is a positive constant
Z and s are independent.

In a specific type of t-test, these conditions are consequences of the population being studied, and of the way in which the data are sampled. For example, in the t-test comparing the means of two independent samples, the following assumptions should be met:

Each of the two populations being compared should follow a normal distribution. This can be tested using a normality test, such as the Shapiro-Wilk or Kolmogorov–Smirnov test, or it can be assessed graphically using a normal quantile plot.

If using Student's original definition of the t-test, the two populations being compared should have the same variance (testable using F test, Levene's test, Bartlett's test, or the Brown–Forsythe test; or assessable graphically using a Q-Q plot). If the sample sizes in the two groups being compared are equal, Student's original t-test is highly robust to the presence of unequal variances.^[8] Welch's t-test is insensitive to equality of the variances regardless of whether the sample sizes are similar.

The data used to carry out the test should be sampled independently from the two populations being compared. This is in general not testable from the data, but if the data are known to be dependently sampled (i.e. if they were sampled in clusters), then the classical t-tests discussed here may give misleading results

FORMULA

Example

A researcher wishes to learn whether the pH of soil affects seed germination of a particular herb found in forests near her home. She filled 10 flower pots with acid soil (pH 5.5) and ten flower pots with neutral soil (pH 7.0) and planted 100 seeds in each pot. The mean number of seeds that germinated in each type of soil is below.

Acid Soil
pH 5.5 Neutral Soil
pH 7.0

42 43

45 51

40 56

37 40

41 32

41 54

48 51

50 55

45 50

46 48

Mean =

43.5

48

The researcher is testing whether soil pH affects germination of the herb.

Her hypothesis is: The mean germination at pH 5.5 is different than the mean germination at pH 7.0.
A t-test can be used to test the probability that the two means do not differ. The alternative is that the means differ; one of them is greater than the other.
This is a two-tailed test because the researcher is interested in if soil acidity changes germination percentage. She does not specify if it increases or decreases germination. Notice that a 2 is entered for the number of tails below.

The t-test shows that the mean germination of the two groups does not differ significantly because p > 0.05. The researcher concludes that pH does not affect germination of the herb.

Example

Suppose that a researcher wished to learn if a particular chemical is toxic to a certain species of beetle. She believes that the chemical might interfere with the beetle’s reproduction. She obtained beetles and divided them into two groups. She then fed one group of beetles with the chemical and used the second group as a control. After 2 weeks, she counted the number of eggs produced by each beetle in each group. The mean egg count for each group of beetles is below.

Group 1
   fed chemical    Group 2
   not fed chemical (control)

33 35

31 42

34 43

38 41

32

28



Mean = 32.7 40.3

The researcher believes that the chemical interferes with beetle reproduction. She suspects that the chemical reduces egg production. Her hypothesis is: The mean number of eggs in group 1 is less than the mean number of group 2.
A t-test can be used to test the probability that the two means do not differ. The alternative is that the mean of group 1 is greater than the mean of group 2.
This is a 1-tailed test because her hypothesis proposes that group B will have greater reproduction than group 1. If she had proposed that the two groups would have different reproduction but was not sure which group would be greater, then it would be a 2-tailed test. Notice that a 1 is entered for the number of tails below.
The results of her t-test are copied below.

The researcher concludes that the mean of group 1 is significantly less than the mean for group 2 because the value of P < 0.05. She accepts her hypothesis that the chemical reduces egg production because group 1 had significantly less eggs than the control.

written by : PRERNA ARORA
group no.8
Group member praveen iyer
neeraj ramadas
prakhar swami
nishant aggarwal

	Acid Soil pH 5.5	Neutral Soil pH 7.0
	42	43
	45	51
	40	56
	37	40
	41	32
	41	54
	48	51
	50	55
	45	50
	46	48
Mean =	43.5	48

	Group 1 fed chemical	Group 2 not fed chemical (control)
	33	35
	31	42
	34	43
	38	41
	32
	28

Mean =	32.7	40.3