Applied Business Statistics: Regression

Given the data of MONTH WISE CEMENT PRODUTION IN INDIA for last few years, we used techniques of CORRELATION, CENTRAL TENDENCIES, Regression Analysis etc. As the first two have been explained in earlier blogs of our group, here is the last one.

Regression analysis

Regression analysis allows you to model, examine, and explore spatial relationships, and can help explain the factors behind observed spatial patterns. Regression analysis is also used for prediction. You may want to understand why people are persistently dying young in certain regions, for example, or may want to predict rainfall where there are no rain gauges.

When used properly, regression methods like OLS, GWR methods are powerful and reliable statistics for examining/estimating linear relationships. Linear relationships are either positive or negative. The graphic below depicts both positive and negative relationships, as well as the case where there is no relationship between two variables:

Correlation analyses and their associated graphics depicted above, test the strength of the relationship between two variables. Regression analyses, on the other hand, make a stronger claim; they attempt to demonstrate the degree to which one or more variables potentially promote positive or negative change in another variable.

Using Regression Analysis

Regression analysis can be used for a large variety of applications:

Modeling fire frequency to determine high risk areas and to understand the factors that contribute to high risk areas.
Modeling property loss from fire as a function of variables such as degree of fire department involvement, response time, property value, etc. If you find that response time is the key factor, you may need to build more fire stations. If you find that involvement is the key factor, you may need to increase equipment/officers dispatched.
Modeling traffic accidents as a function of speed, road conditions, weather, etc. in order to inform policy aimed at decreasing accidents.

There are three primary reasons we use regression analysis:

1. To model some phenomena in order to better understand it and possibly use that understanding to affect policy or to make decisions about appropriate actions to take. Basic objective: to measure the extent that changes in one or more variables jointly affect changes in another. Example: Understand the key characteristics of the habitat for some particular endangered species of bird (perhaps precipitation, food sources, vegetation, predators… ) to assist in designing legislation aimed at protecting that species.

To model some phenomena in order to predict values for that phenomenon at other places or other times. Basic objective: to build a prediction model that is consistent and accurate. Example: where are real estate values likely to go up next year? Or: there are rain gauges at particular places and a set of variables that explain the observed precipitation values… how much rain falls in places where there are no gauges? (Regression may be used in cases where interpolation is not effective because of insufficient sampling: there are no gauges on peaks or in valleys, for example).
You can also use regression analysis to test hypotheses. Suppose you are modeling residential crime in order to better understand it, and hopefully implement policy to prevent it. As you begin your analysis you probably have questions or hypotheses you want to test:

"Broken Window Theory" indicates that defacement of public property (graffiti, damaged structures, etc.) invite other crimes. Will there be a positive relationship between vandalism incidents and residential burglary?
Is there a relationship between illegal drug use and burglary (might drug addicts steal to support their habits)?
Are burglars predatory? Might there be more incidents in residential neighborhoods with higher proportions of elderly or female headed households?
Is a person at greater risk for burglary if they live in a rich or a poor neighborhood?

You can use regression analysis to test these relationships and answer your questions.

Regression Analysis components

It is impossible to discuss regression analysis without first becoming familiar with a few terms and basic concepts specific to regression statistics:

Regression equation: this is the mathematical formula applied to the explanatory variables in order to best predict the dependent variable you are trying to model. Unfortunately for those in the Geosciences who think of X and Y as coordinates, the notation in regression equations for the dependent variable is always "y" and for independent or explanatory variables is always "X". Each independent variable is associated with a regression coefficient describing the strength and the sign of that variable's relationship to the dependent variable. A regression equation might look like this (y is the dependent variable, the X's are the explanatory variables, and the β's are regression coefficients; each of these components of the regression equation are explained further below):

Dependent variable (y): this is the variable representing the process you are trying to predict or understand (e.g., residential burglary, foreclosure, rainfall). In the regression equation, it appears on the left side of the equal sign. While you can use regression to predict the dependent variable, you always start with a set of known y values and use these to build (or to calibrate) the regression model. The known y values are often referred to as observed values.
Independent/Explanatory variables (X): these are the variables used to model or to predict the dependent variable values. In the regression equation, they appear on the right side of the equal sign and are often referred to as explanatory variables. We say that the dependent variable is a function of the explanatory variables. If you are interested in predicting annual purchases for a proposed store, you might include in your model explanatory variables representing the number of potential customers, distance to competition, store visibility, and local spending patterns, for example.
Regression coefficients (β): coefficients are computed by the regression tool. They are values, one for each explanatory variable, that represent the strength and type of relationship the explanatory variable has to the dependent variable. Suppose you are modeling fire frequency as a function of solar radiation, vegetation, precipitation and aspect. You might expect a positive relationship between fire frequency and solar radiation (the more sun, the more frequent the fire incidents). When the relationship is positive, the sign for the associated coefficient is also positive. You might expect a negative relationship between fire frequency and precipitation (places with more rain have fewer fires). Coefficients for negative relationships have negative signs. When the relationship is a strong one, the coefficient is large. Weak relationships are associated with coefficients near zero.

β₀ is the regression intercept. It represents the expected value for the dependent variable if all of the independent variables are zero.

P-Values: most regression methods perform a statistical test to compute a probability, called a p-value, for the coefficients associated with each independent variable. The null hypothesis for this statistical test states that a coefficient is not significantly different from zero (in other words, for all intents and purposes, the coefficient is zero and the associated explanatory variable is not helping your model). Small p-values reflect small probabilities, and suggest that the coefficient is, indeed, important to your model with a value that is significantly different from zero (the coefficient is NOT zero). You would say that a coefficient with a p value of 0.01, for example, is statistically significant at the 99% confidence level; the associated variable is an effective predictor. Variables with coefficients near zero do not help predict or model the dependent variable; they are almost always removed from the regression equation, unless there are strong theoretical reasons to keep them.

R²/R-Squared: Multiple R-Squared and Adjusted R-Squared are both statistics derived from the regression equation to quantify model performance. The value of R-squared ranges from 0 to 100 percent. If your model fits the observed dependent variable values perfectly, R-squared is 1.0 (and you, no doubt, have made an error… perhaps you've used a form of y to predict y). More likely, you will see R-squared values like 0.49, for example, which you can interpret by saying: this model explains 49% of the variation in the dependent variable. To understand what the R-squared value is getting at, create a bar graph showing both the estimated and observed Y values sorted by the estimated values. Notice how much overlap there is. This graphic provides a visual representation of how well the model's predicted values explain the variation in the observed dependent variable values. View an illustration. The Adjusted R-Squared value is always a bit lower than the Multiple R-Squared value because it reflects model complexity (the number of variables) as it relates to the data.

Residuals: these are the unexplained portion of the dependent variable, represented in the regression equation as the random error term, ε. View an illustration. Known values for the dependent variable are used to build and to calibrate the regression model. Using known values for the dependent variable (y) and known values for all of the explanatory variables (the Xs), the regression tool constructs an equation that will predict those known y values, as well as possible. The predicted values will rarely match the observed values exactly. The difference between the observed y values and the predicted y values are called the residuals. The magnitude of the residuals from a regression equation is one measure of model fit. Large residuals indicate poor model fit.

Building a regression model is an iterative process that involves finding effective independent variables to explain the process you are trying to model/understand, then running the regression tool to determine which variables are effective predictors… then removing/adding variables until you find the best model possible.

Blogged By : Neeraj Garg (2013166)

Group No. 1 Members:

Piyush (2013197)

Pallavi Gupta (2013187)

Prerna Bansal (2013209)

Priya Jain (2013210)

Applied Business Statistics

Sunday 1 September 2013

Regression

Regression analysis

No comments:

Post a Comment