Given the data of MONTH WISE CEMENT PRODUTION IN INDIA for last few years, we used techniques of CORRELATION, CENTRAL TENDENCIES, Regression
Analysis etc. As the first two have been explained in earlier blogs of our group, here is the last one.
Regression analysis
Regression analysis allows you to model, examine, and explore
spatial relationships, and can help explain the factors behind observed spatial
patterns. Regression analysis is also used for prediction. You may want to
understand why people are persistently dying young in certain regions, for
example, or may want to predict rainfall where there are no rain gauges.
When used properly, regression methods like OLS, GWR methods are powerful and reliable statistics
for examining/estimating linear relationships. Linear relationships are either positive or negative. The graphic below depicts both positive and negative relationships, as
well as the case where there is no relationship between two variables:
Correlation analyses and their associated graphics depicted above,
test the strength of the relationship between two variables. Regression
analyses, on the other hand, make a stronger claim; they attempt to demonstrate
the degree to which one or more variables potentially promote positive or
negative change in another variable.
Using
Regression Analysis
Regression analysis can be used for a large variety of
applications:
- Modeling fire frequency to
determine high risk areas and to understand the factors that contribute to
high risk areas.
- Modeling property loss from fire as a
function of variables such as degree of fire department involvement,
response time, property value, etc. If you find that response time is the
key factor, you may need to build more fire stations. If you find that
involvement is the key factor, you may need to increase equipment/officers
dispatched.
- Modeling traffic accidents as a
function of speed, road conditions, weather, etc. in order to inform
policy aimed at decreasing accidents.
There are three primary reasons we use regression
analysis:
1.
To model some phenomena in
order to better understand it and possibly use that understanding to affect
policy or to make decisions about appropriate actions to take. Basic objective:
to measure the extent that changes in one or more variables jointly affect
changes in another. Example: Understand the key characteristics of the habitat
for some particular endangered species of bird (perhaps precipitation, food
sources, vegetation, predators… ) to assist in designing legislation aimed at
protecting that species.
- To model some phenomena in order to predict
values for that phenomenon at other places or other times. Basic
objective: to build a prediction model that is consistent and accurate.
Example: where are real estate values likely to go up next year? Or: there
are rain gauges at particular places and a set of variables that explain
the observed precipitation values… how much rain falls in places where
there are no gauges? (Regression may be used in cases where interpolation
is not effective because of insufficient sampling: there are no gauges on peaks
or in valleys, for example).
- You can also use regression analysis
to test hypotheses. Suppose you are modeling residential crime in order to
better understand it, and hopefully implement policy to prevent it. As you
begin your analysis you probably have questions or hypotheses you want to
test:
- "Broken
Window Theory" indicates that defacement of public property
(graffiti, damaged structures, etc.) invite other crimes. Will there be a
positive relationship between vandalism incidents and residential burglary?
- Is there a
relationship between illegal drug use and burglary (might drug addicts
steal to support their habits)?
- Are burglars
predatory? Might there be more incidents in residential neighborhoods
with higher proportions of elderly or female headed households?
- Is a person
at greater risk for burglary if they live in a rich or a poor
neighborhood?
You can use regression analysis
to test these relationships and answer your questions.
Regression
Analysis components
It is impossible to discuss regression analysis without first
becoming familiar with a few terms and basic concepts specific to regression
statistics:
Regression equation: this is the
mathematical formula applied to the explanatory variables in order to best
predict the dependent variable you are trying to model. Unfortunately for those
in the Geosciences who think of X and Y as coordinates, the notation in
regression equations for the dependent variable is always "y" and for
independent or explanatory variables is always "X". Each independent
variable is associated with a regression coefficient describing the strength
and the sign of that variable's relationship to the dependent variable. A
regression equation might look like this (y is the dependent variable, the X's
are the explanatory variables, and the β's are regression coefficients; each of
these components of the regression equation are explained further below):
- Dependent variable (y): this is the
variable representing the process you are trying to predict or understand
(e.g., residential burglary, foreclosure, rainfall). In the regression
equation, it appears on the left side of the equal sign. While you can use
regression to predict the dependent variable, you always start with a set
of known y values and use these to
build (or to calibrate) the regression model. The known y values are often referred
to as observed values.
- Independent/Explanatory
variables (X): these
are the variables used to model or to predict the dependent variable
values. In the regression equation, they appear on the right side of the
equal sign and are often referred to as explanatory variables. We
say that the dependent variable is a function of the explanatory
variables. If you are interested in predicting annual purchases for a
proposed store, you might include in your model explanatory variables
representing the number of potential customers, distance to competition,
store visibility, and local spending patterns, for example.
- Regression
coefficients (β):
coefficients are computed by the regression tool. They are values, one for
each explanatory variable, that represent the strength and type of
relationship the explanatory variable has to the dependent variable.
Suppose you are modeling fire frequency as a function of solar radiation,
vegetation, precipitation and aspect. You might expect a positive
relationship between fire frequency and solar radiation (the more sun, the
more frequent the fire incidents). When the relationship is positive, the
sign for the associated coefficient is also positive. You might expect a
negative relationship between fire frequency and precipitation (places
with more rain have fewer fires). Coefficients for negative relationships
have negative signs. When the relationship is a strong one, the
coefficient is large. Weak relationships are associated with coefficients
near zero.
β0 is the regression intercept. It represents the expected value for the
dependent variable if all of the independent variables are zero.
P-Values: most regression
methods perform a statistical test to compute a probability, called a p-value,
for the coefficients associated with each independent variable. The null
hypothesis for this statistical test states that a coefficient is not
significantly different from zero (in other words, for all intents and
purposes, the coefficient is zero and the associated explanatory variable is
not helping your model). Small p-values reflect small probabilities, and
suggest that the coefficient is, indeed, important to your model with a value
that is significantly different from zero (the coefficient is NOT zero). You
would say that a coefficient with a p value of 0.01, for example, is
statistically significant at the 99% confidence level; the associated variable
is an effective predictor. Variables with coefficients near zero do not help
predict or model the dependent variable; they are almost always removed from
the regression equation, unless there are strong theoretical reasons to keep
them.
R2/R-Squared: Multiple R-Squared and
Adjusted R-Squared are both statistics derived from the regression equation to
quantify model performance. The value of R-squared ranges from 0 to 100
percent. If your model fits the observed dependent variable values perfectly,
R-squared is 1.0 (and you, no doubt, have made an error… perhaps you've used a
form of y to predict y). More likely, you will see
R-squared values like 0.49, for example, which you can interpret by saying: this
model explains 49% of the variation in the dependent variable. To understand
what the R-squared value is getting at, create a bar graph showing both the
estimated and observed Y values sorted by the estimated values. Notice how much
overlap there is. This graphic provides a visual representation of how well the
model's predicted values explain the variation in the observed dependent
variable values. View an illustration. The Adjusted R-Squared value is always a bit lower than the
Multiple R-Squared value because it reflects model complexity (the number of
variables) as it relates to the data.
Residuals: these are the
unexplained portion of the dependent variable, represented in the regression
equation as the random error term, ε. View an illustration. Known values for the dependent variable are used to build and to
calibrate the regression model. Using known values for the dependent variable (y)
and known values for all of the explanatory variables (the Xs), the
regression tool constructs an equation that will predict those known y values,
as well as possible. The predicted values will rarely match the observed values
exactly. The difference between the observed y values and the
predicted y values are called the residuals. The magnitude of
the residuals from a regression equation is one measure of model fit. Large
residuals indicate poor model fit.
Building a regression model is an iterative process that involves
finding effective independent variables to explain the process you are trying
to model/understand, then running the regression tool to determine which
variables are effective predictors… then removing/adding variables until you
find the best model possible.
Blogged By : Neeraj Garg (2013166)
Group No. 1 Members:
Piyush (2013197)
Pallavi Gupta (2013187)
Prerna Bansal (2013209)
Priya Jain (2013210)
No comments:
Post a Comment