Regression Analysis
In statistics, regression analysis is a statistical technique for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables, called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function, which can be described by a probability distribution.
Regression analysis is widely used for prediction and forecasting. Regression analysis is also used to understand which among the independent variables is related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable; for example, correlation does not imply causation.
Making Predictions Using Regression Inference
Regression models predict a value of the
It is generally advised that when performing extrapolation, one should accompany the estimated value of the dependent variable with a prediction interval that represents the uncertainty. Such intervals tend to expand rapidly as the values of the independent variable(s) move outside the range covered by the observed data.
However, this does not cover the full set of modelling errors that may be being made--in particular, the assumption of a particular form for the relation between
Conditions for Regression Inference
A scatterplot shows a linear relationship between a quantitative explanatory variable
- Repeated responses
$y$ are independent of each other. - The mean response
$\mu_y$ has a straight-line (i.e., "linear") relationship with$x$ :$\mu_y = \alpha + \beta x$ ; the slope$\beta$ and intercept$\alpha$ are unknown parameters. - The standard deviation of
$y$ (call it$\sigma$ ) is the same for all values of$x$ . The value of$\sigma$ is unknown. - For any fixed value of
$x$ , the response$y$ varies according to a normal distribution.
The importance of data distribution in linear regression inference
A good rule of thumb when using the linear regression method is to look at the scatter plot of the data. This graph is a visual example of why it is important that the data have a linear relationship. Each of these four data sets has the same linear regression line and therefore the same correlation, 0.816. This number may at first seem like a strong correlation—but in reality the four data distributions are very different: the same predictions that might be true for the first data set would likely not be true for the second, even though the regression method would lead you to believe that they were more or less the same. Looking at panels 2, 3, and 4, you can see that a straight line is probably not the best way to represent these three data sets.