Predictions and Probabilistic Models

Regression models are often used to predict a response variable $y$ from an explanatory variable $x$.

Learning Objective

Explain how to estimate the relationship among variables using regression analysis

Key Points

Regression models predict a value of the $Y$ variable, given known values of the $X$ variables. Prediction within the range of values in the data set used for model-fitting is known informally as interpolation.
Prediction outside this range of the data is known as extrapolation. The further the extrapolation goes outside the data, the more room there is for the model to fail due to differences between the assumptions and the sample data or the true values.
There are certain necessary conditions for regression inference: observations must be independent, the mean response has a straight-line relationship with $x$, the standard deviation of $y$ is the same for all values of $x$, and the response $y$ varies according to a normal distribution.

Terms

interpolation
the process of estimating the value of a function at a point from its values at nearby points
extrapolation
a calculation of an estimate of the value of some function outside the range of known values

Full Text

Regression Analysis

In statistics, regression analysis is a statistical technique for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables, called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function, which can be described by a probability distribution.

Regression analysis is widely used for prediction and forecasting. Regression analysis is also used to understand which among the independent variables is related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable; for example, correlation does not imply causation.

Making Predictions Using Regression Inference

Regression models predict a value of the $Y$ variable, given known values of the $X$ variables. Prediction within the range of values in the data set used for model-fitting is known informally as interpolation. Prediction outside this range of the data is known as extrapolation. Performing extrapolation relies strongly on the regression assumptions. The further the extrapolation goes outside the data, the more room there is for the model to fail due to differences between the assumptions and the sample data or the true values.

It is generally advised that when performing extrapolation, one should accompany the estimated value of the dependent variable with a prediction interval that represents the uncertainty. Such intervals tend to expand rapidly as the values of the independent variable(s) move outside the range covered by the observed data.

However, this does not cover the full set of modelling errors that may be being made--in particular, the assumption of a particular form for the relation between $Y$ and $X$. A properly conducted regression analysis will include an assessment of how well the assumed form is matched by the observed data, but it can only do so within the range of values of the independent variables actually available. This means that any extrapolation is particularly reliant on the assumptions being made about the structural form of the regression relationship. Best-practice advice here is that a linear-in-variables and linear-in-parameters relationship should not be chosen simply for computational convenience, but that all available knowledge should be deployed in constructing a regression model. If this knowledge includes the fact that the dependent variable cannot go outside a certain range of values, this can be made use of in selecting the model – even if the observed data set has no values particularly near such bounds. The implications of this step of choosing an appropriate functional form for the regression can be great when extrapolation is considered. At a minimum, it can ensure that any extrapolation arising from a fitted model is "realistic" (or in accord with what is known).

Conditions for Regression Inference

A scatterplot shows a linear relationship between a quantitative explanatory variable $x$ and a quantitative response variable $y$. Let's say we have $n$ observations on an explanatory variable $x$ and a response variable $y$. Our goal is to study or predict the behavior of $y$ for given values of $x$. Here are the required conditions for the regression model:

Repeated responses $y$ are independent of each other.
The mean response $\mu_y$ has a straight-line (i.e., "linear") relationship with $x$: $\mu_y = \alpha + \beta x$; the slope $\beta$ and intercept $\alpha$ are unknown parameters.
The standard deviation of $y$ (call it $\sigma$) is the same for all values of $x$. The value of $\sigma$ is unknown.
For any fixed value of $x$, the response $y$ varies according to a normal distribution.

The importance of data distribution in linear regression inference

A good rule of thumb when using the linear regression method is to look at the scatter plot of the data. This graph is a visual example of why it is important that the data have a linear relationship. Each of these four data sets has the same linear regression line and therefore the same correlation, 0.816. This number may at first seem like a strong correlation—but in reality the four data distributions are very different: the same predictions that might be true for the first data set would likely not be true for the second, even though the regression method would lead you to believe that they were more or less the same. Looking at panels 2, 3, and 4, you can see that a straight line is probably not the best way to represent these three data sets.

[ edit ]

Prev Concept

Correlation is Not Causation

A Graph of Averages

Next Concept