Predict values from regression analysis using the regression equation

When we do regression analysis we estimate how independent variables are related to a dependent variable. In this guide we will cover how to calculate expected values, guesses, for which value on the dependent variable a specific observation should have.

This is useful both for illustrating the results of the regression analysis, or for comparison with reality.

This guide will show:

  1. What the regression equation is
  2. How to get the values we need for the equation
  3. How to calculate guesses (predicted values) with the equation
  4. What the error term (residual) is
  5. How to calculate guesses for each observation of the dataset
  6. And finally how to see that the guesses calculated by hand are the same as the ones we calculate automatically.

The regression equation

In order to make a guess we need to have a regression equation. The equation shows how to calculate an expected value for the dependent variable, based on the observation's values on the independent variables. The table Stata produces after the regression analysis shows the different parameters of the equation.

Normally the regression equation is shown in the following way (here with two independent variables):

$y_i = \alpha+ \beta_1X_{1i} + \beta_2X_{2i} + \epsilon_i$

The different part denotes the following things:

Symbol Meaning
$y_i$ Value of the dependent variable for person $i$
$\alpha$ Intercept/constant (shown in the regression table)
$\beta_{1}$ Coefficient for the first independent variable (shown in the regression table)
$X_{1_i}$ Value of the first independent variable for person $i$
$\beta_{2}$ Coefficient for the second independent variable (shown in the regression table)
$X_{2_i}$ Value of the second independent variable for person $i$
$\epsilon_i$ The error term for person $i$, that is, the difference between the actual value of the dependent variable ($y_i$) and our guess.

This is the general form. We will, for each observation, make a guess as to what the value of dependent value should be, by multiplying the values on the independent variables with their respective coefficients.

But in each analysis we do we can switch the greek letters for actual numbers. Let's say we want to analyze the incomes of people. We hypothesize that income depend on education and gender. We then run an analysis with two independent variables: the level of education attained, and whether one is a man or woman. The equation would then look like this:

$income_i = \alpha+ \beta_1education_i + \beta_2sex_i + \epsilon_i$

If we know a person's level of education and gender we can make an informed guess of their income. But then we need to know which numbers to put in the place of $\alpha$, $\beta_1$ and $\beta_2$, that is, the intercept, and the coefficients for education and gender.

When we guess we don't include an error term in the equation. It is only there to make the equation match the true income. When we make a guess our equation instead looks like this:

$guessedincome_i = \alpha+ \beta_1education_i + \beta_2sex_i$

Get the values to the equation using regression analysis

We get the numbers when we do a regression analysis. In the american General Social Survey there questions about all variables we are interested in: income ("realrinc"), level of education ("degree", a five-degree scale that runs from 0 no education to 4 master degree) and gender ("sex", where man is 1 och woman 2). Below we run a regression analysis with income as dependent variable and education and gender as independents.

In [1]:
use "/Users/anderssundell/Dropbox/Jupyter/stathelp/data/GSS2016.dta", clear
reg realrinc degree sex

      Source |       SS           df       MS      Number of obs   =     1,631
-------------+----------------------------------   F(2, 1628)      =    135.42
       Model |  2.0112e+11         2  1.0056e+11   Prob > F        =    0.0000
    Residual |  1.2089e+12     1,628   742596017   R-squared       =    0.1426
-------------+----------------------------------   Adj R-squared   =    0.1416
       Total |  1.4101e+12     1,630   865071746   Root MSE        =     27251

    realrinc |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      degree |   8138.599   547.8172    14.86   0.000     7064.099      9213.1
         sex |  -10649.57   1351.835    -7.88   0.000    -13301.09   -7998.052
       _cons |   24874.19   2328.309    10.68   0.000     20307.39    29440.99

Here we have everything we need to compelte our equation. The coefficient for education ($\beta_1$) is 8139, which means that if we take a step up on the education scale (for instance from "junior college" to "bachelor") our income increases with 8139 dollars per year.

The coefficient for the variable "sex" ($\beta_2$) is -10650, which means that if we increase the variable with one step (go from man (1) to woman (2)) our income will decrease with 10650 dollars per year. In other words, women earn 10650 dollars less each year, on average.

And finally we can see that the intercept ($\alpha$) is 24874. Is is the expected income for everyone who has the value 0 on all the independent variables. That also includes having the value 0 on the variable sex, which no person in the dataset has. But it doesn't matter - the equation works just as well anyway.

Now we can enter these numbers into the regression equation:

$income_i = 24874 + 8139\times education_i -10650\times sex_i + \epsilon_i$

or, if we are making a guess on someone's income:

$guessedincome_i = 24874 + 8139\times education_i -10650\times sex_i$

Calculate predicted values manually with the regression equation

Now that we know the values of the coefficients we can enter different values for the two independent variables and get different guesses for income. We thus use the equation to guess the income of various categories of people.

Case 1: Man (sex = 1) with a master's degree (education = 4):
$24874 + 8139\times 4 -10650\times1 = 46780$

Case 2: Man (sex = 1) with degree from junior college (education = 2):
$24874 + 8139\times 2 -10650\times1 = 30502$

Case 3: Man (sex = 1) with no degree (education = 0):
$24874 + 8139\times 0 -10650\times1 = 14224$

Case 4: Woman (sex = 2) with a master's degree (education = 4):
$24874 + 8139\times 4 -10650\times2 = 36130$

Case 5: Woman (sex = 2) with no degree (education = 0):
$24874 + 8139\times 0 -10650\times2 = 3574$

In all cases some things remain the same: the intercept (24874), the coefficient for education (8139) and the coefficient for sex (-10650). The things that change are the values of the different variables, producing different guesses. We thus believe that women with no degree only make 3574 dollars per year, while men with a master's degree make 46780 dollars per year!

The error term - the residual

Of course, not all men with a master's degree earn exactly 46780, and every woman with no degree does not earn exactly 3574. This is where the error term, which can also be called the residual (what is left over) enters the picture. Say we have a woman with no education, that in reality earns 7000 dollars per year. We guessed that she would earn 3574.

To make the regression equation add up we here need to enter an error term that is the difference between our guess for her (3574) and the actual value (7000). The difference is 3426. For this particular woman, the regression equation looks like this:

$7000 = 24874 + 8139\times 0 -10650\times 2 + 3426$

It is simple to add it up - the expression to the right of the equal sign is 7000. The error term is thus different for each observation. But when we calculate our guess we do not need to worry about it. It might however be interesting in other regards, for instance when we do regression diagnostics. But we will not go into that here.

Predict values for all observations, automatically

Individual observations are seldom of interest when we work with survey data. But if we are dealing with known observations, such as the countries of the world, it can sometimes be interesting to know the guesses for individual countries.

With the command predict we can easily do this in Stata. We then run a regression analysis, and immediately afterward type the command predict, followed by the name of the new variable that will save our guesses, one for each observation. Let's call the new variable "guessedincome". Below we do this, with the command quietly before the regression command. This prevents the regression table from being printed. We also run predict once more afterwards, using the option , r to also get the residuals for each observation.

In [2]:
quietly reg realrinc degree sex
predict guessedincome
predict errorterm, r

(option xb assumed; fitted values)
(8 missing values generated)

(1,236 missing values generated)

We have now created two new variables, "guessedincome" and "errorterm". Let us now check how this looks like for the first five persons in the dataset, using the command list.

In [3]:
list degree sex guessedincome realrinc errorterm in 1/5
     |   degree      sex   guesse~e      realrinc   errorterm |
  1. | bachelor     male   38640.42   164382.0376    125741.6 |
  2. | high sch     male   22363.22         25740    3376.782 |
  3. | bachelor     male   38640.42           IAP           . |
  4. | high sch   female   11713.65          5265   -6448.647 |
  5. | graduate   female   36129.45           936   -35193.45 |

The first person is a man with a bachelor's degree. He is expected to earn a bit - we guessed at 38640 dollars per year. But in reality, he earns a whopping 164382 dollars! our guess was way to low. The error term, the difference between our guess and the correct number, is therefore 125742. The real value is 125743 dollar higher than our guess.

The second person is a man with a high school diploma. We guessed that he would make 22363 dollars per year, but in reality it was 25740. Pretty close - the real value is 3374 higher.

Person 3 did not have a value on the income variable.

Person 4 is a woman with a high school diploma. We guessed that she wouldn't make that much - 11714 dollar per year. In reality, she earns even less, 5265 dollar. The error term is therefore -6449. The real value is 6449 dollars lower than our guess.

Our guesses are therefore wrong most of the time - by a little or a lot. All persons with the same characteristics (on these two independent variables) will get the same guess. If we want to make more advanced guesses we need to bring in more variables in the model.

Comparison between the values calculated by hand and automatically predicted values

Regardless of our method for arriving at these values, the result should of course be the same. We saw that men with a master's degree were expected to earn 46780 dollars per year in our manual calculation, and that a woman without a degree were expected to earn 3574. We can now compare these values with those that we got automatically. We calculate the average for our guess, for all combinations of the two variables, using the command table.

In [4]:
table degree sex, contents(mean guessedincome)
r's highest    |  respondents sex  
degree         |     male    female
lt high school | 14224.62  3575.047
   high school | 22363.22  11713.65
junior college | 30501.82  19852.25
      bachelor | 38640.42  27990.85
      graduate | 46779.02  36129.45
            NA |                   

Men with the highest level of education are expected to make 46799 dollars per year - a dollar less than what we calculated by hand. The reason it is not exactly the same is that we in our manual calculation did not include all decimals. We guess that women with the lowest level of education will earn 3575 dollars, one dollar more than in our manual calculation, again because of the decimals.


Predicted values are often interesting in order to make a more concrete connection between our model and substantial issues. It might, for example, be more pedagogial to showh how highly educated people are expected to differ from people with no school, rather than to say "the coefficient for educatinon is 8139". Sometimes it is also interesting to compare our guess with reality to say which observations that are (so to say) over-performers and under-performers.

The equation can be expanded as needed, dependning on the number of variables we have. In this example we only had two variables, education and sex. If we for instance would add age the equation would be slightly longer:

$iincome_i = \alpha+ \beta_1education_i + \beta_2sex_i + \beta_3age_i + \epsilon_i$

The aim of a regression analysis is, to summarize, to try to estimate what the best values of $\alpha$, $\beta_1$, $\beta_2$ and $\beta_3$ might be, and this is (among other things) whas is showed in the regression table after a regression analysis.

Finally, the equation can also be used to make guesses where we don't have the correct answers. For instance, it is easy to make a model that shows how results in presidential elections are affected by economic growth, the popularity of the president, and more. We then of course use historical data when we construct the model. But then we can add current data into the equation, and get an informed guess about what will happen after the next election. The procedure is just the same.