Introduction to regression analysis in Stata

Regression analysis is one of the most common forms of statistical analysis, and one of the most flexible. It can be done with two or more variables, and be used to investigate a range of relationships, with or without controls for alternative explanations.

In general, it is about fitting a line to a group of points. The formula for "normal regression analysis" - Ordinary Least Squares (OLS) - does this in a very good way.

In this guide we will cover the most simple form of regression analysis, where we only have two variables, and the intuition behind the analysis. It can then be built upon and extended in a multitude of ways, but we will not cover that in this post.

We start by opening a data set. In this example we'll use a practice dataset that comes preinstalled in Stata, the "auto" dataset. It holds information about different cars - their gasoline consumption, the model, their weight and length, and so on. Each row in the dataset is one car.

To open this practice dataset you only have to enter into your do-file:

In [2]:
sysuse auto, clear
(1978 Automobile Data)

There are 12 variables in the dataset. We'll look at the possible relationship between the number of miles the car can run on one gallon of gasoline and the weight of the car, in pounts. A reasonable hypothesis is that heavier cars require more gas per mile, resulting in a lower average miles per gallon (mpg). We call the weight the independent variable, and miles per gallon the dependent variable. We believe that mpg is affected by the weight, but not that the weight is affected by the mpg.

Let's start by visualizing the relationship in what is called a scatterplot. We do that by writing, in the do-file:

In [12]:
twoway (scatter mpg weight)
(note: file /Users/anderssundell/.stata_kernel_cache/graph3.svg not found)
Stata Graph - Graph 10 20 30 40 Mileage (mpg) 2,000 3,000 4,000 5,000 Weight (lbs.)

Each dot here represents one car, that is placed on the horizontal axis according to its weight, and on the vertical axis according to the number of mpg. There is some spread, but we can discern a pattern: Dots to the right in the graph tend to be placed lower. The heaviest cars can only run 11-12 miles per gallon. In contrast, the car that can run the farthest on one gallon (and that is accordingly placed at the top of the graph) is comparatively light, weighing about 2000 lbs.

We can visualize this relationship - that heavier cars go fewer miles per gallon - by fitting a line to the points, but we let Stata fit the line as good as possible. Specifically, Stata tries to minimize the vertical distance between the line and all the points. Even more specifically, Stata tries to minimize the total squared distance, which gives the name "Ordinary Least Squares". On this page you can find good visualizations of an explanation for how the formula works. But the gist of it is that we want the line to be as close as possible to the points.

To actually draw the line in our graph we add another layer. In the first layer we show the dots. In the second we draw our line, with the command "lfit". Each layer has its own set of parentheses in the graph command. This is how it looks:

In [3]:
twoway (scatter mpg weight) (lfit mpg weight)
Stata Graph - Graph 10 20 30 40 2,000 3,000 4,000 5,000 Weight (lbs.) Mileage (mpg) Fitted values

This line is the best possible. If we were to change the angle, or the vertical position of the line, the total squared distance between the line and all the points would increase. This is the so called regression line.

Since it slopes down to the right it is a negative relationship. More of one variable (weight) is associated with less of the other (miles per gallon). More miles per gallon is associated with less weight.

Regression analysis is to put numbers on this line. In a regression analysis we primarily obtain the slope coefficient, that is, how much the line is tilted up or down. We also get the intercept or the constant, which shows where the line ends up on the vertical axis when we are at zero on the horizontal. Expressed differently: The expected value on the dependent variable when all the independent variables in the model are zero.

To do a regression analysis on these two variables we write:

In [16]:
reg mpg weight
      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(1, 72)        =    134.62
       Model |   1591.9902         1   1591.9902   Prob > F        =    0.0000
    Residual |  851.469256        72  11.8259619   R-squared       =    0.6515
-------------+----------------------------------   Adj R-squared   =    0.6467
       Total |  2443.45946        73  33.4720474   Root MSE        =    3.4389

         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      weight |  -.0060087   .0005179   -11.60   0.000    -.0070411   -.0049763
       _cons |   39.44028   1.614003    24.44   0.000     36.22283    42.65774

The principle is that you write the command for regression "reg", followed by the dependent variable "mpg" and then a list of all the independent variables (in this case, just "weight".)

The table may look a bit confusing, but not all numbers are equally important. The most important one we find towards the bottom, especially in the row "weight." In the column "Coef." we see the most important number, the slope coefficient. It shows the tilt of the line, expressed as how much the dependent variable changes when we increase the independent variable by one.

In this case it is -.006. It means that a car that weighs one pound more on average can travel .006 miles less on each gallon of gas. It also means that a car that weighs 1000 pounds more will travel 6 miles less per gallon. This is also evident from the graph with the line above.

In the row "cons" we see what the so called intercept or constant is. It is the average value for the dependent variable when all the independent variables are zero. That is, how many miles per gallon a car that weighs 0 pound can drive: 39.44. This is obviously not how it works in the real world; no cars weigh zero pounds. But that is not a problem for the analysis. It is just the value we get when we extend the regression line so far left, far beyond the possible weights of a car. We therefore don't need to interpret this intercept in any way.

Finally, we can look at the number to the right of "R-squared", in the top right part of the output. This value is 0.6515. R-square is a measure that stretches from 0 to 1, and shows how good our line fits the dots. The more spread out the dots are around the line, the lower the R-squared value. The closer they are to the line, the higher the R-squared value.

You can also interpret this value as the percentage of variance in the dependent variable that is explained by the independent variable(s). In this case, 65.15% of the variation in mpg can be explained by the variable weight. Weight is thus an important explanation for why different cars can go different number of miles per gallon, even if other factors also matter (as is evident by the fact that R-squared is not one).


Regression analysis is not harder than that, in its simplest form. It is about fitting a line to a swarm of dots, and we get informatino about this line: its slope, where it starts, and how well it fits to the dots.

In practical terms, we use this analysis to investigate a relationship between to variables, and to find out whether the relationship is weak or strong. In this example, we saw that there was a negative relationship: cars with a lot of weight have few miles per gallon. Cars with a lot of miles per gallon have less weight.

We can also use regression analysis to control for other variables, ruling out other explanations. For instance, one could imagine that some car manufacturers for some reason make their cars heavier, and that they, for some other reason, also consume more gas per mile. If that is the case, maybe there is no causal effect of weight on mpg, but rather a brand effect (even though this sounds unlikely). By adding control variables we can investigate whether this is the case or not. We can also look at p-values and confidence intervals to find out how generalizible these results are to a wider population. But that is for other posts.