Regression analysis with dummy variables

Svensk version | Front page

By Anders Sundell

Regression analysis requires that the independent variables are interval scales, that is, that the distances between the steps of the scale are equally long. Since the analysis basically consists of drawing a straight line over the different values of the independent variable, the interpretation will be very strange if there is no consistency over the steps. Even more so if there is no ordering of the values.

Variables that (untransformed) are not suitable as independent variables are for instance party affiliation, place of birth or personality type. These are categories, with no clear ordering.

But there is a solution: dummy variables. A variable that only has two values has equal distance between all the steps of the scale (since there is only one distance), and it can therefore be used in regression analysis. By dividing up categorical variables in multiple such dummy variables, where each variable represents one variable on the original variable, we can analyze the categorical variable in a regression analysis.

In practice this is like comparing mean values across the groups. The advantage is that we in a regression analysis also can control for other variables. In this guide we will walk through what dummy variables are, how to create them in Stata, and how to include them in a regression analysis.

What dummy variables are

Dummy variables are variables that divide a categorical variable into all its values, minus one. One value is always left out in a regression analysis, as a reference category. B-coefficients for the new variables will then show the expected differences in relation to the reference category. If we start with a variable that has five values, we will have to create four dummy variables. If we start with a variable with three values, we will have to create two dummy variables.

The dummy variables have the value 0 or 1: If the unit of analysis has the value in question it gets a 1, otherwise a 0.

Let's say that we want to investigate the possible effect of a country's electoral system on the number of parties represented in parliament. Classical theory in political science says that countries with majoritarian electoral systems (where the party that gets the most votes in a district takes all the seats from that district for instance) will have fewer parties, whereas countries with more proportional electoral systems (where seats are distributed proportionally to the share of votes) will have more.

In the QoG dataset there is a variable for electoral system, gol_est. It has three values: Majoritarian, Proportional and Mixed. You could perhaps argue that the mixed category is a sort of middle value between majoritarian and proportional, but we cannot really be certain. A safer strategy is to treat the variable as having three distinct categories. We then need to create two dummy variables to be able to analyze the variable in a good way. They are created according to the followin principle:

Original value dum_proportional dum_mixed
Majoritarian 0 0
Proportional 1 0
Mixed 0  1

The majoritarian category is here the reference, and gets a zero on both the two new variables. Proportional systems get a 1 on the variable dum_proportional and mixed gets a 1 on dum_mixed. The coefficients for dum_proportional and dum_mixed will thus show the difference between the two categories and the majoritarian category. If we want to see the difference between proportional and mixed in the coefficients, we would have to leave one of them out, as the reference category.

Create dummy variables

We will now do the dummy variables. We use the QoG basic dataset, that can be loaded directly in Stata with the adress below. The independent variable is called gol_est and has three values. Below, we load the dataset and make a frequency table.

In [1]:
use "https://www.qogdata.pol.gu.se/dataarchive/qog_bas_cs_jan18.dta", clear
tab gol_est
(Quality of Government Basic dataset 2018 - Cross-Section)


   Electoral |
      System |
      Type-3 |
     classes |      Freq.     Percent        Cum.
-------------+-----------------------------------
Majoritarian |         49       37.98       37.98
Proportional |         59       45.74       83.72
       Mixed |         21       16.28      100.00
-------------+-----------------------------------
       Total |        129      100.00

Now it is time to construct the variables. There are multiple ways (as is often the case). A little more cumbersome way is to create your own variables with generate and then give them the right values with replace. A smarter way is to use tab with an option that creates dummy variables , generate.

What we do is write the name of the variable we want to turn into dummy variables, add the option , generate() and in the parentheses we write the name of our new dummy variables. Let's call them dum_electoralsystem to indicate that they are dummy variables, and that they relate to the electoral system.

Vi skriver då namnet på variabeln vi vill göra dummyvariabler av, lägger till option , generate() och skriver inom parenteserna vad vi vill att dummyvariablerna ska heta. Vi kallar dem dum_valsystem för att indikera att det är dummyvariabler och att det handlar om valsystemet.

In [2]:
tab gol_est, generate(dum_electoralsystem)
   Electoral |
      System |
      Type-3 |
     classes |      Freq.     Percent        Cum.
-------------+-----------------------------------
Majoritarian |         49       37.98       37.98
Proportional |         59       45.74       83.72
       Mixed |         21       16.28      100.00
-------------+-----------------------------------
       Total |        129      100.00

If we now check the variable list, we will find three new variables: dum_electoralsystem1, dum_electoralsystem2 and dum_electoralsystem3. If we look at how the original gol_est relate to dum_electoralsystem1 we see the following:

In [3]:
tab gol_est dum_electoralsystem1
   Electoral |
      System |
      Type-3 | gol_est==Majoritarian
     classes |         0          1 |     Total
-------------+----------------------+----------
Majoritarian |         0         49 |        49 
Proportional |        59          0 |        59 
       Mixed |        21          0 |        21 
-------------+----------------------+----------
       Total |        80         49 |       129 

All the 49 countries that had the value "Majoritarian" have received a 1 on the new variable, and those that had either "Proportional" or "Mixed" got the value 0, just as intended. Now we can use these variables in the analysis. However, keep in mind that although the command created dummy variables for all three categories, we will only use two of them in the analysis. We need to leave one out, as a reference category.

Regression analysis with dummy variables

The dependent variable is called gol_enep and is a measure of the number of "effective parties", that is, the number of parties in parliament with a real possibility of affecting policy. To illustrate what the regression analysis will do we can first look closer at the mean values of the dependent variable, over the different categories of the original independent variable.

In [4]:
table gol_est, contents(mean gol_enep)
-----------------------------
Electoral    |
System       |
Type-3       |
classes      | mean(gol_enep)
-------------+---------------
Majoritarian |      3.5251438
Proportional |      4.7289557
       Mixed |      3.7276943
-----------------------------

Countries with majoritarian electoral systems on average have 3.5 (effective) parties, proportional electoral systems on average have 4.7, and in mixed electoral systems, there are on average 3.7.

The regression analysis will not show anything other than these differences in means. The advantage is that we can see if the differences are statistically significant, and we can also control for other variables (or use this as a control variable, of course).

Let's now run the regression analysis, with gol_enep as the dependent variable, and dum_electoralsystem2 and dum_electoralsystem3 as independent variables. We leave out dum_electoralsystem1 - it is our reference category.

In [5]:
reg gol_enep dum_electoralsystem2 dum_electoralsystem3
      Source |       SS           df       MS      Number of obs   =       120
-------------+----------------------------------   F(2, 117)       =      3.97
       Model |  39.1875707         2  19.5937854   Prob > F        =    0.0214
    Residual |  577.131481       117   4.9327477   R-squared       =    0.0636
-------------+----------------------------------   Adj R-squared   =    0.0476
       Total |  616.319052       119  5.17915169   Root MSE        =     2.221

--------------------------------------------------------------------------------------
            gol_enep |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
---------------------+----------------------------------------------------------------
dum_electoralsystem2 |   1.203812   .4531648     2.66   0.009     .3063428    2.101281
dum_electoralsystem3 |   .2025504   .5959897     0.34   0.735    -.9777758    1.382877
               _cons |   3.525144   .3468586    10.16   0.000     2.838208    4.212079
--------------------------------------------------------------------------------------

_cons shows the expected value on the dependent variable when all the independent variables have the value zero. That is, the country has a zero on dum_electoralsystem2 and dum_electoralsystem3, which means that the country has a majoritarian electoral system. The value is 3.5, which we above saw is the mean value for majoritarian electoral systems.

The coefficient for dum_electoralsystem2 is 1.2, and shows the difference between the proportional systems and the majoritarian. If we add 1.2 to 3.5 we get 4.7, which is the mean for the proportional systems. We can also see that this difference is statistically significant (p=0.009).

The difference between the mixed systems dum_electoralsystem3 and the majoritarian is smaller (0.2) and it is not significant (p=0.735).

But is the difference between the proportional and the mixed systems signifcant? We cannot see that here. To find out, we can run a new analysis where we leave out dum_electoralsystem2, and let the proportional systems be the reference category.

In [6]:
reg gol_enep dum_electoralsystem1 dum_electoralsystem3
      Source |       SS           df       MS      Number of obs   =       120
-------------+----------------------------------   F(2, 117)       =      3.97
       Model |  39.1875707         2  19.5937854   Prob > F        =    0.0214
    Residual |  577.131481       117   4.9327477   R-squared       =    0.0636
-------------+----------------------------------   Adj R-squared   =    0.0476
       Total |  616.319052       119  5.17915169   Root MSE        =     2.221

--------------------------------------------------------------------------------------
            gol_enep |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
---------------------+----------------------------------------------------------------
dum_electoralsystem1 |  -1.203812   .4531648    -2.66   0.009    -2.101281   -.3063428
dum_electoralsystem3 |  -1.001261   .5656325    -1.77   0.079    -2.121467    .1189441
               _cons |   4.728956   .2916288    16.22   0.000       4.1514    5.306511
--------------------------------------------------------------------------------------

The coefficient for dum_electoralsystem1 is -1.2, which is exactly the same as the one for dum_electoralsystem2 in the previous analysis, but now it is negative. The difference in the data between the two groups is of course the same, regardless of which one we set as reference. Majoritarian systems have 1.2 parties fewer than the proportional; proportional systems have 1.2 parties more than the majoritarian.

But regardless of which category we chose as reference, the model is basically the same, mathematically. We can for instance see that on the $R^2$-value, which is the same in both of the analyses.

Regression analysis with automatically coded dummy variables

Finally, we can while doing a regression analysis, let Stata create temporary dummy variables for us, within the regression command. We can thus skip the step of creating dum_electoralsystem1 and so on. Instead, we include the original variable gol_est as independent, but write b1. before the variable name. That is b1.gol_est. Stata will then create temporary dummy variables, that are not saved in the dataset after the analysis. The value 1 will be used as reference category.

In [7]:
reg gol_enep b1.gol_est
      Source |       SS           df       MS      Number of obs   =       120
-------------+----------------------------------   F(2, 117)       =      3.97
       Model |  39.1875707         2  19.5937854   Prob > F        =    0.0214
    Residual |  577.131481       117   4.9327477   R-squared       =    0.0636
-------------+----------------------------------   Adj R-squared   =    0.0476
       Total |  616.319052       119  5.17915169   Root MSE        =     2.221

-------------------------------------------------------------------------------
     gol_enep |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
      gol_est |
Proportional  |   1.203812   .4531648     2.66   0.009     .3063428    2.101281
       Mixed  |   .2025504   .5959897     0.34   0.735    -.9777758    1.382877
              |
        _cons |   3.525144   .3468586    10.16   0.000     2.838208    4.212079
-------------------------------------------------------------------------------

The numbers are identical to the table above where majoritarian systems were left out as a reference category. If we want another value as reference, we can write another number after the b. Here we have an analysis with the proportional systems (value 2) as reference.

In [8]:
reg gol_enep b2.gol_est
      Source |       SS           df       MS      Number of obs   =       120
-------------+----------------------------------   F(2, 117)       =      3.97
       Model |  39.1875707         2  19.5937854   Prob > F        =    0.0214
    Residual |  577.131481       117   4.9327477   R-squared       =    0.0636
-------------+----------------------------------   Adj R-squared   =    0.0476
       Total |  616.319052       119  5.17915169   Root MSE        =     2.221

-------------------------------------------------------------------------------
     gol_enep |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
      gol_est |
Majoritarian  |  -1.203812   .4531648    -2.66   0.009    -2.101281   -.3063428
       Mixed  |  -1.001261   .5656325    -1.77   0.079    -2.121467    .1189441
              |
        _cons |   4.728956   .2916288    16.22   0.000       4.1514    5.306511
-------------------------------------------------------------------------------

Conclusion

Dummy variables are necessary to be able to use categorical variables (especially ones that lack rank ordering) in a regression analysis. It is not necessary to give each value a dummy variable of its own - we can also choose a specific set of values, that are compared to the rest. It is however important to remember what the comparison we make actually consists of. All values that don't have their own variable conssitute the reference category, and the dummy variables show the differences compared to the reference category.