*By Anders Sundell*

Regression analysis requires that the independent variables are interval scales, that is, that the distances between the steps of the scale are equally long. Since the analysis basically consists of drawing a straight line over the different values of the independent variable, the interpretation will be very strange if there is no consistency over the steps. Even more so if there is no ordering of the values.

Variables that (untransformed) are not suitable as independent variables are for instance party affiliation, place of birth or personality type. These are categories, with no clear ordering.

But there is a solution: *dummy variables*. A variable that only has two values has equal distance between all the steps of the scale (since there is only one distance), and it can therefore be used in regression analysis. By dividing up categorical variables in multiple such dummy variables, where each variable represents one variable on the original variable, we can analyze the categorical variable in a regression analysis.

In practice this is like comparing mean values across the groups. The advantage is that we in a regression analysis also can control for other variables. In this guide we will walk through what dummy variables are, how to create them in Stata, and how to include them in a regression analysis.

Dummy variables are variables that divide a categorical variable into all its values, minus one. One value is always left out in a regression analysis, as a reference category. B-coefficients for the new variables will then show the expected differences in relation to the reference category. If we start with a variable that has five values, we will have to create four dummy variables. If we start with a variable with three values, we will have to create two dummy variables.

The dummy variables have the value 0 or 1: If the unit of analysis has the value in question it gets a 1, otherwise a 0.

Let's say that we want to investigate the possible effect of a country's electoral system on the number of parties represented in parliament. Classical theory in political science says that countries with majoritarian electoral systems (where the party that gets the most votes in a district takes all the seats from that district for instance) will have fewer parties, whereas countries with more proportional electoral systems (where seats are distributed proportionally to the share of votes) will have more.

In the QoG dataset there is a variable for electoral system, `gol_est`

. It has three values: `Majoritarian`

, `Proportional`

and `Mixed`

. You could perhaps argue that the mixed category is a sort of middle value between majoritarian and proportional, but we cannot really be certain. A safer strategy is to treat the variable as having three distinct categories. We then need to create two dummy variables to be able to analyze the variable in a good way. They are created according to the followin principle:

Original value | dum_proportional | dum_mixed |
---|---|---|

Majoritarian | 0 | 0 |

Proportional | 1 | 0 |

Mixed | 0 | 1 |

The majoritarian category is here the reference, and gets a zero on both the two new variables. Proportional systems get a 1 on the variable `dum_proportional`

and mixed gets a 1 on `dum_mixed`

. The coefficients for `dum_proportional`

and `dum_mixed`

will thus show the difference between the two categories and the majoritarian category. If we want to see the difference between proportional and mixed in the coefficients, we would have to leave one of them out, as the reference category.

We will now do the dummy variables. We use the QoG basic dataset, that can be loaded directly in Stata with the adress below. The independent variable is called `gol_est`

and has three values. Below, we load the dataset and make a frequency table.

In [1]:

```
use "https://www.qogdata.pol.gu.se/dataarchive/qog_bas_cs_jan18.dta", clear
tab gol_est
```

Now it is time to construct the variables. There are multiple ways (as is often the case). A little more cumbersome way is to create your own variables with `generate`

and then give them the right values with `replace`

. A smarter way is to use `tab`

with an option that creates dummy variables `, generate`

.

What we do is write the name of the variable we want to turn into dummy variables, add the option `, generate()`

and in the parentheses we write the name of our new dummy variables. Let's call them `dum_electoralsystem`

to indicate that they are dummy variables, and that they relate to the electoral system.

Vi skriver då namnet på variabeln vi vill göra dummyvariabler av, lägger till option `, generate()`

och skriver inom parenteserna vad vi vill att dummyvariablerna ska heta. Vi kallar dem dum_valsystem för att indikera att det är dummyvariabler och att det handlar om valsystemet.

In [2]:

```
tab gol_est, generate(dum_electoralsystem)
```

If we now check the variable list, we will find three new variables: `dum_electoralsystem1`

, `dum_electoralsystem2`

and `dum_electoralsystem3`

. If we look at how the original `gol_est`

relate to `dum_electoralsystem1`

we see the following:

In [3]:

```
tab gol_est dum_electoralsystem1
```

All the 49 countries that had the value "Majoritarian" have received a 1 on the new variable, and those that had either "Proportional" or "Mixed" got the value 0, just as intended. Now we can use these variables in the analysis. However, keep in mind that although the command created dummy variables for all three categories, we will only use two of them in the analysis. We need to leave one out, as a reference category.

The dependent variable is called `gol_enep`

and is a measure of the number of "effective parties", that is, the number of parties in parliament with a real possibility of affecting policy. To illustrate what the regression analysis will do we can first look closer at the mean values of the dependent variable, over the different categories of the original independent variable.

In [4]:

```
table gol_est, contents(mean gol_enep)
```

Countries with majoritarian electoral systems on average have 3.5 (effective) parties, proportional electoral systems on average have 4.7, and in mixed electoral systems, there are on average 3.7.

The regression analysis will not show anything other than these differences in means. The advantage is that we can see if the differences are statistically significant, and we can also control for other variables (or use this as a control variable, of course).

Let's now run the regression analysis, with `gol_enep`

as the dependent variable, and `dum_electoralsystem2`

and `dum_electoralsystem3`

as independent variables. We leave out `dum_electoralsystem1`

- it is our reference category.

In [5]:

```
reg gol_enep dum_electoralsystem2 dum_electoralsystem3
```

`_cons`

shows the expected value on the dependent variable when all the independent variables have the value zero. That is, the country has a zero on `dum_electoralsystem2`

and `dum_electoralsystem3`

, which means that the country has a majoritarian electoral system. The value is 3.5, which we above saw is the mean value for majoritarian electoral systems.

The coefficient for `dum_electoralsystem2`

is 1.2, and shows the difference between the proportional systems and the majoritarian. If we add 1.2 to 3.5 we get 4.7, which is the mean for the proportional systems. We can also see that this difference is statistically significant (p=0.009).

The difference between the mixed systems `dum_electoralsystem3`

and the majoritarian is smaller (0.2) and it is not significant (p=0.735).

But is the difference between the proportional and the mixed systems signifcant? We cannot see that here. To find out, we can run a new analysis where we leave out `dum_electoralsystem2`

, and let the proportional systems be the reference category.

In [6]:

```
reg gol_enep dum_electoralsystem1 dum_electoralsystem3
```

The coefficient for `dum_electoralsystem1`

is -1.2, which is exactly the same as the one for `dum_electoralsystem2`

in the previous analysis, but now it is negative. The difference in the data between the two groups is of course the same, regardless of which one we set as reference. Majoritarian systems have 1.2 parties *fewer* than the proportional; proportional systems have 1.2 parties *more* than the majoritarian.

But regardless of which category we chose as reference, the model is basically the same, mathematically. We can for instance see that on the $R^2$-value, which is the same in both of the analyses.

Finally, we can while doing a regression analysis, let Stata create temporary dummy variables for us, within the regression command. We can thus skip the step of creating `dum_electoralsystem1`

and so on. Instead, we include the original variable `gol_est`

as independent, but write b1. before the variable name. That is `b1.gol_est`

. Stata will then create temporary dummy variables, that are not saved in the dataset after the analysis. The value 1 will be used as reference category.

In [7]:

```
reg gol_enep b1.gol_est
```

The numbers are identical to the table above where majoritarian systems were left out as a reference category. If we want another value as reference, we can write another number after the b. Here we have an analysis with the proportional systems (value 2) as reference.

In [8]:

```
reg gol_enep b2.gol_est
```

Dummy variables are necessary to be able to use categorical variables (especially ones that lack rank ordering) in a regression analysis. It is not necessary to give each value a dummy variable of its own - we can also choose a specific set of values, that are compared to the rest. It is however important to remember what the comparison we make actually consists of. All values that don't have their own variable conssitute the reference category, and the dummy variables show the differences compared to the reference category.