Recode variables

Svensk version | Front page

Often when we do data analysis the dataset was created by someone else. It is then likely that the variables are not perfectly suited to our aims. We might be interested in a certain number of categories, want to make certain comparisons, or use other categorizations than those in the already existing variables. We then need to recode the variables. It generally means that we create new variables, that build on the old ones.

There are several commands in Stata that help us accomplish this. In this guide we will look at three of the most useful: recode, generate and replace.

We will use the QoG basic dataset. Here I have entered the web adress to the data directly in the browser, but you can of course download it to your computer and open it from there instead - it is generally better.

In [1]:
use "https://www.qogdata.pol.gu.se/dataarchive/qog_bas_cs_jan18.dta", clear
(Quality of Government Basic dataset 2018 - Cross-Section)

Recode

The recode command replaces different values on one variable to some other variable. We can replace certain specific values, or ranges, like 5-19 or 28-752.

Let's for example look at Freedom House's categorization of the countries of the world as "Free", "Partly free" and "Not free". We do this with the command tab that gives us a frequency table, showing how many observations that have each value.

In [2]:
tab fh_status
    Freedom |
     Status |      Freq.     Percent        Cum.
------------+-----------------------------------
       Free |         89       45.88       45.88
Partly Free |         54       27.84       73.71
   Not Free |         51       26.29      100.00
------------+-----------------------------------
      Total |        194      100.00

89 countries have the value "Free", 54 "Partly free" and 51 "Not free". It is however important to note that in the dataset, the values are not stored as words. Instead, the variable consists of numbers, and each number has a label (created by the persons who did the dataset). To look at the actual values, we can add an option to the tab command, nolabel:

In [3]:
tab fh_status, nolabel
    Freedom |
     Status |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |         89       45.88       45.88
          2 |         54       27.84       73.71
          3 |         51       26.29      100.00
------------+-----------------------------------
      Total |        194      100.00

These are the values Stata actually uses. Free has the value 1, Partly free has the value 2, and Not free the value 3. These are the values we need to use if we want to recode the variable, not the words "Free" and so on.

Let's now say that we want to separate the democracies, the countries that has the value 1, "Free", from the rest. We want to make a new variable that has the value 1 for democracies, and 0 for non-democracies.The recode we want to make is thus:

Old value New value
1 1
2 0
3 0

We can use the recode command to do this. The principle is that we write recode, the variable to transform, how to recode it, and then an option that tells Stata what we want to call our new variable. We don't want to change the original variable.

In [4]:
recode fh_status (1 = 1) (2 3 = 0), generate(democracy)
(105 differences between fh_status and democracy)

Each set of parantheses holds the old value to the left, and the new value to the right. 1 is changed to 1, 2 and 3 are changed to 0. And the new variable is called democracy. After it is a good idea to double check that the recoding turned out right:

In [5]:
tab democracy
  RECODE of |
  fh_status |
   (Freedom |
    Status) |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        105       54.12       54.12
          1 |         89       45.88      100.00
------------+-----------------------------------
      Total |        194      100.00

We still have 194 countries; 89 countries have the value 1; and 105 countries have the value 0. This looks good, since 54+51=105 countries had the values 2 and 3 on the old variable.

If we don't want to list all separate values in the recode command we can use / to express an interval. If we for instance want to recode the variable fh_rol, that shows the country's degree of "Rule of law". The variable has the values 0-16, and we might want those that have a value in the interval 10-16 to get a 1 on the new variable, and those that had 0-9 on the old variable to get a 0.

In [6]:
recode fh_rol (0/9 = 0) (10/16 = 1), generate(highruleoflaw)
(183 differences between fh_rol and highruleoflaw)

Generate

We can also construct variables "from scratch". The generate command creates a new variable in the dataset. Together with if-statements - conditions that only apply the command to certain observations - and replace, which we will discuss shortly, it is a very flexible tool.

The principle is that we write generate, the name of the new variable we want to create, and then what the variable's value should be. We can for instance create a test variable that has the value 0 for all observations. I here use the abbreviation gen, which you can write instead of generate:

In [7]:
gen test = 0
In [8]:
tab test
       test |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        194      100.00      100.00
------------+-----------------------------------
      Total |        194      100.00

194 countries now has the value 0 on the new variable test. Not so useful, yet. But we can also use generate to create new version of already existing variables. The dataset for instance holds a variable for the countries' GDP per capite, gle_rgdpc. It is expressed as the number of dollar's GDP per capita. With the command summarize(can be shortened to sumamong other things see the mean, min and max value of the variable:

In [9]:
sum gle_rgdpc
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
   gle_rgdpc |        192     12596.3     15803.7     285.95   95696.97

The mean is 12596.3 dollars, and the max value is 95696,97. Let´s say we want to express the information in 1000's of dollars. We then create a new variable, which is the old one, divided by 1000. We can easily do this with generate. After the equal sign, we can enter any mathematical operation:

In [13]:
gen gdpcap_1000 = gle_rgdpc/1000
(2 missing values generated)
In [14]:
sum gle_rgdpc gdpcap_1000
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
   gle_rgdpc |        192     12596.3     15803.7     285.95   95696.97
 gdpcap_1000 |        192     12.5963     15.8037     .28595   95.69697

When we compare the old and new variable we see that we exactly the same number of countries, and the numbers are the same, but the decimal point has jumped three steps to the left. The mean is now 12.5963 instead of 12596.3. As long as we remember that the new variable shows GDP per capita in 1000's of dollars it has no effect on the analyses, other than making it easier to present in tables.

Replace

replace is similar to the generate command, but here we are changing already existing variables, instead of creating new ones. We can create a variable with generate, and then change it based on some condititons with replace. Here we have to use if statements.

Let's say we want to create a variable that has the value 1 if the country is really poor, and has a GDP per capita that is less than 1000 dollars. All other countries get the value 0. We start by creating a variable where all countries have the value 0:

In [15]:
gen poor = 0

Then we change the variable so poor countries get the value 1. We do this with the replace command. We write replace, the variable we want to change, and then the new value, and then any if statements. The if statements can use infromation from other variables.

In [16]:
replace poor = 1 if gle_rgdpc < 1000
(19 real changes made)

The output shows that 19 changes were made in the variable - 19 countries got the value 1. All those have a value of gle_rgdpc that is less than 1000.

But here we have also created another problem. If we were to look at the new variable we would notice that it has data for 194 countries. But if we look at the original variable gle_rgdpc we would only have data for 192 countries. We lack data on two countries; they are "missing." In Stata they have the value "." on the variable, a dot. Observations that have this "missing" value are not included in analyses we do, which is good.

We didn't have any conditions when we created our new poor variable, and these two countries also got the value 0. When we later changed the variable with replace Stata checked whether they had a value below 1000 on gle_rgdpc, and they had not, since they had the value ".". Therefore, they remain as zeroes. Also good to know is that Stata (for some reason) considers this dot as the largest value there is. So if the condition would have been that the country should have a value on gle_rgdpc that is larger than 1000, the two countries with missing data would have received the value! Not intuitive, but crucial to know.

Anyway, we don't want these two countries (which we don't know anything about) to be included in our variable at all. We will therefore use replace again, to give them the value . on the new variable poor.

In [17]:
replace poor = . if gle_rgdpc == .
(2 real changes made, 2 to missing)

An interesting thing to not is that I in the if statement used double equality signs. It is also not perfectly intuitive, but it is the standard for how to write conditions of this kind. The operators we can choose from are:

Meaning Operator
equal to: ==
larger than or equal to: >=
less than or equal to: <=
not equal to: !=
larger than: >
less than: <
and: &
or: |

The two last ones can be used to combine different statements. For instance, if we want to single out countries that are both poor and non-democracies:

In [18]:
gen poordictatorship = 0
replace poordictatorship = 1 if gle_rgdpc < 1000 & fh_status==3

(5 real changes made)

Conclusion

These are some alternatives to recode and create new variables. You can get really far using these, but there are others as well. These are also not mutually exclusive. What you can do with recode you often can do with replace, and the other way around. You use what you feel most comfortable with, and what suits you best for the recoding you are looking for. But remember to double check that the recode turned out the way you wanted afterwards!