Combine datasets with merge

Svensk version | Front page

Many statistical analyses requires the that we comvbine several datasets. We might get information about the independent variable from one source, and want to analyze the effects of it on a dependent variable from another source. We thus have the same observations in both datasets, but want to combine the variables. We do this with the command merge.

Another scenario is when we have different datasets with the same variables, but with different units of analysis. In those cases, we use the command append, which really is the same as pasting more observations to the dataset. But since merge is a bit trickier we will focus on that command in this guide.

Different ways of merging datasets

With the merge command we can do three (or really, two) different types of merges. Easiest is the 1:1 merge. It means that we have the same observations in both datasets. For instance the same individuals, or the same countries; we just want to add more information about them. We then need a matching variable, a key, that shows the country's or the person's identity, so we can match the information in the two datasets.

But we can also match m:1. It means that we have many observations in the dataset that is active in Stata, and want to add on data from a smaller number of observations, at a higher level. For instance, we might have data on persons in Europe, and want to add country-level information. Imagine that we have conducted a survey, and now want to see whether the answers are affected by whether the country of residence is an EU member or not. All persons in Sweden will have the same value - Sweden is a part of the EU. All persons in Norway will also have the same value - Norway is not a member of the EU. In our individual level dataset we can have thousands of observations, and still combine it with about 30 country-level observations (one for each country).

The third type of merge is 1:m. It is the exact same thing, but instead of starting with the individual level dataset, we start with the country level dataset, and add the individual level data.

Combine two datasets with the same type of observations - merge 1:1

We start with the simplest type: when we have the same type of observations in both datasets, and want to add mode variables. Let's say we want to combine information about the level of democracy in a country with the level of corrption in the country. We can get information about democracy from Freedom House. The data is in Excel format, but we can just cut the relevant parts and paste into a Stata dataset. Then we get information about corruption from Transparency International, and treat it the same way.

We now have two datasets, fh2017.dta and cpi2018.dta. Below we load fh2017.dta and look at the first five rows in the dataset, with the help of the command list.

In [2]:
cd "/Users/xsunde/Dropbox/Jupyter/stathelp/data"
use "fh2017.dta", clear
list in 1/5
/Users/xsunde/Dropbox/Jupyter/stathelp/data



     +----------------------------------------+
     |     country   fh_pr   fh_cl   fh_sta~s |
     |----------------------------------------|
  1. | Afghanistan       5       6         NF |
  2. |     Albania       3       3         PF |
  3. |     Algeria       6       5         NF |
  4. |     Andorra       1       1          F |
  5. |      Angola       6       6         NF |
     +----------------------------------------+

We have four variables: the country name country, the level of political rights on a scale from 1 to 7 fh_pr, the level of civil liberties fh_cl and a summarizing classification fh_status. Now let's look at the corruption data.

In [3]:
use "cpi2018.dta", clear
list in 1/5


     +-----------------------+
     |     country   cpi2018 |
     |-----------------------|
  1. | Afghanistan        16 |
  2. |     Albania        36 |
  3. |     Algeria        35 |
  4. |      Angola        19 |
  5. |   Argentina        40 |
     +-----------------------+

Here we only have two variables: the country name country and the level of corruption cpi2018 (where low values indicate more corruption). We will now merge the two datasets.

Step by step

  1. Make sure that there is an ID variable in both datasets that we can match on. In this case it is the country name country. We want to match the level of democracy in Afghanistan with the level of corruption in Afghanistan. If we don't have a ID variable of this type it is impossible to do the merge. Generally, it is preferable if the ID is a number, and not a string (text) variable. It reduces the risk that the merge is hindered by spelling errors and so on. We can also match on serveral variables. For instance, when working with this type of data we often want to match on both country and year. Sweden 2008 should be matched to Sweden 2008, and so on.
  2. Make sure that the ID variable has the same name in both datasets.
  3. Open one of the datasets. In this case we have loaded the corruption data.
  4. Enter the code to merge the datasets, matched on the ID variable.
  5. Check the results and deal with errors.

We use the command merge. The principle is that we write merge, then what type of merge we want to do, then what the ID variable is, and then which dataset we want to match the active data with. In this case, we want a 1:1 merge, with the democracy data. It looks like this:

In [4]:
merge 1:1 country using "fh2017.dta"
    Result                           # of obs.
    -----------------------------------------
    not matched                            50
        from master                        12  (_merge==1)
        from using                         38  (_merge==2)

    matched                               168  (_merge==3)
    -----------------------------------------

The merge is now complete. But we have errors to correct. The summary shows that 168 observations were successfully matched. But we can also see that there were 50 observations that could not be matched: 12 from the "master" dataset, that is, the one we had active (the corruption data). We also have 38 observations in the "using" dataset, the democracy data, that could not find a match.

This does not necessarily have to be a problem, for instance if we match data on EU countries with data from all of the world. We will then of course not be able to match African countries with the EU data.

But here we might have some problems, since there are countries in both datasets that could not find a match. Often this is caused by errors in the ID variable.

Identifying mis-matched observations

Luckily, Stata allows us to easily identify the problematic observations. The merge command creates a new variable called _merge. It shows the result of the matching. Observations with the code 1 were only found in the original data. Observations with the code 2 were only found in the new data. Observations with the code 3 were successfully matched. We can list all observations that were not matches, that is, all observations that did not have the value 3 on the _merge variable:

In [5]:
list country _merge if _merge!=3
     +----------------------------------------------------+
     |                          country            _merge |
     |----------------------------------------------------|
 22. |                Brunei Darussalam   master only (1) |
 26. |                       Cabo Verde   master only (1) |
 36. |                            Congo   master only (1) |
 43. | Democratic Republic of the Congo   master only (1) |
 58. |                           Gambia   master only (1) |
     |----------------------------------------------------|
 66. |                    Guinea Bissau   master only (1) |
 70. |                        Hong Kong   master only (1) |
 85. |                     Korea, North   master only (1) |
 86. |                     Korea, South   master only (1) |
135. |                      Saint Lucia   master only (1) |
     |----------------------------------------------------|
136. | Saint Vincent and the Grenadines   master only (1) |
172. |         United States of America   master only (1) |
181. |                          Andorra    using only (2) |
182. |              Antigua and Barbuda    using only (2) |
183. |                           Belize    using only (2) |
     |----------------------------------------------------|
184. |                           Brunei    using only (2) |
185. |                       Cape Verde    using only (2) |
186. |              Congo (Brazzaville)    using only (2) |
187. |                 Congo (Kinshasa)    using only (2) |
188. |                   Czechoslovakia    using only (2) |
     |----------------------------------------------------|
189. |                             Fiji    using only (2) |
190. |                     Germany, E.     using only (2) |
191. |                     Germany, W.     using only (2) |
192. |                    Guinea-Bissau    using only (2) |
193. |                         Kiribati    using only (2) |
     |----------------------------------------------------|
194. |                    Liechtenstein    using only (2) |
195. |                 Marshall Islands    using only (2) |
196. |                       Micronesia    using only (2) |
197. |                           Monaco    using only (2) |
198. |                            Nauru    using only (2) |
     |----------------------------------------------------|
199. |                      North Korea    using only (2) |
200. |                  Northern Cyprus    using only (2) |
201. |                            Palau    using only (2) |
202. |                            Samoa    using only (2) |
203. |                       San Marino    using only (2) |
     |----------------------------------------------------|
204. |                      South Korea    using only (2) |
205. |              St. Kitts and Nevis    using only (2) |
206. |                        St. Lucia    using only (2) |
207. |   St. Vincent and the Grenadines    using only (2) |
208. |                       The Gambia    using only (2) |
     |----------------------------------------------------|
209. |                            Tonga    using only (2) |
210. |                           Tuvalu    using only (2) |
211. |                             USSR    using only (2) |
212. |                    United States    using only (2) |
213. |                      Vietnam, N.    using only (2) |
     |----------------------------------------------------|
214. |                      Vietnam, S.    using only (2) |
215. |                        Yemen, N.    using only (2) |
216. |                        Yemen, S.    using only (2) |
217. |                       Yugoslavia    using only (2) |
218. | Yugoslavia (Serbia & Montenegro)    using only (2) |
     +----------------------------------------------------+

The next step is to see if there are any countries that appear twice, in both the "master only" and "using only" categories. One such country is the United States, called "United States of America" in one dataset and "United States" in the other. We can also see that South Korea is called "Korea, South" in one dataset and "Sotuh Korea" in the other. Stata does not understand that these are the same countries. It would also be unable to match if there were differences in capitalization.

We must therefore manually correct the data so that the ID variables match. We need to go back to the data and change the variable in one of the datasets. We can for instance change "United States" to "United States of America" and "Korea, South" to "South Korea" in the democracy data. If we then rerun our do-file with the merge command, the number of unmatched observations should have decreased.

The _merge variable

When we are happy with the results of our merging we can remove the data we don't need. For instance the unmatched data, or observations that were not included in our original data. If we want to remove observations that only were included in the democracy data (the "using" dataset) we can write:

In [6]:
drop if _merge==2
(38 observations deleted)

THen we can drop the _merge variable itself. This is necessary if we want to add more data - Stata will not go through with the merge if there already exists a _merge variable in the dataset. We therefore write:

In [7]:
drop _merge

Combine datasets with different types of observations - merge m:1

Now let's say that we have a dataset of individuals, where the data shows each persons gender and his or her satisfaction with democracy. I made this data up. Let's load it and look at the 10 first persons (there are only 15 in the entire dataset):

In [8]:
use "satisfactiontest.dta", clear
list in 1/10


     +-------------------------------+
     |     country   sat_dem   woman |
     |-------------------------------|
  1. | Afghanistan         1       1 |
  2. | Afghanistan         3       0 |
  3. | Afghanistan         2       1 |
  4. | Afghanistan         4       0 |
  5. | Afghanistan         2       0 |
     |-------------------------------|
  6. |     Albania         5       1 |
  7. |     Albania         7       1 |
  8. |     Albania         5       0 |
  9. |     Albania         8       1 |
 10. |     Albania         4       0 |
     +-------------------------------+

We have here persons from Afghanistan and Albania, who have different levels of satisfaction with democracy. Some of them are women, and some are not. Now let's merge this dataset with the democracy data, with the help of merge m:1. The ID variable that we match on is still country.

In [9]:
merge m:1 country using "fh2017.dta"
    Result                           # of obs.
    -----------------------------------------
    not matched                           203
        from master                         0  (_merge==1)
        from using                        203  (_merge==2)

    matched                                15  (_merge==3)
    -----------------------------------------

15 observations were matched, as planned. 203 observations were not matched, but they were all from the democracy data. We only had persons from three countries, so all other countries are unmatched.

Now let's look at the first 10 persons again:

In [10]:
list in 1/10
     +------------------------------------------------------------------------+
     |     country   sat_dem   woman   fh_pr   fh_cl   fh_sta~s        _merge |
     |------------------------------------------------------------------------|
  1. | Afghanistan         1       1       5       6         NF   matched (3) |
  2. | Afghanistan         3       0       5       6         NF   matched (3) |
  3. | Afghanistan         2       1       5       6         NF   matched (3) |
  4. | Afghanistan         4       0       5       6         NF   matched (3) |
  5. | Afghanistan         2       0       5       6         NF   matched (3) |
     |------------------------------------------------------------------------|
  6. |     Albania         5       1       3       3         PF   matched (3) |
  7. |     Albania         7       1       3       3         PF   matched (3) |
  8. |     Albania         5       0       3       3         PF   matched (3) |
  9. |     Albania         8       1       3       3         PF   matched (3) |
 10. |     Albania         4       0       3       3         PF   matched (3) |
     +------------------------------------------------------------------------+

Note that all the variables from the democracy data - fh_pr, fh_cl and fh_status have the same value for all persons from each country. All persons from Afghanistan have the same value on the variable fh_pr, since the level of political rights is the same for everybody in Afghanistan. The variable only varies between countries, not within.

If we instead had started with the country level data loaded in Stata, we would have had to write merge 1:m using "satisfactiontest.dta" instead, but the end result would have been the same.

Remember

  1. There must be a matching variable, an ID variable, that has the same name in both datasets.
  2. The matching variables need to be of the same type. If you for instance have pasted data from Excel, make sure to check that both variables are numeric (if you have number codes), or that both variables are string variables (if they are names). You can see that by opening the data editor and looking at the colors of the numbers. If they are black they are numeric; if they are red they are text.
  3. If we want to add variables, the observations need to be the same in both datasets. We can for instance not combine survey data from different surveys, with different respondents.
  4. Remove unnecessary observations. If you have a main dataset and add a variable, remove any unnecessary observations that come with the new variable. They will only complicate future merges.
  5. Remember to drop the _merge variable when you are done matching, so that you can easily add more data later on.
  6. Be extra careful and double-check each step! When merges are unsuccessful, it is often because of some petty error, like wrong filenames or wrong address to the files and so on.

Conclusion

The merge command is suitable when we want to add more variables. If we want to add observations, with information on the same variables, we need to use the append command. One such case might be when we want to combine surveys with identical questions, but carried out in 2018 and 2019. append then lets us combine the survey responses in one dataset.