STATISTICS HELP | SV/EN

Transform panel data between long and wide with reshape

Panel data can, as described in another post be structured in a "long" or "wide" way, depending on the question asked. Below are examples of what the structures look like.

  • In wide data each variable is included several times, once for every unit of time, but each panel unit (country in this case) only has one observation.
  • In long data each variable is only included once, but each panel unit has many observations. Time is also shown in a special variable.

In this guide we will cover how to transform the data between the two formats, with the command reshape.

Wide data:

country population2000 population2001 population2002
Sweden 8872284 8888675 8911899
Norway 4491572 4514907 4537240

Long data:

country year population
Sweden 2000 8872284
Sweden 2001 8888675
Sweden 2002 8911899
Norway 2000 4491572
Norway 2001 4514907
Norway 2002 4537240

Reshape from long to wide

First we load the data to work with: The QoG institute time-series cross-section data. It holds data on the countries of the world, over time. To make it a little bit easier to grasp we will throw away most of the variables. We will only keep the variables "cname", "year", "fh_status" (democracy status) and "unna_pop" (population), and only the years 2010 and 2011. We do this with the command keep and an if-statement.

In [1]:
use "https://www.qogdata.pol.gu.se/dataarchive/qog_bas_ts_jan18.dta", clear
keep cname year fh_status unna_pop
keep if year==2010 | year==2011
(Quality of Government Basic dataset 2018 - Time-Series)


(14,770 observations deleted)

If we look at the first six rows of the dataset (using list) we can see that each country is included twice, once for 2010 and once for 2011. Each variable is only included once. The data is in "long" format.

In [2]:
list in 1/6
     +---------------------------------------------+
     |       cname   year     fh_status   unna_pop |
     |---------------------------------------------|
  1. | Afghanistan   2010      Not Free   27962207 |
  2. | Afghanistan   2011      Not Free   28809167 |
  3. |     Albania   2010   Partly Free    2901883 |
  4. |     Albania   2011   Partly Free    2886010 |
  5. |     Algeria   2010      Not Free   36036159 |
     |---------------------------------------------|
  6. |     Algeria   2011      Not Free   36717132 |
     +---------------------------------------------+

The reshape command from long to wide

Now we shall run the reshape command. We need to specify a few things:

  1. If the end result should be long or wide
  2. Which variables to transform (fh_status and unna_pop, in this case)
  3. Which variable that shows the panel unit (cname, in this case)
  4. Which variable that shows the time (year, in this case)

The first two pieces of information are included before the comma sign, and the other two in the options i() (the panel unit) and j() (time). It looks like this:

In [3]:
reshape wide fh_status-unna_pop, i(cname) j(year)
(note: j = 2010 2011)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                      422   ->     211
Number of variables                   4   ->       5
j variable (2 values)              year   ->   (dropped)
xij variables:
                              fh_status   ->   fh_status2010 fh_status2011
                               unna_pop   ->   unna_pop2010 unna_pop2011
-----------------------------------------------------------------------------

Stata shows us a little status report that shows the result: The number of observations has been cut in half, just as expected. Previously, each country was included twice, now it is only included once. The number of variables has however increase. The variable "year" is gone (it is now reflected in the other variable names): "fh_status" has turned into "fh_status2010" and "fh_status2011", while "unna_pop" has turned into "unna_pop2010" and "unna_pop2011". We can again look at the first six rows to make sure that it worked:

In [4]:
list in 1/6
     +-----------------------------------------------------------------------+
     |               cname   fh_sta~2010   unn~2010   fh_sta~2011   unn~2011 |
     |-----------------------------------------------------------------------|
  1. |         Afghanistan      Not Free   27962207      Not Free   28809167 |
  2. |             Albania   Partly Free    2901883   Partly Free    2886010 |
  3. |             Algeria      Not Free   36036159      Not Free   36717132 |
  4. |             Andorra          Free      84419          Free      82326 |
  5. |              Angola      Not Free   21219954      Not Free   21942296 |
     |-----------------------------------------------------------------------|
  6. | Antigua and Barbuda          Free      87233          Free      88152 |
     +-----------------------------------------------------------------------+

A possible problem: Duplicate values ("values of variable not unique")

If the command doesn't work a problem might be that there are duplicate values in the panel and time variables. We then get the error message "values of variable year not unique within cname". Use the command duplicates list to find these duplicate values: duplicates list cname year and figure out what went wrong.

Reshape from wide to long

We have now created a "wide" dataset. If we want to turn it back to long format we can do so with reshape. This might actually be the more common case when collecting data from various official sources.

In order for it to work the variable must be named in a specific way. All variables that show the same information needs to have the same name, but with different numbers at the end. They are supporsed to be named (for instance) "unna_pop2010", "unna_pop2011", "unna_pop2012". They cannot be called "unna2010pop" and "unna2011pop". The numbers have to be at the end. Siffrorna måste vara på slutet för att Stata ska fatta.

The reshape command from wide to long

The command is very similar to what we did previously, but not entirely. First we write reshape long to indicate that we want to create a long dataset. Then we write the name of the variables we want to reshape, but only the "stem" - not the time. Stata will then look for this variable name "stems" in the variable list. If we have variables that are called fh_status2010 and fh_status2011 we only write "fh_status".

In the option i() we state the name of the panel unit variable, and in option j() we name the time variable. But this time, there is no existing time variable! We set the name of it in this command. We will call the vairable that shows the year for "year". The full command is thus:

In [5]:
reshape long fh_status unna_pop, i(cname) j(year)
(note: j = 2010 2011)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                      211   ->     422
Number of variables                   5   ->       4
j variable (2 values)                     ->   year
xij variables:
            fh_status2010 fh_status2011   ->   fh_status
              unna_pop2010 unna_pop2011   ->   unna_pop
-----------------------------------------------------------------------------

The number of observations doubles, from 211 to 422, because there were two years. Previously, the years were spread out over columns - now they are spread out over more observations instead. There are now only four variables, down from five, and Stata has also created a new variable: "year". The information in "fh_status2010" and "fh_status2011" are now contained in the variable "fh_status", and the information in the variables "unna_pop2010" and "unna_pop2011" are now in the variable "unna_pop". Let's look at the first six rows:

In [6]:
list in 1/6
     +---------------------------------------------+
     |       cname   year     fh_status   unna_pop |
     |---------------------------------------------|
  1. | Afghanistan   2010      Not Free   27962207 |
  2. | Afghanistan   2011      Not Free   28809167 |
  3. |     Albania   2010   Partly Free    2901883 |
  4. |     Albania   2011   Partly Free    2886010 |
  5. |     Algeria   2010      Not Free   36036159 |
     |---------------------------------------------|
  6. |     Algeria   2011      Not Free   36717132 |
     +---------------------------------------------+

Conclusion

reshape can be used to transform data to and from long and wide format.

  • Going from long to wide:
    reshape long variabelnamn, i(panelvariabel) j(tidsvariabel)

  • Going from wide to long:
    reshape wide variabelnamn, i(panelvariabel) j(ny tidsvariabel)

For most (but not all) statistical analysis long format is better, since we then can set the panel structure with xtset and do analyses that take time into account.