STATISTICS HELP | SV/EN

Panel data (time-series cross-section)

Cross sectional data means that we have data from many units, at one point in time.
Time series data means that we have data from one unit, over many points in time.
Panel data (or time series cross section) means that we have data from many units, over many points in time.

We can perform more interesting analyses with panel data than with both cross section and time series data, and gives us better opportunity to rule out alternative explanations, thereby making it easier to talk about cause and effect.

In Stata we can use time series commands (see separate guide for them!) in panel data to create lagged and leading variables. We can also use special regression commands that are suited for panel data, such as xtreg.

But first we need to make sure that the data is set up for panel analysis. This guide is about that.

The panel data structure: long or wide

Panel data can be structured in two ways: "long" or "wide". To take an example, let's say we have data on countries, over time.

Wide data

With wide data each row in the dataset stands for one country, and each column a variable at one point in time. For instance the population size of a country, a certain year. Like this:

country population2000 population2001 population2002
Sweden 8872284 8888675 8911899
Norway 4491572 4514907 4537240

It might seem intuitive at first glance, and it makes it easy to compare certain years to each other. But it is harder to do more advanced analyses, with many different variables (population, GDP, unemployment, and so on) we will need a lot of columns.

Long data

In general it is more convenient to have the data in long form. In long data each row represents one country one year, and each column represents one variable. But we also need a variable that shows which year the row represents. The table above would look like this in long form:

country year population
Sweden 2000 8872284
Sweden 2001 8888675
Sweden 2002 8911899
Norway 2000 4491572
Norway 2001 4514907
Norway 2002 4537240

The same data, in another format. Here we instead have few columns, but a lot of rows, but rows are easier to work with in Stata. To change format from wide to long, or from long to wide, use the command reshape. There will be another guide about that. The rest of the guide presumes that the data is in long form.

Set the panel data structure with xtset

We need to specify two variables for Stata: A panel (unit) variable and a time variable. The panel variable is country in this case - all observations for Sweden are connected, all observations for Norway are connected, and so on. The time variable is year, in this case.

The command to specify these variables is xtset. We simply type xtset country year - the panel variable first, and then the time variable. Let us try, with the QoG institute's time series cross section dataset, which contains information about countries, over time. The data is in long format.

In [1]:
use "https://www.qogdata.pol.gu.se/dataarchive/qog_bas_ts_jan18.dta", clear
(Quality of Government Basic dataset 2018 - Time-Series)

A common problem: The panel variable is text (a string)

The variable "cname" shows the name of each country in the data, and the variable "year" shows the year which the row in the data refers to. But if we try to use these variables with xtset we get the following error message:

In [2]:
xtset cname year
string variables not allowed in varlist;
cname is a string variable
r(109);

Stata objects that the panel variable "cname" is a string variable. Stata wants it to be a numeric code. In the QoG data we have such variables, for instance the variable "ccode". But in other cases, for instance when we collect the data ourselves, we might not be so lucky. In those cases we can easily construct a country code ourselves, with the command egen, combined with group():

In [3]:
egen countryid = group(cname)

Stata then creates a new variable called "countryid", that gives each unique value of the variable "cname" its own numeric code, from one and up. We can now use this variable as our panel variable.

Set the panel data the right way

In [4]:
xtset countryid year
       panel variable:  countryid (strongly balanced)
        time variable:  year, 1946 to 2017
                delta:  1 unit

This is the message we get when the command worked. We can now see that our new variable "countryid" is the panel variable, and that the time variable is "year".

A common problem: Duplicates (repeated time values within panel)

Another common error message is "repeated time values within panel". It means that we have duplicate observations for at least one country-year. The two variables we specify with xtset must give unique combinations for all observations. Stata will not know what do with observations that are included in multiple places, for instance Sweden in the year 2000, and then shows us the error message. It looks like this:

In [11]:
xtset countryid year
repeated time values within panel
r(451);

Unfortunately Stata does not tell us which observations that caused the error. But we can use the command duplicates to find them. We then write duplicates list followed by the variables in question, both of them (countryid year). If we only write on of them, for instance duplicates list countryid we will get a very long list of observations, as each country is included many times (once for each year). But if we instead write duplicates list countryid year we only get the observations that have identical values on both the variables:

In [12]:
duplicates list countryid year
Duplicates in terms of countryid year

  +-------------------------+
  |  obs:   countr~d   year |
  |-------------------------|
  | 15193        483   2000 |
  | 15194        483   2000 |
  | 15195        483   2000 |
  | 15196        483   2000 |
  | 15197        483   2000 |
  |-------------------------|
  | 15198        483   2000 |
  | 15199        483   2000 |
  | 15200        483   2000 |
  +-------------------------+

We can here see that 8 observations are causing the problem. They have the value 483 on the variable "countryid", and the value 2000 on the variable "year".

Now that we know who the culprits are we need to think about why they were duplicates in the first place. In this case it was because I created them, to demonstrate the error message, and we can safely delete them. But in general we don't know which of the duplicates that are the problematic ones - there might be one good observation of Sweden in 2000, and a bad one (caused by some error in data entry for instance). In those cases it is necessary to take a close look at the data, to determine what went wrong, and which observation that should be deleted.

If we have decided to remove them we can use the command drop in combination with an if-statement. Below we instruct Stata to remove all observations with the value 483 on "countryid" and 2000 on "year".

In [13]:
drop if countryid==483 & year==2000
(8 observations deleted)

After doing so, we should be able to use xtset as intended.

In [14]:
xtset countryid year
       panel variable:  countryid (strongly balanced)
        time variable:  year, 1946 to 2017
                delta:  1 unit

Create lagged variables

Now we are ready to start using the data. We can for instance create a lagged variable, that shows the population the previous year. Here we use normal time series commands.

In [15]:
gen lag_pop = l.unna_pop
(7,121 missing values generated)

If we now look at a segment of the data we can see that it worked:

In [16]:
list cname year unna_pop lag_pop if cname=="Sweden" & year>2010
       +------------------------------------+
       |  cname   year   unna_pop   lag_pop |
       |------------------------------------|
12738. | Sweden   2011    9462352   9382297 |
12739. | Sweden   2012    9543457   9462352 |
12740. | Sweden   2013    9624247   9543457 |
12741. | Sweden   2014    9703247   9624247 |
12742. | Sweden   2015    9779426   9703247 |
       |------------------------------------|
12743. | Sweden   2016          .   9779426 |
12744. | Sweden   2017          .         . |
       +------------------------------------+

The value on the variable "unna_pop" is for Sweden in 2011 9462352 persons. The year after the variable "lag_pop" also has the value 9462352. As planned! The good thing is that Stata has not simply shifted all observations down one row, but takes into account which observation that belongs to which country. One country's last year is not assigned to the next country's first.

Finally

It is sometimes a bit tricky to set up panel data the right way, so Stata will understand how to deal with the data. It is important that the data is in long form, so that each row is a country year, and that we have a separate variable that shows which year the data corresponds to. With the command reshape we can transform the data from long to wide, or from wide to long. There will be a separate guide for this command.