Mean values in different groups

Svensk version | Front page

A very intuitive and simple way to show a relationship between a categorical and a continuous variable is to calculate mean values (averages) of the continous variable for each value on the categorical variable.

For instace: What is the average income among men and women? What is the mean life expectancy in different countries? What is the level of trust in strangers among immigrants and natives? And so on.

More advanced statistial analysis is often necessary to rule out alternative explanations and strengthen our causal claims, but the presentation is almost always improved by first showing the main features of the relationship in a simple way - for instance with mean values or crosstabs.

We will in the examples work with the QoG basic dataset, which has information about the world's countries. Below I load the data directly from the web page, but we can also download the data and load it from the computer instead (generally advisable).

In [1]:
use "https://www.qogdata.pol.gu.se/dataarchive/qog_bas_cs_jan18.dta", clear
(Quality of Government Basic dataset 2018 - Cross-Section)

Using sum and bysort

In this example we will take a closer look at the relationship between a country's GDP per capita and its level of democracy. We will measure GDP with the variable "gle_rgdpc" and democracy with "fh_status", as categorized byt he organization Freedom House.

With the command summarize, which can also be abbreviated to sum we quickly get the mean value of GDP per capita in the entire dataset. We just type sum followed by the variable:

In [2]:
sum gle_rgdpc
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
   gle_rgdpc |        192     12596.3     15803.7     285.95   95696.97

On average, the 192 countries in the dataset have 12596.3 dollars in GDP per capita. But this is the mean value for the entire dataset, with democracies and dictatorships mixed.

To get at that difference we an use the bysort command. It is a sort of "pre-command", where we type bysort variable:, and then the main command. We then get the main command repeated once for each group of the variable we specified after bysort. Like this:

In [3]:
bysort fh_status: sum gle_rgdpc
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> fh_status = Free

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
   gle_rgdpc |         89    18777.71    17191.11     293.11   95696.97

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> fh_status = Partly Free

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
   gle_rgdpc |         54    5902.896    9931.133     430.73   56760.12

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-> fh_status = Not Free

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
   gle_rgdpc |         49    8745.244    14377.83     285.95   91787.54

We now get three analyses, where the sum command on the GDP variable first was run only on the 89 countries that are categorized as "Free", then on the 54 that are counted as "Partly free" and finally on the 49 countries that are categorized as "Not free".

The tables might look a bit messy, but we at least get the most important numbers. We can see that GDP per capita is highest in the free countries (18777.71), second highest in the not free countries (8745.244) and lowest in the partly free countries (5902.896). There are quite a few research papers written about this finding, but one thing to keep in mind is that there are some oil-rich countries among the worst dictatorships (oil incomes might even underpin the regimes).

We also got some bonus information (like the min, max and standard deviation), but we generally want to use commands that give us the information we are looking for, and not much more. In this case we just wanted the mean, and then there is a better method.

Using table

The table command (not to be confused with the tab command) is a flexible command that give us information about variables in different groups.

The structures is that we first specifiy for which variable we want to compare groups, and in the options of the command we then specify which information we want about the grups. If we want the mean of GDP per capita, for the groups of "fh_status" we type:

In [4]:
table fh_status, contents(mean gle_rgdpc)
----------------------------
Freedom     |
Status      | mean(gle_rg~c)
------------+---------------
       Free |       18777.71
Partly Free |       5902.896
   Not Free |       8745.244
----------------------------

table fh_status meant that we wanted the observations divided according to the different values of "fh_status". With the option contents() we specify what that information should be - in this case the mean of "gle_rgdpc". But the mean is not the only info we can get. Type help table for the full list, but some of the more useful are sd, median, max, min and count. The last one shows how many observations we have in each group.

We can specify up to five different informations, like this:

In [6]:
table fh_status, contents(mean gle_rgdpc sd gle_rgdpc min gle_rgdpc max gle_rgdpc count gle_rgdpc)
----------------------------------------------------------------------------------
Freedom     |
Status      | mean(gle_~c)  sd(gle_rg~c)  min(gle_r~c)  max(gle_r~c)  N(gle_rgdpc)
------------+---------------------------------------------------------------------
       Free |     18777.71      17191.11        293.11      95696.97            89
Partly Free |     5902.896      9931.133        430.73      56760.12            54
   Not Free |     8745.244      14377.83        285.95      91787.54            49
----------------------------------------------------------------------------------

This table is much more neat than what we got with the sum command, and contains the same information. In our reports or theses we should however never simply paste tables from Stata - they can alway be made more readable. But it is of course more convenient to start with something that looks more like the final product.

Means grouped according to two variables using table

The table command also lets us divide according to two or three variables, and show information in the smaller groups. We just add the variables after each other, before the comma sign that denotes that options. Below we divided the observations in the dataset not only according to the level of democracy, but also according to whether the country uses proportional representation or not in their electoral system "dpi_pr".

In [8]:
table fh_status dpi_pr, contents(mean gle_rgdpc count gle_rgdpc)
--------------------------------
            |    Proportional   
Freedom     |   Representation  
Status      |        0         1
------------+-------------------
       Free | 15483.75   19800.8
            |       17        54
            | 
Partly Free |  7853.29  4703.716
            |       21        30
            | 
   Not Free | 6351.004  5930.194
            |       22        16
--------------------------------

This way we can get more fine-grained information. But when we divide into more groups the number of observations in each group goes down, which we need to be observant of. It might therefore be good to use the count option. We can for instnace see that there are only 16 countries that are categorized as not free, while also having proportional representation. Our conclusions for that group will therefore be less certain than for the groups with 54 countries.

Conclusion

Our presentation of statistical results should always be as pedagogical and easy to understand as possible. To display the means in different groups is often a good idea, and gives easy numbers that the reader can use as a point of reference while thinking about more advanced analyses. The table command is a good way to quickly access the relevant information.