Crosstabs¶

A statistical analysis should of course be correct, but it should also ideally be explainable in an easy way. If we can use a simple analysis, without too big loss of analytical power, it is generally preferable.

One of the simplest methods for investigating relationships are crosstabs. The upside is that they are easy to do and fairly easy to understand. The downside is that is cumbersome (but not impossible) to account for more than two variables. If we want to do that we are better off using regression analysis. But crosstabs might be useful for a quick overview of a relationship.

We will in this example use the QoG basic dataset. I have here entered the web path to the dataset directly, but you can of course download it to your computer and load it from there.

In [1]:
use "https://www.qogdata.pol.gu.se/dataarchive/qog_std_cs_jan18.dta", clear

(Quality of Government Standard dataset 2018 - Cross-Section)


A simple crosstab¶

We will look closer at the relationship between a country's level of democracy and its electoral system - more specifically whether the country has proportional representation (PR) or not. PR means that seats in the parliament are distributed in proportion to the vote shares of the parties, like in Sweden. In contrast, representatives to the American congress are elected in single-member districts, where the person that gets the most votes wins. Among other things, this means that a party that gets somewhat more votes still can get a big majority in parliament. Some research also suggests that PR would be better for democracy.

A crosstab cannot prove cause or effect, but we can at least see whether countries that have PR also are more democratic, or not.

As our indicator of democracy we will use fh_status, that shows how free a country is, according to the american organization Freedom House. To measure PR we use the vairable dpi_pr from Database of Political Institutions.

The command we use is called tabulate, usually abbreviated tab. We first write the command, then the variable we want in the rows, and then the variable we want in the columns. I generally prefer to have the independent variable in the rows.

In [4]:
tab dpi_pr fh_status

Proportion |
al |
Representa |          Freedom Status
tion |      Free  Partly Fr   Not Free |     Total
-----------+---------------------------------+----------
0 |        17         21         22 |        60
1 |        54         30         17 |       101
-----------+---------------------------------+----------
Total |        71         51         39 |       161


Here we get a spread of the 161 countries for which we have data, divided according to the six different combinations of values. 17 countries have a zero on dpi_pr (meaning that they don't have PR) and are also classified as "Free" according to Freedom House. 21 of the countries without PR are "Partyle free" and 22 "Not free".

54 countries have a one on dpi_pr and are also "Free", while only 17 have PR and are "Not free".

On the edges we see the total numbers. We can there see that many more countries have PR than that don't, 101 compared to 60. This makes comparisons in the table difficult. If we want to make the claim that PR is related to better democracy, we cannot just say that the number of countries that have PR and are Free is bigger than the number of countries that don't have PR and are free. In the same way, it would be strange to say that Chinese people are richer than Swedes just because there are more billionaires in China - they are many more to begin with. The shares are much more interesting.

Percentages¶

To get the shares we need to calculate percentages, which we can do in three different ways. As is evident from the table above we have three different totals:

The total number of countries in the table: 161.
The total number of countries that have and don't have PR: 101 and 60.
The total number of countries with different levels of freedom: 71, 51 and 39.

Each cell's share of the total, 161, is the cell percentage.
Each cell's share of the row total, 60 and 101, is the row percentage.
Each cell's share of the column total, 71, 51 and 39, is the column percentage.

It is important to remember the meaning of calculating the percentages in different directions. In the table above, if we take the row percentage we get the share of PR countries that are free, and the share of non-PR countries that are Free. If we instead calculate the column percentage we get the share of Free countries that have PR, and the share of non-PR countries that have PR, and so on. This is not the same thing! To draw conclusions about the possible effect of PR, we need to calculate row percentages in this case.

My principle is that I always put the independent variable - the variable I think is the cause - in the rows, and then I always calculate the row percentage, to get the most sensible results. But you can of course do as you find easiest to undersetand.

We get the percentages by adding an option and writing row, col or cell.

In [9]:
tab dpi_pr fh_status, row

+----------------+
| Key            |
|----------------|
|   frequency    |
| row percentage |
+----------------+

Proportion |
al |
Representa |          Freedom Status
tion |      Free  Partly Fr   Not Free |     Total
-----------+---------------------------------+----------
0 |        17         21         22 |        60
|     28.33      35.00      36.67 |    100.00
-----------+---------------------------------+----------
1 |        54         30         17 |       101
|     53.47      29.70      16.83 |    100.00
-----------+---------------------------------+----------
Total |        71         51         39 |       161
|     44.10      31.68      24.22 |    100.00


The row percentages sum to 100 to the far right. The three categories Free, Partly free and Not free add up to 100 percent, in each row.

This means that we can compare the different rows with each other. 28.33% of the countries without PR are classified as Free, while 53.47% of the countries with PR are classified as Free. That is a big difference. We get the reverse relationship if we look at the category Not free. A smaller share of the PR countries are Not free.

As mentioned earlier - we cannot draw conclusions about causality from this. It might be the case the freer countries are more likely to adopt proportional representation. To get closer to drawing conclusions about cause and effect we need to apply other analyses. But this crosstab makes it more interesting to proceed, now that we have established that there is a relationship to begin with.

We will also look at the column percentages, for comparison:

In [10]:
tab dpi_pr fh_status, col

+-------------------+
| Key               |
|-------------------|
|     frequency     |
| column percentage |
+-------------------+

Proportion |
al |
Representa |          Freedom Status
tion |      Free  Partly Fr   Not Free |     Total
-----------+---------------------------------+----------
0 |        17         21         22 |        60
|     23.94      41.18      56.41 |     37.27
-----------+---------------------------------+----------
1 |        54         30         17 |       101
|     76.06      58.82      43.59 |     62.73
-----------+---------------------------------+----------
Total |        71         51         39 |       161
|    100.00     100.00     100.00 |    100.00


The table now sums to 100 percent in each column. The number of countries is of course the same, and the percentages reflect that, but we cannot make the same comparisons that we did above. In the column Free we now see that we have 76% in the PR row, and only 24% in the non-PR row. But this does not prove that it is three times as common for countries to be free if they have PR. It only shows that many more countries have PR, overall. If we want to say something about the likelihood of being free, depending on whether the country has PR or not, we need to look at the row percentage.

What we can learn from this table is whether it is more common to have PR among countries that are Free or Not free. And we can indeed se that PR is more common in the category Free, and is less common among Not free countries.

This question is similar, but not identical to the question of whether it is more likely to be free depending on whether you have PR or not. And to not lose track it is a good idea to have an hypothesis or idea about what causes what. This analysis cannot prove causality, but it helps to structure the thinking and the presentation.

Conclusion¶

Even if crosstabs don't say anything about causality they still serve a purpose, when we want to introduce a relationship before we proceed with more advanced analysis. It is however almost always necessary to calculate percentages to make the relationship clearer. And when we calculate percentages we need to make sure that we are calculating them "in the right direction."

Crosstabs are only useful for variables with a limited number of values, for instance categorical variables. Continuous scales - like a country's GDP or a person's age - leads to tables with hundreds of cells, impossible to overlook. In those cases it is better to display the results graphically, or to calculate averages in different groups.