Simple descriptive statistics¶

The first thing one should do when working with a dataset is to get acquainted with the relevant variables. How many observations do we have information about? What is the distribution of the values? What are the measures of central tendency, such as the mean or the median? What are the minimum and maximums?

We need this information both to improve and to interepret our analyses, and they also make it easier to put the results in context when writing about them. The empirical sections of theses or scientific papers often (at least in the social sciences) begin with some simpler descriptive statistics.

There are many commands in Stata for producing such statistics. We will here discuss some of the simplest and most useful.

We will in the examples work with the QoG basic dataset, which has information about the world's countries. Below I load the data directly from the web page, but we can also download the data and load it from the computer instead (generally advisable).

use "https://www.qogdata.pol.gu.se/dataarchive/qog_bas_cs_jan18.dta", clear

(Quality of Government Basic dataset 2018 - Cross-Section)

Codebook¶

The command codebook gives an overview of a variable: how many observations that have valid values, what the labels of the values are (if there are any), and more. We will here try it out on the variable "fh_status", that shows how free a country is, according to the American organization Freedom House.

codebook fh_status

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
fh_status                                                                                                                                                                                                                                        Freedom Status
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (byte)
                 label:  lblfhstatus

                 range:  [1,3]                        units:  1
         unique values:  3                        missing .:  0/194

            tabulation:  Freq.   Numeric  Label
                            89         1  Free
                            54         2  Partly Free
                            51         3  Not Free

After "type" we can see that it is a numeric variable, meaning that the values are numbers, not text strings. However, the numbers represent qualitiative assessments, such as "Free" or "Partly free".

The range is "1,3" which means that the variable has values that range from 1 to 3. There are also three unique (different) values in the dataset.

0 out of 194 observations have a "missing" value, which is good. It means that we have information on all the countries in the dataset.

Finally we also get a frequency table, that shows how many observations (in the column "Freq.") that have each value, and the label on that value (for instance "Free"). If we use the command codebook on variable that has many more unique values, the output looks a little bit different. We can for instance try it on a variable that shows GDP per capita, "gle_rgdpc".

codebook gle_rgdpc

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
gle_rgdpc                                                                                                                                                                                                                            Real GDP per Capita (2005)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [285.95,95696.97]            units:  .01
         unique values:  192                      missing .:  2/194

                  mean:   12596.3
              std. dev:   15803.7

           percentiles:        10%       25%       50%       75%       90%
                           1131.48   2297.41   6955.53   17127.8   32266.6

A lot is the same, but here we also get the mean, the standard deviation, and different percentiles. The 50th percentile (6955.53) is also the median. An equal number of countries have a GDP per capita above and below 6955.53.

Summarize (sum)¶

The summarize command, which also can be abbreviated to sum, gives a little less information, and is best used on continous variables where the mean is of interest, like GDP per capita. THe nice thing about the command is that we can enter several variables at once, and get a compact table out of it. An example:

sum gle_rgdpc gle_pop wdi_poprul wdi_popurb

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
   gle_rgdpc |        192     12596.3     15803.7     285.95   95696.97
     gle_pop |        192    35888.69    135162.5         10    1324353
  wdi_poprul |        193    43.20123    23.53091          0      91.45
  wdi_popurb |        193    56.79877    23.53091       8.55        100

In regression analysis we usually employ variables of this type, which makes the sum command useful. In theses or scientific papers the tables that show descriptive statistics often include these figures: number of observations, mean, standard deviation, min, and max.

Tabulate (tab)¶

For categorical variables, that is variables with discrete steps (without decimals), the mean is generally not very informative. The variable "fh_status" which we looked at previously is one of them. For these variables it is generally better with a frequency table, that lists the different values, and how many percent of the observations that fall into each category.

We do that with the command tabulate, abbreviated tab.

tab fh_status

    Freedom |
     Status |      Freq.     Percent        Cum.
------------+-----------------------------------
       Free |         89       45.88       45.88
Partly Free |         54       27.84       73.71
   Not Free |         51       26.29      100.00
------------+-----------------------------------
      Total |        194      100.00

We could see the number of observations that had each value also with codebook, but we are generally more interested in the percentage than the absolute number. Now we can see that almost 46% of the world's countries are categorized as "Free", while 26% are categorized as "Not Free". The final column shows the cumulative percentage, which we get by adding the percentages for the different categories together, from top to bottom. For instance, we can see that 73.71% of the countries are either "Free" or "Partly free".

Conclusion¶

Which descriptive statistics we neeed to present for the reader depends, as does everything else, on what the research question is. All information that is necessary to interpret the results properly shall be included, everything unnecessary should be left out.

Many times it is also useful and pedagogical to show the descriptive statistics in graphs, for instance using histograms, which show the distrubition of a variable. See separate guides for graphs.

Also remember that there are many more features to the commands showm here. Just type help 'commandname' to access the full documentation.