If qualifiers and conditions

Svensk version | Front page

By Anders Sundell

Almost all commands in Stata can be combined with so callded if qualifiers. These are conditions that tell Stata which observations that should be included in the command. We might for instance want to recode only a subset of observations, run an analysis on a small part of the dataset, and so on.

The conditions use a set of "logical operators," building blocks by which we can construct both simple and advanced conditions. They are also used in a lot of other softwares. The operators are:

Operator Meaning
== Equal to
!= Not equal to
> Larger than
< Less than
>= Larger than or equal to
<= Less than or equal to
& And
$|$ Or

The two last, "and" and "or" can be used to link several conditions. If we for instance work with data on persons, we can create a condition that requires the person to be 25 years old AND unemployed, for instance. Or we could use a condition to select people that are under 22 years old, OR have never voted in a parliamentary election.

The if qualifiers are entered in the command after the list of variables, and before options ,.

With the aid of the QoG Basic dataset we will look at some examples of how these conditions can be used in a range of applications.

In [1]:
use "https://www.qogdata.pol.gu.se/dataarchive/qog_bas_cs_jan18.dta", clear
(Quality of Government Basic dataset 2018 - Cross-Section)

In descriptive statistics

If qualifiers help us select specific groups of observations. Let's say we want to look at the level of corruption in the world. To see the mean we can write:

In [2]:
sum ti_cpi
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      ti_cpi |        181    42.82476     19.5057          8         92

The mean is 42.8 (on a 0-100 scale, where 100 means the least corruption). But now say we want to do this for a smaller group of countries, such as the ones that are categorized as free according to Freedom House. These have the value 1 on the variable fh_status. We then add an if qualifier to the command:

In [3]:
sum ti_cpi if fh_status==1
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      ti_cpi |         77    57.25041    17.40723         30         92

The number of observations is now lower, 77 instead of 181. The mean is also higher: 57.3 instead of 42.8. Corruption is in general not as widespread in democratic countries.

We can also use if qualifiers to look at observations that have a value over or under some threshold, such as countries with population unna_pop that exceeds 50 million:

In [4]:
sum ti_cpi if unna_pop > 50000000
    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      ti_cpi |         28    42.71429    17.87168         21         79

The if qualifiers work on virtually all commands. For instance the command list. We now want a list of all countries in Eastern Europe and the former Soviet Union, and whether they are categorized as democracies or dictatorships. We start with the variable ht_region, where these countries have the value 1. We will create a list of two variables, the country's name cname and its categorization fh_status. We also add the option clean that removes the lines in the output. Options are added after the if qualifier.

In [5]:
list cname fh_status if ht_region==1, clean
                        cname     fh_status  
  2.                  Albania   Partly Free  
  7.               Azerbaijan      Not Free  
 14.                  Armenia   Partly Free  
 19.   Bosnia and Herzegovina   Partly Free  
 25.                 Bulgaria          Free  
 28.                  Belarus      Not Free  
 44.                  Croatia          Free  
 47.           Czech Republic          Free  
 57.                  Estonia          Free  
 63.                  Georgia   Partly Free  
 75.                  Hungary          Free  
 87.               Kazakhstan      Not Free  
 93.               Kyrgyzstan   Partly Free  
 97.                   Latvia          Free  
101.                Lithuania          Free  
114.                  Moldova   Partly Free  
115.               Montenegro          Free  
138.                   Poland          Free  
143.                  Romania          Free  
144.                   Russia      Not Free  
153.                   Serbia          Free  
157.                 Slovakia          Free  
159.                 Slovenia          Free  
171.               Tajikistan      Not Free  
179.             Turkmenistan      Not Free  
182.                  Ukraine   Partly Free  
183.                Macedonia   Partly Free  
190.               Uzbekistan      Not Free  

If we instead want to condense this list, we can display the information as a table of frequencies with the command tab, where we instead see how many countries that are placed in each category.

In [6]:
tab fh_status if ht_region==1
    Freedom |
     Status |      Freq.     Percent        Cum.
------------+-----------------------------------
       Free |         13       46.43       46.43
Partly Free |          8       28.57       75.00
   Not Free |          7       25.00      100.00
------------+-----------------------------------
      Total |         28      100.00

In graphs

If qualifiers are very useful when we make graphs, especially with the command twoway, since we with this command easily can add different layers of graphs on top of each other. Each layer can have its own set of conditions for which observations that are included in the layer, but we can also create if qualifiers that are applied to the graph as a whole. In the graph below we present the relationship between corruption ti_cpi and ethnic fragmentation al_ethnic. We add an if qualifier for the entire graph, which limits the sample to either Western Europe and Northern American ht_region==5 OR Sub-Saharan Africa ht_region==4.

In [7]:
twoway (scatter ti_cpi al_ethnic) if ht_region==5 | ht_region==4
.     noi gr export /Users/xsunde/.stata_kernel_cache/graph$stata_kernel_graph_
> counter.svg, width(600) replace
.     global stata_kernel_graph_counter = $stata_kernel_graph_counter + 1
. }            

But we can also use if qualifiers within each layer. We will now create a layer where only Western Europe and Northern America are included, and one layer where only Sub-Saharan Africa is included. The benefit of doing so is that we then can modify the looks of each layer. We set the color of the Western European and Northern American dots to blue, and Sub-Saharan Africa red. Note that the if qualifiers here are within the parentheses, and thus only affect what is in the set of parentheses. But even here they are located before the options, as always.

In [8]:
twoway  (scatter ti_cpi al_ethnic if ht_region==5, mcolor(blue)) ///
        (scatter ti_cpi al_ethnic if ht_region==4, mcolor(red))
. cap noi twoway  (scatter ti_cpi al_ethnic if ht_region==5, mcolor(blue))     
>     (scatter ti_cpi al_ethnic if ht_region==4, mcolor(red))

. if _rc == 0 {
.     noi gr export /Users/xsunde/.stata_kernel_cache/graph$stata_kernel_graph_
> counter.svg, width(600) replace
.     global stata_kernel_graph_counter = $stata_kernel_graph_counter + 1
. }            

A very useful feature of this type of scatterplot is to create a new layer with only a few select observations, that are marked in the graph. For instance, we might want to show the location of Sweden and the United States. We then create a new layer, with a condition that the country name cname should be "Sweden" OR "United States", and in this layer we also say that the country names should be used as a marker label. We will also add an option to the entire graph, legend(off), which removes the legend at the bottom of the graph - we will otherwise get one explanation for each layer, which looks bad.

In [9]:
twoway  (scatter ti_cpi al_ethnic if ht_region==5, mcolor(blue)) ///
        (scatter ti_cpi al_ethnic if ht_region==4, mcolor(red)) ///
        (scatter ti_cpi al_ethnic if cname=="Sweden" | cname=="United States", mlabel(cname)) ///
        , legend(off)
. cap noi twoway  (scatter ti_cpi al_ethnic if ht_region==5, mcolor(blue))     
>     (scatter ti_cpi al_ethnic if ht_region==4, mcolor(red))         (scatter 
> ti_cpi al_ethnic if cname=="Sweden" | cname=="United States", mlabel(cname)) 
>         , legend(off)

. if _rc == 0 {
.     noi gr export /Users/xsunde/.stata_kernel_cache/graph$stata_kernel_graph_
> counter.svg, width(600) replace
.     global stata_kernel_graph_counter = $stata_kernel_graph_counter + 1
. }            

In regression analysis

If qualifiers can be used in regression analysis (or other types of analyses) to run the analysis in a specific subgroup, or to eliminate certain outliers. First we can try to run a regression analysis on the relationship between corruption and ethnic fragmentation on Sub-Saharan Africa. Please not that only 46 observations are included in the analysis.

In [10]:
reg ti_cpi al_ethnic if ht_region==4
      Source |       SS           df       MS      Number of obs   =        46
-------------+----------------------------------   F(1, 44)        =      7.85
       Model |  967.759581         1  967.759581   Prob > F        =    0.0075
    Residual |  5421.89259        44  123.224832   R-squared       =    0.1515
-------------+----------------------------------   Adj R-squared   =    0.1322
       Total |  6389.65217        45  141.992271   Root MSE        =    11.101

------------------------------------------------------------------------------
      ti_cpi |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   al_ethnic |  -19.91737    7.10718    -2.80   0.008    -34.24095   -5.593787
       _cons |   46.17676   4.949334     9.33   0.000     36.20203    56.15149
------------------------------------------------------------------------------

We can also eliminate specific outliers. Singapore, for instance, is a very special case when it comes to corruption. We can try to leave it out of the analysis, to make sure that the results are not affected too mush buy the country. We use the "not equal to" operator, !=, to remove specifically Singapore.

In [11]:
reg ti_cpi al_ethnic if cname!="Singapore"
      Source |       SS           df       MS      Number of obs   =       172
-------------+----------------------------------   F(1, 170)       =     29.32
       Model |  9434.28622         1  9434.28622   Prob > F        =    0.0000
    Residual |   54703.246       170    321.7838   R-squared       =    0.1471
-------------+----------------------------------   Adj R-squared   =    0.1421
       Total |  64137.5323       171  375.073288   Root MSE        =    17.938

------------------------------------------------------------------------------
      ti_cpi |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   al_ethnic |  -28.63565   5.288525    -5.41   0.000    -39.07528   -18.19601
       _cons |   55.72485   2.684605    20.76   0.000      50.4254    61.02431
------------------------------------------------------------------------------

Conclusions

If qualifiers can be used in a lot of ways. It is however important to remember that each part of the condition must work on its own even when you string several conditions together. If we for instance want to create a condition that chooses Sweden or the United States, we CANNOT write:

if cname == ("Sweden" | "United States")

Instead, we must write it as two conditions:

if cname == "Sweden" | cname == "United States"

The variable name must be included in both parts of the if qualifier. It is also important to bear in mind the difference between OCH & and OR |. You could for instance be forgiven for thinking that we in the exmaple above could write if cname == "Sweden" & cname == "United States" because both country names are ok. But Stata would then have looked for countries where the country name was both "Sweden" and "United States", and it cannot be both at the same time, and no observations would have been chosen.