Regression analysis with logarithmic variables

Svensk version | Front page

In another guide we discussed how to create logarithmic variables, and what they mean. Here we will instead focus on how to use them in regression analysis, and what to keep in mind when interpreting the coefficients.

We will use the same data as in the other example, tha tis the QoG Basic (version 2018) dataset. The code snippet below loads the data, and creates a variable that is the natural logarithm of GDP per capita, gle_rgdpc.

In [1]:
use "https://www.qogdata.pol.gu.se/dataarchive/qog_bas_cs_jan18.dta", clear
gen ln_gdpc = ln(gle_rgdpc)
(Quality of Government Basic dataset 2018 - Cross-Section)

(2 missing values generated)

Logarithmic variable as independent

We will use this new variable as our independent variable, and life expectancy as the dependent. The idea is that higher GDP per capita is associated with longer life expectancy - for instance because higher national incomes can be used to improve infrastructure and health care.

In the code below we run two analyses, both with actual GDP per capita and with the log-transformed variable, in separate models. The raw output is suppressed by the addition of quietly ahead of the regression command. Therafter we save the output with estimates store and finally present the results together in a table made with the esttab command (see separate guide), to make the results easier to compare.

In [2]:
quietly reg wdi_lifexp gle_rgdpc
estimates store m1

quietly reg wdi_lifexp ln_gdpc

estimates store m2
esttab m1 m2, r2





--------------------------------------------
                      (1)             (2)   
               wdi_lifexp      wdi_lifexp   
--------------------------------------------
gle_rgdpc        0.000346***                
                  (11.00)                   

ln_gdpc                             5.082***
                                  (18.37)   

_cons               67.17***        27.17***
                 (111.73)         (11.17)   
--------------------------------------------
N                     183             183   
R-sq                0.401           0.651   
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

The main impression from both analyses are of course the same (it is the same underlying data): There is a positive and significant relationship. People live longer in richer countries. But we can see that the $R^2$-value is higher in the second model, with the logarithmic variable, which indicates that it fits the data better.

But how to interpret the coefficients? In model 1, with the regular variable, it is easy. The coefficient shows what would happen with the dependent variable if the independent increased one step. If GDP per capita increases with one dollar, life expectancy increases with 0.000346 years. Not a lot, but neither is one dollar.

In the other model, we need to do a completely different interpretation. Technically, it is the same thing. The coefficient shows that if we increase the logarithmic variable with one step, life expectancy will increase with 5.082 years. But what does this mean? To reinterpret it in more concrete terms, we can divide the coefficient by 100, so that it is 0.05082. This represents the increase in life expectancy, if we increase GDP per capita with one percent, compared to what it was previously.

Why divide the coefficient by 100? Can we not just say that life expectancy will go up 5.082 years if we increase GDP per capita with a 100 percent?

The answer is that the rate of change only is true at a specific point. The larger step we take, the less precision. To increase the logarithm of GDP per capita with a whole step, we need much more than a 100 percent increase.

An analogy

The reason is compound interest, that is, interest on interest. Imagine you have 100 dollars, that you want to deposit in the bank, and keep there for a 100 days. The bank (which is a very good bank) allows you to choose among the following interest plans:

Alternative 1: 100 percent interest every 100th day
Alternative 2: 10 percent interest every 10th day
Alternative 3: 1 percent interest every day

Which is best? The alternatives might seem equivalent. But that is not the case. With alternative 1, you would have 200 dollars after the hundred days. First, nothing happens for 99 days, and then you get 100 dollars.

With alternative 2 you would have 110 dollars after 10 days, and when you next time get you interest, it is calculated also on the basis of the 10 additional dollars you received last time. If you follow this plan, you will have 259 dollars after 100 days!

And with alternative 3 you would have even more opportunity to get interest on your interests payment, which means that after 100 days, you would have 270 dollars. If you get an interest rate of 0.1%, paid 10 times each day, you would improve your earnings even more (but only up to 271 dollars).

To increase life expectancy with 5.082 years thus requires that we increase GDP per capita with 1 percent, a hundred times. Or if we want to be evan more exact, with one tenth of a percent, 1000 times. In practice, this means that we need to increase it so that it is 2.71828 times as large as it was before. Do you recognize the number? It is the number $e$, which is the base for the natural logarithm we used to construct the variable.

Logarithmic variable as dependent

What is the case if have a logarithmic dependent variable? Then we need to think a bit differently. Imagine for instance that we want to investigate how the logarithm of GDP per capita is associated with the level of corruption ti_cpi, where higher values indicate less corruption. We then run the following regression (with raw output suppressed, and then presented with esttab.

In [3]:
quietly reg ln_gdpc ti_cpi
esttab


----------------------------
                      (1)   
                  ln_gdpc   
----------------------------
ti_cpi             0.0463***
                  (13.09)   

_cons               6.695***
                  (40.04)   
----------------------------
N                     179   
----------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

The coefficient for the corruption variable is 0.0463. In this case we can multiplie the coefficient by 100, to get the expected change in percent in the dependent variable, if we increase the independent by one step. For each step up on ti_cpi, we get an increase in GDP per capita of 4.63% compared to what GDP what was before.

Logarithmic variables as both dependent and independent

The simplest case is when we have logarithmic scales as both dependent and independent. Then we can interpret the coefficient as the expected change in percent in the dependent variable when the independent variable is increased by one percent. For instance, if we want to see the relationship between the logarithm of population, and the logarithm of GDP (not per capita).

In [7]:
gen ln_gdp = ln(gle_gdp)
gen ln_pop = ln(gle_pop)

quietly reg ln_gdp ln_pop
esttab, r2
(2 missing values generated)

(2 missing values generated)



----------------------------
                      (1)   
                   ln_gdp   
----------------------------
ln_pop              0.942***
                  (22.41)   

_cons               2.321***
                   (6.23)   
----------------------------
N                     192   
R-sq                0.726   
----------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

An increase in the population of one percent is associated with an increase of GDP by 0.942%.

Conclusion

It is often warranted and a good idea to use logarithmic variables in regression analyses, when the data is continous biut skewed. But it is imporant to interpret the coefficients in the right way. Here is a table that shows the correct interpretation for four different scenarios:

Dependent Independent Interpretation of the b-coefficient
Regular Regular How many scale units the dependent changes when we increase the independent with one unit.
Dependent Logarithmic Divide the coefficient by 100: How many scale units the dependent changes when we increase the independent by one percent.
Logarithmic Regular Multiply the coefficient by 100: How many percent the dependent changes when we increase the independent with one unit.
Logarithmic Logarithmic How many percent the dependent changes when we increase the independent by one percent.