Svensk version | Front page

A common test for investigation how one group differs from another, or from a reference value, is the t-test. We can use the to find out if the difference is statistically significant, that is, if such a difference is very unlikely to obtain, had the null hypothesis been true.

A typical case is when we want to determine whether two samples have different values on some variable of interest. For instance two groups in a survey sample. Is it likely that the differences we observe in our sample also are present in the population from which we drew the sample?

We can also use the test to compare treatment and control groups in an experiment. To determine whether the treatment had an effect, we check whether the difference is so large that it is improbable that it was chance that gave rise to the difference, using the t-test.

In this example we will use data from the american General Social Survey, a survey with ordinary people, with questions about all sorts of things. The respondents have been sampled at random from the american people. We will use the version from 2016. Download it to follow along in the example. The code below loads the dataset from my computer.

In [1]:
use "/Users/anderssundell/Dropbox/Jupyter/stathelp/data/GSS2016.dta", clear

Research question: Do men work more than women?

We will take a closer look at the variable hrs1, which shows how many hours the respondent worked last week. The average is 40.9 hours. But how does it look, broken down on women and men? We can use the variable sex, where men have the value 1 and women the value 2 to find out.

Before we proceed to the actual t-test we look at the mean values. It is not strictly necessary to do this step, but I think it is easier to focus on one thing at a time - first the mean difference, then the significance test.

We use the table command to get the mean values:

In [2]:
table sex, contents(mean hrs1)
responden |
ts sex    | mean(hrs1)
     male |    44.1604
   female |     37.904

The results are clear: American men on average work about 6.2 hours more than women each week. Quite expected given the typical division of labor in the family.

This analyses is however only based on 792 men and 854 women. And in both groups we find individuals that do not work at all, and those that work more than 89 hours each week (which is the highest response option). Can we really be sure that the difference in working hours we see reflects a real difference in American society, or is it just that we happened to get a few male workaholics in our sample, just by chance? To answer this question we use the t-test.

t-test of two groups

Now let us do a thought experiment. We assume that there are no differences in working hours between men and women, out in the american population. This is our null hypothesis. If that is the case, and we do a random sample of 1646 individuals, how often would we get a difference among men and women this big, just due to chance?

If that happens a lot we would be unwise to draw far-reaching conclusions from the difference in our sample. But if it turns out that such differences are very unlikely to arise by chance, for instance one time out of 20, then we can be reasonably sure that the gender difference we now observe (men work six hours more) are actually indicative of a real difference in the population.

The measure of how often a difference of this size is caused by random sampling is the p-value, the significance value. In general our decision rule is that differences that only happen one time out of 20 are statistically significant, which means that we use the 95% confidence level. In practice, it means that we will look for p-values that are smaller than 0.05.

We will now run the test in Stata. The command is called ttest, and the principle is that we write the command, the variable we want to test "hrs1" and then specify the variable that determines the groups, with the option by(sex).

In [3]:
ttest hrs1, by(sex)
Two-sample t test with equal variances
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
    male |     792    44.16035    .5228357     14.7139    43.13404    45.18666
  female |     854    37.90398    .4598889    13.43946    37.00133    38.80663
combined |   1,646    40.91434    .3550878    14.40624    40.21787    41.61081
    diff |            6.256372    .6939482                4.895257    7.617488
    diff = mean(male) - mean(female)                              t =   9.0156
Ho: diff = 0                                     degrees of freedom =     1644

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 0.0000

Here we get a lot of numbers at once, but we recognize the mean values in the column "Mean". The column "Obs" shows the number of persons in each group. "Std. Err." shows the standard error around the mean, that is, the uncertainty in our estimate. "Std. Dev." shows the standard deviation around the mean, a measure of the spread in the sample. "95% Conf. Interval" whos the confidence interval around the mean. Our confidence will cover the true mean in the population 95 percent of the times we redo the sampling, if we do it a lot of times. We generally only take one sample, but we thus have good reason to suspect that the true mean is somewhere in the confidence interval.

In each row we find the values first for men, then for women, then for the combined group, and then finally for the difference between the groups. The mean difference is 6.256.

The t-value is specified below the table, to the right. We can see that t = 9.0156.

But we were primarily interested in the tests of significance. We find those at the bottom, and the most important is in the middle It tests the hypothesis that there is a difference (positive or negative) between men and women. The p-value is here 0.0000. It means that if our thought experiment was true, that there was no difference between women and men in the population, we would get a difference this size in our sample less than one time out of 10,000. A more reasonable conclusion is therefore that there actually is a difference between women and men, also in the population!

Had the p-value been for instance 0.2534, it would have meant that we see a difference of this size 25 times out of 100, just due to randomness in the sampling, even if there is no difference in the population. In situations like that we cannot reject the null hypothesis - it might very well be the case that there is no difference.

One-sided tests

What we have done so far is called a two-sided t-test, and is normally what we want to use. But in some special cases it might be justified to use a one-sided t-test. They suit hypotheses about there being a difference in a certain direction. It might be relevant when we deal with processess where we can be sure that there can be no increase, or no decrease. This is generally not the case.

But if we for some reason want to use a one-sided test anyway, we find them to the left and right at the bottom. We start to the left. Here is the test to use if the hypothesis is that the difference is less than zero (in this case, that women work more than men). We will, as before, do a thought experiment and assume that the null hypothesis is true. If that is the case, how often will chance give us a sample in which the difference is lower than the one we observed (6.26)? The p-value is 1.0000. It means that we always get a value that is less than 6.26. The information is not very useful in this case. We cannot use it to reject any hypothesis.

To the right we instead find the test to use if we have an hypothesis that the difference is larger than zero (that man work more than women). If we assume that the null hypothesis is true, how often would chance give us a value that is higher than the one we saw (6.26)? The p-value is 0.0000, which means that it happens less than one time out of 10000. Not very probably. A better conclusions is that men work more than women, in the population.

The p-values will (if the result aligns with the hypothesis) to be half the size for the one-sided tests compared to the two-sided test, since we only need to take one kind of outcome into account. But we cannot decide whether to use a one-sided or two-sided test after looking at the results. We need to do it before, on theoretical grounds. And those are generally hard to find. The simple recommendation is therefore to use two-sided tests.

t-test against a fixed reference value

So far, we have compared two groups. But sometimes we want to compare the mean value in a group againt a fixed reference value. It might for instance be relevant to see whether a political party's support in an opinion survey is significantly higher than the threshold for parliament. In our current case, we can test whether the average American works more or less than 40 hours a week.

We then write the command, the variable we want to test, equal to the reference value. Like this:

In [4]:
ttest hrs1 == 40
One-sample t test
Variable |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
    hrs1 |   1,646    40.91434    .3550878    14.40624    40.21787    41.61081
    mean = mean(hrs1)                                             t =   2.5750
Ho: mean = 40                                    degrees of freedom =     1645

    Ha: mean < 40               Ha: mean != 40                 Ha: mean > 40
 Pr(T < t) = 0.9949         Pr(|T| > |t|) = 0.0101          Pr(T > t) = 0.0051

Average working hours for the complete sample is thus 40.9. A little more, but quite close to 40 hours. But is it enough for us to say that Americans actually work more than 40 hours a week, or might it be due to randomness in the sampling?

In the middle t-test we see that the p-value is 0.0101. It means that if the null hypothesis - Americans work 40 hours a week - is correct, we would get a mean value of 40.9 in the sample about once every hundred samples. Not very often. We therefore instead draw the conclusions that Americans actually wokr more than 40 hourse each week. If we instead test anoteher reference value, for instance 41, the p-values change:

In [5]:
ttest hrs1 == 41
One-sample t test
Variable |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
    hrs1 |   1,646    40.91434    .3550878    14.40624    40.21787    41.61081
    mean = mean(hrs1)                                             t =  -0.2412
Ho: mean = 41                                    degrees of freedom =     1645

    Ha: mean < 41               Ha: mean != 41                 Ha: mean > 41
 Pr(T < t) = 0.4047         Pr(|T| > |t|) = 0.8094          Pr(T > t) = 0.5953

The p-value is now 0.8094. We can therefore not say that the data in our sample supports the hypothesis that people work less than 41 hours every week. Our data is compatible with that null hypothesis.

Comparison with regression analysis

The command for the t-test is pretty easy to understand, and in certain fields of stydy it is cnoventional to use t-tests. In other fields regression analysis is more common. The two analyses however show the same thing, to a large extent. When we do a regression analysis the statistics software usually provides us with a t-test as well. The regression analysis shows what happens with the dependent variable when we increase the independent variable one step.

In this case, what happens with working hours if we increase the variable sex one step? That is, if we go from man (which has the value 1) to woman (which has the variable 2)? Let us try.

In [6]:
reg hrs1 sex
      Source |       SS           df       MS      Number of obs   =     1,646
-------------+----------------------------------   F(1, 1644)      =     81.28
       Model |  16084.1601         1  16084.1601   Prob > F        =    0.0000
    Residual |  325318.762     1,644  197.882458   R-squared       =    0.0471
-------------+----------------------------------   Adj R-squared   =    0.0465
       Total |  341402.922     1,645   207.53977   Root MSE        =    14.067

        hrs1 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
         sex |  -6.256372   .6939482    -9.02   0.000    -7.617488   -4.895257
       _cons |   50.41673   1.109558    45.44   0.000     48.24043    52.59302

The interesting number is the coefficient in the row "sex", which is -6.26. If we go from man to woman, working hourse are changed by -6.26 hours. The same difference that we saw earlier (but not it is negative, since here we get the men's value minus the women's value. The t-value is -9.02, the same that we saw in the t-test, but no it is negative (because the coefficient was negative). The p-value is 0.000.

The row "cons" shows the intercept, that is, the value on the dependent variable when all independent variables are zero. If the variable sex has the value 0, which it cannot have. To find the mean value for men we need to calculate 50.4 - 6.25 = 44.15. The same value we saw earlier.

What we can do with the command ttest we can thus usually do with regas well. But it might be easier to read the table from ttet- the mean values are more clearly spelled out there.


The t-test is a common and useful test, which we can use when we want to see how probable or improbable it is that observed differences are caused by randomness in the sampling. Therefore, they are strictly speaking only useful when we work with random samples, or with random assignment in an experiment.

In practice, however, the tests are often used on other sorts of data, like observational data on the countries of the world. The test is then more an indication of whether the differences are large or not.