STATISTICS HELP | SV/EN

Create a dataset and import data

Sometimes we can use a prepared dataset when we do our analyses. Other times we collect our own data, or use someone else's data, but it isn't in Stata format.

In these two last cases we need to construct our own dataset, or import data into Stata.

Create a dataset from scratch

To create a dataset in Stata is easy. We just click the button "Data editor", the symbol with the pen. Then we can just start to enter our data in the cells, or paste it, from Excel for instance.

The first value we enter determines the type of variable. If it is a number the variable becomes a numeric, and if it is a piece of text the variable turns into a "string variable", a text variable. We cannot mix numbers and text. If we do, the variable must be a string variable, and the numbers will be treated like any other text, and cannot be analyzed statistically.

Each observation - what we're comparing - should have its own row in the dataset. If it's countries, each country should have its own row. If it's persons, each person should have its own row, and so on.

Each variable gets its own column. In the QoG dataset, which I use in a lot of examples, the observations are countries, and the variables are country characteristics.

New variables are automatically named "var1", "var2" and so on. If we want to change the name, for instance to "population", we can rename the variable with the rename command, usually abbreviated ren. We then write the command, the old name of the variable, and then the new name:

In [8]:
ren var1 population

There are some naming rules: Don't use special letters, don't start the variable name with a number, and don't use spaces.

It is not harder than that to create a dataset! We just have to be a bit patient if we're entering the data manually.

Import data to Stata

A common situation when working with data is that we have data somewhere, but not in Stata format. Often it is in Excel or .csv format.

If it is in Excel format I usually create an empty dataset (either by restarting Stata or by writing the command clear) and paste all the data into the data editor. We can usually get the variable names as well, if there are any in the Excel document. Sometimes, however, there are some problems with the names, and Stata turns all variables into text. If that happens, select everything but the names in Excel, and paste all the values into the data editor. Then rename the variables one by one with the help of the rename command.

However, when the dataset is large the copy and paste function might cause some problems. Then it might be better to save the Excel data in a .csv format. Csv stands for "comma separated values" and are text files where the data matrix is described. The variable values are sepratade by commas (hence the name) or semicolons.

Many statistical services provide data in CSV format. We can for instance get a lot of interesting data from Statistics Sweden.

To import CSV data we use the command insheet. We will now try to import data on the population in the counties of Sweden.

We first type the command name, followed by "using" and then the path to the file to be imported.

In [11]:
insheet using "/Users/anderssundell/Dropbox/Jupyter/Stathelp/data/countypop.csv", clear names
(2 vars, 21 obs)

There are two options to the command. The option clear means that we should throw out any data that was currently loaded in Stata. names means that we have the variable names on the first row in the CSV file, and that Stata should turn those into variable names.

If the separator in the file had been something else than commas we could have accounted for that with some options. If it is tab separated data we just add the option tab. If the separator is a semicolon we add the option delimiter(";").

We can now look at the firt five rows in the data by typing:

In [12]:
list * in 1/5
     +----------------------------------+
     |                 region        v2 |
     |----------------------------------|
  1. |    01 Stockholm county   2344124 |
  2. |      03 Uppsala county    376354 |
  3. | 04 S�dermanland county    294695 |
  4. | 05 �sterg�tland county    461583 |
  5. |    06 J�nk�ping county    360825 |
     +----------------------------------+

There are two variables: region and v2. The second variable was actually called "2018" in the original file, but Stata does not allow variable names to start with a number, and therefore changed the variable name automatically. We can change the name to something more pedagogical, like "population2018" with the rename command:

In [13]:
ren v2 population2018
list * in 1/5


     +-----------------------------------+
     |                 region   pop~2018 |
     |-----------------------------------|
  1. |    01 Stockholm county    2344124 |
  2. |      03 Uppsala county     376354 |
  3. | 04 S�dermanland county     294695 |
  4. | 05 �sterg�tland county     461583 |
  5. |    06 J�nk�ping county     360825 |
     +-----------------------------------+

One problem here is that the letters å, ä and ö have become errors in the import. That often happens with special letters, and it is a reason why I often assign variable names in english, even when working with Swedish data, and generally try to avoid special letters.

Conclusion

Creating a dataset in Stata is not hard. Many times we can just paste the data into the data editor. But when that fails, importing CSV is a robust way. Just make sure you have the right separator.

In [ ]: