Dictionary with short explanations¶

Here is a small dictionary of common terms in statistical analysis, with short explanations.

WORD LIST
GENERAL TERMS
Unit of analysis	The units we compare. For instance countries, individuals, years, stocks, municipalities, and so on.
Variable	Characteristic of the units of analysis, that varies. For instnace height, age or party affiliation among individuals, or population, degree of democracy or climate among countries.
Variable value	The actual value a unit of analysis has on a specific variable.
Causality	Cause and effect. In a causal relationship one variable affects the other.
Measure of central tendency	A value that states the tendency in a dataset.
Distribution	How the units of analysis are distributed over the possible values. Can have any look.
Normal distribution	One specific shape of distribution, where most values are found in the middle, with fewer out on the edges.
Spread	How the values are distributed around the central tendency.
Correlation	Degree of relatedness between two variables.
Covariance	Degree of relatedness between to variables, but expressed in the scale that the variables are measured in, which means that it is harder to compare than correlations.
SAMPLING AND INFERENCE
Population	The group we want to draw conclusions about, for instance voters in Sweden.
Sample	The part of the population (which can be chosen in different ways) that we actually study.
Random sampling	When we draw units of analysis from the population at random. In normal circumstances the sampling method that gives us the best opportunities for inference.
Quota sampling	When we compose our sample to fill certain quotas, that match the characteristics of the population.
Sampling through self-selection	When the units of analysis are free to sign up for being part of the sample. Gives very bad opportunities for inference, as the ones who do so in general are unrepresentative.
Inference	To generalize conclusions about the sample to the larger population.
Central limit theorem	Theorem that states that when we draw repeated random samples from a population, the mean in the samples will form a normal distribution based on the true mean in the population. Forms the basis of significance calculations.
Confidence interval	Interval of values that expresses the degree of uncertainty. For instance about what the true mean is in a population, or what the true relationship between two variables is.
VARIABLE TYPES
Nominal scale	Categorization without ordering. The variable "fruit" cna have the values "orange", "pear" or "apple", but cannot be ordered.
Ordinal scale	Categorization + ordering, but without equidistance. There are different distances between the steps of the scale. Survey questions are usually of this type.
Interval scale	Categorization + ordering + equidistance. For instance height, measured in centimeres. Each step up on the scale is equally long.
MEASURES OF CENTRAL TENDENCY
Mode	The most common value. Appropriate for nominal scales.
Median	The value you get if you arrange all the values in order, and take the value in the middle (or between the two in the middle). Appropriate for ordinal scales, but can also be used for interval scales.
Mean	The value you get if you sum all values, and divide by the number of units. Can only be (properly) used on interval scales. Formula: $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$
MEASURES OF DEVIATION
Variance	A measure of the deviation from the mean, that is used in many other calculations. Formula: $V = \frac{\sum_{i=1}^n (x_i - \bar{x})^2)}{n-1}$ The formula means that you takes each observation's devation from the mean, square it, sum all the squared deviations, divide by n-1, and then take the square root of the resulting number.
Standard deviation	A measure of the typlical deviation from the mean. Calculated as the roow of the variance. Formel: $s = \sqrt{V}$
THEORETICAL NAMES FOR VARIABLES
Dependent variable	The variable we want to explain. Sometimes called outcome variable.
Independent variable	The variable we think explains the dependent variable. Sometimes called determinant or predictor.
Control variable	Additional variable that is brought into the analysis, generally to rule out spuriosity.
Antecedent variable	Variable that comes before the independent variable in the causal chain.
Intervening variable / mediating variable	Variable that is located after the independent variable but before the dependent variable in the causal chain. Should not be included as a control if the aim is to rule out spuriosity.
Interaction variable	Variable that moderates the effect of the independent variabler. Sometimes called moderating variable.
Dummy variable	Variable that shows whether a unit of analysis has a specific characteristic (1) or not (0).
TYPES OF RELATIONSHIPS
Null relationship	Lack of correlation between two variables.
Positive relationship	Relationship where more of one variable is associated with more of the other, and where less of one variable is associated with less of the other.
Negative relationship	Relationship where more of one variable is associated with less of the other, and where less of one variable is associated with more of the other.
Causal relationship	Relationship where one variable has caused the other.
Spurious relationship	Relationship between two variables where a third variable has caused both the independent and dependent variable.
Suppressed relationship	Actual relationship between two variables, but one that is hidden by a third variable that is associated with less values on one variable and higher values on the other, or vice versa.
Nonlinear relationship	Relationship whose strength depends on where we are on the scale. For instance the relationship between ice cream eating and happiness - eating one icecream has a positive effect on happiness; eating another increases happiness a little bit; eating a third or fourth probably decreases happiness.
TYPES OF ANALYSES
Univariate analysis	Analysis of one variable. For instance calculation of the mean.
Bivariate analysis	Analysis of relationships between two variables.
Correlation analysis	Analysis that shows the strength and direction of a relationship between two variables. There are many different measures, but the most common is Pearson's R.
Regression analysis	Analysis where a line is fit to a number of points. Gives numbers on the slope of the line, the strength of the relationship, uncertainty about the estimation of the line, and more.
Multiple regression analysis	Regresssion analysis with more than one independent variable.
Logistic regression analysis	Type of regression analysis tailored to cases where the dependent variable only has the value 0 or 1.
Factor analysis	Analysis where one tries to find common denominator between several different variables. That is, condense the data to see if there is some latent, unobserved variable, that can explain variation in multiple variables.
TYPES OF DESIGNS
Experimental design	When we are in contorl over the independent variable, and randomize units of analysis to a treatment group and a control group. Gives us good opportunity to draw conclusions about causal effects.
Pseudo-experimental design	When we exploit natural variation to find cases where random assignment to some "treatment" has arisen.
Regression discontinuity design	A type of pseudor-experimental design when there is a sharp threshold that governs assignment to some "treatment", and when we can assume that units of analysis just above and below this threshold are similar in all other respects.
Mathcing	When units of analysis are matched with "statistical twins", that is, units with the same values on all relevant variables except the independent.
Cross-sectional design	When we compare many units of analysis at one point in time.
Longitudinal design	When we compare one unit with itself, over time.
Panel design	When we compare many units with each other, and with themselves, over time. Also called Time-Series Cross-Section.
TERMS IN REGRESSION ANALYSIS
Regression line	The line that best fits the points.
b-coefficient	The slope of the regression line. How much the dependent variable is expected to change when we increase the independent variable with one step.
Constant/intercept	The expected value on the dependent variable when all the independent variables in the model have the value 0.
$R^2$	A measure of how much of the variation in the dependent variable that can be explained by the independent variables. Runs between 0 1. 0.5 can be interpreted as "50% of the variation in the dependent variable can be explained by the independent variables."
Standard error	A measure of uncertainty in the estimation of b-coefficients.
t-value	The b-coefficient divided by the standard error. Used to calculate the p-value.
p-value	The significance value. Shows how probable it is that we get a relationship between the independent and dependent variable just because of randomness in the sampling procedure, if we assume that there is no relationship in the population. A common convention is that p-values below 0.05 are counted as "statistically significant."
n	The number of units of analysis included in the analysis.
Unstandardized regression coefficient	The regular b-coefficient, which is expressed in units of the dependent variable. If for instance dependent variable is GDP per capita, the coefficient will show how much GDP per capita is expected to increase or decrease if the independent variable is increased by one.
Standardized regression coefficient	b-coefficient, but standardized so that the variance of both dependent and independent variable is 1. Allows for comparison between coefficients measured with different scales, which is hard with unstandardized coefficients.
DIAGRAMS
Histogram	Shows the distribution of a variable.
Scatterplot	Shows the relationship between two variables by drawing the units of analysis as dots, placed on an X and Y axis.