Dictionary with short explanations

Here is a small dictionary of common terms in statistical analysis, with short explanations.

Unit of analysis The units we compare. For instance countries, individuals, years, stocks, municipalities, and so on.
Variable Characteristic of the units of analysis, that varies. For instnace height, age or party affiliation among individuals, or population, degree of democracy or climate among countries.
Variable value The actual value a unit of analysis has on a specific variable.
Causality Cause and effect. In a causal relationship one variable affects the other.
Measure of central tendency A value that states the tendency in a dataset.
Distribution How the units of analysis are distributed over the possible values. Can have any look.
Normal distribution One specific shape of distribution, where most values are found in the middle, with fewer out on the edges.
Spread How the values are distributed around the central tendency.
Correlation Degree of relatedness between two variables.
Covariance Degree of relatedness between to variables, but expressed in the scale that the variables are measured in, which means that it is harder to compare than correlations.
Population The group we want to draw conclusions about, for instance voters in Sweden.
Sample  The part of the population (which can be chosen in different ways) that we actually study.
Random sampling When we draw units of analysis from the population at random. In normal circumstances the sampling method that gives us the best opportunities for inference.
Quota sampling When we compose our sample to fill certain quotas, that match the characteristics of the population.
Sampling through self-selection When the units of analysis are free to sign up for being part of the sample. Gives very bad opportunities for inference, as the ones who do so in general are unrepresentative.
Inference To generalize conclusions about the sample to the larger population.
Central limit theorem Theorem that states that when we draw repeated random samples from a population, the mean in the samples will form a normal distribution based on the true mean in the population. Forms the basis of significance calculations.
Confidence interval  Interval of values that expresses the degree of uncertainty. For instance about what the true mean is in a population, or what the true relationship between two variables is.
Nominal scale Categorization without ordering. The variable "fruit" cna have the values "orange", "pear" or "apple", but cannot be ordered.
Ordinal scale Categorization + ordering, but without equidistance. There are different distances between the steps of the scale. Survey questions are usually of this type.
Interval scale Categorization + ordering + equidistance. For instance height, measured in centimeres. Each step up on the scale is equally long.
Mode The most common value. Appropriate for nominal scales.
Median The value you get if you arrange all the values in order, and take the value in the middle (or between the two in the middle). Appropriate for ordinal scales, but can also be used for interval scales.
Mean The value you get if you sum all values, and divide by the number of units. Can only be (properly) used on interval scales.
Formula: $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$
A measure of the deviation from the mean, that is used in many other calculations.
Formula: $V = \frac{\sum_{i=1}^n (x_i - \bar{x})^2)}{n-1}$
The formula means that you takes each observation's devation from the mean, square it, sum all the squared deviations, divide by n-1, and then take the square root of the resulting number.
Standard deviation A measure of the typlical deviation from the mean. Calculated as the roow of the variance.
Formel: $s = \sqrt{V}$
Dependent variable The variable we want to explain. Sometimes called outcome variable.
Independent variable The variable we think explains the dependent variable. Sometimes called determinant or predictor.
Control variable Additional variable that is brought into the analysis, generally to rule out spuriosity.
Antecedent variable Variable that comes before the independent variable in the causal chain.
Intervening variable / mediating variable Variable that is located after the independent variable but before the dependent variable in the causal chain. Should not be included as a control if the aim is to rule out spuriosity.
Interaction variable Variable that moderates the effect of the independent variabler. Sometimes called moderating variable.
Dummy variable Variable that shows whether a unit of analysis has a specific characteristic (1) or not (0).
Null relationship Lack of correlation between two variables.
Positive relationship Relationship where more of one variable is associated with more of the other, and where less of one variable is associated with less of the other.
Negative relationship Relationship where more of one variable is associated with less of the other, and where less of one variable is associated with more of the other.
Causal relationship Relationship where one variable has caused the other.
Spurious relationship Relationship between two variables where a third variable has caused both the independent and dependent variable.
Suppressed relationship Actual relationship between two variables, but one that is hidden by a third variable that is associated with less values on one variable and higher values on the other, or vice versa.
Nonlinear relationship Relationship whose strength depends on where we are on the scale. For instance the relationship between ice cream eating and happiness - eating one icecream has a positive effect on happiness; eating another increases happiness a little bit; eating a third or fourth probably decreases happiness.
Univariate analysis Analysis of one variable. For instance calculation of the mean.
Bivariate analysis Analysis of relationships between two variables.
Correlation analysis Analysis that shows the strength and direction of a relationship between two variables. There are many different measures, but the most common is Pearson's R.
Regression analysis Analysis where a line is fit to a number of points. Gives numbers on the slope of the line, the strength of the relationship, uncertainty about the estimation of the line, and more.
Multiple regression analysis Regresssion analysis with more than one independent variable.
Logistic regression analysis Type of regression analysis tailored to cases where the dependent variable only has the value 0 or 1.
Factor analysis Analysis where one tries to find common denominator between several different variables. That is, condense the data to see if there is some latent, unobserved variable, that can explain variation in multiple variables.
Experimental design When we are in contorl over the independent variable, and randomize units of analysis to a treatment group and a control group. Gives us good opportunity to draw conclusions about causal effects.
Pseudo-experimental design When we exploit natural variation to find cases where random assignment to some "treatment" has arisen.
Regression discontinuity design A type of pseudor-experimental design when there is a sharp threshold that governs assignment to some "treatment", and when we can assume that units of analysis just above and below this threshold are similar in all other respects.
Mathcing When units of analysis are matched with "statistical twins", that is, units with the same values on all relevant variables except the independent.
Cross-sectional design When we compare many units of analysis at one point in time.
Longitudinal design When we compare one unit with itself, over time.
Panel design When we compare many units with each other, and with themselves, over time. Also called Time-Series Cross-Section.
Regression line The line that best fits the points.
b-coefficient The slope of the regression line. How much the dependent variable is expected to change when we increase the independent variable with one step.
Constant/intercept The expected value on the dependent variable when all the independent variables in the model have the value 0.
$R^2$ A measure of how much of the variation in the dependent variable that can be explained by the independent variables. Runs between 0 1. 0.5 can be interpreted as "50% of the variation in the dependent variable can be explained by the independent variables."
Standard error A measure of uncertainty in the estimation of b-coefficients.
t-value The b-coefficient divided by the standard error. Used to calculate the p-value.
p-value The significance value. Shows how probable it is that we get a relationship between the independent and dependent variable just because of randomness in the sampling procedure, if we assume that there is no relationship in the population. A common convention is that p-values below 0.05 are counted as "statistically significant."
n The number of units of analysis included in the analysis.
Unstandardized regression coefficient  The regular b-coefficient, which is expressed in units of the dependent variable. If for instance dependent variable is GDP per capita, the coefficient will show how much GDP per capita is expected to increase or decrease if the independent variable is increased by one.
Standardized regression coefficient b-coefficient, but standardized so that the variance of both dependent and independent variable is 1. Allows for comparison between coefficients measured with different scales, which is hard with unstandardized coefficients.
Histogram Shows the distribution of a variable.
Scatterplot Shows the relationship between two variables by drawing the units of analysis as dots, placed on an X and Y axis.