Correlation and Causation

What are they good for?

Jeremy Springman

University of Pennsylvania

Global Development: Intermediate Topics in Politics, Policy, and Data

PSCI 3200 - Spring 2025



  • Did everyone find the readings and slides for today?
  • For next week:
    • I’ll scan the chapter and upload tonight
    • Remember you have a quasi-assignment


  • Correlation
    • What is it?
    • What is it composed of?
    • What is it good for?
  • Causation
    • What is it good for?
    • Why is it hard?
    • Potential outcomes and counterfactuals


Which of the following statements describe a correlation?

  1. Most professional data analysis took a statistics course in college.
  1. The longer a person runs the more calories they burn.
  1. People who live to be 100 years old typically take vitamins.
  1. Older people vote more than younger people.

Correlations: Quantitative Comparison

  • Lots of bad analysis implies comparisons
    • Ex. 10 things that extremely successful people do to be productive
    • Ex. 60% of Americans now live paycheck-to-paycheck
    • Ex. 70% of participants reported an improvement
  • Avoid ‘selecting on the dependent variable’
    • Applies to qualitative comparisons as well

Correlations: Necessary Components

What do we need to calculate correlations?

  • Measures of central tendency
    • Mean
  • Measures of spread
    • Variance
    • Standard deviation

Central Tendency: Mean

\[ \mu_X = \frac{1}{n} \sum_{i}^{n} X_i \]

my_vector = rnorm(10, mean = 10, sd = 5)
# Step 1: Sum the values
sum_values <- sum(my_vector)
# Step 2: Count the number of elements
count_elements <- length(my_vector)
# Step 3: Calculate the mean
mean_value <- sum_values / count_elements

[1] 9.365005
[1] 9.365005

Spread: Variance

\[ \sigma^2_X = \frac{1}{N} \sum_{i}^{N} (X_i - \mu_X)^2 \]

  • What does the square in \(\sigma^2\) accomplish?
  • What are the implications for interpretation?
    • Units
    • Distribution
  • Even with these basic measures, we’re already thinking about the distribution!

Spread: Variance

Create a vector
## Create vector, sort by size, and store var
dat = rnorm(10, mean = 10, sd = 5)
dat = sort(dat)
o_var = var(dat)
 [1]  3.674694  6.565736  7.197622  7.771690  8.849113 10.352542 10.646439
 [8] 12.304581 17.793542 18.575325
Add a constant to a big number
## Create new dataframe for big addition and store vector length
b_dat = dat
ind = length(b_dat)

## Add four to the largest number in the vector and calculate size of var increase
b_dat[ind] = b_dat[ind] + 4
b_var = var(b_dat)
val = b_var - o_var
cat("Variance increases by", val )
Variance increases by 8.890842
Add a constant to a smaller number
## Create new dataframe for small addition
s_dat = dat

## Add four to the smallest number in the vector and calculate size of var increase
s_dat[ind-2] = s_dat[ind-2] + 4
s_var = var(s_dat)
val = s_var - o_var
cat("Variance increases by", val )
Variance increases by 3.316847

Spread: Standard Deviation

\[ \sigma_X = \sqrt{\frac{1}{N} \sum_{i}^{N} (X_i - \mu_X)^2} \]

  • What does the \(\sqrt{}\) accomplish?
  • What are the implications for interpretation?
    • Expressed in the same units as the observations
    • How far we expect each observation to be from the mean, on average
  • This means we can report effect sizes as SDs

Measures of Correlation

  • Covariance \(\text{Cov}(X, Y) = \frac{1}{n} \sum_{i}^{N} (X_i - \bar{X})(Y_i - \bar{Y})\)
    • Product of the deviations
    • Range: unbounded
  • Correlation coefficient \(\text{Cor}(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}\)
    • Covariance normalized by product of SDs
    • Range: -1 to 1
  • Slope \(\beta_X = \frac{\text{Cov}(X, Y)}{\sigma^2_X}\)
    • Covariance normalized by variance
    • Expected change in \(Y\) with 1-unit change in \(X\)

Measures of Correlation

  • What does the correlation coefficient tell you that slope doesn’t?
    • Consistency of the relationship on bounded scale (-1 to 1)
  • What does slope tell you that the correlation coefficient doesn’t?
    • Substantive importance (magnitude)
  • Give an example of when you’d prefer each
    • Correlation: When comparing relationships on different scales
    • Slope: When thinking about ROI


What can with do with them?

  • Description: quantitative comparisons
  • sample matters alot
  • sample matters less
  • Forecasting: sample population \(\rightarrow\) out-of-sample
  • Causal inference: correlation + research design

Simple, but powerful

  • Non-linearities, interactions, machine learning


Schools of Thought

  • Potential outcomes and counterfactuals (Econ)
  • DAGs and do-calculus (CS)
  • Manipulability (Philosophy)

“We think of a cause as something that makes a difference, and the difference it makes must be a difference from what would have happened without it.” (Lewis, 1973)

Causality: Why bother?

  • Understanding cause and effect is how we change things in the real world
  • Causal inference separates good evaluations from bad
    • Policy change
    • Development intervention
  • Causal identification is not binary
    • It’s harder for some policies and interventions than others
    • Variety of tools that can help us rule out different threats to inference

Causality: Why bother?

Show code

Year = c(0,1,2,3)
Outcome = c(NA, 1.2, 1.4,NA)
Treatment = c("Control", "Control","Control","Control")

dat = data.frame(Year, Outcome, Treatment)

ggplot(data = dat, aes(x = Year, y = Outcome, group = Treatment)) +
  geom_line(aes(linetype=Treatment),size=2) +
  geom_point(size = 6) +
  scale_linetype_manual(values=c("solid")) +
  xlim(0,3) + 
  scale_y_continuous(limits = c(1,1.85), breaks = seq(1, 1.85, by = .1)) + 
  theme(legend.position = "none", text = element_text(size=20)) 

Causality: Why bother?

Show code
Year = c(0,1,2,3)
Outcome = c(1, 1.2, 1.4, 1.6, 
            0.9, 1.3, 1.7, 2.1)
Treatment = c("Control", "Control","Control","Control", 
              "Treatment", "Treatment", "Treatment", "Treatment")

dat = data.frame(Year, Outcome, Treatment)

ggplot(data = dat, aes(x = Year, y = Outcome, group = Treatment)) +
  geom_line(aes(linetype=Treatment),size=2) +
  geom_point(size = 6) +
  xlim(0,3) + 
  scale_y_continuous(breaks = seq(1, 1.85, by = .1)) + 
  scale_linetype_manual(values=c("solid", "solid")) +
  coord_cartesian(ylim = c(1, 1.85), clip = "on") +
  theme(legend.position = "none", text = element_text(size=20)) 

Causality: Why bother?

Show code
Year = c(0,1,2,3)
Outcome = c(1, 1.2, 1.4,1.6)
Treatment = c("Control", "Control","Control","Control")

dat = data.frame(Year, Outcome, Treatment)

ggplot(data = dat, aes(x = Year, y = Outcome, group = Treatment)) +
  geom_line(aes(linetype=Treatment),size=2) +
  geom_point(size = 6) +
  scale_linetype_manual(values=c("solid")) +
  xlim(0,3) + 
  scale_y_continuous(limits = c(1,1.85), breaks = seq(1, 1.85, by = .1)) + 
  theme(legend.position = "none", text = element_text(size=20))

Causality: Why bother?

Show code
Year = c(0,1,2,3)
Outcome = c(NA, 1.2, 1.4, NA, 
            NA, 1.3, 1.7, NA, 
            NA, 1.3, 1.5, NA)
Treatment = c("Control", "Control","Control","Control", 
              "Treatment", "Treatment", "Treatment", "Treatment",

dat = data.frame(Year, Outcome, Treatment)

ggplot(data = dat, aes(x = Year, y = Outcome, group = Treatment)) +
  geom_line(aes(linetype=Treatment),size=2) +
  geom_point(size = 6) +
  xlim(0,3) + 
  scale_y_continuous(limits = c(1,1.85), breaks = seq(1, 1.85, by = .1)) + 
  scale_linetype_manual(values=c("dotted", "solid", "solid")) +
  theme(legend.position = "none", text = element_text(size=20)) 

Causality: Why bother?

Show code
Year = c(0,1,2,3)
Outcome = c(NA, 1.2, 1.4, NA, 
            NA, 1.3, 1.7, NA)
Treatment = c("Control", "Control","Control","Control", 
              "Treatment", "Treatment", "Treatment", "Treatment")

dat = data.frame(Year, Outcome, Treatment)

ggplot(data = dat, aes(x = Year, y = Outcome, group = Treatment)) +
  geom_line(aes(linetype=Treatment),size=2) +
  geom_point(size = 6) +
  xlim(0,3) + 
  scale_y_continuous(limits = c(1,1.85), breaks = seq(1, 1.85, by = .1)) + 
  scale_linetype_manual(values=c("solid", "solid")) +
  theme(legend.position = "none", text = element_text(size=20))

Causality: Why bother?

Show code
Year = c(0,1,2,3)
Outcome = c(1, 1.2, 1.4,1.6, 
            1.1, 1.3, 1.7, 1.9, 
            1.1, 1.3, 1.5, 1.7)
Treatment = c("Control", "Control","Control","Control", 
              "Treatment", "Treatment", "Treatment", "Treatment",

dat = data.frame(Year, Outcome, Treatment)

ggplot(data = dat, aes(x = Year, y = Outcome, group = Treatment)) +
  geom_line(aes(linetype=Treatment),size=2) +
  geom_point(size = 6) +
  scale_y_continuous(breaks = seq(1, 1.9, by = .1)) + 
  scale_linetype_manual(values=c("dotted", "solid", "solid")) +
  coord_cartesian(ylim = c(1, 1.85), clip = "on") +
  theme(legend.position = "none", text = element_text(size=20))

Causality: Why bother?

Show code
Year = c(0,1,2,3)
Outcome = c(1, 1.2, 1.4,1.6, 
            1.1, 1.3, 1.7, 1.9)
Treatment = c("Control", "Control","Control","Control", 
              "Treatment", "Treatment", "Treatment", "Treatment")

dat = data.frame(Year, Outcome, Treatment)

ggplot(data = dat, aes(x = Year, y = Outcome, group = Treatment)) +
  geom_line(aes(linetype=Treatment),size=2) +
  geom_point(size = 6) +
  scale_y_continuous(breaks = seq(1, 1.9, by = .1)) + 
  scale_linetype_manual(values=c("solid", "solid")) +
  coord_cartesian(ylim = c(1, 1.85), clip = "on") +
  theme(legend.position = "none", text = element_text(size=20))

Causality: What makes it hard?

Fundamental Problem of Causal Inference

\[ Y_i = \begin{cases} Y_i(1) & \text{if } D_i = 1 \text{ (treatment group)} \\ Y_i(0) & \text{if } D_i = 0 \text{ (control group)} \end{cases} \]

  • We only observe any given unit in one treatment status at any one time so we can never directly observe the causal effect of a treatment on a unit.

Potential Outcomes and Counterfactuals

Treatment Effect for individual \(i\)

  • \(TE_i = Y_i(1) - Y_i(0)\)

Average Treatment Effect (ATE)

  • \(ATE = \frac{1}{N} \sum_{i=1}^{N} TE_i\)

Many Different Tools

  • Randomized experiments
    • Gold-standard
    • Field and survey
  • Observational data
    • Natural experiments
    • Difference-in-Differences
    • Matching, Synthetic Control

DAGs and Confounding

Next Meeting

  • Randomized experiments
    • Review of FPCI
    • How randomization addresses confounding
  • The role of social science researchers?
    • What should people with social science training be doing with their skills?
  • Final Project
    • What are your options?
    • Possible data sources