flowchart LR A[Hard edge] --> B(Round edge) B --> C{Decision} C --> D[Result one] C --> E[Result two]
R, RStudio, and Quarto
RStudio Layout
Before using R to illustrate basic programming concepts and data analysis tools, we will get familiar with the RStudio layout.
Rstudio contains 4 panels
RStudio has four primary panels that will help you interact with your data. We will use the default layout of these panels.
- Source panel: Top left
- Edit files to create ‘scripts’ of code
- Console panel: Bottom left
- Accepts code as input
- Displays output when we run code
- Environment panel: Top right
- Everything that R is holding in memory
- Objects that you create in the console or source panels will appear here
- You can clear the environment with the broom icon
- Viewer panel: Bottom-right
- View graphics that you generate
- Navigate files
Illustration
Let’s use these panels to create and interact with data.
Console:
- Perform a calculation: type
2 + 2
into the console panel and hit ENTER - Create and store an object: type
sum = 2 + 2
into the console panel and hit ENTER
Source:
- Start an R script: Open new .R file (button in top-left below “File”)
- Create and store an object: type
sum = 2 + 3
into the source panel and hit cntrl+ENTER
Environment:
- Confirm that the object
sum
is stored in our environment - Use
rm(sum)
to clear the object from the environment - Clear the environment with the broom icon
Viewer:
- Navigate through your computer’s files
- Create a plot in the source panel
Quarto
Formatting
- Quarto is an extension of
markdown
- You can italics, bold, and link
- Create sections and subsections with
#, ##, ###
- Declare title, file type, references, in
yaml
(top section) - Commenting-out text
Write Formulas
- Use math mode to render formulas
\[ \widehat{ATE} = \overline{Y}_{treatment\_group} - \overline{Y}_{control\_group} \]
Integrate External Image
You can also integrate external images that are saved in your project folder.
Automate References
Cite cool work by outstanding scholars (Springman 2022) and automatically generate a references list at the end of your document.
- Find paper on
scholar.google.com
- Select “Cite → BibTex
- Create
references.bib
file- Store BibTex reference in file
- Add
references.bib
to youryaml
Add Flow Charts
There is also built-in functionality to do cool stuff like create flow charts using mermaid
.
Add Footnotes
Write a sentence and then add a footnote.1
Tables By Hand
You can create tables by hand using markdown syntax.
Right | Left | Default | Center |
---|---|---|---|
12 | 12 | 12 | 12 |
123 | 123 | 123 | 123 |
1 | 1 | 1 | 1 |
Working with Code
To integrate code into our document, we can use code chunks. This allows us to create self-contained, reproducible documents.
Code Chunks Options
Code chunks come with specific options to control behavior. Let’s break down each option:
#| echo: true # The code will be shown in the rendered output.
#| warning: false # Any warnings produced by the code will be suppressed.
#| label: fig-fake-data # This assigns an internal label to the chunk, useful for referencing.
#| fig-cap: "Fake data figure" # This sets the caption for any figures generated by the chunk.
#| fig-width: 5
Figures
Check out Figure 1
Descriptive Tables
Descriptive Table of Random Data | ||
---|---|---|
Summary statistics: Mean and Standard Deviation | ||
Variable | Mean | SD |
x | 0.5200910 | 0.2942882 |
y | 0.4770270 | 0.2766775 |
size | 5.6385087 | 2.6461582 |
color | 0.5130599 | 0.2326477 |
Regression Results
Model 1 | Model 2 | |
---|---|---|
(Intercept) | 0.003 (0.006) | −0.001 (0.003) |
x1 | −0.003 (0.011) | −0.787*** (0.013) |
x2 | 0.394*** (0.006) | |
Num.Obs. | 1000 | 1000 |
R2 | 0.000 | 0.808 |
R2 Adj. | −0.001 | 0.807 |
AIC | −1768.6 | −3415.7 |
BIC | −1753.9 | −3396.1 |
Log.Lik. | 887.295 | 1711.866 |
F | 0.076 | 2095.146 |
Inline Code
Use code within the text to describe your data (n = 50) for a more reproducible workflow. If your data changes, the writing automatically updates.
Review of Basic Programming Concepts
Objects: where values are saved in R
“Object” is a generic term for anything that R stores in the environment. This can include anything from an individual number or word, to lists of values, to entire datasets.
Importantly, objects belong to different “classes” depending on the type of values that they store.
- Numerics are numbers
- Characters are text or strings like
"hello world"
and"welcome to R"
. - Factors are a group of characters/strings with a fixed number of unique values
- Logicals are either
TRUE
orFALSE
# Create a numeric object
= 5.6
my_number # Check the class
class(my_number)
[1] "numeric"
# Create a character object
= "welcome to R"
my_character # Check the class
class(my_character)
[1] "character"
# Create a logical object
= TRUE
my_logical # Check the class
class(my_logical)
[1] "logical"
R can perform operations on objects.
# Create a numeric object
= 5.6
my_number # Check the class
class(my_number)
[1] "numeric"
# Perform a calculation
= my_number + 5 my_number
The class of an object determines the type of operations you can perform on it. Some operations can only be run on numeric objects (numbers).
# Create a character object
= "5.6"
my_number # Check the class
class(my_number)
# Perform a calculation
+ 5
my_number round(my_number)
R contains functions that can convert some objects to different factors.
# Convert character to numeric
= as.numeric("5")
my_number class(my_number)
[1] "numeric"
<- 5
my_number
# But R is only so smart
= as.numeric("five")
my_number print(my_number)
[1] NA
Data Structures
The most simple objects are single values, but most data analysis involves more complicated data structures.
Lists
Lists are a type of data structure that store multiple values together. Lists are created using c()
and allow you to perform operations on a series of values.
# Create a numeric list (also called a "vector")
= c(6, 11, 13, 31)
numeric_vector # Print the vector
print(numeric_vector)
[1] 6 11 13 31
# Check the class
class(numeric_vector)
[1] "numeric"
# Calculate the mean
mean(numeric_vector)
[1] 15.25
An important part of working with more complex data structures is called “indexing.” Indexing allows you to extract specific values from a data structure.
# Extract the 2nd element from the list
2] numeric_vector[
[1] 11
# Extract elements 2-4
2:4] numeric_vector[
[1] 11 13 31
# Extract elements 1-2
c(TRUE, TRUE, FALSE, FALSE)] numeric_vector[
[1] 6 11
Dataframes
Data frames are the most common type of data structure used in research. Data frames combine multiple lists of values into a single object.
# Create a dataframe
= data.frame(
my_data x1 = rnorm(100, mean = 1, sd = 1),
x2 = rnorm(100, mean = 1, sd = 1)
)
class(my_data)
[1] "data.frame"
Anything that comes in a spreadsheet (for example, an excel file) can be loaded into an R environment as a dataframe. R works most easily when spreadsheets are saved as a .csv
file.
# Use `read.csv()` to load data from a website
= read.csv("https://raw.githubusercontent.com/jrspringman/psci3200-globaldev/main/workshops/aau_survey/clean_endline_did.csv")
dat
# Use `read.csv()` to load data from your computer's Downloads folder
# dat = read.csv("/home/jeremy/Downloads/clean_endline_did.csv")
In most data frames, rows correspond to observations and the columns correspond to variables that describe the observations. Here, we are looking at survey data from an RCT involving university students in Addis Ababa. Each row correspondents to a different survey respondent, and each column represents their answers to a different question from the survey.
Loading Packages
Packages are an extremely important part of data analysis with R.
- R gives you access to thousands of “packages” that are created by users
- Packages contain bundles of code called “functions” that can execute specific tasks
- Use
install.packages()
to install a package andlibrary()
to load a package
In the next section, we’ll use the package dplyr
to perform some data cleaning. dplyr
is part of a universe of packages called tidyverse
. Since this is one of the most important packages in the R ecosystem, let’s install and load it.
# install.packages("tidyverse")
library(tidyverse)
When you are searching online or asking ChatGPT how to perform a specific task in R, it often helps to specify that you are looking for a solution in dplyr
.
Cleaning Data
In the real-world, data never comes ready to be analyzed. Data cleaning is the process of manipulating data so that it can be analyzed. This is usually the most difficult and time-consuming part of any data analysis project. Let’s walk through some examples.
Question: In your work, what data cleaning have you had to do? How have you cleaned your data?
Creating Variables
Imagine we want to analyze the relationship between whether a respondent moved to to Addis Ababa to attend university and their level of political participation. However, there are two problems:
- We don’t have a specific variable that measures whether or not respondents moved
- We have many measures of participation
How can we create a variable measuring whether the respondent moved to Addis Ababa? We have a multiple-choice question asking students about what region they come from.
Let’s start by investigating this variable.
# The name of the variable in our dataframe is `q8_baseline`
table(dat$q8_baseline)
Addis Ababa
283
Afar Region
1
Amhara Region
174
Benishangul-Gumuz Region
5
Dire Dawa (city)
5
Gambela Region
1
Harari Region
4
Oromia Region
205
Other
12
Prefer not to say
19
Sidama Region
35
Somali Region
5
South West Ethiopia Peoples Region
6
Southern Nations, Nationalities, and Peoples Region
65
Tigray Region
5
= dat %>% # this is called a "pipe"
dat # give our variable a better name
rename(home_region = q8_baseline)
= dat %>%
dat # drop respondents who report "Prefer not to say"
filter(!home_region == "Prefer not to say")
= dat %>%
dat # clean home region variable using `mutate()`
mutate(
# Shorten a long name to an abbreviation
home_region = ifelse(home_region == "Southern Nations, Nationalities, and Peoples Region", "SNNPR", home_region),
# remove the word "Region" from every observation of this column
home_region = str_remove(home_region, " Region"),
home_region = str_remove(home_region, " \\(city\\)")
)
# Check if it worked
table(dat$home_region)
Addis Ababa Afar
283 1
Amhara Benishangul-Gumuz
174 5
Dire Dawa Gambela
5 1
Harari Oromia
4 205
Other Sidama
12 35
SNNPR Somali
65 5
South West Ethiopia Peoples Tigray
6 5
# Chain these all together for more concise code
= read_csv("https://raw.githubusercontent.com/jrspringman/psci3200-globaldev/main/workshops/aau_survey/clean_endline_did.csv" ) %>%
dat # give our variable a better name
rename(home_region = q8_baseline) %>%
# drop respondents who report "Prefer not to say"
filter(!home_region == "Prefer not to say") %>%
# clean home region variable
mutate(
# Shorten a long name to an abbreviation
home_region = case_when(
== "Southern Nations, Nationalities, and Peoples Region" ~ "SNNPR",
home_region == "South West Ethiopia Peoples Region" ~ "SWEPR",
home_region TRUE ~ home_region
),# remove the word "Region" from every observation of this column
home_region = str_remove(home_region, " Region| \\(city\\)")
)
# Check if it worked
table(dat$home_region)
Addis Ababa Afar Amhara Benishangul-Gumuz
283 1 174 5
Dire Dawa Gambela Harari Oromia
5 1 4 205
Other Sidama SNNPR Somali
12 35 65 5
SWEPR Tigray
6 5
Now that we’ve cleaned-up the names, we want to create a variable that tells us whether or not each respondent is originally from Addis Ababa. This will let us measure whether or not they moved to Addis in order to attend college.
# Creating a measure of whether a respondent moved to Addis Ababa in order to at independent variable
= dat %>%
dat mutate(
moved = case_when(
== "Addis Ababa" ~ 0,
home_region TRUE ~ 1
)
)
table(dat$moved, dat$home_region)
Addis Ababa Afar Amhara Benishangul-Gumuz Dire Dawa Gambela Harari Oromia
0 283 0 0 0 0 0 0 0
1 0 1 174 5 5 1 4 205
Other Sidama SNNPR Somali SWEPR Tigray
0 0 0 0 0 0 0
1 12 35 65 5 6 5
Now, we need to create our second variable measuring levels of political participation. Remember, the challenge is that we have multiple measures of participation. Let’s start with two measures:
- Number of times you’ve contacted gov’t official
q13_4_1
- Number of times you’ve signed a petition
q13_5_1
# Check out the distribution of our variables
%>% select(q13_4_1, q13_5_1) %>% head(5) dat
# A tibble: 5 × 2
q13_4_1 q13_5_1
<dbl> <dbl>
1 0 0
2 3 0
3 0 0
4 0 0
5 0 2
mean(dat$q13_4_1)
[1] NA
hist(dat$q13_5_1)
# Create a single measuree
= dat %>%
dat mutate(
add_participation = q13_4_1 + q13_5_1
)
hist(dat$add_participation)
# how do investigate NA values?
One thing we need to be careful with are NA values. We need to think carefully about why NA values are in our data and how to handle them appropriately.
Question: Thinking about the data that you have worked with, what are the most common sources of NA
values?
## Find our two participation measures
= grep("q13_4_1$|q13_5_1$", names(dat), value = T)
add_ecols
= dat %>%
dat mutate(add_participation = rowSums(across(add_ecols) ) )
#mutate(add_participation_end = rowSums(across(add_ecols), na.rm = T) )
# dat = dat %>%
# mutate(home_region = na_if(home_region, "Prefer not to say")) %>%
Merging Datasets
Often, data analysis projects will require you to use variables from more than one dataset. This will require you to combine separate datasets into a single dataset in R, which is called merging. A merge uses an identifier (administrative unit names, respondent IDs, etc.) that is present in both datasets to combine them into one.
Question: Thinking about the data that you have worked with, have you ever had to merge multiple datasets? What was the data? How did you do it?
## Here we are selecting the unit-identifier (response_id) and the two variables we have created and saving them to a separate dataframe
= dat %>%
new_vars select(response_id, moved, add_participation)
## After you do some cleaning or create new variables, you may want to save that dataframe for future use
# write.csv(new_dat, here::here("workshops/aau_survey/new_vars.csv"))
## Drop the new variables from our dataframe so that we can merge them back in
= dat %>%
dat select( -moved, -add_participation)
## Check to make sure the merge will work as expected
# nrow(dat)
# nrow(new_vars)
# dat$response_id[! dat$response_id %in% new_vars$response_id]
## Use dplyr's `full_join` function to merge datasets
= dat %>%
dat full_join(., new_vars)
Data Visualization
One of the major benefits of using R is the ability to create beautiful tables and figures that can be incorporated into deliverables and other professional documents. Importantly, you can create a script that allows you to automatically recreate the same visualizations when you want to make slight changes or incorporate new data.
Question: What methods do you currently use to create data visualizations for deliverables and other research products?
Tables
#install.packages("gt")
library(gt)
= dat %>%
desciptives group_by(moved) %>%
summarise(observations = n(),
share = observations / nrow(dat),
mean = mean(add_participation, na.rm = T)) %>%
mutate(moved = case_when(moved == 0 ~ "No",
== 1 ~ "Yes"),
moved share = share *100)
gt(desciptives) %>%
tab_header(
title = "Moving to University and Political Participation",
%>%
) fmt_number(
columns = 3:4,
decimals = 2,
use_seps = FALSE
%>%
) cols_label(
moved = md("Moved to<br>Addis"),
observations = md("Respondents<br>(count)"),
share = md("Respondents<br>(%)"),
mean = md("Participation<br>(mean score)")
) #%>%
# gtsave(filename = "tab_1.png")
Figures
There are many ways to create figures in R, but ggplot
is the most popular.
ggplot(dat, aes(add_participation)) +
geom_histogram(binwidth = 1)
%>%
dat filter(add_participation < 15) %>%
ggplot(aes(add_participation)) +
geom_histogram(binwidth = 1)
%>%
dat filter(add_participation < 15) %>%
ggplot(aes(x = add_participation, fill = q3_baseline)) +
geom_histogram(binwidth = 1, alpha = 0.5, position = "identity")
%>%
dat filter(add_participation < 15) %>%
mutate(moved = fct_rev(as.factor(moved))) %>%
ggplot(aes(x = add_participation, fill = moved)) +
geom_histogram(aes(y = (..count..) / sum(..count..) * 100),
binwidth = 1, alpha = 0.5, position = "identity") +
scale_y_continuous(labels = scales::percent_format(scale = 1)) +
labs(y = "Percentage") +
theme_bw() +
theme(legend.title = element_blank(), legend.position = c(.8, .8))
You can do tremendous amounts of customization with ggplot
to create extremely informative and professional plots.
## demographics
= dat %>%
demos drop_na(q3_baseline) %>%
select(`Respondent Gender` = q3_baseline, `Work as a student?` = q4_baseline,
`Rural or urban?` = q5_baseline, `Financial support from family?` = q6_baseline,
`Home region` = home_region,
`Student Year` = class_year) %>%
pivot_longer(everything()) %>%
group_by(name, value) %>%
tally() %>%
mutate(pct = n/sum(n)) %>%
top_n(n = 5, wt = pct)
$name = factor(demos$name, levels = c('Home region', 'Rural or urban?', 'Student Year', 'Respondent Gender', 'Work as a student?', 'Financial support from family?'))
demos
<- demos %>%
demos group_by(name) %>%
mutate(total_n = sum(n)) %>%
ungroup()
ggplot(demos , aes(y = value, x = pct)) +
geom_col(fill = "grey") +
facet_wrap( ~name, scales = "free") +
scale_y_discrete(labels = scales::label_wrap(30)) +
::scale_x_percent(limits = c(0, 1)) +
hrbrthemeslabs(x = "Percent of respondents to the survey", y = NULL,
title = "Demographic characteristics of baseline respondents",
subtitle = glue::glue("Number of respondents = {scales::comma(nrow(df))}"),
caption = "Note: home region only displays top five categories by size.") +
geom_text(
aes(
label = glue::glue("n = {total_n}")
), x = 0.92, y = -Inf,
vjust = -1, hjust = 1,
inherit.aes = FALSE
)
#ggsave("/home/jeremy/Downloads/demographics.png")
Regression
Now let’s use simple linear regression to estimate the relationship between these two variables that we created.
Question: Have you used regression in your research? If so, what software did you use to
= lm(add_participation ~ moved, dat)
est
summary(est)
Call:
lm(formula = add_participation ~ moved, data = dat)
Residuals:
Min 1Q Median 3Q Max
-1.788 -1.788 -0.788 0.609 33.609
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3911 0.1624 8.567 <2e-16 ***
moved 0.3967 0.2010 1.973 0.0488 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.673 on 778 degrees of freedom
(26 observations deleted due to missingness)
Multiple R-squared: 0.00498, Adjusted R-squared: 0.003701
F-statistic: 3.894 on 1 and 778 DF, p-value: 0.04881
What if we want to control for the influence of the number of years someone has been at university? Here, we can create a new measure of year and use multiple regression.
#install.packages("modelsummary")
library(modelsummary)
table(dat$class_year)
Year I Year II Year III
320 270 216
= dat %>% mutate(
dat year_bin = as.numeric(case_when(class_year == "Year I" ~ 0,
== "Year II" ~ NA,
class_year == "Year III" ~ 1
class_year
)
)
)table(dat$year_bin)
0 1
320 216
<- list()
models 'Bivariate']] = lm(add_participation ~ moved, dat)
models[['Multivariate']] = lm(add_participation ~ moved + year_bin, dat)
models[[
modelsummary(
models,estimate = "{estimate}{stars} ({std.error})",
statistic = NULL,
gof_omit = 'IC|RMSE|Log|F|R2$|Std.')
Bivariate | Multivariate | |
---|---|---|
(Intercept) | 1.391*** (0.162) | 1.323*** (0.201) |
moved | 0.397* (0.201) | 0.708** (0.219) |
year_bin | −0.226 (0.218) | |
Num.Obs. | 780 | 519 |
R2 Adj. | 0.004 | 0.021 |
Let’s compare two different ways of defining the year measure.
= dat %>% mutate(year_cont = as.numeric(case_when(class_year == "Year I" ~ 0,
dat == "Year II" ~ 1,
class_year == "Year III" ~ 2 )) )
class_year table(dat$year_cont)
0 1 2
320 270 216
<- list()
models 'Bivariate']] = lm(add_participation ~ moved + year_bin, dat)
models[['Multivariate']] = lm(add_participation ~ moved + year_cont, dat)
models[[
modelsummary(
models,estimate = "{estimate}{stars} ({std.error})",
statistic = NULL,
gof_omit = 'IC|RMSE|Log|F|R2$|Std.')
Bivariate | Multivariate | |
---|---|---|
(Intercept) | 1.323*** (0.201) | 1.537*** (0.202) |
moved | 0.708** (0.219) | 0.366+ (0.203) |
year_bin | −0.226 (0.218) | |
year_cont | −0.146 (0.120) | |
Num.Obs. | 519 | 780 |
R2 Adj. | 0.021 | 0.004 |
Appendix
Averaged Z-Scores
## Find participation measures that are based on likert
# baseline
= grep("^q13_.*_baseline$", names(dat), value = T)
bcols paste0(bcols, "_st")] = dat[, bcols]
dat[, = paste0(bcols,"_st")
bcols
# endline
= grep("^q13_[1-7]_\\d$", names(dat), value = T)
ecols paste0(ecols, "_st")] = dat[, ecols]
dat[, = paste0(ecols,"_st")
ecols
# Create treatment variable
= dat %>% mutate(moved = case_when(home_region == "Addis Ababa" ~ 0, TRUE ~ 1) )
dat
# clean q13_
= c("Never", "Once or Twice", "More than twice", "More than 5 times",
levels "More than 10 times")
= dat %>%
dat mutate(across(c(bcols),
.fns = ~ factor(.x, levels = levels)))
# Create z-score function from Kling, Liberman, and Katz (2007)
= function(x, y){
z_score # calculate mean and sd of control group
= mean( as.numeric( unlist(x[, y])) , na.rm = T)
c_mean = sd( as.numeric( unlist(x[, y])) , na.rm = T)
c_sd # subtract control group mean; divide by control group SD
as.numeric(x[, y, drop = TRUE]) - c_mean) / c_sd
(
}
# calculate z-scores
for (i in c(bcols, ecols)) {
= z_score(dat, i)
dat[,i]
}
= dat %>%
dat rowwise() %>%
mutate( z_participation_end = mean(c_across(all_of(bcols)), na.rm = TRUE)) %>%
mutate( z_participation_base = mean(c_across(all_of(ecols)), na.rm = TRUE)) %>%
ungroup()
= dat %>% select(z_participation_end, z_participation_base, moved, response_id ) %>%
regd pivot_longer(cols = c(z_participation_end, z_participation_base),
names_to = "time",
values_to = "z_participation") %>%
mutate(time = case_when(time == "z_participation_end" ~ 1,
TRUE ~ 0))
<- list()
models 'Bivariate']] = lm(z_participation ~ moved, regd)
models[['Multivariate']] = lm(z_participation ~ moved + time, regd)
models[['Interaction']] = lm(z_participation ~ moved + time + moved*time, regd)
models[[
modelsummary(
models,estimate = "{estimate}{stars} ({std.error})",
statistic = NULL,
gof_omit = 'IC|RMSE|Log|F|R2$|Std.')
Bivariate | Multivariate | Interaction | |
---|---|---|---|
(Intercept) | −0.159*** (0.026) | −0.156*** (0.030) | −0.156*** (0.037) |
moved | 0.247*** (0.032) | 0.247*** (0.032) | 0.247*** (0.046) |
time | −0.004 (0.031) | −0.005 (0.052) | |
moved × time | 0.001 (0.065) | ||
Num.Obs. | 1597 | 1597 | 1597 |
R2 Adj. | 0.035 | 0.034 | 0.034 |
References
Footnotes
Inlines notes are easier to write, since you don’t have to pick an identifier and move down to type the note.↩︎