R, RStudio, and Quarto

Author

Affiliation

Jeremy Springman

University of Pennsylvania

Published

January 30, 2025

RStudio Layout

Before using R to illustrate basic programming concepts and data analysis tools, we will get familiar with the RStudio layout.

Rstudio contains 4 panels

RStudio has four primary panels that will help you interact with your data. We will use the default layout of these panels.

Source panel: Top left
- Edit files to create ‘scripts’ of code
Console panel: Bottom left
- Accepts code as input
- Displays output when we run code
Environment panel: Top right
- Everything that R is holding in memory
- Objects that you create in the console or source panels will appear here
- You can clear the environment with the broom icon
Viewer panel: Bottom-right
- View graphics that you generate
- Navigate files

Illustration

Let’s use these panels to create and interact with data.

Console:

Perform a calculation: type 2 + 2 into the console panel and hit ENTER
Create and store an object: type sum = 2 + 2 into the console panel and hit ENTER

Source:

Start an R script: Open new .R file (button in top-left below “File”)
Create and store an object: type sum = 2 + 3 into the source panel and hit cntrl+ENTER

Environment:

Confirm that the object sum is stored in our environment
Use rm(sum) to clear the object from the environment
Clear the environment with the broom icon

Viewer:

Navigate through your computer’s files
Create a plot in the source panel

Quarto

Formatting

Quarto is an extension of markdown
- You can italics, bold, and link
- Create sections and subsections with #, ##, ###
Declare title, file type, references, in yaml (top section)
Commenting-out text

Write Formulas

Use math mode to render formulas

\[ \widehat{ATE} = \overline{Y}_{treatment\_group} - \overline{Y}_{control\_group} \]

Integrate External Image

You can also integrate external images that are saved in your project folder.

Here is an image from my repo with a caption

Automate References

Cite cool work by outstanding scholars (Springman 2022) and automatically generate a references list at the end of your document.

Find paper on scholar.google.com
- Select “Cite → BibTex
Create references.bib file
- Store BibTex reference in file
- Add references.bib to your yaml

Add Flow Charts

There is also built-in functionality to do cool stuff like create flow charts using mermaid.

flowchart LR
  A[Hard edge] --> B(Round edge)
  B --> C{Decision}
  C --> D[Result one]
  C --> E[Result two]

Add Footnotes

Write a sentence and then add a footnote.¹

Tables By Hand

You can create tables by hand using markdown syntax.

Right	Left	Default	Center
12	12	12	12
123	123	123	123
1	1	1	1

Working with Code

To integrate code into our document, we can use code chunks. This allows us to create self-contained, reproducible documents.

Code Chunks Options

Code chunks come with specific options to control behavior. Let’s break down each option:

#| echo: true          # The code will be shown in the rendered output.
#| warning: false       # Any warnings produced by the code will be suppressed.
#| label: fig-fake-data # This assigns an internal label to the chunk, useful for referencing.
#| fig-cap: "Fake data figure" # This sets the caption for any figures generated by the chunk.
#| fig-width: 5

Figures

Check out Figure 1

Descriptive Tables

Table 1: Fake data table

Descriptive Table of Random Data
Summary statistics: Mean and Standard Deviation
Variable	Mean	SD
x	0.5200910	0.2942882
y	0.4770270	0.2766775
size	5.6385087	2.6461582
color	0.5130599	0.2326477

Regression Results

	Model 1	Model 2
(Intercept)	0.003 (0.006)	−0.001 (0.003)
x1	−0.003 (0.011)	−0.787*** (0.013)
x2		0.394*** (0.006)
Num.Obs.	1000	1000
R2	0.000	0.808
R2 Adj.	−0.001	0.807
AIC	−1768.6	−3415.7
BIC	−1753.9	−3396.1
Log.Lik.	887.295	1711.866
F	0.076	2095.146

Inline Code

Use code within the text to describe your data (n = 50) for a more reproducible workflow. If your data changes, the writing automatically updates.

Review of Basic Programming Concepts

Objects: where values are saved in R

“Object” is a generic term for anything that R stores in the environment. This can include anything from an individual number or word, to lists of values, to entire datasets.

Importantly, objects belong to different “classes” depending on the type of values that they store.

Numerics are numbers
Characters are text or strings like "hello world" and "welcome to R".
Factors are a group of characters/strings with a fixed number of unique values
Logicals are either TRUE or FALSE

# Create a numeric object
my_number = 5.6
# Check the class
class(my_number)

[1] "numeric"

# Create a character object
my_character = "welcome to R"
# Check the class
class(my_character)

[1] "character"

# Create a logical object
my_logical = TRUE
# Check the class
class(my_logical)

[1] "logical"

R can perform operations on objects.

# Create a numeric object
my_number = 5.6
# Check the class
class(my_number)

[1] "numeric"

# Perform a calculation
my_number = my_number + 5

The class of an object determines the type of operations you can perform on it. Some operations can only be run on numeric objects (numbers).

# Create a character object
my_number = "5.6"
# Check the class
class(my_number)
# Perform a calculation
my_number + 5
round(my_number)

R contains functions that can convert some objects to different factors.

# Convert character to numeric
my_number = as.numeric("5")
class(my_number)

[1] "numeric"

my_number <- 5

# But R is only so smart
my_number = as.numeric("five")
print(my_number)

[1] NA

Data Structures

The most simple objects are single values, but most data analysis involves more complicated data structures.

Lists

Lists are a type of data structure that store multiple values together. Lists are created using c() and allow you to perform operations on a series of values.

# Create a numeric list (also called a "vector")
numeric_vector = c(6, 11, 13, 31)
# Print the vector
print(numeric_vector)

[1]  6 11 13 31

# Check the class
class(numeric_vector)

[1] "numeric"

# Calculate the mean
mean(numeric_vector)

[1] 15.25

An important part of working with more complex data structures is called “indexing.” Indexing allows you to extract specific values from a data structure.

# Extract the 2nd element from the list
numeric_vector[2]

[1] 11

# Extract elements 2-4
numeric_vector[2:4]

[1] 11 13 31

# Extract elements 1-2
numeric_vector[c(TRUE, TRUE, FALSE, FALSE)]

[1]  6 11

Dataframes

Data frames are the most common type of data structure used in research. Data frames combine multiple lists of values into a single object.

# Create a dataframe
my_data = data.frame(
  x1 = rnorm(100, mean = 1, sd = 1),
  x2 = rnorm(100, mean = 1, sd = 1)
)

class(my_data)

[1] "data.frame"

Anything that comes in a spreadsheet (for example, an excel file) can be loaded into an R environment as a dataframe. R works most easily when spreadsheets are saved as a .csv file.

# Use `read.csv()` to load data from a website
dat = read.csv("https://raw.githubusercontent.com/jrspringman/psci3200-globaldev/main/workshops/aau_survey/clean_endline_did.csv") 

# Use `read.csv()` to load data from your computer's Downloads folder
# dat = read.csv("/home/jeremy/Downloads/clean_endline_did.csv")

In most data frames, rows correspond to observations and the columns correspond to variables that describe the observations. Here, we are looking at survey data from an RCT involving university students in Addis Ababa. Each row correspondents to a different survey respondent, and each column represents their answers to a different question from the survey.

Loading Packages

Packages are an extremely important part of data analysis with R.

R gives you access to thousands of “packages” that are created by users
Packages contain bundles of code called “functions” that can execute specific tasks
Use install.packages() to install a package and library() to load a package

In the next section, we’ll use the package dplyr to perform some data cleaning. dplyr is part of a universe of packages called tidyverse. Since this is one of the most important packages in the R ecosystem, let’s install and load it.

# install.packages("tidyverse")
library(tidyverse)

When you are searching online or asking ChatGPT how to perform a specific task in R, it often helps to specify that you are looking for a solution in dplyr.

Cleaning Data

In the real-world, data never comes ready to be analyzed. Data cleaning is the process of manipulating data so that it can be analyzed. This is usually the most difficult and time-consuming part of any data analysis project. Let’s walk through some examples.

Question: In your work, what data cleaning have you had to do? How have you cleaned your data?

Creating Variables

Imagine we want to analyze the relationship between whether a respondent moved to to Addis Ababa to attend university and their level of political participation. However, there are two problems:

We don’t have a specific variable that measures whether or not respondents moved
We have many measures of participation

How can we create a variable measuring whether the respondent moved to Addis Ababa? We have a multiple-choice question asking students about what region they come from.

Let’s start by investigating this variable.

# The name of the variable in our dataframe is `q8_baseline`
table(dat$q8_baseline)


                                        Addis Ababa 
                                                283 
                                        Afar Region 
                                                  1 
                                      Amhara Region 
                                                174 
                           Benishangul-Gumuz Region 
                                                  5 
                                   Dire Dawa (city) 
                                                  5 
                                     Gambela Region 
                                                  1 
                                      Harari Region 
                                                  4 
                                      Oromia Region 
                                                205 
                                              Other 
                                                 12 
                                  Prefer not to say 
                                                 19 
                                      Sidama Region 
                                                 35 
                                      Somali Region 
                                                  5 
                 South West Ethiopia Peoples Region 
                                                  6 
Southern Nations, Nationalities, and Peoples Region 
                                                 65 
                                      Tigray Region 
                                                  5

dat = dat %>% # this is called a "pipe"
  # give our variable a better name
  rename(home_region = q8_baseline) 

dat = dat %>%
  # drop respondents who report "Prefer not to say"
  filter(!home_region == "Prefer not to say") 

dat = dat %>%
  # clean home region variable using `mutate()`
  mutate(
    # Shorten a long name to an abbreviation
    home_region = ifelse(home_region == "Southern Nations, Nationalities, and Peoples Region", "SNNPR", home_region),
    # remove the word "Region" from every observation of this column
    home_region = str_remove(home_region, " Region"),
    home_region = str_remove(home_region, " \\(city\\)")
    )

# Check if it worked
table(dat$home_region)


                Addis Ababa                        Afar 
                        283                           1 
                     Amhara           Benishangul-Gumuz 
                        174                           5 
                  Dire Dawa                     Gambela 
                          5                           1 
                     Harari                      Oromia 
                          4                         205 
                      Other                      Sidama 
                         12                          35 
                      SNNPR                      Somali 
                         65                           5 
South West Ethiopia Peoples                      Tigray 
                          6                           5

# Chain these all together for more concise code
dat = read_csv("https://raw.githubusercontent.com/jrspringman/psci3200-globaldev/main/workshops/aau_survey/clean_endline_did.csv" ) %>%
  # give our variable a better name
  rename(home_region = q8_baseline) %>%
  # drop respondents who report "Prefer not to say"
  filter(!home_region == "Prefer not to say") %>%
  # clean home region variable
  mutate(
    # Shorten a long name to an abbreviation
    home_region = case_when(
      home_region == "Southern Nations, Nationalities, and Peoples Region" ~ "SNNPR",
      home_region == "South West Ethiopia Peoples Region" ~ "SWEPR",
      TRUE ~ home_region
    ),
    # remove the word "Region" from every observation of this column
    home_region = str_remove(home_region, " Region| \\(city\\)")
    )

# Check if it worked
table(dat$home_region)


      Addis Ababa              Afar            Amhara Benishangul-Gumuz 
              283                 1               174                 5 
        Dire Dawa           Gambela            Harari            Oromia 
                5                 1                 4               205 
            Other            Sidama             SNNPR            Somali 
               12                35                65                 5 
            SWEPR            Tigray 
                6                 5

Now that we’ve cleaned-up the names, we want to create a variable that tells us whether or not each respondent is originally from Addis Ababa. This will let us measure whether or not they moved to Addis in order to attend college.

# Creating a measure of whether a respondent moved to Addis Ababa in order to at independent variable
dat = dat %>% 
  mutate(
    moved = case_when(
      home_region == "Addis Ababa" ~ 0,
      TRUE ~ 1
    ) 
  )

table(dat$moved, dat$home_region)

   
    Addis Ababa Afar Amhara Benishangul-Gumuz Dire Dawa Gambela Harari Oromia
  0         283    0      0                 0         0       0      0      0
  1           0    1    174                 5         5       1      4    205
   
    Other Sidama SNNPR Somali SWEPR Tigray
  0     0      0     0      0     0      0
  1    12     35    65      5     6      5

Now, we need to create our second variable measuring levels of political participation. Remember, the challenge is that we have multiple measures of participation. Let’s start with two measures:

Number of times you’ve contacted gov’t official q13_4_1
Number of times you’ve signed a petition q13_5_1

# Check out the distribution of our variables
dat %>% select(q13_4_1, q13_5_1) %>% head(5)

# A tibble: 5 × 2
  q13_4_1 q13_5_1
    <dbl>   <dbl>
1       0       0
2       3       0
3       0       0
4       0       0
5       0       2

mean(dat$q13_4_1)

[1] NA

hist(dat$q13_5_1)

# Create a single measuree
dat = dat %>%
  mutate(
      add_participation = q13_4_1 + q13_5_1

  )

hist(dat$add_participation)

# how do investigate NA values?

One thing we need to be careful with are NA values. We need to think carefully about why NA values are in our data and how to handle them appropriately.

Question: Thinking about the data that you have worked with, what are the most common sources of NA values?

## Find our two participation measures
add_ecols = grep("q13_4_1$|q13_5_1$", names(dat), value = T)

dat = dat %>%
  mutate(add_participation =  rowSums(across(add_ecols) ) )
  #mutate(add_participation_end =  rowSums(across(add_ecols), na.rm = T) )

# dat = dat %>%
#   mutate(home_region = na_if(home_region, "Prefer not to say")) %>%

Merging Datasets

Often, data analysis projects will require you to use variables from more than one dataset. This will require you to combine separate datasets into a single dataset in R, which is called merging. A merge uses an identifier (administrative unit names, respondent IDs, etc.) that is present in both datasets to combine them into one.

Question: Thinking about the data that you have worked with, have you ever had to merge multiple datasets? What was the data? How did you do it?

## Here we are selecting the unit-identifier (response_id) and the two variables we have created and saving them to a separate dataframe
new_vars = dat %>%
  select(response_id, moved, add_participation)

## After you do some cleaning or create new variables, you may want to save that dataframe for future use
# write.csv(new_dat, here::here("workshops/aau_survey/new_vars.csv"))

## Drop the new variables from our dataframe so that we can merge them back in
dat = dat %>%
  select( -moved, -add_participation)

## Check to make sure the merge will work as expected
# nrow(dat)
# nrow(new_vars)
# dat$response_id[! dat$response_id %in% new_vars$response_id]

## Use dplyr's `full_join` function  to merge datasets
dat = dat %>%
  full_join(., new_vars)

Data Visualization

One of the major benefits of using R is the ability to create beautiful tables and figures that can be incorporated into deliverables and other professional documents. Importantly, you can create a script that allows you to automatically recreate the same visualizations when you want to make slight changes or incorporate new data.

Question: What methods do you currently use to create data visualizations for deliverables and other research products?

Tables

#install.packages("gt")
library(gt)

desciptives = dat %>%
  group_by(moved) %>%
  summarise(observations = n(), 
            share = observations / nrow(dat),
            mean = mean(add_participation, na.rm = T)) %>%
  mutate(moved = case_when(moved == 0 ~ "No",
                           moved == 1 ~ "Yes"),
         share = share *100)

gt(desciptives) %>%
  tab_header(
    title = "Moving to University and Political Participation",
  ) %>%
  fmt_number(
    columns = 3:4,
    decimals = 2,
    use_seps = FALSE
  ) %>%
  cols_label(
    moved = md("Moved to<br>Addis"),
    observations = md("Respondents<br>(count)"),
    share = md("Respondents<br>(%)"),
    mean = md("Participation<br>(mean score)")
  ) 
#%>%
 # gtsave(filename = "tab_1.png")

Figures

There are many ways to create figures in R, but ggplot is the most popular.

ggplot(dat, aes(add_participation)) +
  geom_histogram(binwidth = 1)

dat %>%
  filter(add_participation < 15) %>%
  ggplot(aes(add_participation)) +
  geom_histogram(binwidth = 1)

dat %>%
  filter(add_participation < 15) %>%
  ggplot(aes(x = add_participation, fill = q3_baseline)) +
  geom_histogram(binwidth = 1, alpha = 0.5, position = "identity")

dat %>%
  filter(add_participation < 15) %>%
  mutate(moved = fct_rev(as.factor(moved))) %>%
  ggplot(aes(x = add_participation, fill = moved)) +
  geom_histogram(aes(y = (..count..) / sum(..count..) * 100),
                 binwidth = 1, alpha = 0.5, position = "identity") +
  scale_y_continuous(labels = scales::percent_format(scale = 1)) +
  labs(y = "Percentage") +
  theme_bw() +
  theme(legend.title = element_blank(), legend.position = c(.8, .8))

You can do tremendous amounts of customization with ggplot to create extremely informative and professional plots.

## demographics
demos = dat %>%
  drop_na(q3_baseline) %>% 
  select(`Respondent Gender` = q3_baseline, `Work as a student?` = q4_baseline, 
         `Rural or urban?` = q5_baseline, `Financial support from family?` = q6_baseline, 
         `Home region` = home_region,
         `Student Year` = class_year) %>% 
  pivot_longer(everything()) %>% 
  group_by(name, value) %>% 
  tally() %>% 
  mutate(pct = n/sum(n)) %>% 
  top_n(n = 5, wt = pct)

demos$name = factor(demos$name, levels = c('Home region', 'Rural or urban?', 'Student Year', 'Respondent Gender', 'Work as a student?', 'Financial support from family?'))

demos <- demos %>%
  group_by(name) %>%
  mutate(total_n = sum(n)) %>%
  ungroup()

ggplot(demos , aes(y = value, x = pct)) + 
  geom_col(fill = "grey") + 
  facet_wrap( ~name, scales = "free") + 
  scale_y_discrete(labels = scales::label_wrap(30)) + 
  hrbrthemes::scale_x_percent(limits = c(0, 1)) + 
  labs(x = "Percent of respondents to the survey", y = NULL, 
       title = "Demographic characteristics of baseline respondents", 
       subtitle = glue::glue("Number of respondents = {scales::comma(nrow(df))}"), 
       caption = "Note: home region only displays top five categories by size.") +
  geom_text(
    aes(
      label = glue::glue("n = {total_n}")
    ), 
    x = 0.92, y = -Inf, 
    vjust = -1, hjust = 1, 
    inherit.aes = FALSE
  )

#ggsave("/home/jeremy/Downloads/demographics.png")

Regression

Now let’s use simple linear regression to estimate the relationship between these two variables that we created.

Question: Have you used regression in your research? If so, what software did you use to

est = lm(add_participation ~ moved, dat)

summary(est)


Call:
lm(formula = add_participation ~ moved, data = dat)

Residuals:
   Min     1Q Median     3Q    Max 
-1.788 -1.788 -0.788  0.609 33.609 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.3911     0.1624   8.567   <2e-16 ***
moved         0.3967     0.2010   1.973   0.0488 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.673 on 778 degrees of freedom
  (26 observations deleted due to missingness)
Multiple R-squared:  0.00498,   Adjusted R-squared:  0.003701 
F-statistic: 3.894 on 1 and 778 DF,  p-value: 0.04881

What if we want to control for the influence of the number of years someone has been at university? Here, we can create a new measure of year and use multiple regression.

#install.packages("modelsummary")
library(modelsummary)

table(dat$class_year)


  Year I  Year II Year III 
     320      270      216

dat = dat %>% mutate(
  year_bin = as.numeric(case_when(class_year == "Year I" ~ 0,
                                  class_year == "Year II" ~ NA,
                                  class_year == "Year III" ~ 1 
                                  )
                        ) 
  )
table(dat$year_bin)


  0   1 
320 216

models <- list()
models[['Bivariate']] = lm(add_participation ~ moved, dat)
models[['Multivariate']] = lm(add_participation ~ moved + year_bin, dat)

modelsummary(
  models,
  estimate  = "{estimate}{stars} ({std.error})",
             statistic = NULL,
  gof_omit = 'IC|RMSE|Log|F|R2$|Std.')

	Bivariate	Multivariate
(Intercept)	1.391*** (0.162)	1.323*** (0.201)
moved	0.397* (0.201)	0.708** (0.219)
year_bin		−0.226 (0.218)
Num.Obs.	780	519
R2 Adj.	0.004	0.021

Let’s compare two different ways of defining the year measure.

dat = dat %>% mutate(year_cont = as.numeric(case_when(class_year == "Year I" ~ 0,
                                                 class_year == "Year II" ~ 1,
                                                 class_year == "Year III" ~ 2 )) )
table(dat$year_cont)


  0   1   2 
320 270 216

models <- list()
models[['Bivariate']] = lm(add_participation ~ moved + year_bin, dat)
models[['Multivariate']] = lm(add_participation ~ moved + year_cont, dat)

modelsummary(
  models,
  estimate  = "{estimate}{stars} ({std.error})",
             statistic = NULL,
  gof_omit = 'IC|RMSE|Log|F|R2$|Std.')

	Bivariate	Multivariate
(Intercept)	1.323*** (0.201)	1.537*** (0.202)
moved	0.708** (0.219)	0.366+ (0.203)
year_bin	−0.226 (0.218)
year_cont		−0.146 (0.120)
Num.Obs.	519	780
R2 Adj.	0.021	0.004

Appendix

Averaged Z-Scores

## Find participation measures that are based on likert
# baseline
bcols = grep("^q13_.*_baseline$", names(dat), value = T)
dat[, paste0(bcols, "_st")] = dat[, bcols]
bcols = paste0(bcols,"_st")

# endline
ecols = grep("^q13_[1-7]_\\d$", names(dat), value = T)
dat[, paste0(ecols, "_st")] = dat[, ecols]
ecols = paste0(ecols,"_st")


# Create treatment variable
dat = dat %>% mutate(moved = case_when(home_region == "Addis Ababa" ~ 0, TRUE ~ 1) )

# clean q13_
levels = c("Never", "Once or Twice", "More than twice", "More than 5 times", 
           "More than 10 times")
dat = dat %>% 
  mutate(across(c(bcols), 
                .fns = ~ factor(.x, levels = levels)))

# Create z-score function from Kling, Liberman, and Katz (2007)
z_score = function(x, y){
  # calculate mean and sd of control group
  c_mean = mean( as.numeric( unlist(x[, y])) , na.rm = T)
  c_sd = sd( as.numeric( unlist(x[, y])) , na.rm = T)
  # subtract control group mean; divide by control group SD
  ( as.numeric(x[, y, drop = TRUE]) - c_mean) / c_sd
}

# calculate z-scores
for (i in c(bcols, ecols)) {
  dat[,i] = z_score(dat, i)
}

dat = dat %>% 
  rowwise() %>% 
  mutate( z_participation_end = mean(c_across(all_of(bcols)), na.rm = TRUE)) %>% 
  mutate( z_participation_base = mean(c_across(all_of(ecols)), na.rm = TRUE)) %>%
  ungroup()

regd = dat %>% select(z_participation_end, z_participation_base, moved, response_id ) %>%
  pivot_longer(cols = c(z_participation_end, z_participation_base),
               names_to = "time",
               values_to = "z_participation") %>%
  mutate(time = case_when(time == "z_participation_end" ~ 1,
                          TRUE ~ 0))

models <- list()
models[['Bivariate']] = lm(z_participation ~ moved, regd)
models[['Multivariate']] = lm(z_participation ~ moved + time, regd)
models[['Interaction']] = lm(z_participation ~ moved + time + moved*time, regd)

modelsummary(
  models,
  estimate  = "{estimate}{stars} ({std.error})",
             statistic = NULL,
  gof_omit = 'IC|RMSE|Log|F|R2$|Std.')

	Bivariate	Multivariate	Interaction
(Intercept)	−0.159*** (0.026)	−0.156*** (0.030)	−0.156*** (0.037)
moved	0.247*** (0.032)	0.247*** (0.032)	0.247*** (0.046)
time		−0.004 (0.031)	−0.005 (0.052)
moved × time			0.001 (0.065)
Num.Obs.	1597	1597	1597
R2 Adj.	0.035	0.034	0.034

References

Springman, Jeremy. 2022. “The Political Economy of NGO Service Provision: Evidence from an Ancillary Field Experiment in Uganda.” World Politics 74 (4): 523–63.

Footnotes

Inlines notes are easier to write, since you don’t have to pick an identifier and move down to type the note.↩︎