Week 2
Cleaning and Data
Analysis in

SOCI 269

Sakeef M. Karim
Amherst College

AN INTRODUCTION TO QUANTITATIVE SOCIOLOGY—CULTURE AND POWER

Data Wrangling I—
February 4th

Let’s Get Things Started

Make sure you’re … well, here on this slide.

Let’s Get Things Started

In 1-2 weeks, we’ll all be able to produce plots like this.

Let’s Get Things Started

But let’s take things one step at a time.

The Absolute Basics

Calculations

You can use your console as a calculator.

Feel free to play around with the numbers above.

What’s 18% of 905? Use the console on this slide to arrive at an answer.

Note: Keep clicking or the space bar on your to advance through the slide deck.

Combining Elements

Use the combine function—c()—to stitch together multiple elements of the same type, yielding an atomic vector.

Here’s a character vector.

And here’s a numeric vector.

Note: Keep clicking or the space bar on your to advance through the slide deck.

The Assignment Operator

We can create new objects with the assignment operator <-.

Here’s how we can store the vector c(1, 2, 3) as an object named x.

Can you store the character vector embedded below as a new object named five_colleges?

Note: Keep clicking or the space bar on your to advance through the slide deck.

Naming Conventions

Try to run the line below.

A Question What went awry?

As Wickham, Çetinkaya-Rundel and Grolemund (2023) note, “object names must start with a letter and can only contain letters, numbers, _, and ..”

Note: Keep clicking or the space bar on your to advance through the slide deck.

Data Frames and Base Functions

is particularly useful for working with data frames.

Change the name of this data frame to IRIS.

We can use the $ operator to isolate elements within list or data.frame objects. This comes in handy when we use basic functions in .

Now, calculate the mean of the Petal.Length variable in iris.

Note: Keep clicking or the space bar on your to advance through the slide deck.

Annotations

We can annotate our code using the # sign.

Now, find the median value of iris$Petal.Width while weaving in some crude annotations.

For more elaborate exposition, consider using RMarkdown or Quarto to generate \LaTeX, .docx, .html and .ipynb documents within .

Note: Keep clicking or the space bar on your to advance through the slide deck.

Exercise I

Four Simple Tasks

  1. Fire up RStudio and run the following code—
  1. Store penguins as a new object named penguins_week2.

  2. Calculate the mean of bill_length_mm in the dataset.

  3. How many male and female penguins are in penguins_week2? Use the table function to find out — and add some annotations, too.

Introduction to dplyr
Rows

A Reminder

The Pipe Operator

The pipe operator (|> in base ) will help us chain together a series of dplyr verbs as we manipulate our data frames.

Note: Keep clicking or the space bar on your to advance through the slide deck.

dplyr::filter()

The filter() function allows us to zoom-in on our observations of interest based on column values.

Logical vectors are often used to streamline the “filtration” process.

We can use string patterns in character vectors to make life easier, too.

Note: Keep clicking or the space bar on your to advance through the slide deck.

dplyr::arrange()

The arrange() function allows us to modify—or have fine control over—the order of our observations based on column values.

We can arrange our rows using multiple variables.

We can organize rows in descending order, too.

Note: Keep clicking or the space bar on your to advance through the slide deck.

dplyr::distinct()

Calling the distinct() function returns all the unique rows or values in a dataset vis-à-vis specific variables or their combinations.

Here are the distinct values of the year variable in palmerpenguins::penguins:

We can isolate unique species and island pairs, too.

distinct() is helpful for removing “redundant” information or extraneous rows.

Note: Keep clicking or the space bar on your to advance through the slide deck.

Introduction to dplyr
Columns

dplyr::mutate()

The mutate() function lets us create new variables or modify existing columns. Generally, this is achieved by manipulating the variables in our data.

Create a new variable for population in 1000s.

Logical vectors are useful for generating new variables.

Note: Keep clicking or the space bar on your to advance through the slide deck.

dplyr::select()

The select() function lets us home-in on our variables of interest.

We can provide (and modify) column names or positions.

We can use helper functions when variables are legion.

How can we remove the first three columns?

Note: Keep clicking or the space bar on your to advance through the slide deck.

dplyr::rename()

The rename() function allows us to rename variables without adjusting the dimensions of our data frame.

We can rename columns by specifying variable names as follows: new = old.

We can also use column positions to point to the variables we’d like to rename:

Note: Keep clicking or the space bar on your to advance through the slide deck.

dplyr::relocate()

The relocate() function allows us to move our columns around.

Here’s how we can move a few columns to the “front.”

We can specify where we want columns relocated.

Helper functions can, well, be quite helpful.

Note: Keep clicking or the space bar on your to advance through the slide deck.

Exercise II

The Varieties of Democracy

For the rest of today’s session, we’re going to be working with data from the Varieties of Democracy Project (Coppedge et al. 2024). You can access the codebook for V-Dem 14 below:

Full Page

The Varieties of Democracy

Your Tasks

  1. Fire up RStudio and run the following code—
install.packages("devtools")

devtools::install_github("vdeminstitute/vdemdata")

library(tidyverse)

vdem <- vdemdata::vdem |> as_tibble()

Alternatively, you can access the data here.

  1. Use the dplyr verbs we discussed today to heavily modify the dataset—say, by isolating specific columns, years, or countries of interest before generating new variables.

Data Wrangling II—
February 11th

An Update

No Make-Up Date

Unfortunately, there is no assigned make-up date for last week’s cancelled session. Ergo, the syllabus will need to be adjusted as we get more information about how instructors can weave in optional sessions moving forward.

A Note on Office Hours

I expect each and every one of you to schedule a
meeting
by early March.

Another Question For All of You

Click to Reveal What social, cultural, or political phenomenon would you be
most interested in exploring during this class?

Introduction to dplyr
Groups

dplyr::group_by()

The group_by() function allows us to group our data before performing additional operations or transformations.

What’s weird about this code?

Crucially, we can apply group_by() to more than one column.

Note: Keep clicking or the space bar on your to advance through the slide deck.

dplyr::summarise()

We can call the summarize() function—or summarise()—to easily compute summary statistics at the group level.

What do we need to change before the code can run smoothly?

We can generate summary statistics by group combinations (or group x group interactions).

As of dplyr 1.10, we can use the .by argument for “per-operation” grouping.

Note: Keep clicking or the space bar on your to advance through the slide deck.

dplyr::slice()

The slice() function allows us to extract observations of interest.

Here, we isolate the first (or # last) two rows for each continent in gapminder.

slice_min() and slice_max() allow us to isolate the minimum or maximum values of column x.

The slice_sample() function can be used to randomly draw from our (grouped or ungrouped) data.

Note: Keep clicking or the space bar on your to advance through the slide deck.

An FYI

Note

The last two verbs—i.e., summarize() and slice()—can be applied to ungrouped data as well. What’s more, the other verbs we encountered last Tuesday—e.g., filter() or mutate()—can also be used to manipulate grouped data.

Return to V-Dem

Today’s Exercise

Manipulate your V-Dem data frame so that it looks like this:

# A tibble: 72 × 8
   country  year electoral electoral_global_avg liberal participatory
   <chr>   <dbl>     <dbl>                <dbl>   <dbl>         <dbl>
 1 Canada   2000     0.84                 0.492   0.765         0.601
 2 Canada   2001     0.837                0.495   0.761         0.594
 3 Canada   2002     0.831                0.505   0.754         0.587
 4 Canada   2003     0.831                0.514   0.754         0.584
 5 Canada   2004     0.83                 0.514   0.758         0.58 
 6 Canada   2005     0.828                0.518   0.758         0.579
 7 Canada   2006     0.836                0.522   0.765         0.579
 8 Canada   2007     0.838                0.519   0.767         0.579
 9 Canada   2008     0.835                0.522   0.766         0.578
10 Canada   2009     0.835                0.524   0.765         0.578
# ℹ 62 more rows
# ℹ 2 more variables: deliberative <dbl>, egalitarian <dbl>

You can access the can_mex_usa data frame via GitHub or by copying—and executing—the following line:

readRDS(url("https://github.com/sakeefkarim/intro_quantitative_sociology/raw/refs/heads/main/data/week%202/can_mex_usa.rds"))

Note: Need a hint? Skip to the next slide.

Today’s Exercise

A Hint & Some More Work

This character vector should be of interest:

high_level <- c("v2x_polyarchy", 
                "v2x_libdem",
                "v2x_partipdem", 
                "v2x_delibdem", 
                "v2x_egaldem")

Are you done? Congrats! Here’s some more data to explore (cf. Healy 2023)

readRDS(url("https://github.com/sakeefkarim/intro_quantitative_sociology/raw/refs/heads/main/data/week%202/gss_2010.rds"))

Explore the variables here:

See You Thursday

References

Coppedge, Michael, John Gerring, Carl Henrik Knutsen, Staffan I. Lindberg, Jan Teorell, et al. 2024. V-Dem [Country-Year/Country-Date] Dataset V14.”
Healy, Kieran Joseph. 2023. gssr: General Social Survey Data for Use in R.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd edition. Sebastopol, CA: O’Reilly.