Week 2
Cleaning and Data
Analysis in
SOCI 269
Make sure you’re … well, here on this slide.
In 1-2 weeks, we’ll all be able to produce plots like this.
But let’s take things one step at a time.
You can use your console as a calculator.
Feel free to play around with the numbers above.
What’s 18%
of 905
? Use the console on this slide to arrive at an answer.
Note: Keep clicking or the space bar on your to advance through the slide deck.
Use the combine function—c()
—to stitch together multiple elements of the same type, yielding an atomic vector.
Here’s a character vector.
And here’s a numeric vector.
Note: Keep clicking or the space bar on your to advance through the slide deck.
We can create new objects with the assignment operator <-
.
Here’s how we can store the vector c(1, 2, 3)
as an object named x
.
Can you store the character vector embedded below as a new object named five_colleges
?
Note: Keep clicking or the space bar on your to advance through the slide deck.
Try to run the line below.
As Wickham, Çetinkaya-Rundel and Grolemund (2023) note, “object names must start with a letter and can only contain letters, numbers, _
, and .
.”
Note: Keep clicking or the space bar on your to advance through the slide deck.
is particularly useful for working with data frames.
Change the name of this data frame to IRIS
.
We can use the $
operator to isolate elements within list
or data.frame
objects. This comes in handy when we use basic functions in .
Now, calculate the mean of the Petal.Length
variable in iris
.
Note: Keep clicking or the space bar on your to advance through the slide deck.
We can annotate our code using the #
sign.
Now, find the median value of iris$Petal.Width
while weaving in some crude annotations.
Note: Keep clicking or the space bar on your to advance through the slide deck.
Store penguins
as a new object named penguins_week2.
Calculate the mean of bill_length_mm
in the dataset.
How many male and female penguins are in penguins_week2
? Use the table
function to find out — and add some annotations, too.
dplyr
—The pipe operator (|>
in base ) will help us chain together a series of dplyr
verbs as we manipulate our data frames.
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr::filter()
The filter()
function allows us to zoom-in on our observations of interest based on column values.
Logical vectors are often used to streamline the “filtration” process.
We can use string patterns in character vectors to make life easier, too.
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr::arrange()
The arrange()
function allows us to modify—or have fine control over—the order of our observations based on column values.
We can arrange our rows using multiple variables.
We can organize rows in descending order, too.
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr::distinct()
Calling the distinct()
function returns all the unique rows or values in a dataset vis-à-vis specific variables or their combinations.
Here are the distinct values of the year
variable in palmerpenguins::penguins
:
We can isolate unique species
and island
pairs, too.
distinct()
is helpful for removing “redundant” information or extraneous rows.
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr
—dplyr::mutate()
The mutate()
function lets us create new variables or modify existing columns. Generally, this is achieved by manipulating the variables in our data.
Create a new variable for population in 1000s.
Logical vectors are useful for generating new variables.
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr::select()
The select()
function lets us home-in on our variables of interest.
We can provide (and modify) column names or positions.
We can use helper functions when variables are legion.
How can we remove the first three columns?
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr::rename()
The rename()
function allows us to rename variables without adjusting the dimensions of our data frame.
We can rename columns by specifying variable names as follows: new = old
.
We can also use column positions to point to the variables we’d like to rename:
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr::relocate()
The relocate()
function allows us to move our columns around.
Here’s how we can move a few columns to the “front.”
We can specify where we want columns relocated.
Helper functions can, well, be quite helpful.
Note: Keep clicking or the space bar on your to advance through the slide deck.
For the rest of today’s session, we’re going to be working with data from the Varieties of Democracy Project (Coppedge et al. 2024). You can access the codebook for V-Dem 14 below:
dplyr
verbs we discussed today to heavily modify the dataset—say, by isolating specific columns, years, or countries of interest before generating new variables.No Make-Up Date
Unfortunately, there is no assigned make-up date for last week’s cancelled session. Ergo, the syllabus will need to be adjusted as we get more information about how instructors can weave in optional sessions moving forward.
I expect each and every one of you to schedule a
meeting by early March.
dplyr
—dplyr::group_by()
The group_by()
function allows us to group our data before performing additional operations or transformations.
What’s weird about this code?
Crucially, we can apply group_by()
to more than one column.
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr::summarise()
We can call the summarize()
function—or summarise()
—to easily compute summary statistics at the group level.
What do we need to change before the code can run smoothly?
We can generate summary statistics by group combinations (or group x group interactions).
As of dplyr 1.10
, we can use the .by
argument for “per-operation” grouping.
Note: Keep clicking or the space bar on your to advance through the slide deck.
dplyr::slice()
The slice()
function allows us to extract observations of interest.
Here, we isolate the first (or # last
) two rows for each continent in gapminder
.
slice_min()
and slice_max()
allow us to isolate the minimum or maximum values of column x.
The slice_sample()
function can be used to randomly draw from our (grouped or ungrouped) data.
Note: Keep clicking or the space bar on your to advance through the slide deck.
Note
The last two verbs—i.e., summarize()
and slice()
—can be applied to ungrouped data as well. What’s more, the other verbs we encountered last Tuesday—e.g., filter()
or mutate()
—can also be used to manipulate grouped data.
Manipulate your V-Dem data frame so that it looks like this:
# A tibble: 72 × 8
country year electoral electoral_global_avg liberal participatory
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Canada 2000 0.84 0.492 0.765 0.601
2 Canada 2001 0.837 0.495 0.761 0.594
3 Canada 2002 0.831 0.505 0.754 0.587
4 Canada 2003 0.831 0.514 0.754 0.584
5 Canada 2004 0.83 0.514 0.758 0.58
6 Canada 2005 0.828 0.518 0.758 0.579
7 Canada 2006 0.836 0.522 0.765 0.579
8 Canada 2007 0.838 0.519 0.767 0.579
9 Canada 2008 0.835 0.522 0.766 0.578
10 Canada 2009 0.835 0.524 0.765 0.578
# ℹ 62 more rows
# ℹ 2 more variables: deliberative <dbl>, egalitarian <dbl>
You can access the can_mex_usa
data frame via GitHub or by copying—and executing—the following line:
Note: Need a hint? Skip to the next slide.
Are you done? Congrats! Here’s some more data to explore (cf. Healy 2023)—
Explore the variables here: