Learning Goals

The goal of this course is for you to gain confidence in carrying out the entire data science pipeline,

from research question formulation,
to data collection/scraping,
to wrangling,
to modeling (covered in STAT 155, STAT 253, STAT 4xx courses),
to visualization,
to presentation and communication

Specific course topics and general skills are listed below.

General Skills

Data Communication

In written and oral formats:
- Inform and justify data cleaning and analysis process and the resulting conclusions with clear, organized, logical, and compelling details that adapt to the background, values, and motivations of the audience and context in which communication occurs.

Collaborative Learning

Understand and demonstrate characteristics of effective collaboration (team roles, interpersonal communication, self-reflection, awareness of social dynamics, advocating for yourself and others).
Develop a common purpose and agreement on goals.
Be able to contribute questions or concerns in a respectful way.
Share and contribute to the group’s learning in an equitable manner.

Course Topics

Specific learning objectives for our course topics are listed below. Use these to guide your synthesis of course material for specific topics. Note that the topics are covered in the order of the data science pipeline, not the order in which we will cover them in class.

Foundation

Working in RStudio

Download and install the necessary tools (R, RStudio)
Open a Quarto (.qmd) file from RStudio’s Files pane
Create a new Quarto document
Know when to enter commands in the Console vs a Quarto file
View and modify keyboard shortcuts
Open a dataset in RStudio’s spreadsheet viewer
Check if a package is installed using library() and by looking at the Packages pane
Install a package using install.packages()

Working in a Quarto file

Insert a code chunk (also see keyboard shortcuts section below)
Run all the code in a code chunk (also see keyboard shortcuts section below)
Insert bold and italic text
Insert section headers
Insert an image

General data knowledge

Determine from a data description and looking at the first few rows what the unit of observation is for the dataset. (That is, what does each row represent?)
Identify a variable as quantitative or categorical
Describe the characteristics of tidy data

General coding

Use the assignment operator (<-) to store output into an object
Identify the different parts of a function: function name, arguments
Identify the class (type) of an object
Show the variable types of all columns in a dataset
Show the number of observations (also called cases) and variables in a dataset
Show the first and last few rows (observations) of a dataset
Show the variable (column) names of a dataset
Read in a CSV file of data and store the data in an object

Data Visualization

The learning goals may be adjusted before we start the material of this section.

General

Explain the importance of data visualization
Load the required package for plotting before running any plotting commands
Explain the “grammar of graphics” (the building up of plots in layers)
- Understand the structure of ggplot(___, aes(___)): what goes in the blanks, and what does this code do?
- What do the + signs do in ggplot code?

Univariate, bivariate, and multivariate viz

How can we build appropriate plots for the following types of variables? What geoms and aesthetics can be used?
- 1 categorical variable
- 1 quantitative variable
- 2 categorical variables
- 2 quantitative variables
- 1 categorical and 1 quantitative variable
- 3+ variables that have different combinations of categorical and quantitative variables (There are a lot of combinations, but what principles can be used to think through how to build these up?)
Explain the difference between the color and fill arguments
Add informative labels, captions, and alt text to a plot
Explain what theme_minimal() does
Interpret a plot to describe key takeaways
- 1 categorical variable: relative frequencies of different categories
- 1 quantitative variable: shape of the distribution, range, outliers, center of the distribution
- 2 quantitative variables: describe the characteristics of the relationship: direction (positive vs negative), shape/trend (line, curve, etc), strength (how close are the points to the trend?)
- 2 categorical variables: comparisons of frequencies (counts) and proportions of one variable across categories of the other
- 1 categorical and 1 quantitative variable: comparisons of the distribution shapes, centers, and ranges
Compare the pros and cons of different plots for answering a research question and facilitating comparisons of interest

Mapping

Explain the difference between point maps, contour maps, and choropleth maps
- Describe the general code structure for building up a leaflet plot in layers and compare to the building up of a ggplot in layers
Plot data points on top of a map using the (ggplot())
Create choropleth maps (geom_map())
Understand the basics of creating a map using leaflet, including adding points and choropleths to a base map.

Data Wrangling

Wrangling Verbs

Use the following verbs appropriately: select, mutate, filter, arrange, summarize, group_by
Predict what code will do without running it

Dates

Use functions in the lubridate package to work with dates

Reshaping Data

Explain the difference between wide and long data format
Identify the case (unit of observation) for a data set in a given format
Use pivot_wider and pivot_longer in the tidyr package

Joining Data

Explain the concept of keys (variables that uniquely identify rows or cases)
Explain the difference between left, inner, full, semi, and anti joins
Explain the difference between mutating and filtering joins
Use mutating joins (left_join, inner_join and full_join) and filtering joins (semi_join, anti_join) in the dplyr package to combine information from multiple datasets

Working with Character Data as Factors

Explain the difference between a variable stored as a character vs. a factor
Convert a character variable to a factor
Manipulate the order and values of a factor with the forcats package to improve summaries and visualizations

Working with Character Data as Strings

Use regular expressions to search and replace, detect patterns, locate patterns, extract patterns, and separate text with the stringr package

Starting a Data Project

The learning goals may be adjusted before we start the material of this section.

Data Import

Find existing data sets
Save data sets locally
Load data into RStudio
Do some preliminary data checking and cleaning steps before further wrangling / visualization:
- Make sure variables are properly formatted
- Deal with missing values

EDA

Understand the first steps that should be taken when you encounter a new data set
Develop comfort in knowing how to explore data to understand it
Develop comfort in formulating research questions

Errors

What can cause the following errors? How do we fix our code to deal with the error?

Error: could not find function "___"

Error: object '___' not found

Keyboard shortcuts

Insert assignment operator
- Check the “Modify Keyboard Shortcuts…” dialog box under the Tools menu. (Search for “assign”.)
Insert code chunk
- Ctrl + Alt + I (Windows and Linux)
- Command + Option + I (Mac)
Insert a pipe (%>% or |>)
- Ctrl + Shift + M (Windows and Linux)
- Command + Shift + M (Mac)
Run all code in a code chunk
- Ctrl + Shift + Enter (Windows and Linux)
- Command + Shift + Enter (Mac)
Run only the command on the line that the cursor is on
- Ctrl + Enter (Windows and Linux)
- Command + Enter (Mac)
Comment out/uncomment lines
- Ctrl + Shift + C (Windows and Linux)
- Command + Shift + C (Mac)