Learning Goals

The goal of this course is for you to gain confidence in carrying out the entire data science pipeline,

Specific course topics and general skills are listed below.

General Skills

Data Communication

  • In written and oral formats:

    • Inform and justify data cleaning and analysis process and the resulting conclusions with clear, organized, logical, and compelling details that adapt to the background, values, and motivations of the audience and context in which communication occurs.

Collaborative Learning

  • Understand and demonstrate characteristics of effective collaboration (team roles, interpersonal communication, self-reflection, awareness of social dynamics, advocating for yourself and others).
  • Develop a common purpose and agreement on goals.
  • Be able to contribute questions or concerns in a respectful way.
  • Share and contribute to the group’s learning in an equitable manner.

Course Topics

Specific learning objectives for our course topics are listed below. Use these to guide your synthesis of course material for specific topics. Note that the topics are covered in the order of the data science pipeline, not the order in which we will cover them in class.

Foundation

Working in RStudio

  • Download and install the necessary tools (R, RStudio)
  • Open a Quarto (.qmd) file from RStudio’s Files pane
  • Create a new Quarto document
  • Know when to enter commands in the Console vs a Quarto file
  • View and modify keyboard shortcuts
  • Open a dataset in RStudio’s spreadsheet viewer
  • Check if a package is installed using library() and by looking at the Packages pane
  • Install a package using install.packages()

Working in a Quarto file

  • Insert a code chunk (also see keyboard shortcuts section below)
  • Run all the code in a code chunk (also see keyboard shortcuts section below)
  • Insert bold and italic text
  • Insert section headers
  • Insert an image

General data knowledge

  • Determine from a data description and looking at the first few rows what the unit of observation is for the dataset. (That is, what does each row represent?)
  • Identify a variable as quantitative or categorical
  • Describe the characteristics of tidy data

General coding

  • Use the assignment operator (<-) to store output into an object
  • Identify the different parts of a function: function name, arguments
  • Identify the class (type) of an object
  • Show the variable types of all columns in a dataset
  • Show the number of observations (also called cases) and variables in a dataset
  • Show the first and last few rows (observations) of a dataset
  • Show the variable (column) names of a dataset
  • Read in a CSV file of data and store the data in an object

Data Visualization

The learning goals may be adjusted before we start the material of this section.

General

  • Explain the importance of data visualization
  • Load the required package for plotting before running any plotting commands
  • Explain the “grammar of graphics” (the building up of plots in layers)
    • Understand the structure of ggplot(___, aes(___)): what goes in the blanks, and what does this code do?
    • What do the + signs do in ggplot code?

Univariate, bivariate, and multivariate viz

  • How can we build appropriate plots for the following types of variables? What geoms and aesthetics can be used?
    • 1 categorical variable
    • 1 quantitative variable
    • 2 categorical variables
    • 2 quantitative variables
    • 1 categorical and 1 quantitative variable
    • 3+ variables that have different combinations of categorical and quantitative variables (There are a lot of combinations, but what principles can be used to think through how to build these up?)
  • Explain the difference between the color and fill arguments
  • Add informative labels, captions, and alt text to a plot
  • Explain what theme_minimal() does
  • Interpret a plot to describe key takeaways
    • 1 categorical variable: relative frequencies of different categories
    • 1 quantitative variable: shape of the distribution, range, outliers, center of the distribution
    • 2 quantitative variables: describe the characteristics of the relationship: direction (positive vs negative), shape/trend (line, curve, etc), strength (how close are the points to the trend?)
    • 2 categorical variables: comparisons of frequencies (counts) and proportions of one variable across categories of the other
    • 1 categorical and 1 quantitative variable: comparisons of the distribution shapes, centers, and ranges
  • Compare the pros and cons of different plots for answering a research question and facilitating comparisons of interest

Mapping

  • Explain the difference between point maps, contour maps, and choropleth maps
    • Describe the general code structure for building up a leaflet plot in layers and compare to the building up of a ggplot in layers
  • Plot data points on top of a map using the (ggplot())
  • Create choropleth maps (geom_map())
  • Understand the basics of creating a map using leaflet, including adding points and choropleths to a base map.

Data Wrangling

Wrangling Verbs

  • Use the following verbs appropriately: select, mutate, filter, arrange, summarize, group_by
  • Predict what code will do without running it

Dates

  • Use functions in the lubridate package to work with dates

Reshaping Data

  • Explain the difference between wide and long data format
  • Identify the case (unit of observation) for a data set in a given format
  • Use pivot_wider and pivot_longer in the tidyr package

Joining Data

  • Explain the concept of keys (variables that uniquely identify rows or cases)
  • Explain the difference between left, inner, full, semi, and anti joins
  • Explain the difference between mutating and filtering joins
  • Use mutating joins (left_join, inner_join and full_join) and filtering joins (semi_join, anti_join) in the dplyr package to combine information from multiple datasets

Working with Character Data as Factors

  • Explain the difference between a variable stored as a character vs. a factor
  • Convert a character variable to a factor
  • Manipulate the order and values of a factor with the forcats package to improve summaries and visualizations

Working with Character Data as Strings

  • Use regular expressions to search and replace, detect patterns, locate patterns, extract patterns, and separate text with the stringr package

Starting a Data Project

The learning goals may be adjusted before we start the material of this section.

Data Import

  • Find existing data sets
  • Save data sets locally
  • Load data into RStudio
  • Do some preliminary data checking and cleaning steps before further wrangling / visualization:
    • Make sure variables are properly formatted
    • Deal with missing values

EDA

  • Understand the first steps that should be taken when you encounter a new data set
  • Develop comfort in knowing how to explore data to understand it
  • Develop comfort in formulating research questions

Errors

What can cause the following errors? How do we fix our code to deal with the error?

Error: could not find function "___"

Error: object '___' not found

Keyboard shortcuts

  • Insert assignment operator
    • Check the “Modify Keyboard Shortcuts…” dialog box under the Tools menu. (Search for “assign”.)
  • Insert code chunk
    • Ctrl + Alt + I (Windows and Linux)
    • Command + Option + I (Mac)
  • Insert a pipe (%>% or |>)
    • Ctrl + Shift + M (Windows and Linux)
    • Command + Shift + M (Mac)
  • Run all code in a code chunk
    • Ctrl + Shift + Enter (Windows and Linux)
    • Command + Shift + Enter (Mac)
  • Run only the command on the line that the cursor is on
    • Ctrl + Enter (Windows and Linux)
    • Command + Enter (Mac)
  • Comment out/uncomment lines
    • Ctrl + Shift + C (Windows and Linux)
    • Command + Shift + C (Mac)