Learning Goals
The goal of this course is for you to gain confidence in carrying out the entire data science pipeline,
- from research question formulation,
- to data collection/scraping,
- to wrangling,
- to modeling (covered in STAT 155, STAT 253, STAT 4xx courses),
- to visualization,
- to presentation and communication
Specific course topics and general skills are listed below.
General Skills
Data Communication
In written and oral formats:
- Inform and justify data cleaning and analysis process and the resulting conclusions with clear, organized, logical, and compelling details that adapt to the background, values, and motivations of the audience and context in which communication occurs.
Collaborative Learning
- Understand and demonstrate characteristics of effective collaboration (team roles, interpersonal communication, self-reflection, awareness of social dynamics, advocating for yourself and others).
- Develop a common purpose and agreement on goals.
- Be able to contribute questions or concerns in a respectful way.
- Share and contribute to the group’s learning in an equitable manner.
Course Topics
Specific learning objectives for our course topics are listed below. Use these to guide your synthesis of course material for specific topics. Note that the topics are covered in the order of the data science pipeline, not the order in which we will cover them in class.
Foundation
Working in RStudio
- Download and install the necessary tools (R, RStudio)
- Open a Quarto (
.qmd) file from RStudio’s Files pane - Create a new Quarto document
- Know when to enter commands in the Console vs a Quarto file
- View and modify keyboard shortcuts
- Open a dataset in RStudio’s spreadsheet viewer
- Check if a package is installed using
library()and by looking at the Packages pane - Install a package using
install.packages()
Working in a Quarto file
- Insert a code chunk (also see keyboard shortcuts section below)
- Run all the code in a code chunk (also see keyboard shortcuts section below)
- Insert bold and italic text
- Insert section headers
- Insert an image
General data knowledge
- Determine from a data description and looking at the first few rows what the unit of observation is for the dataset. (That is, what does each row represent?)
- Identify a variable as quantitative or categorical
- Describe the characteristics of tidy data
General coding
- Use the assignment operator (
<-) to store output into an object - Identify the different parts of a function: function name, arguments
- Identify the class (type) of an object
- Show the variable types of all columns in a dataset
- Show the number of observations (also called cases) and variables in a dataset
- Show the first and last few rows (observations) of a dataset
- Show the variable (column) names of a dataset
- Read in a CSV file of data and store the data in an object
Data Visualization
The learning goals may be adjusted before we start the material of this section.
General
- Explain the importance of data visualization
- Load the required package for plotting before running any plotting commands
- Explain the “grammar of graphics” (the building up of plots in layers)
- Understand the structure of
ggplot(___, aes(___)): what goes in the blanks, and what does this code do? - What do the
+signs do inggplotcode?
- Understand the structure of
Univariate, bivariate, and multivariate viz
- How can we build appropriate plots for the following types of variables? What geoms and aesthetics can be used?
- 1 categorical variable
- 1 quantitative variable
- 2 categorical variables
- 2 quantitative variables
- 1 categorical and 1 quantitative variable
- 3+ variables that have different combinations of categorical and quantitative variables (There are a lot of combinations, but what principles can be used to think through how to build these up?)
- Explain the difference between the
colorandfillarguments - Add informative labels, captions, and alt text to a plot
- Explain what
theme_minimal()does - Interpret a plot to describe key takeaways
- 1 categorical variable: relative frequencies of different categories
- 1 quantitative variable: shape of the distribution, range, outliers, center of the distribution
- 2 quantitative variables: describe the characteristics of the relationship: direction (positive vs negative), shape/trend (line, curve, etc), strength (how close are the points to the trend?)
- 2 categorical variables: comparisons of frequencies (counts) and proportions of one variable across categories of the other
- 1 categorical and 1 quantitative variable: comparisons of the distribution shapes, centers, and ranges
- Compare the pros and cons of different plots for answering a research question and facilitating comparisons of interest
Mapping
- Explain the difference between point maps, contour maps, and choropleth maps
- Describe the general code structure for building up a
leafletplot in layers and compare to the building up of aggplotin layers
- Describe the general code structure for building up a
- Plot data points on top of a map using the (
ggplot()) - Create choropleth maps (
geom_map())
- Understand the basics of creating a map using
leaflet, including adding points and choropleths to a base map.
Data Wrangling
Wrangling Verbs
- Use the following verbs appropriately:
select,mutate,filter,arrange,summarize,group_by - Predict what code will do without running it
Dates
- Use functions in the
lubridatepackage to work with dates
Reshaping Data
- Explain the difference between wide and long data format
- Identify the case (unit of observation) for a data set in a given format
- Use
pivot_widerandpivot_longerin thetidyrpackage
Joining Data
- Explain the concept of keys (variables that uniquely identify rows or cases)
- Explain the difference between left, inner, full, semi, and anti joins
- Explain the difference between mutating and filtering joins
- Use mutating joins (
left_join,inner_joinandfull_join)and filtering joins (semi_join,anti_join) in thedplyrpackage to combine information from multiple datasets
Working with Character Data as Factors
- Explain the difference between a variable stored as a
charactervs. afactor - Convert a
charactervariable to afactor - Manipulate the order and values of a factor with the
forcatspackage to improve summaries and visualizations
Working with Character Data as Strings
- Use regular expressions to search and replace, detect patterns, locate patterns, extract patterns, and separate text with the
stringrpackage
Starting a Data Project
The learning goals may be adjusted before we start the material of this section.
Data Import
- Find existing data sets
- Save data sets locally
- Load data into RStudio
- Do some preliminary data checking and cleaning steps before further wrangling / visualization:
- Make sure variables are properly formatted
- Deal with missing values
EDA
- Understand the first steps that should be taken when you encounter a new data set
- Develop comfort in knowing how to explore data to understand it
- Develop comfort in formulating research questions
Errors
What can cause the following errors? How do we fix our code to deal with the error?
Error: could not find function "___"
Error: object '___' not found
Keyboard shortcuts
- Insert assignment operator
- Check the “Modify Keyboard Shortcuts…” dialog box under the Tools menu. (Search for “assign”.)
- Insert code chunk
- Ctrl + Alt + I (Windows and Linux)
- Command + Option + I (Mac)
- Insert a pipe (
%>%or|>)- Ctrl + Shift + M (Windows and Linux)
- Command + Shift + M (Mac)
- Run all code in a code chunk
- Ctrl + Shift + Enter (Windows and Linux)
- Command + Shift + Enter (Mac)
- Run only the command on the line that the cursor is on
- Ctrl + Enter (Windows and Linux)
- Command + Enter (Mac)
- Comment out/uncomment lines
- Ctrl + Shift + C (Windows and Linux)
- Command + Shift + C (Mac)