Junior Visit Day

Welcome! 👋
I’m excited to give you a sense of what we learn in my Introduction to Data Science course.

A PDF of your handouts is available here.

Welcome!

A little about me

B.S. from Johns Hopkins University
- Majors: Biomedical Engineering, Applied Mathematics and Statistics
- Minor: Computer Science
PhD in Biostatistics from Johns Hopkins
I’ve been teaching across the statistics and data science curriculum at Mac since 2018:
- Introductory and intermediate courses in statistics and data science
- Upper level capstone course called Causal Inference (how do we study the effects of policies and interventions?)
In addition to my teaching, I am an environmental activist.
- Will be starting part time as the Research and Policy Director at the MN Environmental Justice Table in June
- Shows up in my involvement with at Mac with our Sustainability Office
- I am part of a coalition working to shut down a trash incinerator in Minneapolis.
  - Data skills have allowed me to acquire, analyze, and visualize data related to air pollution and its health risks
- Hiring 2 students to help me with a research project this summer to examine the incinerator’s impacts in new ways

What do we learn in Introduction to Data Science?

Principles and technical tools for working with data
Students gain a lot of experience with the R programming language for data management, exploration, and visualization.
We spend about 2/3 of the semester on content and the last third on process skills during an intensive project experience.

Data Visualization - Principles

One of the quickest ways to gain insight from data is to display it in visual form.

Compare groups
Observe trends
Notice outliers

However, designing an effective visualization is hard!

We’ll explore this through examples.

Everyone has a printout of one of the following two visualizations. Take a few minutes to think through the following, and write down notes about what you observe:

What message is the visualization trying to convey?
- How clearly is this message conveyed? Is it hard or easy to find the information you’re interested in?
- Are aspects that you want to compare placed next to each other to facilitate comparison?
- Does the use of color help or hinder?
- Do labels help give context?

After you examine your visualization, share thoughts with others sitting near you who had a different visualization.

Example 1

IMAGE 1. Source: N. Yau, *Visualize This*, 2011, p. 223-225.

Example 2

Data Visualization - Coding

One big emphasis in the course is writing code to create data visualizations.

Let’s get a taste of what this looks like with some examples.

The following visualizations use a dataset on 344 penguins.

Each row corresponds to one penguin and the columns give information about each penguin.
6 rows are shown below.
For each visualization, make connections between what you see in the visual and parts of the code that you see.

## # A tibble: 6 × 5
##   species   bill_length_mm flipper_length_mm body_mass_g sex   
##   <fct>              <dbl>             <int>       <int> <fct> 
## 1 Adelie              39.1               181        3750 male  
## 2 Adelie              39.5               186        3800 female
## 3 Chinstrap           46.5               192        3500 female
## 4 Chinstrap           50                 196        3900 male  
## 5 Gentoo              46.1               211        4500 female
## 6 Gentoo              50                 230        5700 male

ggplot(penguins, aes(x = species)) +
    geom_bar()

ggplot(penguins, aes(x = species, fill = sex)) +
    geom_bar()

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point()

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point() +
    geom_smooth()

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point() +
    geom_smooth() +
    labs(x = "Flipper length (mm)", y = "Body mass (g)")