Junior Visit Day

Welcome! 👋
I’m excited to give you a sense of what we learn in my Introduction to Data Science course.

A PDF of your handouts is available here.





Welcome!

A little about me

  • B.S. from Johns Hopkins University
    • Majors: Biomedical Engineering, Applied Mathematics and Statistics
    • Minor: Computer Science
  • PhD in Biostatistics from Johns Hopkins
  • I’ve been teaching across the statistics and data science curriculum at Mac since 2018:
    • Introductory and intermediate courses in statistics and data science
    • Upper level capstone course called Causal Inference (how do we study the effects of policies and interventions?)
  • In addition to my teaching, I am an environmental activist.
    • Will be starting part time as the Research and Policy Director at the MN Environmental Justice Table in June
    • Shows up in my involvement with at Mac with our Sustainability Office
    • I am part of a coalition working to shut down a trash incinerator in Minneapolis.
      • Data skills have allowed me to acquire, analyze, and visualize data related to air pollution and its health risks
    • Hiring 2 students to help me with a research project this summer to examine the incinerator’s impacts in new ways





What do we learn in Introduction to Data Science?

  • Principles and technical tools for working with data
  • Students gain a lot of experience with the R programming language for data management, exploration, and visualization.
  • We spend about 2/3 of the semester on content and the last third on process skills during an intensive project experience.





Data Visualization - Principles

One of the quickest ways to gain insight from data is to display it in visual form.

  • Compare groups
  • Observe trends
  • Notice outliers

However, designing an effective visualization is hard!

We’ll explore this through examples.




Everyone has a printout of one of the following two visualizations. Take a few minutes to think through the following, and write down notes about what you observe:

  • What message is the visualization trying to convey?
    • How clearly is this message conveyed? Is it hard or easy to find the information you’re interested in?
    • Are aspects that you want to compare placed next to each other to facilitate comparison?
    • Does the use of color help or hinder?
    • Do labels help give context?

After you examine your visualization, share thoughts with others sitting near you who had a different visualization.



Example 1

IMAGE 1. Source: N. Yau, Visualize This, 2011, p. 223-225.



Example 2

IMAGE 2. Source: NYTimes





Data Visualization - Coding

One big emphasis in the course is writing code to create data visualizations.

Let’s get a taste of what this looks like with some examples.

The following visualizations use a dataset on 344 penguins.

  • Each row corresponds to one penguin and the columns give information about each penguin.
  • 6 rows are shown below.
  • For each visualization, make connections between what you see in the visual and parts of the code that you see.
## # A tibble: 6 × 5
##   species   bill_length_mm flipper_length_mm body_mass_g sex   
##   <fct>              <dbl>             <int>       <int> <fct> 
## 1 Adelie              39.1               181        3750 male  
## 2 Adelie              39.5               186        3800 female
## 3 Chinstrap           46.5               192        3500 female
## 4 Chinstrap           50                 196        3900 male  
## 5 Gentoo              46.1               211        4500 female
## 6 Gentoo              50                 230        5700 male
ggplot(penguins, aes(x = species)) +
    geom_bar()

ggplot(penguins, aes(x = species, fill = sex)) +
    geom_bar()

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point()

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point() +
    geom_smooth()

ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
    geom_point() +
    geom_smooth() +
    labs(x = "Flipper length (mm)", y = "Body mass (g)")