We have learned

  • Fundamentals of R.
  • Functions, packages, environment, and namespace.
  • Reproducibility.
  • Getting help.
  • Git/GitHub
  • Indexing, subsetting, and replacing
  • *apply family

Today

  • tidy data and tidyverse
  • Read, write data and paths

Tidy data

  1. Each variable is a column; each column is a variable.
  2. Each observation is a row; each row is an observation.
  3. Each value is a cell; each cell is a single value.

Tidy data

# A tibble: 4 × 4
  name   assignment1 assignment2 quiz1
  <chr>  <chr>       <chr>       <chr>
1 Billy  <NA>        D           C    
2 Suzy   F           <NA>        <NA> 
3 Lionel B           C           B    
4 Jenny  A           A           B    

Is this table tidy?

Tidy data

# A tibble: 12 × 3
   name   assessment  grade
   <chr>  <chr>       <chr>
 1 Billy  assignment1 <NA> 
 2 Billy  assignment2 D    
 3 Billy  quiz1       C    
 4 Suzy   assignment1 F    
 5 Suzy   assignment2 <NA> 
 6 Suzy   quiz1       <NA> 
 7 Lionel assignment1 B    
 8 Lionel assignment2 C    
 9 Lionel quiz1       B    
10 Jenny  assignment1 A    
11 Jenny  assignment2 A    
12 Jenny  quiz1       B    

Tidy data

How about this one?

# A tibble: 3 × 5
  assessment  Billy Suzy  Lionel Jenny
  <chr>       <chr> <chr> <chr>  <chr>
1 assignment1 <NA>  F     B      A    
2 assignment2 D     <NA>  C      A    
3 quiz1       C     <NA>  B      B    

Tidy data

# A tibble: 12 × 3
   assessment  student grade
   <chr>       <chr>   <chr>
 1 assignment1 Billy   <NA> 
 2 assignment1 Suzy    F    
 3 assignment1 Lionel  B    
 4 assignment1 Jenny   A    
 5 assignment2 Billy   D    
 6 assignment2 Suzy    <NA> 
 7 assignment2 Lionel  C    
 8 assignment2 Jenny   A    
 9 quiz1       Billy   C    
10 quiz1       Suzy    <NA> 
11 quiz1       Lionel  B    
12 quiz1       Jenny   B    

Introduction to tidyverse

A collection of packages to create, process and manipulate tidy data.

  • dplyr, provides functions for data manipulation
  • tidyr, provide functions to get to tidy data
  • ggplot2, a system to create figures.
  • stringr, provides functions to deal with strings

Visit tidyverse web together.

dplyr vs base R

  • filter (by condition) and slice (by index) to subset rows
  • select to subset columns
  • pull to read a column as a vector (like the double bracket [[]])
  • mutate to add a new column

Run an example

library(dplyr)
classroom
# A tibble: 12 × 3
   assessment  student grade
   <chr>       <chr>   <chr>
 1 assignment1 Billy   <NA> 
 2 assignment1 Suzy    F    
 3 assignment1 Lionel  B    
 4 assignment1 Jenny   A    
 5 assignment2 Billy   D    
 6 assignment2 Suzy    <NA> 
 7 assignment2 Lionel  C    
 8 assignment2 Jenny   A    
 9 quiz1       Billy   C    
10 quiz1       Suzy    <NA> 
11 quiz1       Lionel  B    
12 quiz1       Jenny   B    
# filter
good_stu <- filter(classroom, grade == "A")
good_stu
# A tibble: 2 × 3
  assessment  student grade
  <chr>       <chr>   <chr>
1 assignment1 Jenny   A    
2 assignment2 Jenny   A    
first_stus <- slice(classroom, 1:2)
first_stus
# A tibble: 2 × 3
  assessment  student grade
  <chr>       <chr>   <chr>
1 assignment1 Billy   <NA> 
2 assignment1 Suzy    F    

Run an example

# select
students <- select(classroom, c(student, grade))
students
# A tibble: 12 × 2
   student grade
   <chr>   <chr>
 1 Billy   <NA> 
 2 Suzy    F    
 3 Lionel  B    
 4 Jenny   A    
 5 Billy   D    
 6 Suzy    <NA> 
 7 Lionel  C    
 8 Jenny   A    
 9 Billy   C    
10 Suzy    <NA> 
11 Lionel  B    
12 Jenny   B    
# pull
stuts <- pull(classroom, student)
stuts
 [1] "Billy"  "Suzy"   "Lionel" "Jenny"  "Billy"  "Suzy"   "Lionel" "Jenny" 
 [9] "Billy"  "Suzy"   "Lionel" "Jenny" 

Run an example

# mutate
classroom <- mutate(classroom, good = ifelse(grade == "A", 1, 0))
classroom
# A tibble: 12 × 4
   assessment  student grade  good
   <chr>       <chr>   <chr> <dbl>
 1 assignment1 Billy   <NA>     NA
 2 assignment1 Suzy    F         0
 3 assignment1 Lionel  B         0
 4 assignment1 Jenny   A         1
 5 assignment2 Billy   D         0
 6 assignment2 Suzy    <NA>     NA
 7 assignment2 Lionel  C         0
 8 assignment2 Jenny   A         1
 9 quiz1       Billy   C         0
10 quiz1       Suzy    <NA>     NA
11 quiz1       Lionel  B         0
12 quiz1       Jenny   B         0

Revisit Task 1 in code cracker

  • How many unique crop types are in the label column?
  • What is the name of the crop that appears first alphabetically?
  • What is the name of the crop that appears last alphabetically?
crop_types <- unique(crops[["label"]])
n_crops <- length(crop_types)
first_crop <- sort(crop_types)[1]
last_crop <- sort(crop_types, decreasing = TRUE)[1]

Use dplyr syntax to redo this task. Share your solution in Chat.

Pipeline

  • For these two lines:
crop_types <- unique(crops[["label"]])
n_crops <- length(crop_types)


  • I can chain all functions into a single pipeline using %>%.
n_crops <- crops %>% pull(label) %>% unique() %>% length()

Pipeline

  • The full version is:
n_crops <- crops %>% pull(., label) %>% unique(.) %>% length(.)
  • Use . to refer to the result from the previous step. By default, the result is passed as the first argument.


  • Sometimes you may want to control where the result is passed in the next function call:
crops %>% pull(label) %>% unique() %>% length() %>% paste0("Crop number is: ", .)

Your try

Revisit task 2 in code cracker:

  • What is the maximum N (Nitrogen) value for maize?

Your task:

  • Get the result in one dplyr pipeline
  • Share your solution in Chat.

Your try (10 mins)

  • Redo all 5 code cracker tasks using dplyr syntax one by one.
  • Challenge: try to finish every step in a single pipeline.
  • Ask questions if you get stuck.

Read and write data

  • read.csv to read csv file
  • write.csv to save csv file
# Base R
soemthing <- read.csv("/path/to/the/file", stringsAsFactors = FALSE)
write.csv(something, "/path/for/the/new/file", row.names = FALSE)

Paths

  • Interactive environment (e.g. console):

    • getwd() to get the working directory
    • setwd() to set the new working directory
    • relative paths are relative to this working directory.
  • R markdown knit directory:

    • Document directory
    • Project directory
    • Current working directory
  • R script

    • Current working directory

Strategies to work with paths

Package here will definitely what you should try.

  • Function here of package here will specify the right path of a file within a project:

    • here::here() will give you the project root path
    • here::here("/a/path/relative/to/your/project/root") to specify the right path of a file within your package.

Homework

  • Finish reading Unit1-Module4.
  • Finish rewriting all five code cracker tasks using dplyr if you did not complete them earlier in class.
  • Read this online section to know better about tidy data.
  • Important: Finish Homework (“Tidyversing” the crop dataset section).