This blog post summarizes the book titled Data Manipulation in R, written by Stephanie Locke.

The author clearly states that this book is for beginners. Well I would say that even if you have good experience with R but have not worked with the recent tidyverse package, then this book is worth the read. In any case, this book is super short. An experienced R user can probably read it in a few hours.

Libraries used in the book

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
library(tidyverse)
library(ggplot2movies)
library(nycflights13)
library(odbc)
library(writexl)
library(openxlsx)
library(gapminder)
library(readr)
library(readxl)
library(lubridate)
library(stringr)

Getting Data

One can use readr , readxl and purr packages to efficiently read data from multiple formats and combine them together.

1
2
3
4
5
6
7
8
iris <- read_csv('./data/iris.csv')
iris <- iris%>%rename(Sepal.Length= sepal.length,
                   Sepal.Width = sepal.width,
                   Petal.Length = petal.length,
                   Petal.Width = petal.width,
                   Species = variety)
iris2 <- read_excel('./data/iris.xlsx')
iris3 <- map_df(list.files("./data/", pattern = "sample", full.names = TRUE), read_csv)

There are other packages being actively developed by R community that follow the tidyverse philosophy.

Data Pipelines

This chapter introduced piping, an operator that allows us to create series of connected transformations that goes from the source to the destination. Instead of nesting functions you hook them together like pieces of plumbing so that your data passes through them and changes as it goes.

The pipe operator was first designed and implemented in R by Stefan Milton Bache in the package magrittr. Created in 2014, it is now very heavily used and its widest option has been within the tidyverse.

1
2
3
4
letters%>%toupper()%>%length()
## [1] 26
iris%>%head()%>%nrow()
## [1] 6

An interesting tidbit that I have learnt from the chapter is the assignment. I have always used the assignment as follows

1
temp <- iris %>% head()

But the author seems to be making a nice point where since the result of all the operations are assigned to a variable, it is better to use RHS operator

1
iris %>% head() -> temp

Whenever you use the functions with in a tidyverse framework, it is always better to know the positional argument of data. By default, pipelines will put our object as the first input to a function. To work with jumbled functions, we need a way to tell our pipelines where to use our input instead of relying on defaults. The way we can do that is with a period.

Filtering Columns

This chapter introduces a bunch of different ways to select features from a tibble

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

iris%>%select(Species)%>%head(1)
## # A tibble: 1 x 1
##   Species
##   <chr>  
## 1 Setosa
iris%>%select(-Species)%>%head(1)
## # A tibble: 1 x 4
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
##          <dbl>       <dbl>        <dbl>       <dbl>
## 1          5.1         3.5          1.4         0.2
iris%>%select(Sepal.Length:Petal.Length)%>%head(1)
## # A tibble: 1 x 3
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
iris%>%select(-(Sepal.Length:Petal.Length))%>%head(1)
## # A tibble: 1 x 2
##   Petal.Width Species
##         <dbl> <chr>  
## 1         0.2 Setosa
iris%>%select(starts_with('s'))%>%head(1)
## # A tibble: 1 x 3
##   Sepal.Length Sepal.Width Species
##          <dbl>       <dbl> <chr>  
## 1          5.1         3.5 Setosa
iris%>%select(ends_with('s'))%>%head(1)
## # A tibble: 1 x 1
##   Species
##   <chr>  
## 1 Setosa
iris%>%select(contains('Length'))%>%head(1)
## # A tibble: 1 x 2
##   Sepal.Length Petal.Length
##          <dbl>        <dbl>
## 1          5.1          1.4
iris%>%select_if(is.numeric) %>%head(1)
## # A tibble: 1 x 4
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
##          <dbl>       <dbl>        <dbl>       <dbl>
## 1          5.1         3.5          1.4         0.2
iris%>%select_if(~is.numeric(.) & n_distinct(.)>30) %>%head(1)
## # A tibble: 1 x 2
##   Sepal.Length Petal.Length
##          <dbl>        <dbl>
## 1          5.1          1.4

Filtering Rows

This chapter introduces basic syntax around using slice, filter_if , filter_all and filter_at functions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
iris%>%slice(1:2)
## # A tibble: 2 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 Setosa 
## 2          4.9         3            1.4         0.2 Setosa
iris%>%filter(Species=='Virginica') %>% head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>    
## 1          6.3         3.3            6         2.5 Virginica
iris%>%filter(Species == 'Virginica', Sepal.Length > mean(Sepal.Length))%>%head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>    
## 1          6.3         3.3            6         2.5 Virginica
iris%>%filter(Species == 'Virginica'|Sepal.Length > mean(Sepal.Length))%>%head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>     
## 1            7         3.2          4.7         1.4 Versicolor
iris%>%filter_all(any_vars(.>7.5))%>%head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 Setosa
iris%>%select_if(is.numeric)%>%filter_all(any_vars(abs(. - mean(.) ) > 2*sd(.))) %>% head(1)
## # A tibble: 1 x 4
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
##          <dbl>       <dbl>        <dbl>       <dbl>
## 1          5.8           4          1.2         0.2
iris%>%filter_if(~is.numeric(.), any_vars(.<mean(.)))%>% head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 Setosa
iris%>%filter_at(vars(ends_with("Length")), any_vars(.<mean(.)))%>% head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 Setosa

Working with names

This chapter introduces some of the functions relating to renaming rows and columns of a tibble

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
iris%>% select_all(str_to_lower)%>%head(1)
## # A tibble: 1 x 5
##   sepal.length sepal.width petal.length petal.width species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 Setosa
iris%>%rename_if(is.numeric, str_to_lower)%>% head(1)
## # A tibble: 1 x 5
##   sepal.length sepal.width petal.length petal.width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 Setosa
iris%>%rename_at(vars(starts_with('S')), str_to_lower) %>%head(1)
## # A tibble: 1 x 5
##   sepal.length sepal.width Petal.Length Petal.Width species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 Setosa
mtcars%>%rownames_to_column("cars") %>%head(1)
##        cars mpg cyl disp  hp drat   wt  qsec vs am gear carb
## 1 Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4    4

Re-arranging your data

The following snippets summarize the main syntax introduced in this chapter:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
iris%>%arrange(desc(Species))%>%head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>    
## 1          6.3         3.3            6         2.5 Virginica
iris%>%arrange(desc(Species), Sepal.Length)%>%head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>    
## 1          4.9         2.5          4.5         1.7 Virginica
iris %>%
    arrange_all() %>%
    head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          4.3           3          1.1         0.1 Setosa
iris %>%
    arrange_if(is.character, desc) %>%
    head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>    
## 1          6.3         3.3            6         2.5 Virginica
iris %>%
    arrange_at(vars(Species, starts_with('P')), desc) %>%
    head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>    
## 1          7.7         2.6          6.9         2.3 Virginica
iris %>%
    select(sort(tidyselect::peek_vars())) %>%
    head(1)
## # A tibble: 1 x 5
##   Petal.Length Petal.Width Sepal.Length Sepal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          1.4         0.2          5.1         3.5 Setosa

Changing your Data

The following snippets summarize the main syntax introduced in this chapter:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
iris %>%
    mutate(x = Sepal.Width * Sepal.Length) %>%
    head(1)
## # A tibble: 1 x 6
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species     x
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>   <dbl>
## 1          5.1         3.5          1.4         0.2 Setosa   17.8

iris %>% mutate(Sepal.Width=NULL) %>% head(1) ## # A tibble: 1 x 4 ## Sepal.Length Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <chr> ## 1 5.1 1.4 0.2 Setosa

iris %>% mutate(id = row_number()) %>% head(1) ## # A tibble: 1 x 6 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species id ## <dbl> <dbl> <dbl> <dbl> <chr> <int> ## 1 5.1 3.5 1.4 0.2 Setosa 1

iris %>% mutate(size= case_when( Sepal.Length<mean(Sepal.Length) ~ "s", Sepal.Length>mean(Sepal.Length) ~ "l", TRUE ~ "m" )) %>% head(1) ## # A tibble: 1 x 6 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species size ## <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 5.1 3.5 1.4 0.2 Setosa s

iris %>% mutate(Species = case_when( Species == 'setosa' ~ toupper(Species), TRUE ~ as.character(Species) )) %>% head(1) ## # A tibble: 1 x 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <chr> ## 1 5.1 3.5 1.4 0.2 Setosa

iris %>% mutate(runagg = cumall(Sepal.Length < mean(Sepal.Length))) %>% head(1) ## # A tibble: 1 x 6 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species runagg ## <dbl> <dbl> <dbl> <dbl> <chr> <lgl> ## 1 5.1 3.5 1.4 0.2 Setosa TRUE

iris %>% transmute(Sepal.Width = floor(Sepal.Width), Species == case_when( Species == 'setosa' ~ toupper(Species), TRUE ~ as.character(Species))) %>% head(1) ## # A tibble: 1 x 2 ## Sepal.Width ==... ## <dbl> <lgl> ## 1 3 TRUE

iris %>% mutate_all(as.character) ## # A tibble: 150 x 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <chr> <chr> <chr> <chr> <chr> ## 1 5.1 3.5 1.4 0.2 Setosa ## 2 4.9 3 1.4 0.2 Setosa ## 3 4.7 3.2 1.3 0.2 Setosa ## 4 4.6 3.1 1.5 0.2 Setosa ## 5 5 3.6 1.4 0.2 Setosa ## 6 5.4 3.9 1.7 0.4 Setosa ## 7 4.6 3.4 1.4 0.3 Setosa ## 8 5 3.4 1.5 0.2 Setosa ## 9 4.4 2.9 1.4 0.2 Setosa ## 10 4.9 3.1 1.5 0.1 Setosa ## # … with 140 more rows

iris %>% mutate_if(is.numeric,~. +rnorm(.)) ## # A tibble: 150 x 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <chr> ## 1 5.69 2.02 1.89 -0.526 Setosa ## 2 3.35 2.56 1.37 -0.285 Setosa ## 3 5.79 2.36 0.739 -0.134 Setosa ## 4 3.82 1.41 1.80 -1.20 Setosa ## 5 5.53 2.16 1.45 0.262 Setosa ## 6 4.57 4.17 -0.261 0.918 Setosa ## 7 5.63 2.62 1.65 0.697 Setosa ## 8 4.20 5.27 -0.121 -0.808 Setosa ## 9 2.40 3.58 0.253 0.0210 Setosa ## 10 5.14 4.42 0.695 -1.64 Setosa ## # … with 140 more rows

iris %>% mutate_at(vars(Sepal.Width:Petal.Width), ~ . +rnorm(.)) ## # A tibble: 150 x 5 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <chr> ## 1 5.1 4.60 0.0330 -2.38 Setosa ## 2 4.9 2.82 0.946 2.72 Setosa ## 3 4.7 4.54 0.371 0.645 Setosa ## 4 4.6 4.00 1.16 1.39 Setosa ## 5 5 4.28 1.01 0.679 Setosa ## 6 5.4 1.55 1.70 -0.0684 Setosa ## 7 4.6 2.85 0.196 -0.0835 Setosa ## 8 5 3.59 0.379 -0.448 Setosa ## 9 4.4 1.64 1.65 0.848 Setosa ## 10 4.9 3.61 1.28 -0.817 Setosa ## # … with 140 more rows

Working with dates and times

The following snippets summarize the main syntax introduced in this chapter: relating to lubridate package

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
test_date <- ymd_hms("20210505140101")
yday(test_date);mday(test_date);wday(test_date);
## [1] 125
## [1] 5
## [1] 4
hour(test_date);minute(test_date);second(test_date)
## [1] 14
## [1] 1
## [1] 1
test_date + years(1)
## [1] "2022-05-05 14:01:01 UTC"
test_date + months(1)
## [1] "2021-06-05 14:01:01 UTC"
test_date + days(1)
## [1] "2021-05-06 14:01:01 UTC"
test_date + minutes(1)
## [1] "2021-05-05 14:02:01 UTC"
format(test_date, c("%Y","%F","%z"))
## [1] "2021"       "2021-05-05" "+0000"
floor_date(test_date, "month");ceiling_date(test_date, 'month')
## [1] "2021-05-01 UTC"
## [1] "2021-06-01 UTC"
test_date + months(1:4)
## [1] "2021-06-05 14:01:01 UTC" "2021-07-05 14:01:01 UTC"
## [3] "2021-08-05 14:01:01 UTC" "2021-09-05 14:01:01 UTC"
test_date  %m+% months(1:4)
## [1] "2021-06-05 14:01:01 UTC" "2021-07-05 14:01:01 UTC"
## [3] "2021-08-05 14:01:01 UTC" "2021-09-05 14:01:01 UTC"
test_date %within% interval(ymd('20190101'), ymd('20220101'))
## [1] TRUE

Working with text

The following snippets summarize the main syntax introduced in this chapter : relating to stringr , purrr and forcats package

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
simple <- "This IS HOrribly typed! "
numbers <- c('02', '11', '10', '1')
str_detect(simple, 'typed')
## [1] TRUE
str_extract(simple, 'typed.*')
## [1] "typed! "
str_replace(simple, 'typed', 'written')
## [1] "This IS HOrribly written! "
str_replace_all(simple,'r', 'b')
## [1] "This IS HObbibly typed! "
str_count(simple, "[iI]")
## [1] 3
patterns <- c("[aeiou]","[aeiou].*$", "[a-z]", "^[A-Z]", "[:punct:]$", "r{2}", "[aeiou]b")
for( pat in patterns){
    print(str_extract(simple,pat))
    print(str_extract_all(simple,pat))
    print(str_detect(simple,pat))
}
## [1] "i"
## [[1]]
## [1] "i" "i" "e"
## 
## [1] TRUE
## [1] "is IS HOrribly typed! "
## [[1]]
## [1] "is IS HOrribly typed! "
## 
## [1] TRUE
## [1] "h"
## [[1]]
##  [1] "h" "i" "s" "r" "r" "i" "b" "l" "y" "t" "y" "p" "e" "d"
## 
## [1] TRUE
## [1] "T"
## [[1]]
## [1] "T"
## 
## [1] TRUE
## [1] NA
## [[1]]
## character(0)
## 
## [1] FALSE
## [1] "rr"
## [[1]]
## [1] "rr"
## 
## [1] TRUE
## [1] "ib"
## [[1]]
## [1] "ib"
## 
## [1] TRUE

String Splitting

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
str_split(simple, "i")
## [[1]]
## [1] "Th"          "s IS HOrr"   "bly typed! "
str_split(simple, "[iI]")
## [[1]]
## [1] "Th"          "s "          "S HOrr"      "bly typed! "
str_to_lower(simple); str_to_upper(simple); str_to_title(simple)
## [1] "this is horribly typed! "
## [1] "THIS IS HORRIBLY TYPED! "
## [1] "This Is Horribly Typed! "
str_trim(simple)
## [1] "This IS HOrribly typed!"
str_order(numbers, numeric=TRUE)
## [1] 4 1 3 2
str_length(numbers)
## [1] 2 2 2 1

Advanced operations

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
strings <- c("A word" , "Two words", "Not three words")
strings %>%
    str_split(boundary("word")) %>%
    str_detect("w")
## [1] TRUE TRUE TRUE

strings %>% str_split(boundary("word")) %>% map(str_detect,"w") ## [[1]] ## [1] FALSE TRUE ## ## [[2]] ## [1] TRUE TRUE ## ## [[3]] ## [1] FALSE FALSE TRUE

Factor operations

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
my_string <- c("red", "blue", "yellow", NA, "red")
fct_count(my_string)
## # A tibble: 4 x 2
##   f          n
##   <fct>  <int>
## 1 blue       1
## 2 red        2
## 3 yellow     1
## 4 <NA>       1
fct_explicit_na(my_string)
## [1] red       blue      yellow    (Missing) red      
## Levels: blue red yellow (Missing)
fct_infreq(my_string)
## [1] red    blue   yellow <NA>   red   
## Levels: red blue yellow
fct_lump(my_string, n =1)
## [1] red   Other Other <NA>  red  
## Levels: red Other
fct_lump(my_string, prop=0.25, other_level = "OTHER")
## [1] red   OTHER OTHER <NA>  red  
## Levels: red OTHER
fct_lump(my_string, n = -1, other_level =  "other")
## [1] red    blue   yellow <NA>   red   
## Levels: blue red yellow

Summarising data

The following snippets summarize the main syntax introduced in this chapter :

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
iris %>%
    summarise(mean(Sepal.Width), mean(Sepal.Length))
## # A tibble: 1 x 2
##   `mean(Sepal.Width)` `mean(Sepal.Length)`
##                 <dbl>                <dbl>
## 1                3.06                 5.84
iris %>%
    summarize(x= n(), y = n_distinct(Species), z = first(Species), z1 = last(Species))
## # A tibble: 1 x 4
##       x     y z      z1       
##   <int> <int> <chr>  <chr>    
## 1   150     3 Setosa Virginica
iris %>%
    summarise_all(n_distinct)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <int>       <int>        <int>       <int>   <int>
## 1           35          23           43          22       3
iris %>%
    summarise_if(is.numeric, mean)
## # A tibble: 1 x 4
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
##          <dbl>       <dbl>        <dbl>       <dbl>
## 1         5.84        3.06         3.76        1.20
iris %>%
    summarise_at(vars(Sepal.Length:Petal.Length), mean)
## # A tibble: 1 x 3
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1         5.84        3.06         3.76

iris %>% group_by(Species) %>% summarise(Avg.Sepal.Length = mean(Sepal.Length)) ## # A tibble: 3 x 2 ## Species Avg.Sepal.Length ## * <chr> <dbl> ## 1 Setosa 5.01 ## 2 Versicolor 5.94 ## 3 Virginica 6.59

iris %>% group_by(Species) %>% slice(1) ## # A tibble: 3 x 5 ## # Groups: Species [3] ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <chr> ## 1 5.1 3.5 1.4 0.2 Setosa ## 2 7 3.2 4.7 1.4 Versicolor ## 3 6.3 3.3 6 2.5 Virginica

iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>% head(1) ## # A tibble: 1 x 5 ## # Groups: Species [1] ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## <dbl> <dbl> <dbl> <dbl> <chr> ## 1 7.9 3.8 6.4 2 Virginica

iris %>% group_by(Species ) %>% mutate(SpeciesN = n()) %>% group_by(Sepal.Length) %>% mutate(Sepal.Length = n()) %>% head(1) ## # A tibble: 1 x 6 ## # Groups: Sepal.Length [1] ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species SpeciesN ## <int> <dbl> <dbl> <dbl> <chr> <int> ## 1 9 3.5 1.4 0.2 Setosa 50

iris %>% group_by(Species ) %>% mutate(SpeciesN = n()) %>% group_by(Sepal.Length) %>% mutate(Sepal.Length = n()) %>% ungroup() %>% head(1) ## # A tibble: 1 x 6 ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species SpeciesN ## <int> <dbl> <dbl> <dbl> <chr> <int> ## 1 9 3.5 1.4 0.2 Setosa 50

iris %>% group_by_all() %>% summarise(nrows = n()) %>% head(1) ## # A tibble: 1 x 6 ## # Groups: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width [1] ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species nrows ## <dbl> <dbl> <dbl> <dbl> <chr> <int> ## 1 4.3 3 1.1 0.1 Setosa 1

iris %>% group_by_if(~n_distinct(.) <30) %>% summarise(nrows = n()) %>% head(1) ## # A tibble: 1 x 4 ## # Groups: Sepal.Width, Petal.Width [1] ## Sepal.Width Petal.Width Species nrows ## <dbl> <dbl> <chr> <int> ## 1 2 1 Versicolor 1

iris %>% group_by_at(vars(starts_with("Sp"))) %>% summarise(nrows = n()) %>% head(1) ## # A tibble: 1 x 2 ## Species nrows ## <chr> <int> ## 1 Setosa 50

Produce aggregate rows using the summarise() function, or add aggregates to rows using mutate() function.

Combining datasets

The following snippets summarize the main syntax introduced in this chapter:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
irisLk <- tibble(species = c("setosa", "virginica", "VERSICOLOR"), colour = c("Red", "Blue", "Green"))

iris %>% left_join(irisLk, c("Species" = "species")) %>% dim() ## [1] 150 6 iris %>% inner_join(irisLk, c("Species" = "species")) %>% dim() ## [1] 0 6 iris %>% full_join(irisLk, c("Species" = "species")) %>% dim() ## [1] 153 6 iris %>% anti_join(irisLk, c("Species" = "species")) %>% dim() ## [1] 150 5

iris %>% sample_n(50) %>% union(sample_n(iris, 75)) %>% dim() ## [1] 97 5

iris %>% sample_n(50) %>% union_all(sample_n(iris, 75)) %>% dim() ## [1] 125 5

iris %>% sample_n(10) %>% select(-Species) %>% bind_rows(sample(iris, 50, replace=TRUE))%>% dim() ## [1] 160 54

iris1 <- iris[1:50, 1:4] iris2 <- iris[51:100, 1:4] to_merge <- list(iris1, iris2) to_merge %>% bind_rows() %>% nrow() ## [1] 100

Reshaping data

The following snippets summarize the main syntax introduced in this chapter:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
mtcars %>%
    rownames_to_column("car") ->
    mtcars2
mtcars2 %>%
    gather(measure, value, -car) %>%
    head(4)
##              car measure value
## 1      Mazda RX4     mpg  21.0
## 2  Mazda RX4 Wag     mpg  21.0
## 3     Datsun 710     mpg  22.8
## 4 Hornet 4 Drive     mpg  21.4

mtcars2 %>% gather(measure, value, -car) %>% group_by(measure) %>% mutate(diff = value - mean(value)) %>% head(4) ## # A tibble: 4 x 4 ## # Groups: measure [1] ## car measure value diff ## <chr> <chr> <dbl> <dbl> ## 1 Mazda RX4 mpg 21 0.909 ## 2 Mazda RX4 Wag mpg 21 0.909 ## 3 Datsun 710 mpg 22.8 2.71 ## 4 Hornet 4 Drive mpg 21.4 1.31

mtcars2 %>% group_by(cyl, am) %>% summarise(avg = mean(mpg)) %>% spread(am, avg) ## # A tibble: 3 x 3 ## # Groups: cyl [3] ## cyl 0 1 ## <dbl> <dbl> <dbl> ## 1 4 22.9 28.1 ## 2 6 19.1 20.6 ## 3 8 15.0 15.4

mtcars2 %>% group_by(cyl, am) %>% summarise(avg = mean(mpg)) %>% spread(am, avg, sep=":") ## # A tibble: 3 x 3 ## # Groups: cyl [3] ## cyl am:0 am:1 ## <dbl> <dbl> <dbl> ## 1 4 22.9 28.1 ## 2 6 19.1 20.6 ## 3 8 15.0 15.4

data.frame(measure = c("net 2017", "gross 2017")) %>% separate(measure, c("type", "year")) ## type year ## 1 net 2017 ## 2 gross 2017

data.frame(measure = c("net 2017", "gross 2017")) %>% separate(measure, c("type", "year"), convert=TRUE, remove=FALSE) ## measure type year ## 1 net 2017 net 2017 ## 2 gross 2017 gross 2017

Unpivot a dataset using the gather() function. Pivot a dataset using spread() function

Getting data out of R

This chapter goes through some basic commands from the packages, readr, readxl , writexl. If you have used base R, then I am sure you would have come across some aspects that were pretty painful to deal with, such as writing stringsAsFactors=FALSE for importing csv files and other default behavior that was slighltly frustrating. All such issues are taken care of, by the above mentioned packages. One will never go back to using base R functions, once one starts using the functions in these packages.

Putting it all together

This chapter gives a schematic diagram that details the steps to be performed for any data munging exercise using tidyverse framework.

flights data

How many flights arrived late each month?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
flights%>%
    filter(arr_delay>5) %>%
    group_by(month)%>%
    summarise(n())
## # A tibble: 12 x 2
##    month `n()`
##  * <int> <int>
##  1     1  8988
##  2     2  8119
##  3     3  9033
##  4     4 10544
##  5     5  8490
##  6     6 10739
##  7     7 11518
##  8     8  9649
##  9     9  5347
## 10    10  7628
## 11    11  7485
## 12    12 12291

What percentage of traffic did each carrier represent, by month?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
flights%>%
    group_by(month, carrier)%>%
    summarise(total=n()) %>%
    ungroup()%>%
    group_by(month)%>%
    mutate(stotal = sum(total), pct  = 100*total/stotal)%>%
    ungroup()%>%
    select(month,carrier, pct)%>%
    spread(carrier,pct)
## # A tibble: 12 x 17
##    month  `9E`    AA    AS    B6    DL    EV    F9    FL     HA    MQ       OO
##    <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>    <dbl>
##  1     1  5.83 10.3  0.230  16.4  13.7  15.4 0.218 1.21  0.115   8.41  0.00370
##  2     2  5.85 10.1  0.224  16.4  13.8  15.3 0.196 1.19  0.112   8.19 NA      
##  3     3  5.64  9.67 0.215  16.5  14.5  16.4 0.198 1.10  0.108   7.82 NA      
##  4     4  5.33  9.61 0.212  15.9  14.4  16.1 0.201 1.10  0.106   7.80 NA      
##  5     5  5.08  9.73 0.215  15.9  14.2  16.7 0.201 1.13  0.108   7.93 NA      
##  6     6  5.09  9.76 0.212  16.4  14.6  15.8 0.195 0.892 0.106   7.71  0.00708
##  7     7  5.08  9.79 0.211  16.9  14.4  15.8 0.197 0.894 0.105   7.68 NA      
##  8     8  4.96  9.74 0.211  16.9  14.7  15.6 0.188 0.897 0.106   7.72  0.0136 
##  9     9  5.58  9.48 0.218  15.6  14.1  17.1 0.210 0.925 0.0907  8.00  0.0725 
## 10    10  5.79  9.40 0.215  15.1  14.2  17.0 0.197 0.817 0.0727  7.71 NA      
## 11    11  5.85  9.45 0.191  15.7  14.1  16.4 0.224 0.741 0.0917  7.54  0.0183 
## 12    12  5.80  9.61 0.192  16.9  14.5  15.3 0.217 0.757 0.0995  7.60 NA      
## # ... with 5 more variables: UA <dbl>, US <dbl>, VX <dbl>, WN <dbl>, YV <dbl>

What was the latest flight to depart each month?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
flights%>%
    group_by(month)%>%
    arrange(desc(dep_time))%>%
    slice_head(n=1)
## # A tibble: 12 x 19
## # Groups:   month [12]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     7     2359           2359         0      506            437
##  2  2013     2     7     2400           2359         1      432            436
##  3  2013     3    15     2400           2359         1      324            338
##  4  2013     4     2     2400           2359         1      339            343
##  5  2013     5    21     2400           2359         1      339            350
##  6  2013     6    17     2400           2145       135      102           2315
##  7  2013     7     7     2400           1950       250      107           2130
##  8  2013     8    10     2400           2245        75      110              1
##  9  2013     9     2     2400           2359         1      411            340
## 10  2013    10    30     2400           2359         1      327            337
## 11  2013    11    27     2400           2359         1      515            445
## 12  2013    12     5     2400           2359         1      427            440
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

gapminder data

Is there data for every key year for all countries?

1
2
3
gapminder%>%group_by(year, country)%>%summarise(ct = n())%>%ungroup() %>%
    spread(country,ct )%>%select(-year)%>%summarise_all(sum)%>%filter_all(all_vars(.!=12))%>%nrow()
## [1] 0

Provide an indexed GDP based on the earliest data###

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
index_vals <- gapminder %>% filter(year == min(year)) %>%
    select(year, base=gdpPercap)
gapminder%>%inner_join(index_vals)%>%mutate(indexedgdp = gdpPercap/base)
## # A tibble: 20,164 x 8
##    country     continent  year lifeExp     pop gdpPercap   base indexedgdp
##    <fct>       <fct>     <int>   <dbl>   <int>     <dbl>  <dbl>      <dbl>
##  1 Afghanistan Asia       1952    28.8 8425333      779.   779.     1     
##  2 Afghanistan Asia       1952    28.8 8425333      779.  1601.     0.487 
##  3 Afghanistan Asia       1952    28.8 8425333      779.  2449.     0.318 
##  4 Afghanistan Asia       1952    28.8 8425333      779.  3521.     0.221 
##  5 Afghanistan Asia       1952    28.8 8425333      779.  5911.     0.132 
##  6 Afghanistan Asia       1952    28.8 8425333      779. 10040.     0.0776
##  7 Afghanistan Asia       1952    28.8 8425333      779.  6137.     0.127 
##  8 Afghanistan Asia       1952    28.8 8425333      779.  9867.     0.0790
##  9 Afghanistan Asia       1952    28.8 8425333      779.   684.     1.14  
## 10 Afghanistan Asia       1952    28.8 8425333      779.  8343.     0.0934
## # ... with 20,154 more rows

Takeaway

This book provides a very quick introduction to tidyverse framework. Along the way, it highlight the main packages and functions that one tends to use in a data munging exercise. It might be a useful read for anyone wanting a crash course on tidyverse framework.