This blog post summarizes the book titled Data Manipulation in R, written by
Stephanie Locke.

The author clearly states that this book is for beginners. Well I would say that
even if you  have good experience with  R but have not worked  with the recent
tidyverse package, then this book is worth the read. In any case, this book is
super short. An experienced R user can probably read it in a few hours.
Libraries used in the book
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
 | library(tidyverse)
library(ggplot2movies)
library(nycflights13)
library(odbc)
library(writexl)
library(openxlsx)
library(gapminder)
library(readr)
library(readxl)
library(lubridate)
library(stringr)
 | 
 
Getting Data
One can use readr , readxl and purr packages to efficiently read data from
multiple formats and combine them together.
| 1
2
3
4
5
6
7
8
 | iris <- read_csv('./data/iris.csv')
iris <- iris%>%rename(Sepal.Length= sepal.length,
                   Sepal.Width = sepal.width,
                   Petal.Length = petal.length,
                   Petal.Width = petal.width,
                   Species = variety)
iris2 <- read_excel('./data/iris.xlsx')
iris3 <- map_df(list.files("./data/", pattern = "sample", full.names = TRUE), read_csv)
 | 
 
There are other packages being actively developed by R community that follow
the tidyverse philosophy.
Data Pipelines
This chapter introduced piping, an operator that allows us to create series of
connected transformations that goes from the source to the destination. Instead
of nesting functions you hook them together like pieces of plumbing so that your
data passes through them and changes as it goes.
The pipe operator was first designed and implemented in R by Stefan Milton
Bache in the package magrittr. Created in 2014, it is now very heavily used
and its widest option has been within the tidyverse.
| 1
2
3
4
 | letters%>%toupper()%>%length()
## [1] 26
iris%>%head()%>%nrow()
## [1] 6
 | 
 
An interesting tidbit that I have learnt from the chapter is the assignment. I
have always used the assignment as follows
| 1
 | temp <- iris %>% head()
 | 
 
But the author seems to be making a nice point where since the result of all the
operations are assigned to a variable, it is better to use RHS operator
| 1
 | iris %>% head() -> temp
 | 
 
Whenever you use the functions with in a tidyverse framework, it is always
better to know the positional argument of data. By default, pipelines will put
our object as the first input to a function. To work with jumbled functions, we
need a way to tell our pipelines where to use our input instead of relying on
defaults. The way we can do that is with a period.
Filtering Columns
This chapter introduces a bunch of different ways to select features from a tibble
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
 | 
iris%>%select(Species)%>%head(1)
## # A tibble: 1 x 1
##   Species
##   <chr>  
## 1 Setosa
iris%>%select(-Species)%>%head(1)
## # A tibble: 1 x 4
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
##          <dbl>       <dbl>        <dbl>       <dbl>
## 1          5.1         3.5          1.4         0.2
iris%>%select(Sepal.Length:Petal.Length)%>%head(1)
## # A tibble: 1 x 3
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1          5.1         3.5          1.4
iris%>%select(-(Sepal.Length:Petal.Length))%>%head(1)
## # A tibble: 1 x 2
##   Petal.Width Species
##         <dbl> <chr>  
## 1         0.2 Setosa
iris%>%select(starts_with('s'))%>%head(1)
## # A tibble: 1 x 3
##   Sepal.Length Sepal.Width Species
##          <dbl>       <dbl> <chr>  
## 1          5.1         3.5 Setosa
iris%>%select(ends_with('s'))%>%head(1)
## # A tibble: 1 x 1
##   Species
##   <chr>  
## 1 Setosa
iris%>%select(contains('Length'))%>%head(1)
## # A tibble: 1 x 2
##   Sepal.Length Petal.Length
##          <dbl>        <dbl>
## 1          5.1          1.4
iris%>%select_if(is.numeric) %>%head(1)
## # A tibble: 1 x 4
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
##          <dbl>       <dbl>        <dbl>       <dbl>
## 1          5.1         3.5          1.4         0.2
iris%>%select_if(~is.numeric(.) & n_distinct(.)>30) %>%head(1)
## # A tibble: 1 x 2
##   Sepal.Length Petal.Length
##          <dbl>        <dbl>
## 1          5.1          1.4
 | 
 
Filtering Rows
This chapter introduces basic syntax around using slice, filter_if ,
filter_all and filter_at functions.
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
 | iris%>%slice(1:2)
## # A tibble: 2 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 Setosa 
## 2          4.9         3            1.4         0.2 Setosa
iris%>%filter(Species=='Virginica') %>% head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>    
## 1          6.3         3.3            6         2.5 Virginica
iris%>%filter(Species == 'Virginica', Sepal.Length > mean(Sepal.Length))%>%head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>    
## 1          6.3         3.3            6         2.5 Virginica
iris%>%filter(Species == 'Virginica'|Sepal.Length > mean(Sepal.Length))%>%head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>     
## 1            7         3.2          4.7         1.4 Versicolor
iris%>%filter_all(any_vars(.>7.5))%>%head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 Setosa
iris%>%select_if(is.numeric)%>%filter_all(any_vars(abs(. - mean(.) ) > 2*sd(.))) %>% head(1)
## # A tibble: 1 x 4
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
##          <dbl>       <dbl>        <dbl>       <dbl>
## 1          5.8           4          1.2         0.2
iris%>%filter_if(~is.numeric(.), any_vars(.<mean(.)))%>% head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 Setosa
iris%>%filter_at(vars(ends_with("Length")), any_vars(.<mean(.)))%>% head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 Setosa
 | 
 
Working with names
This chapter introduces some of the functions relating to renaming rows and
columns of a tibble
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
 | iris%>% select_all(str_to_lower)%>%head(1)
## # A tibble: 1 x 5
##   sepal.length sepal.width petal.length petal.width species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 Setosa
iris%>%rename_if(is.numeric, str_to_lower)%>% head(1)
## # A tibble: 1 x 5
##   sepal.length sepal.width petal.length petal.width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 Setosa
iris%>%rename_at(vars(starts_with('S')), str_to_lower) %>%head(1)
## # A tibble: 1 x 5
##   sepal.length sepal.width Petal.Length Petal.Width species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 Setosa
mtcars%>%rownames_to_column("cars") %>%head(1)
##        cars mpg cyl disp  hp drat   wt  qsec vs am gear carb
## 1 Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4    4
 | 
 
Re-arranging your data
The following snippets summarize the main syntax introduced in this chapter:
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
 | iris%>%arrange(desc(Species))%>%head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>    
## 1          6.3         3.3            6         2.5 Virginica
iris%>%arrange(desc(Species), Sepal.Length)%>%head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>    
## 1          4.9         2.5          4.5         1.7 Virginica
iris %>%
    arrange_all() %>%
    head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          4.3           3          1.1         0.1 Setosa
iris %>%
    arrange_if(is.character, desc) %>%
    head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>    
## 1          6.3         3.3            6         2.5 Virginica
iris %>%
    arrange_at(vars(Species, starts_with('P')), desc) %>%
    head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>    
## 1          7.7         2.6          6.9         2.3 Virginica
iris %>%
    select(sort(tidyselect::peek_vars())) %>%
    head(1)
## # A tibble: 1 x 5
##   Petal.Length Petal.Width Sepal.Length Sepal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          1.4         0.2          5.1         3.5 Setosa
 | 
 
Changing your Data
The following snippets summarize the main syntax introduced in this chapter:
|   1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
 | iris %>%
    mutate(x = Sepal.Width * Sepal.Length) %>%
    head(1)
## # A tibble: 1 x 6
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species     x
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>   <dbl>
## 1          5.1         3.5          1.4         0.2 Setosa   17.8
iris %>%
mutate(Sepal.Width=NULL) %>%
head(1)
## # A tibble: 1 x 4
##   Sepal.Length Petal.Length Petal.Width Species
##          <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1          1.4         0.2 Setosa iris %>%
mutate(id = row_number()) %>%
head(1)
## # A tibble: 1 x 6
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species    id
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>   <int>
## 1          5.1         3.5          1.4         0.2 Setosa      1 iris %>%
mutate(size=  case_when(
Sepal.Length<mean(Sepal.Length) ~ "s",
Sepal.Length>mean(Sepal.Length) ~ "l",
TRUE ~ "m"
)) %>%
head(1)
## # A tibble: 1 x 6
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species size 
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>   <chr>
## 1          5.1         3.5          1.4         0.2 Setosa  s iris %>%
mutate(Species = case_when(
Species == 'setosa'  ~ toupper(Species),
TRUE ~ as.character(Species)
)) %>%
head(1)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 Setosa iris %>%
mutate(runagg = cumall(Sepal.Length < mean(Sepal.Length))) %>%
head(1)
## # A tibble: 1 x 6
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species runagg
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>   <lgl> 
## 1          5.1         3.5          1.4         0.2 Setosa  TRUE iris %>%
transmute(Sepal.Width = floor(Sepal.Width),
Species == case_when(
Species == 'setosa' ~ toupper(Species),
TRUE ~ as.character(Species))) %>%
head(1)
## # A tibble: 1 x 2
##   Sepal.Width ==...##         <dbl> <lgl>  
## 1           3 TRUE iris %>%
mutate_all(as.character)
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##    <chr>        <chr>       <chr>        <chr>       <chr>  
##  1 5.1          3.5         1.4          0.2         Setosa 
##  2 4.9          3           1.4          0.2         Setosa 
##  3 4.7          3.2         1.3          0.2         Setosa 
##  4 4.6          3.1         1.5          0.2         Setosa 
##  5 5            3.6         1.4          0.2         Setosa 
##  6 5.4          3.9         1.7          0.4         Setosa 
##  7 4.6          3.4         1.4          0.3         Setosa 
##  8 5            3.4         1.5          0.2         Setosa 
##  9 4.4          2.9         1.4          0.2         Setosa 
## 10 4.9          3.1         1.5          0.1         Setosa 
## # … with 140 more rows iris %>%
mutate_if(is.numeric,~. +rnorm(.))
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
##  1         5.69        2.02        1.89      -0.526  Setosa 
##  2         3.35        2.56        1.37      -0.285  Setosa 
##  3         5.79        2.36        0.739     -0.134  Setosa 
##  4         3.82        1.41        1.80      -1.20   Setosa 
##  5         5.53        2.16        1.45       0.262  Setosa 
##  6         4.57        4.17       -0.261      0.918  Setosa 
##  7         5.63        2.62        1.65       0.697  Setosa 
##  8         4.20        5.27       -0.121     -0.808  Setosa 
##  9         2.40        3.58        0.253      0.0210 Setosa 
## 10         5.14        4.42        0.695     -1.64   Setosa 
## # … with 140 more rows iris %>%
mutate_at(vars(Sepal.Width:Petal.Width), ~ . +rnorm(.))
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
##  1          5.1        4.60       0.0330     -2.38   Setosa 
##  2          4.9        2.82       0.946       2.72   Setosa 
##  3          4.7        4.54       0.371       0.645  Setosa 
##  4          4.6        4.00       1.16        1.39   Setosa 
##  5          5          4.28       1.01        0.679  Setosa 
##  6          5.4        1.55       1.70       -0.0684 Setosa 
##  7          4.6        2.85       0.196      -0.0835 Setosa 
##  8          5          3.59       0.379      -0.448  Setosa 
##  9          4.4        1.64       1.65        0.848  Setosa 
## 10          4.9        3.61       1.28       -0.817  Setosa 
## # … with 140 more rows
 | 
 
Working with dates and times
The following snippets summarize the main syntax introduced in this chapter:
relating to lubridate package
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
 | test_date <- ymd_hms("20210505140101")
yday(test_date);mday(test_date);wday(test_date);
## [1] 125
## [1] 5
## [1] 4
hour(test_date);minute(test_date);second(test_date)
## [1] 14
## [1] 1
## [1] 1
test_date + years(1)
## [1] "2022-05-05 14:01:01 UTC"
test_date + months(1)
## [1] "2021-06-05 14:01:01 UTC"
test_date + days(1)
## [1] "2021-05-06 14:01:01 UTC"
test_date + minutes(1)
## [1] "2021-05-05 14:02:01 UTC"
format(test_date, c("%Y","%F","%z"))
## [1] "2021"       "2021-05-05" "+0000"
floor_date(test_date, "month");ceiling_date(test_date, 'month')
## [1] "2021-05-01 UTC"
## [1] "2021-06-01 UTC"
test_date + months(1:4)
## [1] "2021-06-05 14:01:01 UTC" "2021-07-05 14:01:01 UTC"
## [3] "2021-08-05 14:01:01 UTC" "2021-09-05 14:01:01 UTC"
test_date  %m+% months(1:4)
## [1] "2021-06-05 14:01:01 UTC" "2021-07-05 14:01:01 UTC"
## [3] "2021-08-05 14:01:01 UTC" "2021-09-05 14:01:01 UTC"
test_date %within% interval(ymd('20190101'), ymd('20220101'))
## [1] TRUE
 | 
 
Working with text
The following snippets summarize the main syntax introduced in this chapter :
relating to stringr , purrr and forcats  package
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
 | simple <- "This IS HOrribly typed! "
numbers <- c('02', '11', '10', '1')
str_detect(simple, 'typed')
## [1] TRUE
str_extract(simple, 'typed.*')
## [1] "typed! "
str_replace(simple, 'typed', 'written')
## [1] "This IS HOrribly written! "
str_replace_all(simple,'r', 'b')
## [1] "This IS HObbibly typed! "
str_count(simple, "[iI]")
## [1] 3
patterns <- c("[aeiou]","[aeiou].*$", "[a-z]", "^[A-Z]", "[:punct:]$", "r{2}", "[aeiou]b")
for( pat in patterns){
    print(str_extract(simple,pat))
    print(str_extract_all(simple,pat))
    print(str_detect(simple,pat))
}
## [1] "i"
## [[1]]
## [1] "i" "i" "e"
## 
## [1] TRUE
## [1] "is IS HOrribly typed! "
## [[1]]
## [1] "is IS HOrribly typed! "
## 
## [1] TRUE
## [1] "h"
## [[1]]
##  [1] "h" "i" "s" "r" "r" "i" "b" "l" "y" "t" "y" "p" "e" "d"
## 
## [1] TRUE
## [1] "T"
## [[1]]
## [1] "T"
## 
## [1] TRUE
## [1] NA
## [[1]]
## character(0)
## 
## [1] FALSE
## [1] "rr"
## [[1]]
## [1] "rr"
## 
## [1] TRUE
## [1] "ib"
## [[1]]
## [1] "ib"
## 
## [1] TRUE
 | 
 
String Splitting
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
 | str_split(simple, "i")
## [[1]]
## [1] "Th"          "s IS HOrr"   "bly typed! "
str_split(simple, "[iI]")
## [[1]]
## [1] "Th"          "s "          "S HOrr"      "bly typed! "
str_to_lower(simple); str_to_upper(simple); str_to_title(simple)
## [1] "this is horribly typed! "
## [1] "THIS IS HORRIBLY TYPED! "
## [1] "This Is Horribly Typed! "
str_trim(simple)
## [1] "This IS HOrribly typed!"
str_order(numbers, numeric=TRUE)
## [1] 4 1 3 2
str_length(numbers)
## [1] 2 2 2 1
 | 
 
Advanced operations
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
 | strings <- c("A word" , "Two words", "Not three words")
strings %>%
    str_split(boundary("word")) %>%
    str_detect("w")
## [1] TRUE TRUE TRUE
strings %>%
str_split(boundary("word")) %>%
map(str_detect,"w")
## [[1]]
## [1] FALSE  TRUE
## 
## [[2]]
## [1] TRUE TRUE
## 
## [[3]]
## [1] FALSE FALSE  TRUE
 | 
 
Factor operations
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
 | my_string <- c("red", "blue", "yellow", NA, "red")
fct_count(my_string)
## # A tibble: 4 x 2
##   f          n
##   <fct>  <int>
## 1 blue       1
## 2 red        2
## 3 yellow     1
## 4 <NA>       1
fct_explicit_na(my_string)
## [1] red       blue      yellow    (Missing) red      
## Levels: blue red yellow (Missing)
fct_infreq(my_string)
## [1] red    blue   yellow <NA>   red   
## Levels: red blue yellow
fct_lump(my_string, n =1)
## [1] red   Other Other <NA>  red  
## Levels: red Other
fct_lump(my_string, prop=0.25, other_level = "OTHER")
## [1] red   OTHER OTHER <NA>  red  
## Levels: red OTHER
fct_lump(my_string, n = -1, other_level =  "other")
## [1] red    blue   yellow <NA>   red   
## Levels: blue red yellow
 | 
 
Summarising data
The following snippets summarize the main syntax introduced in this chapter :
|   1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
 | iris %>%
    summarise(mean(Sepal.Width), mean(Sepal.Length))
## # A tibble: 1 x 2
##   `mean(Sepal.Width)` `mean(Sepal.Length)`
##                 <dbl>                <dbl>
## 1                3.06                 5.84
iris %>%
    summarize(x= n(), y = n_distinct(Species), z = first(Species), z1 = last(Species))
## # A tibble: 1 x 4
##       x     y z      z1       
##   <int> <int> <chr>  <chr>    
## 1   150     3 Setosa Virginica
iris %>%
    summarise_all(n_distinct)
## # A tibble: 1 x 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <int>       <int>        <int>       <int>   <int>
## 1           35          23           43          22       3
iris %>%
    summarise_if(is.numeric, mean)
## # A tibble: 1 x 4
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
##          <dbl>       <dbl>        <dbl>       <dbl>
## 1         5.84        3.06         3.76        1.20
iris %>%
    summarise_at(vars(Sepal.Length:Petal.Length), mean)
## # A tibble: 1 x 3
##   Sepal.Length Sepal.Width Petal.Length
##          <dbl>       <dbl>        <dbl>
## 1         5.84        3.06         3.76
iris %>%
group_by(Species) %>%
summarise(Avg.Sepal.Length = mean(Sepal.Length))
## # A tibble: 3 x 2
##   Species    Avg.Sepal.Length
## * <chr>                 <dbl>
## 1 Setosa                 5.01
## 2 Versicolor             5.94
## 3 Virginica              6.59 iris %>%
group_by(Species) %>%
slice(1)
## # A tibble: 3 x 5
## # Groups:   Species [3]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>     
## 1          5.1         3.5          1.4         0.2 Setosa    
## 2          7           3.2          4.7         1.4 Versicolor
## 3          6.3         3.3          6           2.5 Virginica iris %>%
group_by(Species) %>%
arrange(desc(Sepal.Length)) %>%
head(1)
## # A tibble: 1 x 5
## # Groups:   Species [1]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>    
## 1          7.9         3.8          6.4           2 Virginica iris %>%
group_by(Species ) %>%
mutate(SpeciesN = n()) %>%
group_by(Sepal.Length) %>%
mutate(Sepal.Length = n()) %>%
head(1)
## # A tibble: 1 x 6
## # Groups:   Sepal.Length [1]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species SpeciesN
##          <int>       <dbl>        <dbl>       <dbl> <chr>      <int>
## 1            9         3.5          1.4         0.2 Setosa        50 iris %>%
group_by(Species ) %>%
mutate(SpeciesN = n()) %>%
group_by(Sepal.Length) %>%
mutate(Sepal.Length = n()) %>%
ungroup() %>%
head(1)
## # A tibble: 1 x 6
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species SpeciesN
##          <int>       <dbl>        <dbl>       <dbl> <chr>      <int>
## 1            9         3.5          1.4         0.2 Setosa        50 iris %>%
group_by_all() %>%
summarise(nrows = n()) %>%
head(1)
## # A tibble: 1 x 6
## # Groups:   Sepal.Length, Sepal.Width, Petal.Length, Petal.Width [1]
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species nrows
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>   <int>
## 1          4.3           3          1.1         0.1 Setosa      1 iris %>%
group_by_if(~n_distinct(.) <30) %>%
summarise(nrows = n()) %>%
head(1)
## # A tibble: 1 x 4
## # Groups:   Sepal.Width, Petal.Width [1]
##   Sepal.Width Petal.Width Species    nrows
##         <dbl>       <dbl> <chr>      <int>
## 1           2           1 Versicolor     1 iris %>%
group_by_at(vars(starts_with("Sp"))) %>%
summarise(nrows = n()) %>%
head(1)
## # A tibble: 1 x 2
##   Species nrows
##   <chr>   <int>
## 1 Setosa     50
 | 
 
Produce aggregate rows using the summarise()  function, or add aggregates to
rows using mutate()  function.
Combining datasets
The following snippets summarize the main syntax introduced in this chapter:
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
 | irisLk <- tibble(species = c("setosa", "virginica", "VERSICOLOR"), colour = c("Red", "Blue", "Green"))
iris %>%
left_join(irisLk, c("Species" = "species")) %>%
dim()
## [1] 150   6
iris %>%
inner_join(irisLk, c("Species" = "species")) %>%
dim()
## [1] 0 6
iris %>%
full_join(irisLk, c("Species" = "species")) %>%
dim()
## [1] 153   6
iris %>%
anti_join(irisLk, c("Species" = "species")) %>%
dim()
## [1] 150   5 iris %>%
sample_n(50) %>%
union(sample_n(iris, 75)) %>%
dim()
## [1] 97  5 iris %>%
sample_n(50) %>%
union_all(sample_n(iris, 75)) %>%
dim()
## [1] 125   5 iris %>%
sample_n(10) %>%
select(-Species) %>%
bind_rows(sample(iris, 50, replace=TRUE))%>%
dim()
## [1] 160  54 iris1 <- iris[1:50, 1:4]
iris2 <- iris[51:100, 1:4]
to_merge <- list(iris1, iris2)
to_merge %>%
bind_rows() %>%
nrow()
## [1] 100
 | 
 
Reshaping data
The following snippets summarize the main syntax introduced in this chapter:
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
 | mtcars %>%
    rownames_to_column("car") ->
    mtcars2
mtcars2 %>%
    gather(measure, value, -car) %>%
    head(4)
##              car measure value
## 1      Mazda RX4     mpg  21.0
## 2  Mazda RX4 Wag     mpg  21.0
## 3     Datsun 710     mpg  22.8
## 4 Hornet 4 Drive     mpg  21.4
mtcars2 %>%
gather(measure, value, -car) %>%
group_by(measure) %>%
mutate(diff = value - mean(value)) %>%
head(4)
## # A tibble: 4 x 4
## # Groups:   measure [1]
##   car            measure value  diff
##   <chr>          <chr>   <dbl> <dbl>
## 1 Mazda RX4      mpg      21   0.909
## 2 Mazda RX4 Wag  mpg      21   0.909
## 3 Datsun 710     mpg      22.8 2.71 
## 4 Hornet 4 Drive mpg      21.4 1.31 mtcars2 %>%
group_by(cyl, am) %>%
summarise(avg = mean(mpg)) %>%
spread(am, avg)
## # A tibble: 3 x 3
## # Groups:   cyl [3]
##     cyl   01##   <dbl> <dbl> <dbl>
## 1     4  22.9  28.1
## 2     6  19.1  20.6
## 3     8  15.0  15.4 mtcars2 %>%
group_by(cyl, am) %>%
summarise(avg = mean(mpg)) %>%
spread(am, avg, sep=":")
## # A tibble: 3 x 3
## # Groups:   cyl [3]
##     cyl am:0am:1##   <dbl>  <dbl>  <dbl>
## 1     4   22.9   28.1
## 2     6   19.1   20.6
## 3     8   15.0   15.4 data.frame(measure = c("net 2017", "gross 2017")) %>%
separate(measure, c("type", "year"))
##    type year
## 1   net 2017
## 2 gross 2017 data.frame(measure = c("net 2017", "gross 2017")) %>%
separate(measure, c("type", "year"), convert=TRUE, remove=FALSE)
##      measure  type year
## 1   net 2017   net 2017
## 2 gross 2017 gross 2017
 | 
 
Unpivot a dataset using the gather()  function. Pivot a dataset using
spread()  function
Getting data out of R
This  chapter goes  through  some  basic commands  from  the packages,  readr,
readxl , writexl. If you have used base R, then I am sure you would have
come across some aspects that were pretty  painful to deal with, such as writing
stringsAsFactors=FALSE for  importing csv  files and other  default behavior
that was slighltly frustrating. All such issues  are taken care of, by the above
mentioned packages. One  will never go back to using  base R functions, once
one starts using the functions in these packages.
Putting it all together
This chapter gives a schematic diagram that details the steps to be performed
for any data munging exercise using tidyverse framework.

flights data
How many flights arrived late each month?
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
 | flights%>%
    filter(arr_delay>5) %>%
    group_by(month)%>%
    summarise(n())
## # A tibble: 12 x 2
##    month `n()`
##  * <int> <int>
##  1     1  8988
##  2     2  8119
##  3     3  9033
##  4     4 10544
##  5     5  8490
##  6     6 10739
##  7     7 11518
##  8     8  9649
##  9     9  5347
## 10    10  7628
## 11    11  7485
## 12    12 12291
 | 
 
What percentage of traffic did each carrier represent, by month?
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
 | flights%>%
    group_by(month, carrier)%>%
    summarise(total=n()) %>%
    ungroup()%>%
    group_by(month)%>%
    mutate(stotal = sum(total), pct  = 100*total/stotal)%>%
    ungroup()%>%
    select(month,carrier, pct)%>%
    spread(carrier,pct)
## # A tibble: 12 x 17
##    month  `9E`    AA    AS    B6    DL    EV    F9    FL     HA    MQ       OO
##    <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>    <dbl>
##  1     1  5.83 10.3  0.230  16.4  13.7  15.4 0.218 1.21  0.115   8.41  0.00370
##  2     2  5.85 10.1  0.224  16.4  13.8  15.3 0.196 1.19  0.112   8.19 NA      
##  3     3  5.64  9.67 0.215  16.5  14.5  16.4 0.198 1.10  0.108   7.82 NA      
##  4     4  5.33  9.61 0.212  15.9  14.4  16.1 0.201 1.10  0.106   7.80 NA      
##  5     5  5.08  9.73 0.215  15.9  14.2  16.7 0.201 1.13  0.108   7.93 NA      
##  6     6  5.09  9.76 0.212  16.4  14.6  15.8 0.195 0.892 0.106   7.71  0.00708
##  7     7  5.08  9.79 0.211  16.9  14.4  15.8 0.197 0.894 0.105   7.68 NA      
##  8     8  4.96  9.74 0.211  16.9  14.7  15.6 0.188 0.897 0.106   7.72  0.0136 
##  9     9  5.58  9.48 0.218  15.6  14.1  17.1 0.210 0.925 0.0907  8.00  0.0725 
## 10    10  5.79  9.40 0.215  15.1  14.2  17.0 0.197 0.817 0.0727  7.71 NA      
## 11    11  5.85  9.45 0.191  15.7  14.1  16.4 0.224 0.741 0.0917  7.54  0.0183 
## 12    12  5.80  9.61 0.192  16.9  14.5  15.3 0.217 0.757 0.0995  7.60 NA      
## # ... with 5 more variables: UA <dbl>, US <dbl>, VX <dbl>, WN <dbl>, YV <dbl>
 | 
 
What was the latest flight to depart each month?
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
 | flights%>%
    group_by(month)%>%
    arrange(desc(dep_time))%>%
    slice_head(n=1)
## # A tibble: 12 x 19
## # Groups:   month [12]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     7     2359           2359         0      506            437
##  2  2013     2     7     2400           2359         1      432            436
##  3  2013     3    15     2400           2359         1      324            338
##  4  2013     4     2     2400           2359         1      339            343
##  5  2013     5    21     2400           2359         1      339            350
##  6  2013     6    17     2400           2145       135      102           2315
##  7  2013     7     7     2400           1950       250      107           2130
##  8  2013     8    10     2400           2245        75      110              1
##  9  2013     9     2     2400           2359         1      411            340
## 10  2013    10    30     2400           2359         1      327            337
## 11  2013    11    27     2400           2359         1      515            445
## 12  2013    12     5     2400           2359         1      427            440
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
 | 
 
gapminder data
Is there data for every key year for all countries?
| 1
2
3
 | gapminder%>%group_by(year, country)%>%summarise(ct = n())%>%ungroup() %>%
    spread(country,ct )%>%select(-year)%>%summarise_all(sum)%>%filter_all(all_vars(.!=12))%>%nrow()
## [1] 0
 | 
 
Provide an indexed GDP based on the earliest data###
|  1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
 | index_vals <- gapminder %>% filter(year == min(year)) %>%
    select(year, base=gdpPercap)
gapminder%>%inner_join(index_vals)%>%mutate(indexedgdp = gdpPercap/base)
## # A tibble: 20,164 x 8
##    country     continent  year lifeExp     pop gdpPercap   base indexedgdp
##    <fct>       <fct>     <int>   <dbl>   <int>     <dbl>  <dbl>      <dbl>
##  1 Afghanistan Asia       1952    28.8 8425333      779.   779.     1     
##  2 Afghanistan Asia       1952    28.8 8425333      779.  1601.     0.487 
##  3 Afghanistan Asia       1952    28.8 8425333      779.  2449.     0.318 
##  4 Afghanistan Asia       1952    28.8 8425333      779.  3521.     0.221 
##  5 Afghanistan Asia       1952    28.8 8425333      779.  5911.     0.132 
##  6 Afghanistan Asia       1952    28.8 8425333      779. 10040.     0.0776
##  7 Afghanistan Asia       1952    28.8 8425333      779.  6137.     0.127 
##  8 Afghanistan Asia       1952    28.8 8425333      779.  9867.     0.0790
##  9 Afghanistan Asia       1952    28.8 8425333      779.   684.     1.14  
## 10 Afghanistan Asia       1952    28.8 8425333      779.  8343.     0.0934
## # ... with 20,154 more rows
 | 
 
Takeaway
This book provides a very quick introduction to tidyverse framework. Along the
way, it highlight the main packages and functions that one tends to use in a
data munging exercise. It might be a useful read for anyone wanting a crash
course on tidyverse framework.