R Cookbook Chapter1 - Chapter6

R cookbook

I am going over R cookbook mainly to review the syntax. It has been almost 2 months since I have written any R program. So, obviously my hands are rusty and hence the reason for going over the cookbook.

I will write down whatever I find new / good reminders of things I have forgotten about R

R cookbook -Data Transformations

Forgot about split function

> library(MASS)
> x <- with(Cars93, split(MPG.city, Origin))
> sapply(x, median)
    USA non-USA
     20      22
> lapply(x, median)
$USA
[1] 20

$`non-USA`
[1] 22

Difference between sapply and lappy. The former gives a vector as output whereas latter gives list as output
If the called function returns a structured object , always use lappy

> z <- list(x = runif(100), y = runif(100), z = runif(100))
> lapply(z, t.test)
$x

        One Sample t-test

data:  X[[1L]]
t = 16.8501, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.4410959 0.5588463
sample estimates:
mean of x
0.4999711

$y

        One Sample t-test

data:  X[[2L]]
t = 17.0593, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.4360076 0.5507845
sample estimates:
mean of x
 0.493396

$z

        One Sample t-test

data:  X[[3L]]
t = 19.8378, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.4915784 0.6008450
sample estimates:
mean of x
0.5462117
> sapply(z, t.test)
            x                   y                   z
statistic   16.85007            17.05927            19.83777
parameter   99                  99                  99
p.value     7.697001e-31        3.092998e-31        2.876633e-36
conf.int    Numeric,2           Numeric,2           Numeric,2
estimate    0.4999711           0.493396            0.5462117
null.value  0                   0                   0
alternative "two.sided"         "two.sided"         "two.sided"
method      "One Sample t-test" "One Sample t-test" "One Sample t-test"
data.name   "X[[1L]]"           "X[[2L]]"           "X[[3L]]"
> batches = data.frame(f = as.factor(sample.int(10, 20, replace = T)),
     v1 = runif(20))
> sapply(batches, class)
        f        v1
 "factor" "numeric"
> lapply(batches, class)
$f
[1] "factor"

$v1
[1] "numeric"

One very useful way of using sapply is to pass a function and other variable along with a function

> x <- data.frame(matrix(rnorm(1000), 200, 5))
> head(x)
          X1         X2         X3         X4         X5
1  0.2462347 -0.2475659  1.0894281 -0.9770150  0.2906779
2  1.7860306  0.2282453 -1.4134570 -0.8327972 -0.6294957
3 -0.1900189  2.0704738  1.2697864 -1.0238222  0.8206145
4 -0.9786056  0.2990741 -1.8738614 -0.8740837 -0.5114038
5 -0.7924509  0.1239347  0.1661355 -0.6518038 -1.2335057
6 -1.4474740 -1.9331664  0.2592431 -0.4734888  1.6203936
> colnames(x) <- letters[1:5]
> sapply(x, cor, y = x[, 5])
          a           b           c           d           e
 0.06175565 -0.03654796  0.02252372 -0.01408585  1.00000000
> lapply(x, cor, y = x[, 5])
$a
[1] 0.06175565

$b
[1] -0.03654796

$c
[1] 0.02252372

$d
[1] -0.01408585

$e
[1] 1

In the above code, I am passing a vector for y and it is used in the correlation function that is called on each column element. sapply gives a vector as an output whereas lapply gives output as a list

tapply is used to apply function to groups of data. It contains vector of data, the groups vector that categorizes the input vector and the function.

> x <- data.frame(matrix(rnorm(1000), 200, 5))
> tapply(x[, 1], sample(letters[1:5], 200, T), summary)
$a
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
-2.4010 -0.6451 -0.1532 -0.1656  0.6008  1.2530

$b
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
-2.69900 -0.50800 -0.08974 -0.05318  0.70380  2.06200

$c
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
-1.7180 -0.8657 -0.1659 -0.1219  0.5416  2.6900

$d
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
-1.9450000 -0.4964000  0.0261200 -0.0003208  0.4923000  2.4350000

$e
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
-1.8950 -0.9386 -0.2520 -0.2421  0.1317  3.1980
> by(x, sample(letters[1:5], 200, T), summary)
sample(letters[1:5], 200, T): a
       X1                  X2                 X3                 X4
 Min.   :-1.894702   Min.   :-2.12825   Min.   :-2.48184   Min.   :-1.81276
 1st Qu.:-0.315219   1st Qu.:-0.59759   1st Qu.:-0.51781   1st Qu.:-0.85265
 Median :-0.001239   Median :-0.16609   Median :-0.01515   Median :-0.17397
 Mean   : 0.012802   Mean   :-0.06084   Mean   :-0.02153   Mean   :-0.09181
 3rd Qu.: 0.497065   3rd Qu.: 0.55039   3rd Qu.: 0.79936   3rd Qu.: 0.76880
 Max.   : 1.479541   Max.   : 2.15728   Max.   : 2.06674   Max.   : 2.92034
       X5
 Min.   :-1.66168
 1st Qu.:-0.54460
 Median : 0.06998
 Mean   : 0.08078
 3rd Qu.: 0.75543
 Max.   : 1.77108

sample(letters[1:5], 200, T): b X1 X2 X3 X4 Min. :-2.6988 Min. :-1.919464 Min. :-2.22402 Min. :-2.7828 1st Qu.:-0.8711 1st Qu.:-0.383063 1st Qu.:-0.67027 1st Qu.:-0.4645 Median :-0.2265 Median :-0.007706 Median :-0.19612 Median : 0.1912 Mean :-0.1334 Mean : 0.005452 Mean :-0.08216 Mean : 0.1898 3rd Qu.: 0.7205 3rd Qu.: 0.553677 3rd Qu.: 0.33841 3rd Qu.: 1.1395 Max. : 2.6904 Max. : 1.851667 Max. : 2.21023 Max. : 2.8057 X5 Min. :-2.41833 1st Qu.:-0.52458 Median : 0.31183 Mean : 0.09115 3rd Qu.: 1.00719 Max. : 1.94639

sample(letters[1:5], 200, T): c
       X1                X2                X3                X4
 Min.   :-2.0565   Min.   :-2.3364   Min.   :-2.0210   Min.   :-2.0947
 1st Qu.:-0.7051   1st Qu.:-0.6890   1st Qu.:-0.2665   1st Qu.:-0.3118
 Median :-0.1054   Median :-0.2009   Median : 0.1013   Median : 0.1938
 Mean   :-0.1038   Mean   :-0.1345   Mean   : 0.1338   Mean   : 0.2960
 3rd Qu.: 0.6657   3rd Qu.: 0.4101   3rd Qu.: 0.6527   3rd Qu.: 1.2366
 Max.   : 2.0617   Max.   : 1.6190   Max.   : 1.4966   Max.   : 2.7061
       X5
 Min.   :-2.8578
 1st Qu.:-0.8699
 Median :-0.2601
 Mean   :-0.1859
 3rd Qu.: 0.5110
 Max.   : 2.1267

sample(letters[1:5], 200, T): d X1 X2 X3 X4 Min. :-2.4011 Min. :-2.3560 Min. :-2.1363 Min. :-2.54557 1st Qu.:-1.0510 1st Qu.:-0.7428 1st Qu.:-0.4530 1st Qu.:-0.79773 Median :-0.3280 Median :-0.2157 Median : 0.2762 Median : 0.06078 Mean :-0.2108 Mean :-0.2219 Mean : 0.1424 Mean :-0.06605 3rd Qu.: 0.4572 3rd Qu.: 0.3877 3rd Qu.: 0.8852 3rd Qu.: 0.86327 Max. : 2.4768 Max. : 1.4466 Max. : 1.6618 Max. : 1.26067 X5 Min. :-2.2495 1st Qu.:-0.5001 Median :-0.1382 Mean :-0.1613 3rd Qu.: 0.3032 Max. : 1.4193

sample(letters[1:5], 200, T): e
       X1                X2                X3                X4
 Min.   :-1.9366   Min.   :-2.5626   Min.   :-1.5326   Min.   :-2.6420
 1st Qu.:-0.8047   1st Qu.:-0.9031   1st Qu.:-0.5513   1st Qu.:-0.2864
 Median :-0.3132   Median :-0.3650   Median : 0.3799   Median : 0.2160
 Mean   :-0.2036   Mean   :-0.2917   Mean   : 0.2980   Mean   : 0.1280
 3rd Qu.: 0.3988   3rd Qu.: 0.3309   3rd Qu.: 1.0764   3rd Qu.: 0.5996
 Max.   : 3.1978   Max.   : 2.0349   Max.   : 2.7986   Max.   : 2.0513
       X5
 Min.   :-1.63434
 1st Qu.:-0.51284
 Median :-0.04027
 Mean   : 0.05018
 3rd Qu.: 0.57312
 Max.   : 1.97665

Clearly there is a difference in the reasons for using tapply and by. In the by function, you can pass a subset of rows according to grouping criteria whereas in tapply, the input is always a single vector and function runs only on that vector.

mapply is used to apply a function to a parallel vector or lists.I never ever knew this till date. Fantastic learning. Let’s say you have written a function which works for two arguments. You can quickly vectorize the function by passing it to mapply

> test <- function(a, b, c) {
     if (b == "Test") {
         return(a + c)
     }
     else {
         return(a * c)
     }
 }
> test.vectorized <- function(a, b, c) {
     mapply(test, a, b, c)
 }
> test.df <- data.frame(runif(10), sample(c("Test", "NoTest"),
     10, T), runif(10))
> test.vectorized(test.df[, 1], test.df[, 2], test.df[, 3])
 [1] 1.79459526 0.32283599 0.02181394 1.15925957 0.68642825 1.18831506
 [7] 1.09246148 0.22514183 1.25755210 0.02422697

Till date I have neveer thought about vectorizing a simple function. mapply is the best way to vectorize the function.

R cookbook - Data Structures

You can turn a list in to matrix by merely giving a dim attribute
stack can be used to combine list in to a 2 column data frame

> x1 = runif(3)
> x2 = runif(3)
> x3 = runif(3)
> stack(list(x1 = x1, x2 = x2, x3 = x3))
     values ind
1 0.4405111  x1
2 0.7045491  x1
3 0.8090941  x1
4 0.4549867  x2
5 0.4280318  x2
6 0.6292135  x2
7 0.1691065  x3
8 0.1545277  x3
9 0.6390784  x3

Have never used the above function before

use drop = FALSE for subsetting so that the resultant set is again a data frame
To initialize a data frame from row data, use do.call(rbind, obs) I came across this in Hadley Wickham’s code and was totally clueless what it meant. Now I understand it
Using matrix notation to select columns from the data frame is not the best procedure. Use list operators instead
use subset to select or remove certain columns from the data frame
If you attach a data frame and make changes to a variable, the changes will not be reflected in the original data frame. Only local copy will be changed