R cookbook

I am going over R cookbook mainly to review the syntax. It has been almost 2 months since I have written any R program. So, obviously my hands are rusty and hence the reason for going over the cookbook.

I will write down whatever I find new / good reminders of things I have forgotten about R


R cookbook -Data Transformations

  • Forgot about split function
> library(MASS)
> x <- with(Cars93, split(MPG.city, Origin))
> sapply(x, median)
    USA non-USA
     20      22
> lapply(x, median)
$USA
[1] 20
$`non-USA` [1] 22
  • Difference between sapply and lappy. The former gives a vector as output whereas latter gives list as output
  • If the called function returns a structured object , always use lappy
> z <- list(x = runif(100), y = runif(100), z = runif(100))
> lapply(z, t.test)
$x
One Sample t-test
data: X[[1L]] t = 16.8501, df = 99, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 0.4410959 0.5588463 sample estimates: mean of x 0.4999711
$y
One Sample t-test
data: X[[2L]] t = 17.0593, df = 99, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 0.4360076 0.5507845 sample estimates: mean of x 0.493396
$z
One Sample t-test
data: X[[3L]] t = 19.8378, df = 99, p-value < 2.2e-16 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: 0.4915784 0.6008450 sample estimates: mean of x 0.5462117 > sapply(z, t.test) x y z statistic 16.85007 17.05927 19.83777 parameter 99 99 99 p.value 7.697001e-31 3.092998e-31 2.876633e-36 conf.int Numeric,2 Numeric,2 Numeric,2 estimate 0.4999711 0.493396 0.5462117 null.value 0 0 0 alternative "two.sided" "two.sided" "two.sided" method "One Sample t-test" "One Sample t-test" "One Sample t-test" data.name "X[[1L]]" "X[[2L]]" "X[[3L]]" > batches = data.frame(f = as.factor(sample.int(10, 20, replace = T)), v1 = runif(20)) > sapply(batches, class) f v1 "factor" "numeric" > lapply(batches, class) $f [1] "factor"
$v1 [1] "numeric"
  • One very useful way of using sapply is to pass a function and other variable along with a function
> x <- data.frame(matrix(rnorm(1000), 200, 5))
> head(x)
          X1         X2         X3         X4         X5
1  0.2462347 -0.2475659  1.0894281 -0.9770150  0.2906779
2  1.7860306  0.2282453 -1.4134570 -0.8327972 -0.6294957
3 -0.1900189  2.0704738  1.2697864 -1.0238222  0.8206145
4 -0.9786056  0.2990741 -1.8738614 -0.8740837 -0.5114038
5 -0.7924509  0.1239347  0.1661355 -0.6518038 -1.2335057
6 -1.4474740 -1.9331664  0.2592431 -0.4734888  1.6203936
> colnames(x) <- letters[1:5]
> sapply(x, cor, y = x[, 5])
          a           b           c           d           e
 0.06175565 -0.03654796  0.02252372 -0.01408585  1.00000000
> lapply(x, cor, y = x[, 5])
$a
[1] 0.06175565
$b [1] -0.03654796
$c [1] 0.02252372
$d [1] -0.01408585
$e [1] 1

In the above code, I am passing a vector for y and it is used in the correlation function that is called on each column element. sapply gives a vector as an output whereas lapply gives output as a list

  • tapply is used to apply function to groups of data. It contains vector of data, the groups vector that categorizes the input vector and the function.
> x <- data.frame(matrix(rnorm(1000), 200, 5))
> tapply(x[, 1], sample(letters[1:5], 200, T), summary)
$a
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
-2.4010 -0.6451 -0.1532 -0.1656  0.6008  1.2530
$b Min. 1st Qu. Median Mean 3rd Qu. Max. -2.69900 -0.50800 -0.08974 -0.05318 0.70380 2.06200
$c Min. 1st Qu. Median Mean 3rd Qu. Max. -1.7180 -0.8657 -0.1659 -0.1219 0.5416 2.6900
$d Min. 1st Qu. Median Mean 3rd Qu. Max. -1.9450000 -0.4964000 0.0261200 -0.0003208 0.4923000 2.4350000
$e Min. 1st Qu. Median Mean 3rd Qu. Max. -1.8950 -0.9386 -0.2520 -0.2421 0.1317 3.1980 > by(x, sample(letters[1:5], 200, T), summary) sample(letters[1:5], 200, T): a X1 X2 X3 X4 Min. :-1.894702 Min. :-2.12825 Min. :-2.48184 Min. :-1.81276 1st Qu.:-0.315219 1st Qu.:-0.59759 1st Qu.:-0.51781 1st Qu.:-0.85265 Median :-0.001239 Median :-0.16609 Median :-0.01515 Median :-0.17397 Mean : 0.012802 Mean :-0.06084 Mean :-0.02153 Mean :-0.09181 3rd Qu.: 0.497065 3rd Qu.: 0.55039 3rd Qu.: 0.79936 3rd Qu.: 0.76880 Max. : 1.479541 Max. : 2.15728 Max. : 2.06674 Max. : 2.92034 X5 Min. :-1.66168 1st Qu.:-0.54460 Median : 0.06998 Mean : 0.08078 3rd Qu.: 0.75543 Max. : 1.77108

sample(letters[1:5], 200, T): b X1 X2 X3 X4 Min. :-2.6988 Min. :-1.919464 Min. :-2.22402 Min. :-2.7828 1st Qu.:-0.8711 1st Qu.:-0.383063 1st Qu.:-0.67027 1st Qu.:-0.4645 Median :-0.2265 Median :-0.007706 Median :-0.19612 Median : 0.1912 Mean :-0.1334 Mean : 0.005452 Mean :-0.08216 Mean : 0.1898 3rd Qu.: 0.7205 3rd Qu.: 0.553677 3rd Qu.: 0.33841 3rd Qu.: 1.1395 Max. : 2.6904 Max. : 1.851667 Max. : 2.21023 Max. : 2.8057 X5 Min. :-2.41833 1st Qu.:-0.52458 Median : 0.31183 Mean : 0.09115 3rd Qu.: 1.00719 Max. : 1.94639

sample(letters[1:5], 200, T): c
       X1                X2                X3                X4
 Min.   :-2.0565   Min.   :-2.3364   Min.   :-2.0210   Min.   :-2.0947
 1st Qu.:-0.7051   1st Qu.:-0.6890   1st Qu.:-0.2665   1st Qu.:-0.3118
 Median :-0.1054   Median :-0.2009   Median : 0.1013   Median : 0.1938
 Mean   :-0.1038   Mean   :-0.1345   Mean   : 0.1338   Mean   : 0.2960
 3rd Qu.: 0.6657   3rd Qu.: 0.4101   3rd Qu.: 0.6527   3rd Qu.: 1.2366
 Max.   : 2.0617   Max.   : 1.6190   Max.   : 1.4966   Max.   : 2.7061
       X5
 Min.   :-2.8578
 1st Qu.:-0.8699
 Median :-0.2601
 Mean   :-0.1859
 3rd Qu.: 0.5110
 Max.   : 2.1267

sample(letters[1:5], 200, T): d X1 X2 X3 X4 Min. :-2.4011 Min. :-2.3560 Min. :-2.1363 Min. :-2.54557 1st Qu.:-1.0510 1st Qu.:-0.7428 1st Qu.:-0.4530 1st Qu.:-0.79773 Median :-0.3280 Median :-0.2157 Median : 0.2762 Median : 0.06078 Mean :-0.2108 Mean :-0.2219 Mean : 0.1424 Mean :-0.06605 3rd Qu.: 0.4572 3rd Qu.: 0.3877 3rd Qu.: 0.8852 3rd Qu.: 0.86327 Max. : 2.4768 Max. : 1.4466 Max. : 1.6618 Max. : 1.26067 X5 Min. :-2.2495 1st Qu.:-0.5001 Median :-0.1382 Mean :-0.1613 3rd Qu.: 0.3032 Max. : 1.4193

sample(letters[1:5], 200, T): e
       X1                X2                X3                X4
 Min.   :-1.9366   Min.   :-2.5626   Min.   :-1.5326   Min.   :-2.6420
 1st Qu.:-0.8047   1st Qu.:-0.9031   1st Qu.:-0.5513   1st Qu.:-0.2864
 Median :-0.3132   Median :-0.3650   Median : 0.3799   Median : 0.2160
 Mean   :-0.2036   Mean   :-0.2917   Mean   : 0.2980   Mean   : 0.1280
 3rd Qu.: 0.3988   3rd Qu.: 0.3309   3rd Qu.: 1.0764   3rd Qu.: 0.5996
 Max.   : 3.1978   Max.   : 2.0349   Max.   : 2.7986   Max.   : 2.0513
       X5
 Min.   :-1.63434
 1st Qu.:-0.51284
 Median :-0.04027
 Mean   : 0.05018
 3rd Qu.: 0.57312
 Max.   : 1.97665

Clearly there is a difference in the reasons for using tapply and by. In the by function, you can pass a subset of rows according to grouping criteria whereas in tapply, the input is always a single vector and function runs only on that vector.

  • mapply is used to apply a function to a parallel vector or lists.I never ever knew this till date. Fantastic learning. Let’s say you have written a function which works for two arguments. You can quickly vectorize the function by passing it to mapply
> test <- function(a, b, c) {
     if (b == "Test") {
         return(a + c)
     }
     else {
         return(a * c)
     }
 }
> test.vectorized <- function(a, b, c) {
     mapply(test, a, b, c)
 }
> test.df <- data.frame(runif(10), sample(c("Test", "NoTest"),
     10, T), runif(10))
> test.vectorized(test.df[, 1], test.df[, 2], test.df[, 3])
 [1] 1.79459526 0.32283599 0.02181394 1.15925957 0.68642825 1.18831506
 [7] 1.09246148 0.22514183 1.25755210 0.02422697
  • Till date I have neveer thought about vectorizing a simple function. mapply is the best way to vectorize the function.

R cookbook - Data Structures

  • You can turn a list in to matrix by merely giving a dim attribute
  • stack can be used to combine list in to a 2 column data frame
> x1 = runif(3)
> x2 = runif(3)
> x3 = runif(3)
> stack(list(x1 = x1, x2 = x2, x3 = x3))
     values ind
1 0.4405111  x1
2 0.7045491  x1
3 0.8090941  x1
4 0.4549867  x2
5 0.4280318  x2
6 0.6292135  x2
7 0.1691065  x3
8 0.1545277  x3
9 0.6390784  x3

Have never used the above function before

  • use drop = FALSE for subsetting so that the resultant set is again a data frame
  • To initialize a data frame from row data, use do.call(rbind, obs) I came across this in Hadley Wickham’s code and was totally clueless what it meant. Now I understand it
  • Using matrix notation to select columns from the data frame is not the best procedure. Use list operators instead
  • use subset to select or remove certain columns from the data frame
  • If you attach a data frame and make changes to a variable, the changes will not be reflected in the original data frame. Only local copy will be changed