Reproducible Research with R and RStudio

The book starts by explaining an example project that one download from the author’s github account. The project files serve as a minimalist introduction to reproducible research. I guess it might make sense to download this project, try to follow the suggestions and create the relevant files. This kind of gives a sense of what one can accomplished by reading through the book.

Introducing Reproducible Research

The highlight of a RR document is that data, analysis and results are all in one document. There is no separation between announcing the results and doing number crunching. The author gives a list of benefits that accrue to any researcher generating RR documents. They are

better work habits
better team work
changes are easier
high research impact

The author uses knitr or rmarkdown in the book to discuss RR. The primary difference between the two is that the former needs that documents must be written using the markup language associated with the desired output. The latter is more straightforward in the sense that it can be used to produce a variety of outputs.

Getting Started with Reproducible Research

The thing to keep in mind is that reproducibility is not an after thought - it is something you build into the project from the beginning. Some general aspects of RR are discussed by the author. If you do not believe in the benefits of RR, then you might have to carefully read this chapter to understand the benefits as it gives some RR tips to a newbie. This chapter also gives a road map to the reader as to what he/she can expect from the book. In any research project, there is data gathering stage, data analysis stage and presentation stage. The book contains a set of chapters addressing each stage of the project. More importantly, the book contains ways to tie each of the stages so as to produce a single compendium for your entire project.

Getting started with R, RStudio and knitr/rmarkdown

This chapter gives a basic introduction to R and subsequently dives in to knitr and rmarkdown commands. It shows how one can create a .Rnw or .Rtex document and convert in to a pdf either through RStudio or the command line. rmarkdown documents on the other hand are more convenient for reproducing simple projects where there are not many interdependencies between various tasks. Obviously the content in this chapter gives only a general idea. One has to dig through the documentation to make things work. One learning for me from this chapter was the option of creating .Rtex documents in which the syntax can be less baroque.

Getting started with File Management

This chapter gives the basic directory structure that one can follow for organizing the project files. One can use the structure as a guideline for one’s own projects. The example project uses gnu make file for data munging. It gives a crash course of some basic commands that one can via shell.

Storing, Collaborating, Accessing Files, and Versioning

The four activities mentioned in the chapter title can be done in many ways. The chapter focuses on Dropbox and Github. It is fairly easy to follow the limited functionality one gets from Dropbox. On the otherhand, Github demands some learning from a newbie. One needs to get to know the basic terminology of git. The author does a commendable job of highlighting the main aspects of git version control and its tight integration with RStudio.

Gathering Data with R This chapter talks about the way in which one can use GNU make utility to create a systematic way of gathering data. The use of make file makes it easy for other to reproduce the data preparation stage of a project. If you have written a make file in C++ or in some other context, it is pretty easy to follow the basic steps mentioned in the chapter. Else it might involve some learning curve. My guess is once you start writing make files for specific tasks, you will realize their tremendous value in any data analysis project. A nice starting point for learning make file is robjhyndman’s site.

Preparing Data for Analysis This gives a whirlwind tour of data munging operations and data analysis.

Statistical Modeling and knitr

The chapter gives a brief description of chunk options that are frequently used. Out of all the options, cache.extra and dependson are the options that I have never used in the past and is a learning for me.One of the reasons I like knitr is its ability to cache objects. In the Sweave era, I had to load separate packages, do all sorts of things to run a time intensive RR document. It was very painful to say the least. Thanks to knitr it is extremely easy now. Even though cache option is described at the end, I think it is one of the most useful features of the package. Another good thing is that you can combine various languages in RR document. Currently knitr supports the following language engines :

Awk
Bash shell
CoffeeScript
Gawk
Haskell
Highlight
Python
R (default)
Ruby
SAS
Bourne shell

Showing results with tables

In whatever analysis you do using R , there are always situations where your output is in the form of a data.frame or matrix or some sort of list structure that is format-ed two display on the console as a table. One can use kable to show data.frame and matrix structures. It is simple, effective but limited in scope. xtable package on the other hand is extremely powerful. One can use various statistical model fitted objects and pass it on to xtable function to obtain a table and tabular environment encoded for the results. The chapter also mentions texreg that is far more powerful than the previous mentioned packages. With texreg , you can show the output of more than one statistical models as a table in your RR document.There are times when the output classes are not supported by xtable. In such cases, one has to manually hunt down the relevant table, create a data frame or matrix of the relevant results and then use xtable function.

Showing results with figures

It is often better to know basic LaTeX syntax for embedding graphics before using knitr. One problem I have always faced with knitr embedded graphics is that all the chunk options should be mentioned in one single line. You cannot have two lines for chunk options. Learnt a nice hack from this chapter where some of the environment level code can be used as markup rather than as chunk options .This chapter touches upon the main chunk options relating to graphics and does it well, without overwhelming the reader.

Presentation with knitr/LaTeX

The author says that much of the LaTeX in the book has been written using Sublime Text editor. I think this is the case with most of the people who intend to create an RR. Even though RStudio has a good environment to create a LaTeX file, I usually go back to my old editor to write LaTeX markup. How to cite bibliography in your document? and How to cite R packages in your document? are questions that every researcher has to think about in producing RR documents. The author does a good job of highlighting the main aspects of this thought process. The chapter ends with a brief discussion on Beamer. It gives a 10,000 ft of beamer. I stumbled on to a nice link in this chapter that gives the reason for using fragile in beamer.

Large knitr/LaTeX Documents: Theses,Books, and Batch Reports

This chapter is mightily useful for long RR documents. In fact if your RR document is not large, it makes sense to logically subdivide in to separate child documents. For knitr, there are chunk options to specify parent and child relationships. These options are useful in knitting child documents independently of the other documents embedded in the parent document. You do not have to specify the preamble code again in each of the child documents as it inherits the code from the parent document. The author’s also shows a way to use Pandoc to change the content from a different markup language to tex, which can then be included in the RR document.

The penultimate chapter is on rmarkdown. The concluding chapter of the book discusses some general issues of reproducible research.

Takeaway Write some decent takeaway so that it appeals to everyone