The most popular R packages are, mostly, written in R
I’ve been starting to think about some R related research. Building on my previous post, wherein I data mine the RStudio logs to find the most popular R packages, I’ve dug a little deeper in to the package contents.
One question I have is, “How much of R is written C/C++ or Fortran?” I expected the performance critical parts, especially linear algebra, to be written in a “faster” language. However, looking at just the top 25 most popular packages, a lot of R is written in R.
Now, the most popular packages don’t represent everything people think of as being R. But the conflation of the R language, the R virutal machine, the base packages, and the contributed packages is really what most people think of as being “R.” This post only looks at a piece of the bigger R ecosystem, but I’ll get to the rest in due time. I should also note these are the most popular non-default packages. For the default packages, which are included with the R installer, I don’t know of a way of measuring their popularity.
Looking at the plot, most packages consist mainly of R code. Notable expections are Rcpp and rJava, which have a lot of C++ and Java code respectively. There’s also a smattering of C across some packages. Some of these are wrappers around C programs such as RCurl, but there could be some C for performance sake. A deeper dive is necessary to deterimine why an individual package uses C.
How did I get this data? In my previous post, I mined the RStudio logs to find the most popular packages. Then I downloaded the packages from CRAN, and ran cloc on them. Note for reproducibility: I had to get the latest-and-greatest cloc from SVN for R support. Luckily cloc has a csv output option, so getting the data into R is as simple as:
Now I can accumulate the cloc for the top N packages using Map/Reduce:
Then I stash the intermediate results in a file. This function opens the file and makes a nice stacked barplot showing the KLOC (thousands of lines of code) for the various languages in each package.
Making the chart was honestly the hardest part. I’ve used MATLAB and Python’s Matplotlib a lot so R’s plotting is forgein to me. The built-in R plotting seems horrendous, since I am used to plotting that “just works” in MATLAB. But maybe I learned MATLAB so long along that I’ve forgetten how difficult it was to learn. I’ll be trying to learn ggplot2 for future work.