I’ve been starting to think about some R related research. Building on my previous post, wherein I data mine the RStudio logs to find the most popular R packages, I’ve dug a little deeper in to the package contents.

One question I have is, “How much of R is written C/C++ or Fortran?” I expected the performance critical parts, especially linear algebra, to be written in a “faster” language. However, looking at just the top 25 most popular packages, a lot of R is written in R.

package languages plot

Now, the most popular packages don’t represent everything people think of as being R. But the conflation of the R language, the R virutal machine, the base packages, and the contributed packages is really what most people think of as being “R.” This post only looks at a piece of the bigger R ecosystem, but I’ll get to the rest in due time. I should also note these are the most popular non-default packages. For the default packages, which are included with the R installer, I don’t know of a way of measuring their popularity.

Looking at the plot, most packages consist mainly of R code. Notable expections are Rcpp and rJava, which have a lot of C++ and Java code respectively. There’s also a smattering of C across some packages. Some of these are wrappers around C programs such as RCurl, but there could be some C for performance sake. A deeper dive is necessary to deterimine why an individual package uses C.

How did I get this data? In my previous post, I mined the RStudio logs to find the most popular packages. Then I downloaded the packages from CRAN, and ran cloc on them. Note for reproducibility: I had to get the latest-and-greatest cloc from SVN for R support. Luckily cloc has a csv output option, so getting the data into R is as simple as:

cloc <- function(file) {
    nHeader <- 4
    nFooter <- 3
    cmd <- paste0("cloc --force-lang=R,r --csv ", file)
    output <- system(cmd, intern=TRUE)
    output <- output[-(1:nHeader)]
    con <- textConnection(output)
    d <- read.csv(con)
    d <- d[-length(d)] # the last column is some weird string
    languages <- d[["language"]]
    df <- data.frame(t(as.numeric(d[["code"]])), basename(file))
    colnames(df) <- c(as.character(languages), "package")
    return(df)
}

Now I can accumulate the cloc for the top N packages using Map/Reduce:

clocTopN <- function(N) {
    topN <- getTopN(DLCSV, N)
    files <- paste0(EXDIR, topN)
    clocs <- Map(cloc, files)
    merged <- Reduce(function(x, y){merge(x, y, all=TRUE)}, clocs)
    merged[is.na(merged)] <- 0
    return (merged)
}

Then I stash the intermediate results in a file. This function opens the file and makes a nice stacked barplot showing the KLOC (thousands of lines of code) for the various languages in each package.

stackedBarPlot <- function(filename) {
    x <- read.csv(filename, row.names=1)
    langs <- c("R", "C", "C..", "Java")
    displangs <- c("R", "C", "C++", "Java")
    y <- x[c("package", langs)] # get only the columns we want
    y[langs] <- y[langs] /1000 # convert to KLOC
    df <- data.frame(y)
    colnames(df) <- c("package", displangs)
    m <- melt(df, id.vars="package")
    colnames(m) <- c("package", "Language", "KLOC")
    ggplot(m, aes(x = package, y = KLOC, fill = Language)) + 
      geom_bar(stat = "identity") +
      ylab("KLOC") + 
      theme(axis.text.x = element_text(angle=-90, hjust=0))
    ggsave(file="pkg_comp.png")
}

Making the chart was honestly the hardest part. I’ve used MATLAB and Python’s Matplotlib a lot so R’s plotting is forgein to me. The built-in R plotting seems horrendous, since I am used to plotting that “just works” in MATLAB. But maybe I learned MATLAB so long along that I’ve forgetten how difficult it was to learn. I’ll be trying to learn ggplot2 for future work.