Writing sharable (R) code. Uh… and sharing it!

Intro

In this tutorial we will build an R package, write unit tests for it (steps 1-7), push it in a GitHub repository and integrate Travis CI with it (step 8). In step 9 we build a drat repository to host this and other R packages, using GitHub as a Web Server and set up Travis CI to automatically push updates of our package into drat. In step 10 we make some improvements in the package. We finish with providing sources for R package development and other relevant topics.

Prerequisites

(Optional) Rstudio; this is not necessary to produce R packages but it does make your life easier as the main package for R package development, devtools, is well integrated with RStudio.
(For Windows users only) Rtools which you can download here.
The following R packages:
- devtools, usethis, testthat and roxygen2 in order to build R packages
- tidyverse, rlang
- drat
(Optional) A github account. If you wish to replicate steps 8 onwards herein you need your own github account. You also need a Travis-CI account for which you can sign up with your github account.

Create an R package

Let’s jump right into it by building a minimal R package following Hilary Parker’s building steps. Afterwards, we can discuss when or why to build an R package along with some good practices. Most of the discussed topics below are taken from Hadley Wickham’s R packages.

Step 1: Create a package directory

Hilary is building a cat-themed package, but in order to not discourage dog persons, I will go with a different theme. Let’s make a package whose aim is to perform some basic statistics on blood metabolomics data. (In fact, other than the slightly grotesque namings, the package’s single demo utility has nothing specific to blood metabolomics.)

Let’s create the minimum amount of subdirectories your package needs.

# Navigate to the desired parent directory
setwd("parent_directory")
# Create the directory of your package with the minimum amount of subdirectories
# Its reasonable but not necessary to name the directory with the same name as 
# the package 
usethis::create_package("bloodstats")
# If you run the above in RStudio, you will likely get a new RStudio window open
# automatically, with the project name "bloodstats"

(Alternatively, from within RStudio, you may perform the step above by going to File -> New Project... -> New Directory -> R package, type in the package name and location and click Create Project. Notice, you may also add existing functions at this step.)

What the above function did was to create a directory called bloodstats/, inside the parent_directory/, and inside bloodstats/ two subdirectories,

R/, that will soon contain the package source code and
man/, that will soon contain the package’s documentation.

Finally, there are two files, DESCRIPTION and NAMESPACE. Go ahead and edit the DESCRIPTION file with a short description of the package, your name and contact information, etc. This is the file where you’ll also be keeping track of your package versioning.

Here is an example of how I would have my first version of the DESCRIPTION look like:

Package: bloodstats
Title: Utilities for Metabolomics Data
Version: 0.0.0.9999
Authors@R:
    person(
        "Maria", "Kalimeri", email = "maria.kalimeri@nightingalehealth.com",
        role = c("aut", "cre")
    )
Description: Functions and other utilities for basic statistics on blood 
    metabolomics data.
Depends:
    R (>= 3.5.0)
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.1.1

The package’s namespace, as recorded in the NAMESPACE file, is something you should understand if you plan to share your packages. Namespace takes care of imports and exports such that your package will coexist in harmony with other packages. The file NAMESPACE is something you shouldn’t edit by hand, instead roxygen2 will take care of updating this file everytime you build your documentation.

An important note concerning the name of your package, especially if you plan to share it with others. It’s good to make sure that the name of your package is not already in use by another CRAN package. You can check this by loading https://cran.r-project.org/web/packages/bloodstats.

The License field can be either a standard abbreviation for an open source license, like GPL-2 or BSD, or a pointer to a file containing more information, file LICENSE. The license is really only important if you’re planning on releasing your package. If you don’t, you can ignore this section. I have added an MIT license here just for demonstrational purposes.

Step 2: Add functions

Below is an example of a function that fits the scope of our package. It takes a data frame as input (supposedly containining blood biomarkers) and returns the mean value of each column (variable) as long as this is numeric.

bloodmeans <- function(df) {
  df %>%
    dplyr::summarise_if(is.numeric, mean, na.rm = TRUE)
}

Save this function as bloodmeans.R inside the R subdirectory.

Good to know

Note the usage of the pipe %>% symbol above. The pipe is a way to write a series of operations on an R object, e.g. a data frame, in an easy-to-read way. As an example, the operation x %>% f(y) effectivly means f(x, y). You can read more on the pipe here.

Furthermore, summarize_if is a function of the dplyr package, that is part of the core tidyverse, an opinionated collection of R packages designed for data science. If you are not a tidyverse user, I strongly suggest to give it a try. A good place to start is H. Wickham’s book “R for Data Science”

Step 3: Add documentation

What you need to do is type each function’s description and other comments at the beginning of each function in the form of special comments and roxygen2 will take care of building the whole documentation. An example of how the special comments that constitute object documentation should be is shown below. For more information on the subject see here.

#' Extract Mean Values of Blood Biomarkers  
#' 
#' This function accepts a dataframe as input and extracts the mean value of 
#' each numeric variable.
#' 
#' @param df a \code{data.frame} with at least one numeric variable in order to 
#' get a non-empty result.
#' @return a data.frame with the mean values of each numeric 
#' @importFrom dplyr summarise_if
#' @author John Doe 
#' @export 
#' @examples 
#' library(magrittr)
#' data.frame(x1 = c(1,2,3), x2 = c(4,5,6)) %>% 
#'   bloodstats::bloodmeans()
bloodmeans <- function(df) {
  df %>%
    dplyr::summarise_if(is.numeric, mean, na.rm = TRUE)
}

It is not necessary to have each function in its own file - although it usually makes the code easier to read/access by others - but when you add more than one function in a file make sure to add the documentation for each function just before its definition.

Some notes on documentation. I find it always very usefull to have at least one example per function. Even for a “quick and dirty” package, working examples are sometimes saving the day. Especially if you need to demonstrate the structure of the function’s input parameter(s). Such a thing may need one or two built-in datasets. We will add a demo dataset, a couple of steps later.

Let’s take a moment and look at the package’s NAMESPACE now. In the above code notice two things, first the explicit call dplyr::summarise_if and second, at the documentation chunck, the entry:

#' @importFrom magrittr %>%

They both make sure that you package will have all needed imports for this function to work, i.e. packages dplyr and magrittr in this case. At this point we need to add these two dependencies in the file DESCRIPTION but let’s not do it yet. Simply for demo purposes, we will let devtools::check() pick up this error a bit later on.

Step 4: Process documentation

You can now use devtools::document() to build your documentation. From within the package directory, type the following:

# If you are using an RStudio project for your package development you are most 
# likely already in the package directory. If not, navigate into it
# > setwd("./bloodstats")
# and type:
devtools::document()

This function is a wrapper for roxygen2::roxygenize(); it adds .Rd to the man directory, one for each object in your package, assuming you have written comments as suggested in step 3. The function will also update the NAMESPACE file of the main directory with the corresponding imports and exports.

If you see the following warning: Warning: The existing 'NAMESPACE' file was not generated by roxygen2,`and will not be overwritten. go ahead and remove the file NAMESPACE from the root directory. After you do so, re-run the devtools::document() command above.

Step 5: Run checks

devtools::check() or R CMD check will check your code for common issues like documentation mismatches, missing imports etc, including pass or fail of unit tests if such exist.

You should probably run checks quite often. This will help to start curing problems and incosistencies as soon as they appear rather than having to deal with a huge amount of them at a much later stage.

So run the command below

devtools::check()

or use the RStudio build-in shortcuts if you prefer.

Unless you added the dplyr and magrittr dependencies in your DESCRIPTION above, the check command should now throw an error at the “checking package dependencies” stage. To fix this open DESCRIPTION and update it as shown below:

Package: bloodstats
Title: Utilities for Metabolomics Data
Version: 0.0.1
Authors@R:
    person(
        "Maria", "Kalimeri", email = "maria.kalimeri@nightingalehealth.com",
        role = c("aut", "cre")
    )
Description: Functions and other utilities for basic statistics on blood 
    metabolomics data.
Depends:
    R (>= 3.5.0)
Imports:
    dplyr,
    magrittr
License: MIT + file LICENSE
Encoding: UTF-8
LazyData: true
RoxygenNote: 6.1.1

After running checks again, there should still be a warning

❯ checking DESCRIPTION meta-information ... WARNING
  Invalid license file pointers: LICENSE

which happens as there is a pointer to a file LICENSE that just doesn’t exist. The MIT license is a ‘template’, so if you use it, you need License: MIT + file LICENSE, and a LICENSE file that looks like this:

YEAR: <Year or years when changes have been made>
COPYRIGHT HOLDER: <Name of the copyright holder>

You can add these lines in a file called LICENSE in the package root and run devtools::check() again.

Notice in the DESCRIPTION above that I now increased the version of our package from the development version 0.0.0.9999 to 0.0.1

Step 6: Write tests

This is an important part of package development. The main aims of writing formal tests is to make sure, that you will not break code that used to work, when you come back in the future to add features or improve existing code.

You can use usethis::use_testthat() to set up the package to use tests. This command will do all the necessary steps below:

Create a tests/testthat directory.
Adds testthat to the Suggests field in the DESCRIPTION. The Suggests lines indicate that while your package can take advantage of a package, this is not required to make it work.
Create a file tests/testthat.R that runs all your tests when R CMD check runs. (See more on automated checking here.)

The next step is to actually write the tests. We have only one function at the moment, bloodmeans(). Create an R file with the name test-bloodmeans.R, save it in subdir ./tests/testthat/ and type in the following contents.

context("bloodmeans")

library(magrittr)

res <-
  data.frame(var1 = c(1, 2, 3), var2 = c(4, 5, 6)) %>%
  bloodstats::bloodmeans()

test_that("bloodmeans returns output of expected class", {

  expect_true(
    class(res) == "data.frame"
  )
})

test_that("bloodmeans returns expected result given input", {

  expect_true(
    all(res == data.frame(var1 = 2, var2 = 5))
  )
})

You may now run the tests:

devtools::test()

Or you may use the RStudio build-in shortcuts.

Note that as soon as you add tests in your package, devtools::check() will also include them in the check step.

Refer to the related section in R-packages:tests for more info on proper unit testing.

Step 7: Install your package

From the root directory of the bloodstats folder, type the following.

devtools::install(".")

That will get your package installed in your machine. You can try viewing the documentation of your function by typing

?bloodmeans

Share your R package

Step 8: Make the package a GitHub repo

If you want to reproduce the steps from here onwards you will need a github account.

Just as Hilary in her post, we will not dive into git and GitHub here (let me also refer to Karl Broman’s Git/GitHub Guide for that). For the purposes of this tutorial, I will assume some basic knowledge of git. If you don’t have it, it’s ok, you may simply copy-paste the git commands here, and come back to it later for details.

Step 8a: Push initial commit

Let us follow the steps in this guide in order to create a GitHub repository for our existing R package. Do make the following addition though: between steps 4 and 5, add a file with the name .gitignore with at least the following contents which constitute example lines of .gitignore contents for an R package repository.

.RData
.Rhistory
.Rproj.user
.Ruserdata

If a macOS user you may also want to gitignore .DS_Store

.gitignore contains the names of files that you don’t want to include in your git repository; so instead of not staging them every time you commit, you permanently ignore them by adding them in this file.

Step 8b: Add a README

This is an important step so that people that land in your github repository page will have an overview of your project.

Create a file README.md in the root of your package directory with the following contents.

bloodstats
----------

A package with utilities for basic statistics on blood biomarkers.

Alternatively, as an R user, especially if you use RStudio, it’s handy to use an Rmarkdown document to write your description. Create a file called README.Rmd (remember to add this into a file called .Rbuildignore in the top level of the package directory so that devtools::check() wont give you additional notes about non-standard files in your package folder) with the contents suggested below.

When you copy-paste the README contents below remember to remove the backslash “\” before the chuck ``` definitions

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

# bloodstats

A dummy package for demo and testing. 

## Installation

\```{r, eval = FALSE}
# # Install devtools if you don't have it already
# install.packages("devtools")
devtools::install_github("mariakalimeri/bloodstats")
\```


## Examples

\```{r, eval = FALSE}
data.frame(var1 = c(1, 2, 3), var2 = c(4, 5, 6)) %>%
  bloodstats::bloodmeans()
\ ```

If you use the Rmd approach, remember to knitr::knit() your document in order to
have a README.md final output.

It’s good to enrich your README with quick, getting-started examples and other notes. I often try to browse around other GitHub repositories to explore good practices, see for example the README file for patchwork.

Add, commit and push your newly created README.Rmd and README.md files.

git status
git add README*
git commit -m "add README"
git push -u origin master

Step 8c: Enable Travis-CI for your package’s repository

This step requires that you have a Travis-CI account (enabled with your GitHub account).

From your global GitHub > Settings > Applications > Authorized OAuth Apps, make sure you have granted access to Travis-CI.

Locally, add a file called .travis.yml in the root directory of your package. Below are the suggested contents.

# R for travis: see documentation at https://docs.travis-ci.com/user/languages/r
  
language: R
cache: packages

Add, commit and push your newly added .travis.yml file.

git status
git add README*
git commit -m "add .travis.yml"
git push -u origin master

On the browser, in your Travis-CI account, go to Settings (upper right corner) got to Settings and enable Travis-CI for bloodstats. If you don’t see bloodstats in the list of repositories, then sync your GitHub account using the Sync account button on the upper left side and refresh the webpage.

You may now find bloodstats in your active repositories in your Travis-CI dashboard, where you can click on Trigger a build button.

Fingers crossed and your build will pass!

Finally, open your README.Rmd and add a travis CI status icon if you wish. You can find and copy the Status Image if you click on the build | unknown icon next to the Travis-CI project title on the travis-ci webpage. Choose an icon for the master branch (in the first drop down menu) and Markdown (in the second drop down menu). The two lines of code that you should add to the README file will will look like below.

[![Build Status](https://travis-ci.org/mariakalimeri/bloodstats.svg?branch=master)](https://travis-ci.org/mariakalimeri/bloodstats)

Step 9: Create your own R package drat repository

Drat is a recursive acronym: Drat R Archive Template. It’s essentially an R package that allows for creation and use of R Repositories via helper functions that insert packages into a repository, and add repository information to the current R session.

Having an R repository of your own makes sense when you build, distribute and maintain more than one or two R packages.

Step 9a: Create a `drat` git repository in GitHub

source: Drat for package authors

As a package author with a given GitHub account, in order to create your own R package drat repository, all that is needed is a GitHub repository named drat and inside it a the subdirectory src/contrib/.

In a terminal, let’s create the drat folder with its contents and initialize a git repository for it. Let’s place this drat directory in the same parent directory as bloodstats.

mkdir drat 
cd drat
mkdir src
cd src
mkdir contrib

Follow the same steps as in Step 8a above, in order to create a git repository for this drat and push its contents in GitHub.

The next step is to place a package into drat.

Let us built the source package for bloodstats to insert it into drat. Assuming your current working directory is bloodstats type the follwoing

devtools::build()

This will create the file bloodstats_0.0.1.tar.gz in the parent directory of bloodstats.

## insert the bloodstats bundle into the drat repo on local file system
## First change location to the parent directory
setwd("../")
drat::insertPackage("bloodstats_0.0.1.tar.gz", "drat")

Git add, commit and push the latest update of this repo to GitHub.

The next step you need to do is turn on GitHub pages for drat. You do this by going to the GitHub settings of drat and scroll down to the GitHub Pages section. Choose master as your source branch (unless you have a good reason to choose otherwise).

Technically, you now have a drat repository of your own, with its first package in it! Hurray!! You may very well keep this up-to-date manually as you did just now.

Your friend, the R user that wants to use your collection of packages, as well as have an easy way to update them, has already been told that your R package repository is hosted in a GitHub repository named drat (e.g. https://mariakalimeri.github.io/drat/`). He now needs to add your drat repo to his list of R repositories:

drat::addRepo("mariakalimeri")

and he is good to go. E.g. he may type

available.packages(repos = getOption("repos")["mariakalimeri"])

and hopefully see the available bloodstats packages sitting there happily with all its info. He can install and update packages the R-usual way, i.e.

install.packages("bloodstats")

Mind that drat::addRepo() will add the new R repository only in the current running R session. To update your R repository list more permanently, open your ~/.Rprofile and type
local({
  r <- getOption("repos")
  r["CRAN"] <- "https://cran.rstudio.com/"
  r["mariakalimeri"] <- "https://mariakalimeri.github.io/drat/"
  options(repos = r)
})
where above I am also selecting the RStudio CRAN mirror.

Step 9b: Combining Drat and Travis CI

(I didn’t have time to write down this section properly but the relevant link is provided below)

A next step is to allow Travis-CI to push automatic updates of bloodstats into drat. To achieve this, use the link below and follow the steps in the workflow section.

source: Use Travis CI for automatic pushes of succesfull builds into drat

Step 10: Improve package and update!

Step 10a: Improve existing function

Let us add an extra input parameter, var, in bloodmeans(). If var is specified, and it’s a valid df variable, the function will attempt to group the input data frame by it and return means for each group.

Update bloodmeans.R with the following and try running tests for your package. Do they go through?

#' Extract Mean Values of Blood Biomarkers
#'
#' This function accepts a dataframe as input and extracts the mean value of
#' each numeric variable.
#'
#' @param df a \code{data.frame} with at least one numeric variable in order to
#' get a non-empty result.
#' @param var NULL (default) or the unquoted name of the variable by which to
#' group the input df.
#' @return a data.frame with the mean values of each numeric
#' @importFrom magrittr %>%
#' @author John Doe
#' @export
#' @examples
#' library(magrittr)
#' data.frame(x1 = c(1,2,3), x2 = c(4,5,6)) %>%
#'   bloodstats::bloodmeans()
bloodmeans <- function(df, var = NULL) {

  # Quote input
  var <- rlang::enquo(var)

  if (!rlang::quo_is_null(var)) {
    df <-
      df %>%
      dplyr::group_by(!!var)
  }

  df %>%
    summarise_if(is.numeric, mean, na.rm = TRUE)
}

Try running devtools::test() again. You most likely got an error:

test-bloodmeans.R:5: error: (unknown)
could not find function "summarise_if"

To fix this substitute summarise_if above, with dplyr::summarise_if. If your tests pass now, run devtools::document() to update documentation and run also a devtools::check(). Are the checks clean (is rlang dependency in DESCRIPTION?). When all is ok, bump your package’s version number (what should it be? You now added a new parameter to bloodmeans so you updated the interface of the function a bit. I would update the version to 0.1.0 at this point). Install the new version locally devtools::install(), git add, commit and push your updates to GitHub.

git status
git add .
git commit -m "add grouping variable in bloodmeans and bump version number"
git push -u origin master

The last push to GitHub has hopefully triggered a new Travis-CI built. You can check this from your travis-ci account. If your checks were clear locally, the travis built will likely go through as well.

Step 10b: Add demo data to the package

Let’s create a data frame with random values as carry-on data for this package.

We will place this in a subdirectory called data as is the most common location for package data. Each file in this directory should be a .RData file created by save() containing a single object (with the same name as the file). The easiest way to adhere to these rules is to use devtools::use_data().

Let’s write a small script to first create the data. We will place the script in a subdirectory called data-raw. We don’t need this subdirectory to the in the bundled version of the package, so we also add it to .Rbuildignore. We do all the above with the following commant.

usethis::use_data_raw()

Create a script generate_pkg_example_data.R and save it inside data-raw. The contents of the script are shown below.

# Create a data frame with random values
#
library(tidyverse)

# Set random seed to make things reproducible
set.seed(34)

# Create data frame
df_example <-
  tibble(id = rep(paste0("id_", seq(1, 50)), 2),
         biofluid = rep(c("blood", "urine"), each = 50),
         males = sample(1:100, size = 100),
         females = sample(1000:1200, size = 100))

usethis::use_data(df_example, overwrite = TRUE)

The last line will create a folder data and add the example data frame inside it, in the proper format. (You may also add package usethis in the Suggests field of your DESCRIPTION file.)

Finally, you may also update the examples of bloodmeans() to demonstrate the new feature with parameter var

# Grouped bloodmeans
bloodstats::df_example %>%
    bloodstats::bloodmeans(var = group)

Next step is to add some documentation for the df_example. In the R subdirectory, create a script called data.R with the following contents.

#' Random Metabolic Data
#'
#' A dataframe with random data for blood and urine, males and females.
#'
#' @format A data frame (tibble) with 100 rows and 4 columns.
#' \describe{
#'   \item{id}{ID of individual.}
#'   \item{biofluid}{Biofluid type, blood or urine}
#'   \item{males}{Log odds for incident type 2 diabetes.}
#'   \item{females}{Standard error.}
#' }
#' @source Generated with a random \code{set.seed(34)}
"df_example"

Rerun devtools::document(), devtools::check(document = FALSE) and if all clear you can commit and push your changes to GitHub.

There is one thing we didn’t do and should have. That is, after updating the bloodmeans() function to return grouped means if a valid variable name is provided, we didn’t add a unit test for this case. I will leave this as a take-home exercise for you.

Further discussion on package development

When to build a package. Examples of practical use cases:
- Data reading and cleaning functions, specific to the type of data your are handling
- Data analysis workflow with a demo vignette
- A set of vusalization tools
Package scope. Should a package have an overall scope? Is a package with a bunch of relatively unrelated functions or a not-well-defined scope better than no package at all?

‘There is a lot to learn on package development, but don’t feel overwhelmed. Start with a minimal subset of useful features (e.g. just an R/ directory!) and build up over time. To paraphrase the Zen monk Shunryu Suzuki: “Each package is perfect the way it is — and it can use a little improvement”.’

Usefull links and references

Take a look at our ggforestplot, Nightingale’s first open source R package!! Website made with pkgdown
R packages by Hadley Wickham
Karl Broman’s Git/GitHub Guide
How to create a GitHub repo from an existing directory
How to create a GitHub page for your project
Tidy evaluation and most common actions
R code styler
Software semantic versioning
R for Data Science

Writing sharable (R) code. Uh… and sharing it!

Presented at an R-Ladies Helsinki meeting on 04/04/2019

Maria Kalimeri, Senior Data Scientist @ Nightingale Health

05 April 2019

Intro

Prerequisites

Create an R package

Step 1: Create a package directory

Step 2: Add functions

Step 3: Add documentation

Step 4: Process documentation

Step 5: Run checks

Step 6: Write tests

Step 7: Install your package

Further discussion on package development

Usefull links and references

Writing sharable (R) code. Uh… and sharing it!

Presented at an R-Ladies Helsinki meeting on 04/04/2019

Maria Kalimeri, Senior Data Scientist @ Nightingale Health

05 April 2019

Intro

Prerequisites

Create an R package

Step 1: Create a package directory

Step 2: Add functions

Step 3: Add documentation

Step 4: Process documentation

Step 5: Run checks

Step 6: Write tests

Step 7: Install your package

Share your R package

Step 8: Make the package a GitHub repo

Step 8a: Push initial commit

Step 8b: Add a README

Step 8c: Enable Travis-CI for your package’s repository

Step 9: Create your own R package drat repository

Step 9a: Create a drat git repository in GitHub

Step 9b: Combining Drat and Travis CI

Step 10: Improve package and update!

Step 10a: Improve existing function

Step 10b: Add demo data to the package

Further discussion on package development

Usefull links and references

Step 9a: Create a `drat` git repository in GitHub