Hands-on building an R package

Before proceeding…


Attribution: This content has been developed on the basis provided by Chapter 1: The Whole Game (R packages book by Hadley Wickham & Jenny Bryan, 2e) and the UBC course notes Reproducible and Trustworthy Workflows for Data Science by Tiffany Timbers, Joel Östblom, Florencia D’Andrea, and Rodolfo Lourenzutti

We assume you have followed the installation instructions we shared before the workshop and have: registered for a GitHub account and installed git (more information here)

Toy package: {eda}

  • Suppose our package’s purpose is to provide data wrangling and summary functions to conduct a proper exploratory data analysis (hence the name EDA)
  • Therefore, our toy package’s name will be {eda}
  • We will be switching back and forth between these slides and hands-on practice in RStudio

Installing auxiliary R packages

  • Before starting with the actual process, we assumed you have installed the necessary packages {devtools} and {usethis} (also sent in the installation instructions before the workshop)
  • {devtools} is a meta-package that encompasses more focused development-related R packages
  • {usethis} automates tasks related to project setup and development to build R packages

Installation and loading code

  • Installing {devtools} automatically installs {usethis}
  • Same situation applies when it come to loading the packages
# Uncomment and run the line below in the R console if you have not yet installed the `devtools` package
# install.packages("devtools") 

library(devtools)

create_package()

  • create_package() will initialize our new package in a directory of our choice
  • I will initialize the {eda} package in my Desktop folder for easier reference

Don’ts when choosing your home directory

  • Your package shouldn’t be hosted in another RStudio Project, R package, or Git repository
  • Your package shouldn’t be hosted in an R package library (i.e., where we usually install other packages from CRAN)

Code

create_package("~/Desktop/eda")


Project layout description (ignore-type files)

  • .gitignore is used by GitHub and lists all “hidden” files created by R and RStudio that aren’t necessary for the repository
  • .Rbuildignore contains all files created via R and RStudio that won’t be necessary when building our package (e.g., eda.Rproj)

Project layout description (other components)

  • DESCRIPTION contains the metadata and dependency installation instructions for our package
  • eda.Rproj is the RStudio project file
  • NAMESPACE contains the package’s functions to export along with imports from other packages
  • An R/ directory which will contain all package’s functions as .R scripts

use_git()

  • Besides creating the RStudio project file eda.Rproj, we will initialize a Git repository via use_git()
  • A Git repository will eventually allow us to publish and share our package in GitHub.com
library(devtools)

use_git()

What does this function specifically do?

  • It creates a hidden .git directory in the folder {eda}
  • Furthermore, it creates your initial commit

Ensuring we made our initial commit

  • Let’s relaunch our RStudio project eda.Rproj
  • On the Git tab, click on the clock icon to check your commit history (note your GitHub user is shown in the Author column)

Write your first function!

  • Recall we aim to package our code (shown below) for counting the number of observations in a class for any data frame (besides mtcars) so we (and others) can reuse this code more easily in other projects
library(tidyverse)

mtcars |>
  group_by(cyl) |>
  summarize(count = n()) |>
  rename("class" = cyl)
# A tibble: 3 × 2
  class count
  <dbl> <int>
1     4    11
2     6     7
3     8    14

This is our function count_classes()

  • It receives a data_frame or data frame extension (e.g., a tibble) along with an unquoted column name containing the class label class_col
library(dplyr)

count_classes <- function(data_frame, class_col) {
  if (!is.data.frame(data_frame)) {
    stop("`data_frame` should be a data frame or data frame extension (e.g. a tibble)")
  }

  data_frame |>
    group_by({{ class_col }}) |>
    summarize(count = n()) |>
    rename("class" = {{ class_col }})
}

The use of { }

  • Note we’re using curly brackets in { class_col }
  • An indirection with unquoted column names (such as class_col) needs extra support (via the curly brackets) because the global environment is not aware of the data frame column names

Syntax package::function()

  • count_classes() includes four {dplyr} functions:
    • group_by(), summarize(), n(), and rename()
  • Alternatively, we can use the syntax dplyr::group_by(), dplyr::summarize(), dplyr::n(), and dplyr::rename()
  • This syntax is recommended since it makes explicit which package each dependency is coming from within our package functions

Re-writing our function

count_classes <- function(data_frame, class_col) {
  if (!is.data.frame(data_frame)) {
    stop("`data_frame` should be a data frame or data frame extension (e.g. a tibble)")
  }

  data_frame |>
    dplyr::group_by({{ class_col }}) |>
    dplyr::summarize(count = dplyr::n()) |>
    dplyr::rename("class" = {{ class_col }})
}

use_r()


library(devtools)

use_r("count_classes")


  • This helper function from {usethis} allows us to create an .R script in the R/ subdirectory of {eda}

How does it look on R Studio?

  • use_r() creates the .R script count_classes.R
  • The Git tab keeps track of all our changes in the repository after our initial commit

Local commit of changes

  • Now, we need to commit our work in count_classes.R
  • We can do this via RStudio:
    1. In the Git tab, check the box in column Staged
    2. Click on the Commit button

Then…

  1. Ensure the changes are checked in the Stage column
  2. Type a commit message: Add count_classes()
  3. Click on the Commit button

Commit confirmation

  • We will get the below message once we have locally committed our changes
  • This local committing process will be repeated every time we make significant changes to our package

load_all()

  • The next step in our package building process is to test informally our function count_classes()
  • Function load_all() from {devtools} makes function count_classes() available for experimentation
load_all()


Then, we test our function

  • We use data frame mtcars and column cyl
  • The above variables correspond to function arguments data_frame and class_col, respectively
count_classes(data_frame = mtcars, class_col = cyl)

Setting up our remote GitHub repository

  • Via our GitHub account, we will create a remote {eda} public repository (we will add the README.md file later)

Pushing our local commits to the remote repository

  • In the Terminal tab of our RStudio session, we will paste the Git commands shown by GitHub.com from section ...or push an existing repository from the command line

This is how our remote repository should look like:


And these are our two initial commits in the remote repository:


check()

  • It’s a function that ensures the sources of an R add-on package work correctly
  • Therefore, check() executes R CMD check in the shell (i.e., terminal)
  • We use check() from {devtools} via the R Console
check()

Heads-up!

  • The output of this function is quite long!
  • That said, we essentially need to check the final summary (i.e., the previous screenshot)
  • It’s crucial to address any issue we might encounter (which will be shown in the check() output)
  • For our toy package {eda}, we will only encounter two warnings (package dependency and missing license, to be addressed later on)

Edit DESCRIPTION

  • It’s time to edit our package’s metadata in the DESCRIPTION file
  • It currently looks like this:
Package: eda
Title: What the Package Does (One Line, Title Case)
Version: 0.0.0.9000
Authors@R: 
    person("First", "Last", , "first.last@example.com", role = c("aut", "cre"),
           comment = c(ORCID = "YOUR-ORCID-ID"))
Description: What the package does (one paragraph).
License: `use_mit_license()`, `use_gpl3_license()` or friends to pick a
    license
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.1

Heads-up!

Let’s proceed with the edits!

  • We will update the fields Title, Authors@R, and Description
  • If you don’t have an ORCID, you can delete comment = c(ORCID = "YOUR-ORCID-ID")
Package: eda
Title: A Package for Data Wrangling
Version: 0.0.0.9000
Authors@R: 
    person("G. Alexi", "Rodriguez-Arelis", , "alexrod@stat.ubc.ca", role = c("aut", "cre"))
Description: Provide data wrangling and summary functions to conduct a proper 
    exploratory data analysis.
License: `use_mit_license()`, `use_gpl3_license()` or friends to pick a
    license
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.1

Saving, committing, and pushing

  • Once we have edited the DESCRIPTION file, we need to save our changes
  • Then, as we did with the count_classes() function, let’s locally commit these changes (use the commit message Edit DESCRIPTION)
  • Moving forward, within the Git tab in RStudio, we will remotely push our edits to our public repository on GitHub by clicking on the Push button

Ensuring we have remotely pushed our edits

  • Firstly, within RStudio, you will get the below confirmation

  • Then, we will be able to see our third commit in our remote repository

use_mit_license()

  • To address one of the warnings obtained in the output of check(), we need to include a LICENSE.md
  • Hence, we can use function use_mit_license() from {usethis} via the R Console
use_mit_license()

How does LICENSE.md look like?

Note: More about license matters later on in this workshop

Let’s take a look at the DESCRIPTION file!

  • The License field gets updated as follows:
Package: eda
Title: What the Package Does (One Line, Title Case)
Version: 0.0.0.9000
Authors@R: 
    person("G. Alexi", "Rodriguez-Arelis", , "alexrod@stat.ubc.ca", role = c("aut", "cre"))
Description: Provide data wrangling and summary functions to conduct a proper 
    exploratory data analysis.
License: MIT + file LICENSE
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.1
  • Let’s commit and push these changes to the remote repository (use the commit message Use MIT license)

document()

  • Let’s document our count_classes() function via package {roxygen2}
  • In RStudio, let’s open R/count_classes.R in the source editor

Using {roxygen2}

  • Do Code > Insert roxygen skeleton
  • We will get a documentation template we have to fill out

Filling out the template

  • Let’s copy and paste the below documentation on the previous template for count_classes()
#' Count class observations
#' Creates a new data frame with two columns, 
#' listing the classes present in the input data frame,
#' and the number of observations for each class.
#'
#' @param data_frame A data frame or data frame extension (e.g. a tibble).
#' @param class_col Unquoted column name of column containing class labels.
#'
#' @return A data frame with two columns. 
#'   The first column (named class) lists the classes from the input data frame.
#'   The second column (named count) lists the number of observations for each class from the input data frame.
#'   It will have one row for each class present in input data frame.
#' @export
#'
#' @examples
#' count_classes(mtcars, cyl)

Then…

  • Save your changes in R/count_classes.R
  • Commit and push these changes to the remote repository (use the commit message Add roxygen header to document count_classes())

Using document() from {devtools}

  • Let’s run the document() function in the R Console
document()


  • This function creates man/count_classes.Rd in {eda}, which is the help we get when typing ?count_classes in the R Console

Viewing the outputs from document()

  • Commit and push these changes to the remote repository (use the commit message Run document())

Using check() again

  • Since we already included LICENSE.md in {eda}, let’s use check() again in the R Console to ensure the license-related warning is gone

install()

  • It’s time to install our package {eda}
  • That said, instead of using install.packages() as with any package in the CRAN, we will use install() from {devtools}
  • Note that install() installs a local package in the current working directory, whereas install.packages() installs from a package repository
install()
library(eda)
count_classes(mtcars, cyl)

Let’s do it in the R console

use_package()

  • Note that count_classes() uses functions from package {dplyr}
  • Therefore, {dplyr} becomes a dependency

How to include a dependency in our package?

  • We can use function use_package() from {usethis}
use_package("dplyr")


  • This function will include {dplyr} in our DESCRIPTION, more specifically in Imports

Let’s do it in the R console

  • Commit and push these changes to the remote repository (use the commit message Import dplyr)

use_readme_rmd()

  • Our remote repository still doesn’t have a README.md file describing the package, installation, and usage
  • We can automatically generate one via use_readme_rmd() from {usethis}
use_readme_rmd()


  • This function will generate an .Rmd template, which we have to fill out

The .Rmd file

  • Fill out the template, knit to .md, use build_readme(), commit, and push these changes to the remote repository (use the commit message Write README.Rmd and render)

Using check() and install()

  • We’re done with the basic steps to build our R package!
  • Again, we use check() (to ensure all warnings are gone!), and then re-build via install()

Review

  • We can partially review the previous process (test() will be covered later on) via the below diagram from Chapter 1: The Whole Game (R packages book by Hadley Wickham & Jenny Bryan, 2e)