<- function(x) {
transf log(3 * x ^ 2) / sqrt(x)
}
From user to developer - the why and how of packaging your code
Introduction
Whenever we perform computational analysis of biological data, we use software packages. Sometimes these are built and sold to us by big companies, typically the same ones that manufacture the machines used to generate the data. However, in many cases, the software packages we use are built by independent researchers, and distributed to be used by others.
In this lecture, we will discuss why it is a good idea to structure and package (and possibly even distribute) your own code, and provide some ideas on how you can start the path from being a user of tools to becoming a developer yourself. We will focus on the R language, and show in practice the steps one would go through to turn existing code into a package which is installable by others. Most of the concepts are directly transferable to other languages, such as python; however, the packaging system in R is arguably more standardized and more tooling is available to help you with the packaging, so we use that as an example.
Functions
The first step towards structuring your code is often to start writing functions. Functions are, effectively, recipes that take a set of input arguments, apply a prescribed sequence of operations to these, and return the results. Let’s say, for example, that we have defined a complex transformation defined by the following rule:
\[x \mapsto \frac{\log(3x^2)}{\sqrt{x}}\]
and that we need to repeatedly apply this to numbers at various places in our analysis workflow. One option would be to write out the transformation explicitly each time we would like to apply it. One disadvantage of that appears if we, for some reason, would need to modify the transformation - say, to replace the 3 with a 4 in the numerator above. In this case, we need to make sure that we find all the places in the code where the transformation is applied, and make the change everywhere.
The other option would be to define a function, which applies this transformation. In R, this would be done as follows:
After defining this function, we can simply call it at each place in the code where we want to apply our transformation.
transf(2)
[1] 1.757094
If we need to change the transformation, we only need to change it in one place (in the function definition), and this will propagate to wherever the function is called. Collecting code in functions is therefore an excellent way of following the DRY (don’t repeat yourself) practice in software development, and can be helpful regardless of whether one is developing software packages or writing code to analyze data.
In reality, most functions are larger than the one we showed above, and often perform several analysis steps (say, normalization, parameter estimation and differential expression analysis). When designing a function, it is worth thinking about the scope. Ideally, each function should do one, well-defined thing. This makes it easier to generate modular code (where building blocks can be combined in different ways), and also makes testing and bug finding much easier.
In practical data analysis, it may be tempting to think that a particular analysis is a “one-off”, and something that will never be done exactly that way again. This, by experience, is very rarely the case. There will be many situations where you will have to go back to code you wrote a long time ago, to modify an analysis, add new samples, or apply the same approach to another data set. Thus, writing clean, structured code and functions from the start can in fact save you lots of time, even if it takes a little bit longer in that moment.
Documentation
Whenever you have to share a function (and this includes with your future self!), documentation becomes essential. It should be clear from the documentation (without having to actually look into the code) what the function does, what type of inputs it expects, and what it returns. There should also be a note about what other software packages the function requires in order to function correctly. Different languages have different ways of encoding this type of documentation - for python, you would write a docstring, while for R, the roxygen preamble is a common way to capture it.
Beyond single functions
As you start collecting multiple functions that each do a single, well-defined thing, the question arises on how to structure these. Do you put them all in a single script? In multiple scripts (if so, how do you make sure that you keep them all together so that you can use them in a future project)? How do you encode the dependencies between them, and make sure that any external dependencies are available?
This is where packaging comes into the picture. A package (or library, or module, depending on the language) is a collection of functions, which are distributed together as a bundle. The package also contains the documentation of all those functions, and a list of all the dependencies that need to be available for them to work. Packages can also contain other components, such as automatic tests that make sure that the functions work as expected, and information about who wrote the package and how you are allowed to use it (copyright and license information).
Next, we will see how to construct an R package from scratch, starting from a couple of very small functions.
Let’s build an R package!
Starting point
Let’s assume that we have written two functions, defined below. One of them (say_hello
) takes a string representing a name as input, and prints out a greeting for that name. The other (get_age_in_days
) takes a number representing the age of a person in years, and calculates the number of days that corresponds to by multiplying by 365.25 and rounding the result to the nearest integer.
<- function(name) {
say_hello message("Hello ", name, "!")
}
<- function(age_years) {
get_age_in_days round(age_years * 365.25)
}
say_hello("Charlotte")
Hello Charlotte!
get_age_in_days(23)
[1] 8401
Set up the package skeleton
Today, there are several ways of setting up the skeleton for a new R package in an automated fashion. We will do it by creating a new RStudio Project - for this, we open RStudio, click on the arrow next to Project (None)
(or the name of the current project) in the top right corner, and choose New Project
.
Then, we select New Directory
:
We tell RStudio that we want the project to correspond to an R Package
:
And finally we provide some details, including the name we want to give our package:
After clicking Create Project
, RStudio will open a new session, with a package skeleton set up for us.
Before starting to add our own code, we remove some of the example code that we don’t want to include (we will replace this with our own files):
- The
R/hello.R
script. We will generate our own R scripts with functions. - The
man/hello.Rd
file. This is the documentation corresponding to the removed example function. NAMESPACE
. This is a list of functions that are exported from, and imported by, the package, which will be automatically regenerated later.
Adding package metadata
Next, let’s start editing the package files with our own information. We start with the DESCRIPTION
file, which contains the high-level metadata about our package:
After editing it with our specific information, it could look something like this (let’s leave the License
field for a moment):
We set the initial package version to 0.1.0
. It is good practice to bump the version number every time a significant change it made to the package, both in order to alert users of changes and for reproducibility reasons - it is expected that if the same version of a package is used, the same results will be obtained.
To decide on a license, consult e.g. this page. This is an important consideration if you are planning on making your package public, as it regulates how others may use the code. For now, we will choose the MIT license. The most straightforward way of adding all the required information is via the usethis
package, which provides a large number of convenience functions that are helpful during package development. We add the license information as follows:
::use_mit_license(copyright_holder = "Charlotte Soneson") usethis
Note that we specify also the copyright holder. Depending on where you work, this may be you and your co-authors, or your employer.
usethis
will print out helpful messages of what it’s doing in the R console:
We see that it has modified the License
field in the DESCRIPTION
file, and additionally added two files (LICENSE
and LICENSE.md
) containing the copyright information and the text of the chosen license.
Add R code
Now, let’s add our two functions to our package. To do this, we first create a new file R/say_hello.R
, and paste our say_hello
function in there. Next, we need to add documentation. As mentioned above, we will use roxygen2
to write the documentation. In RStudio, a roxygen skeleton can be generated by placing the cursor inside the function, and clicking Code -> Insert Roxygen Skeleton
:
This generates a preamble to the function, where each row is prefixed by #'
. Again, we can configure that with the specific information for our function:
Note how we indicate what type of value we expect for each argument (@param
). The @export
line indicates that the function will be exported (i.e., available to use for anyone who installs the package). It’s also useful to provide simple example code to help a new user get to know your function.
To generate the actual documentation (which will be displayed in R when typing ?say_hello
) we use another package that provides lots of helpful functionality for building and testing R packages, namely devtools
. The devtools::document()
function creates an .Rd
file for each documented function in the man/
folder:
::document() devtools
It also is responsible for populating the NAMESPACE
file, which lists all the functions that are exported from our package, as well as all functions from other packages that are being imported and used by our functions (so far, we don’t have any such functions).
We can go through the same procedure to generate and document our other function (get_age_in_days
).
Checking the package
In principle, this is all that is needed to generate a package that can be distributed and installed by you and others! RStudio provides helpful tooling for checking your package and making sure that it conforms to expected standards and contains all necessary files. To run a check within RStudio, we go to the Build
tab and press Check
:
This will call the devtools::check()
function, which performs lots of formal checks on your package content. In the end, it will tell you whether there are ERRORS, WARNINGS, or NOTES that require your attention.
It is often helpful to run such checks regularly while developing a package, to catch potential issues early on in the process.
At this point, we can also load the package into our R session, to be able to use the functions. This can be done either by installing the package into the local R library (Build -> Install
) followed by library(exampleRPackage)
, or by typing devtools::load_all()
, which will simulate the previous action without actually installing the package (this is particularly helpful if you already have an older version of the package installed that you don’t want to overwrite with the version you are currently developing, or if you just want to quickly test newly added functionality).
::load_all() devtools
Adding a README
A README
file in your package serves as the first landing spot for a new user, and can be helpful to provide context and installation advice. We can easily create a readme with usethis
:
::use_readme_md() usethis
For now, let’s make it very rudimentary:
Adding argument checks
Next, let’s take things one step further and make our functions a bit more robust. In particular, the get_age_in_days()
function will not work if a value that can not be converted to a numeric is provided as input:
get_age_in_days("Charlotte")
Error in age_years * 365.25: non-numeric argument to binary operator
While R will return an automatic error message, these are not always easy to parse, and it may not be obvious to the user what the problem is. We can add a check to our function, which breaks (with a useful error message) if the input is not numeric:
get_age_in_days("Charlotte")
Error in get_age_in_days("Charlotte"): The `age_years` argument must be numeric.
Adding unit tests
Before wrapping up the package, we will also add some unit tests. Unit tests are used to check that all functions in the package work as expected and return the right value for a range of different possible input values.
We will set up our unit tests using the testthat
package. To generate all the required files, we again use usethis
:
::use_testthat() usethis
We see that this generates a new tests/
folder, and adds information to the DESCRIPTION file.
Now we can create a test file for a given function, by opening the corresponding script (say, R/get_age_in_days.R
) and typing
::use_test() usethis
This opens a new script (in the tests/testthat
folder) where we can define our tests.
Here is an example of a set of unit tests for this function:
We can run all the tests in this script by clicking on the Run Tests
button in the top right corner of the script panel. We can also run all tests in the package via the Test
button in the Build
panel.
The covr
package provides helpful tools to check the fraction of code in the package that is covered by any of our unit tests (we want this to be as high as possible). In particular, running covr::report()
from the package directory opens an interactive report in the viewer panel, where we can see the overall coverage, as well as the coverage for each function. We can also click on a file to see which lines are covered and not.
Encoding dependencies
In many cases, our package will only work properly if other packages are installed. This is the case, for example, any time one of our functions calls a function provided in another package. We need to encode this dependency in our package, so that anyone who would like to use the package can prepare accordingly. In fact, in many cases dependencies will be automatically installed upon installation of our package, if they are properly encoded.
Let’s say, for example, that we want to use the str_c()
function from the stringr
package to concatenate “Hello” and the provided name before sending this to the message()
function in say_hello()
. To achieve this, we need to:
- add the
stringr
package as a dependency in the package metadata (theDESCRIPTION
file), - import either the entire package or (preferably) the function(s) we need in the
NAMESPACE
file.
We can achieve this in two steps. First, we call
::use_package("stringr") usethis
This adds stringr
to the list of dependencies in the DESCRIPTION
file.
Next, we add an @importFrom
statement in the roxygen preamble of our say_hello()
function, indicating that we want our package to import the str_c()
function from the stringr
package. We also modify the code of our function to use this new dependency.
After using devtools
to re-render the documentation (and the NAMESPACE
file) and re-load the package, we are ready to use our updated function.
::document()
devtools::load_all() devtools
Pushing the package to GitHub
Now that our package is ready to be distributed, we can make it available to others to install. There are many ways of distributing software - here we will use GitHub. We already initialized a git repository when creating the package. Now, we add and commit all the files that we have modified. This can be done from the command line, or using the Git
panel in RStudio.
We next create an empty repository on GitHub, and follow the instructions to push our local content there. For more information on how to work with git in RStudio, see e.g. Happy Git and GitHub for the useR.
Now, the package is available for anyone to install, using e.g. the remotes
or pak
package to install the package from your GitHub repository.
Adding a documentation website
The code in the GitHub repository contains all the information needed to use our package (including the documentation of all the functions). However, it is not immediately accessible in a user-friendly way. A common approach for R packages is to use the pkgdown
package to generate a documentation website, which will pull information from the various folders in your package. We can test it locally:
::build_site() pkgdown
This will build the site, and open it in a browser window. We can see that it pulled the information from the README
as well as the DESCRIPTION
files. Moreover, the Reference
tab contains the documentation for the functions.
This website can also be generated automatically every time we push to the GitHub repository, and thereafter displayed for free using GitHub Pages. To set this up, we need to use GitHub Actions (a continuous integration/continuous delivery system built into GitHub), and instruct it to perform the necessary steps at each new push. We can again use usethis
to set this up.
::use_github_action("pkgdown") usethis
Adding the new files (.github/workflows/pkgdown.yaml
) and pushing to GitHub now activates GitHub Actions.
We can follow the progress under the Actions
tab in the GitHub repository. Once finalized, there will be a new branch named gh-pages
in your repository, which will contain the documentation webpage. To display this, we need to activate GitHub Pages in our repository (Settings -> Pages -> Deploy from a branch -> Select gh-pages -> Save
). The settings page will then display the URL to the rendered documentation.
Click on the page to view the rendered documentation. This will now be updated every time we push to the repository, pulling the latest information from the package files.
How to distribute packages?
There are many outlets for distributing packages and making them available for others to install and use. Many developers use git or other tools for version control during software development. Consequently, it is common to find the final software package distributed in a GitHub or GitLab repository. In fact, R packages can be installed directly from such repositories, using e.g. the pak
, remotes
, devtools
or BiocManager
packages. Similar functionality is available for other languages (e.g. python).
In addition, there are several language-specific package outlets - for R, the two main ones are CRAN (which hosts more than 20,000 packages) and Bioconductor (which distributes ~2,300 packages, specifically for analysis of biological data). Submitting your package for distribution via such a repository can provide additional visibility, and acts as an additional quality assurance since the package undergoes a review process before being included.
Resources
- R Packages book
- CRAN - The Comprehensive R Archive Network
- Bioconductor
- PyPI - The Python Package Index
- GitHub Actions
- GitHub Pages
- devtools
- usethis
- pkgdown
- roxygen2
- testthat
- Ten simple rules for training scientists to make better software
- Blog post on how to develop good R packages
- Eleven quick tips for writing a Bioconductor package