Computational skills are essential in the modern research environment, which increasingly involves large datasets and high performance computing, along with increasing demand that scientific studies be reproducible and open. With these modern demands, researchers across nearly all disciplines often need to create, use, and share software. Most introductions to programming or software engineering focus on developing commercial applications, however, researchers needs are largely related to exploring problems and answering questions. This book covers the fundamentals of creating software as part of your research cycle whether you work alone or as part of a team. We wrote this book with three audiences in mind:
- Learners in a class setting who would be using this book in a more formal class or workshop setting that is lead by an instructor or instructors. Much of the book is written with this reader in mind.
- Independent learners who would be going through the material and learning on their own. An independent learner can for the most part follow the book as if they were a learner in a class setting, with some modifications that we highlight and describe where necessary.
- Instructors who would use this book as the basis for a course or workshop of their own. Throughout the book there are sections dedicated to this reader that give information or instructions on how to use or modify the material for the instructor’s specific context.
This primary aim of this book is to teach you the tools, skills, and knowledge necessary to making research software that is open, reproducible, and sustainable. For researchers and data scientists who can write functions to create programs that are hundreds of lines long and who want to be more productive and have more confidence in their results, this book provides a pragmatic, tools-based introduction to research software engineering. For researchers whose tasks include building software packages, this book will help train you to be a [research software engineer][rse]. Unlike material aimed at computer scientists and professional software developers, this book uses data analysis as the motivating example and assumes that the learner’s ultimate goal is to answer questions rather than develop commercial software products.
TODO: Add more about linking why we should even write packages in the first place.
The main goal of this book is that you learn to develop an R package. By the end of the book, you will have the skills and knowledge necessary to: develop well-built, well-documented, and re-usable R-packages; work effectively as part of a team or as a contributor; and optionally prepare your package to be submitted to CRAN.
To support your learning experience, we have developed the following specific learning objectives that you will master from this book:
- Explain what an R package is, how it is structured, and how to set one up.
- List and describe some best (or good enough) practices in package development and how to implement them.
- Explain what a code style guide is, why it is important, and how to write code that adheres to the style guide.
- Describe why user-friendly and well-written documentation is crucial to building a usable package and some steps to achieve this type of documentation.
- Describe what CRAN is, what policies and quality-control steps it has for packages, and how to prepare a package for submission to CRAN so that it passes these checks and requirements.
Aside from the technical aspects of building an R package, the book will also cover topics that enhance and improve the process of package development. These learning objectives are to:
- Describe and apply effective workflows and project management tools that coordinate the development of software, particularly for software that is complex or when working in a team.
- Outline (and optionally practice) how to work productively in a small team that is inclusive and welcoming.
- Explain why it is beneficial to involve community contributors when designing design software and documentation.
TODO: Add graphic for overview? Something like https://moderndive.com/preface.html#fig:pipeline-figure?
The following personas are examples of the types of people that are our target audience.
- Completed a master’s in library science five years ago and has since worked for a small aid organization. She did some statistics during her degree, and has learned some R and Python by doing data science courses online, but has no formal training in programming. Amira would like to tidy up the scripts, data sets, and reports she has created in order to share them with her colleagues. These lessons will show her how to do this and what “done” looks like.
- Completed an [Insight Data Science][insight] fellowship last year after doing a PhD in Geology and now works for a company that does forensic audits. He uses a variety of machine learning and visualization packages, and would now like to turn some of his own work into an open source project. This book will show him how such a project should be organized and how to encourage people to contribute to it.
- Became a competent programmer during a bachelor’s degree in applied math and was then hired by the university’s research computing center. The kinds of applications they are being asked to support have shifted from fluid dynamics to data analysis; this guide will teach them how to build and run re-usable data analysis pipelines so that they can pass those skills on to their users.
TODO: Add undergrad persona.
Learners are expected to be able to:
- Write functions to analyze tabular data using [R][r].
- Have a basic understanding of Git and GitHub, and have at least used them for basic version control tasks.
- Create reproducible reports using [R Markdown][r-markdown].
Learners will need:
This book and the tools, skills, and knowledge being taught come out of the growing awareness of and demand for [open science][open-science], [reproducible research][reproducibility], and [software sustainability][sustainable-software]. So what are these?
- [Open science][open-science] is a movement to make scientific data, methods, and results publicly and freely accessible by publishing them with [open licenses][open-license]. These licenses allow others to re-use and modify these outputs without worrying about copyright infringement, often only requiring attribution to the original source.
- [Reproducible research][reproducible-research] means ensuring that anyone with access to data and software can feasibly reproduce the same or similar results, both to check them and to build on them. It’s important to note that science can be open but not reproducible and reproducible but not open.
- [Sustainable software][sustainable-software] is when software is developed in a way that makes it easier for other people to extend and/or maintain it, rather than to replace it. Sustainability isn’t just about the software itself, it also depends on the skills and culture of its users. For instance, if a software package is being maintained by a couple of post-docs who are being paid a fraction of what they could earn in industry and have no realistic hope of promotion because their field doesn’t value tool building, it doesn’t matter whether the package is well tested and easy to install: sooner or later (probably sooner) it will become [abandonware][abandonware].
This book presents software development that emphasizes openness, reproducibility, and sustainability, because it supports both individual researchers and the entire research community.
Throughout the book, you will build at least two packages (maybe more depending on the instructor).
- The main example package developed throughout the book is the Zipf’s Law package
zipfs. This package is used for the code-alongs, the actual content, and most of the exercises. More details about the Zipf’s Law will be introduced in Chapter ??.
- The example package for the first project assignment is the Weather in Kenya
kenyaweather. This package is used as a reference when you work on your package for the first project assignment.
Specific details about the project assignments are found in the
Appendix ??). Two of the packages are completed and can be
used as references (
TODO: Remove this at a later point, keep for now for an overview.
Course chapters and content need to be strongly aligned with learning outcome/goal. Here is a draft outline to guide development.
- Overview of course
- Course motivation and learning outcomes
- Intended learner (personas, assumptions and expectations etc)
- Folder and file structure (RStudio R Projects)
- File paths, basics of the shell (in RStudio)
- Working directory and how it is set with RStudio R Projects
- Making use of the fs package for filesystem management
- Setting up an R package
- Describing what a package is and when or why to make one
- Using devtools and usethis for development
- Developing functions
- Basic explanation of what a function is and its components
- Actual process of taking code and converting it into a function will be done in next chapter
- Making and using datasets
- Write a script to download the zipf data and save to
usethis::use_data_raw()and process into
- Write a script to download the zipf data and save to
- Function development in a package environment
- Creating non-function code, checking that it works, then
converting it into a function
- There are several workflows for this (create in vignette Rmd, make a dev/creating.R
script as a development location, developing in the examples Roxygen
section, “Untitled1.R”). Which to use?
- Mostly what a workflow actually looks like.
- There are several workflows for this (create in vignette Rmd, make a dev/creating.R script as a development location, developing in the examples Roxygen section, “Untitled1.R”). Which to use?
- Building functions up slowly, making small targeted functions that build up into a bigger more complex functions
- Process control (if-else, stop, return, switch)
- Dependency management
- Function documentation (with roxygen2), part 1
@examplesto help with creating function documentation
- Creating non-function code, checking that it works, then converting it into a function
- Version control: Using Git and GitHub as a sole user (part 1)
- Using Git in RStudio (standard add-commit-history)
- Using Git in the Terminal of RStudio (moving in history with checkout, creating branches, adding and updating remotes)
- Setting up GitHub for R package (make use of usethis), pushing and pulling
- E.g. with pr_* functions from usethis
- Emphasize exercises to practice, not showing output of git in code chunks (they are a pain to edit afterwards)
- Checking correctness of code
- Build management and workflow
- With devtools
- Running local CRAN checks
- General workflow (load_all, test, check)
- Pre-push running test, build, and check
- Continuous integration using GitHub Actions
- Developing documentation and tutorials on usage
- Vignettes, README (with rmarkdown and usethis)
- Function documentation (with roxygen2), part 2
- Running spell checks and styling (with spelling and styler)
- Exposing your package to the world with a website:
- Setting up a website (with pkgdown)
- Customizing the themes
- Getting all resources and material organized
- Defining lifecycle of your package and individual functions (with lifecycle)
- Involving the community:
- Designing your package to be used by or contributed to from the community (refer to Mozilla Science Labs Open Leaders material)
- Contributing guidelines
- Being inclusive (code of conduct)
- NEWS file
- Version control: Workflows around using GitHub (part 2)
- Make use of R builtin tools like the usethis pull request helpers (and learning how to make use of this yourself)
- Using this workflow as a team or as a contributor
- Project management of your package development (either sole or in a team)
- Issue tracking
- Project boards
- Preparing for package release
- Check builds on other systems (with rhub)
- Etiquette for using (mostly free) servers and external build systems like GitHub Actions and rhub (e.g. don’t overuse them)
- Managing your versions, git tagging
- Check builds on other systems (with rhub)
- R package development in a team-based environment
- Making use of GitHub Organizations/Teams
- Branch protection, member roles
- Resolving conflict, personality differences
- Running meetings, code of conduct
- Decision making
- Issue assignment
- Relying on and using code of conduct to build culture and standards
- Setting up SSH, two-factor authentication, PATs?
- Appendices (ideas)
- Instructions to potential instructors
The instructional design of the learning material contained in this book revolves around three key concepts:
- Participatory live-coding lessons, where participants join with instructors to write and troubleshoot code step-by-step. We believe that this method encourages participants to actively engage with the material, to build muscle memory through typing, and to learn how to handle mistakes, rather than passively observing content. (Brown and Wilson, 2018; Wilson, 2019a)
- In-class and out-of-class exercises that include independent reading of assigned material and hands-on, practical work. The hands-on exercises are interspersed throughout the live-coding sessions to complement and reinforce the content as well as give learners the opportunity to work through the code and problems at their own pace. Reading activities are used to build concept-heavy knowledge.
- Project- and problem-based assignments that include both independent and group-based work. For the project assignments, learners reinforce what they learned through a new problem and challenge. The independent work prepares them for the optional group-based work, where they can practice team-based skills like communication, project management, and running meetings.
More details for instructors can be found in Appendix TODO: Add ref label.
All of this material can be freely re-used under the terms of the Creative Commons Attribution 4.0 International License and the book code is licensed under a MIT License, so the material can be used, re-used, and modified, as long as there is attribution to this source.
The source for the book can be found at the
r-rse GitHub repository.
Any corrections, additions, or contributions are very much welcomed.
Check out our
as well as our
Code of Conduct
for more information on how to contribute.
This book owes its existence to everyone we met through [the Carpentries][carpentries]. We are also grateful to [Insight Data Science][insight] for sponsoring the early stages of this work, to everyone who has contributed, and to the authors of (Noble, 2009; Haddock and Dunn, 2010; Wilson et al., 2014, 2017; Scopatz and Huff, 2015; Taschuk and Wilson, 2017; Brown and Wilson, 2018; Devenyi et al., 2018; Sholler et al., 2019; Wilson, 2019b).
TODO: Add more R relevant acknowledgements here.