Managing large codebases in R
HostAlexander Bertram
PanelistRyo Nakagawara
About the webinar
About the webinar
This Webinar is a one-hour session designed for M&E specialists and Information Management Officers who are familiar with the R programming language and wish to advance their skills so as to manage large codebases more effectively.
In summary, we discuss:
- Adopting a coding style for your team
- Organizing code into functions
- Organizing functions into packages
- Documenting code
- Using version control
- Practical examples from the field
View the presentation slides of the Webinar.
View Ryo's presentation notes.
Is this Webinar for me?
- Are you responsible for information management in your organization and do you work with R?
- Do you work with large codebases and need to add structure in the way you manage them?
- Do you wish to advance your skills in R?
Then, watch our Webinar!
About the Presenters
About the Presenters
Mr. Alexander Bertram, Technical Director of BeDataDriven and founder of ActivityInfo, is a graduate of the American University's School of International Service and started his career in international assistance fifteen years ago working with IOM in Kunduz, Afghanistan and later worked as an Information Management officer with UNICEF in DR Congo. With UNICEF, frustrated with the time required to build data collection systems for each new programme, he worked on the team that developed ActivityInfo, a simplified platform for M&E data collection. In 2010, he left UNICEF to start BeDataDriven and develop ActivityInfo full time. Since then, he has worked with organizations in more than 50 countries to deploy ActivityInfo for monitoring & evaluation.
Mr. Ryo Nakagawara is an experienced R developer and data engineer/scientist with experience in international development and soccer analytics, currently residing in Japan. Ryo's strengths lie in building data pipelines by creating and maintaining R packages, scripts, reproducible reports, dashboards, and more. Ryo also has experience managing large codebases/projects in open-source and enterprise environments on GitHub. Outside of work, Ryo regularly contributes to both fun and serious open-source projects as well as being an editor on the "R Weekly" newsletter. You can find Ryo on Github or LinkedIn.
Transcript
Transcript
00:00:02
Introduction
Great, thank you so much, Faye. I'm excited about today, and I'm also excited about the crowd that we have here. Not only do we have Ryo, who's going to talk to us about his work at ACDI/VOCA, but I also see we have a couple of the key people from the response for Venezuela. You use R very avidly, and they're set up from OCHA and others among our power users. So I hope we'll have some time at the end to also get some of your experience and maybe have some of your feedback.
Today we're going to do a quick intro to what we mean by large codebases and why it's important, and then we're going to dive into five principles. I'm going to introduce some ideas about some good ways to manage our codebases, and Ryo is going to illustrate and share with us how he's applied these at ACDI/VOCA. Hopefully, it should be very interactive and we'll go back and forth and then hope we have time for questions, feedback, or other tips from our audience.
First of all, for those not sure what ActivityInfo is, it is a user-friendly relational database for M&E, case management, and humanitarian coordination. From the beginning, we've always had a strong integration with R, and many of our users have been relying on R to automate tasks with ActivityInfo, to do advanced analytics, and manage data flows between systems. It's been a key element from the beginning.
00:02:12
What is a large codebase?
Firstly, what do we mean by a large codebase? It doesn't have to be that large, but I think anytime that you're working with more than one person on code in R, or even more than a few files, you start to run into these issues of organization. You want to make sure that you know where to find things and how to change things. One thing that's important is that you write code maybe once, but you're going to read it so many more times. Other people are going to read it, and your future self is going to read it. Code that I write now, I have to be able to come back to in six months after I've moved on and done something else. I have to be able to understand what I was thinking about when I wrote that code. That's some of the reasons that paying attention to the organization of code is really important.
Some examples of large codebases include OCHA Libya, where they use an R script to move data from ActivityInfo to aggregate across different sectors and pull that into an internal dashboard for review. It's kind of an extract, transform, and load flow with R and ActivityInfo. Francis and James at the response for Venezuela have used R to pull data from the 17 databases or the 17 countries that participate in the response for Venezuela into a single regional database. They have developed a Shiny app that helps validate and aggregate data across those different countries into a single view of the data.
Another example is the QualMiner program in Ecuador. This is a project that we worked together with IOM and UNHCR on to use data collected and managed with ActivityInfo to visualize some of the qualitative narrative data that was captured, applying some of the advanced techniques in visualizing qualitative data using R. And of course, Ryo can talk about the codebase at ACDI/VOCA and how that interacts with ActivityInfo.
At ACDI/VOCA, we have lots of R packages that we build ourselves, and separately, GitHub repositories that are filled with scripts and other stuff that run the code from those packages. They are linked, and later on, I'll show an example of what our GitHub organization looks like and how we've structured and organized our code.
00:05:35
Adopting a common code style
The first principle is about adopting a common code style. A code style is like a style guide when you're writing; you can get by without it, but it makes reading code much easier. It is a set of rules that the whole team agrees on regarding how to name functions, variables, and data sets. Do you use capital letters? Do you use underscores, periods, dashes? How and where do you space? This makes it much easier when you're working with code within your team to navigate and know what to expect.
A simple example is a function like check_duplicates. Is it check_duplicates with an underscore, or checkDuplicates with a capital D, or dots? Having a mix of these naming conventions in your codebase can make it very difficult to write and read code. I'm linking here to Hadley Wickham's style guide for R. I would really recommend this, especially if you're using the tidyverse packages. He goes through and talks about spacing and making sure that you have spaces between your expressions to make it easier to read and write. I also wanted to point out the formatR package; if you've got an existing codebase that's a mix of all these things, this is a package that will help you automatically tidy it up. Just agreeing together as a team on a code style can go a long way.
Each individual team member is going to have different preferences because we all learned R in a different way. You have to decide as a team and then stick to that decision, or reading each other's code is going to be really frustrating. Common style differences include the three different ways of assigning variables in R: the equals sign, the left assignment, and the right assignment. For most R users, the left assignment is probably the most common. In the ActivityInfo R package, it is all left assignment.
Another point of contention can be snake_case versus camelCase, or American English versus British English. In certain places like ggplot2, you can input the color argument in both spellings, but for most of us in our own packages, we're going to have to pick one way or another. To analyze these stylistic issues, there's a really great package called lintr. It goes through and analyzes your entire code. The lintr default is to use the tidyverse style guide, which prefers the left assignment arrow.
There is also a package called styler which formats your code automatically depending on the style you want. There is an RStudio Add-in for it which is really nice. Google also has its own style guide, which is a bit different from the tidyverse style guide. It doesn't matter so much which one you pick, as long as you pick one. If you're using a lot of the tidyverse packages, then it's quite nice to choose the tidyverse style guide.
00:14:00
Organizing code into functions
The next area is organizing things into functions. If you're using R, you've probably seen functions before. A function takes inputs (arguments) and gives you an output (result). Even addition is a function. Functions are great ways to organize code into small bits. One distinction I want to introduce is the idea between pure and impure functions. A pure function is like a mathematical function: if you give it the same inputs, it gives you the same outputs. It doesn't have any side effects, like writing a file or contacting a website. This makes them very easy to test.
Impure, or imperative functions, depend on the outside world. They might read from a file, a server, or an API, or depend on the time of day. This means the same arguments might produce different outputs. The classic example of an impure function is "launch missiles"—it has side effects in the real world. Why use functions? They help you break code into smaller bits that are easier to read and understand. A good rule of thumb is about 20 lines; a function should fit inside your head. You should be able to understand the whole function at once.
There is also the "functional core, imperative shell" pattern. Pure functions are good, but all the interesting things happen in the outside world (side effects). The idea is to sandwich the pure functions between the imperative ones. For example, you might have a shell that reads from a CSV file (imperative), pipes that to a pure function which removes duplicates and scores eligibility, and then pipes that to a function that imports into ActivityInfo (imperative).
Let's look at a real-world example. We have a function from a project that checks duplicates in households. It's a well-thought-out script, but it is 892 lines long. This function does not fit into my head. One thing you can do is look for pure functions to pull out. For example, there is a section that checks if there are any duplicates and adds lines to a log. We can pull this out to a separate function called no_issues. This function does one thing at a time and is very simple to understand. By breaking this out into separate smaller functions, the ultimate check_duplicates function might look like a series of piped functions: check_for_codes then check_for_first_names. This makes it much easier to know what the function does without reading nearly 900 lines.
The main thing about creating functions is the mantra "Don't Repeat Yourself" or DRY. The way I tend to write functions is to write out the lines of code in an R script or R Markdown file for whatever task I want to do. Once I'm satisfied with the output from beginning to end, I slowly start wrapping the function skeleton around it. You have to think about what the interchangeable things in your function are—usually different data. Those become your primary arguments. You also want to think about the end result you want to return to the user.
It's usually never just about one individual function; you will have multiple functions in your script. You want to think about how each different function fits the larger picture. One function may output data needed as an argument for another function. If you are doing too much in a single function, you might want to break it down into logical components. When you're starting out, make each individual function do one specific thing perfectly with one output. This makes it easier to document, test, and debug.
Once you start creating functions, it's really powerful because you can scale it up by running the same operations for different things. For example, you can use functional programming tools like lapply in base R or the purrr family of functions to iterate over lists. Writing organized code is often iterative. Don't be afraid to quickly write out something that works, but sometimes it's worth taking an extra minute or two to refactor it. Refactoring is the process of restructuring your code without changing its behavior so that it is clearer.
00:33:25
Organizing functions into packages
A package in R is a way of organizing a set of functions, dependencies, and documentation together in one shareable unit. If you put together a package, you can share that with somebody else and be reasonably confident that they're going to be able to use it right away. If you just share a script, it's not going to be clear which packages or libraries the code uses. A package puts all of this together so that it's ready to share and use by others. Even if you're working alone, it makes it easier to work with so that you don't have a nest of source files.
At BeDataDriven and ACDI/VOCA, we use packages on the back of specifically two packages: usethis and devtools. They have functions that create package structure and make package development much easier. For instance, to create a package, you can use usethis::create_package(). This sets up the package structure, including the DESCRIPTION file and the R folder. There are other functions to quickly set up connections to GitHub or create test infrastructure folders with use_testthat().
You might have different individual components, like an ETL process with get_data and clean functions. You can have one large package function, like run_all, that runs all of this together, or you can have the individual components in an R script running in a separate script or R Markdown file.
00:39:04
Documenting code
A lot of people will put comments in the code using the hash symbol. That's very useful, but when you start to break your code into smaller functions, you might find that everywhere you're putting a comment is actually a good place to put a function boundary. If you have a comment like "make Arabic names consistent," that's a sign that this should be a separate function. Once you extract that to a function called fix_arabic_names, you almost don't need the comment anymore because the function name explains it.
You can use a special syntax called roxygen2 to document the function itself. You can describe what the function does, the parameters it expects, and what it returns. This allows you, a team member, or your future self to have a clear indication of what to expect. It levels up your comment game.
Another very useful way to document things is to add assertions to your functions. You can use stopifnot to ensure that the data frame has the columns you expect. If you use this kind of assertion, you're going to get a nice error message before the code fails deep inside the function. This is very helpful for providing feedback on the use of the function.
For documentation, there is a package called sinew and an RStudio add-in for it. It allows you to automatically generate the roxygen skeleton based on what you've already written inside the function. It picks up dependencies and arguments, saving a lot of time and ensuring consistency. There is also a package called pkgdown which lets you create web pages from your package documentation. The README file becomes your front page, and the package documentation becomes individual web pages. You can even use HTML and CSS to spice up the appearance.
00:49:11
Using version control
Finally, source control. A lot of you are using GitHub or Git. When I started, there were several systems like SVN, but now nearly everybody uses Git. There are nice free sites like GitHub, GitLab, and Bitbucket that offer generous free packages. I believe you can even have free private repositories within certain limits. As soon as you start having any substantive level of code, I strongly recommend considering a version control system to keep track of changes and collaborate. It also gives you the freedom to delete things because you can always find them back in the history.
At ACDI/VOCA, we use GitHub Projects, which is a Kanban or Trello-style board that organizes tasks coming out as issues from individual repositories. You can collapse and organize issues from tons of different repositories into one view. We split things into two different types of repositories: package repositories and script repositories. They are often paired; for example, avdg-forex is the package repository containing the functions, and avdg-forex-scripts houses the actual scripts that run the code.
The package repositories are just for the package code itself, not the execution. The script repositories contain the shell scripts, log files, and data folders. We have a dashboard that shows the status of these scripts—whether they are running or erroring. If a script isn't working, I can click a button and go straight to the log file instead of digging through folders. We also have personal or shared workspaces for project staff, like AV-Philippines or AV-Laos, where we can manage permissions so staff only see what is relevant to them.
01:01:27
Q&A session
Colin: I have a few questions regarding keeping track of namespaces on long-running projects, how you approach code review, and when you decide that you need to refactor.
Ryo: Namespaces are hard. There is a debate between using lots of dependencies versus using only base R. There are so many useful packages out there that you don't have the time to rewrite yourself. Regarding code review, we use pull requests and have reviewers make sure the code changed in a separate branch is up to spec. We use automatic checks like lintr and GitHub Actions to ensure code conforms to our standards before merging.
Alex: For the ActivityInfo codebase, we have a clear transition where code must be peer-reviewed before it goes into production. If you are doing data analysis where your team is the only user, code review can be more about learning. You can ask a team member to look over your code to see if it makes sense. Regarding refactoring, if code is running and doesn't need to change, I generally don't touch it. The best moments to refactor are while you are writing the code (before committing) or when you are about to make changes to existing code for a new requirement. It helps you get back into the code and prepares it for the changes.
Sign up for our newsletter
Sign up for our newsletter and get notified about new resources on M&E and other interesting articles and ActivityInfo news.