10  Creating portable and reproducible scientific workflows

At this point, you would have written your own R code and R scripts and saved these within an R file (.R file extension).

By now, you would have also appreciated how extensible R is through built-in packages and/or through functions that you have created yourself.

So far, in the examples that we have worked on, the operations and the problems have been quite straightforward. But from your own experience dealing with your own data, real world data is far from straightforward and far from simple. Complexity is almost always a given.

R’s scripting capability and R’s extensibility are its main characteristics that make R a good tool for creating robust scientific workflows particularly for complex data and research projects.

A typical scientific workflow would have the following steps:

flowchart TD
  data_sources(Data\nsources)
  raw_data(Raw\ndata)
  clean_data(Clean,\nusable\ndata)
  stats_outputs(Specialised\nstatistical\noutputs)
  descriptive(Descriptive\nstatistics)
  results(Results\ntables)
  model_outputs(Model\noutputs)
  graphics(Plots\nfigures\nmaps\ngraphs)
  report(Report)
  manuscript(Manuscript)
  outputs(Website\napplications)

  extract[/Extract/]
  transform[/Transform/]
  analyse[/Analyse/]
  describe[/Describe/]
  tabulate[/Tabulate/]
  model[/Model/]
  visualise[/Visualise/]
  contribute[/Contribute/]

  data_sources --- extract --> raw_data
  raw_data --- transform --> clean_data
  clean_data --- analyse --> stats_outputs
  clean_data --- describe --> descriptive
  clean_data --- tabulate --> results
  clean_data --- model --> model_outputs
  clean_data --- visualise --> graphics

  stats_outputs <--> descriptive
  descriptive <--> results
  results <--> model_outputs
  model_outputs <--> graphics

  stats_outputs --- contribute
  descriptive --- contribute
  results --- contribute
  model_outputs --- contribute
  graphics --- contribute

  contribute --> report
  contribute --> manuscript
  contribute --> outputs

A flowchart for an example scientific workflow

In general, an R script should reflect the different steps outlined above. Hence, an R script of a scientific workflow would tend to look like this:

## Load libraries


## Retrieve and read data


## Process data


## Analyse data

### Descriptive analysis

### Statistical tests

### Model specifications


## Outputs

### Tabulation of results

### Model outputs

### Plots


## Report

In this chapter, we will go through a step-by-step walkthrough of how to build a robust scientific workflow in R. A robust workflow is one that is portable i.e., not dependent on hardware and software and instead can be run on almost any machine with very minimal, if any, additional setup or configuration required, and one that is reproducible i.e., can be run over and over again without issues, providing the expected results with the same data or providing updated results with new and/or updated data.

10.1 Create a new RStudio project

The steps here are a summary of what is found in Section 9.1.

Tip
  • Open RStudio

  • Click on the File option in the RStudio menu. In the dropdown menu, select New Project

  • In the menu window, select New directory option.

  • In the next menu window, select New project option.

  • In the next menu window, enter the following details:

    • Name of the project - important to make the project name as short as possible but descriptive of the project you are creating; don’t use spaces, instead use dash (or underscore) and avoid using capital letters;

    • Select the directory in your computer in which you want to save the project in. Click on Browse to open your computers file manager and navigate to the directory you want to save your project in;

    • Tick the selection box to make this project a git repository (whilst this is not necessary, this is highly recommended especially if you are collaborating with others);

    • Tick the selection box to enable renv in this project (this is what mainly contribute to the portability of your project); and,

    • Click on Create project

10.2 Create an R file for package dependencies

It is best practice to create a standalone R file specific for invoking/calling on R package dependencies. I recommend calling this file packages.R and this file should be saved in the root directory of the project you just created.

These are steps on how you can create this file.

Tip

Steps for creating an R package dependencies file:

  • Click on the File option in the RStudio menu. In the dropdown menu, select New File and then in the next dropdown menu, select R script.

  • A new tab will open in your text editor pane of RStudio (upper left pane) with the name Untitled1. Save this file by clicking on the disk icon on the text editor menu or do a keyboard shortcut with CTRL + s. Give this empty R script the filename packages.R.

  • You should now see a file in the main directory/root directory of your project named packages.R

  • Add code in the packages.R file specifying the packages you will be using in this project. There will be standard packages that we will always use with this type of workflow. So a template/generic packages.R file will contain the following:

################################################################################
#
#'
#' General packages needed for a targets workflow
#'
#
################################################################################

library(targets)
library(tarchetypes)
library(here)
library(rmarkdown)
library(knitr)
library(kableExtra)
library(dplyr)
library(openxlsx)
library(ggplot2)


################################################################################
#
#'
#' Add other packages that will be used in the project below
#'
#
################################################################################

10.3 Create placeholder directories

Create placeholder directories for different components of the workflow.These placeholder directories will provide an organising structure to the project and remind you of where to save/store specific files and outputs.

Tip

Following are steps on how to create placeholder directories:

  • In the lower right pane of RStudio (the file manager pane), find the menu button labelled Folder.

  • Give this new folder the label of R. This folder will hold all bespoke functions that we will create to use for this project workflow;

  • Repeat these steps to create new folders with the following labels:

    • data - This folder will hold any data that we retrieve as part of this workflow.

    • outputs - This folder will hold all our workflow outputs such as plots/figures, tables (in Excel or CSV files), HTML and/or Word and/or PDF outputs

    • reports - This folder will hold all our RMarkdown report (.Rmd) files

    • docs - This folder will hold any of our deployed outputs such as HTML report, dashboard, etc.

These are placeholder directories which we will populate as we work through the workflow for this project.

10.4 Create the target script file

The next task is to create a {targets} script file (_targets.R) which is the file that will define the workflow that we will be creating.

Tip

The _targets.R script file can be created through these steps:

  • Clicking on File –> New File –> R Script in RStudio.

  • A new tab will show in your Source window on the top left quadrant of your RStudio screen. This tab will usually be called Untitled1.

  • Save this file first and change its name to _targets.R. Make sure to save it in the current project directory.

  • You know that you were successful in doing this once you see a file called _targets.R in the file system window in the lower right quadrant of your RStudio screen.

10.5 Edit the targets script file

Now, the next step is to edit your script file by adding sets of R code that does the following:

  • Loads the packages required
  • Loads custom functions (if any)
  • Defines individual targets using tar_targets function
  • Ends with a list of targets objects

A basic {targets} workflow will look like this:

## Load libraries --------------------------------------------------------------
library(targets)



## Load custom functions -------------------------------------------------------
for (f in list.files("R", full.names = TRUE)) source (f)
for (f in list.files(here::here("R"), full.names = TRUE)) source (f)


## Create targets and list targets objects -------------------------------------