Open Reproducible Science

Open and Reproducible Science in R

View the Project on GitHub OxfordIHTM/open-reproducible-science

Open and Reproducible Science in R: A University of Oxford International Health and Tropical Medicine Module

License for
data License for
code License for slide
deck Test render
slides Publish
slides pages-build-deployment DOI

Part 1: All about R

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT\&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

R provides a wide variety of statistical (linear and non-linear modelling, classical statistical tests, time-series analysis, classification, clustering, etc.) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.

One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.

R is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

R is unique in that it is not general-purpose. It does not compromise by trying to do a lot of things. It does a few things very well, mainly statistical analysis and data visualization. While you can find data analysis and machine learning libraries for languages like Python, R has many statistical functionalities built into its core. No third-party libraries are needed for much of the core data analysis you can do with the language.

But even with this specific use case, it is used in every industry you can think of because a modern business runs on data. Using past data, data scientists and data analysts can determine the health of a business and give business leaders actionable insights into the future of their company.

Just because R is specifically used for statistical analysis and data visualization doesn’t mean its use is limited. It’s actually quite popular, ranking 12th in the TIOBE index of the most popular programming languages.

Academics, scientists, and researchers use R to analyse the results of experiments. In addition, businesses of all sizes and in every industry use it to extract insights from the increasing amount of daily data they generate.

Session 1: Getting the right tools for the job

Using the right tools for the job is critical for success in any endeavour. In scientific research, specifically with scientific analysis, R is widely used by researchers from diverse disciplines to estimate and display results and by teachers of statistics and research methods. One of the most powerful characteristics of R is that it is open-source, meaning anyone can access the underlying code used to run the program and add their own code for free. Along with R, other tools facilitate/potentiate the benefits of using R.

In this session, we will introduce a basic set of tools that maximises the effectiveness of R for scientific research. Students will be given a basic introduction to R, RStudio, and to git and GitHub. The students will also be guided and facilitated in installing and setting up each of these tools in the way that is considered best practice for scientific research purposes.

Further Reading

Teaching Material

Slides can be viewed here.

PDF version of slides can be downloaded here.

R scripts for slides available here.

Session 2: Learning the basics of R - Part 1: R through the user interface of RStudio

In this session, a discussion on the current state of “data” in general and how this translates to global health data specifically. This discussion will then be linked to how the use of R can facilitate and support researchers in computing and statistical analysis. The final part of the session will be focused on a data exercise using R based on Exercise 1 found in the Practical R for Epidemiologists.

Further Reading

Teaching Material

Slides can be viewed here.

PDF version of slides can be downloaded here.

R scripts for slides available here.

Session 3: Learning the basics of R - Part 2: Creating and manipulating objects, and extending R using packages

In this session, the students will continue to work through using R to work with data and will be introduced to more built-in functions in R for performing data analysis. They will also be introduced to R packages and how they extend the functionalities of R specifically with regard to computing and statistical analysis. Students will be taught how to install and load R packages and be given a background and introduction to object-oriented programming.

Further Reading

Teaching Material

Slides can be viewed here.

PDF version of slides can be downloaded here.

R scripts for slides available here.

Session 4: Learning the basics of R - Part 3: Creating your own functions

In this session, the students will be introduced to functions in R and how they can be used to extend the functionalities of R. The students will be introduced to what functions are and how to create their own functions for specific computing and statistical analysis tasks that they require.

Further Reading

Teaching Material

Slides can be viewed here.

PDF version of slides can be downloaded here.

R scripts for slides available here.

Session 5: Creating basic R workflows and literate programming

In this session, the students will be taught and guided through creating basic R workflows with a focus on using a project-oriented approach in building scientific analysis workflows. The students will then be introduced to literate programming in R and the tools, packages, and techniques used to convert raw code workflows into those that combine content and data analysis code.

Further Reading

Teaching Material

Slides can be viewed here.

PDF version of slides can be downloaded here.

R scripts for slides available here.

Part 2: Open and Reproducible Science

Open Science is the practice of science in such a way that others can collaborate and contribute, where research data, lab notes and other research processes are freely available, under terms that enable reuse, redistribution and reproduction of the research and its underlying data and methods. Reproducible research means that research data and code are made available so that others are able to reach the same results as are claimed in scientific outputs. Closely related is the concept of replicability, the act of repeating a scientific methodology to reach similar conclusions. These concepts are core elements of empirical research.

This lecture series is designed to give Oxford IHTM students a foundational understanding and appreciation of the pillars of Open Science more broadly and within that the concepts, methods and tools for Reproducible Research more specifically. To further the students’ learning, practical examples and exercises will be discussed and walked through using the R language for statistical computing as a way to practically demonstrate these concepts.

Session 6: Open Science and Reproducible Research in R: An Overview

In this session, an overview of the what, why, and how of Open Science and Reproducible Research will be discussed to provide the learners with foundational understanding and appreciation of these concepts and their applications.

Further Reading

Rick, Jessica & Alston, Jesse. (2020). A beginner’s guide to conducting reproducible research in ecology, evolution, and conservation. 10.32942/osf.io/h5r6n.

Teaching Material

Slides can be viewed here.

PDF version of slides can be downloaded here.

R scripts for slides available here.

Session 7: Git and GitHub for use with R: tools for versioning and sharing research

In this session, an overview of git and GitHub will be discussed along with their integration with R using RStudio. This will then be followed by a practical session to guide learners on setting up git and GitHub onto their personal machines culminating in the learners accessing their first assignment in GitHub Classroom.

Further Reading

Teaching Material

Slides can be viewed here.

PDF version of slides can be downloaded here.

R scripts for slides available here.

Session 8: Reproducible scientific workflows in R - Part 1: Introduction to the {targets} package

In this session, a discussion of best practices on reproducible scientific workflows and an introduction to using the {targets} to implemenet these workflows. The final part of the session will be the first day of Hackathon 2024.

Further Reading

Teaching Material

Slides can be viewed here.

PDF version of slides can be downloaded here.

R scripts for slides available here.

Session 9: Reproducible scientific workflows in R - Part 2: Creating targets-based scientific workflows

In this session, the students will create their own targets-based workflows as part of the second day of Hackathon 2024.

Further Reading

Teaching Material

Slides can be viewed here.

PDF version of slides can be downloaded here.

R scripts for slides available here.

Session 10: Making your R-based research project portable

In this session, a discussion of R code portability, the factors that impact how the R code/script we develop can be used and run successfully by someone else, and the currently available solutions to ensuring portability.

The final part of the session will the third day of Hackathon 2024.

Further Reading

Teaching Material

Slides can be viewed here.

PDF version of slides can be downloaded here.

R scripts for slides available here

 

Series Lecturer

Ernest Guevarra

 

License

Unless otherwise specified, data used in the code and in the slide decks in this repository are licensed under a CC0 1.0 Universal license.

All code in this repository are licensed under a GNU General Public License 3 (GPL-3) license.

All slide decks in this repository are licensed under a CC BY 4.0 license.

 

Community guidelines

Feedback, bug reports and feature requests are welcome; file issues here or seek support here. If you would like to contribute to these teaching materials, please see our contributing guidelines.

Please note that the Open and Reproducible Science project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.