Creating and styling boxplots

Using ggplot

We demonstrate how to create and style boxplots using ggplot
tutorial
visualisation
Author

iHealth Team

Published

29 November 2023

Modified

19 October 2024

Boxplots (also called box-and-whisker plots) are a graphical tool used to summarise and display the distribution of a continuous variable. They are useful for several reasons:

  1. Identifying Outliers: Boxplots clearly highlight outliers (values that fall significantly outside the range of most of the data). Outliers are shown as individual points beyond the “whiskers” of the plot.

  2. Visualizing the Spread and Central Tendency: The box itself shows the interquartile range (IQR). The line inside the box represents the median, providing a sense of central tendency.

  3. Displaying the Range of Data: The boxplot gives a quick overview of the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum values, helping to understand the range and overall distribution.

Key Components of a Boxplot:

  • Box: Represents the interquartile range (IQR).

  • Whiskers: Extend to the smallest and largest data points within 1.5 times the IQR from Q1 and Q3.

  • Median: The line inside the box, representing the middle value of the dataset.

  • Outliers: Shown as individual points beyond the whiskers.

Boxplots provide a concise summary, making it easier to understand the distributional properties of a dataset at a glance.

In this tutorial, we demonstrate how to create boxplots using ggplot. We will use a dataset from a nutrition survey of school children 10 years and older from Pakistan. This dataset is available from the Oxford iHealth teaching datasets repository

## link to CSV from GitHub repository ----
1csv_file_url <- "https://raw.githubusercontent.com/OxfordIHTM/teaching_datasets/refs/heads/main/school_nutrition.csv"

## Read CSV file ----
2nut_data <- read.csv(file = csv_file_url)
1
This URL can be retrieved from GitHub by accessing the raw version of the GitHub link to the file
2
Use read.csv() to read the CSV file from the URL

On inspection of the dataset, we see:

## Show first 5 rows of data ----
head(nut_data)
  region school age_months sex weight height
1      1      1        121   2   20.6  124.6
2      1      1        121   1   27.9  130.7
3      1      1        129   2   25.7  131.4
4      1      1        133   1   27.0  135.7
5      1      1        145   2   28.5  130.5
6      1      1        148   2   35.1  142.1

We have a data.frame with 267 rows and 6 columns.

For this tutorial, we will focus on the weight variable in the dataset for demonstrating how to create and style boxplots in base R.

Creating a boxplot

A boxplot of the weight variable for all children in the dataset can be created as follows:

## Load ggplot2 ----
1library(ggplot2)

## Boxplot of weight of all children ----
2ggplot(data = nut_data, mapping = aes(x = "", y = weight)) +
3  geom_boxplot() +
4  labs(
5    title = "Summary of weight values for all children",
6    subtitle = "School children 10 years and above in Pakistan",
7    y = "Weight (kgs)"
  ) +
8  theme_minimal()
1
Load {ggplot2}. If not yet installed, run install.packages("ggplot2").
2
Set ggplot aesthetic mappings. For boxplot, we just need to set the y axis value aesthetic and provide an empty character x axis value aesthetic so that the x-axis will be treated as a character value. For more information, run ?ggplot.
3
Plot the boxplot using geom_boxplot(). For more information, run ?geom_histogram.
4
Set labels of the plot. For more information, run ?labs.
5
Set the title of the plot.
6
Set the subtitle of the plot.
7
Set the y axis label of the plot.
8
Set a plot theme. For more information, run ?themes.
Figure 1: Boxplot of weight of all children

A boxplot for weight by sex is created as follows:

## Convert sex to factor ----
nut_data$sex <- factor(nut_data$sex, labels = c("Male", "Female"))

## Boxplot of weight by sex of children ----
1ggplot(data = nut_data, mapping = aes(x = sex, y = weight)) +
  geom_boxplot() +
  labs(
    title = "Summary of weight values by sex",
    subtitle = "School children 10 years and above in Pakistan",
    x = "Sex",
    y = "Weight (kgs)"
  ) +
  theme_minimal()
1
Set x-axis variable to sex.
Figure 2: Boxplot of weight by sex