We demonstrate how to create and style boxplots using ggplot
tutorial
visualisation
Author
iHealth Team
Published
29 November 2023
Modified
19 October 2024
Boxplots (also called box-and-whisker plots) are a graphical tool used to summarise and display the distribution of a continuous variable. They are useful for several reasons:
Identifying Outliers: Boxplots clearly highlight outliers (values that fall significantly outside the range of most of the data). Outliers are shown as individual points beyond the “whiskers” of the plot.
Visualizing the Spread and Central Tendency: The box itself shows the interquartile range (IQR). The line inside the box represents the median, providing a sense of central tendency.
Displaying the Range of Data: The boxplot gives a quick overview of the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum values, helping to understand the range and overall distribution.
Key Components of a Boxplot:
Box: Represents the interquartile range (IQR).
Whiskers: Extend to the smallest and largest data points within 1.5 times the IQR from Q1 and Q3.
Median: The line inside the box, representing the middle value of the dataset.
Outliers: Shown as individual points beyond the whiskers.
Boxplots provide a concise summary, making it easier to understand the distributional properties of a dataset at a glance.
In this tutorial, we demonstrate how to create boxplots using ggplot. We will use a dataset from a nutrition survey of school children 10 years and older from Pakistan. This dataset is available from the Oxford iHealthteaching datasets repository
## link to CSV from GitHub repository ----1csv_file_url <-"https://raw.githubusercontent.com/OxfordIHTM/teaching_datasets/refs/heads/main/school_nutrition.csv"## Read CSV file ----2nut_data <-read.csv(file = csv_file_url)
1
This URL can be retrieved from GitHub by accessing the raw version of the GitHub link to the file
For this tutorial, we will focus on the weight variable in the dataset for demonstrating how to create and style boxplots in base R.
Creating a boxplot
A boxplot of the weight variable for all children in the dataset can be created as follows:
## Load ggplot2 ----1library(ggplot2)## Boxplot of weight of all children ----2ggplot(data = nut_data, mapping =aes(x ="", y = weight)) +3geom_boxplot() +4labs(5title ="Summary of weight values for all children",6subtitle ="School children 10 years and above in Pakistan",7y ="Weight (kgs)" ) +8theme_minimal()
1
Load {ggplot2}. If not yet installed, run install.packages("ggplot2").
2
Set ggplot aesthetic mappings. For boxplot, we just need to set the y axis value aesthetic and provide an empty character x axis value aesthetic so that the x-axis will be treated as a character value. For more information, run ?ggplot.
3
Plot the boxplot using geom_boxplot(). For more information, run ?geom_histogram.
4
Set labels of the plot. For more information, run ?labs.
5
Set the title of the plot.
6
Set the subtitle of the plot.
7
Set the y axis label of the plot.
8
Set a plot theme. For more information, run ?themes.
A boxplot for weight by sex is created as follows:
## Convert sex to factor ----nut_data$sex <-factor(nut_data$sex, labels =c("Male", "Female"))## Boxplot of weight by sex of children ----1ggplot(data = nut_data, mapping =aes(x = sex, y = weight)) +geom_boxplot() +labs(title ="Summary of weight values by sex",subtitle ="School children 10 years and above in Pakistan",x ="Sex",y ="Weight (kgs)" ) +theme_minimal()