Applied Bioinformatics

Introduction to ggplot


What is R Studio?

RStudio is an ‘Integrated Development Environment’ for R, a software application that allows the user to perform statistical analysis, plot graphs, write functions and manage datasets using the R statistical programming language.

How to install RStudio?

Before you can install RStudio, you need to install R, the programming language RStudio uses. You can download R from the official website: https://cloud.r-project.org

For Windows users, select Download R for Windows, followed by base then click on the download link.

For macOS users, select Download R for macOS, then click on the appropriate download link. Make sure to download the correct file, depending on whether your Mac is using it’s own silicon (M1/M2) or Intel’s.

Now you can download RStudio: https://www.rstudio.com/products/rstudio/download

We recommend you install R and RStudio on your own computer, however there may be times where you don’t have access to a local version of RStudio and instead need to use the cloud-based version. This might be if an on-campus computer doesn’t have RStudio installed.

To access RStudio Cloud, follow this link: https://rstudio.cloud/

RStudio Cloud is free to use, but you are limited to 25 hours per month. Once you have set up an account (or you can log in through a Google or GitHub account), create a new project and it should load up RStudio the same as the local version.

The R Studio interface

The basics

Functions

Unlike excel, SPSS or other similar software which operate using a ‘point and click’ system, R is an interpreted language. In order to do anything in RStudio, you need to tell it what to do using the R language, typically through functions.

A function in R usually consists of two parts…

  • The name of the function
  • Parentheses

Here are some examples:

getwd()
setwd()
dir()
dim()

The function getwd() tells you the current folder location RStudio is looking at. You can see the brackets are empty. This is because you don’t need to supply any additional information, also called ‘arguments’, to the function. Most functions however do require one or more arguments.

The function dir() lists all the files in your current working directory. However you can also run dir(recursive = true), which lists the files in all subdirectories as well.

This might have filled your console with a list of files, so this is a good time to learn how to clear your console. Press CTRL + L on your keyboard, and this should wipe your console.

If you want to learn more about a function, you can either run the help command:

?(name_of_function)

or search for it on the internet or use the help menu within RStudio. Select the ‘Help’ tab in the bottom right pane, then search the name of the function.

Data types

Vectors:

A series of values. These are created using the c() function, which stands for ‘combine’ or ‘concatenate’. For example, c(6, 11, 13, 31, 90, 92) creates a six-element series of positive integer values .

Factors:

Categorical data are commonly represented in R as factors. Categorical data can also be represented as strings.

Data frames:

Rectangular spreadsheets. They are representations of datasets in R where the rows correspond to observations and the columns correspond to variables.

Objects

What if you need the output from one function in order to perform something else? R doesn’t save the output from anything you run in the console unless you assign it to an object.

If we go back to the vector above, if you type it in the console it just writes it for you but doesn’t save it. We can instead assign this vector to an object, using either an = or <-.

My_vector <- c(6, 11, 13, 31, 90, 92)

My_vector
## [1]  6 11 13 31 90 92

You can now just type My_vector and RStudio will print out your vector, or you could use it in another function.

Working directory and importing files

Your working directory (the current folder location R Studio is set to work in) is important in R Studio. This is the location you can import files from, and where RStudio will place any files you export. To see what the current working directory is set to, type getwd(). To set your working directory, you can use the function

setwd(‘C:/insert/folder/path/here’)

There is an easier way to do this however. As shown in the image below, click on the three small dots then browse to the desired folder location and select it. Then select the ‘More’ option and click ‘Set as working directory’.

The function and arguments you use to import a file and assign it to an object depend on the type of file it is e.g. .tsv or .csv. One option using base R (no packages needed) is the read.delim() function. You need to assign the function to an object and include whether the file has column headers and the type of separator the file uses (comma ‘,’ for .csv files and tab ’ for .tsv files).

My_tsv_file <- read.delim('name_of_file.tsv', header = TRUE, sep = '\t')

My_csv_file <- read.delim('name_of_file.csv', header = TRUE, sep = ',')

Packages

Alongside all the functions that come with R built-in, you can also install packages containing lots of other functions.

The main package we will be working with later is called Tidyverse, and contains functions relating to graph plotting, managing datasets and much more. Tidyverse is actually just a collection of multiple packages all collated together, you can find out more on their official website: https://www.tidyverse.org

Another package we will be using today is called palmerpenguins. This is a dataset designed to be used as a learning tool for data exploration and visualisation. It contains information about penguin species, their geographical distribution and their bill dimensions.

To install a package in R, use the following command:

install.packages('nameofpackage')

Although you might have installed a package, this does not mean you can use it straight away. You need to manually load each package you want to use. This may seem slow and inefficient, however some packages can be extremely large and would slow R down if it loaded every package you have installed each time you turned it on.

To load one of your installed packages, use the following command:

library(nameofpackage)

Exercise - plotting data with ggplot2

Loading libraries and preparing our dataset

The first thing we need to do is install the two packages we need: Tidyverse and palmerpenguins.

install.packages('tidyverse')
install.packages('palmerpenguins')

RStudio cannot yet access these until we load them (referred to as libraries). Think of it like installing a program on your computer, you can’t use it until you load it up despite it being installed. To load a library you use the library() function.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(palmerpenguins)

We can now use all the functions from the Tidyverse package and we have our dataset loaded and ready to use. It’s useful though if we could actually view our dataset, so you can see what you are plotting. There are many ways to do this. One option is to use the head() function. Type head(penguins) and see what you get. This just views some of the data in your console, but you can view the entire dataset in its own pane by typing

data(penguins)

This should add a ‘penguins’ entry under values in your environment pane. Then type penguins in the console and click on its data entry in the top right to view the table. As you can see this dataset contains information about three species of penguin, metrics associated with their bills, their sex and island location.

Recording our work - creating an R script file

The best way to keep track of all the code you are going to use is to note it down in an R script file. By doing this you can keep all the relevant code in the same place, and you can even run it directly from the script file as opposed to continuously typing it into the console. To create a new R script file click on File -> New File -> R Script. If you want include non-code text start the line with a #, otherwise R will try to read this as code.

How does plotting work with ggplot2

The function we are going to use is called ggplot() which is from the ggplot2 package within Tidyverse.

With ggplot2, you begin a plot with the function ggplot(). ggplot() creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph. So ggplot(data = penguins) creates an empty graph, but it’s not very interesting to look at.

You complete your graph by adding one or more layers to ggplot(). The function geom_point() adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each add a different type of layer to a plot.

Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes. ggplot2 looks for the mapped variables in the data argument, in this case penguins.

The basic template for a graph with ggplot2 looks like this:

ggplot(data = your_dataset, aes(x = variable, y = variable)) + geom_function()

This is quite a lot of information, so here’s a nice summary produced by the R for ecology team:

Plotting a single variable

The first thing we are going to do is plot a single variable from our penguin dataset, flipper length. We are first going to plot a histogram, from the function geom_histogram().

ggplot(data = penguins, aes(x = flipper_length_mm)) +
  geom_histogram()

We’ve made a histogram, although it’s very plain. Let’s make some changes.

ggplot(data = penguins, aes(x = flipper_length_mm)) +
  geom_histogram(aes(fill = species), colour = 'black') +
  theme_classic() +
  labs(x = 'Flipper length (mm)', y = 'Frequency', fill = 'Species')

  • aes(fill = species) separates the data by colour according to the species column.
  • colour = 'black' sets the colour of the bar outlines to black.
  • theme allows you to set overall appearance changes to graphs.
  • labs() allows you to set labels.

Have a go at making the density plot below using the geom_density() function with the same data. Use the argument alpha = to set the transparency.

NOTE: If you assign your plot to an object, you will need to type the name of the object into the console to produce the graph.

Plotting two variables - flipper length vs body mass

This time we are going to plot the flipper length (x axis) against body mass (y axis). In addition, we are going to slowly improve the code. Your graph will start like the plot on the left and finish like the one on the right.

Version one

ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point()

As you can see this plot is limited in how much information you can infer from it. Can you tell how the different species mix? This code is a starting point which you can build on to make a better plot.

Version two

ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_point(aes(color = species))

This time we have added some extra detail to the geom_point() function. This has now coloured the data points according to what species the penguin belongs to.

Version three

ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) + 
  geom_point(aes(color = species, shape = species), alpha = 0.8, size = 3)

Now we have not only coloured the datapoints by species but given each species a unique shape as well. Using the alpha argument, we have set the transparency to 80% and also set the size of each datapoint using the size argument.

Version four

This time we are going to add some of our own labeling, however some of the code is missing. Replace the question marks with the appropriate code.

ggplot(data = ?, ?(x = ?, y = ?)) + 
  geom_point(aes(color = species, shape = species), alpha = 0.8, size = 3) +
  labs(x = "Flipper length (mm)", y = ?, color = 'Penguin species',
    shape = 'Penguin species')

Version five

This time, using the theme() function, we are going to customise the legend position and the legend background.

ggplot(data = ?, aes(?)) +
  geom_point(aes(color = species, shape = species), alpha = 0.8, size = 3) +
  labs(?) + theme(legend.position = c(0.9, 0.2),
    legend.background = element_rect(fill = "white", color = NA))

Version six

Finally, let’s overlay an appearance theme in addition to our previous code.

ggplot(?) + geom_point(?) + labs(?) + theme_classic() + theme(?)

Saving your graph

You can use the function ggsave() to export your plot as a pdf, png etc.

ggsave('my_scatterplot.pdf')

There are many more settings within ggsave(). You can look it up in the help menu to see the different arguments available to you. These include setting the height and width of your image, alongside its dpi quality. By default ggsave() will export your most recent plot.

Plotting two variables - adding a regression line and ANOVA

For this example we are going to plot the length of the penguin bills on the x axis against the depth of the bills on the y axis.

Generating the basic plot

bill_anatomy <- ggplot(data = ?, aes(x = ?, y = ?, group = species)) +
  geom_point(aes(color = species, shape = species), size = 3, alpha = 0.8)

Try and fill in the missing code, your plot should look like this:

Adding regression lines

Regression lines can easily be added to a scatterplot by adding another geom function to it, geom_smooth(method = 'lm'). Here we are going to add one regression line per species, so three regression lines in total. method = 'lm refers to using a linear model. Your plot should look like this:

You can remove the confidence intervals overlayed on your regression lines by including the argument se = FALSE within the geom_smooth() function.

Final tidying

Now we can add some final tidying and customisation.

bill_anatomy <- bill_anatomy +
  theme_classic() +
  labs(?) +
  theme(legend.position = c(0.85, 0.15),
    legend.background = element_rect(fill = 'white, colour = NA))

Your graph should look something like this:

Performing an ANOVA

Let’s examine the effect of bill length and penguin species and their interaction on bill depth using an ANOVA. Run the following code below:

model <- aov(bill_depth_mm ~ bill_length_mm * species, data = penguins)

The A ~ B * C notation means that the anova looks at bill length vs bill depth, species vs bill depth and the interaction of bill length and species on bill depth. Now you can run the following code:

drop1(model, scope = ~., test = "F")
## Single term deletions
## 
## Model:
## bill_depth_mm ~ bill_length_mm * species
##                        Df Sum of Sq    RSS      AIC F value    Pr(>F)    
## <none>                              306.32 -25.6762                      
## bill_length_mm          1    34.030 340.36   8.3514 37.3271 2.764e-09 ***
## species                 2    11.618 317.94 -16.9448  6.3719  0.001923 ** 
## bill_length_mm:species  2     0.872 307.20 -28.7036  0.4785  0.620151    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above code iteratively removes each term from the ANOVA to explore its influence on the \(R^2\) value. The scope argument determines what is to be removed, in this case all terms will be removed. The test = 'F' argument carries out an F test on this, an integral part of an ANOVA.

Drawing boxplots

This time we are going to create a series of boxplots, plotting flipper length on the y axis separated by species on the x axis. For boxplots you use the geom_boxplot() function.

ggplot(data = ?, aes(?)) + geom_boxplot(aes(color = ?), width = 0.5, show.legend = FALSE) + geom_jitter(aes(color = species), alpha = 0.5, show.legend = FALSE,
  position = position_jitter(width = 0.2, seed = 0)) +
theme_classic() +
labs(?)

Your graph should look like this:

You will notice the function geom_jitter() was also included. This function adds a small amount of random variation to the location of each datapoint and is a useful way of handling overplotting (where there is a lot of overlap of your datapoints).

Some other useful things to edit with graphs is their line thickness, text size and selecting your own colours. The below code shows you how to do that with these boxplots:

ggplot(data = penguins, aes(y = flipper_length_mm, x = species)) +
  geom_boxplot(aes(color = species), width = 0.5, show.legend = FALSE, size = 0.9) +
  geom_jitter(aes(color = species), alpha = 0.5, show.legend = FALSE,
    position = position_jitter(width = 0.2, seed = 0)) +
  theme_classic() +
  labs(x = 'Species', y = 'Flipper length (mm)') +
  theme(text = element_text(size = 15)) +
  scale_colour_manual(values = c('darkorange', 'purple', 'cyan4'))

Final task

Your final task is to prepare a scatter plot of body mass on the x axis against bill length on the y axis. Separate the three species by colour. Add a regression line to your plot. Add the function facet_wrap(~island) on the end of your code. The goal is for your plot to look like this:

Here is a guide for the code you need to use:

ggplot(?) +
  geom_point(?) +
  geom_smooth(?) +
  labs(?) +
  theme_bw() +
  theme(?) +
  facet_wrap(~island)

Extra resources

The R Graph Gallery: https://www.r-graph-gallery.com/index.html

This website teaches you how to plot over 39 different plots, starting off with basic code and building up to more complex graphs.

R for Data Science: https://r4ds.had.co.nz/

This is a very easy to use webpage-based book which teaches you the very basics of plotting and dataset management (filtering, sub-setting, ordering etc. ) through to more complex functions.

Quick R: https://www.statmethods.net/

This is a great website for learning how to perform statistical analyses using R.

R Graphics Cookbook: https://r-graphics.org/

This is another very easy to use webpage-based book focused solely on generating high-quality plots, with more than 150 different sets of code for generating a variety of different graphs.

swirl: https://swirlstats.com/

This website shows you how to download the R package swirl. swirl contains numerous interactive courses such as exploratory data analysis or a just a simple introduction to R.

Acknowledgements

Palmerpenguins artwork by @allison_horst Palmerpenguins Developers: Allison Horst, Alison Hill, Kristen Gorman

Palmerpenguins: https://allisonhorst.github.io/palmerpenguins/index.html

Tidyverse: https://www.tidyverse.org/