What is R Studio?
RStudio is an ‘Integrated Development Environment’ for R, a software application that allows the user to perform statistical analysis, plot graphs, write functions and manage datasets using the R statistical programming language.
How to install RStudio?
Before you can install RStudio, you need to install R, the programming language RStudio uses. You can download R from the official website: https://cloud.r-project.org
For Windows users, select Download R for Windows, followed by base then click on the download link.
For macOS users, select Download R for macOS, then click on the appropriate download link. Make sure to download the correct file, depending on whether your Mac is using it’s own silicon (M1/M2) or Intel’s.
Now you can download RStudio: https://www.rstudio.com/products/rstudio/download
We recommend you install R and RStudio on your own computer, however there may be times where you don’t have access to a local version of RStudio and instead need to use the cloud-based version. This might be if an on-campus computer doesn’t have RStudio installed.
To access RStudio Cloud, follow this link: https://rstudio.cloud/
RStudio Cloud is free to use, but you are limited to 25 hours per month. Once you have set up an account (or you can log in through a Google or GitHub account), create a new project and it should load up RStudio the same as the local version.
The R Studio interface
The basics
Functions
Unlike excel, SPSS or other similar software which operate using a ‘point and click’ system, R is an interpreted language. In order to do anything in RStudio, you need to tell it what to do using the R language, typically through functions.
A function in R usually consists of two parts…
- The name of the function
- Parentheses
Here are some examples:
getwd()
setwd()
dir()
dim()
The function getwd()
tells you the current folder
location RStudio is looking at. You can see the brackets are empty. This
is because you don’t need to supply any additional information, also
called ‘arguments’, to the function. Most functions however do require
one or more arguments.
The function dir()
lists all the files in your current
working directory. However you can also run
dir(recursive = true)
, which lists the files in all
subdirectories as well.
This might have filled your console with a list of files, so this is
a good time to learn how to clear your console. Press
CTRL + L
on your keyboard, and this should wipe your
console.
If you want to learn more about a function, you can either run the help command:
?(name_of_function)
or search for it on the internet or use the help menu within RStudio. Select the ‘Help’ tab in the bottom right pane, then search the name of the function.
Data types
Vectors:
A series of values. These are created using the c()
function, which stands for ‘combine’ or ‘concatenate’. For example,
c(6, 11, 13, 31, 90, 92)
creates a six-element series of
positive integer values .
Factors:
Categorical data are commonly represented in R as factors. Categorical data can also be represented as strings.
Data frames:
Rectangular spreadsheets. They are representations of datasets in R where the rows correspond to observations and the columns correspond to variables.
Objects
What if you need the output from one function in order to perform something else? R doesn’t save the output from anything you run in the console unless you assign it to an object.
If we go back to the vector above, if you type it in the console it
just writes it for you but doesn’t save it. We can instead assign this
vector to an object, using either an =
or
<-
.
## [1] 6 11 13 31 90 92
You can now just type My_vector and RStudio will print out your vector, or you could use it in another function.
Working directory and importing files
Your working directory (the current folder location R Studio is set
to work in) is important in R Studio. This is the location you can
import files from, and where RStudio will place any files you export. To
see what the current working directory is set to, type
getwd()
. To set your working directory, you can use the
function
setwd(‘C:/insert/folder/path/here’)
There is an easier way to do this however. As shown in the image below, click on the three small dots then browse to the desired folder location and select it. Then select the ‘More’ option and click ‘Set as working directory’.
The function and arguments you use to import a file and assign it to
an object depend on the type of file it is e.g. .tsv or .csv. One option
using base R (no packages needed) is the read.delim()
function. You need to assign the function to an object and include
whether the file has column headers and the type of separator the file
uses (comma ‘,’ for .csv files and tab ’ for .tsv files).
My_tsv_file <- read.delim('name_of_file.tsv', header = TRUE, sep = '\t')
My_csv_file <- read.delim('name_of_file.csv', header = TRUE, sep = ',')
Packages
Alongside all the functions that come with R built-in, you can also install packages containing lots of other functions.
The main package we will be working with later is called
Tidyverse
, and contains functions relating to graph
plotting, managing datasets and much more. Tidyverse is actually just a
collection of multiple packages all collated together, you can find out
more on their official website: https://www.tidyverse.org
Another package we will be using today is called
palmerpenguins
. This is a dataset designed to be used as a
learning tool for data exploration and visualisation. It contains
information about penguin species, their geographical distribution and
their bill dimensions.
To install a package in R, use the following command:
install.packages('nameofpackage')
Although you might have installed a package, this does not mean you can use it straight away. You need to manually load each package you want to use. This may seem slow and inefficient, however some packages can be extremely large and would slow R down if it loaded every package you have installed each time you turned it on.
To load one of your installed packages, use the following command:
library(nameofpackage)
Exercise - plotting data with ggplot2
Loading libraries and preparing our dataset
The first thing we need to do is install the two packages we need: Tidyverse and palmerpenguins.
install.packages('tidyverse')
install.packages('palmerpenguins')
RStudio cannot yet access these until we load them (referred to as
libraries). Think of it like installing a program on your computer, you
can’t use it until you load it up despite it being installed. To load a
library you use the library()
function.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We can now use all the functions from the Tidyverse package and we
have our dataset loaded and ready to use. It’s useful though if we could
actually view our dataset, so you can see what you are plotting. There
are many ways to do this. One option is to use the head()
function. Type head(penguins)
and see what you get. This
just views some of the data in your console, but you can view the entire
dataset in its own pane by typing
data(penguins)
This should add a ‘penguins’ entry under values in your environment pane. Then type penguins in the console and click on its data entry in the top right to view the table. As you can see this dataset contains information about three species of penguin, metrics associated with their bills, their sex and island location.
Recording our work - creating an R script file
The best way to keep track of all the code you are going to use is to
note it down in an R script file. By doing this you can keep all the
relevant code in the same place, and you can even run it directly from
the script file as opposed to continuously typing it into the console.
To create a new R script file click on
File -> New File -> R Script
. If you want include
non-code text start the line with a #
, otherwise R will try
to read this as code.
How does plotting work with ggplot2
The function we are going to use is called ggplot()
which is from the ggplot2 package within Tidyverse.
With ggplot2, you begin a plot with the function
ggplot()
. ggplot()
creates a coordinate system
that you can add layers to. The first argument of ggplot()
is the dataset to use in the graph. So
ggplot(data = penguins)
creates an empty graph, but it’s
not very interesting to look at.
You complete your graph by adding one or more layers to ggplot(). The
function geom_point()
adds a layer of points to your plot,
which creates a scatterplot. ggplot2 comes with many geom functions that
each add a different type of layer to a plot.
Each geom function in ggplot2 takes a mapping
argument.
This defines how variables in your dataset are mapped to visual
properties. The mapping argument is always paired with
aes()
, and the x and y arguments of aes()
specify which variables to map to the x and y axes. ggplot2 looks for
the mapped variables in the data argument, in this case penguins.
The basic template for a graph with ggplot2 looks like this:
ggplot(data = your_dataset, aes(x = variable, y = variable)) + geom_function()
This is quite a lot of information, so here’s a nice summary produced by the R for ecology team:
Plotting a single variable
The first thing we are going to do is plot a single variable from our
penguin dataset, flipper length. We are first going to
plot a histogram, from the function geom_histogram()
.
We’ve made a histogram, although it’s very plain. Let’s make some changes.
ggplot(data = penguins, aes(x = flipper_length_mm)) +
geom_histogram(aes(fill = species), colour = 'black') +
theme_classic() +
labs(x = 'Flipper length (mm)', y = 'Frequency', fill = 'Species')
aes(fill = species)
separates the data by colour according to the species column.colour = 'black'
sets the colour of the bar outlines to black.theme
allows you to set overall appearance changes to graphs.labs()
allows you to set labels.
Have a go at making the density plot below using the
geom_density()
function with the same data. Use the
argument alpha =
to set the transparency.
NOTE: If you assign your plot to an object, you will need to type the name of the object into the console to produce the graph.
Plotting two variables - flipper length vs body mass
This time we are going to plot the flipper length (x axis) against body mass (y axis). In addition, we are going to slowly improve the code. Your graph will start like the plot on the left and finish like the one on the right.
Version one
As you can see this plot is limited in how much information you can infer from it. Can you tell how the different species mix? This code is a starting point which you can build on to make a better plot.
Version two
ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species))
This time we have added some extra detail to the
geom_point()
function. This has now coloured the data
points according to what species the penguin belongs to.
Version three
ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = species), alpha = 0.8, size = 3)
Now we have not only coloured the datapoints by species but given
each species a unique shape as well. Using the alpha
argument, we have set the transparency to 80% and also set the size of
each datapoint using the size
argument.
Version four
This time we are going to add some of our own labeling, however some of the code is missing. Replace the question marks with the appropriate code.
ggplot(data = ?, ?(x = ?, y = ?)) +
geom_point(aes(color = species, shape = species), alpha = 0.8, size = 3) +
labs(x = "Flipper length (mm)", y = ?, color = 'Penguin species',
shape = 'Penguin species')
Version five
This time, using the theme()
function, we are going to
customise the legend position and the legend background.
ggplot(data = ?, aes(?)) +
geom_point(aes(color = species, shape = species), alpha = 0.8, size = 3) +
labs(?) + theme(legend.position = c(0.9, 0.2),
legend.background = element_rect(fill = "white", color = NA))
Version six
Finally, let’s overlay an appearance theme in addition to our previous code.
ggplot(?) + geom_point(?) + labs(?) + theme_classic() + theme(?)
Saving your graph
You can use the function ggsave()
to export your plot as
a pdf, png etc.
ggsave('my_scatterplot.pdf')
There are many more settings within ggsave()
. You can
look it up in the help menu to see the different arguments available to
you. These include setting the height and width of your image, alongside
its dpi quality. By default ggsave()
will export your most
recent plot.
Plotting two variables - adding a regression line and ANOVA
For this example we are going to plot the length of the penguin bills on the x axis against the depth of the bills on the y axis.
Generating the basic plot
bill_anatomy <- ggplot(data = ?, aes(x = ?, y = ?, group = species)) +
geom_point(aes(color = species, shape = species), size = 3, alpha = 0.8)
Try and fill in the missing code, your plot should look like this:
Adding regression lines
Regression lines can easily be added to a scatterplot by adding
another geom function to it, geom_smooth(method = 'lm')
.
Here we are going to add one regression line per species, so three
regression lines in total. method = 'lm
refers to using a
linear model. Your plot should look like this:
You can remove the confidence intervals overlayed on your regression
lines by including the argument se = FALSE
within the
geom_smooth()
function.
Final tidying
Now we can add some final tidying and customisation.
bill_anatomy <- bill_anatomy +
theme_classic() +
labs(?) +
theme(legend.position = c(0.85, 0.15),
legend.background = element_rect(fill = 'white, colour = NA))
Your graph should look something like this:
Performing an ANOVA
Let’s examine the effect of bill length and penguin species and their interaction on bill depth using an ANOVA. Run the following code below:
The A ~ B * C
notation means that the anova looks at
bill length vs bill depth, species vs bill depth and the interaction of
bill length and species on bill depth. Now you can run the following
code:
## Single term deletions
##
## Model:
## bill_depth_mm ~ bill_length_mm * species
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 306.32 -25.6762
## bill_length_mm 1 34.030 340.36 8.3514 37.3271 2.764e-09 ***
## species 2 11.618 317.94 -16.9448 6.3719 0.001923 **
## bill_length_mm:species 2 0.872 307.20 -28.7036 0.4785 0.620151
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The above code iteratively removes each term from the ANOVA to
explore its influence on the \(R^2\)
value. The scope
argument determines what is to be removed,
in this case all terms will be removed. The test = 'F'
argument carries out an F test on this, an integral part of an
ANOVA.
Drawing boxplots
This time we are going to create a series of boxplots, plotting
flipper length on the y axis separated by species on the x axis. For
boxplots you use the geom_boxplot()
function.
ggplot(data = ?, aes(?)) + geom_boxplot(aes(color = ?), width = 0.5, show.legend = FALSE) + geom_jitter(aes(color = species), alpha = 0.5, show.legend = FALSE,
position = position_jitter(width = 0.2, seed = 0)) +
theme_classic() +
labs(?)
Your graph should look like this:
You will notice the function geom_jitter()
was also
included. This function adds a small amount of random variation to the
location of each datapoint and is a useful way of handling overplotting
(where there is a lot of overlap of your datapoints).
Some other useful things to edit with graphs is their line thickness, text size and selecting your own colours. The below code shows you how to do that with these boxplots:
ggplot(data = penguins, aes(y = flipper_length_mm, x = species)) +
geom_boxplot(aes(color = species), width = 0.5, show.legend = FALSE, size = 0.9) +
geom_jitter(aes(color = species), alpha = 0.5, show.legend = FALSE,
position = position_jitter(width = 0.2, seed = 0)) +
theme_classic() +
labs(x = 'Species', y = 'Flipper length (mm)') +
theme(text = element_text(size = 15)) +
scale_colour_manual(values = c('darkorange', 'purple', 'cyan4'))
Final task
Your final task is to prepare a scatter plot of body mass on
the x axis against bill length on the y axis.
Separate the three species by colour. Add a regression line to your
plot. Add the function facet_wrap(~island)
on the end of
your code. The goal is for your plot to look like this:
Here is a guide for the code you need to use:
ggplot(?) +
geom_point(?) +
geom_smooth(?) +
labs(?) +
theme_bw() +
theme(?) +
facet_wrap(~island)
Extra resources
The R Graph Gallery: https://www.r-graph-gallery.com/index.html
This website teaches you how to plot over 39 different plots, starting off with basic code and building up to more complex graphs.
R for Data Science: https://r4ds.had.co.nz/
This is a very easy to use webpage-based book which teaches you the very basics of plotting and dataset management (filtering, sub-setting, ordering etc. ) through to more complex functions.
Quick R: https://www.statmethods.net/
This is a great website for learning how to perform statistical analyses using R.
R Graphics Cookbook: https://r-graphics.org/
This is another very easy to use webpage-based book focused solely on generating high-quality plots, with more than 150 different sets of code for generating a variety of different graphs.
swirl: https://swirlstats.com/
This website shows you how to download the R package swirl. swirl contains numerous interactive courses such as exploratory data analysis or a just a simple introduction to R.
Acknowledgements
Palmerpenguins artwork by @allison_horst Palmerpenguins Developers: Allison Horst, Alison Hill, Kristen Gorman
Palmerpenguins: https://allisonhorst.github.io/palmerpenguins/index.html
Tidyverse: https://www.tidyverse.org/