Before we can do today’s exercises we need to do something extra. We
need to get access to some extra functions that aren’t available in
“base” R. To do this we will load a package into R. In R packages are
called libraries and a library is a collection of functions that are
usually designed to do similar types of things (e.g., analyzing data in
a particular way). In this case we are going to load a really
“meta”-package (a package so big that it actually contains other
packages) and this is called the tidyverse
.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Now don’t worry about the “conflicts” message - this isn’t an error. (TL;DR: all this means is that some functions in the tidyverse are going to take precedence over functions with the same name in base R.). You can find out more about the tidyverse at their webpage.
Now you’ve got two tasks:
I will remind you how to do #s 1 and 2 in my video.
Now that we are in the right place, let’s read in the data, but this
time we are going to use a function from the tidyverse that works with
tab-separated values (files ending .tsv
):
<- read_tsv("data/phosphate-assay.tsv") phosphate
## Rows: 6 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (1): sample
## dbl (6): conc_uM, rep1, rep2, rep3, rep4, rep5
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
phosphate
I will tell you a little bit about that object (which is called a “tibble”) in the video and show you a few tricks to view the data in different ways.
Did you notice how each replicate is a separate column? Let’s start by drawing a graph with the concentration column (on the x axis) and the first replicate (on the y axis):
plot(x = phosphate$conc_uM, y = phosphate$rep1)
It looks like when the concentration goes up so does the value of the replicate (measured as optical density at 600nm).
Maybe we should put those units into the plot properly:
plot(x = phosphate$conc_uM, y = phosphate$rep1, xlab = "concentration (μM)", ylab = "OD (600nm)")
Let’s look at that relationship using a simple linear (regression)
model. The function for linear models is lm()
. In this case
we will use this function to construct a model and assign the result to
an object called model1
. We write the name of the model to
view its key properties:
<- lm(rep1 ~ conc_uM, data=phosphate)
model1 model1
##
## Call:
## lm(formula = rep1 ~ conc_uM, data = phosphate)
##
## Coefficients:
## (Intercept) conc_uM
## 0.1451905 0.0002288
So you can see that the slope or gradient is 0.0002288
while the y intercept is 0.1451905
. You can find out much
more about the model using summary(model1)
. This shows the
R^2 value and carries out a significance test. (We won’t go into this
now).
plot(x = phosphate$conc_uM, y = phosphate$rep1, xlab = "", ylab = "")
abline(model1, col="blue")
Cool right? We added the regression line to the plot. Did you notice how we added empty information to the plot labels? Try typing this new line of code:
title(xlab = "concentration (μM)", ylab = "OD (600nm)")
What happens? Like the abline()
function above, the
title()
function can be used to decorate a plot after you
have drawn it.
Now the above is all very well, but we’ve only plotted one replicate. Just for a second let’s remind ourselves of the column names in our data object:
names(phosphate)
## [1] "sample" "conc_uM" "rep1" "rep2" "rep3" "rep4" "rep5"
Let’s take the names from the 3rd to the last column and assign them to a new object:
<- names(phosphate)[3:7]
repcols repcols
## [1] "rep1" "rep2" "rep3" "rep4" "rep5"
Now you can see we need to plot all five columns in repcols.
Now this is a bit more complicated. Here you’ll see two more
functions that can be used to add elements to a plot
(points()
and legend()
) - see if you can
figure out what they do:
plot(phosphate$conc_uM, phosphate$rep1, pch = 1, ylim = c(0, 0.35), xlab = "", ylab = "")
points(phosphate$conc_uM, phosphate$rep2, pch = 2, xlab = "", ylab = "")
points(phosphate$conc_uM, phosphate$rep3, pch = 3, xlab = "", ylab = "")
points(phosphate$conc_uM, phosphate$rep4, pch = 4, xlab = "", ylab = "")
points(phosphate$conc_uM, phosphate$rep5, pch = 5, xlab = "", ylab = "")
title(xlab = "concentration (μM)", ylab = "OD at 600nm")
legend("topright", legend = repcols, pch = 1:5)
To understand the functions you can use the ?
notation for
example read the help file for ?points
and see if you can
find out what the pch
argument is doing…
Tidyverse is a bit of a weird name isn’t it? Well it’s actually a really good name because it’s all about how data is organised. Looking at the data in the phosphate object we can see that each row contains five y values (one for each replicate), but in “tidy” data we expect each observation to be on a separate row and each column to be a single variable. For more detail on this see this reference.
How can we get the phosphate
data into a tidy format? We
should have one column for optical density (the y value) and, because
every measurement should have its own row, we also need a column to tell
us what replicate each measurement is. There is a special helper
function in tidyverse
to help us do this and that is
pivot_longer()
:
<- pivot_longer(phosphate, rep1:rep5, names_to = "replicate", values_to = "OD600nm")
phosphate_long phosphate_long
Look at the “arguments” (separated by commas) inside
pivot_longer()
. First we pass it the object containing the
data, then we tell it which columns contain the data we want to
transform/pivot, then we use names_to
to create a new
column that contains these old column names. Finally we use
values_to
to create the single column containing
the data. Learn more at the above reference or using
?pivot_longer
. Take a good look at the
tidy data in our new phosphate_long
object.
Tidyverse contains an awesome function called ggplot()
.
It can be a little hard to learn at first (you may find this reference chapter
helpful), but here is how we can draw a plot with all the data in
it:
ggplot(phosphate_long, aes(x=conc_uM, y=OD600nm, shape=replicate)) + geom_point()
First we tell it the name of the data. Then the next argument uses the
aes()
function to tell ggplot()
the “aesthetic
mappings”, i.e., which variables are associated with which features of
the plot. That’s kind of simple for the x and y axes, but try replacing
shape
in the above code with colour
. What do
you see? What if you remove that argument altogether.
Then there is this slightly weird (and unique to
ggplot()
) feature where we add the geometric function that
is used to make the plot (in this case geom_point
). Keeping
the colour
change you made above, try replacing
geom_point()
with geom_line()
. What do you
see? Does that make sense? (It may be pretty but the answer is no! You
can’t interpolate in that way!)
In ggplot()
this is really each to do using the
stat_smooth()
function. We just need to add one more thing
to the plot (and you might notice that we put the functions on different
lines here which is fine as long as there is a plus on every line if we
expect more functions):
ggplot(phosphate_long, aes(x=conc_uM, y=OD600nm)) +
geom_point(aes(colour=replicate)) +
stat_smooth(method=lm)
## `geom_smooth()` using formula 'y ~ x'
The only thing you might notice here if you look really closely is
that we moved the colour attribute into the geom_points()
function. Try moving it back to the main ggplot()
call.
What happens?
This is actually very complex, so please don’t worry if you don’t
understand this last example which introduces an idea called the pipe
%>%
, but it shows you how you can summarize data and
send it into the next function (and ultimately into
ggplot()
):
%>%
phosphate_long group_by(sample, conc_uM) %>%
summarise(meanOD600nm = mean(OD600nm)) %>%
ggplot(aes(x=conc_uM, y=meanOD600nm)) + geom_point() + stat_smooth(method=lm)
## `summarise()` has grouped output by 'sample'. You can override using the
## `.groups` argument.
## `geom_smooth()` using formula 'y ~ x'
Generally you won’t need a complex example like this (instead see Figure \@ref{fig:best-ggplot} which has the advantage that we can see all the data)…
These references were linked above:
Chang, W., 2018. R graphics cookbook: practical recipes for visualizing data. O’Reilly Media. 2nd edition
Wickham, H. and Grolemund, G., 2016. R for data science: import, tidy, transform, visualize, and model data. ” O’Reilly Media, Inc.”.
But when attributing, we mustn’t forget about software! R actually contains a function that provides us with bibliographic details! For all of R:
citation()
##
## To cite R in publications use:
##
## R Core Team (2022). R: A language and environment for statistical
## computing. R Foundation for Statistical Computing, Vienna, Austria.
## URL https://www.R-project.org/.
##
## A BibTeX entry for LaTeX users is
##
## @Manual{,
## title = {R: A Language and Environment for Statistical Computing},
## author = {{R Core Team}},
## organization = {R Foundation for Statistical Computing},
## address = {Vienna, Austria},
## year = {2022},
## url = {https://www.R-project.org/},
## }
##
## We have invested a lot of time and effort in creating R, please cite it
## when using it for data analysis. See also 'citation("pkgname")' for
## citing R packages.
And for a particular package:
citation("tidyverse")
##
## To cite package 'tidyverse' in publications use:
##
## Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R,
## Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller
## E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V,
## Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to
## the tidyverse." _Journal of Open Source Software_, *4*(43), 1686.
## doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## title = {Welcome to the {tidyverse}},
## author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani},
## year = {2019},
## journal = {Journal of Open Source Software},
## volume = {4},
## number = {43},
## pages = {1686},
## doi = {10.21105/joss.01686},
## }
For reproducibility we sometimes print out information about our setup and computer:
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.2 stringr_1.4.1 dplyr_1.0.10 purrr_0.3.4
## [5] readr_2.1.2 tidyr_1.2.1 tibble_3.1.8 ggplot2_3.3.6
## [9] tidyverse_1.3.2
##
## loaded via a namespace (and not attached):
## [1] lattice_0.20-45 lubridate_1.8.0 assertthat_0.2.1
## [4] digest_0.6.29 utf8_1.2.2 R6_2.5.1
## [7] cellranger_1.1.0 backports_1.4.1 reprex_2.0.2
## [10] evaluate_0.16 highr_0.9 httr_1.4.4
## [13] pillar_1.8.1 rlang_1.0.6 googlesheets4_1.0.1
## [16] readxl_1.4.1 rstudioapi_0.14 jquerylib_0.1.4
## [19] Matrix_1.5-1 klippy_0.0.0.9500 rmarkdown_2.16
## [22] splines_4.2.1 labeling_0.4.2 googledrive_2.0.0
## [25] bit_4.0.4 munsell_0.5.0 broom_1.0.1
## [28] compiler_4.2.1 modelr_0.1.9 xfun_0.33
## [31] pkgconfig_2.0.3 mgcv_1.8-40 htmltools_0.5.3
## [34] tidyselect_1.1.2 fansi_1.0.3 crayon_1.5.2
## [37] tzdb_0.3.0 dbplyr_2.2.1 withr_2.5.0
## [40] grid_4.2.1 nlme_3.1-159 jsonlite_1.8.0
## [43] gtable_0.3.1 lifecycle_1.0.2 DBI_1.1.3
## [46] magrittr_2.0.3 scales_1.2.1 cli_3.4.1
## [49] stringi_1.7.8 vroom_1.5.7 cachem_1.0.6
## [52] farver_2.1.1 fs_1.5.2 xml2_1.3.3
## [55] bslib_0.4.0 ellipsis_0.3.2 generics_0.1.3
## [58] vctrs_0.4.2 tools_4.2.1 bit64_4.0.5
## [61] glue_1.6.2 hms_1.1.2 parallel_4.2.1
## [64] fastmap_1.1.0 yaml_2.3.5 colorspace_2.0-3
## [67] gargle_1.2.1 rvest_1.0.3 knitr_1.40
## [70] haven_2.5.1 sass_0.4.2