Introduction to Data Visualization with R (Part 2) (2024)

Objectives

Review the grammar of graphics template.
Learn about the statistical transformations inherent to geoms.
Learn more about fine tuning figures with labels, legends, scales, and themes.
Learn how to save plots with ggsave().
Review general tips for creating publishable figures.

Our grammar of graphics template

Last lesson we discussed the three basic components of creating a ggplot2 plot: the data, one or more geoms, and aesthetic mappings.

ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

But, we also learned of other features that greatly improve our figures, and today we will be expanding our ggplot2 template even further to include:

one or more datasets,
one or more geometric objects that serve as the visual representations of the data, – for instance, points, lines, rectangles, contours,
descriptions of how the variables in the data are mapped to visual properties (aesthetics) of the geometric objects, and an associated scale (e. g., linear, logarithmic, rank),
a facet specification, i.e. the use of multiple similar subplots to look at subsets of the same data,
one or more coordinate systems,
optional parameters that affect the layout and rendering, such text size, font and alignment, legend positions.
statistical summarization rules
---(Holmes and Huber, 2021)

ggplot(data = <DATA>) + <GEOM_FUNCTION>( mapping = aes(<MAPPINGS>), stat = <STAT> ) + <FACET_FUNCTION> + <COORDINATE SYSTEM> + <THEME>

Loading the libraries

To begin plotting, let's load our tidyverse library.

#load librarieslibrary(tidyverse) # Tidyverse automatically loads ggplot2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──## ✔ dplyr 1.1.3 ✔ readr 2.1.4## ✔ forcats 1.0.0 ✔ stringr 1.5.0## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0## ✔ purrr 1.0.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──## ✖ dplyr::filter() masks stats::filter()## ✖ dplyr::lag() masks stats::lag()## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Importing the data

We also need some data to plot, so if you haven't already, let's load the data we will need for this lesson.

#scaled_counts#We used this in lesson 2 so you may not need to reloadscaled_counts<- read.delim("./data/filtlowabund_scaledcounts_airways.txt",  as.is=TRUE)dexp<-read.delim("./data/diffexp_results_edger_airways.txt",  as.is=TRUE) #let's get some data#we are only interested in transcript counts greater than 100#read in the datasc<-read.csv("./data/sc.csv")

Statistical transformations

Many graphs, like scatterplots, plot the raw values of your dataset. Other graphs, like bar charts, calculate new values to plot:
bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.
smoothers fit a model to your data and then plot predictions from the model.
boxplots compute a robust summary of the distribution and then display a specially formatted box. The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. --- R4DS

Let's plot a bar graph using the data (sc).

#returns an error message. What went wrong?ggplot(data=sc) +  geom_bar( aes(x=Num_transcripts, y = TotalCounts))

## Error in `geom_bar()`:## ! Problem while computing stat.## ℹ Error occurred in the 1st layer.## Caused by error in `setup_params()`:## ! `stat_count()` must only have an x or y aesthetic.

What's the difference between stat identity and stat count?

ggplot(data=sc) +  geom_bar( aes(x=Num_transcripts, y = TotalCounts), stat="identity")

As we can see, stat="identity" returns the raw data.

Let's look at another example.

#Let's filter our data to only include 4 transcripts of interest#We used this code in the tidyverse lessonkeep_t<-c("CPD","EXT1","MCL1","LASP1")interesting_trnsc<-scaled_counts %>%  filter(transcript %in% keep_t) #the default here is `stat_count()`ggplot(data = interesting_trnsc) +  geom_bar(mapping = aes(x = transcript, y=counts_scaled))

## Error in `geom_bar()`:## ! Problem while computing stat.## ℹ Error occurred in the 1st layer.## Caused by error in `setup_params()`:## ! `stat_count()` must only have an x or y aesthetic.

#Let's take away the y aestheticggplot(data = interesting_trnsc) +  geom_bar(mapping = aes(x = transcript))

This is not a very useful figure, and probably not worth plotting. We could have gotten this info using str(). However, the point here is that there are default statistical transformations occurring with many geoms, and you can specify alternatives.

Let's change the stat parameter to "identity". This will plot the raw values of the normalized counts rather than how many rows are present for each transcript.

#defaulted to a stacked barplotggplot(data = interesting_trnsc) +  geom_bar(mapping = aes(x = transcript,y=counts_scaled, fill=SampleName), stat="identity",color="black") +  facet_wrap(~dex)

What if we wanted the columns side by side?

#introducing the position argument, position="dodge"ggplot(data = interesting_trnsc) +  geom_bar(mapping = aes(x = transcript,y=counts_scaled, fill=SampleName), stat="identity",color="black",position="dodge") +  facet_wrap(~dex)

How do we know what the default stat is for geom_bar()? Well, we could read the documentation, ?geom_bar(). This is true of multiple geoms. The statistical transformation can often be customized, so if the default is not what you need, check out the documentation to learn more about how to make modifications. For example, you could provide custom mapping for a box plot. To do this, see the examples section of the geom_boxplot() documentation.

Coordinate systems

ggplot2 uses a default coordinate system (the Cartesian coordinate system). This isn't super important until we want to do something like make a map (See coord_quickmap()) or create a pie chart (See coord_polar()).

When will we have to think about coordinate systems? We likely won't have to modify from default in too many cases (see those above). The most common circ*mstance in which we will likely need to change the coordinate system is in the event that we want to switch the x and y axes (?coord_flip()) or if we want to fix our aspect ratio (?coord_fixed()).

#let's return to our bar plot above#get horizontal bars instead of vertical barsggplot(data = interesting_trnsc) +  geom_bar(mapping = aes(x = transcript,y=counts_scaled, fill=SampleName), stat="identity",color="black",position="dodge") +  facet_wrap(~dex) + coord_flip()

Labels, legends, scales, and themes

How do we ultimately get our figures to a publishable state? The bread and butter of pretty plots really falls to the additional non-data layers of our ggplot2 code. These layers will include code to label the axes, scale the axes, and customize the legends and theme.

The default axes and legend titles come from the ggplot2 code.

ggplot(data=sc) +  geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex), shape=21,size=2) +  scale_fill_manual(values=c("purple", "yellow"))

In the above plot, the y-axis label (TotalCounts) is the variable name mapped to the y aesthetic, while the x-axis label (Num_transcripts) is the variable name named to the x aesthetic. The fill aesthetic was set equal to "dex", and so this became the default title of the fill legend. We can change these labels using ylab(), xlab(), or labs(), and guide() for the legend.

ggplot(data=sc) +  geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex), shape=21,size=2) +  scale_fill_manual(values=c("purple", "yellow"),  labels=c('treated','untreated'))+  #can change labels of fill levels along with colors xlab("Recovered transcripts per sample") + #add x label ylab("Total sequences per sample") #add y label

Let's change the legend title.

ggplot(data=sc) +  geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex), shape=21,size=2) +  scale_fill_manual(values=c("purple", "yellow"),  labels=c('treated','untreated'))+  #can change labels of fill levels along with colors xlab("Recovered transcripts per sample") + #add x label ylab("Total sequences per sample") +#add y label guides(fill = guide_legend(title="Treatment"))

We can modify the axes scales of continuous variables using scale_x_contiuous() and scale_y_continuous(). Discrete (categorical variable) axes can be modified using scale_x_discrete() and scale_y_discrete().

ggplot(data=sc) +  geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex), shape=21,size=2) +  scale_fill_manual(values=c("purple", "yellow"),  labels=c('treated','untreated'))+  #can change labels of fill levels along with colors xlab("Recovered transcripts per sample") + #add x label ylab("Total sequences per sample") +#add y label guides(fill = guide_legend(title="Treatment")) + #label the legend scale_y_continuous(breaks=seq(1.0e7, 3.5e7, by = 2e6), limits=c(1.0e7,3.5e7)) #change breaks and limits

Perhaps we want to represent these data on a logarithmic scale.

ggplot(data=sc) +  geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex), shape=21,size=2) +  scale_fill_manual(values=c("purple", "yellow"),  labels=c('treated','untreated'))+  #can change labels of fill levels along with colors xlab("Recovered transcripts per sample") + #add x label ylab("Total sequences per sample") +#add y label guides(fill = guide_legend(title="Treatment")) + #label the legend scale_y_continuous(trans="log10") #use the trans argument

Note

You could manually transform the data without transforming the scales. The figures would be the same, excluding the axes labels. When you use the transformed scale (e.g., scale_y_continuous(trans="log10") or scale_y_log10()), the axis labels remain in the original data space. When the data is transformed manually, the labels will also be transformed.

Finally, we can change the overall look of non-data elements of our plot (titles, labels, fonts, background, grid lines, and legends) by customizing ggplot2 themes. Check out ?ggplot2::theme(). For a list of available parameters. ggplot2 provides 8 complete themes, with theme_gray() as the default theme.
You can also create your own custom theme and then apply it to all figures in a plot.

Create a custom theme to use with multiple figures.

#Setting a thememy_theme <- theme_bw() + theme( panel.border = element_blank(), axis.line = element_line(), panel.grid.major = element_line(size = 0.2), panel.grid.minor = element_line(size = 0.1), text = element_text(size = 12), legend.position = "bottom", axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1) )

## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.## ℹ Please use the `linewidth` argument instead.## This warning is displayed once every 8 hours.## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was## generated.

ggplot(data=sc) +  geom_point(aes(x=Num_transcripts, y = TotalCounts,fill=dex), shape=21,size=2) +  scale_fill_manual(values=c("purple", "yellow"),  labels=c('treated','untreated'))+  #can change labels of fill levels along with colors xlab("Recovered transcripts per sample") + #add x label ylab("Total sequences per sample") +#add y label guides(fill = guide_legend(title="Treatment")) + #label the legend scale_y_continuous(trans="log10") + #use the trans argument my_theme

Saving plots (`ggsave()`)

Finally, we have a quality plot ready to publish. The next step is to save our plot to a file. The easiest way to do this with ggplot2 is ggsave(). This function will save the last plot that you displayed by default. Look at the function parameters using ?ggsave().

ggsave("Plot1.png",width=5.5,height=3.5,units="in",dpi=300)

Nice plot example

These steps can be used to create a publish worthy figure. For example, let's create a volcano plot of our differential expression results.

A volcano plot is a type of scatterplot that shows statistical significance (P value) versus magnitude of change (fold change). It enables quick visual identification of genes with large fold changes that are also statistically significant. These may be the most biologically significant genes. --- Maria Doyle, 2021

#get the datadexp_sigtrnsc<-dexp %>%  mutate(Significant = FDR < 0.05 & abs(logFC) >= 2) %>% arrange(FDR)topgenes<-dexp_sigtrnsc$transcript[1:6]

Plot

#install.packages(ggrepel)library(ggrepel)ggplot(data=dexp_sigtrnsc,aes(x = logFC, y = log10(FDR))) + geom_point(aes( color = Significant, size = Significant,  alpha = Significant)) + geom_text_repel(data=dexp_sigtrnsc %>%  filter(transcript %in% topgenes),  aes(label=transcript), nudge_y=0.5,hjust=0.5,direction="y", segment.color="gray") + scale_y_reverse(limits=c(0,-7))+ scale_color_manual(values = c("black", "#e11f28")) + scale_size_discrete(range = c(0, 2)) + guides(size = "none", alpha= "none")+ my_theme

## Warning: Using size for a discrete variable is not advised.

## Warning: Using alpha for a discrete variable is not advised.

Enhanced Volcano

There is a dedicated package for creating volcano plots available in Bioconductor, EnhancedVolcano. Plots created using this package can be customized using ggplot2 functions and syntax.

#The default cut-off for log2FC is >|2|#the default cut-off for log10 p-value is 10e-6library(EnhancedVolcano)EnhancedVolcano(dexp_sigtrnsc, title = "Enhanced Volcano with Airways", lab = dexp_sigtrnsc$transcript, x = 'logFC', y = 'FDR')

Recommendations for creating publishable figures

(Inspired by Visualizing Data in the Tidyverse, a Coursera lesson)

Consider whether the plot type you have chosen is the best way to convey your message
Make your plot visually appealing
- Careful color selection - color blind friendly if possible (e.g., library(viridis))
- Eliminate unnecessary white space
- Carefully choose themes including font types
Label all axes with concise and informative labels
- These labels should be straight forward and adequately describe the data
Ask yourself "Does the data make sense?"
- Does the data plotted address the question you are answering?
Try not to mislead the audience
- Often this means starting the y-axis at 0
- Keep axes consistent when arranging facets or multiple plots
- Keep colors consistent across plots
Do not try to convey too much information in the same plot
- Keep plots fairly simple

Complementary packages

There are many complementary R packages related to creating publishable figures using ggplot2. Check out the packages cowplot and ggpubr. Cowplot is particularly great for providing functions that facilitate arranging multiple plots in a grid panel. Usually publications restrict the number of figures allowed, and so it is helpful to be able to group multiple figures into a single figure panel. GGpubr is particularly great for beginners, providing easy code to make publish worthy figures. It is particularly great for stats integration and easily incorporating brackets and p-values for group comparisons.

Acknowledgements

Material from this lesson was adapted from Chapter 3 of R for Data Science and from a 2021 workshop entitled Introduction to Tidy Transciptomics by Maria Doyle and Stefano Mangiola.