Deepayan Sarkar
R has a reputation for being a good system for graphics
This is mainly based on its ability to produce good publication-quality statistical plots
This course is about one specific graphics system in R called lattice
R actually has two largely independent graphics subsystems
Traditional graphics
Grid graphics
Grid graphics is not usually used directly by the user
But it forms the basis of two high-level graphics systems:
lattice: based on Trellis graphics (Cleveland)
ggplot2: inspired by “Grammar of Graphics” (Wilkinson)
These represent two very different philosophical approaches to graphics
lattice is in many ways a natural successor to traditional graphics
ggplot2 represents a completely different declarative approach
I will try to illustrate this with an example
Anscombe (1973) introduced four artificial bivariate datasets to emphasize the importance of graphics
The datasets all had the same means, standard deviations, and correlation
'data.frame': 11 obs. of 8 variables:
$ x1: num 10 8 13 9 11 14 6 4 12 7 ...
$ x2: num 10 8 13 9 11 14 6 4 12 7 ...
$ x3: num 10 8 13 9 11 14 6 4 12 7 ...
$ x4: num 8 8 8 8 8 8 8 19 8 8 ...
$ y1: num 8.04 6.95 7.58 8.81 8.33 ...
$ y2: num 9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
$ y3: num 7.46 6.77 12.74 7.11 7.81 ...
$ y4: num 6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...
x1 x2 x3 x4 y1 y2 y3 y4
9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909
x1 x2 x3 x4 y1 y2 y3 y4
3.316625 3.316625 3.316625 3.316625 2.031568 2.031657 2.030424 2.030579
[1] 0.8164205 0.8162365 0.8162867 0.8165214
Traditional graphics thinks of this as four different data sets
The function to create scatter plots is plot()
Multiple plots can be put in the same figure using par(mfrow = ...)
Several ways of specifying variable names inside dataset:
par(mfrow = c(1, 4))
plot(y1 ~ x1, data = anscombe, pch = 16)
plot(y2 ~ x2, data = anscombe, pch = 16)
plot(y3 ~ x3, data = anscombe, pch = 16)
plot(y4 ~ x4, data = anscombe, pch = 16)
xrng <- with(anscombe, range(x1, x2, x3, x4))
yrng <- with(anscombe, range(y1, y2, y3, y4)) ## common axis limits
par(mfrow = c(1, 4))
plot(y1 ~ x1, data = anscombe, pch = 16, xlim = xrng, ylim = yrng)
plot(y2 ~ x2, data = anscombe, pch = 16, xlim = xrng, ylim = yrng)
plot(y3 ~ x3, data = anscombe, pch = 16, xlim = xrng, ylim = yrng)
plot(y4 ~ x4, data = anscombe, pch = 16, xlim = xrng, ylim = yrng)
Both lattice and ggplot2 are capable of producing a single plot with all four datasets
But this requires the dataset to be in the “long format” (one row per data point)
anscombe.long <-
with(anscombe, data.frame(x = c(x1, x2, x3, x4),
y = c(y1, y2, y3, y4),
which = rep(c("1", "2", "3", "4"), each = 11)))
str(anscombe.long)
'data.frame': 44 obs. of 3 variables:
$ x : num 10 8 13 9 11 14 6 4 12 7 ...
$ y : num 8.04 6.95 7.58 8.81 8.33 ...
$ which: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
library(package = "lattice")
xyplot(y ~ x | which, data = anscombe.long, layout = c(4, 1), pch = 16)
library(package = "ggplot2")
ggplot(data = anscombe.long) + geom_point(aes(x = x, y = y)) + facet_grid( ~ which)
The approaches share many common features
Both Capable of plotting subsets of data (indexed by categorical variables)
Both makes efficient use of available space (common scales, common axes)
Different visual appearance, but that is superficial (different default themes)
However, the way in which we specify the display is very different
lattice uses an extension of the formula-data interface (with function xyplot()
instead of plot()
)
ggplot2 specifies type of rendering (geom) and mapping of variables to coordinates (aesthetics)
The differences become clearer if we try to customize the display further
A natural modification in this example is to add a linear regression line to each scatter plot
xrng <- with(anscombe, range(x1, x2, x3, x4))
yrng <- with(anscombe, range(y1, y2, y3, y4))
par(mfrow = c(1, 4))
plot(y1 ~ x1, anscombe, pch = 16, xlim = xrng, ylim = yrng)
abline(lm(y1 ~ x1, anscombe)) ## lm() fits regression line, abline() adds the line to current plot
plot(y2 ~ x2, anscombe, pch = 16, xlim = xrng, ylim = yrng)
abline(lm(y2 ~ x2, anscombe))
plot(y3 ~ x3, anscombe, pch = 16, xlim = xrng, ylim = yrng)
abline(lm(y3 ~ x3, anscombe))
plot(y4 ~ x4, anscombe, pch = 16, xlim = xrng, ylim = yrng)
abline(lm(y4 ~ x4, anscombe))
The traditional graphics approach is to add the line after the plot is drawn
In general, a plot is never finished, you can always add more points, lines, text, …
This is possible because there is only one plot !
lattice and ggplot2 need alternative solutions
The ggplot2 solution is to allow plots to have multiple layers
The lattice solution is to allow user to fully specify the procedure used to display data
ggplot(data = anscombe.long) + facet_grid( ~ which) +
geom_smooth(aes(x = x, y = y), method = "lm", se = FALSE)
ggplot(data = anscombe.long) + facet_grid( ~ which) + geom_point(aes(x = x, y = y)) +
geom_smooth(aes(x = x, y = y), method = "lm", se = FALSE)
The lattice solution is actually very similar to the traditional graphics solution
Basically, we want to do the following for each data subset x, y
:
Draw points at (x, y)
Draw the linear regression line through x, y
For lattice, we need to encapsulate this procedure into a function
displayFunction <- function(x, y) {
panel.grid(h = -1, v = -1) ## add a reference grid
panel.points(x, y, pch = 16) ## draw the points
panel.abline(lm(y ~ x), col = "grey50") ## draw linear regression line
}
xyplot()
function as the panel
argument
library(package = "latticeExtra")
xyplot(y ~ x | which, data = anscombe.long, layout = c(4, 1), pch = 16) +
layer_(panel.grid(h = -1, v = -1)) + layer(panel.abline(lm(y ~ x)))
library(package = "latticeExtra")
xyplot(y ~ x | which, data = anscombe.long, layout = c(4, 1), pch = 16,
grid = TRUE, type = c("p", "r"))
Day 1
Background and basic usage of lattice high-level functions
Themes and annotation: customizing graphical parameters, legends, axes, labels
Day 2
Customization of panel display using panel functions and layers
Exercises: Recreate some regression diagnostic plots
Discuss audience problems
Anscombe, Francis J. 1973. “Graphs in Statistical Analysis.” The American Statistician 27 (1). Taylor & Francis Group:17–21.