Deepayan Sarkar
Implementation of Trellis graphics for R
Powerful high-level data visualization system
Traditional user interface:
Collection of high level functions: xyplot()
, dotplot()
, etc.
Interface based on formula and data source
As we know, R is a Free Software re-implementation / dialect of S
The original implementation, available commercially as S-PLUS, was developed at Bell Labs
Both traditional graphics and Trellis graphics were part of the original S
Remembering this helps in understanding the design philosophy of lattice
Two important influences on graphics in S:
John W. Tukey
William Cleveland
Among the most influential modern statisticians
Champion of “Exploratory Data Analysis”
Worked at Bell Labs (where the S language was created)
Did not write software, but influenced the spirit of S
Also worked at Bell Labs for a long time, and directly influenced the design of S graphics
Two important books:
The Elements of Graphing Data (1985)
Visualizing Data (1993)
Trellis graphics is essentially an implementation of ideas in the second book
There are various designs or types of graphs for displaying data
Each design usually has a name (scatter plot, histogram, box plot, bar chart)
S has a high-level function corresponding to each such design (to be directly invoked by user)
The display produced should have reasonable defaults
Some customization through optional arguments (esp. graphical parameters)
Further customization can be done by adding to or replacing the display procedurally
Implicit expectation: S users will eventually turn into programmers
John Chambers, Preface of “Programming with Data” (Chambers 1998):
“S encourages you to slide into programming, perhaps without noticing”
Function | Default Display |
---|---|
plot() |
Scatter Plot, Time-series Plot (with type="l" ) |
boxplot() |
Comparative Box-and-Whisker Plots |
barplot() |
Bar Plot |
dotchart() |
Cleveland Dot Plot |
hist() |
Histogram |
plot(density()) |
Kernel Density Plot |
qqnorm() |
Normal Quantile-Quantile Plot |
qqplot() |
Two-sample Quantile-Quantile Plot |
stripchart() |
Stripchart (Comparative 1-D Scatter Plots) |
pairs() |
Scatter-Plot Matrix |
Function | Default Display |
---|---|
xyplot() |
Scatter Plot, Time-series Plot (with type="l" ) |
bwplot() |
Comparative Box-and-Whisker Plots |
barchart() |
Bar Plot |
dotplot() |
Cleveland Dot Plot |
histogram() |
Histogram |
densityplot() |
Kernel Density Plot |
qqmath() |
Normal Quantile-Quantile Plot |
qq() |
Two-sample Quantile-Quantile Plot |
stripplot() |
Stripchart (Comparative 1-D Scatter Plots) |
splom() |
Scatter-Plot Matrix |
Learning lattice essentially means
Learning about these functions (and a few others I didn’t mention)
Learning how to customize the default displays through optional arguments
Learning how to customize displays by writing alternative panel functions
Learning how to customize other parts (annotation, axis, themes)
We will now quickly go through some examples covering the first two points
Chem97
dataset'data.frame': 31022 obs. of 8 variables:
$ lea : Factor w/ 131 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ school : Factor w/ 2410 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ student : Factor w/ 31022 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ score : num 4 10 10 10 8 10 6 8 4 10 ...
$ gender : Factor w/ 2 levels "M","F": 2 2 2 2 2 2 2 2 2 2 ...
$ age : num 3 -3 -4 -2 -1 4 1 4 3 0 ...
$ gcsescore: num 6.62 7.62 7.25 7.5 6.44 ...
$ gcsecnt : num 0.339 1.339 0.964 1.214 0.158 ...
Chem97
datasetWe are only interested in
score
: Point score on A-level Chemistry in 1997 (advanced level)
gender
: Student’s gender
gcsescore
: Average GCSE score of individual (secondary level)
score gender gcsescore
1 4 F 6.625
2 10 F 7.625
3 10 F 7.250
4 10 F 7.500
5 8 F 6.444
6 10 F 7.750
The most visible innovation in lattice over traditional graphics is multipanel conditioning
Common scales and shared axis labeling by default
Strips above each panel describing subset
Optimal use of space (e.g., no extra space left for main label unless present)
This is the origin of the name “Trellis graphics” and “lattice”
Makes use of the formula-data interface — similar to modeling functions like lm()
All high-level lattice calls will usually have a formula and a data=
argument
densityplot(~ gcsescore | factor(score), data = Chem97, plot.points = FALSE,
groups = gender, auto.key = TRUE)
Display specified in terms of
Type of display (histogram, densityplot, etc.)
Variables with specific roles
Typical roles for variables
Primary variables: used for the main graphical display
Conditioning variables: used to divide into subgroups and juxtapose (multipanel conditioning)
Grouping variables: divide into subgroups and superpose
Primary interface: high-level functions
Each function corresponds to a display type
Specification of roles depends on display type
Usually specified through a formula and the groups
argument
We have used histograms and density plots to understand distribution of gcsescore
We will next see some variations and some other displays with the same goal
Useful to keep in mind that good data graphics should enable comparison
histogram(~ gcsescore | factor(score), data = Chem97,
nint = 10, breaks = NULL, equal.widths = FALSE)
densityplot(~ gcsescore | factor(score), data = Chem97, plot.points = FALSE,
groups = gender, kernel = "triangular")
densityplot(~ gcsescore | factor(score), data = Chem97, plot.points = FALSE,
groups = gender, bw = "bcv")
qqmath(~ gcsescore | factor(score), data = Chem97, groups = gender, auto.key = TRUE,
grid = TRUE, alpha = 0.2)
qqmath(~ gcsescore | factor(score), data = Chem97, groups = gender, auto.key = TRUE, grid = TRUE,
f.value = ppoints(100), ## plot fewer quantiles
aspect = "xy") ## adjust aspect ratio to 'bank' to 45 degrees
qq(gender ~ gcsescore | factor(score), data = Chem97, grid = TRUE,
f.value = ppoints(100), aspect = "iso")
What are the available arguments available?
Where can we find more details about them?
To answer this, we need to learn some details about how lattice works
Summary:
Some optional arguments are common to all high-level lattice functions
Some are specific to the high-level function
Some are specific to the default display panel function
Documented in help(xyplot)
(for the most part)
Main categories:
as.table
, between
, layout
, skip
: control panel layout; see Chapter 2 of the Lattice book
xlab
, ylab
, main
, sub
: labels
xlim
, ylim
: axis limits
scales
: list controlling many details about scales
aspect
: aspect ratio
key
, auto.key
: legend
par.settings
: default graphical parameters (theme)
lattice.options
: non-graphical settings
Will not discuss all, but will encounter some later (see documentation for details)
Most high-level functions will have some specific optional arguments
All of these have a default panel function to produce the default display
These have names of the form panel.<high-level-function>
For example, panel.histogram
, panel.densityplot
, panel.bwplot
, etc.
histogram
## S3 method for class 'formula'
histogram(x, data, allow.multiple, outer,
auto.key = FALSE,
aspect = "fill",
panel = lattice.getOption("panel.histogram"),
prepanel, scales, strip, groups,
xlab, xlim, ylab, ylim,
type = c("percent", "count", "density"),
nint = if (is.factor(x)) nlevels(x) else round(log2(length(x)) + 1),
endpoints = extend.limits(range(as.numeric(x), finite = TRUE), prop = 0.04),
breaks,
equal.widths = TRUE,
drop.unused.levels = lattice.getOption("drop.unused.levels"),
...,
lattice.options = NULL,
default.scales = list(),
default.prepanel = lattice.getOption("prepanel.default.histogram"),
subscripts,
subset)
bwplot
## S3 method for class 'formula'
bwplot(x, data, allow.multiple, outer,
auto.key = FALSE,
aspect = "fill",
panel = lattice.getOption("panel.bwplot"),
prepanel = NULL,
scales = list(),
strip = TRUE,
groups = NULL,
xlab, xlim, ylab, ylim,
box.ratio = 1,
horizontal = NULL,
drop.unused.levels = lattice.getOption("drop.unused.levels"),
...,
lattice.options = NULL,
default.scales,
default.prepanel = lattice.getOption("prepanel.default.bwplot"),
subscripts, subset)
bwplot()
bwplot(factor(score) ~ gcsescore | gender, Chem97, layout = c(2, 1), coef = 0,
pch = "|", fill = hcl(h = 240, l = 85))
Graphical parameters are an important part of any graphical display
lattice allows common parameters to be specified as optional arguments to high-level functions
Again, this follows the standard practice in traditional graphics
However, this is not a good idea in general, particularly if the plot includes a legend
We will discuss customizing graphical parameters in the next presentation
Display specified in terms of
Type of display (histogram, densityplot, etc.)
Variables with specific roles
Typical roles for variables
Primary variables: used for the main graphical display
Conditioning variables: used to divide into subgroups and juxtapose (multipanel conditioning)
Grouping variables: divide into subgroups and superpose
Primary interface: high-level functions
Each function corresponds to a display type
Specification of roles depends on display type
Usually specified through a formula and the groups
argument
Design goals:
Enable effective graphics by encouraging good graphical practice; e.g., see Cleveland (1985)
Remove the burden from the user as much as possible by building in good defaults into software
Some obvious examples:
Use as much of the available space as possible
Encourage direct comparsion by superposition (grouping)
Enable comparison when juxtaposing (conditioning):
use common axes
add common reference objects (such as grids)
Inevitable departure from traditional R graphics paradigms
Any serious graphics system must also be flexible
lattice tries to balance flexibility and ease of use using the following model:
A display is made up of various elements
Coordinated defaults provide meaningful results, but
Each element can be controlled independently
The main elements are:
the primary (panel) display
axis annotation
strip annotation (describing the conditioning process)
legends (typically describing the grouping process)
We will discuss some of these elements in the rest of this course
Load the Cars93
dataset from the MASS
package
We are interested in understanding features that explain the MPG.city
of a car model
We start by comparing the distribution of MPG.city
for the two levels of Man.trans.avail
Draw a strip plot (basically a one-dimensional scatter plot)
The relevant lattice high-level function is stripplot()
Put Man.trans.avail
on the y-axis and MPG.city
on the x-axis
Modify the plot by adding the optional argument jitter = TRUE
Which help page documents the jitter
argument? What does it do?
Which version of the plot would you prefer? Why?
Draw and box-and-whisker plot of the same data with notches
Where is the meaning of notch = TRUE
documented?
Can you conclude from this that manual transmission cars are more fuel-efficient?
Perform a two-sample \(t\)-test to compare the means of the two distributions
Is it reasonable to assume equal variance in the two subgroups?
Would your answer change if we take log(MPG.city)
as the response?
Draw a scatter plot of MPG.city
against Weight
Does fuel efficiency depend on weight?
In the scatter plot of MPG.city
against Weight
, add Man.trans.avail
as a grouping variable
Does fuel efficiency depend on Man.trans.avail
after accounting for weight?
Fit a linear model and perform a formal test for this question
Chambers, John M. 1998. Programming with Data: A Guide to the S Language. New York: Springer.
Cleveland, William S. 1985. The Elements of Graphing Data. Monterey, California: Wadsworth.