Lattice Graphics: Basic Usage

Deepayan Sarkar

What is the lattice package?

  • Implementation of Trellis graphics for R

  • Powerful high-level data visualization system

  • Traditional user interface:

    • Collection of high level functions: xyplot(), dotplot(), etc.

    • Interface based on formula and data source

Origins

  • As we know, R is a Free Software re-implementation / dialect of S

  • The original implementation, available commercially as S-PLUS, was developed at Bell Labs

  • Both traditional graphics and Trellis graphics were part of the original S

  • Remembering this helps in understanding the design philosophy of lattice

Good graphical principles

  • Two important influences on graphics in S:

    • John W. Tukey

    • William Cleveland

John Tukey

  • Among the most influential modern statisticians

  • Champion of “Exploratory Data Analysis”

  • Worked at Bell Labs (where the S language was created)

  • Did not write software, but influenced the spirit of S

William Cleveland

  • Also worked at Bell Labs for a long time, and directly influenced the design of S graphics

  • Two important books:

    • The Elements of Graphing Data (1985)

    • Visualizing Data (1993)

  • Trellis graphics is essentially an implementation of ideas in the second book

Philosophy of data graphics in S

  • There are various designs or types of graphs for displaying data

  • Each design usually has a name (scatter plot, histogram, box plot, bar chart)

  • S has a high-level function corresponding to each such design (to be directly invoked by user)

  • The display produced should have reasonable defaults

  • Some customization through optional arguments (esp. graphical parameters)

  • Further customization can be done by adding to or replacing the display procedurally

  • Implicit expectation: S users will eventually turn into programmers

  • John Chambers, Preface of “Programming with Data” (Chambers 1998):

“S encourages you to slide into programming, perhaps without noticing”

Examples of high-level traditional graphics functions

Function Default Display
plot() Scatter Plot, Time-series Plot (with type="l")
boxplot() Comparative Box-and-Whisker Plots
barplot() Bar Plot
dotchart() Cleveland Dot Plot
hist() Histogram
plot(density()) Kernel Density Plot
qqnorm() Normal Quantile-Quantile Plot
qqplot() Two-sample Quantile-Quantile Plot
stripchart() Stripchart (Comparative 1-D Scatter Plots)
pairs() Scatter-Plot Matrix

lattice defines analogous functions with different names

Function Default Display
xyplot() Scatter Plot, Time-series Plot (with type="l")
bwplot() Comparative Box-and-Whisker Plots
barchart() Bar Plot
dotplot() Cleveland Dot Plot
histogram() Histogram
densityplot() Kernel Density Plot
qqmath() Normal Quantile-Quantile Plot
qq() Two-sample Quantile-Quantile Plot
stripplot() Stripchart (Comparative 1-D Scatter Plots)
splom() Scatter-Plot Matrix

lattice defines analogous functions with different names

  • Learning lattice essentially means

    • Learning about these functions (and a few others I didn’t mention)

    • Learning how to customize the default displays through optional arguments

    • Learning how to customize displays by writing alternative panel functions

    • Learning how to customize other parts (annotation, axis, themes)

  • We will now quickly go through some examples covering the first two points

Examples

Dataset for illustration: The Chem97 dataset

  • 1997 A-level Chemistry examination in Britain
'data.frame':   31022 obs. of  8 variables:
 $ lea      : Factor w/ 131 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ school   : Factor w/ 2410 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ student  : Factor w/ 31022 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ score    : num  4 10 10 10 8 10 6 8 4 10 ...
 $ gender   : Factor w/ 2 levels "M","F": 2 2 2 2 2 2 2 2 2 2 ...
 $ age      : num  3 -3 -4 -2 -1 4 1 4 3 0 ...
 $ gcsescore: num  6.62 7.62 7.25 7.5 6.44 ...
 $ gcsecnt  : num  0.339 1.339 0.964 1.214 0.158 ...

Dataset for illustration: The Chem97 dataset

  • We are only interested in

    • score : Point score on A-level Chemistry in 1997 (advanced level)

    • gender : Student’s gender

    • gcsescore : Average GCSE score of individual (secondary level)

  score gender gcsescore
1     4      F     6.625
2    10      F     7.625
3    10      F     7.250
4    10      F     7.500
5     8      F     6.444
6    10      F     7.750

A basic histogram

plot of chunk unnamed-chunk-3

A basic histogram using the formula interface

plot of chunk unnamed-chunk-4

Histograms with multipanel conditioning

plot of chunk unnamed-chunk-5

Innovations

  • The most visible innovation in lattice over traditional graphics is multipanel conditioning

    • Common scales and shared axis labeling by default

    • Strips above each panel describing subset

    • Optimal use of space (e.g., no extra space left for main label unless present)

  • This is the origin of the name “Trellis graphics” and “lattice”

  • Makes use of the formula-data interface — similar to modeling functions like lm()

  • All high-level lattice calls will usually have a formula and a data= argument

Density plots with multipanel conditioning

plot of chunk unnamed-chunk-6

Density plots with multipanel conditioning

plot of chunk unnamed-chunk-7

Density plots with multipanel conditioning

plot of chunk unnamed-chunk-8

Density plots with conditioning and within-panel grouping

plot of chunk unnamed-chunk-9

Trellis Philosophy: Part I

  • Display specified in terms of

    • Type of display (histogram, densityplot, etc.)

    • Variables with specific roles

  • Typical roles for variables

    • Primary variables: used for the main graphical display

    • Conditioning variables: used to divide into subgroups and juxtapose (multipanel conditioning)

    • Grouping variables: divide into subgroups and superpose

  • Primary interface: high-level functions

    • Each function corresponds to a display type

    • Specification of roles depends on display type

    • Usually specified through a formula and the groups argument

Plots to summarize univariate distributions

  • We have used histograms and density plots to understand distribution of gcsescore

  • We will next see some variations and some other displays with the same goal

  • Useful to keep in mind that good data graphics should enable comparison

Variations: density histograms with 50 bins

plot of chunk unnamed-chunk-10

Variations: histograms with unequal-width bins

plot of chunk unnamed-chunk-11

Variations: density plots with triangular kernel (ASH)

plot of chunk unnamed-chunk-12

Variations: bandwidth chosen by biased cross-validation

plot of chunk unnamed-chunk-13

Normal quantile-quantile plots

plot of chunk unnamed-chunk-14

Normal quantile-quantile plots with banking

plot of chunk unnamed-chunk-15

Two-sample quantile-quantile plots

plot of chunk unnamed-chunk-16

Box and whisker plots for multi-sample comparisons

plot of chunk unnamed-chunk-17

Box and whisker plots with categorical variable on x-axis

plot of chunk unnamed-chunk-18

Box and whisker plots with explicit panel layout and gaps

plot of chunk unnamed-chunk-19

Box and whisker plots with notches and variable width

plot of chunk unnamed-chunk-20

Optional arguments

  • What are the available arguments available?

  • Where can we find more details about them?

  • To answer this, we need to learn some details about how lattice works

  • Summary:

    • Some optional arguments are common to all high-level lattice functions

    • Some are specific to the high-level function

    • Some are specific to the default display panel function

Common optional arguments

  • Documented in help(xyplot) (for the most part)

  • Main categories:

    • as.table, between, layout, skip : control panel layout; see Chapter 2 of the Lattice book

    • xlab, ylab, main, sub : labels

    • xlim, ylim : axis limits

    • scales : list controlling many details about scales

    • aspect : aspect ratio

    • key, auto.key : legend

    • par.settings : default graphical parameters (theme)

    • lattice.options : non-graphical settings

  • Will not discuss all, but will encounter some later (see documentation for details)

Display-specific optional arguments

  • Most high-level functions will have some specific optional arguments

  • All of these have a default panel function to produce the default display

  • These have names of the form panel.<high-level-function>

  • For example, panel.histogram, panel.densityplot, panel.bwplot, etc.

  • Optional arguments of the panel function can also be specified in the high-level call

Display-specific optional arguments: histogram

Display-specific optional arguments: bwplot

Example: Specifying optional parameters in bwplot()

plot of chunk unnamed-chunk-21

Specifying graphical parameters

  • Graphical parameters are an important part of any graphical display

  • lattice allows common parameters to be specified as optional arguments to high-level functions

  • Again, this follows the standard practice in traditional graphics

  • However, this is not a good idea in general, particularly if the plot includes a legend

  • We will discuss customizing graphical parameters in the next presentation

Summary

Trellis Philosophy: Part I

  • Display specified in terms of

    • Type of display (histogram, densityplot, etc.)

    • Variables with specific roles

  • Typical roles for variables

    • Primary variables: used for the main graphical display

    • Conditioning variables: used to divide into subgroups and juxtapose (multipanel conditioning)

    • Grouping variables: divide into subgroups and superpose

  • Primary interface: high-level functions

    • Each function corresponds to a display type

    • Specification of roles depends on display type

    • Usually specified through a formula and the groups argument

Trellis Philosophy: Part II

  • Design goals:

    • Enable effective graphics by encouraging good graphical practice; e.g., see Cleveland (1985)

    • Remove the burden from the user as much as possible by building in good defaults into software

  • Some obvious examples:

    • Use as much of the available space as possible

    • Encourage direct comparsion by superposition (grouping)

    • Enable comparison when juxtaposing (conditioning):

      • use common axes

      • add common reference objects (such as grids)

  • Inevitable departure from traditional R graphics paradigms

Trellis Philosophy: Part III

  • Any serious graphics system must also be flexible

  • lattice tries to balance flexibility and ease of use using the following model:

    • A display is made up of various elements

    • Coordinated defaults provide meaningful results, but

    • Each element can be controlled independently

    • The main elements are:

      • the primary (panel) display

      • axis annotation

      • strip annotation (describing the conditioning process)

      • legends (typically describing the grouping process)

  • We will discuss some of these elements in the rest of this course

Exercises

  • Load the Cars93 dataset from the MASS package

  • We are interested in understanding features that explain the MPG.city of a car model

  • We start by comparing the distribution of MPG.city for the two levels of Man.trans.avail

  • Draw a strip plot (basically a one-dimensional scatter plot)

    • The relevant lattice high-level function is stripplot()

    • Put Man.trans.avail on the y-axis and MPG.city on the x-axis

  • Modify the plot by adding the optional argument jitter = TRUE

    • Which help page documents the jitter argument? What does it do?

    • Which version of the plot would you prefer? Why?

Exercises

  • Draw and box-and-whisker plot of the same data with notches

    • Where is the meaning of notch = TRUE documented?

    • Can you conclude from this that manual transmission cars are more fuel-efficient?

  • Perform a two-sample \(t\)-test to compare the means of the two distributions

    • Is it reasonable to assume equal variance in the two subgroups?

    • Would your answer change if we take log(MPG.city) as the response?

Exercises

  • Draw a scatter plot of MPG.city against Weight

  • Does fuel efficiency depend on weight?

  • In the scatter plot of MPG.city against Weight, add Man.trans.avail as a grouping variable

  • Does fuel efficiency depend on Man.trans.avail after accounting for weight?

  • Fit a linear model and perform a formal test for this question

References

Chambers, John M. 1998. Programming with Data: A Guide to the S Language. New York: Springer.

Cleveland, William S. 1985. The Elements of Graphing Data. Monterey, California: Wadsworth.