Introduction to R Graphics

Deepayan Sarkar

R graphics

  • R has a reputation for being a good system for graphics

  • This is mainly based on its ability to produce good publication-quality statistical plots

  • R actually has two largely independent graphics subsystems

    • Traditional graphics

      • Available in R from the beginning
      • Rich collection of tools
      • Not very flexible
    • Grid graphics

      • Relatively recent (2000)
      • Low-level tool, highly flexible

Grid graphics, lattice and ggplot2

  • Grid graphics is not usually used directly by the user

  • But it forms the basis of two high-level graphics systems:

    • lattice: based on Trellis graphics (Cleveland)

    • ggplot2: inspired by “Grammar of Graphics” (Wilkinson)

  • These represent two very different philosophical approaches to graphics

  • lattice is in many ways a natural successor to traditional graphics

  • ggplot2 represents a completely different declarative approach

  • All of these are worth learning (but requires some effort)

  • I will briefly introduce all three, but mainly focus on lattice

An example: Anscombe’s quartet

  • Francis Anscombe introduced four artificial bivariate datasets to emphasize the importance of graphics

  • These datasets are available in R as the built-in dataset anscombe (see ?anscombe)

'data.frame':   11 obs. of  8 variables:
 $ x1: num  10 8 13 9 11 14 6 4 12 7 ...
 $ x2: num  10 8 13 9 11 14 6 4 12 7 ...
 $ x3: num  10 8 13 9 11 14 6 4 12 7 ...
 $ x4: num  8 8 8 8 8 8 8 19 8 8 ...
 $ y1: num  8.04 6.95 7.58 8.81 8.33 ...
 $ y2: num  9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
 $ y3: num  7.46 6.77 12.74 7.11 7.81 ...
 $ y4: num  6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...
  • The datasets all had the same means, standard deviations, and correlation
      x1       x2       x3       x4       y1       y2       y3       y4 
9.000000 9.000000 9.000000 9.000000 7.500909 7.500909 7.500000 7.500909 
      x1       x2       x3       x4       y1       y2       y3       y4 
3.316625 3.316625 3.316625 3.316625 2.031568 2.031657 2.030424 2.030579 
[1] 0.8164205 0.8162365 0.8162867 0.8165214


  • How can we plot all four datasets together?

Anscombe’s quartet using traditional graphics

  • Traditional graphics thinks of this as four different data sets

  • The function to create scatter plots is plot()

  • Multiple plots can be put in the same figure using par(mfrow = ...)

  • Several ways of specifying variable names inside dataset:

plot of chunk unnamed-chunk-3

plot of chunk unnamed-chunk-4

Anscombe’s quartet using lattice and ggplot2

  • Both lattice and ggplot2 are capable of producing a single plot with all four datasets

  • But this requires the dataset to be in the “long format” (or “transposed”), with one row per data point

'data.frame':   44 obs. of  3 variables:
 $ x    : num  10 8 13 9 11 14 6 4 12 7 ...
 $ y    : num  8.04 6.95 7.58 8.81 8.33 ...
 $ which: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...

Anscombe’s quartet using lattice

plot of chunk unnamed-chunk-6
plot of chunk unnamed-chunk-6

plot of chunk unnamed-chunk-6

Anscombe’s quartet using ggplot2

plot of chunk unnamed-chunk-7

Anscombe’s quartet using lattice and ggplot2

  • The approaches share many common features

  • Both Capable of plotting subsets of data (indexed by categorical variables)

    • This idea is known by several names: small multiples, conditioning, facetting
  • Both makes efficient use of available space (common scales, common axes)

  • Different visual appearance, but that is superficial (different default themes)

  • However, the way in which we specify the display is very different

    • lattice uses an extension of the formula-data interface (with function xyplot() instead of plot())

    • ggplot2 specifies type of rendering (geom) and mapping of variables to coordinates (aesthetics)

  • The differences become clearer if we try to customize the display further

  • A natural extension in this example is to add a linear regression line to each scatter plot

Anscombe’s quartet with regression lines

plot of chunk unnamed-chunk-8

  • The traditional graphics approach is to add the line after the plot is drawn

  • In general, a plot is never finished, you can always add more points, lines, text, …

  • This is possible because there is only one plot !

  • lattice and ggplot2 need alternative solutions

  • The ggplot2 solution is to allow plots to have multiple layers

  • The lattice solution is to allow user to fully specify the procedure used to display data

Regression lines: the ggplot2 solution

  • Plot with points only

plot of chunk unnamed-chunk-9

  • Plot with regression line only

plot of chunk unnamed-chunk-10

  • Plot with both points and regression lines

plot of chunk unnamed-chunk-11

Regression lines: the lattice solution

  • The lattice solution is actually very similar to the traditional graphics solution

  • Basically, we want to do the following for each data subset x, y:

    • Draw points at (x, y)

    • Draw the linear regression line through x, y

  • For lattice, we need to encapsulate this procedure into a function

  • This function is then supplied to the xyplot() function as the panel argument

  • Plot with points only

plot of chunk unnamed-chunk-13

  • Plot with grid, points, and regression line

plot of chunk unnamed-chunk-14

  • lattice also supports a layering mechanism similar to ggplot2

plot of chunk unnamed-chunk-15

  • Common customizations like these are also supported directly through optional arguments

plot of chunk unnamed-chunk-16

Philosophy of traditional graphics

  • The core of the traditional R graphics system is the graphics package

  • Various add-on packages provide further functionality

  • The full list of functions can be seen using

  • The listed functions can be roughly categorized into two groups:

    • High-level functions are intended to produce a complete plot by themselves

    • Low-level functions are intended to add elements to existing plots

Examples of high-level traditional graphics functions

Function Default Display
plot() Scatter Plot, Time-series Plot (with type="l")
boxplot() Comparative Box-and-Whisker Plots
barplot() Bar Plot
dotchart() Cleveland Dot Plot
hist() Histogram
plot(density()) Kernel Density Plot
qqnorm() Normal Quantile-Quantile Plot
qqplot() Two-sample Quantile-Quantile Plot
stripchart() Stripchart (Comparative 1-D Scatter Plots)
pairs() Scatter-Plot Matrix

The plot() function

  • The plot() function is actually a generic function

  • Can plot several types of R objects

  • The most common method is the default method plot.default(), which can be used to

    • plot paired numeric data

    • plot univariate numeric data against serial number

  • See ?plot and ?plot.default for details, including

    • logarithmic scales using the log argument

    • the type argument for controlling display

  • Another useful alternative for bivariate data that we have already seen is ?plot.formula

Philosophy of lattice graphics

  • Very similar in spirit to traditional graphics

  • There are various designs or types of graphs for displaying data

  • Each design usually has a name (scatter plot, histogram, box plot, bar chart)

  • lattice provides a high-level function corresponding to each such design (to be directly invoked by user)

  • The display produced should have reasonable defaults

  • Some customization through optional arguments (esp. graphical parameters)

  • Further customization can be done by adding to or replacing the display procedurally

Examples of high-level traditional graphics functions

Function Default Display
plot() Scatter Plot, Time-series Plot (with type="l")
boxplot() Comparative Box-and-Whisker Plots
barplot() Bar Plot
dotchart() Cleveland Dot Plot
hist() Histogram
plot(density()) Kernel Density Plot
qqnorm() Normal Quantile-Quantile Plot
qqplot() Two-sample Quantile-Quantile Plot
stripchart() Stripchart (Comparative 1-D Scatter Plots)
pairs() Scatter-Plot Matrix

lattice defines analogous functions with different names

Function Default Display
xyplot() Scatter Plot, Time-series Plot (with type="l")
bwplot() Comparative Box-and-Whisker Plots
barchart() Bar Plot
dotplot() Cleveland Dot Plot
histogram() Histogram
densityplot() Kernel Density Plot
qqmath() Normal Quantile-Quantile Plot
qq() Two-sample Quantile-Quantile Plot
stripplot() Stripchart (Comparative 1-D Scatter Plots)
splom() Scatter-Plot Matrix
  • Low-level graphics function also have similar lattice analogues
graphics lattice Purpose
text() panel.text() Add Text to a Plot
lines() panel.lines() Add Connected Line Segments to a Plot
points() panel.points() Add Points to a Plot
polygon() panel.polygon() Polygon Drawing
rect() panel.rect() Draw One or More Rectangles
segments() panel.segments() Add Line Segments to a Plot
abline() panel.abline() Add Straight Lines to a Plot
arrows() panel.arrows() Add Arrows to a Plot
grid() panel.grid() Add Grid to a Plot
  • Learning lattice essentially means

    • Learning about these functions (and a few others)

    • Learning how to customize the default displays through optional arguments

    • Learning how to customize displays by writing alternative panel functions

    • Learning how to customize other parts (annotation, axis, themes)

  • We don’t have time to learn everything in detail

  • Instead, we will go through some examples relevant for clinical trial reporting

  • Please ask if you want to know some specific aspect in more detail

Case studies

  • Laboratory Data Scatter Plot

  • Clinical Response Line Plot

  • Clinical Response Bar Chart

  • Box Plot

  • Odds Ratio Plot

  • Kaplan-Meier Estimates Plot

Example: Laboratory Data Scatter plot

  • Read in data: value gives hematocrit percentage, wish to compare to baseline
   subject_id lbtest value baseline trt
1         101    hct  35.0     31.0   a
2         102    hct  40.2     30.0   a
3         103    hct  42.0     42.4   b
4         104    hct  41.2     41.4   b
5         105    hct  35.0     33.3   a
6         106    hct  34.3     34.3   a
7         107    hct  30.3     44.0   b
8         108    hct  34.2     42.0   b
9         109    hct  40.0     41.1   b
10        110    hct  41.0     42.1   b
11        111    hct  33.3     33.8   a
12        112    hct  34.0     31.0   a
13        113    hct  34.0     41.0   b
14        114    hct  34.0     40.0   b
15        115    hct  37.2     35.2   a
16        116    hct  39.3     36.2   a
17        117    hct  36.3     38.3   b
18        118    hct  37.4     37.3   b
19        119    hct  44.2     34.3   a
20        120    hct  42.2     36.5   a
  • Sample SAS output:

SAS example

plot of chunk unnamed-chunk-18

plot of chunk unnamed-chunk-19

Example: Clinical Response Line plot

  • Read in data: trt is treatment, response is recorded for on days 0–10 (0 is baseline)
   trt visit response
1    1     0     9.40
2    2     0     9.35
3    1     1     9.35
4    2     1     9.40
5    1     2     8.22
6    2     2     8.78
7    1     3     6.33
8    2     3     8.23
9    1     4     4.00
10   2     4     7.77
11   1     5     2.22
12   2     5     4.46
13   1     6     1.44
14   2     6     2.00
15   1     7     1.13
16   2     7     1.86
17   1     8     0.55
18   2     8     1.44
19   1     9     0.67
20   2     9     1.33
21   1    10     0.45
22   2    10     1.01
  • Create categorical variables as appropriate
          trt    visit response
1  Super Drug Baseline     9.40
2    Old Drug Baseline     9.35
3  Super Drug    Day 1     9.35
4    Old Drug    Day 1     9.40
5  Super Drug    Day 2     8.22
6    Old Drug    Day 2     8.78
7  Super Drug    Day 3     6.33
8    Old Drug    Day 3     8.23
9  Super Drug    Day 4     4.00
10   Old Drug    Day 4     7.77
11 Super Drug    Day 5     2.22
12   Old Drug    Day 5     4.46
13 Super Drug    Day 6     1.44
14   Old Drug    Day 6     2.00
15 Super Drug    Day 7     1.13
16   Old Drug    Day 7     1.86
17 Super Drug    Day 8     0.55
18   Old Drug    Day 8     1.44
19 Super Drug    Day 9     0.67
20   Old Drug    Day 9     1.33
21 Super Drug   Day 10     0.45
22   Old Drug   Day 10     1.01
  • Sample SAS output:

SAS example

plot of chunk unnamed-chunk-22

plot of chunk unnamed-chunk-23

Example: Clinical response bar chart

  • Read in data: trt is treatment, pain indicates pain level
'data.frame':   150 obs. of  3 variables:
 $ subject: int  113 420 780 121 423 784 122 465 785 124 ...
 $ pain   : int  1 1 0 1 0 0 1 4 1 4 ...
 $ trt    : int  1 2 3 1 2 3 1 2 3 1 ...
  • Create suitable categorical variables
   subject pain      trt
1      113  1-2  Placebo
2      420  1-2 Old Drug
3      780    0 New Drug
4      121  1-2  Placebo
5      423    0 Old Drug
6      784    0 New Drug
7      122  1-2  Placebo
8      465  7-8 Old Drug
9      785  1-2 New Drug
10     124  7-8  Placebo
11     481  5-6 Old Drug
12     786  5-6 New Drug
13     164  7-8  Placebo
14     482    0 Old Drug
15     787    0 New Drug
16     177  7-8  Placebo
17     483    0 Old Drug
18     789    0 New Drug
19     178    0  Placebo
20     484    0 Old Drug
  • Sample SAS output:

SAS example

  • A bar chart essentially summarizes a cross-tabulation of pain levels by treatment

  • First we need to create the appropriate table

     trt
pain  Placebo Old Drug New Drug
  0         6       18       19
  1-2       7       19       15
  3-4       8        9        9
  5-6       9        3        5
  7-8      20        1        2
  • Grouped bar chart:

plot of chunk unnamed-chunk-27

  • Multi-panel bar chart:

plot of chunk unnamed-chunk-28

  • With percentage instead of count (as in SAS output)

plot of chunk unnamed-chunk-29

Alternative: Cleveland Dot Plot — direct superposition makes comparison easier

plot of chunk unnamed-chunk-30

Example: Box plots

  • Read in data: seizures per hour by treatment and visit
'data.frame':   98 obs. of  3 variables:
 $ trt     : int  1 2 2 2 2 2 1 2 1 1 ...
 $ visit   : int  2 1 2 1 2 3 1 3 1 2 ...
 $ seizures: num  1.5 3 1.8 2.6 2 2 2.8 2.6 3 2.2 ...
  • Make suitable categorical variables.
       trt    visit seizures
1   Active 6 Months      1.5
2  Placebo Baseline      3.0
3  Placebo 6 Months      1.8
4  Placebo Baseline      2.6
5  Placebo 6 Months      2.0
6  Placebo 9 Months      2.0
7   Active Baseline      2.8
8  Placebo 9 Months      2.6
9   Active Baseline      3.0
10  Active 6 Months      2.2
11  Active Baseline      2.4
12 Placebo Baseline      3.2
13 Placebo Baseline      3.2
14  Active 6 Months      1.4
15  Active Baseline      2.6
16 Placebo 6 Months      2.1
17  Active 9 Months      1.8
18  Active 6 Months      1.2
19  Active Baseline      2.6
20 Placebo Baseline      3.0
  • Sample SAS output:

SAS example

  • Sample SAS output:

SAS example

  • Sample SAS output:

SAS example

  • Obtaining a grouped box plot as in the book is somewhat complicated

  • It is also not necessarily the best thing to do

  • It is easy to obtain a multi-panel box plot as follows:

plot of chunk unnamed-chunk-33

  • Alternatively, define panels by treatment to visualize the time course for each treatment

plot of chunk unnamed-chunk-34

  • Recreating the grouped boxplot as produced by SAS will need more work

  • Non-standard customizations require an alternative display procedure through a “panel function”

  • The following panel function simply adds a line joining the medians

plot of chunk unnamed-chunk-36

  • The following function is similar, but

    • Indicates the mean instead of the median as a point

    • Still joins the medians by lines

plot of chunk unnamed-chunk-38

  • A proper grouped box plot will need a bit more work (esp. getting the colors right)

  • A reasonable approximation is given by the following

  • This is achieved using the panel.superpose() panel function

  • Converts a regular panel display function into a grouped display function

  • One additional trick is needed to shift the boxplots horizontally according to group

plot of chunk unnamed-chunk-40

Example: Odds Ratio Plot

  • Read in data: pain by treatment, gender, race
'data.frame':   65 obs. of  5 variables:
 $ success : int  1 1 1 1 0 0 1 0 1 1 ...
 $ trt     : int  0 1 0 1 0 0 1 0 1 0 ...
 $ male    : int  1 2 2 1 1 2 1 2 1 1 ...
 $ race    : int  3 1 2 1 3 1 3 1 2 1 ...
 $ basepain: int  20 22 20 10 24 20 22 20 25 11 ...
  • Sample SAS output:

SAS example

  • This plot was new to me

  • Essentially, this summarizes the results of a logistic regression model

  • The coefficients of such a model influence the “success probability” non-linearly

  • With the standard logisitic regression model, the coefficients are the log of the odds ratios

  • Logistic regression models are fit using glm() (for generalized linear models)

  • Uses essentially the same formula interface as lm(), along with a “family” defining model

  • As before, we need to first make suitable categorical variables

    • This does not actually matter for categorical variables that have two levels

    • Such variables contribute only one independent dummy variable

    • However, it does matter for race

  • Unfortunately, the SAS code neglects to treat race as categorical

  • Instead, it treats race as a continuous variable (with values 1, 2, 3)

  • This is an error, but we will retain it to match SAS results

  success trt male race basepain
1       1   0    1    3       20
2       1   1    2    1       22
3       1   0    2    2       20
4       1   1    1    1       10
5       0   0    1    3       24
6       0   0    2    1       20

Example: Odds Ratio Plot

  • Fit logistic regression model:

Call:
glm(formula = success ~ basepain + male + race + trt, family = binomial(), 
    data = pain)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0701  -1.0072   0.6187   0.8637   1.4417  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept)  1.217920   1.111842   1.095   0.2733  
basepain     0.009032   0.027082   0.334   0.7388  
male2       -0.768458   0.594465  -1.293   0.1961  
race        -0.616383   0.423524  -1.455   0.1456  
trt1         1.226517   0.588593   2.084   0.0372 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 81.792  on 64  degrees of freedom
Residual deviance: 72.688  on 60  degrees of freedom
AIC: 82.688

Number of Fisher Scoring iterations: 3
  • We need the relevant coefficient estimates (log odds) and their confidence intervals

  • This computation is available in the MASS package

Waiting for profiling to be done...
                  2.5 %     97.5 %
(Intercept) -0.92576323 3.49865470
basepain    -0.04126015 0.06790952
male2       -1.96474468 0.39138930
race        -1.48310317 0.20591317
trt1         0.10652176 2.44258253
  • We use this to construct a data frame that will provide data for our plot
Waiting for profiling to be done...
              covariate     estimate       2.5 %     97.5 %
(Intercept) (Intercept)  1.217919975 -0.92576323 3.49865470
basepain       basepain  0.009032014 -0.04126015 0.06790952
male2             male2 -0.768458365 -1.96474468 0.39138930
race               race -0.616382916 -1.48310317 0.20591317
trt1               trt1  1.226516853  0.10652176 2.44258253
  • An easy way to plot these is to plot the coeffient estimates and then add the confidence intervals.

plot of chunk unnamed-chunk-47

plot of chunk unnamed-chunk-48

  • lattice version: cheating a bit using with()
  • Log-scales are faked by specifying explicit tick mark locations and labels

  • This example uses another feature of lattice: high-level plot functions return objects

  • The actual display is created by the print() method (see next page)

plot of chunk unnamed-chunk-50

  • A better alternative is the segplot() function in latticeExtra

plot of chunk unnamed-chunk-51

Example: Kaplan-Meier Estimates Plot

  • Read in data:
'data.frame':   200 obs. of  3 variables:
 $ trt        : Factor w/ 3 levels "Placebo","Old Drug",..: 1 1 3 3 2 2 2 2 3 3 ...
 $ daystodeath: int  52 825 693 981 279 826 531 15 1057 793 ...
 $ deathcensor: int  1 0 0 0 1 0 0 0 0 0 ...
  • As seen earlier, we need to create a survival object and use it in the survfit() function
Call: survfit(formula = Surv(daystodeath, deathcensor) ~ trt, data = death)

              n events median 0.95LCL 0.95UCL
trt=Placebo  68     26    980     872      NA
trt=Old Drug 64     18     NA    1024      NA
trt=New Drug 68      7     NA      NA      NA
  • The fitted survival model can be plotted using the corresponding plot() method.

plot of chunk unnamed-chunk-54

  • Can also use xyplot, but lacks event marking, and the curves do not start from time 0

plot of chunk unnamed-chunk-55