Introduction to Data Visualization in R

# Introduction to Data Visualization in R

## Introductory Computer Programming

### Deepayan Sarkar

---

# Data Visualization

---

* Important component of data analysis

* Main purposes

- Exploration

- Presentation

---

* Learning objectives

- What kind of visualization to use

- How to create them

---

# Example datasets: `airquality` (size: small)

```r
str(airquality) # built-in dataset
```

```
'data.frame':	153 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
```

???

The first dataset we will consider is a simple built-in dataset in R,

- giving daily air quality measurements in New York City,

- over five months in 1973.

It has 153 observations, one for each day from May through September.

---

# Example datasets: `airquality` (size: small)

```r
head(airquality, 15)
```

```
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
7     23     299  8.6   65     5   7
8     19      99 13.8   59     5   8
9      8      19 20.1   61     5   9
10    NA     194  8.6   69     5  10
11     7      NA  6.9   74     5  11
12    16     256  9.7   69     5  12
13    11     290  9.2   66     5  13
14    14     274 10.9   68     5  14
15    18      65 13.2   58     5  15
```

???

There are some NA values, which indicate missing data.

Also notice that dates are specified separately in MONTH and DAY
columns, and MONTH is indicated by number rather than name.

---

# Example datasets: `gapminder` (size: moderate)

```r
gapminder <- read.table("data/gapminder.tsv", sep = "\t", header = TRUE)
str(gapminder)
```

```
'data.frame':	1698 obs. of  6 variables:
 $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ continent: chr  "Asia" "Asia" "Asia" "Asia" ...
 $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
 $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num  779 821 853 836 740 ...
```

---

# Example datasets: `gapminder` (size: moderate)

```r
subset(gapminder, country == "Australia")
```

```
     country continent year lifeExp      pop gdpPercap
61 Australia   Oceania 1952  69.120  8691212  10039.60
62 Australia   Oceania 1957  70.330  9712569  10949.65
63 Australia   Oceania 1962  70.930 10794968  12217.23
64 Australia   Oceania 1967  71.100 11872264  14526.12
65 Australia   Oceania 1972  71.930 13177000  16788.63
66 Australia   Oceania 1977  73.490 14074100  18334.20
67 Australia   Oceania 1982  74.740 15184200  19477.01
68 Australia   Oceania 1987  76.320 16257249  21888.89
69 Australia   Oceania 1992  77.560 17481977  23424.77
70 Australia   Oceania 1997  78.830 18565243  26997.94
71 Australia   Oceania 2002  80.370 19546792  30687.75
72 Australia   Oceania 2007  81.235 20434176  34435.37
```

???

But this large size is mainly due to the fact that the dataset
contains records for many countries.

If we restrict our attention to the subset for Australia, for example,
we see that there are only 12 observations.

---

# Example datasets: `NHANES` (size: somewhat large)

```r
library(package = "NHANES")
str(NHANES)
```

```
tbl_df [10,000 × 76] (S3: tbl_df/tbl/data.frame)
 $ ID              : int [1:10000] 51624 51624 51624 51625 51630 51638 51646 51647 51647 51647 ...
 $ SurveyYr        : Factor w/ 2 levels "2009_10","2011_12": 1 1 1 1 1 1 1 1 1 1 ...
 $ Gender          : Factor w/ 2 levels "female","male": 2 2 2 2 1 2 2 1 1 1 ...
 $ Age             : int [1:10000] 34 34 34 4 49 9 8 45 45 45 ...
 $ AgeDecade       : Factor w/ 8 levels " 0-9"," 10-19",..: 4 4 4 1 5 1 1 5 5 5 ...
 $ AgeMonths       : int [1:10000] 409 409 409 49 596 115 101 541 541 541 ...
 $ Race1           : Factor w/ 5 levels "Black","Hispanic",..: 4 4 4 5 4 4 4 4 4 4 ...
 $ Race3           : Factor w/ 6 levels "Asian","Black",..: NA NA NA NA NA NA NA NA NA NA ...
 $ Education       : Factor w/ 5 levels "8th Grade","9 - 11th Grade",..: 3 3 3 NA 4 NA NA 5 5 5 ...
 $ MaritalStatus   : Factor w/ 6 levels "Divorced","LivePartner",..: 3 3 3 NA 2 NA NA 3 3 3 ...
 $ HHIncome        : Factor w/ 12 levels " 0-4999"," 5000-9999",..: 6 6 6 5 7 11 9 11 11 11 ...
 $ HHIncomeMid     : int [1:10000] 30000 30000 30000 22500 40000 87500 60000 87500 87500 87500 ...
 $ Poverty         : num [1:10000] 1.36 1.36 1.36 1.07 1.91 1.84 2.33 5 5 5 ...
 $ HomeRooms       : int [1:10000] 6 6 6 9 5 6 7 6 6 6 ...
 $ HomeOwn         : Factor w/ 3 levels "Own","Rent","Other": 1 1 1 1 2 2 1 1 1 1 ...
 $ Work            : Factor w/ 3 levels "Looking","NotWorking",..: 2 2 2 NA 2 NA NA 3 3 3 ...
 $ Weight          : num [1:10000] 87.4 87.4 87.4 17 86.7 29.8 35.2 75.7 75.7 75.7 ...
 $ Length          : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ HeadCirc        : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ Height          : num [1:10000] 165 165 165 105 168 ...
 $ BMI             : num [1:10000] 32.2 32.2 32.2 15.3 30.6 ...
 $ BMICatUnder20yrs: Factor w/ 4 levels "UnderWeight",..: NA NA NA NA NA NA NA NA NA NA ...
 $ BMI_WHO         : Factor w/ 4 levels "12.0_18.5","18.5_to_24.9",..: 4 4 4 1 4 1 2 3 3 3 ...
 $ Pulse           : int [1:10000] 70 70 70 NA 86 82 72 62 62 62 ...
 $ BPSysAve        : int [1:10000] 113 113 113 NA 112 86 107 118 118 118 ...
 $ BPDiaAve        : int [1:10000] 85 85 85 NA 75 47 37 64 64 64 ...
 $ BPSys1          : int [1:10000] 114 114 114 NA 118 84 114 106 106 106 ...
 $ BPDia1          : int [1:10000] 88 88 88 NA 82 50 46 62 62 62 ...
 $ BPSys2          : int [1:10000] 114 114 114 NA 108 84 108 118 118 118 ...
 $ BPDia2          : int [1:10000] 88 88 88 NA 74 50 36 68 68 68 ...
 $ BPSys3          : int [1:10000] 112 112 112 NA 116 88 106 118 118 118 ...
 $ BPDia3          : int [1:10000] 82 82 82 NA 76 44 38 60 60 60 ...
 $ Testosterone    : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ DirectChol      : num [1:10000] 1.29 1.29 1.29 NA 1.16 1.34 1.55 2.12 2.12 2.12 ...
 $ TotChol         : num [1:10000] 3.49 3.49 3.49 NA 6.7 4.86 4.09 5.82 5.82 5.82 ...
 $ UrineVol1       : int [1:10000] 352 352 352 NA 77 123 238 106 106 106 ...
 $ UrineFlow1      : num [1:10000] NA NA NA NA 0.094 ...
 $ UrineVol2       : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ UrineFlow2      : num [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ Diabetes        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ DiabetesAge     : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ HealthGen       : Factor w/ 5 levels "Excellent","Vgood",..: 3 3 3 NA 3 NA NA 2 2 2 ...
 $ DaysPhysHlthBad : int [1:10000] 0 0 0 NA 0 NA NA 0 0 0 ...
 $ DaysMentHlthBad : int [1:10000] 15 15 15 NA 10 NA NA 3 3 3 ...
 $ LittleInterest  : Factor w/ 3 levels "None","Several",..: 3 3 3 NA 2 NA NA 1 1 1 ...
 $ Depressed       : Factor w/ 3 levels "None","Several",..: 2 2 2 NA 2 NA NA 1 1 1 ...
 $ nPregnancies    : int [1:10000] NA NA NA NA 2 NA NA 1 1 1 ...
 $ nBabies         : int [1:10000] NA NA NA NA 2 NA NA NA NA NA ...
 $ Age1stBaby      : int [1:10000] NA NA NA NA 27 NA NA NA NA NA ...
 $ SleepHrsNight   : int [1:10000] 4 4 4 NA 8 NA NA 8 8 8 ...
 $ SleepTrouble    : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
 $ PhysActive      : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 2 2 2 ...
 $ PhysActiveDays  : int [1:10000] NA NA NA NA NA NA NA 5 5 5 ...
 $ TVHrsDay        : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
 $ CompHrsDay      : Factor w/ 7 levels "0_hrs","0_to_1_hr",..: NA NA NA NA NA NA NA NA NA NA ...
 $ TVHrsDayChild   : int [1:10000] NA NA NA 4 NA 5 1 NA NA NA ...
 $ CompHrsDayChild : int [1:10000] NA NA NA 1 NA 0 6 NA NA NA ...
 $ Alcohol12PlusYr : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
 $ AlcoholDay      : int [1:10000] NA NA NA NA 2 NA NA 3 3 3 ...
 $ AlcoholYear     : int [1:10000] 0 0 0 NA 20 NA NA 52 52 52 ...
 $ SmokeNow        : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA NA NA NA ...
 $ Smoke100        : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
 $ Smoke100n       : Factor w/ 2 levels "Non-Smoker","Smoker": 2 2 2 NA 2 NA NA 1 1 1 ...
 $ SmokeAge        : int [1:10000] 18 18 18 NA 38 NA NA NA NA NA ...
 $ Marijuana       : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
 $ AgeFirstMarij   : int [1:10000] 17 17 17 NA 18 NA NA 13 13 13 ...
 $ RegularMarij    : Factor w/ 2 levels "No","Yes": 1 1 1 NA 1 NA NA 1 1 1 ...
 $ AgeRegMarij     : int [1:10000] NA NA NA NA NA NA NA NA NA NA ...
 $ HardDrugs       : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 1 1 1 ...
 $ SexEver         : Factor w/ 2 levels "No","Yes": 2 2 2 NA 2 NA NA 2 2 2 ...
 $ SexAge          : int [1:10000] 16 16 16 NA 12 NA NA 13 13 13 ...
 $ SexNumPartnLife : int [1:10000] 8 8 8 NA 10 NA NA 20 20 20 ...
 $ SexNumPartYear  : int [1:10000] 1 1 1 NA 1 NA NA 0 0 0 ...
 $ SameSex         : Factor w/ 2 levels "No","Yes": 1 1 1 NA 2 NA NA 2 2 2 ...
 $ SexOrientation  : Factor w/ 3 levels "Bisexual","Heterosexual",..: 2 2 2 NA 2 NA NA 1 1 1 ...
 $ PregnantNow     : Factor w/ 3 levels "Yes","No","Unknown": NA NA NA NA NA NA NA NA NA NA ...
```

???

Data originally come from a health and nutrition survey conducted regularly in the USA.

Each row in this dataset represents a respondent in the study.

The actual study uses a fairly complex survey design.

This is not the full dataset, but rather a carefully chosen subset that can be treated as a _random sample_ from the US population.

---

# The goal of data visualization

* Visualizations help us study relationships

* This is enabled by comparison

???

visual comparisons require the data values being plotted to be
converted into something that _can_ be plotted.

The most common and obvious mapping is from a value to a _coordinate
position_ on the plot.

But the mapping can also be to _length_, _area_, or even
_color_.

We will see some of these mappings in the examples that follow.

---

# What do we study using visualization?

- Univariate distributions

- Bivariate and trivariate (generally multivariate) relationships

- Special case: Relationship with time (time-series) or space (spatial)

???

Let us now dive into some visualization _examples_, keeping our earlier discussion in mind.

We will start with a simple univariate data vector, namely, the vector of ozone concentrations in the AIR QUALITY data set.

---

# The `plot()` function

```r
plot(airquality$Ozone)
```

![plot of chunk unnamed-chunk-6](figures/visintro-unnamed-chunk-6-1.svg)

---

# Univariate distributions: strip charts or dot plots

```r
stripchart(airquality$Ozone)
```

![plot of chunk unnamed-chunk-7](figures/visintro-unnamed-chunk-7-1.svg)

---

# Univariate distributions: strip charts or dot plots

```r
stripchart(airquality$Ozone, method = "stack", pch = 16)
```

![plot of chunk unnamed-chunk-8](figures/visintro-unnamed-chunk-8-1.svg)

---