Basic Statistical Modeling in R

# Basic Statistical Modeling in R

## Introductory Computer Programming

### Deepayan Sarkar

---

# Statistical Models and Tests

---

* Statistical modeling is one of the main uses of Statistics as a subject area

* What does this mean, exactly?

* Model: describes probability experiment that generates random data

* Some details of model remain unknown: parameters of the model

* Application

* Given real life data along with _questions_ of interest
	
	* Choose suitable model and "estimate" parameters
	
	* Ultimate goal is to come up with useful _answers_

---

* Statistical models can be quite complicated

* But they are often made using simple building blocks

* Two such important building blocks: ANOVA and linear regression.

* These models have some properties in common

* Response or dependent variable (the $Y$ variable)
	
	* Predictor or independent variable(s)

* Both are _conditional_ probability models
	
	* The $Y$ variable is assumed to be continuous
	
	* Usually assumed to have a _Normal_ distribution

---

# Typical questions

---

* Two fundamental questions are usually important

* Testing 
	
	* Prediction
	
	* Both involve estimation of the unknown parameters

* Typical testing question

* Does conditional distribution of $Y$ depend on predictor $X$?

* Essentially a question about independence

* Typical prediction question

* How will response $Y$ behave given predictor $X$
	
	* Question about conditional distribution of $Y \mid X$

---

# Example dataset: ToothGrowth

---

* Built-in dataset available in R

* Each row representing data from one guinea pig

* Response: `len` (length of odontoblasts, cells responsible for tooth growth)

* Main predictor of interest: `dose` (daily DOSE of vitamin C in milligrams)

* Secondary predictor: `supp` (method by which the dose was administered)

---

```r
ToothGrowth
```

```
    len supp dose
1   4.2   VC  0.5
2  11.5   VC  0.5
3   7.3   VC  0.5
4   5.8   VC  0.5
5   6.4   VC  0.5
6  10.0   VC  0.5
7  11.2   VC  0.5
8  11.2   VC  0.5
9   5.2   VC  0.5
10  7.0   VC  0.5
11 16.5   VC  1.0
12 16.5   VC  1.0
13 15.2   VC  1.0
14 17.3   VC  1.0
15 22.5   VC  1.0
16 17.3   VC  1.0
17 13.6   VC  1.0
18 14.5   VC  1.0
19 18.8   VC  1.0
20 15.5   VC  1.0
21 23.6   VC  2.0
22 18.5   VC  2.0
23 33.9   VC  2.0
24 25.5   VC  2.0
25 26.4   VC  2.0
26 32.5   VC  2.0
27 26.7   VC  2.0
28 21.5   VC  2.0
29 23.3   VC  2.0
30 29.5   VC  2.0
31 15.2   OJ  0.5
32 21.5   OJ  0.5
33 17.6   OJ  0.5
34  9.7   OJ  0.5
35 14.5   OJ  0.5
36 10.0   OJ  0.5
37  8.2   OJ  0.5
38  9.4   OJ  0.5
39 16.5   OJ  0.5
40  9.7   OJ  0.5
41 19.7   OJ  1.0
42 23.3   OJ  1.0
43 23.6   OJ  1.0
44 26.4   OJ  1.0
45 20.0   OJ  1.0
46 25.2   OJ  1.0
47 25.8   OJ  1.0
48 21.2   OJ  1.0
49 14.5   OJ  1.0
50 27.3   OJ  1.0
51 25.5   OJ  2.0
52 26.4   OJ  2.0
53 22.4   OJ  2.0
54 24.5   OJ  2.0
55 24.8   OJ  2.0
56 30.9   OJ  2.0
57 26.4   OJ  2.0
58 27.3   OJ  2.0
59 29.4   OJ  2.0
60 23.0   OJ  2.0
```

]

---

* Does the dose of vitamin C affect length?

* Plots can often answer such questions very effectively

* Let's start with a scatter plot of length against dose
	
---

```r
library(ggplot2)
p <- ggplot(ToothGrowth) + facet_wrap(~ supp)
p + geom_point(aes(x = dose, y = len))
```

![plot of chunk unnamed-chunk-2](figures/modeling-unnamed-chunk-2-1.svg)

---

* Clearly length increases with dose (so not dependent)

* But we need to make some decisions before formulating model

* Does length depend only on dose, or supplement type as well?

* Is the relationship with dose linear?

---

# Model 1: `len ~ dose`

---

* Let's start with the simplest model (linear function of `dose` only)

```r
fm1 <- lm(len ~ dose, data = ToothGrowth)
fm1
```

```

Call:
lm(formula = len ~ dose, data = ToothGrowth)

Coefficients:
(Intercept)         dose  
      7.422        9.764  
```

---

* To answer testing question formally: `summary()`

```r
summary(fm1)
```

```

Call:
lm(formula = len ~ dose, data = ToothGrowth)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.4496 -2.7406 -0.7452  2.8344 10.1139

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.4225     1.2601    5.89 2.06e-07 ***
dose          9.7636     0.9525   10.25 1.23e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.601 on 58 degrees of freedom
Multiple R-squared:  0.6443,	Adjusted R-squared:  0.6382 
F-statistic: 105.1 on 1 and 58 DF,  p-value: 1.233e-14
```

---

* Next question of interest: _prediction_

* This is done using a generic function called `predict()`

* This works for many types of statistical models in R

* For models fir by `lm()`, we need

```r
?predict.lm
```

```r
str(getS3method("predict", "lm"))
```

```
function (object, newdata, se.fit = FALSE, scale = NULL, df = Inf, interval = c("none", 
    "confidence", "prediction"), level = 0.95, type = c("response", "terms"), 
    terms = NULL, na.action = na.pass, pred.var = res.var/weights, weights = 1, 
    rankdeficient = c("warnif", "simple", "non-estim", "NA", "NAwarn"), 
    tol = 1e-06, verbose = FALSE, ...)  
```

---

* Need to specify _which_ values of the predictors we want predictions for

* This is usually a data frame supplied as the `newdata` argument

* Here, only a few values for which prediction makes: the unique values of `dose`

* But we will also include `supp` in our subsequent  models

* So let's consider predictions for all unique combinations of `dose` and `supp`

---

* We could make such a data frame easily using `expand.grid()`

```r
expand.grid(supp = c("VC", "OJ"), dose = c(0.5, 1, 2))
```

```
  supp dose
1   VC  0.5
2   OJ  0.5
3   VC  1.0
4   OJ  1.0
5   VC  2.0
6   OJ  2.0
```

---

* But this is not right: ignores that `supp` was originally a _factor_

* Safer way: get unique values from the data itself

```r
with(ToothGrowth, expand.grid(supp = unique(supp), dose = unique(dose)))
```

```
  supp dose
1   VC  0.5
2   OJ  0.5
3   VC  1.0
4   OJ  1.0
5   VC  2.0
6   OJ  2.0
```

---

* Actually `unique()` works nicely on data frames as well

```r
udf <- unique(ToothGrowth[-1])
udf
```

```
   supp dose
1    VC  0.5
11   VC  1.0
21   VC  2.0
31   OJ  0.5
41   OJ  1.0
51   OJ  2.0
```

---

* We can now use `predict()` on this new data frame

```r
predict(fm1, udf)
```

```
       1       11       21       31       41       51 
12.30429 17.18607 26.94964 12.30429 17.18607 26.94964 
```

---

* To do anything useful, we need to associate them to the predictors

* We can do this using the `cbind()` function

```r
udf1 <- cbind(udf, fit = predict(fm1, udf))
udf1
```

```
   supp dose      fit
1    VC  0.5 12.30429
11   VC  1.0 17.18607
21   VC  2.0 26.94964
31   OJ  0.5 12.30429
41   OJ  1.0 17.18607
51   OJ  2.0 26.94964
```

---

* We can now add these predictions to our previous plot

```r
p + geom_point(aes(x = dose, y = len)) +
    geom_line(aes(x = dose, y = fit), data = udf1) # NOTE: differenct dataset for this layer
```

![plot of chunk unnamed-chunk-11](figures/modeling-unnamed-chunk-11-1.svg)

???

This uses the fact that we can provide a NEW dataset when adding a
layer to an existing ggplot2 plot.

---

* As predictions, these are not particularly good

* But ignoring that aspect, how can we _interpret_ these predictions?

* Two possible interpretations.

* Predicting (or estimating) the average response at these predictor values

* Predicting what a new _individual_ response might look like

* In the first case, we are predicting a parameter (or a parametric function)

* In the second case, we are predicting a random variable

---

* Even in first case, there is uncertainty due to _sampling_ error

* We usually acknowledge this by providing a _confidence interval_

* The `predict()` method gives confidence intervals as follows

```r
udf1conf <- cbind(udf, predict(fm1, udf, interval = "confidence"))
udf1conf
```

```
   supp dose      fit      lwr      upr
1    VC  0.5 12.30429 10.56371 14.04486
11   VC  1.0 17.18607 15.95530 18.41684
21   VC  2.0 26.94964 24.96508 28.93420
31   OJ  0.5 12.30429 10.56371 14.04486
41   OJ  1.0 17.18607 15.95530 18.41684
51   OJ  2.0 26.94964 24.96508 28.93420
```

---

* The `errorbar` geom can be used to display these confidence intervals

```r
p + geom_point(aes(x = dose, y = len)) + geom_line(aes(x = dose, y = fit), data = udf1) + 
    geom_errorbar(aes(x = dose, ymin = lwr, ymax = upr), data = udf1conf, width = 0.1, col = 2)
```

![plot of chunk unnamed-chunk-13](figures/modeling-unnamed-chunk-13-1.svg)

---

* We can also ask for a _prediction_ interval

* Accounts for sampling error + intrinsic variability in the model

```r
udf1pred <- cbind(udf, predict(fm1, udf, interval = "prediction"))
udf1pred
```

```
   supp dose      fit       lwr      upr
1    VC  0.5 12.30429  2.931015 21.67756
11   VC  1.0 17.18607  7.893956 26.47819
21   VC  2.0 26.94964 17.528014 36.37127
31   OJ  0.5 12.30429  2.931015 21.67756
41   OJ  1.0 17.18607  7.893956 26.47819
51   OJ  2.0 26.94964 17.528014 36.37127
```

---

```r
p + geom_point(aes(x = dose, y = len)) + geom_line(aes(x = dose, y = fit), data = udf1) + 
    geom_errorbar(aes(x = dose, ymin = lwr, ymax = upr), data = udf1pred, width = 0.1, col = 2)
```

![plot of chunk unnamed-chunk-15](figures/modeling-unnamed-chunk-15-1.svg)

???

Prediction intervals are larger than the confidence intervals

---

# Model 2: `len ~ factor(dose)`

---

* ANOVA models extend two-sample $t$-tests to more than two samples

* The model essentially says that the response has a different
      mean for each subgroup

* ANOVA models usually assume equal variances in all subgroups

---

* For this example, have to look at the three subsets defined by `dose`

* Fitting process is exactly the same, except to make dose a factor

```r
fm2 <- lm(len ~ factor(dose), data = ToothGrowth)
```

---

* This leads to slightly confusing output

```r
fm2
```

```

Call:
lm(formula = len ~ factor(dose), data = ToothGrowth)

Coefficients:
  (Intercept)  factor(dose)1  factor(dose)2  
        10.60           9.13          15.50  
```

* Not immediately clear what these coefficients mean

---

* Similarly for `summary()`

```r
summary(fm2)
```

```

Call:
lm(formula = len ~ factor(dose), data = ToothGrowth)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6000 -3.2350 -0.6025  3.3250 10.8950

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    10.6050     0.9486  11.180 5.39e-16 ***
factor(dose)1   9.1300     1.3415   6.806 6.70e-09 ***
factor(dose)2  15.4950     1.3415  11.551  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.242 on 57 degrees of freedom
Multiple R-squared:  0.7029,	Adjusted R-squared:  0.6924 
F-statistic: 67.42 on 2 and 57 DF,  p-value: 9.533e-16
```

* Not clear what is being tested

---

* It is better to call the `aov()` function when fitting ANOVA models

```r
fm2 <- aov(len ~ factor(dose), data = ToothGrowth)
fm2
```

```
Call:
   aov(formula = len ~ factor(dose), data = ToothGrowth)

Terms:
                factor(dose) Residuals
Sum of Squares      2426.434  1025.775
Deg. of Freedom            2        57

Residual standard error: 4.242175
Estimated effects may be unbalanced
```

---

* `aov()` internally calls `lm()`, but has a more appropriate summary method

```r
summary(fm2)
```

```
             Df Sum Sq Mean Sq F value   Pr(>F)    
factor(dose)  2   2426    1213   67.42 9.53e-16 ***
Residuals    57   1026      18                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

* `predict()` method works exactly as before

```r
udf2conf <- cbind(udf, predict(fm2, udf, interval = "confidence"))
udf2conf
```

```
   supp dose    fit       lwr     upr
1    VC  0.5 10.605  8.705503 12.5045
11   VC  1.0 19.735 17.835503 21.6345
21   VC  2.0 26.100 24.200503 27.9995
31   OJ  0.5 10.605  8.705503 12.5045
41   OJ  1.0 19.735 17.835503 21.6345
51   OJ  2.0 26.100 24.200503 27.9995
```

---

```r
p + geom_point(aes(x = dose, y = len)) + geom_line(aes(x = dose, y = fit), data = udf2conf) + 
    geom_errorbar(aes(x = dose, ymin = lwr, ymax = upr), data = udf2conf, width = 0.1, col = 2)
```

![plot of chunk unnamed-chunk-22](figures/modeling-unnamed-chunk-22-1.svg)

---

* We can quantify the goodness of fit using residual sum of squares

```r
sum(residuals(fm1)^2)
```

```
[1] 1227.905
```

```r
sum(residuals(fm2)^2)
```

```
[1] 1025.775
```

---

* A formal test comparing two models can be done using `anova()`

```r
anova(fm1, fm2)
```

```
Analysis of Variance Table

Model 1: len ~ dose
Model 2: len ~ factor(dose)
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
1     58 1227.9                                
2     57 1025.8  1    202.13 11.232 0.001432 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

* Suggests improvement with ANOVA model is statistically significant

---

# Model 3: `len ~ log(dose)`

---

* Could also have considered a slightly different linear regression model

* Notice that doses increase in geometric series, not arithmetic series

* However, increase in response from dose 1 to 2 is not very different
from the increase from 2 to 3

---

* We can incorporate this in a model by making `log(dose)` the predictor

```r
fm3 <- lm(len ~ log(dose), data = ToothGrowth)
fm3
```

```

Call:
lm(formula = len ~ log(dose), data = ToothGrowth)

Coefficients:
(Intercept)    log(dose)  
      18.81        11.18  
```

---

* The ANOVA model is _not_ significantly better than this new regression model

```r
anova(fm3, fm2)
```

```
Analysis of Variance Table

Model 1: len ~ log(dose)
Model 2: len ~ factor(dose)
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     58 1051.3                           
2     57 1025.8  1    25.484 1.4161  0.239
```

---

* `predict()` continues to work without any changes

* Note that the predictor is no longer a variable in the data, but a
function of

```r
udf3conf <- cbind(udf, predict(fm3, udf, interval = "confidence"))
udf3conf
```

```
   supp dose      fit       lwr      upr
1    VC  0.5 11.06583  9.326279 12.80539
11   VC  1.0 18.81333 17.713142 19.91352
21   VC  2.0 26.56083 24.821279 28.30039
31   OJ  0.5 11.06583  9.326279 12.80539
41   OJ  1.0 18.81333 17.713142 19.91352
51   OJ  2.0 26.56083 24.821279 28.30039
```

---

```r
p + geom_point(aes(x = dose, y = len)) + geom_line(aes(x = dose, y = fit), data = udf3conf) + 
    geom_errorbar(aes(x = dose, ymin = lwr, ymax = upr), data = udf3conf, width = 0.1, col = 2)
```

![plot of chunk unnamed-chunk-28](figures/modeling-unnamed-chunk-28-1.svg)

---

# Confidence Intervals Using lattice

---

* Similar confidence interval plots can be created using lattice

* The easiest way is to create components and add them as layers

```r
library(latticeExtra)
p1 <- xyplot(len ~ dose | supp, data = ToothGrowth,
             jitter.x = TRUE, pch = 16, grid = TRUE)
p2 <- xyplot(fit ~ dose | supp, data = udf3conf, type = "l")
p3 <- segplot(dose ~ lwr + upr | supp, data = udf3conf,
              horizontal = FALSE, col = 2)
panel.errorbar <- function(...) {
    panel.arrows(..., angle = 90, end = "both")
}
```

---

```r
p1 + p2 + update(p3, segments.fun = panel.errorbar, length = 0.1)
```

![plot of chunk unnamed-chunk-30](figures/modeling-unnamed-chunk-30-1.svg)

---

# Combining Multiple Predictors

---

* From the plots, it is clear that `supp` also affects response

* How can we include the `supp` variable in our model?

* There are _six_ unique combinations of `supp` and `dose`

* A simple option: consider the six corresponding subgroups in our model

* This will still be an ANOVA model, with six groups instead three

---

* We need to create a new variable that combines `supp` and `dose`

* One way to create such a variable is simply by combining them as strings

```r
with(udf, paste(supp, dose, sep = "_")) 
```

```
[1] "VC_0.5" "VC_1"   "VC_2"   "OJ_0.5" "OJ_1"   "OJ_2"  
```

---

* The formula in `lm()` can handle functions of predictors directly

* So we don't need to create the new variable in advance

```r
fm4 <- aov(len ~ paste(supp, dose, sep = "_"), data = ToothGrowth)
```

---

* Remaining steps work as before

```r
udf4conf <- cbind(udf, predict(fm4, udf, interval = "confidence"))
udf4conf
```

```
   supp dose   fit       lwr      upr
1    VC  0.5  7.98  5.677691 10.28231
11   VC  1.0 16.77 14.467691 19.07231
21   VC  2.0 26.14 23.837691 28.44231
31   OJ  0.5 13.23 10.927691 15.53231
41   OJ  1.0 22.70 20.397691 25.00231
51   OJ  2.0 26.06 23.757691 28.36231
```

* This is the first example where predictions depend on `supp`

---

* We can plot the confidence intervals as before and verify this

```r
p + geom_point(aes(x = dose, y = len)) + geom_line(aes(x = dose, y = fit), data = udf4conf) + 
    geom_errorbar(aes(x = dose, ymin = lwr, ymax = upr), data = udf4conf, width = 0.1, col = 2)
```

![plot of chunk unnamed-chunk-34](figures/modeling-unnamed-chunk-34-1.svg)

---

* Is this a significant improvement over the previous ANOVA model?

```r
anova(fm2, fm4)
```

```
Analysis of Variance Table

Model 1: len ~ factor(dose)
Model 2: len ~ paste(supp, dose, sep = "_")
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1     57 1025.78                                  
2     54  712.11  3    313.67 7.9287 0.0001794 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

* This kind of model (all combination of factors) is known as an interaction model

* They are very common, and can be specified in a much simpler way

```r
fm5 <- aov(len ~ supp * factor(dose), data = ToothGrowth)
```

---

* We get the same result using this model in the previous `anova()` call

```r
anova(fm2, fm5)
```

```
Analysis of Variance Table

Model 1: len ~ factor(dose)
Model 2: len ~ supp * factor(dose)
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1     57 1025.78                                  
2     54  712.11  3    313.67 7.9287 0.0001794 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

* Additive models: _effects_ of `supp` and `dose` are additive

* Also known as a (additive) two-way ANOVA model

```r
fm6 <- aov(len ~ supp + factor(dose), data = ToothGrowth)
```

---

* Comparing with the interaction model:

```r
anova(fm6, fm5)
```

```
Analysis of Variance Table

Model 1: len ~ supp + factor(dose)
Model 2: len ~ supp * factor(dose)
  Res.Df    RSS Df Sum of Sq     F  Pr(>F)  
1     56 820.43                             
2     54 712.11  2    108.32 4.107 0.02186 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

* This test is also reported by calling `summary()` on the interaction
model

```r
summary(fm5)
```

```
                  Df Sum Sq Mean Sq F value   Pr(>F)    
supp               1  205.4   205.4  15.572 0.000231 ***
factor(dose)       2 2426.4  1213.2  92.000  < 2e-16 ***
supp:factor(dose)  2  108.3    54.2   4.107 0.021860 *  
Residuals         54  712.1    13.2                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---

# Demo

* Additive and Interaction Models for Regression using `NHANES` data