class: center, middle # Basic Statistical Modeling in R ## Introductory Computer Programming ### Deepayan Sarkar
--- layout: true # Statistical Models and Tests --- * Statistical modeling is one of the main uses of Statistics as a subject area -- * What does this mean, exactly? * Model: describes probability experiment that generates random data * Some details of model remain unknown: parameters of the model -- * Application * Given real life data along with _questions_ of interest * Choose suitable model and "estimate" parameters * Ultimate goal is to come up with useful _answers_ --- * Statistical models can be quite complicated * But they are often made using simple building blocks * Two such important building blocks: ANOVA and linear regression. -- * These models have some properties in common * Response or dependent variable (the $Y$ variable) * Predictor or independent variable(s) * Both are _conditional_ probability models * The $Y$ variable is assumed to be continuous * Usually assumed to have a _Normal_ distribution --- layout: true # Typical questions --- * Two fundamental questions are usually important * Testing * Prediction * Both involve estimation of the unknown parameters -- * Typical testing question * Does conditional distribution of $Y$ depend on predictor $X$? * Essentially a question about independence -- * Typical prediction question * How will response $Y$ behave given predictor $X$ * Question about conditional distribution of $Y \mid X$ --- layout: true # Example dataset: ToothGrowth --- * Built-in dataset available in R * Each row representing data from one guinea pig * Response: `len` (length of odontoblasts, cells responsible for tooth growth) * Main predictor of interest: `dose` (daily DOSE of vitamin C in milligrams) * Secondary predictor: `supp` (method by which the dose was administered) --- .scrollable500[ ```r ToothGrowth ``` ``` len supp dose 1 4.2 VC 0.5 2 11.5 VC 0.5 3 7.3 VC 0.5 4 5.8 VC 0.5 5 6.4 VC 0.5 6 10.0 VC 0.5 7 11.2 VC 0.5 8 11.2 VC 0.5 9 5.2 VC 0.5 10 7.0 VC 0.5 11 16.5 VC 1.0 12 16.5 VC 1.0 13 15.2 VC 1.0 14 17.3 VC 1.0 15 22.5 VC 1.0 16 17.3 VC 1.0 17 13.6 VC 1.0 18 14.5 VC 1.0 19 18.8 VC 1.0 20 15.5 VC 1.0 21 23.6 VC 2.0 22 18.5 VC 2.0 23 33.9 VC 2.0 24 25.5 VC 2.0 25 26.4 VC 2.0 26 32.5 VC 2.0 27 26.7 VC 2.0 28 21.5 VC 2.0 29 23.3 VC 2.0 30 29.5 VC 2.0 31 15.2 OJ 0.5 32 21.5 OJ 0.5 33 17.6 OJ 0.5 34 9.7 OJ 0.5 35 14.5 OJ 0.5 36 10.0 OJ 0.5 37 8.2 OJ 0.5 38 9.4 OJ 0.5 39 16.5 OJ 0.5 40 9.7 OJ 0.5 41 19.7 OJ 1.0 42 23.3 OJ 1.0 43 23.6 OJ 1.0 44 26.4 OJ 1.0 45 20.0 OJ 1.0 46 25.2 OJ 1.0 47 25.8 OJ 1.0 48 21.2 OJ 1.0 49 14.5 OJ 1.0 50 27.3 OJ 1.0 51 25.5 OJ 2.0 52 26.4 OJ 2.0 53 22.4 OJ 2.0 54 24.5 OJ 2.0 55 24.8 OJ 2.0 56 30.9 OJ 2.0 57 26.4 OJ 2.0 58 27.3 OJ 2.0 59 29.4 OJ 2.0 60 23.0 OJ 2.0 ``` ] --- * Does the dose of vitamin C affect length? * Plots can often answer such questions very effectively * Let's start with a scatter plot of length against dose --- ```r library(ggplot2) p <- ggplot(ToothGrowth) + facet_wrap(~ supp) p + geom_point(aes(x = dose, y = len)) ```  --- * Clearly length increases with dose (so not dependent) * But we need to make some decisions before formulating model * Does length depend only on dose, or supplement type as well? * Is the relationship with dose linear? --- layout: true # Model 1: `len ~ dose` --- * Let's start with the simplest model (linear function of `dose` only) ```r fm1 <- lm(len ~ dose, data = ToothGrowth) fm1 ``` ``` Call: lm(formula = len ~ dose, data = ToothGrowth) Coefficients: (Intercept) dose 7.422 9.764 ``` --- * To answer testing question formally: `summary()` ```r summary(fm1) ``` ``` Call: lm(formula = len ~ dose, data = ToothGrowth) Residuals: Min 1Q Median 3Q Max -8.4496 -2.7406 -0.7452 2.8344 10.1139 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.4225 1.2601 5.89 2.06e-07 *** dose 9.7636 0.9525 10.25 1.23e-14 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.601 on 58 degrees of freedom Multiple R-squared: 0.6443, Adjusted R-squared: 0.6382 F-statistic: 105.1 on 1 and 58 DF, p-value: 1.233e-14 ``` --- * Next question of interest: _prediction_ * This is done using a generic function called `predict()` * This works for many types of statistical models in R * For models fir by `lm()`, we need ```r ?predict.lm ``` ```r str(getS3method("predict", "lm")) ``` ``` function (object, newdata, se.fit = FALSE, scale = NULL, df = Inf, interval = c("none", "confidence", "prediction"), level = 0.95, type = c("response", "terms"), terms = NULL, na.action = na.pass, pred.var = res.var/weights, weights = 1, rankdeficient = c("warnif", "simple", "non-estim", "NA", "NAwarn"), tol = 1e-06, verbose = FALSE, ...) ``` --- * Need to specify _which_ values of the predictors we want predictions for * This is usually a data frame supplied as the `newdata` argument -- * Here, only a few values for which prediction makes: the unique values of `dose` * But we will also include `supp` in our subsequent models * So let's consider predictions for all unique combinations of `dose` and `supp` --- * We could make such a data frame easily using `expand.grid()` ```r expand.grid(supp = c("VC", "OJ"), dose = c(0.5, 1, 2)) ``` ``` supp dose 1 VC 0.5 2 OJ 0.5 3 VC 1.0 4 OJ 1.0 5 VC 2.0 6 OJ 2.0 ``` --- * But this is not right: ignores that `supp` was originally a _factor_ * Safer way: get unique values from the data itself ```r with(ToothGrowth, expand.grid(supp = unique(supp), dose = unique(dose))) ``` ``` supp dose 1 VC 0.5 2 OJ 0.5 3 VC 1.0 4 OJ 1.0 5 VC 2.0 6 OJ 2.0 ``` --- * Actually `unique()` works nicely on data frames as well ```r udf <- unique(ToothGrowth[-1]) udf ``` ``` supp dose 1 VC 0.5 11 VC 1.0 21 VC 2.0 31 OJ 0.5 41 OJ 1.0 51 OJ 2.0 ``` --- * We can now use `predict()` on this new data frame ```r predict(fm1, udf) ``` ``` 1 11 21 31 41 51 12.30429 17.18607 26.94964 12.30429 17.18607 26.94964 ``` --- * To do anything useful, we need to associate them to the predictors * We can do this using the `cbind()` function ```r udf1 <- cbind(udf, fit = predict(fm1, udf)) udf1 ``` ``` supp dose fit 1 VC 0.5 12.30429 11 VC 1.0 17.18607 21 VC 2.0 26.94964 31 OJ 0.5 12.30429 41 OJ 1.0 17.18607 51 OJ 2.0 26.94964 ``` --- * We can now add these predictions to our previous plot ```r p + geom_point(aes(x = dose, y = len)) + geom_line(aes(x = dose, y = fit), data = udf1) # NOTE: differenct dataset for this layer ```  ??? This uses the fact that we can provide a NEW dataset when adding a layer to an existing ggplot2 plot. --- * As predictions, these are not particularly good * But ignoring that aspect, how can we _interpret_ these predictions? -- * Two possible interpretations. * Predicting (or estimating) the average response at these predictor values * Predicting what a new _individual_ response might look like -- * In the first case, we are predicting a parameter (or a parametric function) * In the second case, we are predicting a random variable --- * Even in first case, there is uncertainty due to _sampling_ error * We usually acknowledge this by providing a _confidence interval_ -- * The `predict()` method gives confidence intervals as follows ```r udf1conf <- cbind(udf, predict(fm1, udf, interval = "confidence")) udf1conf ``` ``` supp dose fit lwr upr 1 VC 0.5 12.30429 10.56371 14.04486 11 VC 1.0 17.18607 15.95530 18.41684 21 VC 2.0 26.94964 24.96508 28.93420 31 OJ 0.5 12.30429 10.56371 14.04486 41 OJ 1.0 17.18607 15.95530 18.41684 51 OJ 2.0 26.94964 24.96508 28.93420 ``` --- * The `errorbar` geom can be used to display these confidence intervals ```r p + geom_point(aes(x = dose, y = len)) + geom_line(aes(x = dose, y = fit), data = udf1) + geom_errorbar(aes(x = dose, ymin = lwr, ymax = upr), data = udf1conf, width = 0.1, col = 2) ```  --- * We can also ask for a _prediction_ interval * Accounts for sampling error + intrinsic variability in the model ```r udf1pred <- cbind(udf, predict(fm1, udf, interval = "prediction")) udf1pred ``` ``` supp dose fit lwr upr 1 VC 0.5 12.30429 2.931015 21.67756 11 VC 1.0 17.18607 7.893956 26.47819 21 VC 2.0 26.94964 17.528014 36.37127 31 OJ 0.5 12.30429 2.931015 21.67756 41 OJ 1.0 17.18607 7.893956 26.47819 51 OJ 2.0 26.94964 17.528014 36.37127 ``` --- ```r p + geom_point(aes(x = dose, y = len)) + geom_line(aes(x = dose, y = fit), data = udf1) + geom_errorbar(aes(x = dose, ymin = lwr, ymax = upr), data = udf1pred, width = 0.1, col = 2) ```  ??? Prediction intervals are larger than the confidence intervals --- layout: true # Model 2: `len ~ factor(dose)` --- * ANOVA models extend two-sample $t$-tests to more than two samples * The model essentially says that the response has a different mean for each subgroup * ANOVA models usually assume equal variances in all subgroups --- * For this example, have to look at the three subsets defined by `dose` * Fitting process is exactly the same, except to make dose a factor ```r fm2 <- lm(len ~ factor(dose), data = ToothGrowth) ``` --- * This leads to slightly confusing output ```r fm2 ``` ``` Call: lm(formula = len ~ factor(dose), data = ToothGrowth) Coefficients: (Intercept) factor(dose)1 factor(dose)2 10.60 9.13 15.50 ``` * Not immediately clear what these coefficients mean --- * Similarly for `summary()` ```r summary(fm2) ``` ``` Call: lm(formula = len ~ factor(dose), data = ToothGrowth) Residuals: Min 1Q Median 3Q Max -7.6000 -3.2350 -0.6025 3.3250 10.8950 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.6050 0.9486 11.180 5.39e-16 *** factor(dose)1 9.1300 1.3415 6.806 6.70e-09 *** factor(dose)2 15.4950 1.3415 11.551 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.242 on 57 degrees of freedom Multiple R-squared: 0.7029, Adjusted R-squared: 0.6924 F-statistic: 67.42 on 2 and 57 DF, p-value: 9.533e-16 ``` * Not clear what is being tested --- * It is better to call the `aov()` function when fitting ANOVA models ```r fm2 <- aov(len ~ factor(dose), data = ToothGrowth) fm2 ``` ``` Call: aov(formula = len ~ factor(dose), data = ToothGrowth) Terms: factor(dose) Residuals Sum of Squares 2426.434 1025.775 Deg. of Freedom 2 57 Residual standard error: 4.242175 Estimated effects may be unbalanced ``` --- * `aov()` internally calls `lm()`, but has a more appropriate summary method ```r summary(fm2) ``` ``` Df Sum Sq Mean Sq F value Pr(>F) factor(dose) 2 2426 1213 67.42 9.53e-16 *** Residuals 57 1026 18 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- * `predict()` method works exactly as before ```r udf2conf <- cbind(udf, predict(fm2, udf, interval = "confidence")) udf2conf ``` ``` supp dose fit lwr upr 1 VC 0.5 10.605 8.705503 12.5045 11 VC 1.0 19.735 17.835503 21.6345 21 VC 2.0 26.100 24.200503 27.9995 31 OJ 0.5 10.605 8.705503 12.5045 41 OJ 1.0 19.735 17.835503 21.6345 51 OJ 2.0 26.100 24.200503 27.9995 ``` --- ```r p + geom_point(aes(x = dose, y = len)) + geom_line(aes(x = dose, y = fit), data = udf2conf) + geom_errorbar(aes(x = dose, ymin = lwr, ymax = upr), data = udf2conf, width = 0.1, col = 2) ```  --- * We can quantify the goodness of fit using residual sum of squares ```r sum(residuals(fm1)^2) ``` ``` [1] 1227.905 ``` ```r sum(residuals(fm2)^2) ``` ``` [1] 1025.775 ``` --- * A formal test comparing two models can be done using `anova()` ```r anova(fm1, fm2) ``` ``` Analysis of Variance Table Model 1: len ~ dose Model 2: len ~ factor(dose) Res.Df RSS Df Sum of Sq F Pr(>F) 1 58 1227.9 2 57 1025.8 1 202.13 11.232 0.001432 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` * Suggests improvement with ANOVA model is statistically significant --- layout: true # Model 3: `len ~ log(dose)` --- * Could also have considered a slightly different linear regression model * Notice that doses increase in geometric series, not arithmetic series * However, increase in response from dose 1 to 2 is not very different from the increase from 2 to 3 --- * We can incorporate this in a model by making `log(dose)` the predictor ```r fm3 <- lm(len ~ log(dose), data = ToothGrowth) fm3 ``` ``` Call: lm(formula = len ~ log(dose), data = ToothGrowth) Coefficients: (Intercept) log(dose) 18.81 11.18 ``` --- * The ANOVA model is _not_ significantly better than this new regression model ```r anova(fm3, fm2) ``` ``` Analysis of Variance Table Model 1: len ~ log(dose) Model 2: len ~ factor(dose) Res.Df RSS Df Sum of Sq F Pr(>F) 1 58 1051.3 2 57 1025.8 1 25.484 1.4161 0.239 ``` --- * `predict()` continues to work without any changes * Note that the predictor is no longer a variable in the data, but a function of ```r udf3conf <- cbind(udf, predict(fm3, udf, interval = "confidence")) udf3conf ``` ``` supp dose fit lwr upr 1 VC 0.5 11.06583 9.326279 12.80539 11 VC 1.0 18.81333 17.713142 19.91352 21 VC 2.0 26.56083 24.821279 28.30039 31 OJ 0.5 11.06583 9.326279 12.80539 41 OJ 1.0 18.81333 17.713142 19.91352 51 OJ 2.0 26.56083 24.821279 28.30039 ``` --- ```r p + geom_point(aes(x = dose, y = len)) + geom_line(aes(x = dose, y = fit), data = udf3conf) + geom_errorbar(aes(x = dose, ymin = lwr, ymax = upr), data = udf3conf, width = 0.1, col = 2) ```  --- layout: true # Confidence Intervals Using lattice --- * Similar confidence interval plots can be created using lattice * The easiest way is to create components and add them as layers ```r library(latticeExtra) p1 <- xyplot(len ~ dose | supp, data = ToothGrowth, jitter.x = TRUE, pch = 16, grid = TRUE) p2 <- xyplot(fit ~ dose | supp, data = udf3conf, type = "l") p3 <- segplot(dose ~ lwr + upr | supp, data = udf3conf, horizontal = FALSE, col = 2) panel.errorbar <- function(...) { panel.arrows(..., angle = 90, end = "both") } ``` --- ```r p1 + p2 + update(p3, segments.fun = panel.errorbar, length = 0.1) ```  --- layout: true # Combining Multiple Predictors --- * From the plots, it is clear that `supp` also affects response * How can we include the `supp` variable in our model? -- * There are _six_ unique combinations of `supp` and `dose` * A simple option: consider the six corresponding subgroups in our model * This will still be an ANOVA model, with six groups instead three --- * We need to create a new variable that combines `supp` and `dose` * One way to create such a variable is simply by combining them as strings ```r with(udf, paste(supp, dose, sep = "_")) ``` ``` [1] "VC_0.5" "VC_1" "VC_2" "OJ_0.5" "OJ_1" "OJ_2" ``` --- * The formula in `lm()` can handle functions of predictors directly * So we don't need to create the new variable in advance ```r fm4 <- aov(len ~ paste(supp, dose, sep = "_"), data = ToothGrowth) ``` --- * Remaining steps work as before ```r udf4conf <- cbind(udf, predict(fm4, udf, interval = "confidence")) udf4conf ``` ``` supp dose fit lwr upr 1 VC 0.5 7.98 5.677691 10.28231 11 VC 1.0 16.77 14.467691 19.07231 21 VC 2.0 26.14 23.837691 28.44231 31 OJ 0.5 13.23 10.927691 15.53231 41 OJ 1.0 22.70 20.397691 25.00231 51 OJ 2.0 26.06 23.757691 28.36231 ``` * This is the first example where predictions depend on `supp` --- * We can plot the confidence intervals as before and verify this ```r p + geom_point(aes(x = dose, y = len)) + geom_line(aes(x = dose, y = fit), data = udf4conf) + geom_errorbar(aes(x = dose, ymin = lwr, ymax = upr), data = udf4conf, width = 0.1, col = 2) ```  --- * Is this a significant improvement over the previous ANOVA model? ```r anova(fm2, fm4) ``` ``` Analysis of Variance Table Model 1: len ~ factor(dose) Model 2: len ~ paste(supp, dose, sep = "_") Res.Df RSS Df Sum of Sq F Pr(>F) 1 57 1025.78 2 54 712.11 3 313.67 7.9287 0.0001794 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- * This kind of model (all combination of factors) is known as an interaction model * They are very common, and can be specified in a much simpler way ```r fm5 <- aov(len ~ supp * factor(dose), data = ToothGrowth) ``` --- * We get the same result using this model in the previous `anova()` call ```r anova(fm2, fm5) ``` ``` Analysis of Variance Table Model 1: len ~ factor(dose) Model 2: len ~ supp * factor(dose) Res.Df RSS Df Sum of Sq F Pr(>F) 1 57 1025.78 2 54 712.11 3 313.67 7.9287 0.0001794 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- * Additive models: _effects_ of `supp` and `dose` are additive * Also known as a (additive) two-way ANOVA model ```r fm6 <- aov(len ~ supp + factor(dose), data = ToothGrowth) ``` --- * Comparing with the interaction model: ```r anova(fm6, fm5) ``` ``` Analysis of Variance Table Model 1: len ~ supp + factor(dose) Model 2: len ~ supp * factor(dose) Res.Df RSS Df Sum of Sq F Pr(>F) 1 56 820.43 2 54 712.11 2 108.32 4.107 0.02186 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- * This test is also reported by calling `summary()` on the interaction model ```r summary(fm5) ``` ``` Df Sum Sq Mean Sq F value Pr(>F) supp 1 205.4 205.4 15.572 0.000231 *** factor(dose) 2 2426.4 1213.2 92.000 < 2e-16 *** supp:factor(dose) 2 108.3 54.2 4.107 0.021860 * Residuals 54 712.1 13.2 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- layout: false # Demo * Additive and Interaction Models for Regression using `NHANES` data