Workflow: multiple datasets and global environment
Variable scope and common “non-standard evaluation” approaches
Data manipulation
Common analysis methods:
Variables are symbols associated with some value (technically called a binding)
These symbols are stored in a look-up table
Each variable is stored in a container called an “environment”
Variables defined in an R session are stored in a special environment called the “global environment”
This is also known as the “global workspace”
Let us define some variables in the workspace
treat <- read.table("data/treat-1.txt", header = TRUE, stringsAsFactors = FALSE)
ae <- read.table("data/ae.txt", header = TRUE, stringsAsFactors = FALSE)
str(treat)
'data.frame': 70 obs. of 2 variables:
$ subjid: int 101 102 103 104 105 106 107 108 109 110 ...
$ trtcd : int 1 0 0 1 0 0 1 1 0 1 ...
'data.frame': 20 obs. of 5 variables:
$ subjid : int 101 101 102 102 103 103 103 115 115 116 ...
$ aerel : int 1 2 2 1 1 1 2 3 3 2 ...
$ aesev : int 1 1 2 1 1 2 2 2 1 1 ...
$ aebodsys: chr "Cardiac disorders" "Gastrointestinal disorders" "Cardiac disorders" "Psychiatric disorders" ...
$ aedecod : chr "Atrial flutter" "Constipation" "Cardiac failure" "Delirium" ...
Note that the two datasets are connected by a common variable subjid
Add one more dataset, using the sas7bdat
package
This is an experimental package to read files stored in the (undocumented) SAS7BDAT format
library(package = "sas7bdat")
asthma <- read.sas7bdat("sasdata/twosample.sas7bdat")
str(asthma, give.attr = FALSE)
'data.frame': 24 obs. of 4 variables:
$ PATNO: num 101 103 106 108 109 110 113 116 118 120 ...
$ FEV0 : num 1.35 3.22 2.78 2.45 1.84 2.81 1.9 3 2.25 2.86 ...
$ FEV6 : num NaN 3.55 3.15 2.3 2.37 3.2 2.65 3.96 2.97 2.28 ...
$ TRT : Factor w/ 2 levels "ABC-123","PLACEBO": 1 1 1 1 1 1 1 1 1 1 ...
ls()
[1] "ae" "asthma" "treat"
[1] 3.141593
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
[1] 3.142857
[1] "ae" "asthma" "pi" "treat"
R actually looks for variables in several environments sequentially (scope)
The list of these environments, known as the “search path”, is given by search()
[1] ".GlobalEnv" "package:sas7bdat" "package:lattice" "package:stats" "package:graphics"
[6] "package:grDevices" "package:utils" "package:datasets" "package:methods" "Autoloads"
[11] "package:base"
These environments usually correspond to the various R packages currently loaded
Packages mostly contain functions (which are also bindings) and datasets
A variable is matched to the first occurrence in this list
Possible bindings of a variable can be found using find()
[1] "package:stats"
[1] ".GlobalEnv" "package:base"
[1] 3.142857
[1] 3.141593
Datasets are also conceptually a collection of variables
It is an attractive idea to make them part of the search path
Error in unique(subjid): object 'subjid' not found
[1] ".GlobalEnv" "treat" "package:sas7bdat" "package:lattice" "package:stats"
[6] "package:graphics" "package:grDevices" "package:utils" "package:datasets" "package:methods"
[11] "Autoloads" "package:base"
[1] "treat"
[1] 70
The following object is masked from treat:
subjid
[1] ".GlobalEnv" "ae" "treat" "package:sas7bdat" "package:lattice"
[6] "package:stats" "package:graphics" "package:grDevices" "package:utils" "package:datasets"
[11] "package:methods" "Autoloads" "package:base"
[1] "ae" "treat"
[1] 14
[1] 70
But this can easily become tedious
Example: t-test comparing FEV1 change (0-6 weeks) by treatment
t.test(x = (asthma$FEV6 - asthma$FEV0)[asthma$TRT == "PLACEBO"],
y = (asthma$FEV6 - asthma$FEV0)[asthma$TRT == "ABC-123"],
var.equal = TRUE)
Two Sample t-test
data: (asthma$FEV6 - asthma$FEV0)[asthma$TRT == "PLACEBO"] and (asthma$FEV6 - asthma$FEV0)[asthma$TRT == "ABC-123"]
t = -0.97362, df = 19, p-value = 0.3425
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.6889311 0.2514766
sample estimates:
mean of x mean of y
0.2840000 0.5027273
To avoid such tedious use, R offers something known as “non-standard” evaluation
This is an idea that appears in different forms in different functions
The essential idea is that functions are provided a dataset which is temporarily attached during evaluation
General purpose functions: with()
and within()
Data manipulation functions: transform()
, subset()
(see also the popular package dplyr)
Used in many contexts: Formula interface (graphics, tests, model fitting)
Basic working knowledge of all these is essential to use R well
with()
functionThe simplest of the NSE functions: with(data, expr)
Makes variables in data
temporarily available in scope
Evaluates the expression given by expr
Returns result of evaluation
with(asthma, t.test(x = (FEV6 - FEV0)[TRT == "PLACEBO"],
y = (FEV6 - FEV0)[TRT == "ABC-123"],
var.equal = TRUE))
Two Sample t-test
data: (FEV6 - FEV0)[TRT == "PLACEBO"] and (FEV6 - FEV0)[TRT == "ABC-123"]
t = -0.97362, df = 19, p-value = 0.3425
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.6889311 0.2514766
sample estimates:
mean of x mean of y
0.2840000 0.5027273
Why is this “non-standard” evaluation?
The call is not equivalent to
e <- t.test(x = (FEV6 - FEV0)[TRT == "PLACEBO"], # gives error
y = (FEV6 - FEV0)[TRT == "ABC-123"],
var.equal = TRUE)
with(asthma, e)
with()
treats its expr
argument as an unevaluated expressiontransform()
functiontransform()
is used to modify or add new variables in a dataset
Suppose we want to add a variable CHG
recording change in FEV1 over the treatment period
We can do this directly as
transform()
is'data.frame': 24 obs. of 5 variables:
$ PATNO: num 101 103 106 108 109 110 113 116 118 120 ...
$ FEV0 : num 1.35 3.22 2.78 2.45 1.84 2.81 1.9 3 2.25 2.86 ...
$ FEV6 : num NaN 3.55 3.15 2.3 2.37 3.2 2.65 3.96 2.97 2.28 ...
$ TRT : Factor w/ 2 levels "ABC-123","PLACEBO": 1 1 1 1 1 1 1 1 1 1 ...
$ CHG : num NaN 0.33 0.37 -0.15 0.53 ...
within()
functionwithin()
is similar, but used for more complex instructions
The dataset returned contains all modified or newly created variables
within(asthma, # not assigned, so auto-printed ('asthma' remains unchanged)
{
FEV6[!is.finite(FEV6)] <- NA # Change NaN to NA
FEV0[!is.finite(FEV0)] <- NA # Note that CHG, already having NaN, not updated
PATNO <- as.character(PATNO) # change integer to character
STRT <- as.character(TRT) # new variable with treatment as character
})
PATNO FEV0 FEV6 TRT CHG STRT
1 101 1.35 NA ABC-123 NaN ABC-123
2 103 3.22 3.55 ABC-123 0.33 ABC-123
3 106 2.78 3.15 ABC-123 0.37 ABC-123
4 108 2.45 2.30 ABC-123 -0.15 ABC-123
5 109 1.84 2.37 ABC-123 0.53 ABC-123
6 110 2.81 3.20 ABC-123 0.39 ABC-123
7 113 1.90 2.65 ABC-123 0.75 ABC-123
8 116 3.00 3.96 ABC-123 0.96 ABC-123
9 118 2.25 2.97 ABC-123 0.72 ABC-123
10 120 2.86 2.28 ABC-123 -0.58 ABC-123
11 121 1.56 2.67 ABC-123 1.11 ABC-123
12 124 2.66 3.76 ABC-123 1.10 ABC-123
13 102 3.01 3.90 PLACEBO 0.89 PLACEBO
14 104 2.24 3.01 PLACEBO 0.77 PLACEBO
15 105 2.25 2.47 PLACEBO 0.22 PLACEBO
16 107 1.65 1.99 PLACEBO 0.34 PLACEBO
17 111 1.95 NA PLACEBO NaN PLACEBO
18 112 3.05 3.26 PLACEBO 0.21 PLACEBO
19 114 2.75 2.55 PLACEBO -0.20 PLACEBO
20 115 1.60 2.20 PLACEBO 0.60 PLACEBO
21 117 2.77 2.56 PLACEBO -0.21 PLACEBO
22 119 2.06 2.90 PLACEBO 0.84 PLACEBO
23 122 1.71 NA PLACEBO NaN PLACEBO
24 123 3.54 2.92 PLACEBO -0.62 PLACEBO
subset()
functionasthma.placebo <- subset(asthma, TRT == "PLACEBO" & is.finite(CHG))
asthma.abc123 <- subset(asthma, TRT == "ABC-123" & is.finite(CHG))
t.test(asthma.placebo$CHG, asthma.abc123$CHG, var.equal = TRUE)
Two Sample t-test
data: asthma.placebo$CHG and asthma.abc123$CHG
t = -0.97362, df = 19, p-value = 0.3425
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.6889311 0.2514766
sample estimates:
mean of x mean of y
0.2840000 0.5027273
split()
functionDoes not use non-standard evaluation, but extremely useful for data manipulation
Splits one column (or full data frame) into parts by value of one or more categorical columns
Example: split CHG
variable by TRT
List of 2
$ ABC-123: num [1:12] NaN 0.33 0.37 -0.15 0.53 ...
$ PLACEBO: num [1:12] 0.89 0.77 0.22 0.34 NaN 0.21 -0.2 0.6 -0.21 0.84 ...
Welch Two Sample t-test
data: PLACEBO and ABC-123
t = -0.97477, df = 18.888, p-value = 0.342
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.6885676 0.2511130
sample estimates:
mean of x mean of y
0.2840000 0.5027273
A much cleaner alternative to all these is the formula-data interface in t.test()
Such calls usually have a “formula” (containing variable names) and a dataset as the first two arguments
The formula is interpreted in the context of the function
Welch Two Sample t-test
data: CHG by TRT
t = 0.97477, df = 18.888, p-value = 0.342
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.2511130 0.6885676
sample estimates:
mean in group ABC-123 mean in group PLACEBO
0.5027273 0.2840000
within()
to convert categorical variables to factors with descriptive labelsdemog <- within(demog,
{
trt = factor(trt, levels = c(1, 0), labels = c("Active", "Placebo"))
gender = factor(gender, levels = c(1, 2), labels = c("Male", "Female"))
race = factor(race, levels = c(1, 2, 3), labels = c("White", "Black", "Other"))
})
str(demog)
'data.frame': 60 obs. of 5 variables:
$ subjid: int 101 102 103 104 105 106 201 202 203 204 ...
$ trt : Factor w/ 2 levels "Active","Placebo": 2 1 1 2 1 2 1 2 1 2 ...
$ gender: Factor w/ 2 levels "Male","Female": 1 2 1 2 1 2 1 2 1 2 ...
$ race : Factor w/ 3 levels "White","Black",..: 3 1 2 1 3 1 3 1 2 1 ...
$ age : int 37 65 32 23 44 49 35 50 49 60 ...
t.test()
will only work if the right-hand side takes exactly two values
Two Sample t-test
data: age by trt
t = 0.36624, df = 58, p-value = 0.7155
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-5.588158 8.090939
sample estimates:
mean in group Active mean in group Placebo
51.35484 50.10345
Error in t.test.formula(age ~ race, data = demog, var.equal = TRUE): grouping factor must have exactly 2 levels
lm()
Call:
lm(formula = age ~ trt, data = demog)
Coefficients:
(Intercept) trtPlacebo
51.355 -1.251
Call:
lm(formula = age ~ race, data = demog)
Coefficients:
(Intercept) raceBlack raceOther
52.722 -4.333 -6.722
anova()
Analysis of Variance Table
Response: age
Df Sum Sq Mean Sq F value Pr(>F)
trt 1 23.5 23.464 0.1341 0.7155
Residuals 58 10145.8 174.927
Analysis of Variance Table
Response: age
Df Sum Sq Mean Sq F value Pr(>F)
race 2 375.8 187.88 1.0935 0.342
Residuals 57 9793.5 171.82
library(package = "lattice")
xyplot(TRT ~ CHG, data = asthma) # can visually check equal variance assumption
lm()
is an important function that deserves detailed discussion
A linear model is any model that can be expressed in the form
y = Xβ + ε
Very flexible model incorporating ANOVA and linear regression (including some spline models)
The role of the formula in lm(formula, data)
is to specify y and X in terms of variables in data
Basic form: y ~ model
, where y
is the name of the response variable
model
consists of a series of terms separated by +
Each term is usually a predictor by itself (main effect), or an interaction
Terms can be functions of variables, possibly a matrix (e.g., log(x)
, poly(x, 2)
)
Interactions are defined using a :
(e.g., a:b
)
The intercept term is represented by 1
In R, each term is a symbolic representation, not the actual columns of X
Each term gets expanded into one or more columns in X (using model.matrix()
)
Some shortcuts:
y ~ a * b
is equivalent to y ~ a + b + a:b
y ~ a * b * c
is equivalent to y ~ a + b + c + a:b + b:c + c:a + a:b:c
y ~ (a + b + c)^2
is equivalent to y ~ a + b + c + a:b + b:c + c:a
y ~ a * b * c - a:b:c
is also equivalent to y ~ a + b + c + a:b + b:c + c:a
y ~ x
is equivalent to y ~ 1 + x
(intercept term is implied)
y ~ x - 1
and y ~ 0 + x
explicitly removes the intercept term
See help(formula)
for more details
R converts each model specification into a model matrix X
Numeric predictors are retained as they are
Categorical predictors are converted using a full rank re-parameterization (which can be customized)
By default, the first column of the dummy variable matrix is omitted
Remaining coefficients represent “effect compared to first (omitted) level”
(Intercept) age
1 1 37
2 1 65
3 1 32
4 1 23
5 1 44
6 1 49
7 1 35
8 1 50
9 1 49
10 1 60
11 1 39
12 1 67
13 1 70
14 1 55
15 1 65
16 1 45
17 1 36
18 1 46
19 1 44
20 1 77
21 1 45
22 1 59
23 1 49
24 1 33
25 1 33
26 1 44
27 1 64
28 1 56
29 1 73
30 1 46
31 1 44
32 1 53
33 1 45
34 1 65
35 1 43
36 1 39
37 1 50
38 1 30
39 1 33
40 1 65
41 1 57
42 1 56
43 1 67
44 1 46
45 1 72
46 1 29
47 1 65
48 1 46
49 1 60
50 1 28
51 1 44
52 1 66
53 1 46
54 1 75
55 1 46
56 1 55
57 1 57
58 1 63
59 1 61
60 1 49
attr(,"assign")
[1] 0 1
(Intercept) age genderFemale
1 1 37 0
2 1 65 1
3 1 32 0
4 1 23 1
5 1 44 0
6 1 49 1
7 1 35 0
8 1 50 1
9 1 49 0
10 1 60 1
11 1 39 0
12 1 67 1
13 1 70 0
14 1 55 0
15 1 65 0
16 1 45 0
17 1 36 0
18 1 46 0
19 1 44 1
20 1 77 1
21 1 45 0
22 1 59 0
23 1 49 1
24 1 33 0
25 1 33 0
26 1 44 1
27 1 64 0
28 1 56 0
29 1 73 0
30 1 46 0
31 1 44 0
32 1 53 1
33 1 45 0
34 1 65 0
35 1 43 1
36 1 39 0
37 1 50 0
38 1 30 1
39 1 33 1
40 1 65 0
41 1 57 1
42 1 56 0
43 1 67 0
44 1 46 1
45 1 72 1
46 1 29 0
47 1 65 1
48 1 46 0
49 1 60 0
50 1 28 0
51 1 44 0
52 1 66 1
53 1 46 0
54 1 75 0
55 1 46 0
56 1 55 1
57 1 57 1
58 1 63 0
59 1 61 0
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$gender
[1] "contr.treatment"
(Intercept) age genderFemale age:genderFemale
1 1 37 0 0
2 1 65 1 65
3 1 32 0 0
4 1 23 1 23
5 1 44 0 0
6 1 49 1 49
7 1 35 0 0
8 1 50 1 50
9 1 49 0 0
10 1 60 1 60
11 1 39 0 0
12 1 67 1 67
13 1 70 0 0
14 1 55 0 0
15 1 65 0 0
16 1 45 0 0
17 1 36 0 0
18 1 46 0 0
19 1 44 1 44
20 1 77 1 77
21 1 45 0 0
22 1 59 0 0
23 1 49 1 49
24 1 33 0 0
25 1 33 0 0
26 1 44 1 44
27 1 64 0 0
28 1 56 0 0
29 1 73 0 0
30 1 46 0 0
31 1 44 0 0
32 1 53 1 53
33 1 45 0 0
34 1 65 0 0
35 1 43 1 43
36 1 39 0 0
37 1 50 0 0
38 1 30 1 30
39 1 33 1 33
40 1 65 0 0
41 1 57 1 57
42 1 56 0 0
43 1 67 0 0
44 1 46 1 46
45 1 72 1 72
46 1 29 0 0
47 1 65 1 65
48 1 46 0 0
49 1 60 0 0
50 1 28 0 0
51 1 44 0 0
52 1 66 1 66
53 1 46 0 0
54 1 75 0 0
55 1 46 0 0
56 1 55 1 55
57 1 57 1 57
58 1 63 0 0
59 1 61 0 0
attr(,"assign")
[1] 0 1 2 3
attr(,"contrasts")
attr(,"contrasts")$gender
[1] "contr.treatment"
(Intercept) age genderFemale raceBlack raceOther age:genderFemale age:raceBlack age:raceOther
1 1 37 0 0 1 0 0 37
2 1 65 1 0 0 65 0 0
3 1 32 0 1 0 0 32 0
4 1 23 1 0 0 23 0 0
5 1 44 0 0 1 0 0 44
6 1 49 1 0 0 49 0 0
7 1 35 0 0 1 0 0 35
8 1 50 1 0 0 50 0 0
9 1 49 0 1 0 0 49 0
10 1 60 1 0 0 60 0 0
11 1 39 0 0 1 0 0 39
12 1 67 1 0 0 67 0 0
13 1 70 0 0 0 0 0 0
14 1 55 0 1 0 0 55 0
15 1 65 0 0 0 0 0 0
16 1 45 0 0 0 0 0 0
17 1 36 0 0 0 0 0 0
18 1 46 0 1 0 0 46 0
19 1 44 1 0 0 44 0 0
20 1 77 1 1 0 77 77 0
21 1 45 0 0 0 0 0 0
22 1 59 0 0 0 0 0 0
23 1 49 1 0 0 49 0 0
24 1 33 0 1 0 0 33 0
25 1 33 0 1 0 0 33 0
26 1 44 1 0 0 44 0 0
27 1 64 0 0 0 0 0 0
28 1 56 0 0 1 0 0 56
29 1 73 0 1 0 0 73 0
30 1 46 0 0 0 0 0 0
31 1 44 0 1 0 0 44 0
32 1 53 1 0 0 53 0 0
33 1 45 0 0 0 0 0 0
34 1 65 0 0 1 0 0 65
35 1 43 1 1 0 43 43 0
36 1 39 0 0 0 0 0 0
37 1 50 0 0 0 0 0 0
38 1 30 1 1 0 30 30 0
39 1 33 1 0 0 33 0 0
40 1 65 0 0 0 0 0 0
41 1 57 1 0 0 57 0 0
42 1 56 0 1 0 0 56 0
43 1 67 0 0 0 0 0 0
44 1 46 1 1 0 46 46 0
45 1 72 1 0 0 72 0 0
46 1 29 0 0 0 0 0 0
47 1 65 1 0 0 65 0 0
48 1 46 0 1 0 0 46 0
49 1 60 0 0 0 0 0 0
50 1 28 0 0 0 0 0 0
51 1 44 0 1 0 0 44 0
52 1 66 1 0 0 66 0 0
53 1 46 0 1 0 0 46 0
54 1 75 0 0 0 0 0 0
55 1 46 0 0 0 0 0 0
56 1 55 1 0 0 55 0 0
57 1 57 1 1 0 57 57 0
58 1 63 0 0 0 0 0 0
59 1 61 0 1 0 0 61 0
genderFemale:raceBlack genderFemale:raceOther
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
10 0 0
11 0 0
12 0 0
13 0 0
14 0 0
15 0 0
16 0 0
17 0 0
18 0 0
19 0 0
20 1 0
21 0 0
22 0 0
23 0 0
24 0 0
25 0 0
26 0 0
27 0 0
28 0 0
29 0 0
30 0 0
31 0 0
32 0 0
33 0 0
34 0 0
35 1 0
36 0 0
37 0 0
38 1 0
39 0 0
40 0 0
41 0 0
42 0 0
43 0 0
44 1 0
45 0 0
46 0 0
47 0 0
48 0 0
49 0 0
50 0 0
51 0 0
52 0 0
53 0 0
54 0 0
55 0 0
56 0 0
57 1 0
58 0 0
59 0 0
attr(,"assign")
[1] 0 1 2 3 3 4 5 5 6 6
attr(,"contrasts")
attr(,"contrasts")$gender
[1] "contr.treatment"
attr(,"contrasts")$race
[1] "contr.treatment"
'data.frame': 93 obs. of 27 variables:
$ Manufacturer : Factor w/ 32 levels "Acura","Audi",..: 1 1 2 2 3 4 4 4 4 5 ...
$ Model : Factor w/ 93 levels "100","190E","240",..: 49 56 9 1 6 24 54 74 73 35 ...
$ Type : Factor w/ 6 levels "Compact","Large",..: 4 3 1 3 3 3 2 2 3 2 ...
$ Min.Price : num 12.9 29.2 25.9 30.8 23.7 14.2 19.9 22.6 26.3 33 ...
$ Price : num 15.9 33.9 29.1 37.7 30 15.7 20.8 23.7 26.3 34.7 ...
$ Max.Price : num 18.8 38.7 32.3 44.6 36.2 17.3 21.7 24.9 26.3 36.3 ...
$ MPG.city : int 25 18 20 19 22 22 19 16 19 16 ...
$ MPG.highway : int 31 25 26 26 30 31 28 25 27 25 ...
$ AirBags : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
$ DriveTrain : Factor w/ 3 levels "4WD","Front",..: 2 2 2 2 3 2 2 3 2 2 ...
$ Cylinders : Factor w/ 6 levels "3","4","5","6",..: 2 4 4 4 2 2 4 4 4 5 ...
$ EngineSize : num 1.8 3.2 2.8 2.8 3.5 2.2 3.8 5.7 3.8 4.9 ...
$ Horsepower : int 140 200 172 172 208 110 170 180 170 200 ...
$ RPM : int 6300 5500 5500 5500 5700 5200 4800 4000 4800 4100 ...
$ Rev.per.mile : int 2890 2335 2280 2535 2545 2565 1570 1320 1690 1510 ...
$ Man.trans.avail : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
$ Fuel.tank.capacity: num 13.2 18 16.9 21.1 21.1 16.4 18 23 18.8 18 ...
$ Passengers : int 5 5 5 6 4 6 6 6 5 6 ...
$ Length : int 177 195 180 193 186 189 200 216 198 206 ...
$ Wheelbase : int 102 115 102 106 109 105 111 116 108 114 ...
$ Width : int 68 71 67 70 69 69 74 78 73 73 ...
$ Turn.circle : int 37 38 37 37 39 41 42 45 41 43 ...
$ Rear.seat.room : num 26.5 30 28 31 27 28 30.5 30.5 26.5 35 ...
$ Luggage.room : int 11 15 14 17 13 16 17 21 14 18 ...
$ Weight : int 2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
$ Origin : Factor w/ 2 levels "USA","non-USA": 2 2 2 2 2 1 1 1 1 1 ...
$ Make : Factor w/ 93 levels "Acura Integra",..: 1 2 4 3 5 6 7 9 8 10 ...
Linear models are fit using lm()
The result is usually stored in a variable for further processing
lm()
lm()
returns a list
The meaning of most components are obvious, but see help(lm)
for details
List of 13
$ coefficients : Named num [1:12] 41.3291 -0.0062 10.277 -4.0899 1.1735 ...
$ residuals : Named num [1:93] 0.788 0.733 0.461 0.237 4.858 ...
$ effects : Named num [1:93] -215.69 45.45 4.9 1.25 1.28 ...
$ rank : int 12
$ fitted.values: Named num [1:93] 24.2 17.3 19.5 18.8 17.1 ...
$ assign : int [1:12] 0 1 2 3 3 4 5 5 6 6 ...
$ qr :List of 5
..$ qr : num [1:93, 1:12] -9.644 0.104 0.104 0.104 0.104 ...
..$ qraux: num [1:12] 1.1 1.09 1.12 1.11 1.04 ...
..$ pivot: int [1:12] 1 2 3 4 5 6 7 8 9 10 ...
..$ tol : num 1e-07
..$ rank : int 12
$ df.residual : int 81
$ contrasts :List of 2
..$ Man.trans.avail: chr "contr.treatment"
..$ AirBags : chr "contr.treatment"
$ xlevels :List of 2
..$ Man.trans.avail: chr [1:2] "No" "Yes"
..$ AirBags : chr [1:3] "Driver & Passenger" "Driver only" "None"
$ call : language lm(formula = MPG.city ~ Weight * Man.trans.avail * AirBags, data = Cars93)
$ terms :Classes 'terms', 'formula' language MPG.city ~ Weight * Man.trans.avail * AirBags
$ model :'data.frame': 93 obs. of 4 variables:
..$ MPG.city : int [1:93] 25 18 20 19 22 22 19 16 19 16 ...
..$ Weight : int [1:93] 2705 3560 3375 3405 3640 2880 3470 4105 3495 3620 ...
..$ Man.trans.avail: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 1 1 1 1 ...
..$ AirBags : Factor w/ 3 levels "Driver & Passenger",..: 3 1 2 1 2 2 2 2 2 2 ...
summary()
Call:
lm(formula = MPG.city ~ Weight * Man.trans.avail * AirBags, data = Cars93)
Residuals:
Min 1Q Median 3Q Max
-6.8822 -1.3541 -0.0079 0.8039 13.1871
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.3291301 13.6733416 3.023 0.00335 **
Weight -0.0062036 0.0037873 -1.638 0.10530
Man.trans.availYes 10.2770284 20.9825645 0.490 0.62561
AirBagsDriver only -4.0898583 15.4854474 -0.264 0.79237
AirBagsNone 1.1735115 17.1752446 0.068 0.94569
Weight:Man.trans.availYes -0.0034421 0.0061659 -0.558 0.57821
Weight:AirBagsDriver only 0.0010294 0.0043071 0.239 0.81172
Weight:AirBagsNone -0.0003477 0.0047409 -0.073 0.94171
Man.trans.availYes:AirBagsDriver only 2.5575320 22.5411909 0.113 0.90995
Man.trans.availYes:AirBagsNone 1.8911293 23.6413834 0.080 0.93644
Weight:Man.trans.availYes:AirBagsDriver only -0.0004309 0.0066246 -0.065 0.94830
Weight:Man.trans.availYes:AirBagsNone -0.0012667 0.0069125 -0.183 0.85506
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.005 on 81 degrees of freedom
Multiple R-squared: 0.7483, Adjusted R-squared: 0.7141
F-statistic: 21.89 on 11 and 81 DF, p-value: < 2.2e-16
anova()
Analysis of Variance Table
Response: MPG.city
Df Sum Sq Mean Sq F value Pr(>F)
Weight 1 2065.52 2065.52 228.7639 < 2e-16 ***
Man.trans.avail 1 24.01 24.01 2.6592 0.10684
AirBags 2 3.20 1.60 0.1772 0.83794
Weight:Man.trans.avail 1 55.14 55.14 6.1064 0.01557 *
Weight:AirBags 2 0.63 0.32 0.0350 0.96565
Man.trans.avail:AirBags 2 25.20 12.60 1.3954 0.25362
Weight:Man.trans.avail:AirBags 2 0.52 0.26 0.0290 0.97143
Residuals 81 731.35 9.03
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table
Model 1: MPG.city ~ Weight
Model 2: MPG.city ~ Weight * Man.trans.avail * AirBags
Res.Df RSS Df Sum of Sq F Pr(>F)
1 91 840.05
2 81 731.35 10 108.7 1.2039 0.3011
coefficients()
(Intercept) Weight
41.3291300743 -0.0062036131
Man.trans.availYes AirBagsDriver only
10.2770283603 -4.0898583382
AirBagsNone Weight:Man.trans.availYes
1.1735114521 -0.0034420884
Weight:AirBagsDriver only Weight:AirBagsNone
0.0010293517 -0.0003477101
Man.trans.availYes:AirBagsDriver only Man.trans.availYes:AirBagsNone
2.5575320500 1.8911292706
Weight:Man.trans.availYes:AirBagsDriver only Weight:Man.trans.availYes:AirBagsNone
-0.0004308614 -0.0012666999
Most modeling functions in R support the formula interface
Most also support some common methods:
coef()
to obtain coefficients
fitted()
to get fitted values
residuals()
to get residuals
predict()
to get predicted values for new predictor values
Contingency tables can be created using table()
or xtabs()
Essentially same, but xtabs()
uses a formula-data interface
trt
race Active Placebo
White 18 18
Black 10 8
Other 3 3
trt
gender Active Placebo
Male 22 16
Female 9 12
fisher.test()
tt.gender <- xtabs(~ gender + trt, data = demog)
tt.race <- xtabs(~ race + trt, data = demog)
fisher.test(tt.gender)
Fisher's Exact Test for Count Data
data: tt.gender
p-value = 0.2911
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.5486984 6.2107193
sample estimates:
odds ratio
1.814327
Fisher's Exact Test for Count Data
data: tt.race
p-value = 0.927
alternative hypothesis: two.sided
chisq.test()
Pearson's Chi-squared test with Yates' continuity correction
data: tt.gender
X-squared = 0.69763, df = 1, p-value = 0.4036
Pearson's Chi-squared test
data: tt.gender
X-squared = 1.2266, df = 1, p-value = 0.2681
Warning in chisq.test(tt.race): Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: tt.race
X-squared = 0.15573, df = 2, p-value = 0.9251
death <- read.table("data/death.txt", header = TRUE, na.strings = ".", stringsAsFactors = FALSE)
death <- within(death,
{
trt <- factor(trt, levels = c("A", "B", "C"),
labels = c("Placebo", "Old Drug", "New Drug"))
monthstodeath <- daystodeath / (365.25 / 12) # survival time in months
})
str(death)
'data.frame': 200 obs. of 4 variables:
$ trt : Factor w/ 3 levels "Placebo","Old Drug",..: 1 1 3 3 2 2 2 2 3 3 ...
$ daystodeath : int 52 825 693 981 279 826 531 15 1057 793 ...
$ deathcensor : int 1 0 0 0 1 0 0 0 0 0 ...
$ monthstodeath: num 1.71 27.1 22.77 32.23 9.17 ...
deathcensor
is 1 if the patient died, and 0 if the survival time was censored
Kaplan-Meier survival curves are fit using survfit()
in the survival package
The model formula needs a survival object created using Surv()
A survival objects needs the survival times and the censoring indicator
[1] 1.70841889 27.10472279+ 22.76796715+ 32.22997947+ 9.16632444 27.13757700+ 17.44558522+ 0.49281314+
[9] 34.72689938+ 26.05338809+ 34.43121150+ 30.39014374+ 15.44147844+ 8.24640657 27.26899384+ 21.94661191
[17] 11.49897331+ 24.50924025+ 4.00821355 27.10472279+ 5.35523614 24.14784394+ 22.96509240+ 25.33059548
[25] 29.20739220+ 30.62012320+ 25.39630390 25.19917864+ 5.09240246+ 23.26078029+ 17.97125257+ 15.17864476
[33] 3.74537988 23.12936345+ 34.29979466+ 23.06365503 26.80903491+ 3.28542094 31.31006160+ 20.76386037+
[41] 31.50718686+ 22.17659138+ 31.54004107 1.67556468+ 1.08418891 21.19096509+ 1.83983573 32.19712526
[49] 4.92813142+ 20.96098563+ 29.73305955+ 11.20328542 22.53798768+ 20.96098563+ 28.64887064 44.25462012+
[57] 21.65092402+ 4.36960986 11.82751540+ 29.79876797 2.29979466+ 19.44969199+ 3.67967146+ 28.97741273
[65] 33.08418891+ 19.51540041+ 0.22997947+ 11.86036961+ 31.67145791+ 19.12114990+ 33.64271047 17.74127310
[73] 31.60574949+ 9.26488706+ 28.68172485+ 42.51334702+ 31.57289528+ 17.11704312+ 8.80492813 21.58521561+
[81] 32.85420945+ 0.29568789 22.27515400+ 32.49281314 29.89733060+ 36.36960986+ 35.18685832 31.90143737+
[89] 2.92402464+ 36.50102669+ 23.03080082+ 11.95893224 14.52156057 3.02258727 35.44969199+ 3.05544148+
[97] 17.47843943 34.89117043+ 29.66735113+ 26.02053388+ 4.46817248+ 5.05954825+ 27.76180698+ 1.70841889+
[105] 27.56468172+ 35.35112936+ 27.40041068 19.35112936+ 26.77618070+ 34.06981520+ 27.33470226+ 36.79671458+
[113] 26.38193018+ 0.52566735 20.69815195+ 17.93839836+ 0.91991786 32.98562628+ 33.51129363+ 2.46406571+
[121] 42.67761807+ 2.59548255+ 5.58521561+ 31.04722793+ 34.69404517+ 31.11293634+ 33.34702259+ 6.24229979
[129] 33.70841889+ 4.20533881 30.88295688+ 41.72484600+ 33.57700205 30.06160164+ 14.02874743 5.81519507
[137] 4.17248460+ 24.47638604 27.40041068+ 24.70636550+ 39.72073922+ 5.05954825+ 23.75359343+ 40.87063655+
[145] 0.16427105+ 27.36755647+ 23.16221766+ 1.60985626+ 31.34291581+ 1.97125257 23.16221766+ 17.34702259+
[153] 31.27720739+ 25.49486653+ 22.34086242+ 2.89117043+ 0.75564682+ 25.49486653+ 21.91375770+ 5.09240246+
[161] 31.08008214+ 24.70636550+ 35.35112936+ 12.48459959 31.04722793+ 23.72073922+ 20.69815195+ 2.00410678
[169] 30.58726899+ 0.06570842+ 19.15400411+ 9.26488706 3.38398357 34.03696099+ 19.67967146+ 0.55852156+
[177] 29.89733060+ 24.96919918+ 18.49691992+ 11.40041068 29.79876797+ 29.43737166+ 17.87268994+ 13.27310062
[185] 0.26283368 29.40451745+ 17.24845996+ 24.31211499+ 0.36139630+ 14.65297741 17.14989733+ 8.34496920+
[193] 28.51745380+ 25.42915811+ 16.42710472+ 0.88706366+ 27.66324435+ 8.80492813 16.59137577+ 16.59137577
Call: survfit(formula = Surv(monthstodeath, deathcensor) ~ trt, data = death)
n events median 0.95LCL 0.95UCL
trt=Placebo 68 26 32.2 28.6 NA
trt=Old Drug 64 18 NA 33.6 NA
trt=New Drug 68 7 NA NA NA
summary()
methodCall: survfit(formula = Surv(monthstodeath, deathcensor) ~ trt, data = death)
trt=Placebo
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 68 0 1.000 0.0000 1.000 1.000
3 61 4 0.940 0.0291 0.885 0.999
6 53 6 0.845 0.0450 0.762 0.938
12 48 5 0.766 0.0530 0.668 0.877
24 27 6 0.662 0.0608 0.553 0.793
36 2 5 0.395 0.1137 0.225 0.695
trt=Old Drug
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 64 0 1.000 0.0000 1.000 1.000
3 55 4 0.935 0.0316 0.875 0.999
6 52 2 0.900 0.0387 0.828 0.979
12 45 4 0.829 0.0493 0.738 0.932
24 29 4 0.751 0.0583 0.645 0.874
trt=New Drug
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 68 0 1.000 0.0000 1.000 1.000
3 59 1 0.985 0.0153 0.955 1.000
6 51 1 0.968 0.0225 0.924 1.000
12 49 0 0.968 0.0225 0.924 1.000
24 35 1 0.948 0.0295 0.892 1.000
36 7 4 0.713 0.1129 0.523 0.973
summary()
methodplot(sm, mark.time = TRUE, col = 1:3,
xlab = "Months from Randomization",
main = "Kaplan-Meier Survival Estimates for Death")
legend("bottomleft", names(sm$strata), inset = 0.01, lty = 1, col = 1:3)
survdiff()
Call:
survdiff(formula = Surv(monthstodeath, deathcensor) ~ trt, data = death,
rho = 0)
N Observed Expected (O-E)^2/E (O-E)^2/V
trt=Placebo 68 26 16.2 5.885 8.697
trt=Old Drug 64 18 15.9 0.288 0.421
trt=New Drug 68 7 18.9 7.502 12.166
Chisq= 13.9 on 2 degrees of freedom, p= 9e-04
Call:
survdiff(formula = Surv(monthstodeath, deathcensor) ~ trt, data = death,
rho = 1)
N Observed Expected (O-E)^2/E (O-E)^2/V
trt=Placebo 68 22.43 14.0 5.145 8.803
trt=Old Drug 64 15.40 13.6 0.245 0.413
trt=New Drug 68 5.45 15.7 6.731 12.512
Chisq= 14.3 on 2 degrees of freedom, p= 8e-04
In a number of previous Phase I and II studies of male, non-insulin-dependent diabetic (NIDDM) patients, the mean body mass index (BMI) was found to be 28.4. An investigator has 17 male NIDDM patients enrolled in a new study and wants to know if the BMI from this sample is consistent with previous findings. BMI is computed as the ratio of weight in kilograms to the square of the height in meters. What conclusion can the investigator make based on the following height and weight data from his 17 patients?
A new synthetic erythropoietin-type hormone, Rebligen, which is used to treat chemotherapy-induced anemia in cancer patients, was tested in a study of 48 adult cancer patients undergoing chemotherapeutic treatment. Half the patients received low-dose administration of Rebligen via intramuscular injection three times at 2-day intervals, and half the patients received a placebo in a similar fashion. Patients were stratified according to their type of cancer: cervical, prostate, or colorectal. Changes in hemoglobin (in mg/dl) from pre-first injection to one week after last injection were obtained for analysis. Does Rebligen have any effect on the hemoglobin (Hgb) levels?
A study was conducted to monitor the incidence of gastro-intestinal (GI) adverse drug reactions of a new antibiotic used in lower respiratory tract infections (LRTI). Two parallel groups were included in the study. One group consisted of 66 LRTI patients randomized to receive the new treatment, and a reference group of 52 LRTI patients were randomized to receive erythromycin. There were 22 patients in the test group and 28 patients in the control (erythromycin) group who reported one or more GI complaints during 7 days of treatment. Is there evidence of a difference in GI side effect rates between the two groups?
First, construct the contingency table by hand and perform the test
Also read in the dataset sasdata/chisqr.sas7bdat and construct the contingency table from it. Note that the data is already summarized; read help(xtabs)
and use an appropriate variable in the left hand side of the formula to construct the contingency table