R is more flexible than a regular calculator
In fact, R is a full programming language
Most standard data analysis methods are already implemented
Can be extended by writing add-on packages
Thousands of add-on packages are available
Variables (in the context of programming)
Data structures needed for data analyis
Functions (set of instructions for performing a procedure)
Variables are symbols that may be associated with different values
Computations involving variables are done using their current value
[1] 3.162278
Warning in sqrt(x): NaNs produced
[1] NaN
[1] 0+1i
Vectors
Lists (general collection of objects)
Data frames (a spreadsheet-like data set)
Matrices
Indexed collection of homogeneous scalars, can be
TRUE
/ FALSE
)Missing values are allowed, indicated as NA
Elements are indexed starting from 1
i
-th element of vector x
can be extracted using x[i]
There are also more sophisticated forms of (vector) indexing
[1] "January" "February" "March" "April" "May" "June" "July" "August"
[9] "September" "October" "November" "December"
[1] -0.01001108 0.11252941 0.75957249 0.13169291 -0.61549488 -0.38971443 0.09061209 1.90299126
[9] 0.71969075 0.28110559
num [1:10] -0.01 0.113 0.76 0.132 -0.615 ...
chr [1:12] "January" "February" "March" "April" "May" "June" "July" "August" "September" "October" ...
[1] 8 9 6 5 2 5 11 8 6 3 2 3 7 6 6 7 11 12 2 5 4 9 3 5 7 1 1 11 8 2
[1] August September June May February May November August June March
[11] February March July June June July November December February May
[21] April September March May July January January November August February
Levels: January February March April May June July August September October November December
int [1:30] 8 9 6 5 2 5 11 8 6 3 ...
Factor w/ 12 levels "January","February",..: 8 9 6 5 2 5 11 8 6 3 ...
num [1:2] 0 0
num 0
num 0
numeric(0)
logical(0)
[1] 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
[1] 1 2 3 4 5 6 7 8 9 10
[1] -0.05877929 1.24178994 1.12618568 -0.28574712 0.33757875
c()
function to combine smaller vectors [1] 0 1 1 2 3 5 8 13 21 34
[1] 3.957564195 1.269926679 0.009469478 0.182340988 0.537683329 -0.316325890 -1.533174163 -0.822603301
[9] -3.883589422 -1.762983274
[1] "Hearts" "Spades" "Diamonds" "Clubs"
Indexing refers to extracting subsets of vectors (or other kinds of data)
R supports several kinds of indexing:
Indexing by a vector of positive integers
Indexing by a vector of negative integers
Indexing by a logical vector
Indexing by a vector of names
[1] "February"
[1] "February" "April" "June" "September" "November"
[1] "February" "February" "June" "April" "June" "November"
NA
(missing)[1] NA
[1] "January" "March" "May" "July" "September" "November" NA NA
[1] "January" "March" "April" "May" "June" "July" "August" "September"
[9] "October" "November" "December"
[1] "January" "March" "May" "July" "August" "October" "December"
Error in month.name[c(2, -3)]: only 0's may be mixed with negative subscripts
character(0)
character(0)
[1] "January" "February" "November" "December"
[1] "March" "April" "May" "June" "July" "August" "September" "October"
TRUE
elements[1] "January" "April" "July" "October"
[1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[1] "January" "June" "July"
[1] -0.90400018 0.64537142 -0.18131343 -1.57485208 1.28833752 -0.64120477 -0.11357468 -0.60709094
[9] 0.36187801 -0.02112882 -0.08529980 -0.15286958 -0.11278629 -0.03084374 0.21417339 0.33745118
[17] 0.51941696 -0.98108416 -0.69654672 -1.40327686
[1] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
[18] FALSE FALSE FALSE
[1] 0.6453714 1.2883375 0.3618780 0.2141734 0.3374512 0.5194170
[1] -0.2069622
[1] 0.5611047
which()
[1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[1] 1 6 7
[1] "January" "June" "July"
[1] "February" "March" "April" "May" "August" "September" "October" "November"
[9] "December"
R vectors can have names (optional)
These names can be used to index, just like positive integers
Will see examples later
Arithmetic operations are usually “vectorized”. These include
Arithmetic operators such as +, -, *, /, ^
Mathematical functions such as sin(), cos(), log()
Logical comparisons <, >, <=, >=, ==
and operators &, |, !
Almost any other function where it makes sense
They operate element-wise on vectors, producing another vector
Remember that R has no “scalar” type. Scalars are just length-1 vectors
Operations with unequal sized vectors: Shorter vector is repeated / recycled to match longer vector
Example: Recreate height-weight simulation
ht <- rnorm(200, mean = 172, sd = 10) # height in cm
bmi <- rnorm(200, mean = 22, sd = 2.2) # bmi (independent of height)
wt <- bmi * (ht / 100)^2
The last step has several vectorized computations:
ht / 100
divides each element of height by 100
(100
is recycled)
(ht / 100)^2
squares each element of the result (2
is recycled)
bmi * (ht / 100)^2
multiplies result with bmi
element-wise
[1] 13095.85
[1] 65.47926
[1] 10.15318
[1] 0.7942958
[1] 145.1637 164.3472 172.3810 179.4889 198.0955
Min. 1st Qu. Median Mean 3rd Qu. Max.
145.2 164.4 172.4 172.1 179.5 198.1
Atomic vectors must be homogeneous (all elements of the same type)
But we often need to combine different types of data
Lists are vectors with arbitrary types of components
Like atomic vectors, they may or may not have names, but usually do
Usually constructed using the function list()
Example: Suppose we want to record units and a descriptive label along with a data vector
lht <- list(data = ht[1:10], unit = "cm", label = "Simulated height")
lbmi <- list(data = bmi[1:10], unit = "none", label = "Simulated BMI")
lwt <- list(data = wt[1:10], unit = "kg", label = "Weight calculated from height and BMI")
These are now vectors with different types of elements
Can be indexed in the usual way:
$data
[1] 159.8666 168.8484 196.3493 167.8481 175.1913 159.1338 177.4504 174.6472 163.5114 178.1257
$unit
[1] "cm"
$label
[1] "Simulated BMI"
But often we want to extract a specific element of a list
This is done by indexing with double brackets [[ ... ]]
[1] 159.8666 168.8484 196.3493 167.8481 175.1913 159.1338 177.4504 174.6472 163.5114 178.1257
[1] "Simulated BMI"
$
[1] "Weight calculated from height and BMI"
Lists can themselves contain lists recursively
List of 3
$ height:List of 3
..$ data : num [1:10] 160 169 196 168 175 ...
..$ unit : chr "cm"
..$ label: chr "Simulated height"
$ bmi :List of 3
..$ data : num [1:10] 23.9 23.6 22.6 24 20.5 ...
..$ unit : chr "none"
..$ label: chr "Simulated BMI"
$ weight:List of 3
..$ data : num [1:10] 61.2 67.2 87.3 67.6 62.8 ...
..$ unit : chr "kg"
..$ label: chr "Weight calculated from height and BMI"
[1] 61.19160 67.18511 87.25451 67.63458 62.81660 60.63069 68.36702 66.96299 64.41511 79.14377
Lists are very flexible data structures that are widely used
Two very important uses:
Standard representation of data sets
To contain results of complex functions (model fitting, tests)
Data frames represent rectangular (spreadheet-like) data
Essentially lists with some additional restrictions
Elements are viewed as columns in a data set
Each element / column is (usually) an atomic vector
Different columns can have different types
Every column must have a name
Can be created using the data.frame()
function
List of 4
$ height: num [1:200] 160 169 196 168 175 ...
$ bmi : num [1:200] 23.9 23.6 22.6 24 20.5 ...
$ weight: num [1:200] 61.2 67.2 87.3 67.6 62.8 ...
$ gender: chr [1:2] "M" "F"
mydf <- data.frame(height = ht, bmi = bmi, weight = wt, gender = c("M", "F"))
str(mydf) # compare with list
'data.frame': 200 obs. of 4 variables:
$ height: num 160 169 196 168 175 ...
$ bmi : num 23.9 23.6 22.6 24 20.5 ...
$ weight: num 61.2 67.2 87.3 67.6 62.8 ...
$ gender: Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 1 2 1 ...
[1] M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F
[53] M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F
[105] M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F
[157] M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F
Levels: F M
Lists can be recursive
$height
$height$data
[1] 159.8666 168.8484 196.3493 167.8481 175.1913 159.1338 177.4504 174.6472 163.5114 178.1257
$height$unit
[1] "cm"
$height$label
[1] "Simulated height"
$bmi
$bmi$data
[1] 23.94287 23.56564 22.63233 24.00688 20.46678 23.94241 21.71165 21.95390 24.09306 24.94387
$bmi$unit
[1] "none"
$bmi$label
[1] "Simulated BMI"
$weight
$weight$data
[1] 61.19160 67.18511 87.25451 67.63458 62.81660 60.63069 68.36702 66.96299 64.41511 79.14377
$weight$unit
[1] "kg"
$weight$label
[1] "Weight calculated from height and BMI"
But they are ‘flattened’ in data frames
height.data height.unit height.label bmi.data bmi.unit bmi.label weight.data weight.unit
1 159.8666 cm Simulated height 23.94287 none Simulated BMI 61.19160 kg
2 168.8484 cm Simulated height 23.56564 none Simulated BMI 67.18511 kg
3 196.3493 cm Simulated height 22.63233 none Simulated BMI 87.25451 kg
4 167.8481 cm Simulated height 24.00688 none Simulated BMI 67.63458 kg
5 175.1913 cm Simulated height 20.46678 none Simulated BMI 62.81660 kg
6 159.1338 cm Simulated height 23.94241 none Simulated BMI 60.63069 kg
7 177.4504 cm Simulated height 21.71165 none Simulated BMI 68.36702 kg
8 174.6472 cm Simulated height 21.95390 none Simulated BMI 66.96299 kg
9 163.5114 cm Simulated height 24.09306 none Simulated BMI 64.41511 kg
10 178.1257 cm Simulated height 24.94387 none Simulated BMI 79.14377 kg
weight.label
1 Weight calculated from height and BMI
2 Weight calculated from height and BMI
3 Weight calculated from height and BMI
4 Weight calculated from height and BMI
5 Weight calculated from height and BMI
6 Weight calculated from height and BMI
7 Weight calculated from height and BMI
8 Weight calculated from height and BMI
9 Weight calculated from height and BMI
10 Weight calculated from height and BMI
It is much more common to create data frames by importing data from a file
Typical approach: read data from spreadsheet file into data frame
Easiest route:
R itself cannot read Excel files directly
Save as CSV file from Excel
Read with read.csv()
or read.table()
(more flexible)
Alternative option:
write.csv()
or write.table()
data(Cars93, package = "MASS") # built-in dataset
write.csv(Cars93, file = "cars93.csv") # export
cars <- read.csv("cars93.csv") # import (path relative to working directory)
Import text dataset data/demog.txt containing demographic data
The contents of the file are:
subjid trt gender race age
101 0 1 3 37
102 1 2 1 65
103 1 1 2 32
104 0 2 1 23
105 1 1 3 44
106 0 2 1 49
201 1 1 3 35
202 0 2 1 50
203 1 1 2 49
204 0 2 1 60
205 1 1 3 39
206 1 2 1 67
301 0 1 1 70
302 0 1 2 55
303 1 1 1 65
304 0 1 1 45
305 1 1 1 36
306 0 1 2 46
401 1 2 1 44
402 0 2 2 77
403 1 1 1 45
404 1 1 1 59
405 0 2 1 49
406 1 1 2 33
501 0 1 2 33
502 1 2 1 44
503 1 1 1 64
504 0 1 3 56
505 1 1 2 73
506 0 1 1 46
507 1 1 2 44
508 0 2 1 53
509 0 1 1 45
510 0 1 3 65
511 1 2 2 43
512 1 1 1 39
601 0 1 1 50
602 0 2 2 30
603 1 2 1 33
604 0 1 1 65
605 1 2 1 57
606 0 1 2 56
607 1 1 1 67
608 0 2 2 46
609 1 2 1 72
610 0 1 1 29
611 1 2 1 65
612 1 1 2 46
701 1 1 1 60
702 0 1 1 28
703 1 1 2 44
704 0 2 1 66
705 1 1 2 46
706 1 1 1 75
707 1 1 1 46
708 0 2 1 55
709 0 2 2 57
710 0 1 1 63
711 1 1 2 61
712 0 . 1 49
Attempt 1:
'data.frame': 61 obs. of 5 variables:
$ V1: Factor w/ 61 levels "101","102","103",..: 61 1 2 3 4 5 6 7 8 9 ...
$ V2: Factor w/ 3 levels "0","1","trt": 3 1 2 2 1 2 1 2 1 2 ...
$ V3: Factor w/ 4 levels ".","1","2","gender": 4 2 3 2 3 2 3 2 3 2 ...
$ V4: Factor w/ 4 levels "1","2","3","race": 4 3 1 2 1 3 1 3 1 2 ...
$ V5: Factor w/ 34 levels "23","28","29",..: 34 9 26 5 1 12 15 7 16 15 ...
R doesn’t know that the first row gives column headers
All columns have been interpreted as characters and converted to factors
Attempt 2:
'data.frame': 60 obs. of 5 variables:
$ subjid: int 101 102 103 104 105 106 201 202 203 204 ...
$ trt : int 0 1 1 0 1 0 1 0 1 0 ...
$ gender: chr "1" "2" "1" "2" ...
$ race : int 3 1 2 1 3 1 3 1 2 1 ...
$ age : int 37 65 32 23 44 49 35 50 49 60 ...
The gender
column is still interpreted as character data
This is because R doesn’t know that missing values are encoded as "."
Attempt 2:
demog <- read.table("data/demog.txt", header = TRUE, stringsAsFactors = FALSE, na.strings = ".")
str(demog)
'data.frame': 60 obs. of 5 variables:
$ subjid: int 101 102 103 104 105 106 201 202 203 204 ...
$ trt : int 0 1 1 0 1 0 1 0 1 0 ...
$ gender: int 1 2 1 2 1 2 1 2 1 2 ...
$ race : int 3 1 2 1 3 1 3 1 2 1 ...
$ age : int 37 65 32 23 44 49 35 50 49 60 ...
[1] 1 2 1 2 1 2 1 2 1 2 1 2 1 1 1 1 1 1 2 2 1 1 2 1 1 2 1 1 1 1 1 2 1 1 2
[36] 1 1 2 2 1 2 1 1 2 2 1 2 1 1 1 1 2 1 1 1 2 2 1 1 NA
The columns trt
, gender
, and race
should actually be categorical
We need more information about the numeric encoding to do this
Will see examples later
We have earlier used the lm()
function to fit an OLS linear regression model
Let’s see what the output returned by lm()
actually looks like
List of 12
$ coefficients : Named num [1:2] -68.022 0.776
..- attr(*, "names")= chr [1:2] "(Intercept)" "height"
$ residuals : Named num [1:200] 5.19 4.22 2.95 5.44 -5.07 ...
..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
$ effects : Named num [1:200] -926.02 113.77 3.26 5.01 -5.31 ...
..- attr(*, "names")= chr [1:200] "(Intercept)" "height" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:200] 56 63 84.3 62.2 67.9 ...
..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:200, 1:2] -14.1421 0.0707 0.0707 0.0707 0.0707 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:200] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:2] "(Intercept)" "height"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.07 1.02
..$ pivot: int [1:2] 1 2
..$ tol : num 1e-07
..$ rank : int 2
..- attr(*, "class")= chr "qr"
$ df.residual : int 198
$ xlevels : Named list()
$ call : language lm(formula = weight ~ height, data = mydf)
$ terms :Classes 'terms', 'formula' language weight ~ height
.. ..- attr(*, "variables")= language list(weight, height)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "weight" "height"
.. .. .. ..$ : chr "height"
.. ..- attr(*, "term.labels")= chr "height"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(weight, height)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "weight" "height"
$ model :'data.frame': 200 obs. of 2 variables:
..$ weight: num [1:200] 61.2 67.2 87.3 67.6 62.8 ...
..$ height: num [1:200] 160 169 196 168 175 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language weight ~ height
.. .. ..- attr(*, "variables")= language list(weight, height)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "weight" "height"
.. .. .. .. ..$ : chr "height"
.. .. ..- attr(*, "term.labels")= chr "height"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(weight, height)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "weight" "height"
- attr(*, "class")= chr "lm"
t.test()
to perform one-sample t-testList of 10
$ statistic : Named num 0.167
..- attr(*, "names")= chr "t"
$ parameter : Named num 199
..- attr(*, "names")= chr "df"
$ p.value : num 0.867
$ conf.int : num [1:2] 21.7 22.3
..- attr(*, "conf.level")= num 0.95
$ estimate : Named num 22
..- attr(*, "names")= chr "mean of x"
$ null.value : Named num 22
..- attr(*, "names")= chr "mean"
$ stderr : num 0.145
$ alternative: chr "two.sided"
$ method : chr "One Sample t-test"
$ data.name : chr "mydf$bmi"
- attr(*, "class")= chr "htest"
These details are usually unimportant for regular use
Printing these results will show “user-friendly” output
One Sample t-test
data: mydf$bmi
t = 0.16719, df = 199, p-value = 0.8674
alternative hypothesis: true mean is not equal to 22
95 percent confidence interval:
21.73774 22.31084
sample estimates:
mean of x
22.02429
Matrices and arrays arise very naturally in statistics
Two common uses:
Model matrix for linear models
Contingency tables
Example: built-in dataset giving death rates (per 1000) for demographic subgroups
Rural Male Rural Female Urban Male Urban Female
50-54 11.7 8.7 15.4 8.4
55-59 18.1 11.7 24.3 13.6
60-64 26.9 20.3 37.0 19.3
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0
[1] 5 4
Unlike data frames, they are always homogeneous (all elements of same type)
Matrices have row and column indexes, and possibly names
Indexing works in the same way as vectors, but in two dimensions (separated by ,
)
Rural Female Urban Male
50-54 8.7 15.4
55-59 11.7 24.3
Rural Male Rural Female
50-54 11.7 8.7
55-59 18.1 11.7
60-64 26.9 20.3
65-69 41.0 30.9
70-74 66.0 54.3
There are many ways to create a matrix
Example: matrix()
constructs matrix by providing data and dimensions
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[4,] 10 11 12
cbind()
and rbind()
constructs matrices by combining columns or rows int ht
[1,] 1 159.8666
[2,] 1 168.8484
[3,] 1 196.3493
[4,] 1 167.8481
[5,] 1 175.1913
[6,] 1 159.1338
[7,] 1 177.4504
[8,] 1 174.6472
[9,] 1 163.5114
[10,] 1 178.1257
[11,] 1 175.2149
[12,] 1 155.5292
[13,] 1 157.8308
[14,] 1 160.7194
[15,] 1 177.3041
X
is the design matrix for linear regression on height (including intercept)Standard matrix operations: transpose t()
and matrix product %*%
Can be used to solve linear regression equation
int ht
int 200.00 34416.96
ht 34416.96 5944139.57
[,1]
int -68.0215728
ht 0.7757852
(Intercept) height
-68.0215728 0.7757852
[1] 1 2 3 4 5 6 7 8 9 10 11 12
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
, , 1
[,1] [,2]
[1,] 1 3
[2,] 2 4
, , 2
[,1] [,2]
[1,] 5 7
[2,] 6 8
, , 3
[,1] [,2]
[1,] 9 11
[2,] 10 12
Incidentally, assignments where the left-hand side looks like a function call are a special feature of R
These modify some aspect of an already existing variable
These are known as replacement functions
The underlying vector nature of a matrix is easy to verify
Rural Male Rural Female Urban Male Urban Female
50-54 11.7 8.7 15.4 8.4
55-59 18.1 11.7 24.3 13.6
60-64 26.9 20.3 37.0 19.3
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0
[1] 41.0 66.0 8.7 11.7 20.3 30.9 54.3
This background is enough to start on typical data analysis tasks
But before that, we also need to learn about accessing documentation
This requires a brief discussion of the class system in R
Every R object must have a class
[1] "list"
[1] "data.frame"
[1] "numeric"
[1] "lm"
[1] "htest"
Generic functions are placeholder functions
They perform different tasks depending on type of argument passed to them
For example, summary()
is a generic function
height bmi weight gender
Min. :145.2 Min. :16.43 Min. :44.19 F:100
1st Qu.:164.4 1st Qu.:20.67 1st Qu.:58.35 M:100
Median :172.4 Median :22.21 Median :64.91
Mean :172.1 Mean :22.02 Mean :65.48
3rd Qu.:179.5 3rd Qu.:23.45 3rd Qu.:71.78
Max. :198.1 Max. :28.41 Max. :90.05
Min. 1st Qu. Median Mean 3rd Qu. Max.
44.19 58.35 64.91 65.48 71.78 90.05
Call:
lm(formula = weight ~ height, data = mydf)
Residuals:
Min 1Q Median 3Q Max
-17.7074 -3.9083 0.4285 4.2131 17.5302
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -68.02157 7.26984 -9.357 <2e-16 ***
height 0.77579 0.04217 18.397 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.184 on 198 degrees of freedom
Multiple R-squared: 0.6309, Adjusted R-squared: 0.629
F-statistic: 338.4 on 1 and 198 DF, p-value: < 2.2e-16
[1] summary.aov summary.aovlist* summary.aspell*
[4] summary.check_packages_in_dir* summary.connection summary.data.frame
[7] summary.Date summary.default summary.ecdf*
[10] summary.factor summary.glm summary.infl*
[13] summary.lm summary.loess* summary.manova
[16] summary.matrix summary.mlm* summary.nls*
[19] summary.packageStatus* summary.POSIXct summary.POSIXlt
[22] summary.ppr* summary.prcomp* summary.princomp*
[25] summary.proc_time summary.srcfile summary.srcref
[28] summary.stepfun summary.stl* summary.table
[31] summary.tukeysmooth* summary.warnings
see '?methods' for accessing help and source code
print()
functionA special generic function is called print()
Whenever the result of an evaluation is not assigned to a variable, it is “auto-printed”
This is done using the print()
generic function
For example:
Call:
lm(formula = weight ~ height, data = mydf)
Coefficients:
(Intercept) height
-68.0216 0.7758
Every dataset and function in R is documented in a help page
The documentation for a function can be accessed by ?
or help()
summary.lm()
directly instead of summary()
Most useful things in R happen by calling functions
Functions have one or more arguments
All arguments have names
Arguments may be compulsory or optional
Optional arguments have “default” values
Functions normally also have a useful “return” value
These are all described in the help page
Arguments may or may not be named when calling a function
If not named, arguments are matched by position
Conventionally, optional arguments are named, compulsory arguments are often not named
Load 02-rbasics.rmd in R Studio and work through the examples
Read the help page for mean()
and median()
Here are two simple numeric vectors containing NA
and Inf
values
Find the mean and median of these vectors
How can you make R ignore the NA
value?
How can you make R ignore the Inf
value? Hint: see ?is.finite
Go through the help pages for write.table()
and read.table()
to understand how they work. Skip non-essential details.
Export the demog
data frame as a CSV file named "demog.csv"
using write.csv()
Import this newly created dataset again using read.csv()
, saving it as a variable named d
Do you need to specify the header
argument? Why?
Do you need to specify the na.strings
argument? Why?
The goal of the next exercise is to convert d$race
into a factor
Read the help page for factor()
to learn how to create factors
d$race
has three values: 1 = White, 2 = Black, 3 = Other
Create and add a new factor variable d$frace
to the data frame d
d$frace
should have “levels” 1, 2, 3, and correposponding “labels” White, Black, Other
The goal of the next exercise is to import a SAS format dataset and use it to perform a t-test
SAS can export data in a binary format with extension sas7bdat
This is not a documented export format, and is not meant to be imported into other software
However, it is common enough that there is contributed add-on package that has attempted to reverse-engineer the format
To use this package, we first need to load it into R as follows
## install.packages("sas7bdat") # install package if not already installed (needed only once)
library(package = "sas7bdat")
Once the package is loaded, read the help page for read.sas7bdat()
Use it to import the file sasdata/twosample.sas7bdat
The imported data frame should have variables PATNO
, TRT
, FEV0
and FEV6
A new compound, ABC-123, is being developed for long-term treatment of patients with chronic asthma. Asthmatic patients were enrolled in a double blind study and randomized to receive daily oral doses of ABC-123 or a placebo for 6 weeks. The primary measurement of interest is the resting FEV1 (forced expiratory volume during the first second of expiration), which is measured before (as FEV0) and at the end (as FEV6) of the 6-week treatment period.
Add a variable called CHG
to the dataset recording the change in FEV1
Read the help page for t.test()
Does administration of ABC-123 have any effect on FEV1? Answer this by performing a two-sample t-test
Note that there may be multiple approaches to arrive at the correct answer