Basic usage of R

Deepayan Sarkar

Basics of using R

R is more flexible than a regular calculator

In fact, R is a full programming language
Most standard data analysis methods are already implemented
Can be extended by writing add-on packages
Thousands of add-on packages are available

Major concepts

Variables (in the context of programming)
Data structures needed for data analyis
Functions (set of instructions for performing a procedure)

Variables

Variables are symbols that may be associated with different values
Computations involving variables are done using their current value

x <- 10 # assignment
sqrt(x)

[1] 3.162278

x <- -1
sqrt(x)

Warning in sqrt(x): NaNs produced

[1] NaN

x <- -1+0i
sqrt(x)

[1] 0+1i

Data structures for data analysis

Vectors
Lists (general collection of objects)
Data frames (a spreadsheet-like data set)
Matrices

Atomic vectors

Indexed collection of homogeneous scalars, can be
- Numeric / Integer
- Categorical (factor)
- Character
- Logical (TRUE / FALSE)
Missing values are allowed, indicated as NA
Elements are indexed starting from 1
i-th element of vector x can be extracted using x[i]
There are also more sophisticated forms of (vector) indexing

Atomic vectors: examples

month.name # built-in

 [1] "January"   "February"  "March"     "April"     "May"       "June"      "July"      "August"   
 [9] "September" "October"   "November"  "December"

x <- rnorm(10)
x

 [1] -0.01001108  0.11252941  0.75957249  0.13169291 -0.61549488 -0.38971443  0.09061209  1.90299126
 [9]  0.71969075  0.28110559

str(x) # useful function

 num [1:10] -0.01 0.113 0.76 0.132 -0.615 ...

str(month.name)

 chr [1:12] "January" "February" "March" "April" "May" "June" "July" "August" "September" "October" ...

m <- sample(1:12, 30, replace = TRUE)
m

 [1]  8  9  6  5  2  5 11  8  6  3  2  3  7  6  6  7 11 12  2  5  4  9  3  5  7  1  1 11  8  2

mf <- factor(m, levels = 1:12, labels = month.name)
mf

 [1] August    September June      May       February  May       November  August    June      March    
[11] February  March     July      June      June      July      November  December  February  May      
[21] April     September March     May       July      January   January   November  August    February 
Levels: January February March April May June July August September October November December

str(m)

 int [1:30] 8 9 6 5 2 5 11 8 6 3 ...

str(mf)

 Factor w/ 12 levels "January","February",..: 8 9 6 5 2 5 11 8 6 3 ...

Atomic vectors

“Scalars” are just vectors of length 1

str(numeric(2))

 num [1:2] 0 0

str(numeric(1))

 num 0

str(0)

 num 0

Vectors can have length zero

numeric(0)

numeric(0)

logical(0)

logical(0)

Creating vectors

Using functions that return vectors

seq(0, 1, by = 0.05) # regular sequence

 [1] 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00

rep(1:3, each = 5)   # elements repeated in a regular pattern

 [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3

1:10                 # shortcut for seq(1, 10, by = 1)

 [1]  1  2  3  4  5  6  7  8  9 10

rnorm(5)             # many random number generators available

[1] -0.05877929  1.24178994  1.12618568 -0.28574712  0.33757875

Using the c() function to combine smaller vectors

c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34)

 [1]  0  1  1  2  3  5  8 13 21 34

c(rexp(5), -rexp(5))

 [1]  3.957564195  1.269926679  0.009469478  0.182340988  0.537683329 -0.316325890 -1.533174163 -0.822603301
 [9] -3.883589422 -1.762983274

c("Hearts", "Spades", "Diamonds", "Clubs")

[1] "Hearts"   "Spades"   "Diamonds" "Clubs"

Types of indexing

Indexing refers to extracting subsets of vectors (or other kinds of data)
R supports several kinds of indexing:
- Indexing by a vector of positive integers
- Indexing by a vector of negative integers
- Indexing by a logical vector
- Indexing by a vector of names

Types of indexing: positive integers

The “standard” C-like indexing with a scalar (vector of length 1):

month.name[2] # the first index is 1, not 0

[1] "February"

The “index” can also be an integer vector

month.name[c(2, 4, 6, 9, 11)]

[1] "February"  "April"     "June"      "September" "November"

Elements can be repeated

month.name[c(2, 2, 6, 4, 6, 11)]

[1] "February" "February" "June"     "April"    "June"     "November"

“Out-of-bounds” indexing give NA (missing)

month.name[13]

[1] NA

month.name[seq(1, by = 2, length.out = 8)]

[1] "January"   "March"     "May"       "July"      "September" "November"  NA          NA

Types of indexing: negative integers

Negative integers omit the specified entries

month.name[-2]

 [1] "January"   "March"     "April"     "May"       "June"      "July"      "August"    "September"
 [9] "October"   "November"  "December"

month.name[-c(2, 4, 6, 9, 11)]

[1] "January"  "March"    "May"      "July"     "August"   "October"  "December"

Cannot be mixed with positive integers

month.name[c(2, -3)]

Error in month.name[c(2, -3)]: only 0's may be mixed with negative subscripts

Types of indexing: zero

Zero has a special meaning - doesn’t select anything

month.name[0]

character(0)

month.name[integer(0)] ## same as empty index

character(0)

month.name[c(1, 2, 0, 11, 12)]

[1] "January"  "February" "November" "December"

month.name[-c(1, 2, 0, 11, 12)]

[1] "March"     "April"     "May"       "June"      "July"      "August"    "September" "October"

Types of indexing: logical vector

Indexing by logical vector: select TRUE elements

month.name[c(TRUE, FALSE, FALSE)] # index is recycled if shorter than data

[1] "January" "April"   "July"    "October"

Logical vectors are usually created by logical comparisons

i <- substring(month.name, 1, 1) == "J"
i

 [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

month.name[i]

[1] "January" "June"    "July"

Common use: extract subset satisfying a certain condition (also called “filtering”)

(x <- rnorm(20)) # parentheses to print result

 [1] -0.90400018  0.64537142 -0.18131343 -1.57485208  1.28833752 -0.64120477 -0.11357468 -0.60709094
 [9]  0.36187801 -0.02112882 -0.08529980 -0.15286958 -0.11278629 -0.03084374  0.21417339  0.33745118
[17]  0.51941696 -0.98108416 -0.69654672 -1.40327686

x > 0     # element-wise comparison

 [1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
[18] FALSE FALSE FALSE

x[x > 0]  # result of comparison used as logical index vector

[1] 0.6453714 1.2883375 0.3618780 0.2141734 0.3374512 0.5194170

mean(x)

[1] -0.2069622

mean(x[x > 0])

[1] 0.5611047

Types of indexing: logical to integer

Sometimes logical indexing can be replaced by integer indexing using which()

 [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

which(i)

[1] 1 6 7

month.name[ which(i) ]

[1] "January" "June"    "July"

month.name[ -which(i) ] # same as month.name[ !i ]

[1] "February"  "March"     "April"     "May"       "August"    "September" "October"   "November" 
[9] "December"

Types of indexing: character vectors

R vectors can have names (optional)
These names can be used to index, just like positive integers
Will see examples later

Vectorized arithmetic

Arithmetic operations are usually “vectorized”. These include
- Arithmetic operators such as +, -, *, /, ^
- Mathematical functions such as sin(), cos(), log()
- Logical comparisons <, >, <=, >=, == and operators &, |, !
- Almost any other function where it makes sense
They operate element-wise on vectors, producing another vector
Remember that R has no “scalar” type. Scalars are just length-1 vectors
Operations with unequal sized vectors: Shorter vector is repeated / recycled to match longer vector
Example: Recreate height-weight simulation

ht <- rnorm(200, mean = 172, sd = 10)    # height in cm
bmi <- rnorm(200, mean = 22, sd = 2.2)   # bmi (independent of height)
wt <- bmi * (ht / 100)^2

The last step has several vectorized computations:
- ht / 100 divides each element of height by 100 (100 is recycled)
- (ht / 100)^2 squares each element of the result (2 is recycled)
- bmi * (ht / 100)^2 multiplies result with bmi element-wise

Scalars from vectors

Many functions summarize a data vector by producing a scalar

sum(wt)

[1] 13095.85

mean(wt)

[1] 65.47926

sd(wt)

[1] 10.15318

cor(wt, ht)

[1] 0.7942958

Sometimes the summary output can be a vector as well

fivenum(ht) ## minimum, quartiles, and maximum

[1] 145.1637 164.3472 172.3810 179.4889 198.0955

summary(ht) ## similar summary with descriptive names

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  145.2   164.4   172.4   172.1   179.5   198.1

Lists

Atomic vectors must be homogeneous (all elements of the same type)
But we often need to combine different types of data
Lists are vectors with arbitrary types of components
Like atomic vectors, they may or may not have names, but usually do
Usually constructed using the function list()

Example: Suppose we want to record units and a descriptive label along with a data vector

lht <- list(data = ht[1:10], unit = "cm", label = "Simulated height")
lbmi <- list(data = bmi[1:10], unit = "none", label = "Simulated BMI")
lwt <- list(data = wt[1:10], unit = "kg", label = "Weight calculated from height and BMI")

These are now vectors with different types of elements
Can be indexed in the usual way:

lht[1:2]     # index with integers, result is a length-2 list

$data
 [1] 159.8666 168.8484 196.3493 167.8481 175.1913 159.1338 177.4504 174.6472 163.5114 178.1257

$unit
[1] "cm"

lbmi["label"] # index with character, result is a length-1 list

$label
[1] "Simulated BMI"

But often we want to extract a specific element of a list
This is done by indexing with double brackets [[ ... ]]

lht[[ 1 ]]      # index with integer

 [1] 159.8666 168.8484 196.3493 167.8481 175.1913 159.1338 177.4504 174.6472 163.5114 178.1257

lbmi[[ "label" ]] # index with character

[1] "Simulated BMI"

For lists with names, a common alternative is to use $

lwt$label # note the lack of quotes

[1] "Weight calculated from height and BMI"

Lists can themselves contain lists recursively

mydata <- list(height = lht, bmi = lbmi, weight = lwt)
str(mydata)

List of 3
 $ height:List of 3
  ..$ data : num [1:10] 160 169 196 168 175 ...
  ..$ unit : chr "cm"
  ..$ label: chr "Simulated height"
 $ bmi   :List of 3
  ..$ data : num [1:10] 23.9 23.6 22.6 24 20.5 ...
  ..$ unit : chr "none"
  ..$ label: chr "Simulated BMI"
 $ weight:List of 3
  ..$ data : num [1:10] 61.2 67.2 87.3 67.6 62.8 ...
  ..$ unit : chr "kg"
  ..$ label: chr "Weight calculated from height and BMI"

Elements can be extracted recursively

mydata$weight[[1]]

 [1] 61.19160 67.18511 87.25451 67.63458 62.81660 60.63069 68.36702 66.96299 64.41511 79.14377

Uses of lists

Lists are very flexible data structures that are widely used
Two very important uses:
- Standard representation of data sets
- To contain results of complex functions (model fitting, tests)

Data frames

Data frames represent rectangular (spreadheet-like) data
Essentially lists with some additional restrictions
- Elements are viewed as columns in a data set
- Each element / column is (usually) an atomic vector
- Different columns can have different types
- Every column must have a name
Can be created using the data.frame() function

Data frame: Example

mydlist <- list(height = ht, bmi = bmi, weight = wt, gender = c("M", "F"))
str(mydlist)

List of 4
 $ height: num [1:200] 160 169 196 168 175 ...
 $ bmi   : num [1:200] 23.9 23.6 22.6 24 20.5 ...
 $ weight: num [1:200] 61.2 67.2 87.3 67.6 62.8 ...
 $ gender: chr [1:2] "M" "F"

mydf <- data.frame(height = ht, bmi = bmi, weight = wt, gender = c("M", "F"))
str(mydf) # compare with list

'data.frame':   200 obs. of  4 variables:
 $ height: num  160 169 196 168 175 ...
 $ bmi   : num  23.9 23.6 22.6 24 20.5 ...
 $ weight: num  61.2 67.2 87.3 67.6 62.8 ...
 $ gender: Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 1 2 1 ...

mydf$gender

  [1] M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F
 [53] M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F
[105] M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F
[157] M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F
Levels: F M

Lists can be recursive

list(height = lht, bmi = lbmi, weight = lwt)

$height
$height$data
 [1] 159.8666 168.8484 196.3493 167.8481 175.1913 159.1338 177.4504 174.6472 163.5114 178.1257

$height$unit
[1] "cm"

$height$label
[1] "Simulated height"


$bmi
$bmi$data
 [1] 23.94287 23.56564 22.63233 24.00688 20.46678 23.94241 21.71165 21.95390 24.09306 24.94387

$bmi$unit
[1] "none"

$bmi$label
[1] "Simulated BMI"


$weight
$weight$data
 [1] 61.19160 67.18511 87.25451 67.63458 62.81660 60.63069 68.36702 66.96299 64.41511 79.14377

$weight$unit
[1] "kg"

$weight$label
[1] "Weight calculated from height and BMI"

But they are ‘flattened’ in data frames

data.frame(height = lht, bmi = lbmi, weight = lwt)

   height.data height.unit     height.label bmi.data bmi.unit     bmi.label weight.data weight.unit
1     159.8666          cm Simulated height 23.94287     none Simulated BMI    61.19160          kg
2     168.8484          cm Simulated height 23.56564     none Simulated BMI    67.18511          kg
3     196.3493          cm Simulated height 22.63233     none Simulated BMI    87.25451          kg
4     167.8481          cm Simulated height 24.00688     none Simulated BMI    67.63458          kg
5     175.1913          cm Simulated height 20.46678     none Simulated BMI    62.81660          kg
6     159.1338          cm Simulated height 23.94241     none Simulated BMI    60.63069          kg
7     177.4504          cm Simulated height 21.71165     none Simulated BMI    68.36702          kg
8     174.6472          cm Simulated height 21.95390     none Simulated BMI    66.96299          kg
9     163.5114          cm Simulated height 24.09306     none Simulated BMI    64.41511          kg
10    178.1257          cm Simulated height 24.94387     none Simulated BMI    79.14377          kg
                            weight.label
1  Weight calculated from height and BMI
2  Weight calculated from height and BMI
3  Weight calculated from height and BMI
4  Weight calculated from height and BMI
5  Weight calculated from height and BMI
6  Weight calculated from height and BMI
7  Weight calculated from height and BMI
8  Weight calculated from height and BMI
9  Weight calculated from height and BMI
10 Weight calculated from height and BMI

Data import

It is much more common to create data frames by importing data from a file
Typical approach: read data from spreadsheet file into data frame
Easiest route:
- R itself cannot read Excel files directly
- Save as CSV file from Excel
- Read with read.csv() or read.table() (more flexible)
Alternative option:
- Use “Import Dataset” menu item in R Studio (supports more formats using add-on packages)

Data export

Data frames can be exported as a spreadsheet file using write.csv() or write.table()

data(Cars93, package = "MASS") # built-in dataset
write.csv(Cars93, file = "cars93.csv") # export
cars <- read.csv("cars93.csv") # import (path relative to working directory)

Most statistical software are able to read CSV files

Data import example

Import text dataset data/demog.txt containing demographic data
The contents of the file are:

subjid trt gender race age
101 0 1 3 37
102 1 2 1 65
103 1 1 2 32
104 0 2 1 23
105 1 1 3 44
106 0 2 1 49
201 1 1 3 35
202 0 2 1 50
203 1 1 2 49
204 0 2 1 60
205 1 1 3 39
206 1 2 1 67
301 0 1 1 70
302 0 1 2 55
303 1 1 1 65
304 0 1 1 45
305 1 1 1 36
306 0 1 2 46
401 1 2 1 44
402 0 2 2 77
403 1 1 1 45
404 1 1 1 59
405 0 2 1 49
406 1 1 2 33
501 0 1 2 33
502 1 2 1 44
503 1 1 1 64
504 0 1 3 56
505 1 1 2 73
506 0 1 1 46
507 1 1 2 44
508 0 2 1 53
509 0 1 1 45
510 0 1 3 65
511 1 2 2 43
512 1 1 1 39
601 0 1 1 50
602 0 2 2 30
603 1 2 1 33
604 0 1 1 65
605 1 2 1 57
606 0 1 2 56
607 1 1 1 67
608 0 2 2 46
609 1 2 1 72
610 0 1 1 29
611 1 2 1 65
612 1 1 2 46
701 1 1 1 60
702 0 1 1 28
703 1 1 2 44
704 0 2 1 66
705 1 1 2 46
706 1 1 1 75
707 1 1 1 46
708 0 2 1 55
709 0 2 2 57
710 0 1 1 63
711 1 1 2 61
712 0 . 1 49

Attempt 1:

demog <- read.table("data/demog.txt")
str(demog)

'data.frame':   61 obs. of  5 variables:
 $ V1: Factor w/ 61 levels "101","102","103",..: 61 1 2 3 4 5 6 7 8 9 ...
 $ V2: Factor w/ 3 levels "0","1","trt": 3 1 2 2 1 2 1 2 1 2 ...
 $ V3: Factor w/ 4 levels ".","1","2","gender": 4 2 3 2 3 2 3 2 3 2 ...
 $ V4: Factor w/ 4 levels "1","2","3","race": 4 3 1 2 1 3 1 3 1 2 ...
 $ V5: Factor w/ 34 levels "23","28","29",..: 34 9 26 5 1 12 15 7 16 15 ...

R doesn’t know that the first row gives column headers
All columns have been interpreted as characters and converted to factors

Attempt 2:

demog <- read.table("data/demog.txt", header = TRUE, stringsAsFactors = FALSE)
str(demog)

'data.frame':   60 obs. of  5 variables:
 $ subjid: int  101 102 103 104 105 106 201 202 203 204 ...
 $ trt   : int  0 1 1 0 1 0 1 0 1 0 ...
 $ gender: chr  "1" "2" "1" "2" ...
 $ race  : int  3 1 2 1 3 1 3 1 2 1 ...
 $ age   : int  37 65 32 23 44 49 35 50 49 60 ...

The gender column is still interpreted as character data
This is because R doesn’t know that missing values are encoded as "."

Attempt 2:

demog <- read.table("data/demog.txt", header = TRUE, stringsAsFactors = FALSE, na.strings = ".")
str(demog)

'data.frame':   60 obs. of  5 variables:
 $ subjid: int  101 102 103 104 105 106 201 202 203 204 ...
 $ trt   : int  0 1 1 0 1 0 1 0 1 0 ...
 $ gender: int  1 2 1 2 1 2 1 2 1 2 ...
 $ race  : int  3 1 2 1 3 1 3 1 2 1 ...
 $ age   : int  37 65 32 23 44 49 35 50 49 60 ...

demog$gender

 [1]  1  2  1  2  1  2  1  2  1  2  1  2  1  1  1  1  1  1  2  2  1  1  2  1  1  2  1  1  1  1  1  2  1  1  2
[36]  1  1  2  2  1  2  1  1  2  2  1  2  1  1  1  1  2  1  1  1  2  2  1  1 NA

The columns trt, gender, and race should actually be categorical
We need more information about the numeric encoding to do this
Will see examples later

Lists as containers of complex results

We have earlier used the lm() function to fit an OLS linear regression model
Let’s see what the output returned by lm() actually looks like

fm <- lm(weight ~ height, data = mydf) # more on this later
str(fm)

List of 12
 $ coefficients : Named num [1:2] -68.022 0.776
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "height"
 $ residuals    : Named num [1:200] 5.19 4.22 2.95 5.44 -5.07 ...
  ..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
 $ effects      : Named num [1:200] -926.02 113.77 3.26 5.01 -5.31 ...
  ..- attr(*, "names")= chr [1:200] "(Intercept)" "height" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:200] 56 63 84.3 62.2 67.9 ...
  ..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:200, 1:2] -14.1421 0.0707 0.0707 0.0707 0.0707 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:200] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "height"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.07 1.02
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 198
 $ xlevels      : Named list()
 $ call         : language lm(formula = weight ~ height, data = mydf)
 $ terms        :Classes 'terms', 'formula'  language weight ~ height
  .. ..- attr(*, "variables")= language list(weight, height)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "weight" "height"
  .. .. .. ..$ : chr "height"
  .. ..- attr(*, "term.labels")= chr "height"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(weight, height)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "weight" "height"
 $ model        :'data.frame':  200 obs. of  2 variables:
  ..$ weight: num [1:200] 61.2 67.2 87.3 67.6 62.8 ...
  ..$ height: num [1:200] 160 169 196 168 175 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language weight ~ height
  .. .. ..- attr(*, "variables")= language list(weight, height)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "weight" "height"
  .. .. .. .. ..$ : chr "height"
  .. .. ..- attr(*, "term.labels")= chr "height"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(weight, height)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "weight" "height"
 - attr(*, "class")= chr "lm"

Another example: t.test() to perform one-sample t-test

tt <- t.test(mydf$bmi, mu = 22) # Is mean BMI equal to 22?
str(tt)

List of 10
 $ statistic  : Named num 0.167
  ..- attr(*, "names")= chr "t"
 $ parameter  : Named num 199
  ..- attr(*, "names")= chr "df"
 $ p.value    : num 0.867
 $ conf.int   : num [1:2] 21.7 22.3
  ..- attr(*, "conf.level")= num 0.95
 $ estimate   : Named num 22
  ..- attr(*, "names")= chr "mean of x"
 $ null.value : Named num 22
  ..- attr(*, "names")= chr "mean"
 $ stderr     : num 0.145
 $ alternative: chr "two.sided"
 $ method     : chr "One Sample t-test"
 $ data.name  : chr "mydf$bmi"
 - attr(*, "class")= chr "htest"

These details are usually unimportant for regular use
Printing these results will show “user-friendly” output

print(tt)


    One Sample t-test

data:  mydf$bmi
t = 0.16719, df = 199, p-value = 0.8674
alternative hypothesis: true mean is not equal to 22
95 percent confidence interval:
 21.73774 22.31084
sample estimates:
mean of x 
 22.02429

But to develop our own analysis tools, we will need some understanding of how this happens

Another important data structure: matrix / array

Matrices and arrays arise very naturally in statistics
Two common uses:
- Model matrix for linear models
- Contingency tables
Example: built-in dataset giving death rates (per 1000) for demographic subgroups

VADeaths

      Rural Male Rural Female Urban Male Urban Female
50-54       11.7          8.7       15.4          8.4
55-59       18.1         11.7       24.3         13.6
60-64       26.9         20.3       37.0         19.3
65-69       41.0         30.9       54.6         35.1
70-74       66.0         54.3       71.1         50.0

dim(VADeaths) # dimension of the matrix

[1] 5 4

Unlike data frames, they are always homogeneous (all elements of same type)
Matrices have row and column indexes, and possibly names
Indexing works in the same way as vectors, but in two dimensions (separated by ,)

VADeaths[1:2, c(2, 3)]

      Rural Female Urban Male
50-54          8.7       15.4
55-59         11.7       24.3

Indexing by “empty” index selects all rows / columns

VADeaths[, c("Rural Male", "Rural Female")]

      Rural Male Rural Female
50-54       11.7          8.7
55-59       18.1         11.7
60-64       26.9         20.3
65-69       41.0         30.9
70-74       66.0         54.3

Such indexing also works for data frames

Creating a matrix

There are many ways to create a matrix
Example: matrix() constructs matrix by providing data and dimensions

matrix(1:12, nrow = 3, ncol = 4)               # fills up columns first by default

     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

matrix(1:12, nrow = 4, ncol = 3, byrow = TRUE) # fills up rows first

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]   10   11   12

cbind() and rbind() constructs matrices by combining columns or rows

X <- cbind(int = 1, ht = mydf$height) # Note: 1 is recycled
head(X, 15)

      int       ht
 [1,]   1 159.8666
 [2,]   1 168.8484
 [3,]   1 196.3493
 [4,]   1 167.8481
 [5,]   1 175.1913
 [6,]   1 159.1338
 [7,]   1 177.4504
 [8,]   1 174.6472
 [9,]   1 163.5114
[10,]   1 178.1257
[11,]   1 175.2149
[12,]   1 155.5292
[13,]   1 157.8308
[14,]   1 160.7194
[15,]   1 177.3041

X is the design matrix for linear regression on height (including intercept)

Matrix operations

Standard matrix operations: transpose t() and matrix product %*%
Can be used to solve linear regression equation

y <- mydf$weight
XtX <- t(X) %*% X  # transpose and matrix multiplication
XtX

         int         ht
int   200.00   34416.96
ht  34416.96 5944139.57

Xty <- t(X) %*% y
solve(XtX, Xty) # solves normal equations X'X beta = X'y

           [,1]
int -68.0215728
ht    0.7757852

fm$coefficients # compare with earlier result from lm()

(Intercept)      height 
-68.0215728   0.7757852

Matrix representation

Internally, matrices and arrays are stored as vectors along with a dimension

x <- 1:12
x

 [1]  1  2  3  4  5  6  7  8  9 10 11 12

dim(x) <- c(4, 3)
x

     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12

General arrays can have more than two dimensions

dim(x) <- c(2, 2, 3) # three-dimensional array
x

, , 1

     [,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2

     [,1] [,2]
[1,]    5    7
[2,]    6    8

, , 3

     [,1] [,2]
[1,]    9   11
[2,]   10   12

Incidentally, assignments where the left-hand side looks like a function call are a special feature of R
These modify some aspect of an already existing variable
These are known as replacement functions
The underlying vector nature of a matrix is easy to verify

VADeaths

      Rural Male Rural Female Urban Male Urban Female
50-54       11.7          8.7       15.4          8.4
55-59       18.1         11.7       24.3         13.6
60-64       26.9         20.3       37.0         19.3
65-69       41.0         30.9       54.6         35.1
70-74       66.0         54.3       71.1         50.0

VADeaths[4:10] # index as vector (data is stored columnwise)

[1] 41.0 66.0  8.7 11.7 20.3 30.9 54.3

Next steps

This background is enough to start on typical data analysis tasks
But before that, we also need to learn about accessing documentation
This requires a brief discussion of the class system in R

The class of R objects

Every R object must have a class

class(mydata)

[1] "list"

class(mydf)

[1] "data.frame"

class(mydf$weight)

[1] "numeric"

class(fm)

[1] "lm"

class(tt) # for 'hypothesis test'

[1] "htest"

Some functions are ‘generic’ functions

Generic functions are placeholder functions
They perform different tasks depending on type of argument passed to them
For example, summary() is a generic function

summary(mydf)

     height           bmi            weight      gender 
 Min.   :145.2   Min.   :16.43   Min.   :44.19   F:100  
 1st Qu.:164.4   1st Qu.:20.67   1st Qu.:58.35   M:100  
 Median :172.4   Median :22.21   Median :64.91          
 Mean   :172.1   Mean   :22.02   Mean   :65.48          
 3rd Qu.:179.5   3rd Qu.:23.45   3rd Qu.:71.78          
 Max.   :198.1   Max.   :28.41   Max.   :90.05

summary(mydf$weight)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  44.19   58.35   64.91   65.48   71.78   90.05

summary(fm)


Call:
lm(formula = weight ~ height, data = mydf)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.7074  -3.9083   0.4285   4.2131  17.5302 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -68.02157    7.26984  -9.357   <2e-16 ***
height        0.77579    0.04217  18.397   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.184 on 198 degrees of freedom
Multiple R-squared:  0.6309,    Adjusted R-squared:  0.629 
F-statistic: 338.4 on 1 and 198 DF,  p-value: < 2.2e-16

Methods of generic functions

methods("summary")

 [1] summary.aov                    summary.aovlist*               summary.aspell*               
 [4] summary.check_packages_in_dir* summary.connection             summary.data.frame            
 [7] summary.Date                   summary.default                summary.ecdf*                 
[10] summary.factor                 summary.glm                    summary.infl*                 
[13] summary.lm                     summary.loess*                 summary.manova                
[16] summary.matrix                 summary.mlm*                   summary.nls*                  
[19] summary.packageStatus*         summary.POSIXct                summary.POSIXlt               
[22] summary.ppr*                   summary.prcomp*                summary.princomp*             
[25] summary.proc_time              summary.srcfile                summary.srcref                
[28] summary.stepfun                summary.stl*                   summary.table                 
[31] summary.tukeysmooth*           summary.warnings              
see '?methods' for accessing help and source code

The `print()` function

A special generic function is called print()
Whenever the result of an evaluation is not assigned to a variable, it is “auto-printed”
This is done using the print() generic function
For example:

print(fm) # same as just entering fm


Call:
lm(formula = weight ~ height, data = mydf)

Coefficients:
(Intercept)       height  
   -68.0216       0.7758

Documentation

Every dataset and function in R is documented in a help page
The documentation for a function can be accessed by ? or help()

?seq
help(cbind)
help(VADeaths)

Documentation of generic functions and methods

Generic functions have their own documentation page

help(summary)

The documentation for a specific method may be in a different page

help(summary.lm)

Note that you should never call summary.lm() directly instead of summary()

Understanding function documentation

Most useful things in R happen by calling functions
Functions have one or more arguments
- All arguments have names
- Arguments may be compulsory or optional
- Optional arguments have “default” values
Functions normally also have a useful “return” value
These are all described in the help page
Arguments may or may not be named when calling a function
If not named, arguments are matched by position
Conventionally, optional arguments are named, compulsory arguments are often not named

Exercises

Load 02-rbasics.rmd in R Studio and work through the examples
Read the help page for mean() and median()
Here are two simple numeric vectors containing NA and Inf values

x <- c(1:5, NA, 7:10)
y <- c(1:5, Inf, 7:10)

Find the mean and median of these vectors
How can you make R ignore the NA value?
How can you make R ignore the Inf value? Hint: see ?is.finite

Go through the help pages for write.table() and read.table() to understand how they work. Skip non-essential details.
Export the demog data frame as a CSV file named "demog.csv" using write.csv()
Import this newly created dataset again using read.csv(), saving it as a variable named d
Do you need to specify the header argument? Why?
Do you need to specify the na.strings argument? Why?

The goal of the next exercise is to convert d$race into a factor
Read the help page for factor() to learn how to create factors
d$race has three values: 1 = White, 2 = Black, 3 = Other
Create and add a new factor variable d$frace to the data frame d
d$frace should have “levels” 1, 2, 3, and correposponding “labels” White, Black, Other

The goal of the next exercise is to import a SAS format dataset and use it to perform a t-test
SAS can export data in a binary format with extension sas7bdat
This is not a documented export format, and is not meant to be imported into other software
However, it is common enough that there is contributed add-on package that has attempted to reverse-engineer the format
To use this package, we first need to load it into R as follows

## install.packages("sas7bdat") # install package if not already installed (needed only once)
library(package = "sas7bdat")

Once the package is loaded, read the help page for read.sas7bdat()
Use it to import the file sasdata/twosample.sas7bdat
The imported data frame should have variables PATNO, TRT, FEV0 and FEV6

Exercises

The story behind the dataset is as follows:

A new compound, ABC-123, is being developed for long-term treatment of patients with chronic asthma. Asthmatic patients were enrolled in a double blind study and randomized to receive daily oral doses of ABC-123 or a placebo for 6 weeks. The primary measurement of interest is the resting FEV1 (forced expiratory volume during the first second of expiration), which is measured before (as FEV0) and at the end (as FEV6) of the 6-week treatment period.

Add a variable called CHG to the dataset recording the change in FEV1
Read the help page for t.test()
Does administration of ABC-123 have any effect on FEV1? Answer this by performing a two-sample t-test
Note that there may be multiple approaches to arrive at the correct answer

Basics of using R

Major concepts

Variables

Data structures for data analysis

Atomic vectors

Atomic vectors: examples

Atomic vectors

Creating vectors

Types of indexing

Types of indexing: positive integers

Types of indexing: negative integers

Types of indexing: zero

Types of indexing: logical vector

Types of indexing: logical to integer

Types of indexing: character vectors

Vectorized arithmetic

Scalars from vectors

Lists

Uses of lists

Data frames

Data frame: Example

Data import

Data export

Data import example

Lists as containers of complex results

Another important data structure: matrix / array

Creating a matrix

Matrix operations

Matrix representation

Next steps

The class of R objects

Some functions are ‘generic’ functions

Methods of generic functions

The print() function

Documentation

Documentation of generic functions and methods

Understanding function documentation

Exercises

Exercises

The `print()` function