Basic usage of R

Deepayan Sarkar

Basics of using R

R is more flexible than a regular calculator

  • In fact, R is a full programming language

  • Most standard data analysis methods are already implemented

  • Can be extended by writing add-on packages

  • Thousands of add-on packages are available

Major concepts

  • Variables (in the context of programming)

  • Data structures needed for data analyis

  • Functions (set of instructions for performing a procedure)

Variables

  • Variables are symbols that may be associated with different values

  • Computations involving variables are done using their current value

[1] 3.162278
Warning in sqrt(x): NaNs produced
[1] NaN
[1] 0+1i

Data structures for data analysis

  • Vectors

  • Lists (general collection of objects)

  • Data frames (a spreadsheet-like data set)

  • Matrices

Atomic vectors

  • Indexed collection of homogeneous scalars, can be

    • Numeric / Integer
    • Categorical (factor)
    • Character
    • Logical (TRUE / FALSE)
  • Missing values are allowed, indicated as NA

  • Elements are indexed starting from 1

  • i-th element of vector x can be extracted using x[i]

  • There are also more sophisticated forms of (vector) indexing

Atomic vectors: examples

 [1] "January"   "February"  "March"     "April"     "May"       "June"      "July"      "August"   
 [9] "September" "October"   "November"  "December" 
 [1] -0.01001108  0.11252941  0.75957249  0.13169291 -0.61549488 -0.38971443  0.09061209  1.90299126
 [9]  0.71969075  0.28110559
 num [1:10] -0.01 0.113 0.76 0.132 -0.615 ...
 chr [1:12] "January" "February" "March" "April" "May" "June" "July" "August" "September" "October" ...
 [1]  8  9  6  5  2  5 11  8  6  3  2  3  7  6  6  7 11 12  2  5  4  9  3  5  7  1  1 11  8  2
 [1] August    September June      May       February  May       November  August    June      March    
[11] February  March     July      June      June      July      November  December  February  May      
[21] April     September March     May       July      January   January   November  August    February 
Levels: January February March April May June July August September October November December
 int [1:30] 8 9 6 5 2 5 11 8 6 3 ...
 Factor w/ 12 levels "January","February",..: 8 9 6 5 2 5 11 8 6 3 ...

Atomic vectors

  • “Scalars” are just vectors of length 1
 num [1:2] 0 0
 num 0
 num 0
  • Vectors can have length zero
numeric(0)
logical(0)

Creating vectors

  • Using functions that return vectors
 [1] 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00
 [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
 [1]  1  2  3  4  5  6  7  8  9 10
[1] -0.05877929  1.24178994  1.12618568 -0.28574712  0.33757875
  • Using the c() function to combine smaller vectors
 [1]  0  1  1  2  3  5  8 13 21 34
 [1]  3.957564195  1.269926679  0.009469478  0.182340988  0.537683329 -0.316325890 -1.533174163 -0.822603301
 [9] -3.883589422 -1.762983274
[1] "Hearts"   "Spades"   "Diamonds" "Clubs"   

Types of indexing

  • Indexing refers to extracting subsets of vectors (or other kinds of data)

  • R supports several kinds of indexing:

    • Indexing by a vector of positive integers

    • Indexing by a vector of negative integers

    • Indexing by a logical vector

    • Indexing by a vector of names

Types of indexing: positive integers

  • The “standard” C-like indexing with a scalar (vector of length 1):
[1] "February"


  • The “index” can also be an integer vector
[1] "February"  "April"     "June"      "September" "November" 


  • Elements can be repeated
[1] "February" "February" "June"     "April"    "June"     "November"
  • “Out-of-bounds” indexing give NA (missing)
[1] NA
[1] "January"   "March"     "May"       "July"      "September" "November"  NA          NA         


Types of indexing: negative integers

  • Negative integers omit the specified entries
 [1] "January"   "March"     "April"     "May"       "June"      "July"      "August"    "September"
 [9] "October"   "November"  "December" 
[1] "January"  "March"    "May"      "July"     "August"   "October"  "December"


  • Cannot be mixed with positive integers
Error in month.name[c(2, -3)]: only 0's may be mixed with negative subscripts

Types of indexing: zero

  • Zero has a special meaning - doesn’t select anything
character(0)
character(0)
[1] "January"  "February" "November" "December"
[1] "March"     "April"     "May"       "June"      "July"      "August"    "September" "October"  

Types of indexing: logical vector

  • Indexing by logical vector: select TRUE elements
[1] "January" "April"   "July"    "October"


  • Logical vectors are usually created by logical comparisons
 [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
[1] "January" "June"    "July"   
  • Common use: extract subset satisfying a certain condition (also called “filtering”)
 [1] -0.90400018  0.64537142 -0.18131343 -1.57485208  1.28833752 -0.64120477 -0.11357468 -0.60709094
 [9]  0.36187801 -0.02112882 -0.08529980 -0.15286958 -0.11278629 -0.03084374  0.21417339  0.33745118
[17]  0.51941696 -0.98108416 -0.69654672 -1.40327686
 [1] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
[18] FALSE FALSE FALSE
[1] 0.6453714 1.2883375 0.3618780 0.2141734 0.3374512 0.5194170
[1] -0.2069622
[1] 0.5611047

Types of indexing: logical to integer

  • Sometimes logical indexing can be replaced by integer indexing using which()
 [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
[1] 1 6 7
[1] "January" "June"    "July"   
[1] "February"  "March"     "April"     "May"       "August"    "September" "October"   "November" 
[9] "December" 

Types of indexing: character vectors

  • R vectors can have names (optional)

  • These names can be used to index, just like positive integers

  • Will see examples later

Vectorized arithmetic

  • Arithmetic operations are usually “vectorized”. These include

    • Arithmetic operators such as +, -, *, /, ^

    • Mathematical functions such as sin(), cos(), log()

    • Logical comparisons <, >, <=, >=, == and operators &, |, !

    • Almost any other function where it makes sense

  • They operate element-wise on vectors, producing another vector

  • Remember that R has no “scalar” type. Scalars are just length-1 vectors

  • Operations with unequal sized vectors: Shorter vector is repeated / recycled to match longer vector

  • Example: Recreate height-weight simulation

  • The last step has several vectorized computations:

    • ht / 100 divides each element of height by 100 (100 is recycled)

    • (ht / 100)^2 squares each element of the result (2 is recycled)

    • bmi * (ht / 100)^2 multiplies result with bmi element-wise

Scalars from vectors

  • Many functions summarize a data vector by producing a scalar
[1] 13095.85
[1] 65.47926
[1] 10.15318
[1] 0.7942958
  • Sometimes the summary output can be a vector as well
[1] 145.1637 164.3472 172.3810 179.4889 198.0955
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  145.2   164.4   172.4   172.1   179.5   198.1 

Lists

  • Atomic vectors must be homogeneous (all elements of the same type)

  • But we often need to combine different types of data

  • Lists are vectors with arbitrary types of components

  • Like atomic vectors, they may or may not have names, but usually do

  • Usually constructed using the function list()

Example: Suppose we want to record units and a descriptive label along with a data vector

  • These are now vectors with different types of elements

  • Can be indexed in the usual way:

$data
 [1] 159.8666 168.8484 196.3493 167.8481 175.1913 159.1338 177.4504 174.6472 163.5114 178.1257

$unit
[1] "cm"
$label
[1] "Simulated BMI"
  • But often we want to extract a specific element of a list

  • This is done by indexing with double brackets [[ ... ]]

 [1] 159.8666 168.8484 196.3493 167.8481 175.1913 159.1338 177.4504 174.6472 163.5114 178.1257
[1] "Simulated BMI"


  • For lists with names, a common alternative is to use $
[1] "Weight calculated from height and BMI"

Lists can themselves contain lists recursively

List of 3
 $ height:List of 3
  ..$ data : num [1:10] 160 169 196 168 175 ...
  ..$ unit : chr "cm"
  ..$ label: chr "Simulated height"
 $ bmi   :List of 3
  ..$ data : num [1:10] 23.9 23.6 22.6 24 20.5 ...
  ..$ unit : chr "none"
  ..$ label: chr "Simulated BMI"
 $ weight:List of 3
  ..$ data : num [1:10] 61.2 67.2 87.3 67.6 62.8 ...
  ..$ unit : chr "kg"
  ..$ label: chr "Weight calculated from height and BMI"


  • Elements can be extracted recursively
 [1] 61.19160 67.18511 87.25451 67.63458 62.81660 60.63069 68.36702 66.96299 64.41511 79.14377

Uses of lists

  • Lists are very flexible data structures that are widely used

  • Two very important uses:

    • Standard representation of data sets

    • To contain results of complex functions (model fitting, tests)

Data frames

  • Data frames represent rectangular (spreadheet-like) data

  • Essentially lists with some additional restrictions

    • Elements are viewed as columns in a data set

    • Each element / column is (usually) an atomic vector

    • Different columns can have different types

    • Every column must have a name

  • Can be created using the data.frame() function

Data frame: Example

List of 4
 $ height: num [1:200] 160 169 196 168 175 ...
 $ bmi   : num [1:200] 23.9 23.6 22.6 24 20.5 ...
 $ weight: num [1:200] 61.2 67.2 87.3 67.6 62.8 ...
 $ gender: chr [1:2] "M" "F"
'data.frame':   200 obs. of  4 variables:
 $ height: num  160 169 196 168 175 ...
 $ bmi   : num  23.9 23.6 22.6 24 20.5 ...
 $ weight: num  61.2 67.2 87.3 67.6 62.8 ...
 $ gender: Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 1 2 1 ...
  [1] M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F
 [53] M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F
[105] M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F
[157] M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F M F
Levels: F M

Lists can be recursive

$height
$height$data
 [1] 159.8666 168.8484 196.3493 167.8481 175.1913 159.1338 177.4504 174.6472 163.5114 178.1257

$height$unit
[1] "cm"

$height$label
[1] "Simulated height"


$bmi
$bmi$data
 [1] 23.94287 23.56564 22.63233 24.00688 20.46678 23.94241 21.71165 21.95390 24.09306 24.94387

$bmi$unit
[1] "none"

$bmi$label
[1] "Simulated BMI"


$weight
$weight$data
 [1] 61.19160 67.18511 87.25451 67.63458 62.81660 60.63069 68.36702 66.96299 64.41511 79.14377

$weight$unit
[1] "kg"

$weight$label
[1] "Weight calculated from height and BMI"

But they are ‘flattened’ in data frames

   height.data height.unit     height.label bmi.data bmi.unit     bmi.label weight.data weight.unit
1     159.8666          cm Simulated height 23.94287     none Simulated BMI    61.19160          kg
2     168.8484          cm Simulated height 23.56564     none Simulated BMI    67.18511          kg
3     196.3493          cm Simulated height 22.63233     none Simulated BMI    87.25451          kg
4     167.8481          cm Simulated height 24.00688     none Simulated BMI    67.63458          kg
5     175.1913          cm Simulated height 20.46678     none Simulated BMI    62.81660          kg
6     159.1338          cm Simulated height 23.94241     none Simulated BMI    60.63069          kg
7     177.4504          cm Simulated height 21.71165     none Simulated BMI    68.36702          kg
8     174.6472          cm Simulated height 21.95390     none Simulated BMI    66.96299          kg
9     163.5114          cm Simulated height 24.09306     none Simulated BMI    64.41511          kg
10    178.1257          cm Simulated height 24.94387     none Simulated BMI    79.14377          kg
                            weight.label
1  Weight calculated from height and BMI
2  Weight calculated from height and BMI
3  Weight calculated from height and BMI
4  Weight calculated from height and BMI
5  Weight calculated from height and BMI
6  Weight calculated from height and BMI
7  Weight calculated from height and BMI
8  Weight calculated from height and BMI
9  Weight calculated from height and BMI
10 Weight calculated from height and BMI

Data import

  • It is much more common to create data frames by importing data from a file

  • Typical approach: read data from spreadsheet file into data frame

  • Easiest route:

    • R itself cannot read Excel files directly

    • Save as CSV file from Excel

    • Read with read.csv() or read.table() (more flexible)

  • Alternative option:

    • Use “Import Dataset” menu item in R Studio (supports more formats using add-on packages)

Data export

  • Data frames can be exported as a spreadsheet file using write.csv() or write.table()
  • Most statistical software are able to read CSV files

Data import example

  • Import text dataset data/demog.txt containing demographic data

  • The contents of the file are:

subjid trt gender race age
101 0 1 3 37
102 1 2 1 65
103 1 1 2 32
104 0 2 1 23
105 1 1 3 44
106 0 2 1 49
201 1 1 3 35
202 0 2 1 50
203 1 1 2 49
204 0 2 1 60
205 1 1 3 39
206 1 2 1 67
301 0 1 1 70
302 0 1 2 55
303 1 1 1 65
304 0 1 1 45
305 1 1 1 36
306 0 1 2 46
401 1 2 1 44
402 0 2 2 77
403 1 1 1 45
404 1 1 1 59
405 0 2 1 49
406 1 1 2 33
501 0 1 2 33
502 1 2 1 44
503 1 1 1 64
504 0 1 3 56
505 1 1 2 73
506 0 1 1 46
507 1 1 2 44
508 0 2 1 53
509 0 1 1 45
510 0 1 3 65
511 1 2 2 43
512 1 1 1 39
601 0 1 1 50
602 0 2 2 30
603 1 2 1 33
604 0 1 1 65
605 1 2 1 57
606 0 1 2 56
607 1 1 1 67
608 0 2 2 46
609 1 2 1 72
610 0 1 1 29
611 1 2 1 65
612 1 1 2 46
701 1 1 1 60
702 0 1 1 28
703 1 1 2 44
704 0 2 1 66
705 1 1 2 46
706 1 1 1 75
707 1 1 1 46
708 0 2 1 55
709 0 2 2 57
710 0 1 1 63
711 1 1 2 61
712 0 . 1 49

Attempt 1:

'data.frame':   61 obs. of  5 variables:
 $ V1: Factor w/ 61 levels "101","102","103",..: 61 1 2 3 4 5 6 7 8 9 ...
 $ V2: Factor w/ 3 levels "0","1","trt": 3 1 2 2 1 2 1 2 1 2 ...
 $ V3: Factor w/ 4 levels ".","1","2","gender": 4 2 3 2 3 2 3 2 3 2 ...
 $ V4: Factor w/ 4 levels "1","2","3","race": 4 3 1 2 1 3 1 3 1 2 ...
 $ V5: Factor w/ 34 levels "23","28","29",..: 34 9 26 5 1 12 15 7 16 15 ...
  • R doesn’t know that the first row gives column headers

  • All columns have been interpreted as characters and converted to factors

Attempt 2:

'data.frame':   60 obs. of  5 variables:
 $ subjid: int  101 102 103 104 105 106 201 202 203 204 ...
 $ trt   : int  0 1 1 0 1 0 1 0 1 0 ...
 $ gender: chr  "1" "2" "1" "2" ...
 $ race  : int  3 1 2 1 3 1 3 1 2 1 ...
 $ age   : int  37 65 32 23 44 49 35 50 49 60 ...
  • The gender column is still interpreted as character data

  • This is because R doesn’t know that missing values are encoded as "."

Attempt 2:

'data.frame':   60 obs. of  5 variables:
 $ subjid: int  101 102 103 104 105 106 201 202 203 204 ...
 $ trt   : int  0 1 1 0 1 0 1 0 1 0 ...
 $ gender: int  1 2 1 2 1 2 1 2 1 2 ...
 $ race  : int  3 1 2 1 3 1 3 1 2 1 ...
 $ age   : int  37 65 32 23 44 49 35 50 49 60 ...
 [1]  1  2  1  2  1  2  1  2  1  2  1  2  1  1  1  1  1  1  2  2  1  1  2  1  1  2  1  1  1  1  1  2  1  1  2
[36]  1  1  2  2  1  2  1  1  2  2  1  2  1  1  1  1  2  1  1  1  2  2  1  1 NA
  • The columns trt, gender, and race should actually be categorical

  • We need more information about the numeric encoding to do this

  • Will see examples later

Lists as containers of complex results

  • We have earlier used the lm() function to fit an OLS linear regression model

  • Let’s see what the output returned by lm() actually looks like

List of 12
 $ coefficients : Named num [1:2] -68.022 0.776
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "height"
 $ residuals    : Named num [1:200] 5.19 4.22 2.95 5.44 -5.07 ...
  ..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
 $ effects      : Named num [1:200] -926.02 113.77 3.26 5.01 -5.31 ...
  ..- attr(*, "names")= chr [1:200] "(Intercept)" "height" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:200] 56 63 84.3 62.2 67.9 ...
  ..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:200, 1:2] -14.1421 0.0707 0.0707 0.0707 0.0707 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:200] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "height"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.07 1.02
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 198
 $ xlevels      : Named list()
 $ call         : language lm(formula = weight ~ height, data = mydf)
 $ terms        :Classes 'terms', 'formula'  language weight ~ height
  .. ..- attr(*, "variables")= language list(weight, height)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "weight" "height"
  .. .. .. ..$ : chr "height"
  .. ..- attr(*, "term.labels")= chr "height"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(weight, height)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "weight" "height"
 $ model        :'data.frame':  200 obs. of  2 variables:
  ..$ weight: num [1:200] 61.2 67.2 87.3 67.6 62.8 ...
  ..$ height: num [1:200] 160 169 196 168 175 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language weight ~ height
  .. .. ..- attr(*, "variables")= language list(weight, height)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "weight" "height"
  .. .. .. .. ..$ : chr "height"
  .. .. ..- attr(*, "term.labels")= chr "height"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(weight, height)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "weight" "height"
 - attr(*, "class")= chr "lm"
  • Another example: t.test() to perform one-sample t-test
List of 10
 $ statistic  : Named num 0.167
  ..- attr(*, "names")= chr "t"
 $ parameter  : Named num 199
  ..- attr(*, "names")= chr "df"
 $ p.value    : num 0.867
 $ conf.int   : num [1:2] 21.7 22.3
  ..- attr(*, "conf.level")= num 0.95
 $ estimate   : Named num 22
  ..- attr(*, "names")= chr "mean of x"
 $ null.value : Named num 22
  ..- attr(*, "names")= chr "mean"
 $ stderr     : num 0.145
 $ alternative: chr "two.sided"
 $ method     : chr "One Sample t-test"
 $ data.name  : chr "mydf$bmi"
 - attr(*, "class")= chr "htest"
  • These details are usually unimportant for regular use

  • Printing these results will show “user-friendly” output


    One Sample t-test

data:  mydf$bmi
t = 0.16719, df = 199, p-value = 0.8674
alternative hypothesis: true mean is not equal to 22
95 percent confidence interval:
 21.73774 22.31084
sample estimates:
mean of x 
 22.02429 


  • But to develop our own analysis tools, we will need some understanding of how this happens

Another important data structure: matrix / array

  • Matrices and arrays arise very naturally in statistics

  • Two common uses:

    • Model matrix for linear models

    • Contingency tables

  • Example: built-in dataset giving death rates (per 1000) for demographic subgroups

      Rural Male Rural Female Urban Male Urban Female
50-54       11.7          8.7       15.4          8.4
55-59       18.1         11.7       24.3         13.6
60-64       26.9         20.3       37.0         19.3
65-69       41.0         30.9       54.6         35.1
70-74       66.0         54.3       71.1         50.0
[1] 5 4
  • Unlike data frames, they are always homogeneous (all elements of same type)

  • Matrices have row and column indexes, and possibly names

  • Indexing works in the same way as vectors, but in two dimensions (separated by ,)

      Rural Female Urban Male
50-54          8.7       15.4
55-59         11.7       24.3
  • Indexing by “empty” index selects all rows / columns
      Rural Male Rural Female
50-54       11.7          8.7
55-59       18.1         11.7
60-64       26.9         20.3
65-69       41.0         30.9
70-74       66.0         54.3
  • Such indexing also works for data frames

Creating a matrix

  • There are many ways to create a matrix

  • Example: matrix() constructs matrix by providing data and dimensions

     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]   10   11   12
  • cbind() and rbind() constructs matrices by combining columns or rows
      int       ht
 [1,]   1 159.8666
 [2,]   1 168.8484
 [3,]   1 196.3493
 [4,]   1 167.8481
 [5,]   1 175.1913
 [6,]   1 159.1338
 [7,]   1 177.4504
 [8,]   1 174.6472
 [9,]   1 163.5114
[10,]   1 178.1257
[11,]   1 175.2149
[12,]   1 155.5292
[13,]   1 157.8308
[14,]   1 160.7194
[15,]   1 177.3041
  • X is the design matrix for linear regression on height (including intercept)

Matrix operations

  • Standard matrix operations: transpose t() and matrix product %*%

  • Can be used to solve linear regression equation

         int         ht
int   200.00   34416.96
ht  34416.96 5944139.57
           [,1]
int -68.0215728
ht    0.7757852
(Intercept)      height 
-68.0215728   0.7757852 

Matrix representation

  • Internally, matrices and arrays are stored as vectors along with a dimension
 [1]  1  2  3  4  5  6  7  8  9 10 11 12
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12
  • General arrays can have more than two dimensions
, , 1

     [,1] [,2]
[1,]    1    3
[2,]    2    4

, , 2

     [,1] [,2]
[1,]    5    7
[2,]    6    8

, , 3

     [,1] [,2]
[1,]    9   11
[2,]   10   12
  • Incidentally, assignments where the left-hand side looks like a function call are a special feature of R

  • These modify some aspect of an already existing variable

  • These are known as replacement functions

  • The underlying vector nature of a matrix is easy to verify

      Rural Male Rural Female Urban Male Urban Female
50-54       11.7          8.7       15.4          8.4
55-59       18.1         11.7       24.3         13.6
60-64       26.9         20.3       37.0         19.3
65-69       41.0         30.9       54.6         35.1
70-74       66.0         54.3       71.1         50.0
[1] 41.0 66.0  8.7 11.7 20.3 30.9 54.3

Next steps

  • This background is enough to start on typical data analysis tasks

  • But before that, we also need to learn about accessing documentation

  • This requires a brief discussion of the class system in R

The class of R objects

Every R object must have a class

[1] "list"
[1] "data.frame"
[1] "numeric"
[1] "lm"
[1] "htest"

Some functions are ‘generic’ functions

  • Generic functions are placeholder functions

  • They perform different tasks depending on type of argument passed to them

  • For example, summary() is a generic function

     height           bmi            weight      gender 
 Min.   :145.2   Min.   :16.43   Min.   :44.19   F:100  
 1st Qu.:164.4   1st Qu.:20.67   1st Qu.:58.35   M:100  
 Median :172.4   Median :22.21   Median :64.91          
 Mean   :172.1   Mean   :22.02   Mean   :65.48          
 3rd Qu.:179.5   3rd Qu.:23.45   3rd Qu.:71.78          
 Max.   :198.1   Max.   :28.41   Max.   :90.05          
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  44.19   58.35   64.91   65.48   71.78   90.05 

Call:
lm(formula = weight ~ height, data = mydf)

Residuals:
     Min       1Q   Median       3Q      Max 
-17.7074  -3.9083   0.4285   4.2131  17.5302 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -68.02157    7.26984  -9.357   <2e-16 ***
height        0.77579    0.04217  18.397   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.184 on 198 degrees of freedom
Multiple R-squared:  0.6309,    Adjusted R-squared:  0.629 
F-statistic: 338.4 on 1 and 198 DF,  p-value: < 2.2e-16

Methods of generic functions

 [1] summary.aov                    summary.aovlist*               summary.aspell*               
 [4] summary.check_packages_in_dir* summary.connection             summary.data.frame            
 [7] summary.Date                   summary.default                summary.ecdf*                 
[10] summary.factor                 summary.glm                    summary.infl*                 
[13] summary.lm                     summary.loess*                 summary.manova                
[16] summary.matrix                 summary.mlm*                   summary.nls*                  
[19] summary.packageStatus*         summary.POSIXct                summary.POSIXlt               
[22] summary.ppr*                   summary.prcomp*                summary.princomp*             
[25] summary.proc_time              summary.srcfile                summary.srcref                
[28] summary.stepfun                summary.stl*                   summary.table                 
[31] summary.tukeysmooth*           summary.warnings              
see '?methods' for accessing help and source code

The print() function

  • A special generic function is called print()

  • Whenever the result of an evaluation is not assigned to a variable, it is “auto-printed”

  • This is done using the print() generic function

  • For example:


Call:
lm(formula = weight ~ height, data = mydf)

Coefficients:
(Intercept)       height  
   -68.0216       0.7758  

Documentation

  • Every dataset and function in R is documented in a help page

  • The documentation for a function can be accessed by ? or help()

Documentation of generic functions and methods

  • Generic functions have their own documentation page
  • The documentation for a specific method may be in a different page
  • Note that you should never call summary.lm() directly instead of summary()

Understanding function documentation

  • Most useful things in R happen by calling functions

  • Functions have one or more arguments

    • All arguments have names

    • Arguments may be compulsory or optional

    • Optional arguments have “default” values

  • Functions normally also have a useful “return” value

  • These are all described in the help page

  • Arguments may or may not be named when calling a function

  • If not named, arguments are matched by position

  • Conventionally, optional arguments are named, compulsory arguments are often not named

Exercises

  • Load 02-rbasics.rmd in R Studio and work through the examples

  • Read the help page for mean() and median()

  • Here are two simple numeric vectors containing NA and Inf values

  • Find the mean and median of these vectors

  • How can you make R ignore the NA value?

  • How can you make R ignore the Inf value? Hint: see ?is.finite


  • Go through the help pages for write.table() and read.table() to understand how they work. Skip non-essential details.

  • Export the demog data frame as a CSV file named "demog.csv" using write.csv()

  • Import this newly created dataset again using read.csv(), saving it as a variable named d

  • Do you need to specify the header argument? Why?

  • Do you need to specify the na.strings argument? Why?


  • The goal of the next exercise is to convert d$race into a factor

  • Read the help page for factor() to learn how to create factors

  • d$race has three values: 1 = White, 2 = Black, 3 = Other

  • Create and add a new factor variable d$frace to the data frame d

  • d$frace should have “levels” 1, 2, 3, and correposponding “labels” White, Black, Other


  • The goal of the next exercise is to import a SAS format dataset and use it to perform a t-test

  • SAS can export data in a binary format with extension sas7bdat

  • This is not a documented export format, and is not meant to be imported into other software

  • However, it is common enough that there is contributed add-on package that has attempted to reverse-engineer the format

  • To use this package, we first need to load it into R as follows

  • Once the package is loaded, read the help page for read.sas7bdat()

  • Use it to import the file sasdata/twosample.sas7bdat

  • The imported data frame should have variables PATNO, TRT, FEV0 and FEV6


Exercises

  • The story behind the dataset is as follows:

A new compound, ABC-123, is being developed for long-term treatment of patients with chronic asthma. Asthmatic patients were enrolled in a double blind study and randomized to receive daily oral doses of ABC-123 or a placebo for 6 weeks. The primary measurement of interest is the resting FEV1 (forced expiratory volume during the first second of expiration), which is measured before (as FEV0) and at the end (as FEV6) of the 6-week treatment period.

  • Add a variable called CHG to the dataset recording the change in FEV1

  • Read the help page for t.test()

  • Does administration of ABC-123 have any effect on FEV1? Answer this by performing a two-sample t-test

  • Note that there may be multiple approaches to arrive at the correct answer