---
title: Basic usage of R
author: Deepayan Sarkar
---

```{r opts, echo = FALSE, results = "hide", warning = FALSE, message = FALSE}
knitr::opts_chunk$set(cache = TRUE, cache.path='~/knitr-cache/ctw-rbasics/',
                      autodep = TRUE, comment = "", warning = TRUE, message = TRUE,
                      knitr.table.format = "html", dev.args = list(pointsize = 16),
                      fig.width = 15, fig.height = 7, dpi = 110,
                      fig.path='figures/rbasics-')
options(warnPartialMatchDollar = FALSE, width = 110)
```


## Basics of using R

R is more flexible than a regular calculator

* In fact, R is a full programming language

* Most standard data analysis methods are already implemented

* Can be extended by writing add-on packages

* Thousands of add-on packages are available


## Major concepts

* Variables (in the context of programming)

* Data structures needed for data analyis

* Functions (set of instructions for performing a procedure)

## Variables

* Variables are symbols that may be associated with different values

* Computations involving variables are done using their current value

```{r}
x <- 10 # assignment
sqrt(x)
x <- -1
sqrt(x)
x <- -1+0i
sqrt(x)
```

## Data structures for data analysis


* Vectors

* Lists (general collection of objects)

* Data frames (a spreadsheet-like data set)

* Matrices 


## Atomic vectors

* Indexed collection of homogeneous scalars, can be 

    * Numeric / Integer
    * Categorical (factor)
    * Character
    * Logical (`TRUE` / `FALSE`)
	
* Missing values are allowed, indicated as `NA`

* Elements are indexed starting from 1

* `i`-th element of vector `x` can be extracted using `x[i]`

* There are also more sophisticated forms of (vector) indexing


## Atomic vectors: examples

```{r}
month.name # built-in 
x <- rnorm(10)
x
```

. . .


```{r}
str(x) # useful function
str(month.name)
```

## Atomic vectors: examples

```{r}
m <- sample(1:12, 30, replace = TRUE)
m
mf <- factor(m, levels = 1:12, labels = month.name)
mf
str(m)
str(mf)
```


## Atomic vectors

* "Scalars" are just vectors of length 1

```{r}
str(numeric(2))
str(numeric(1))
str(0)
```

## Atomic vectors


* Vectors can have length zero 

```{r}
numeric(0)
logical(0)
```


## Creating vectors

* Using functions that return vectors

```{r}
seq(0, 1, by = 0.05) # regular sequence
rep(1:3, each = 5)   # elements repeated in a regular pattern
1:10                 # shortcut for seq(1, 10, by = 1)
rnorm(5)             # many random number generators available
```

## Creating vectors

* Using the `c()` function to combine smaller vectors

```{r}
c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34)
c(rexp(5), -rexp(5))
c("Hearts", "Spades", "Diamonds", "Clubs")
```


## Types of indexing

* Indexing refers to extracting subsets of vectors (or other kinds of data)

* R supports several kinds of indexing:

	* Indexing by a vector of positive integers

	* Indexing by a vector of negative integers
	
	* Indexing by a logical vector

	* Indexing by a vector of names


## Types of indexing: positive integers

* The "standard" C-like indexing with a scalar (vector of length 1):

```{r}
month.name[2] # the first index is 1, not 0
```
\


* The "index" can also be an integer vector

```{r}
month.name[c(2, 4, 6, 9, 11)]
```
\


* Elements can be repeated

```{r}
month.name[c(2, 2, 6, 4, 6, 11)]
```


## Types of indexing: positive integers

* "Out-of-bounds" indexing give `NA` (missing)

```{r}
month.name[13]
month.name[seq(1, by = 2, length.out = 8)]
```
\


## Types of indexing: negative integers

* Negative integers omit the specified entries

```{r}
month.name[-2]
month.name[-c(2, 4, 6, 9, 11)]
```
\

* Cannot be mixed with positive integers

```{r}
month.name[c(2, -3)]
```

## Types of indexing: zero

* Zero has a special meaning - doesn't select anything

```{r}
month.name[0]
month.name[integer(0)] ## same as empty index
month.name[c(1, 2, 0, 11, 12)]
month.name[-c(1, 2, 0, 11, 12)]
```

## Types of indexing: logical vector

* Indexing by logical vector: select `TRUE` elements

```{r}
month.name[c(TRUE, FALSE, FALSE)] # index is recycled if shorter than data
```
\


* Logical vectors are usually created by logical comparisons

```{r}
i <- substring(month.name, 1, 1) == "J"
i
month.name[i]
```


## Types of indexing: logical vector

* Common use: extract subset satisfying a certain condition (also called "filtering")

```{r}
(x <- rnorm(20)) # parentheses to print result
x > 0     # element-wise comparison
x[x > 0]  # result of comparison used as logical index vector
mean(x)
mean(x[x > 0])
```


## Types of indexing: logical to integer

* Sometimes logical indexing can be replaced by integer indexing using `which()`

```{r}
i
which(i)
month.name[ which(i) ]
month.name[ -which(i) ] # same as month.name[ !i ]
```


## Types of indexing: character vectors

* R vectors can have names (optional)

* These names can be used to index, just like positive integers

* Will see examples later


## Vectorized arithmetic

* Arithmetic operations are usually "vectorized". These include

	* Arithmetic operators such as `+, -, *, /, ^`

	* Mathematical functions such as `sin(), cos(), log()`
	
	* Logical comparisons `<, >, <=, >=, ==` and operators `&, |, !`
	
	* Almost any other function where it makes sense

* They operate element-wise on vectors, producing another vector

* Remember that R has no "scalar" type. Scalars are just length-1 vectors

* Operations with unequal sized vectors: Shorter vector is repeated /
  recycled to match longer vector


## Vectorized arithmetic

* Example: Recreate height-weight simulation

```{r}
ht <- rnorm(200, mean = 172, sd = 10)    # height in cm
bmi <- rnorm(200, mean = 22, sd = 2.2)   # bmi (independent of height)
wt <- bmi * (ht / 100)^2
```

* The last step has several vectorized computations:

	* `ht / 100` divides each element of height by `100` (`100` is recycled)

	* `(ht / 100)^2` squares each element of the result (`2` is recycled)

	* `bmi * (ht / 100)^2` multiplies result with `bmi` element-wise


## Scalars from vectors

* Many functions summarize a data vector by producing a scalar

```{r}
sum(wt)
mean(wt)
sd(wt)
cor(wt, ht)
```

## Scalars from vectors

* Sometimes the summary output can be a vector as well

```{r}
fivenum(ht) ## minimum, quartiles, and maximum
summary(ht) ## similar summary with descriptive names
```


## Lists

* Atomic vectors must be homogeneous (all elements of the same type)

* But we often need to combine different types of data

. . .

* Lists are vectors with arbitrary types of components

* Like atomic vectors, they may or may not have names, but usually do

* Usually constructed using the function `list()`


## Lists

Example: Suppose we want to record units and a descriptive label along with a data vector

```{r}
lht <- list(data = ht[1:10], unit = "cm", label = "Simulated height")
lbmi <- list(data = bmi[1:10], unit = "none", label = "Simulated BMI")
lwt <- list(data = wt[1:10], unit = "kg", label = "Weight calculated from height and BMI")
```

* These are now vectors with different types of elements

* Can be indexed in the usual way:

```{r}
lht[1:2]     # index with integers, result is a length-2 list
lbmi["label"] # index with character, result is a length-1 list
```

## Lists

* But often we want to extract a specific _element_ of a list

* This is done by indexing with double brackets `[[ ... ]]`

```{r}
lht[[ 1 ]]      # index with integer
lbmi[[ "label" ]] # index with character
```
\


* For lists with names, a common alternative is to use `$`

```{r}
lwt$label # note the lack of quotes
```


## Lists

Lists can themselves contain lists recursively

```{r}
mydata <- list(height = lht, bmi = lbmi, weight = lwt)
str(mydata)
```
\


* Elements can be extracted recursively

```{r}
mydata$weight[[1]]
```


## Uses of lists

* Lists are very flexible data structures that are widely used

* Two very important uses:

	* Standard representation of data sets
	
	* To contain results of complex functions (model fitting, tests)


## Data frames

* Data frames represent rectangular (spreadheet-like) data

* Essentially lists with some additional restrictions

	* Elements are viewed as columns in a data set

	* Each element / column is (usually) an atomic vector

	* Different columns can have different types

	* Every column must have a name

* Can be created using the `data.frame()` function


## Data frame: Example

```{r}
mydlist <- list(height = ht, bmi = bmi, weight = wt, gender = c("M", "F"))
str(mydlist)
mydf <- data.frame(height = ht, bmi = bmi, weight = wt, gender = c("M", "F"))
str(mydf) # compare with list
mydf$gender
```


## Data frame: Example

Lists can be recursive

```{r}
list(height = lht, bmi = lbmi, weight = lwt) 
```

## Data frame: Example

But they are 'flattened' in data frames

```{r}
data.frame(height = lht, bmi = lbmi, weight = lwt)
```


## Data import

* It is much more common to create data frames by importing data from a file

* Typical approach: read data from spreadsheet file into data frame

* Easiest route:

    * R itself cannot read Excel files directly

    * Save as CSV file from Excel

    * Read with `read.csv()` or `read.table()` (more flexible)

* Alternative option:

    * Use "Import Dataset" menu item in R Studio (supports more
      formats using add-on packages)


## Data export

* Data frames can be exported as a spreadsheet file using
  `write.csv()` or `write.table()`

```{r eval=FALSE}
data(Cars93, package = "MASS") # built-in dataset
write.csv(Cars93, file = "cars93.csv") # export
cars <- read.csv("cars93.csv") # import (path relative to working directory)
```

* Most statistical software are able to read CSV files


## Data import example

* Import text dataset [data/demog.txt](data/demog.txt) containing demographic data

* The contents of the file are:

```
subjid trt gender race age
101 0 1 3 37
102 1 2 1 65
103 1 1 2 32
104 0 2 1 23
105 1 1 3 44
106 0 2 1 49
201 1 1 3 35
202 0 2 1 50
203 1 1 2 49
204 0 2 1 60
205 1 1 3 39
206 1 2 1 67
301 0 1 1 70
302 0 1 2 55
303 1 1 1 65
304 0 1 1 45
305 1 1 1 36
306 0 1 2 46
401 1 2 1 44
402 0 2 2 77
403 1 1 1 45
404 1 1 1 59
405 0 2 1 49
406 1 1 2 33
501 0 1 2 33
502 1 2 1 44
503 1 1 1 64
504 0 1 3 56
505 1 1 2 73
506 0 1 1 46
507 1 1 2 44
508 0 2 1 53
509 0 1 1 45
510 0 1 3 65
511 1 2 2 43
512 1 1 1 39
601 0 1 1 50
602 0 2 2 30
603 1 2 1 33
604 0 1 1 65
605 1 2 1 57
606 0 1 2 56
607 1 1 1 67
608 0 2 2 46
609 1 2 1 72
610 0 1 1 29
611 1 2 1 65
612 1 1 2 46
701 1 1 1 60
702 0 1 1 28
703 1 1 2 44
704 0 2 1 66
705 1 1 2 46
706 1 1 1 75
707 1 1 1 46
708 0 2 1 55
709 0 2 2 57
710 0 1 1 63
711 1 1 2 61
712 0 . 1 49
```


## Data import example

Attempt 1:

```{r}
demog <- read.table("data/demog.txt")
str(demog)
```

* R doesn't know that the first row gives column headers

* All columns have been interpreted as characters and converted to factors


## Data import example

Attempt 2:

```{r}
demog <- read.table("data/demog.txt", header = TRUE, stringsAsFactors = FALSE)
str(demog)
```

* The `gender` column is still interpreted as character data

* This is because R doesn't know that missing values are encoded as `"."`


## Data import example

Attempt 2:

```{r}
demog <- read.table("data/demog.txt", header = TRUE, stringsAsFactors = FALSE, na.strings = ".")
str(demog)
demog$gender
```

* The columns `trt`, `gender`, and `race` should actually be categorical

* We need more information about the numeric encoding to do this

* Will see examples later


## Lists as containers of complex results

* We have earlier used the `lm()` function to fit an OLS linear regression model 

* Let's see what the output returned by `lm()` actually looks like

```{r}
fm <- lm(weight ~ height, data = mydf) # more on this later
str(fm)
```

## Lists as containers of complex results

* Another example: `t.test()` to perform one-sample t-test

```{r}
tt <- t.test(mydf$bmi, mu = 22) # Is mean BMI equal to 22?
str(tt)
```


## Lists as containers of complex results

* These details are usually unimportant for regular use

* Printing these results will show "user-friendly" output

```{r}
print(tt)
```
\

* But to develop our own analysis tools, we will need some
  understanding of how this happens


## Another important data structure: matrix / array

* Matrices and arrays arise very naturally in statistics

* Two common uses:

	* Model matrix for linear models
	
	* Contingency tables


. . .

* Example: built-in dataset giving death rates (per 1000) for demographic subgroups

```{r}
VADeaths
dim(VADeaths) # dimension of the matrix
```


## Another important data structure: matrix / array

* Unlike data frames, they are always homogeneous (all elements of same type)

* Matrices have row and column indexes, and possibly names

* Indexing works in the same way as vectors, but in two dimensions (separated by `,`)

```{r}
VADeaths[1:2, c(2, 3)]
```

* Indexing by "empty" index selects all rows / columns

```{r}
VADeaths[, c("Rural Male", "Rural Female")]
```

* Such indexing also works for data frames


## Creating a matrix

* There are many ways to create a matrix

* Example: `matrix()` constructs matrix by providing data and dimensions

```{r}
matrix(1:12, nrow = 3, ncol = 4)               # fills up columns first by default
matrix(1:12, nrow = 4, ncol = 3, byrow = TRUE) # fills up rows first
```


## Creating a matrix

* `cbind()` and `rbind()` constructs matrices by combining columns or rows

```{r}
X <- cbind(int = 1, ht = mydf$height) # Note: 1 is recycled
head(X, 15)
```

* `X` is the design matrix for linear regression on height (including intercept)


## Matrix operations

* Standard matrix operations: transpose `t()` and matrix product `%*%`

* Can be used to solve linear regression equation

```{r}
y <- mydf$weight
XtX <- t(X) %*% X  # transpose and matrix multiplication
XtX
Xty <- t(X) %*% y
solve(XtX, Xty) # solves normal equations X'X beta = X'y
fm$coefficients # compare with earlier result from lm()
```


## Matrix representation

* Internally, matrices and arrays are stored as vectors along with a dimension

```{r}
x <- 1:12
x
dim(x) <- c(4, 3)
x
```


## Matrix representation

* General arrays can have more than two dimensions

```{r}
dim(x) <- c(2, 2, 3) # three-dimensional array
x
```

* Incidentally, assignments where the left-hand side looks like a
  function call are a special feature of R
  
* These modify some aspect of an _already existing_ variable
  
* These are known as _replacement_ functions


## Matrix representation

* The underlying vector nature of a matrix is easy to verify

```{r}
VADeaths
VADeaths[4:10] # index as vector (data is stored columnwise)
```


## Next steps

* This background is enough to start on typical data analysis tasks

* But before that, we also need to learn about accessing documentation

* This requires a brief discussion of the class system in R


## The class of R objects

Every R object must have a class

```{r}
class(mydata)
class(mydf)
class(mydf$weight)
class(fm)
class(tt) # for 'hypothesis test'
```


## Some functions are 'generic' functions

* Generic functions are placeholder functions

* They perform different tasks depending on type of argument passed to them

* For example, `summary()` is a generic function

```{r}
summary(mydf)
summary(mydf$weight)
```

## Some functions are 'generic' functions


```{r}
summary(fm)
```

## Methods of generic functions

```{r}
methods("summary")
```


## The `print()` function

* A special generic function is called `print()`

* Whenever the result of an evaluation is not assigned to a variable, it is "auto-printed"

* This is done using the `print()` generic function

* For example:

```{r}
print(fm) # same as just entering fm
```


## Documentation

* Every dataset and function in R is documented in a help page

* The documentation for a function can be accessed by `?` or `help()`

```r
?seq
help(cbind)
help(VADeaths)
```


## Documentation of generic functions and methods

* Generic functions have their own documentation page

```r
help(summary)
```

* The documentation for a specific method may be in a different page

```r
help(summary.lm)
```

* Note that you should never call `summary.lm()` directly instead of
  `summary()`


## Understanding function documentation

* Most useful things in R happen by calling functions

* Functions have one or more arguments

    * All arguments have names

	* Arguments may be compulsory or optional

	* Optional arguments have "default" values 

* Functions normally also have a useful "return" value	

* These are all described in the help page


## Understanding function documentation

* Arguments may or may not be named when calling a function

* If not named, arguments are matched by position

* Conventionally, optional arguments are named, compulsory arguments are often not named


## Exercises

* Load [02-rbasics.rmd](02-rbasics.rmd) in R Studio and work through the examples

* Read the help page for `mean()` and `median()`

* Here are two simple numeric vectors containing `NA` and `Inf` values

```{r}
x <- c(1:5, NA, 7:10)
y <- c(1:5, Inf, 7:10)
```

* Find the mean and median of these vectors

* How can you make R ignore the `NA` value?

* How can you make R ignore the `Inf` value? Hint: see `?is.finite`


* * *

## Exercises

* Go through the help pages for `write.table()` and `read.table()` to
  understand how they work. Skip non-essential details.

* Export the `demog` data frame as a CSV file named `"demog.csv"` using `write.csv()`

* Import this newly created dataset again using `read.csv()`, saving
  it as a variable named `d`
  
* Do you need to specify the `header` argument? Why?

* Do you need to specify the `na.strings` argument? Why?


* * *

## Exercises

* The goal of the next exercise is to convert `d$race` into a factor

* Read the help page for `factor()` to learn how to create factors

* `d$race` has three values: 1 = White, 2 = Black, 3 = Other

* Create and add a new factor variable `d$frace` to the data frame `d`

* `d$frace` should have "levels" 1, 2, 3, and correposponding "labels"
  White, Black, Other


* * *

## Exercises

* The goal of the next exercise is to import a SAS format dataset and
  use it to perform a t-test
  
* SAS can export data in a binary format with extension `sas7bdat`

* This is __not__ a documented export format, and is not meant to be
  imported into other software
  
* However, it is common enough that there is contributed add-on
  package that has attempted to reverse-engineer the format
  
* To use this package, we first need to load it into R as follows

```{r}
## install.packages("sas7bdat") # install package if not already installed (needed only once)
library(package = "sas7bdat")
```

* Once the package is loaded, read the help page for `read.sas7bdat()`

* Use it to import the file [sasdata/twosample.sas7bdat](sasdata/twosample.sas7bdat)

* The imported data frame should have variables `PATNO`, `TRT`, `FEV0` and `FEV6`


* * *

## Exercises

* The story behind the dataset is as follows:

> A new compound, ABC-123, is being developed for long-term treatment
> of patients with chronic asthma. Asthmatic patients were enrolled in
> a double blind study and randomized to receive daily oral doses of
> ABC-123 or a placebo for 6 weeks. The primary measurement of
> interest is the resting FEV1 (forced expiratory volume during the
> first second of expiration), which is measured before (as FEV0) and
> at the end (as FEV6) of the 6-week treatment period.

* Add a variable called `CHG` to the dataset recording the change in FEV1

* Read the help page for `t.test()` 

* Does administration of ABC-123 have any effect on FEV1? Answer this
  by performing a two-sample t-test

* Note that there may be multiple approaches to arrive at the correct answer