--- title: Basic usage of R author: Deepayan Sarkar --- ```{r opts, echo = FALSE, results = "hide", warning = FALSE, message = FALSE} knitr::opts_chunk$set(cache = TRUE, cache.path='~/knitr-cache/ctw-rbasics/', autodep = TRUE, comment = "", warning = TRUE, message = TRUE, knitr.table.format = "html", dev.args = list(pointsize = 16), fig.width = 15, fig.height = 7, dpi = 110, fig.path='figures/rbasics-') options(warnPartialMatchDollar = FALSE, width = 110) ``` ## Basics of using R R is more flexible than a regular calculator * In fact, R is a full programming language * Most standard data analysis methods are already implemented * Can be extended by writing add-on packages * Thousands of add-on packages are available ## Major concepts * Variables (in the context of programming) * Data structures needed for data analyis * Functions (set of instructions for performing a procedure) ## Variables * Variables are symbols that may be associated with different values * Computations involving variables are done using their current value ```{r} x <- 10 # assignment sqrt(x) x <- -1 sqrt(x) x <- -1+0i sqrt(x) ``` ## Data structures for data analysis * Vectors * Lists (general collection of objects) * Data frames (a spreadsheet-like data set) * Matrices ## Atomic vectors * Indexed collection of homogeneous scalars, can be * Numeric / Integer * Categorical (factor) * Character * Logical (`TRUE` / `FALSE`) * Missing values are allowed, indicated as `NA` * Elements are indexed starting from 1 * `i`-th element of vector `x` can be extracted using `x[i]` * There are also more sophisticated forms of (vector) indexing ## Atomic vectors: examples ```{r} month.name # built-in x <- rnorm(10) x ``` . . . ```{r} str(x) # useful function str(month.name) ``` ## Atomic vectors: examples ```{r} m <- sample(1:12, 30, replace = TRUE) m mf <- factor(m, levels = 1:12, labels = month.name) mf str(m) str(mf) ``` ## Atomic vectors * "Scalars" are just vectors of length 1 ```{r} str(numeric(2)) str(numeric(1)) str(0) ``` ## Atomic vectors * Vectors can have length zero ```{r} numeric(0) logical(0) ``` ## Creating vectors * Using functions that return vectors ```{r} seq(0, 1, by = 0.05) # regular sequence rep(1:3, each = 5) # elements repeated in a regular pattern 1:10 # shortcut for seq(1, 10, by = 1) rnorm(5) # many random number generators available ``` ## Creating vectors * Using the `c()` function to combine smaller vectors ```{r} c(0, 1, 1, 2, 3, 5, 8, 13, 21, 34) c(rexp(5), -rexp(5)) c("Hearts", "Spades", "Diamonds", "Clubs") ``` ## Types of indexing * Indexing refers to extracting subsets of vectors (or other kinds of data) * R supports several kinds of indexing: * Indexing by a vector of positive integers * Indexing by a vector of negative integers * Indexing by a logical vector * Indexing by a vector of names ## Types of indexing: positive integers * The "standard" C-like indexing with a scalar (vector of length 1): ```{r} month.name[2] # the first index is 1, not 0 ``` \ * The "index" can also be an integer vector ```{r} month.name[c(2, 4, 6, 9, 11)] ``` \ * Elements can be repeated ```{r} month.name[c(2, 2, 6, 4, 6, 11)] ``` ## Types of indexing: positive integers * "Out-of-bounds" indexing give `NA` (missing) ```{r} month.name[13] month.name[seq(1, by = 2, length.out = 8)] ``` \ ## Types of indexing: negative integers * Negative integers omit the specified entries ```{r} month.name[-2] month.name[-c(2, 4, 6, 9, 11)] ``` \ * Cannot be mixed with positive integers ```{r} month.name[c(2, -3)] ``` ## Types of indexing: zero * Zero has a special meaning - doesn't select anything ```{r} month.name[0] month.name[integer(0)] ## same as empty index month.name[c(1, 2, 0, 11, 12)] month.name[-c(1, 2, 0, 11, 12)] ``` ## Types of indexing: logical vector * Indexing by logical vector: select `TRUE` elements ```{r} month.name[c(TRUE, FALSE, FALSE)] # index is recycled if shorter than data ``` \ * Logical vectors are usually created by logical comparisons ```{r} i <- substring(month.name, 1, 1) == "J" i month.name[i] ``` ## Types of indexing: logical vector * Common use: extract subset satisfying a certain condition (also called "filtering") ```{r} (x <- rnorm(20)) # parentheses to print result x > 0 # element-wise comparison x[x > 0] # result of comparison used as logical index vector mean(x) mean(x[x > 0]) ``` ## Types of indexing: logical to integer * Sometimes logical indexing can be replaced by integer indexing using `which()` ```{r} i which(i) month.name[ which(i) ] month.name[ -which(i) ] # same as month.name[ !i ] ``` ## Types of indexing: character vectors * R vectors can have names (optional) * These names can be used to index, just like positive integers * Will see examples later ## Vectorized arithmetic * Arithmetic operations are usually "vectorized". These include * Arithmetic operators such as `+, -, *, /, ^` * Mathematical functions such as `sin(), cos(), log()` * Logical comparisons `<, >, <=, >=, ==` and operators `&, |, !` * Almost any other function where it makes sense * They operate element-wise on vectors, producing another vector * Remember that R has no "scalar" type. Scalars are just length-1 vectors * Operations with unequal sized vectors: Shorter vector is repeated / recycled to match longer vector ## Vectorized arithmetic * Example: Recreate height-weight simulation ```{r} ht <- rnorm(200, mean = 172, sd = 10) # height in cm bmi <- rnorm(200, mean = 22, sd = 2.2) # bmi (independent of height) wt <- bmi * (ht / 100)^2 ``` * The last step has several vectorized computations: * `ht / 100` divides each element of height by `100` (`100` is recycled) * `(ht / 100)^2` squares each element of the result (`2` is recycled) * `bmi * (ht / 100)^2` multiplies result with `bmi` element-wise ## Scalars from vectors * Many functions summarize a data vector by producing a scalar ```{r} sum(wt) mean(wt) sd(wt) cor(wt, ht) ``` ## Scalars from vectors * Sometimes the summary output can be a vector as well ```{r} fivenum(ht) ## minimum, quartiles, and maximum summary(ht) ## similar summary with descriptive names ``` ## Lists * Atomic vectors must be homogeneous (all elements of the same type) * But we often need to combine different types of data . . . * Lists are vectors with arbitrary types of components * Like atomic vectors, they may or may not have names, but usually do * Usually constructed using the function `list()` ## Lists Example: Suppose we want to record units and a descriptive label along with a data vector ```{r} lht <- list(data = ht[1:10], unit = "cm", label = "Simulated height") lbmi <- list(data = bmi[1:10], unit = "none", label = "Simulated BMI") lwt <- list(data = wt[1:10], unit = "kg", label = "Weight calculated from height and BMI") ``` * These are now vectors with different types of elements * Can be indexed in the usual way: ```{r} lht[1:2] # index with integers, result is a length-2 list lbmi["label"] # index with character, result is a length-1 list ``` ## Lists * But often we want to extract a specific _element_ of a list * This is done by indexing with double brackets `[[ ... ]]` ```{r} lht[[ 1 ]] # index with integer lbmi[[ "label" ]] # index with character ``` \ * For lists with names, a common alternative is to use `$` ```{r} lwt$label # note the lack of quotes ``` ## Lists Lists can themselves contain lists recursively ```{r} mydata <- list(height = lht, bmi = lbmi, weight = lwt) str(mydata) ``` \ * Elements can be extracted recursively ```{r} mydata$weight[[1]] ``` ## Uses of lists * Lists are very flexible data structures that are widely used * Two very important uses: * Standard representation of data sets * To contain results of complex functions (model fitting, tests) ## Data frames * Data frames represent rectangular (spreadheet-like) data * Essentially lists with some additional restrictions * Elements are viewed as columns in a data set * Each element / column is (usually) an atomic vector * Different columns can have different types * Every column must have a name * Can be created using the `data.frame()` function ## Data frame: Example ```{r} mydlist <- list(height = ht, bmi = bmi, weight = wt, gender = c("M", "F")) str(mydlist) mydf <- data.frame(height = ht, bmi = bmi, weight = wt, gender = c("M", "F")) str(mydf) # compare with list mydf$gender ``` ## Data frame: Example Lists can be recursive ```{r} list(height = lht, bmi = lbmi, weight = lwt) ``` ## Data frame: Example But they are 'flattened' in data frames ```{r} data.frame(height = lht, bmi = lbmi, weight = lwt) ``` ## Data import * It is much more common to create data frames by importing data from a file * Typical approach: read data from spreadsheet file into data frame * Easiest route: * R itself cannot read Excel files directly * Save as CSV file from Excel * Read with `read.csv()` or `read.table()` (more flexible) * Alternative option: * Use "Import Dataset" menu item in R Studio (supports more formats using add-on packages) ## Data export * Data frames can be exported as a spreadsheet file using `write.csv()` or `write.table()` ```{r eval=FALSE} data(Cars93, package = "MASS") # built-in dataset write.csv(Cars93, file = "cars93.csv") # export cars <- read.csv("cars93.csv") # import (path relative to working directory) ``` * Most statistical software are able to read CSV files ## Data import example * Import text dataset [data/demog.txt](data/demog.txt) containing demographic data * The contents of the file are: ``` subjid trt gender race age 101 0 1 3 37 102 1 2 1 65 103 1 1 2 32 104 0 2 1 23 105 1 1 3 44 106 0 2 1 49 201 1 1 3 35 202 0 2 1 50 203 1 1 2 49 204 0 2 1 60 205 1 1 3 39 206 1 2 1 67 301 0 1 1 70 302 0 1 2 55 303 1 1 1 65 304 0 1 1 45 305 1 1 1 36 306 0 1 2 46 401 1 2 1 44 402 0 2 2 77 403 1 1 1 45 404 1 1 1 59 405 0 2 1 49 406 1 1 2 33 501 0 1 2 33 502 1 2 1 44 503 1 1 1 64 504 0 1 3 56 505 1 1 2 73 506 0 1 1 46 507 1 1 2 44 508 0 2 1 53 509 0 1 1 45 510 0 1 3 65 511 1 2 2 43 512 1 1 1 39 601 0 1 1 50 602 0 2 2 30 603 1 2 1 33 604 0 1 1 65 605 1 2 1 57 606 0 1 2 56 607 1 1 1 67 608 0 2 2 46 609 1 2 1 72 610 0 1 1 29 611 1 2 1 65 612 1 1 2 46 701 1 1 1 60 702 0 1 1 28 703 1 1 2 44 704 0 2 1 66 705 1 1 2 46 706 1 1 1 75 707 1 1 1 46 708 0 2 1 55 709 0 2 2 57 710 0 1 1 63 711 1 1 2 61 712 0 . 1 49 ``` ## Data import example Attempt 1: ```{r} demog <- read.table("data/demog.txt") str(demog) ``` * R doesn't know that the first row gives column headers * All columns have been interpreted as characters and converted to factors ## Data import example Attempt 2: ```{r} demog <- read.table("data/demog.txt", header = TRUE, stringsAsFactors = FALSE) str(demog) ``` * The `gender` column is still interpreted as character data * This is because R doesn't know that missing values are encoded as `"."` ## Data import example Attempt 2: ```{r} demog <- read.table("data/demog.txt", header = TRUE, stringsAsFactors = FALSE, na.strings = ".") str(demog) demog$gender ``` * The columns `trt`, `gender`, and `race` should actually be categorical * We need more information about the numeric encoding to do this * Will see examples later ## Lists as containers of complex results * We have earlier used the `lm()` function to fit an OLS linear regression model * Let's see what the output returned by `lm()` actually looks like ```{r} fm <- lm(weight ~ height, data = mydf) # more on this later str(fm) ``` ## Lists as containers of complex results * Another example: `t.test()` to perform one-sample t-test ```{r} tt <- t.test(mydf$bmi, mu = 22) # Is mean BMI equal to 22? str(tt) ``` ## Lists as containers of complex results * These details are usually unimportant for regular use * Printing these results will show "user-friendly" output ```{r} print(tt) ``` \ * But to develop our own analysis tools, we will need some understanding of how this happens ## Another important data structure: matrix / array * Matrices and arrays arise very naturally in statistics * Two common uses: * Model matrix for linear models * Contingency tables . . . * Example: built-in dataset giving death rates (per 1000) for demographic subgroups ```{r} VADeaths dim(VADeaths) # dimension of the matrix ``` ## Another important data structure: matrix / array * Unlike data frames, they are always homogeneous (all elements of same type) * Matrices have row and column indexes, and possibly names * Indexing works in the same way as vectors, but in two dimensions (separated by `,`) ```{r} VADeaths[1:2, c(2, 3)] ``` * Indexing by "empty" index selects all rows / columns ```{r} VADeaths[, c("Rural Male", "Rural Female")] ``` * Such indexing also works for data frames ## Creating a matrix * There are many ways to create a matrix * Example: `matrix()` constructs matrix by providing data and dimensions ```{r} matrix(1:12, nrow = 3, ncol = 4) # fills up columns first by default matrix(1:12, nrow = 4, ncol = 3, byrow = TRUE) # fills up rows first ``` ## Creating a matrix * `cbind()` and `rbind()` constructs matrices by combining columns or rows ```{r} X <- cbind(int = 1, ht = mydf$height) # Note: 1 is recycled head(X, 15) ``` * `X` is the design matrix for linear regression on height (including intercept) ## Matrix operations * Standard matrix operations: transpose `t()` and matrix product `%*%` * Can be used to solve linear regression equation ```{r} y <- mydf$weight XtX <- t(X) %*% X # transpose and matrix multiplication XtX Xty <- t(X) %*% y solve(XtX, Xty) # solves normal equations X'X beta = X'y fm$coefficients # compare with earlier result from lm() ``` ## Matrix representation * Internally, matrices and arrays are stored as vectors along with a dimension ```{r} x <- 1:12 x dim(x) <- c(4, 3) x ``` ## Matrix representation * General arrays can have more than two dimensions ```{r} dim(x) <- c(2, 2, 3) # three-dimensional array x ``` * Incidentally, assignments where the left-hand side looks like a function call are a special feature of R * These modify some aspect of an _already existing_ variable * These are known as _replacement_ functions ## Matrix representation * The underlying vector nature of a matrix is easy to verify ```{r} VADeaths VADeaths[4:10] # index as vector (data is stored columnwise) ``` ## Next steps * This background is enough to start on typical data analysis tasks * But before that, we also need to learn about accessing documentation * This requires a brief discussion of the class system in R ## The class of R objects Every R object must have a class ```{r} class(mydata) class(mydf) class(mydf$weight) class(fm) class(tt) # for 'hypothesis test' ``` ## Some functions are 'generic' functions * Generic functions are placeholder functions * They perform different tasks depending on type of argument passed to them * For example, `summary()` is a generic function ```{r} summary(mydf) summary(mydf$weight) ``` ## Some functions are 'generic' functions ```{r} summary(fm) ``` ## Methods of generic functions ```{r} methods("summary") ``` ## The `print()` function * A special generic function is called `print()` * Whenever the result of an evaluation is not assigned to a variable, it is "auto-printed" * This is done using the `print()` generic function * For example: ```{r} print(fm) # same as just entering fm ``` ## Documentation * Every dataset and function in R is documented in a help page * The documentation for a function can be accessed by `?` or `help()` ```r ?seq help(cbind) help(VADeaths) ``` ## Documentation of generic functions and methods * Generic functions have their own documentation page ```r help(summary) ``` * The documentation for a specific method may be in a different page ```r help(summary.lm) ``` * Note that you should never call `summary.lm()` directly instead of `summary()` ## Understanding function documentation * Most useful things in R happen by calling functions * Functions have one or more arguments * All arguments have names * Arguments may be compulsory or optional * Optional arguments have "default" values * Functions normally also have a useful "return" value * These are all described in the help page ## Understanding function documentation * Arguments may or may not be named when calling a function * If not named, arguments are matched by position * Conventionally, optional arguments are named, compulsory arguments are often not named ## Exercises * Load [02-rbasics.rmd](02-rbasics.rmd) in R Studio and work through the examples * Read the help page for `mean()` and `median()` * Here are two simple numeric vectors containing `NA` and `Inf` values ```{r} x <- c(1:5, NA, 7:10) y <- c(1:5, Inf, 7:10) ``` * Find the mean and median of these vectors * How can you make R ignore the `NA` value? * How can you make R ignore the `Inf` value? Hint: see `?is.finite` * * * ## Exercises * Go through the help pages for `write.table()` and `read.table()` to understand how they work. Skip non-essential details. * Export the `demog` data frame as a CSV file named `"demog.csv"` using `write.csv()` * Import this newly created dataset again using `read.csv()`, saving it as a variable named `d` * Do you need to specify the `header` argument? Why? * Do you need to specify the `na.strings` argument? Why? * * * ## Exercises * The goal of the next exercise is to convert `d$race` into a factor * Read the help page for `factor()` to learn how to create factors * `d$race` has three values: 1 = White, 2 = Black, 3 = Other * Create and add a new factor variable `d$frace` to the data frame `d` * `d$frace` should have "levels" 1, 2, 3, and correposponding "labels" White, Black, Other * * * ## Exercises * The goal of the next exercise is to import a SAS format dataset and use it to perform a t-test * SAS can export data in a binary format with extension `sas7bdat` * This is __not__ a documented export format, and is not meant to be imported into other software * However, it is common enough that there is contributed add-on package that has attempted to reverse-engineer the format * To use this package, we first need to load it into R as follows ```{r} ## install.packages("sas7bdat") # install package if not already installed (needed only once) library(package = "sas7bdat") ``` * Once the package is loaded, read the help page for `read.sas7bdat()` * Use it to import the file [sasdata/twosample.sas7bdat](sasdata/twosample.sas7bdat) * The imported data frame should have variables `PATNO`, `TRT`, `FEV0` and `FEV6` * * * ## Exercises * The story behind the dataset is as follows: > A new compound, ABC-123, is being developed for long-term treatment > of patients with chronic asthma. Asthmatic patients were enrolled in > a double blind study and randomized to receive daily oral doses of > ABC-123 or a placebo for 6 weeks. The primary measurement of > interest is the resting FEV1 (forced expiratory volume during the > first second of expiration), which is measured before (as FEV0) and > at the end (as FEV6) of the 6-week treatment period. * Add a variable called `CHG` to the dataset recording the change in FEV1 * Read the help page for `t.test()` * Does administration of ABC-123 have any effect on FEV1? Answer this by performing a two-sample t-test * Note that there may be multiple approaches to arrive at the correct answer