% NOTE -- ONLY EDIT THE .Rnw FILE!!! The .tex file is % likely to be overwritten. \documentclass[compress]{beamer} \usepackage{SweaveBeamer} \input{commondefs} \SweaveOpts{prefix.string=figs/introduction,eps=FALSE,pdf=TRUE,keep.source=TRUE} \title{A Quick Introduction to \R{}} \subtitle{Language Overview} \begin{document} \begin{frame} \titlepage \end{frame} <>= options(width = 50) @ \begin{frame}[fragile] \frametitle{Background} \begin{itemize} \item \R\ is often referred to as a \emph{dialect} of the \slan\ language \item \slan\ was developed at the AT\&T Bell Laboratories by John Chambers and his colleagues doing research in statistical computing, beginning in the late 1970's \item The original \slan\ implementation is used in the commercially available software \emph{\splus} \item \R\ is an open source implementation developed independently, starting in the early 1990's \item Mostly similar, but there are differences as well \end{itemize} \end{frame} \begin{frame}[fragile] \frametitle{Expressions and Objects} \R\ works by evaluating \emph{expressions} typed at the command prompt \begin{itemize} \item Expressions involve variable references, operators, function calls, etc. \item Most expressions, when evaluated, produce a value, which can be either assigned to a variable (e.g. \code{x <- 2 + 2}), or is printed in the \R\ session \item Some expressions are useful for their \emph{side-effects} (e.g., \code{plot} produces graphics output) \end{itemize} \textit{Since evaluated expression values can be quite large, and often need to be re-used, it is good practice to assign them to variables rather than print them directly} \end{frame} \begin{frame}[fragile] \frametitle{Expressions and Objects} \emph{Objects} are anything that can be assigned to a variable. In the following example, \code{c(1, 2, 3, 4, 5)} is an \textbf{expression} that produces an \textbf{object}, whether or not the result is stored in a variable: <>= sum(c(1, 2, 3, 4, 5)) x <- c(1, 2, 3, 4, 5) sum(x) @ \R\ has several important \emph{types} of objects that we will learn about; for example: \emph{functions, vectors} (numeric, character, logical), \emph{matrices, lists} and \emph{data frames} \end{frame} \begin{frame}[fragile] \frametitle{Functions} Most useful things in \R\ are done by function calls. Function calls look like a name followed by some \emph{arguments} in parentheses. <>= plot(height, weight) @ Apart from a special argument called \code{\ldots}, all arguments have a \emph{formal name}. When a function is evaluated, it needs to know what value has been assigned to each of its arguments. \end{frame} \begin{frame}[t,fragile] \frametitle{Functions} There are several ways to specify arguments: \begin{itemize} \item \emph{By position}: \\ The first two arguments of the \code{plot} function are \code{x} and \code{y}. So, <>= plot(height, weight) @ is equivalent to <>= plot(x = height, y = weight) @ \item \emph{By name}: \\ This is the safest way to match arguments, by specifying the argument names explicitly. This overrides positional matching, so it is equivalent to say <>= plot(y = weight, x = height) @ Formal argument names can be matched partially\\ (we will see examples later). \end{itemize} \end{frame} \begin{frame}[t,fragile] \frametitle{Functions} There are several ways to specify arguments: \begin{itemize} \item \emph{With default values}: \\ Arguments will often have default values. If they are not specified in the call, these default values will be used. <>= plot(height) @ \end{itemize} Functions are just like other objects in \R: \begin{itemize} \item They can be assigned to variables \item They can be used as arguments in other function calls \item New function objects are defined using the construct \\ \code{ function( arglist ) expr } \end{itemize} \end{frame} \begin{frame}[fragile] \frametitle{Functions} A simple function: <<>>= myfun <- function(a = 1, b = 2, c) { return(list(a = a, b = b, c = c)) } @ <<>>= myfun(6, 7, 8) @ \end{frame} \begin{frame}[fragile] \frametitle{Functions} A simple function: <<>>= myfun <- function(a = 1, b = 2, c) { return(list(a = a, b = b, c = c)) } @ <<>>= myfun(10, c = 'string') @ \end{frame} \begin{frame}[fragile] \frametitle{Function Arguments} The arguments that a particular function accepts (along with their default values) can be listed by the \code{args} function: <>= args(myfun) args(plot.default) @ The triple-dot (\ldots) argument indicates that the function can accept any number of further named arguments. What happens to those arguments is determined by the function. \end{frame} \begin{frame}[fragile] \frametitle{Vectors} The basic data types in \R\ are all vectors. The simplest varieties are \emph{numeric}, \emph{character} and \emph{logical} (\code{TRUE} or \code{FALSE}): <>= c(1, 2, 3, 4, 5) c("Huey", "Dewey", "Louie") c(T, T, F, T) c(1, 2, 3, 4, 5) > 3 @ \textit{ \code{T} and \code{F} are convenient abbreviations for \code{TRUE} and \code{FALSE} respectively.} \end{frame} \begin{frame}[fragile] \frametitle{Vectors} The length of any vector can be determined by the \Rfunction{length} function: <>= gt.3 <- c(1, 2, 3, 4, 5) > 3 gt.3 length(gt.3) sum(gt.3) @ This happens because of \emph{coercion} from logical to numeric. \end{frame} \begin{frame} \frametitle{Special values} \begin{itemize} \item \code{NA} Denotes a `missing value' \item \code{NaN} `Not a Number', e.g., $0 / 0$ \item \code{-Inf, Inf} positive and negative infinities, e.g. $1 / 0$ \item \code{NULL} Null object, mostly for programming convenience \end{itemize} \end{frame} \begin{frame}[fragile] \frametitle{Functions that create vectors} \Rfunction{seq} creates a sequence of equidistant numbers (See \code{?seq}) <>= seq(4, 10, 0.5) seq(length = 10) 1:10 args(seq.default) @ \textit{\emph{Partial matching}: Note that the named argument \code{length} of the call to \Rfunction{seq} actually matches the argument \code{length.out}} \end{frame} \begin{frame}[fragile] \frametitle{Functions that create vectors} \Rfunction{c} \emph{concatenates} one or more vectors: <>= c(1:5, seq(10, 20, length = 6)) @ \Rfunction{rep} replicates a vector <>= rep(1:5, 2) rep(1:5, length = 12) rep(c('one', 'two'), c(6, 3)) @ \end{frame} \begin{frame}[fragile] \frametitle{Matrices and Arrays} Matrices (and more generally arrays of any dimension) are stored in \R\ as a vector with dimensions: <>= x <- 1:12 dim(x) <- c(3, 4) x nrow(x) ncol(x) @ The fact that the left hand side of an assignment can look like a function applied to an object (rather than a variable) is a very interesting and useful feature. These are called \emph{replacement} functions. \end{frame} \begin{frame}[fragile] \frametitle{Matrices and Arrays} The same vector can be used to create a 3-dimensional array <>= dim(x) <- c(2, 2, 3); x @ \end{frame} \begin{frame}[fragile] \frametitle{Matrices (contd)} Matrices can also be created conveniently by the \code{matrix} function. Their row and column names can be set. <>= x <- matrix(1:12, nrow = 3, byrow = TRUE) rownames(x) <- LETTERS[1:3] x t(x) @ Matrices can be transposed by the \Rfunction{t} function. General array permutation is done by \Rfunction{aperm}. \end{frame} \begin{frame}[fragile] \frametitle{Matrices (contd)} Matrices do not need to be numeric. There can be character or logical matrices as well: <>= matrix(month.name, nrow = 6) @ \end{frame} \begin{frame}[fragile] \frametitle{Matrix multiplication} The multiplication operator (\code{*}) works element-wise, as with vectors. The matrix multiplication operator is \code{\%*\%}: <>= x x * x x %*% t(x) @ \end{frame} \begin{frame}[fragile] \frametitle{Creating matrices from vectors} The \code{cbind} (\emph{column bind}) and \code{rbind} (\emph{row bind}) functions can create matrices from smaller matrices or vectors: <>= y <- cbind(A = 1:4, B = 5:8, C = 9:12) y rbind(y, 0) @ Note that the short vector ($0$) is replicated. \end{frame} \begin{frame}[fragile] \frametitle{Factors} Factors are how \R\ handles \emph{categorical data} (e.g., eye color, gender, pain level). Such data are often available as numeric codes, but should be converted to factors for proper analysis. <>= pain <- c(0, 3, 2, 2, 1) fpain <- factor(pain, levels = 0:3) fpain levels(fpain) <- c("none", "mild", "medium", "severe") fpain as.numeric(fpain) @ The last function extracts the internal representation of factors, as integer codes starting from 1. \end{frame} \begin{frame}[fragile] \frametitle{Factors} Factors can also be created from character vectors. <>= text.pain <- c("none", "severe", "medium", "medium", "mild") factor(text.pain) @ Note that the levels are sorted alphabetically by default, which may not be what you really want. It is usually a good idea to specify the levels explicitly when creating a factor. \end{frame} \begin{frame}[fragile] \frametitle{Lists} \begin{itemize} \item \emph{Lists} are very flexible data structures used extensively in \R{}. \item A list is a vector, but the elements of a list do not need to be of the same type. Each element of a list can be \emph{any} \R{} object, including another list. \item lists can be created using the \code{list} function \item list elements are usually extracted by name \\ (using the \code{\$} operator). \end{itemize} \end{frame} \begin{frame}[fragile] \frametitle{Lists} <>= x <- list(fun = seq, len = 10) x$fun x$len x$fun(length = x$len) @ \begin{itemize} \item Functions are \R\ objects too. In this case, the \code{fun} element of \code{x} is the already familiar \code{seq} function, and can be called like any other function. \item Lists give us the ability to create \emph{composite objects} that contain several related, simpler objects. Many useful \R\ functions return a list rather than a simple vector. \end{itemize} \end{frame} \begin{frame}[fragile] \frametitle{Lists (contd)} A more natural example: energy intake (paired: before and after) ({\it \S 1.2.8, Dalgaard (2002)}): <>= intake.pre <- c(5260, 5470, 5640, 6180, 6390, 6515, 6805, 7515, 7515, 8230, 8770) intake.post <- c(3910, 4220, 3885, 5160, 5645, 4680, 5265, 5975, 6790, 6900, 7335) mylist <- list(before = intake.pre, after = intake.post) mylist mylist$before mylist[[2]] @ List elements can be extracted by name as well as position. \end{frame} \begin{frame}[fragile] \frametitle{Data Frames} \emph{Data frames} are \R\ objects that represent (rectangular) data sets, and thus very important for statistical applications. They are essentially lists with some additional structure. \begin{itemize} \item Each element of a data frame has to be a either a factor or a numeric, character or logical vector \item Each of these must have the same length \item They are similar to matrices because they have the same \emph{rectangular array} structure; the only difference is that different columns of a data frame can be of a different data type. \end{itemize} \end{frame} \begin{frame}[fragile] \frametitle{Data Frames} Data frames are created by the \code{data.frame} function: <>= d <- data.frame(intake.pre, intake.post) d d$intake.post @ The list-like \code{\$} operator can be used to extract columns. \end{frame} \begin{frame}[fragile] \frametitle{Indexing} Extracting one or more elements from a vector is done by \emph{indexing}. There are several kinds of indexing possible in \R{}, among them \begin{itemize} \item Indexing by a vector of positive integers \item Indexing by a vector of negative integers \item Indexing by a logical vector \item Indexing by a vector of names \end{itemize} In each case, the extraction is done by following the vector by a pair of brackets (\code{[...]}). The type of indexing depends on the object inside the brackets. \end{frame} \begin{frame}[fragile] \frametitle{Indexing by positive integers} <>= intake.pre intake.pre[5] intake.pre[c(3,5,7)] ind <- c(3,5,7) intake.pre[ind] intake.pre[8:13] intake.pre[c(1, 2, 1, 2)] @ \end{frame} \begin{frame}[fragile] \frametitle{Indexing by positive integers} Works more or less as expected. Interesting features: \begin{itemize} \item using an index bigger than the length of the vector produces \code{NA}'s \item indices can be repeated, resulting in the same element being chosen more than once. This feature is often very useful. \end{itemize} \end{frame} \begin{frame}[fragile] \frametitle{Indexing by negative integers} Using negative indices leaves out the specified elements. <>= intake.pre intake.pre[-5] ind <- -c(3,5,7) ind intake.pre[ind] @ Negative indices cannot be mixed with positive indices. \end{frame} \begin{frame}[fragile] \frametitle{Indexing by a logical vector} For this, the logical vector being used as the index should be exactly as long as the vector being indexed. If it is shorter, it is replicated to be as long as necessary. <>= intake.pre ind <- rep(c(TRUE, FALSE), length = length(intake.pre)) ind intake.pre[ind] intake.pre[c(T, F)] @ Only the elements that correspond to \code{TRUE} are retained. \end{frame} \begin{frame}[fragile] \frametitle{Indexing by names} This works only for vectors that have names. <>= names(intake.pre) <- LETTERS[1:11] intake.pre intake.pre[c('A', 'B', 'C', 'K')] names(intake.pre) <- NULL @ <>= names(intake.pre) <- NULL @ All these types of indexing works for matrices and arrays as well, as we shall see later. \end{frame} \begin{frame}[fragile] \frametitle{Logical comparisons} All the usual logical comparisons are possible: \begin{center} \begin{tabular}{|c|c||c|c|} \hline less than & $<$ & less than or equal to & $<=$ \\ \hline greater than & $>$ & greater than or equal to & $>=$ \\ \hline equals & $==$ & does not equal & $!=$\\ \hline \end{tabular} \end{center} Each of these operate on two vectors element-wise \\ (the shorter one is replicated if necessary). <>= intake.pre intake.pre > 7000 intake.pre > intake.post @ \end{frame} \begin{frame}[fragile] \frametitle{Logical operations} Element-wise boolean operations are also possible. \begin{center} \begin{tabular}{|c|c|} \hline AND & \code{\&} \\ OR & \code{|} \\ NOT & \code{!}\\ \hline \end{tabular} \end{center} <>= intake.pre intake.pre > 7000 intake.pre < 8000 intake.pre > 7000 & intake.pre < 8000 @ \end{frame} \begin{frame}[fragile] \frametitle{Conditional Selection} Logical comparisons and indexing by logical vectors together allow subsetting a vector based on the properties of other (or perhaps the same) vectors. <>= intake.post intake.post[intake.pre > 7000] intake.post[intake.pre > 7000 & intake.pre < 8000] month.name[month.name > "N"] @ For character vectors, sorting is determined by alphabetical order. \end{frame} \begin{frame}[fragile] \frametitle{Matrix and Data frame indexing} Indexing for matrices and data frames are similar: they also use brackets, but need two indices. If one (or both) of the indices are unspecified, all the corresponding rows and columns are selected. <>= x <- matrix(1:12, 3, 4) x x[1:2, 1:2] x[1:2, ] @ \end{frame} \begin{frame}[fragile] \frametitle{Matrix and Data frame indexing} If only one row or column is selected, the result is converted to a vector. This can be suppressed by adding a \code{drop = FALSE} <>= x[1,] x[1,,drop = FALSE] @ \end{frame} \begin{frame}[fragile] \frametitle{Matrix and Data frame indexing} Data frames behave similarly. <>= d[1:3,] d[1:3, "intake.pre"] d[d$intake.post < 5000, 1, drop = FALSE] @ \end{frame} \begin{frame}[fragile] \frametitle{Modifying objects} It is usually possible to modify \R\ objects by assigning a value to a subset or function of that object. For the most part, anything that makes sense, works. This will become clearer with more experience. <>= x <- runif(10, min = -1, max = 1) x x < 0 x[x < 0] <- 0 x @ \end{frame} \begin{frame}[fragile] \frametitle{Adding columns to a data frame} New columns can be added to data frame, by assigning to a currently non-existent column name (this works for lists too): <>= d$decrease d$decrease <- d$intake.pre - d$intake.post d @ \end{frame} \begin{frame}[fragile] \frametitle{The \Rfunction{subset} function} Working with data frames can become a bit cumbersome because we always need to prefix the name of the data frame to every column. \\ There are several functions to make this easier. \\ For example, \Rfunction{subset} can be used to select rows of a data frame. \end{frame} \begin{frame}[fragile] \frametitle{The \Rfunction{subset} function} <>= library(ISwR) data(thuesen) str(thuesen) thue2 <- subset(thuesen, blood.glucose < 7) thue2 @ \end{frame} \begin{frame}[fragile] \frametitle{The \Rfunction{transform} function} {\small Similarly, the \code{transform} function can be used to add new variables to a data frame using the old ones} <>= thue3 <- transform(thue2, log.gluc = log(blood.glucose)) thue3 @ \end{frame} \begin{frame}[fragile] \frametitle{The \Rfunction{with} function} Another similar and very useful function is \code{with}, which can be used to evaluate arbitrary expressions using variables in a data frame: <>= with(thuesen, log(blood.glucose)) @ \end{frame} \begin{frame}[fragile] \frametitle{Grouped data} Grouped data have one or more numerical variables, and one or more categorical factors (a.k.a groups) that indicate the category for each observation. The most natural way to store such data is as data frames with different columns for the numerical and categorical variables. <>= data(energy) str(energy) summary(energy) @ \end{frame} \begin{frame}[fragile] \frametitle{Extracting information by group} It is easy to extract data by category: <>= exp.lean <- energy$expend[energy$stature == "lean"] exp.obese <- with(energy, expend[stature == "obese"]) exp.lean exp.obese @ \end{frame} \begin{frame}[fragile] \frametitle{Extracting information by group} A more sophisticated way to do this: <>= l <- with(energy, split(x = expend, f = stature)) l @ \end{frame} \begin{frame}[fragile] \frametitle{Extracting information by group} More generally, arbitrary functions can be applied to data frames split by a group using the \Rfunction{by} function: <>= by(data = energy, INDICES = energy$stature, FUN = summary) @ \end{frame} \begin{frame}[fragile] \frametitle{Sorting} Vectors can be sorted by \Rfunction{sort}: <>= sort(intake.post) @ But it's usually more useful to work with the \emph{sort order}, using the \Rfunction{order} function, which returns an integer indexing vector that can be used get the sorted vectors. This can be useful to re-order the rows of a data frame by one or more columns. \end{frame} \begin{frame}[fragile] \frametitle{Sorting} <>= ord <- order(intake.post) ord intake.post[ord] intake.pre[ord] d[ord, ] @ \end{frame} \begin{frame}[fragile] \frametitle{Implicit loops} We often need to apply one particular function to all elements in a vector or a list. Generally, this would be done by looping through all those elements. \R\ has a few functions to do this elegantly; \begin{itemize} \item \Rfunction{lapply}: Returns the results as a list \item \Rfunction{sapply}: Tries to simplify the results and make it a vector \item See also \Rfunction{apply} and \Rfunction{tapply} \end{itemize} \end{frame} \begin{frame}[fragile] \frametitle{Implicit loops} <>= lapply(thuesen, mean) sapply(thuesen, mean) sapply(thuesen, mean, na.rm = TRUE) @ Note the need for \code{na.rm = TRUE} because there is a missing observation in one of the rows. Unless otherwise specified, all calculations involving an \code{NA} usually produce an \code{NA}. \end{frame} \begin{frame}[fragile] \frametitle{Graphics} \R's graphics capabilities are one of its strongest features. It can also be fairly complicated, with many features that are rarely used. Instead of going into details here, we will learn about \R\ graphics by looking at some examples later. Meanwhile, \begin{itemize} \item Look at \code{help(plot.default)} \item Look at \code{help(par)} \end{itemize} These two help pages cover most of the options and features common to standard graphics functions. They contain a lot of information, and are mostly useful as references to look up when you need to do something special. \end{frame} \begin{frame}[fragile] \frametitle{Programming constructs} \R\ has the standard programming constructs: \begin{itemize} \item \code{if} \item \code{else} \item \code{for} \item \code{while} \item etc. \end{itemize} \end{frame} \begin{frame}[fragile] \frametitle{for loops} \begin{itemize} \item Since most \R\ functions work on vectors, the \code{for} construct is rarely needed for simple use. \item The \code{for} keyword is always followed by an expression of the form \code{(variable in vector)}. \item The block of statements that follow this is executed once for every value in \code{vector}, with that value being stored in \code{variable} <>= for (i in 1:5) { print(i^2) } @ \end{itemize} \end{frame} \begin{frame}[fragile] \frametitle{while and if statements} <>= fibonacci <- function(length.out) { if (length.out < 0) { warning("length.out cannot be negative") return(NULL) } else if (length.out < 2) x <- seq(length = length.out) - 1 else { x <- c(0, 1) while (length(x) < length.out) { x <- c(x, sum(rev(x)[1:2])) } } x } @ \end{frame} \begin{frame}[fragile] \frametitle{while and if statements} <<>>= fibonacci(-1) fibonacci(1) fibonacci(10) @ Note that a function returns the last expression it evaluates (in this case \code{x}), and the explicit \code{return()} is not necessary. \end{frame} \begin{frame}[fragile] \frametitle{Session management} \R\ has the ability to save objects, to be loaded again later. Whenever exiting, \R\ tries to save all the objects currently in the workspace, and when starting up the next time (in the same directory), it loads it up again. \vspace{10mm} See \code{?save} \end{frame} \begin{frame}[fragile] \frametitle{Classes and generic functions} \R\ implements a system of \emph{object-oriented} programming, based on the following concepts: \begin{itemize} \item \emph{Generic functions}: functions meant to do a particular task, but do it differently based on the object it operates on. Examples: \Rfunction{plot}, \Rfunction{summary}, \Rfunction{mean} \item \emph{Methods}: specific versions of the generic function \item \emph{Class}: attribute of an object that determines which generic will be used, e.g., <>= class(thuesen) class(thuesen$blood.glucose) class(seq) @ \end{itemize} \end{frame} \begin{frame}[fragile] \frametitle{Methods} Methods for a particular generic can be listed using the \Rfunction{methods} function. \\ ~ \\ There is usually a \emph{``default''} method, conventionally named \Rfunction{plot.default}, \Rfunction{summary.default}, etc. \\ ~ \\ <>= methods(mean) @ % \item The basic OOP paradigm in \R\ is that \emph{conceptually similar % jobs should be done by the same function, irrespective of the data % type the work on}. % \item See \code{?class} for more information. % \end{itemize} \end{frame} \begin{frame}[fragile] \frametitle{Inspecting \R\ objects using \Rfunction{str}} The \Rfunction{str} function prints out information about the \code{structure} of any \R\ object. <>= str(thuesen) @ This can be especially useful for large data frames. \end{frame} \begin{frame}[fragile] \frametitle{Type checking} It is often useful to know whether an object is of certain type. There are several functions of the form \code{is.type} which do this. \\~\\ Note that \code{is} is not a generic function, even though the naming convention is similar. <>= is.data.frame(thuesen) is.list(thuesen) is.numeric(thuesen) is.function(thuesen) @ \end{frame} \begin{frame}[fragile] \frametitle{Detecting special values} Some similarly-named functions are used for element-wise checking. The most important of these is \Rfunction{is.na}, which is needed to identify which elements of a vector are missing. <<>>= thuesen$short.velocity thuesen$short.velocity == NA is.na(thuesen$short.velocity) @ \end{frame} \begin{frame}[fragile] \frametitle{Detecting special values} <<>>= is.na(c(Inf, NaN, NA, 1)) is.nan(c(Inf, NaN, NA, 1)) is.finite(c(Inf, NaN, NA, 1)) @ \end{frame} \begin{frame}[fragile] \frametitle{Coercion methods} There are several functions of the form \code{as.type} that are used to convert objects of one type to another. <>= as.numeric(c("1", "2", "2a", "b")) as.numeric(c(TRUE, FALSE, NA)) as.character(c(TRUE, FALSE, NA)) @ \end{frame} \begin{frame}[fragile] \frametitle{Coercion methods (contd)} There are some automatic coercion rules that often simplify things <>= thuesen$blood.glucose < 7 sum(thuesen$blood.glucose < 7) @ but can sometimes produce surprising results <>= 1 == TRUE 1 == "1" "1" == TRUE @ \end{frame} \begin{frame}[fragile] \frametitle{Coercion methods (contd)} <<>>= t(as.matrix(thuesen)) ## data.frame -> matrix @ \end{frame} \begin{frame}[fragile] \frametitle{Coercion methods (contd)} <<>>= as.list(thuesen) ## data.frame -> list @ \end{frame} \begin{frame}[fragile] \frametitle{Further resources} At this point, you should know enough about \R\ to do more exploration on your own. For now, you can use the datasets that come with \R\ (you can get a list using \code{data}), we'll learn how to import data soon. \\~\\ The course page has links to more resources. \end{frame} \end{document}