---
title: An overview of the R programming environment
author: Deepayan Sarkar
---
```{r opts, echo = FALSE, results = "hide", warning = FALSE, message = FALSE}
opts_chunk$set(dev = "png", cache=TRUE, cache.path='~/knitr-cache/roverview/', autodep=TRUE,
## prompt = TRUE,
comment = "",
warning = TRUE, message = TRUE,
knitr.table.format = "html",
fig.width = 15, fig.height = 8, dpi = 96, fig.path='figures/roverview-')
library(lattice)
lattice.options(default.args = list(as.table = TRUE),
default.theme = list(fontsize = list(text = 20, points = 12)))
options(warnPartialMatchDollar = FALSE, width = 120, show.signif.stars = FALSE)
```
## Software for Statistics
- Computing software is essential for modern statistics
- Large datasets
- Visualization
- Simulation
- Iterative methods
- Many softwares are available
. . .
- We will learn about R
- Available as [Free](https://en.wikipedia.org/wiki/Free_software_movement) /
[Open Source](https://en.wikipedia.org/wiki/The_Open_Source_Definition) Software
- Very popular (both academia and industry)
- Easy to try out on your own
## Outline
- Installing R
- Some examples
- A little bit of history
- Some thoughts on why R has been successful
## Installing R
* R is most commonly used as a [REPL](https://en.wikipedia.org/wiki/Read-eval-print_loop) (Read-Eval-Print-Loop)
* This is essentially the model used by a calculator:
* Waits for user input
* Evaluates and prints result
* Waits for more input
. . .
* There are several different _interfaces_ to do this
* R itself works on many platforms (Windows, Mac, UNIX, Linux)
* Some interfaces are platform-specific, some work on most
. . .
* R and the interface may need to be installed separately
## Installing R
* Go to
\includegraphics[width=0.4\textwidth]{the_dress.png}
. . .
Color combination: Is it __white & gold__ or __blue & black__ ? Let's count!
```{r echo=FALSE}
NN <- 44
XX <- 26
```
## Question: What proportion of population sees white & gold?
- Statistics uses data to make inferences
- Model:
- Let $p$ be the probability of seeing white & gold
- Assume that individuals are independent
. . .
- Data:
- Suppose $X$ out of $N$ sampled individuals see white & gold;
e.g., $N = `r {NN}`$, $X = `r {XX}`$.
- According to model, $X \sim Bin(N, p)$
. . .
- "Obvious" estimate of $p = X / N = `r {XX}` / `r {NN}` = `r {round(XX/NN, 4)}`$
- But how is this estimate derived?
## Generally useful method: maximum likelihood
- Likelihood function: probability of observed data as function of $p$
$$
L(p) = P(X = `r {XX}`) = {`r {NN}` \choose `r {XX}`} p^{`r {XX}`} (1-p)^{(`r {NN}`-`r {XX}`)}, p \in (0, 1)
$$
- Intuition: $p$ that gives higher $L(p)$ is more "likely" to be correct
- Maximum likelihood estimate $\hat{p} = \arg \max L(p)$
. . .
- By differentiating
$$
\log L(p) = c + `r {XX}` \log p + `r {NN-XX}` \log (1-p)
$$
we get
$$
\frac{d}{dp} \log L(p) = \frac{`r {XX}`}{p} - \frac{`r {NN-XX}`}{1-p} = 0 \implies `r {XX}` (1-p) - `r {NN-XX}` p = 0 \implies p = \frac{`r {XX}`}{`r {NN}`}
$$
## How could we do this numerically?
- Pretend for the moment that we did not know how to do this.
- How could we arrive at the same solution numerically?
- Basic idea: Compute $L(p)$ for various values of $p$ and find minimum.
. . .
- To do this in R, the most important thing to understand is that __R
works like a calculator__:
- The user types in an expression, R calculates the answer
- The expression can involve numbers, variables, and functions
. . .
- For example:
```{r echo=FALSE}
N = NN
x = XX
fakeCode <- sprintf("```r\nN = %d\nx = %d\n```", NN, XX)
```
`r {fakeCode}`
```{r}
p = 0.5
choose(N, x) * p^x * (1-p)^(N-x)
```
## "Vectorized" computations
- One distinguishing feature of R is that it operates on "vectors"
```{r}
pvec = seq(0, 1, by = 0.01)
pvec
Lvec = choose(N, x) * pvec^x * (1-pvec)^(N-x)
Lvec
```
## Plotting is very easy
```{r}
plot(x = pvec, y = Lvec, type = "l")
```
\
## Functions
- Functions can be used to encapsulate repetitive computations
- Like mathematical functions, R function also take arguments as input
and "returns" an output
```{r}
L = function(p) choose(N, x) * p^x * (1-p)^(N-x)
L(0.5)
L(x/N)
```
## Functions can be plotted directly
```{r}
plot(L, from = 0, to = 1)
```
\
## ...and they can be numerically "optimized"
```{r}
optimize(L, interval = c(0, 1), maximum = TRUE)
```
\
- Compare with
```{r}
x / N
```
## A more complicated example
- Suppose $X_1, X_2, ..., X_n \sim Bin(N, p)$, and are independent
- Instead of observing each $X_i$, we only get to know $M = \max(X_1, X_2, ..., X_n)$
- What is the maximum likelihood estimate of $p$? ($N$ and $n$ are known, $M = m$ is observed)
## A more complicated example
To compute likelihood, we need p.m.f. of $M$ :
$$
P(M \leq m) = P(X_1 \leq m, ..., X_n \leq m) =
\left[ \sum_{x=0}^m {N \choose x} p^{x} (1-p)^{(N-x)} \right]^n
$$
and
$$
P(M = m) = P(M \leq m) - P(M \leq m-1)
$$
. . .
In R,
```{r}
n = 10
N = 50
M = 30
F <- function(p, m)
{
x = seq(0, m)
(sum(choose(N, x) * p^x * (1-p)^(N-x)))^n
}
L = function(p)
{
F(p, M) - F(p, M-1)
}
```
## Maximum Likelihood estimate
```{r echo=FALSE,fig.height = 6}
pp = seq(0.2,0.8,by=0.005)
plot(pp, sapply(pp, L), type = "l")
```
\
```{r}
optimize(L, interval = c(0, 1), maximum = TRUE)
```
## "The Dress" revisited
- What factors determine perceived color? (From [23andme.com](https://blog.23andme.com/23andme-research/genetics-and-that-striped-dress/))
\includegraphics[width=0.8\textwidth]{Age-Effect-Dress.jpg}
## Simulation: birthday problem
- R can be used to simulate random events
- Example: how likely is a common birthday in a group of 20 people?
```{r}
N = 20
days = sample(365, N, rep = TRUE)
days
length(unique(days))
```
## Law of Large Numbers
- With enough replications, sample proportion should converge to probability
```{r}
haveCommon = function()
{
days = sample(365, N, rep = TRUE)
length(unique(days)) < N
}
haveCommon()
haveCommon()
haveCommon()
haveCommon()
```
## Law of Large Numbers
- With enough replications, sample proportion should converge to probability
- Do this sytematically:
```{r}
replicate(100, haveCommon())
```
## Law of Large Numbers
- With enough replications, sample proportion should converge to probability
```{r}
plot(cumsum(replicate(1000, haveCommon())) / 1:1000, type = "l")
lines(cumsum(replicate(1000, haveCommon())) / 1:1000, col = "red")
lines(cumsum(replicate(1000, haveCommon())) / 1:1000, col = "blue")
```
\
## A more serious example: climate change {#climate}
```{r,echo=FALSE}
globalTemp <- read.csv("data/annual_temp.csv")
kable(globalTemp[1:5], "html", table.attr='id="dt_global_temp"')
```
## Change in temperature (global average deviation) since 1851
```{r}
library(lattice)
xyplot(Temp ~ Year, data = globalTemp, grid = TRUE)
```
\
## Change in atmospheric carbon dioxide
```{r}
xyplot(CO2 ~ Year, data = globalTemp, grid = TRUE)
```
\
## Does change in $CO_2$ explain temperature rise?
```{r}
xyplot(Temp ~ CO2, data = globalTemp, grid = TRUE, type = c("p", "r")) # include OLS regression line
```
\
## Fitting the regression model
```{r}
fm = lm(Temp ~ 1 + CO2, data = globalTemp)
coef(fm) # estimated regression coefficients
```
. . .
We can confirm using a general optimizer:
```{r}
SSE = function(beta)
{
with(globalTemp,
sum((Temp - beta[1] - beta[2] * CO2)^2))
}
optim(c(0, 0), fn = SSE)
```
## Fitting the regression model
* `lm()` gives exact solution and more statistically relevant details
```{r}
summary(fm)
```
## Changing the model-fitting criteria
- Suppose we wanted to minimize _sum of absolute errors_ instead of sum of squares
- No closed form solution any more, but general optimizer will still work:
```{r}
SAE = function(beta)
{
with(globalTemp,
sum(abs(Temp - beta[1] - beta[2] * CO2)))
}
opt = optim(c(0, 0), fn = SAE)
opt
```
## Changing the model-fitting criteria
- Compare with least squares line
```{r}
coef(fm) # least squared errors
opt$par # least absolute errors
```
\
* The two lines are virtually identical in this case
* This is not always true
## Another example: number of phone calls per year in Belgium
```{r}
data(phones, package = "MASS")
xyplot(calls ~ year, data = phones, grid = TRUE)
```
\
## Another example: number of phone calls per year in Belgium
```{r}
fm2 <- lm(calls ~ year, data = phones)
SAE = function(beta)
{
with(phones,
sum(abs(calls - beta[1] - beta[2] * year)))
}
opt = optim(c(0, 0), fn = SAE)
```
. . .
```{r}
coef(fm2) # least squared errors
opt$par # least absolute errors
```
* The two lines are quite different
* The second line is an example of _robust regression_
## Another example: number of phone calls per year in Belgium
```{r fig.height=7}
xyplot(calls ~ year, data = phones, grid = TRUE,
panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.abline(fm2, col = "red") # least squared errors
panel.abline(opt$par, col = "blue") # least absolute errors
})
```
\
## Summary
* Conventional statistical learning focuses on problems that can be "solved" analytically
. . .
* Numerical solutions are also valid solutions... but potentially difficult to obtain
* R makes it _easy_ to obtain numerical solutions and compare with traditional solutions
* We will come back to this idea when we next discuss the origins of R
# A very brief history of R
## What is R?
From its own website:
> R is a free software environment for statistical computing and graphics.
. . .
> It is a GNU project which is similar to the S language and
> environment which was developed at Bell Laboratories (formerly AT&T,
> now Lucent Technologies) by John Chambers and colleagues. R can be
> considered as a different implementation of S.
## The origins of S
- Developed at Bell Labs (statistics research department) 1970s onwards
- Primary goals
- Interactivity: Exploratory Data Analysis vs batch mode
- Flexibility: Novel vs routine methodology
- Practical: For actual use, not (just) academic research
. . .
John Chambers received the prestigious _ACM Software System Award_ in 1998
> For The S system, which has forever altered how people analyze,
> visualize, and manipulate data.
## The origins of R
- Early 1990s: Started as teaching tool by Robert Gentleman & Ross
Ihaka at the University of Auckland
- 1995: Convinced by Martin Mächler to release as Free Software (GPL)
- 2000: Version 1.0 released
Has since far surpassed S in popularity
## Number of R packages on CRAN
```{r, echo=FALSE,fig.width=11,fig.height=9}
cranpkgs <- read.csv("cran-packages.csv", skip = 1, stringsAsFactors = FALSE)
cranpkgs$date <- as.Date(cranpkgs$date, "%Y-%m-%d")
xyplot(pkgs ~ date, cranpkgs, grid = TRUE, aspect = "xy")
```
\
## Why the success? The user's perspective {#user}
- R is designed for data analysis
- Basic data structures are vectors
- Large collection of statistical functions
- Advanced statistical graphics capabilities
- The vast majority of R users use it as a statistical toolbox
- R "base" comes with a large suite of statistical modeling and graphics functions
- If these are not enough, more than 10000 add-on packages are available
## The developer's perspective
- Easy dissemination of research (through add-on packages)
- Rapid prototyping
- Interfaces to external software
## Rapid prototyping
John Chambers, _Programming with Data_:
> S is a programming language and environment for all kinds of
> computing involving data. It has a simple goal: _To turn ideas into
> software, quickly and faithfully._
. . .
A silly example: generate Fibonacci sequence
```{r }
fibonacci <- function(n) {
if (n < 2)
x <- seq(length = n) - 1
else {
x <- c(0, 1)
while (length(x) < n) {
x <- c(x, sum(tail(x, 2)))
}
}
x
}
fib10 <- fibonacci(10)
fib10
```
## Also easy to call C for efficiency
```{r, echo=FALSE}
cprog <- "
#include