An Overview of the R Programming Environment

Deepayan Sarkar

Software for Statistics

  • Computing software is essential for modern statistics

    • Large datasets
    • Visualization
    • Simulation
    • Iterative methods
  • Many softwares are available, but we will focus on R

  • Why R?

    • Available as Free / Open Source Software

    • Very popular (both academia and industry)

    • Easy to try out on your own

  • Also because I know R better than other languages (Python and Julia are other good alternatives)

Outline

  • Installing R

  • Basics of using R

  • Some examples

  • Fitting linear models in R

Installing R

  • R is most commonly used as a REPL (Read-Eval-Print-Loop)

    • When it is started, R Waits for user input

    • Evaluates and prints result

    • Waits for more input

  • There are several different interfaces to do this

  • R itself works on many platforms (Windows, Mac, UNIX, Linux)

  • Some interfaces are platform-specific, some work on most

  • R and the interface may need to be installed separately

Installing R

  • This will install R, as well as a default graphical interface on Windows and Mac
  • I will recommend a different interface called R Studio that needs to be installed separately

  • I personally use yet another interface called ESS which works with a general purpose editor called Emacs (download link for Windows)

Running R

  • Once installed, you can start the appropriate interface (or R directly) to get something like this:
R Under development (unstable) (2018-05-05 r74699) -- "Unsuffered Consequences"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

Loading required package: utils
> 
  • The > represents a prompt indicating that R is waiting for input.

  • The difficult part is to learn what to do next

Basic usage

The R REPL essentially works like a calculator

[1] 782
[1] 3.857143
[1] 7.389056
[1] 1024

R has standard mathematical functions

[1] 25
[1] 4.787492
[1] 3628800
[1] 15.10441

R has standard mathematical functions

[1] 3003
[1] 3003

R supports variables

[1] 1024
[1] 100
[1] 3628800
[1] 21.79106

R can compute on vectors

[1] 15
 [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
 [1]    1   15  105  455 1365 3003 5005 6435 6435 5005 3003 1365  455  105   15    1

R has functions for statistical calculations

 [1] 1.336346e-02 6.681731e-02 1.559070e-01 2.251991e-01 2.251991e-01 1.651460e-01 9.174777e-02 3.932047e-02
 [9] 1.310682e-02 3.398065e-03 6.796131e-04 1.029717e-04 1.144130e-05 8.800998e-07 4.190952e-08 9.313226e-10
 [1] 1.336346e-02 6.681731e-02 1.559070e-01 2.251991e-01 2.251991e-01 1.651460e-01 9.174777e-02 3.932047e-02
 [9] 1.310682e-02 3.398065e-03 6.796131e-04 1.029717e-04 1.144130e-05 8.800998e-07 4.190952e-08 9.313226e-10

R has functions that work on vectors

[1] 3.75
[1] 3.75

R can draw graphs

plot of chunk unnamed-chunk-9

R can simulate random variables

 [1] "H 1"  "D 1"  "C 1"  "S 1"  "H 2"  "D 2"  "C 2"  "S 2"  "H 3"  "D 3"  "C 3"  "S 3"  "H 4"  "D 4"  "C 4"  "S 4" 
[17] "H 5"  "D 5"  "C 5"  "S 5"  "H 6"  "D 6"  "C 6"  "S 6"  "H 7"  "D 7"  "C 7"  "S 7"  "H 8"  "D 8"  "C 8"  "S 8" 
[33] "H 9"  "D 9"  "C 9"  "S 9"  "H 10" "D 10" "C 10" "S 10" "H 11" "D 11" "C 11" "S 11" "H 12" "D 12" "C 12" "S 12"
[49] "H 13" "D 13" "C 13" "S 13"
 [1] "S 3"  "C 6"  "D 1"  "S 13" "S 8"  "S 5"  "H 12" "H 3"  "C 10" "C 11" "S 12" "D 3"  "D 12"
 [1] "H 4"  "D 2"  "D 13" "D 12" "C 2"  "D 11" "H 11" "C 4"  "C 10" "S 6"  "C 1"  "D 1"  "H 12"

R can simulate random variables

 [1]  2.3015978047  0.7705667884 -0.0892046760 -0.9865371964  0.1810126429  0.3879747647 -0.2050735532 -2.0976958604
 [9]  0.7322113434  0.7798364023  0.4075021768  1.1769618106 -0.7462886765  0.2604936430  1.0177229912 -1.4089085065
[17] -1.1018401259 -0.7095959598  0.0396401117  0.4131531330 -0.2439613859  0.5237437640  1.0509392597  1.1101180234
[25] -1.7204412437 -0.2713384125 -0.3736775529 -0.5267496485 -0.7690549718 -1.4408822619 -0.0008896339  0.5305939182
[33] -0.4447285252  0.6269810650  0.9813285849  0.6229496408  0.2847310574 -1.0710706779  0.9876113962 -0.5857974226
[41]  0.1273712433 -0.4372480986 -2.3501188661  1.6769334822  2.4683746362  0.1797619689 -1.3621373802  0.4193724498
[49]  1.5429262058  0.1843367986

Example: random walk

plot of chunk unnamed-chunk-13

R is in fact a full programming language

  • Variables

  • Functions

  • Control flow structures

    • For loops, while loops

    • If-then-else (branching)

  • Distinguishing features

    • Focus on vectors and vectorized operations

    • Treatment of functions at par with other object types

  • We will see a few examples to illustrate what I mean by this

Some examples

Before we start, an experiment!

the dress

Color combination: Is it white & gold or blue & black ? Let’s count!

Question: What proportion of population sees white & gold?

  • Statistics uses data to make inferences

  • Model:

    • Let \(p\) be the probability of seeing white & gold

    • Assume that individuals are independent

  • Data:

    • Suppose \(X\) out of \(N\) sampled individuals see white & gold; e.g., \(N = 22\), \(X = 4\).

    • According to model, \(X \sim Bin(N, p)\)

  • “Obvious” estimate of \(p = X / N = 4 / 22 = 0.1818\)

  • But how is this estimate derived?

Generally useful method: maximum likelihood

  • Likelihood function: probability of observed data as function of \(p\)

\[ L(p) = P(X = 4) = {22 \choose 4} p^{4} (1-p)^{(22-4)}, p \in (0, 1) \]

  • Intuition: \(p\) that gives higher \(L(p)\) is more “likely” to be correct

  • Maximum likelihood estimate \(\hat{p} = \arg \max L(p)\)

  • By differentiating \[ \log L(p) = c + 4 \log p + 18 \log (1-p) \] we get \[ \frac{d}{dp} \log L(p) = \frac{4}{p} - \frac{18}{1-p} = 0 \implies 4 (1-p) - 18 p = 0 \implies p = \frac{4}{22} \]

How could we do this numerically?

  • Pretend for the moment that we did not know how to do this.

  • How could we arrive at the same solution numerically?

  • Basic idea: Compute \(L(p)\) for various values of \(p\) and find minimum.

  • To do this in R, remember that R works like a calculator:

    • The user types in an expression, R calculates the answer

    • The expression can involve numbers, variables, and functions

  • For example:

[1] 0.001744032

“Vectorized” computations

  • One important distinguishing feature of R is that it operates on “vectors”
  [1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22
 [24] 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45
 [47] 0.46 0.47 0.48 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68
 [70] 0.69 0.70 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.80 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90 0.91
 [93] 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00
  [1] 0.000000e+00 6.104468e-05 8.135864e-04 3.424448e-03 8.981244e-03 1.816014e-02 3.112581e-02 4.756632e-02
  [9] 6.679673e-02 8.788797e-02 1.097942e-01 1.314635e-01 1.519244e-01 1.703469e-01 1.860795e-01 1.986644e-01
 [17] 2.078363e-01 2.135088e-01 2.157514e-01 2.147628e-01 2.108405e-01 2.043521e-01 1.957077e-01 1.853355e-01
 [25] 1.736617e-01 1.610932e-01 1.480058e-01 1.347351e-01 1.215714e-01 1.087568e-01 9.648595e-02 8.490724e-02
 [33] 7.412670e-02 6.421205e-02 5.519768e-02 4.708981e-02 3.987162e-02 3.350820e-02 2.795111e-02 2.314247e-02
 [41] 1.901852e-02 1.551265e-02 1.255785e-02 1.008871e-02 8.042842e-03 6.361987e-03 4.992675e-03 3.886614e-03
 [49] 3.000817e-03 2.297533e-03 1.744032e-03 1.312274e-03 9.785206e-04 7.229017e-04 5.289691e-04 3.832572e-04
 [57] 2.748619e-04 1.950500e-04 1.369032e-04 9.500179e-05 6.514771e-05 4.412621e-05 2.950433e-05 1.946290e-05
 [65] 1.265841e-05 8.111339e-06 5.116943e-06 3.175152e-06 1.936197e-06 1.159101e-06 6.804388e-07 3.912116e-07
 [73] 2.199829e-07 1.207963e-07 6.466384e-08 3.368058e-08 1.703221e-08 8.342155e-09 3.946399e-09 1.797505e-09
 [81] 7.854421e-10 3.278832e-10 1.301292e-10 4.882107e-11 1.719852e-11 5.643294e-12 1.708098e-12 4.712720e-13
 [89] 1.167905e-13 2.551777e-14 4.799371e-15 7.529135e-16 9.440292e-17 8.910680e-18 5.800271e-19 2.272840e-20
 [97] 4.269521e-22 2.508903e-24 1.768718e-27 7.026760e-33 0.000000e+00

Plotting is very easy

plot of chunk unnamed-chunk-17

Functions

  • Functions can be used to encapsulate repetitive computations

  • Like mathematical functions, they take arguments as input and “returns” an output

[1] 0.001744032
[1] 0.2158045

Functions can be plotted directly

plot of chunk unnamed-chunk-19

…and they can be numerically “optimized”

$maximum
[1] 0.1818189

$objective
[1] 0.2158045

A more complicated example

  • Suppose \(X_1, X_2, ..., X_n \sim Bin(N, p)\), and are independent

  • Instead of observing each \(X_i\), we only get to know \(M = \max(X_1, X_2, ..., X_n)\)

  • What is the maximum likelihood estimate of \(p\)? (\(N\) and \(n\) are known, \(M = m\) is observed)

A more complicated example

To compute likelihood, we need p.m.f. of \(M\) : \[ P(M \leq m) = P(X_1 \leq m, ..., X_n \leq m) = \left[ \sum_{x=0}^m {N \choose x} p^{x} (1-p)^{(N-x)} \right]^n \] and \[ P(M = m) = P(M \leq m) - P(M \leq m-1) \]

Maximum Likelihood estimate

plot of chunk unnamed-chunk-22

$maximum
[1] 0.4996703

$objective
[1] 0.1981222

Another example: Linear regression

'data.frame':   200 obs. of  5 variables:
 $ sex   : Factor w/ 2 levels "F","M": 2 1 1 2 1 2 2 2 2 2 ...
 $ weight: int  77 58 53 68 59 76 76 69 71 65 ...
 $ height: int  182 161 161 177 157 170 167 186 178 171 ...
 $ repwt : int  77 51 54 70 59 76 77 73 71 64 ...
 $ repht : int  180 159 158 175 155 165 165 180 175 170 ...

Call:
lm(formula = weight ~ height, data = Davis)

Coefficients:
(Intercept)       height  
    25.2662       0.2384  


You should always plot the data first!

plot of chunk unnamed-chunk-25

The regression model is fit by minimizing least squares

(Intercept)      height 
 25.2662278   0.2384059 

We can confirm using a general optimizer:

$par
[1] 25.3053648  0.2381936

$value
[1] 43713.12

$counts
function gradient 
      99       NA 

$convergence
[1] 0

$message
NULL

Fitting the regression model

lm() gives exact solution and more statistically relevant details


Call:
lm(formula = weight ~ height, data = Davis)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.696  -9.506  -2.818   6.372 127.145 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 25.26623   14.95042   1.690  0.09260 . 
height       0.23841    0.08772   2.718  0.00715 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.86 on 198 degrees of freedom
Multiple R-squared:  0.03597,   Adjusted R-squared:  0.0311 
F-statistic: 7.387 on 1 and 198 DF,  p-value: 0.007152

Changing the model-fitting criteria

  • Suppose we wanted to minimize sum of absolute errors instead of sum of squares

  • No closed form solution any more, but general optimizer will still work:

$par
[1] -106.000787    1.000005

$value
[1] 1504

$counts
function gradient 
     169       NA 

$convergence
[1] 0

$message
NULL

This is an example of robust regression

plot of chunk unnamed-chunk-30

R gives access to an extensive toolset

  • Most standard data analysis methods are already implemented

  • Can be extended by writing add-on packages

  • Thousands of add-on packages are available

Drawbacks

  • Learning R needs some effort

  • Not point-and-click software

  • Command-line interface

Good practices

  • Use a good interface (R Studio is the most popular one)

  • Save your code in a script file (makes it easier to reproduce later)