---
title: "An Overview of the R Programming Environment"
author: "Deepayan Sarkar"
---
```{r opts, echo = FALSE, results = "hide", warning = FALSE, message = FALSE}
knitr::opts_chunk$set(cache = TRUE, cache.path='~/knitr-cache/ctw-roverview/',
autodep = TRUE, comment = "", warning = TRUE, message = TRUE,
knitr.table.format = "html", dev.args = list(pointsize = 16),
fig.width = 15, fig.height = 7, dpi = 110,
fig.path='figures/roverview-')
options(warnPartialMatchDollar = FALSE, width = 110)
```
## R for Clinical Trials
What is R?
- An environment for statistical computing and graphics (from [website](https://www.r-project.org))
- Available as [Free](https://en.wikipedia.org/wiki/Free_software_movement) /
[Open Source](https://en.wikipedia.org/wiki/The_Open_Source_Definition) Software
- Very popular (both academia and industry)
- Easy to try out on your own
. . .
Is it suitable for clinical trials research / analysis?
- Yes! See this [Task View](https://cran.r-project.org/web/views/ClinicalTrials.html) if you are looking for specific methods not covered in the workshop
- For regulatory issues, see the R [certification](https://www.r-project.org/certification.html) page
- Ask me if you still have any concerns after this workshop
## R for Clinical Trials
Is R better than SAS?
- I don't know SAS, so I don't have an opinion
- Both are tools, you should use what you are comfortable with
Main differences: (from looking at some SAS examples for this workshop)
- R is designed to be used _interactively_
- It is more similar to traditional programming languages like C / Java / Python
- Almost everything in R is done by _calling functions_ (similar to SAS PROCs?)
- Writing __new__ functions is much easier (and very common) in R
## Overall agenda for the workshop
- Informal overview of R
- More formal introduction to the language
- Statistical analyis (model fitting, visualization, tests)
- Case studies using clinical trial examples
## Outline of first part: Overview
- Installing R
- Basics of using R
- Example: Linear regresion
- Working with "reproducible documents"
## Installing R
* R is most commonly used as a [REPL](https://en.wikipedia.org/wiki/Read-eval-print_loop) (Read-Eval-Print-Loop)
* When it is started, R Waits for user input
* Evaluates and prints result
* Waits for more input
. . .
* There are several different _interfaces_ to do this
* R itself works on many platforms (Windows, Mac, UNIX, Linux)
* Some interfaces are platform-specific, some work on most
. . .
* R and the interface may need to be installed separately
## Installing R
* I assume you have already done this! Otherwise:
* Go to (or choose a [mirror](https://cran.r-project.org/mirrors.html) first)
* Follow instructions depending on your platform (probably Windows)
. . .
* This will install R, as well as a default graphical interface on Windows and Mac
. . .
* I recommend a different interface called
[R Studio](https://www.rstudio.com/) that needs to be installed separately
* I hope that you have also done this (but it is not essential)
## Running R
* Once installed, you can start the appropriate interface (or R directly) to get something like this:
```
R Under development (unstable) (2019-12-29 r77627) -- "Unsuffered Consequences"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
Loading required package: utils
>
```
\
* The `>` represents a _prompt_ indicating that R is waiting for input.
* The difficult part is to learn what to do next
## The R REPL essentially works like a calculator
```{r}
34 * 23
27 / 7
exp(2)
2^10
```
\
. . .
* Formally, R _evaluates_ the _expression_ typed in at the prompt
* This may sometimes result in an error (or a `+` prompt requesting more input)
## R has standard mathematical functions
```{r}
sqrt(5 * 125)
log(120)
factorial(10)
log(factorial(10))
```
\
Most non-trivial tasks involve calling functions
## R has standard mathematical functions
```{r}
choose(15, 5)
factorial(15) / (factorial(10) * factorial(5))
```
. . .
```{r}
choose(1500, 2)
factorial(1500) / (factorial(1498) * factorial(2))
```
## R supports variables
```{r}
x <- 2
y <- 10
x^y
y^x
factorial(y)
log(factorial(y), base = 10)
```
\
Variable assignment is done using "\ `a <- b`\ " , but "\ `a = b`\ " also works
## R can compute on vectors
```{r}
N <- 15
x <- seq(0, N)
N
x
choose(N, x)
```
\
. . .
This is one of the __most important__ distinguishing features of R
## R has built-in functions for probability calculations
```{r}
p <- 0.25
choose(N, x) * p^x * (1-p)^(N-x)
dbinom(x, size = N, prob = p)
```
\
* The results of any evaluation is usually printed, unless assigned to a variable
* When printing vectors, R prefixes each output line with the index of the first element
## R has functions that work on vectors
```{r}
p.x <- dbinom(x, size = N, prob = p)
sum(x * p.x) / sum(p.x)
N * p
```
## R can draw graphs
```{r}
plot(x, p.x, ylab = "Probability", pch = 16)
title(main = sprintf("Binomial(%g, %g)", N, p))
abline(h = 0, col = "grey")
```
\
## R can simulate random variables
```{r}
cards <- as.vector(outer(c("H", "D", "C", "S"), 1:13, paste))
cards
```
. . .
```{r}
sample(cards, 13)
sample(cards, 13)
```
## R can simulate random variables
```{r}
z <- rnorm(50, mean = 0, sd = 1)
z
mean(z)
sd(z)
median(z)
```
## R is in fact a full programming language
* Variables
* Functions
* Control flow structures
* For loops, while loops
* If-then-else (branching)
. . .
* Distinguishing features
* Focus on _vectors_ and _vectorized operations_
* Treatment of _functions_ at par with other object types
* We will see a few examples to illustrate what I mean by this
## Example: Linear regression
Let us simulate some fake height-weight data
```{r}
ht <- rnorm(200, mean = 172, sd = 10) # height in cm
bmi <- rnorm(200, mean = 22, sd = 2.2) # bmi (independent of height)
wt <- bmi * (ht / 100)^2 # weight in kg
```
. . .
A simple least squares regression model is fit using the `lm()` function
```{r}
fm <- lm(wt ~ ht) # OLS regression of weight on height
```
As usual, no output is printed is the result is assigned to a variable
## Examine fitted model
Useful output is produced by the functions `coef()` and `summary()`
```{r}
coef(fm)
summary(fm)
```
## You should always plot the data to assess model fit!
```{r}
plot(wt ~ ht)
abline(coef(fm))
```
\
## Let's introduce a mistake in the data
Switch height and weight for the first case
```{r}
tmp <- wt[1]
wt[1] <- ht[1]
ht[1] <- tmp
```
Fit the OLS regression line again
```{r}
fm2 <- lm(wt ~ ht)
summary(fm2)
```
## Plot the data and regression line again
```{r}
plot(wt ~ ht)
abline(coef(fm2))
```
\
## How can we "fix" this?
Two common approaches:
1. Delete "outlier" and re-fit model
. . .
2. Use robust regression
- Least squares fit is sensitive to outliers
- Instead, minimise sum of absolute errors (or something similar)
. . .
How?
- We can find and use a method someone has already implemented
- We can implement a solution of our own
## R package ecosystem
* R gives access to an extensive toolset
* Most standard data analysis methods are already implemented
* Can be extended by writing add-on packages
* Thousands of add-on packages are available from [CRAN](https://cran.r-project.org)
. . .
* Quality varies
* This may matter in regulated environments
## R package ecosystem
For this workshop, I will mostly use the "default" R packages
* These come by default along with a standard installation of R
* Consist of "base" packages
```{r}
rownames(installed.packages(priority = "base"))
```
\
* ... and "recommended" packages
```{r}
rownames(installed.packages(priority = "recommended"))
```
\
There are many other extremely powerful packages that you can explore later
## The other option: implement our own solution
- This is often surprisingly easy
- We will come back to the robust regression example later
## Interactive vs "pipeline" analysis
R encourages an _interactive_ data analysis workflow
* Result of each step should dictate the next step to take
* Understanding this will help make sense of the design of R
. . .
* Sometimes it is useful to apply a standard workflow on a new dataset
* These are usually done using "R scripts"
* I would also strongly encourage using dynamic "notebooks"
## Notebooks in R Studio
* Notebooks allow you to mix text and code
* Notebook only contains code, results are dynamically generated
* R Studio supports several variants

## Notebooks in R Studio
* These slides are written using "R Markdown"
* Open the file names [01-roverview.rmd](01-roverview.rmd) in R Studio
* These can also be used to generate HTML / PDF / Word reports (requires some more software)
* Benefit: reproducible results, no copy-paste errors
## Before we move on...
Please spend some time getting comfortable with R Studio:
* Load the R Markdown file in R Studio
* Run the "code chunks" (marked using ` ```{r} ... ``` `)
* Useful shortcut: Ctrl + Shift + Enter
* Try making small changes and re-run