Deepayan Sarkar
Compulsory non-credit course (pass marks: 35%)
Does not count towards composite score, but you need to pass
Syllabus
Basics in Programming: flow-charts, logic in programming
Common syntax
Handling input/output files
Sorting
Iterative algorithms
Simulations from statistical distributions
Programming for statistical data analyses: regression, estimation, parametric tests
Think of tasks that cannot be easily done without a computer
Could be both related and unrelated to what you are studying
Can be solved using scalar variables only:
Is a given natural number \(n \in \mathbb{N}\) prime?
Given integer \(k \geq 0\), compute its factorial \(k!\), and \(\log k!\)
Given integers \(n, k \geq 0\) such that \(k \leq n\), compute \(n \choose k\)
Probably need vector objects to be solved:
Find all prime numbers less than a given number \(N\)
Sort a given collection of numbers
Produce a random permutation of a given set of numbers
Given set \(S\) and query object \(x\), determine whether \(x \in S\) (set membership)
Simple random walk (+1 or -1 with probability \(p\) and \(1-p\)):
Toss a coin (with probability of head \(p\)) until you get \(k\) consecutive heads.
Given a game of snakes and ladders, how many throws of the dice does it take to reach the end?
Shuffle a deck of cards.
How can we probabilistically model a shuffle?
How many times do we need to shuffle to make the deck approximately random?
How can we “test” for randomness?
Given a function \(f\), solve for \(f(x) = 0\), e.g.,
solve non-linear equations like \(e^x + \sin x = 0\)
solve linear equations (e.g., as part of fitting linear models)
Optimization: given a function \(f\), find \(x\) where \(f(x)\) is minimized
Solution used usually depends on context
We will spend a lot of time discussing algorithms
An algorithm is essentially a set of instructions to solve a problem
Algorithms usually require some inputs
Instructions are executed sequentially, finally resulting in an output
You can think of an algorithm as a recipe (inputs: ingredients, output: food!)
Basic idea: see if \(n\) is divisible by any number between \(2\) and \(n-1\)
Obviously, enough to check is \(n\) is divisible by any number between \(2\) and \(\sqrt{n}\)
Intuitively, the second approach is more “efficient”
is_prime(n)
The meaning of this algorithm / pseudo-code should be more or less obvious
Assumes availability of certain basic operators / functions (mod, sqrt)
We often employ some conventions and use some structures in pseudo-code
For example,
is_prime(n)
is_prime(n)
is_prime(n)
Is an algorithm correct? To be correct, an algorithm must
stop after a finite number of steps, and
produce the correct output for all possible inputs (i.e., all instances of the problem).
How efficient is the algorithm?
What resources does the algorithm need to run, typically in terms of time and storage?
How does it compare with other algorithms for the same problem?
There are actually many different approaches to programming
We will mostly consider structured programming
Characterized by use of various control flow constructs (if, then, while, for, etc.) and block structures
More specifically, we will focus of procedural programming
Characterized by use of modular procedures (usually called functions)
We are mainly interested in procedures that perform some computations
Most algorithms we will discuss directly correspond to procedures or functions when actually implemented
The main components of our programs are going to be functions.
Usually a programming language will have many built-in functions
Additional libraries or packages will provide more standard functions
Functions usually
have one or more input arguments,
perform some computations, possibly calling other functions, and
return one or more output values.
The main contribution of a function is the second step
The standard model for performing computations is sequential execution
In other words, a function executes a set of instructions in a specified sequence
Some control flow structures may be used to create branches or loops in the flow of execution
Briefly, the main ingredients used are:
Declaration of variables (implicit in some languages). The details of how variables store values, and who can access them (scope) are important, and will be discussed later.
Evaluation of expressions. Can involve variables provided they have been defined in an earlier step.
Assignment to variables (to store intermediate results for later use).
Logical tests (equal?, less than?, greater than?, is more input available?).
Logical operations (AND, OR, NOT, XOR).
Branching - take different paths based on result of a logical operation (if-then-else).
Loops - repeat sequence of steps, usually a fixed number of times, or while a condition holds (for / while).
+
(addition)*
(multiplication)/
(division — possibly integer division)^
(power)%
(the modulo operation)&
(AND)|
(OR)!
(NOT)==
(equality)!=
(\(\neq\))<
, >
(strictly less than or greater than)<=
>=
(\(\leq\), \(\geq\))round, floor, ceil, abs, sqrt, exp, log, sin, cos, ...
The algorithms we discuss can be implemented in many programming languages
Some standard languages suitable for structured programming are
There are also many others with various relative strengths and weaknesses
In this course, we will mainly focus on
R because it already has an extensive collection of statistical software that we can use
C / C++ because it is easy to call C / C++ code from R (useful when R code is inefficient)
is_prime
algorithm in various languagesRecall the is_prime
algorithm to determine if a number is prime
With slight modification to use only integer arithmetic
is_prime(n)
is_prime
algorithm in various languagesint is_prime_c(int n)
{
int i = 2;
while (i * i <= n) {
if (n % i == 0) {
return 0;
}
i = i + 1;
}
return 1;
}
C is a compiled language, so actually running this code involves some additional work
Note that all variable types need to be explicitly declared
This includes the types of function arguments (inputs) and return value (output)
is_prime
algorithm in various languagesis_prime_r <- function(n)
{
i <- 2
while (i * i <= n) {
if (n %% i == 0) {
return (FALSE)
}
i <- i + 1;
}
return (TRUE);
}
The basic structure is very similar, but with some differences:
=
also works in R)%%
instead of %
TRUE
and FALSE
instead of 1
and 0
for logical valuesis_prime
algorithm in various languages[1] FALSE
[1] FALSE
[1] FALSE
[1] TRUE
is_prime
algorithm in various languagesThe main difference is that indentation defines code blocks
Changing indentation will change meaning of code, which does not happen in C or R
However, code in all languages should be indented properly for readability
is_prime
algorithm in various languages0
0
0
1
#include <stdio.h>
#include <stdlib.h>
int is_prime_c(int n)
{
int i = 2;
while (i * i <= n) {
if (n % i == 0) {
return 0;
}
i = i + 1;
}
return 1;
}
int main(int argc, char *argv[])
{
int i, n;
if (argc > 1) { /* one or more arguments supplied */
for (i = 1; i < argc; i++) {
n = atoi(argv[i]); /* converts string to integer */
printf("%d -> %d\n", n, is_prime_c(n));
}
}
else printf("Usage: %s <n1> <n2> ...\n", argv[0]);
return 0;
}
The code needs to be “compiled” before it is run
It also needs a main()
function to be defined
main()
is run first when the program is executed
Here is a complete file that can be compiled
How to compile & run depends on the operating system
Usage: ./is_prime <n1> <n2> ...
4 -> 0
10 -> 0
100 -> 0
101 -> 1
R, Python, etc., are “interpreted” languages that read and evaluate code interactively
Compiled code is usually (but not always) much faster than interpreters
Most interpreters are themselves written in a compiled language
However, compiled languages have several disadvantages:
Ultimately, choice depends on the purpose of the program
We will mainly use R (to take advantage of its many useful features)
We will not write C programs designed to be run directly
However, we will sometimes call C / C++ code from R to take advantage of its speed
The easiest way to do this is using a package called Rcpp
Python code can similarly be called using the reticulate package
And Julia code can be called using the JuliaCall package
I will give an example of Rcpp to illustrate its usefulness
We will look at it in more detail after learning more about R and C
library(package = "Rcpp")
sourceCpp(code =
"
#include <Rcpp.h>
// [[Rcpp::export]]
int is_prime_c(int n)
{
int i = 2;
while (i * i <= n) {
if (n % i == 0) {
return 0;
}
i = i + 1;
}
return 1;
}
")
[1] 0
[1] 0
[1] 0
[1] 1
We can call both versions on a sequence of integers as follows
The time required is recorded using system.time()
user system elapsed
11.950 0.008 11.958
user system elapsed
2.454 0.016 2.471
The C version is clearly faster
Would have been even faster if the loop was also in C
We can try this later after we discuss vectors / arrays
[1] 78499
[1] 78499
[1] 999931 999953 999959 999961 999979 999983
[1] 999931 999953 999959 999961 999979 999983
[1] TRUE
Over the next few classes, we will learn R more formally
We will then come back to study algorithms in more detail