Introductory Computer Programming: 2016-17

Course Information

Instructor: Deepayan Sarkar <deepayan@isid.ac.in>
Course notes

Syllabus

Basics in Programming: flow-charts, logic in programming
Common syntax
Handling input/output files
Sorting
Iterative algorithms
Simulations from statistical distributions
Programming for statistical data analyses: regression, estimation, parametric tests

Assignment 2

Deadline: March 27, 2017 (by email)

The goal of this assignment is to combine data from different sources and perform a regression analysis.

The following datasets provide various measurements for each county (similar to districts in India) in the USA.

USCancerRatesMales.csv provides age-adjusted death rate due to cancer among the male population during the period 1999-2003.
USBlackPop.csv provides the percent of resident population in 2000 that were black.
USEducation.xls provides the percent of adults with a bachelor's degree or higher in 2000.

Note that not all files contain data for all counties, and those that are common are not in the same order. Also, some files have full state names while some only have abbreviations. The correspondence between state names and state abbreviations are given by the built-in R variables state.name and state.abb.

Combine the three datasets to create a merged dataset that records all three measurements for the counties that are available in all the datasets. The R functions tolower(), intersect(), and match() could be useful.
Fit a regression model with death rate due to cancer among males as response, and state, percent of black population, and percent of adults with a bachelor's degree or higher as predictors. Interpret your findings.
Look at the usual regression diagnostics to identify systematic model violations and unusual observations. Fit an updated model if necessary. Summarize your findings.

Submit both R code and a narrative report in PDF format that explains the steps you have taken. The report should include (only) relevant results.

Assignment 1

Deadline: March 6, 2017 (by email)

The goal of this assignment is to design and implement an algorithm to compute a specific order statistic (i.e., quantile) so that the algorithm runs in linear time in the average case. This can be done by modifying the quicksort algorithm so that after the partition step, the recursion is called only on one subarray (you need to work out the details).

To serve as a template, implementation using the quicksort algorithm to solve the problem, along with an R interface, is available at here. This contains two functions that can be called from R via Rcpp: order_statistic(x, w) which returns the w-th order statistic in the vector x, and ncomparisons(x, w) which returns the number of comparisons needed to do so.

Modify the given code to compute order statistics as described above. Remember, the call order_statistic(x, w) must return the w-th order statistic of x, i.e., sort(x)[w], but without actually sorting x.

Submit the modified C++ file with the name quantile.cpp. You can test whether it gives correct results using the sample code in quantile.R.
Design a simulation study in R to determine empirically through a hypothesis test whether the average run time is Θ(n) as opposed to Θ(n log n).

Using the simulation study, also empirically determine the order of the variance of the average case runtime.

Submit a file simulation.R containing your R code and a file simulation.pdf containing a narrative report.

Exercises

Find some interesting studies and summarize (summary).
Choose a study, formulate one or more interesting question that the study may be able to answer, and obtain the necessary data.
Write programs in both R and C to simulate the following random variable: An urn contains n black and n white balls. Pick balls without replacement, and let X_k denote the #black - #white balls observed after tke k-th draw. The random variable we are interested in is the proportion of time during which X_k was positive.

Code demos

Slides

R Graphics

Last updated: Thu Nov 3 2016