deepayan@isid.ac.in
>The goal of this assignment is to combine data from different sources and perform a regression analysis.
The following datasets provide various measurements for each county (similar to districts in India) in the USA.
USCancerRatesMales.csv provides age-adjusted death rate due to cancer among the male population during the period 1999-2003.
USBlackPop.csv provides the percent of resident population in 2000 that were black.
USEducation.xls provides the percent of adults with a bachelor's degree or higher in 2000.
Note that not all files contain data for all counties, and those
that are common are not in the same order. Also, some files have
full state names while some only have abbreviations. The
correspondence between state names and state abbreviations are given
by the built-in R variables state.name
and state.abb
.
Combine the three datasets to create a merged dataset that
records all three measurements for the counties that are available
in all the datasets. The R
functions tolower()
, intersect()
,
and match()
could be useful.
Fit a regression model with death rate due to cancer among males as response, and state, percent of black population, and percent of adults with a bachelor's degree or higher as predictors. Interpret your findings.
Look at the usual regression diagnostics to identify systematic model violations and unusual observations. Fit an updated model if necessary. Summarize your findings.
Submit both R code and a narrative report in PDF format that explains the steps you have taken. The report should include (only) relevant results.
The goal of this assignment is to design and implement an algorithm to compute a specific order statistic (i.e., quantile) so that the algorithm runs in linear time in the average case. This can be done by modifying the quicksort algorithm so that after the partition step, the recursion is called only on one subarray (you need to work out the details).
To serve as a template, implementation using the quicksort
algorithm to solve the problem, along with an R interface, is
available at here. This contains two
functions that can be called from R via
Rcpp: order_statistic(x, w)
which returns
the w
-th order statistic in the
vector x
, and ncomparisons(x, w)
which
returns the number of comparisons needed to do so.
Modify the given code to compute order statistics as
described above. Remember, the call order_statistic(x,
w)
must return the w
-th order statistic
of x
, i.e., sort(x)[w]
, but without
actually sorting x
.
Submit the modified C++ file with the
name quantile.cpp
. You can test whether
it gives correct results using the sample code
in quantile.R
.
Design a simulation study in R to determine empirically through a hypothesis test whether the average run time is Θ(n) as opposed to Θ(n log n).
Using the simulation study, also empirically determine the order of the variance of the average case runtime.
Submit a file simulation.R
containing
your R code and a file simulation.pdf
containing
a narrative report.