Deepayan Sarkar
Most of you should be familiar with Linear Regression / Least Squares
Are there any other kinds of regression?
We will
Try to refine what we understand by the term “regression” (linear regression is only a special case)
Learn alternative approaches to solve the “regression” problem
Learn how to identify and address modeling errors
Most techniques we will learn require non-trivial programming
We will learn and use the R language for computation
Room 11 (ground floor) is a computer lab (usually locked, but security guards at the main gate will open it when you ask them)
There are two more (smaller) computer labs in teh ground floor of the Faculty Building
You can use your own laptops as well
Midterm examination: 30%
Final examination: 50%
Assignments / Projects: 20%
Consider bivariate data \((X, Y)\) with some distribution
Interested in “predicting” \(Y\) for a fixed value of \(X=x\)
In probability terms, want the conditional distribution of \[Y | X = x\]
The “Regression Problem”: when \(Y\) is numeric
The “Classification Problem”: when \(Y\) is categorical
A more modern approach is to view both as special cases of the “Learning Problem”
   sex weight height repwt repht
1    M     77    182    77   180
2    F     58    161    51   159
3    F     53    161    54   158
4    M     68    177    70   175
5    F     59    157    59   155
6    M     76    170    76   165
7    M     76    167    77   165
8    M     69    186    73   180
9    M     71    178    71   175
10   M     65    171    64   170
11   M     70    175    75   174
12   F    166     57    56   163
13   F     51    161    52   158
14   F     64    168    64   165
15   F     52    163    57   160
16   F     65    166    66   165
17   M     92    187   101   185
18   F     62    168    62   165
19   M     76    197    75   200
20   F     61    175    61   171Interested in predicting weight distribution as a function of height
How should we proceed?


'data.frame':   7425 obs. of  5 variables:
 $ wages    : num  10.6 11 NA 17.8 NA ...
 $ education: num  15 13.2 16 14 8 16 12 14.5 15 10 ...
 $ age      : int  40 19 49 46 71 50 70 42 31 56 ...
 $ sex      : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 1 1 1 2 1 ...
 $ language : Factor w/ 3 levels "English","French",..: 1 1 3 3 1 1 1 1 1 1 ...   wages education age    sex language
1  10.56      15.0  40   Male  English
2  11.00      13.2  19   Male  English
3     NA      16.0  49   Male    Other
4  17.76      14.0  46   Male    Other
5     NA       8.0  71   Male  English
6  14.00      16.0  50 Female  English
7     NA      12.0  70 Female  English
8     NA      14.5  42 Female  English
9   8.20      15.0  31   Male  English
10    NA      10.0  56 Female  English



                       education income women prestige census type
gov.administrators         13.11  12351 11.16     68.8   1113 prof
general.managers           12.26  25879  4.02     69.1   1130 prof
accountants                12.77   9271 15.70     63.4   1171 prof
purchasing.officers        11.42   8865  9.11     56.8   1175 prof
chemists                   14.62   8403 11.68     73.5   2111 prof
physicists                 15.64  11030  5.13     77.6   2113 prof
biologists                 15.09   8258 25.65     72.6   2133 prof
architects                 15.44  14163  2.69     78.1   2141 prof
civil.engineers            14.52  11377  1.03     73.1   2143 prof
mining.engineers           14.64  11023  0.94     68.8   2153 prof
surveyors                  12.39   5902  1.91     62.0   2161 prof
draughtsmen                12.30   7059  7.83     60.0   2163 prof
computer.programers        13.83   8425 15.33     53.8   2183 prof
economists                 14.44   8049 57.31     62.2   2311 prof
psychologists              14.36   7405 48.28     74.9   2315 prof
social.workers             14.21   6336 54.77     55.1   2331 prof
lawyers                    15.77  19263  5.13     82.3   2343 prof
librarians                 14.15   6112 77.10     58.1   2351 prof
vocational.counsellors     15.22   9593 34.89     58.3   2391 prof
ministers                  14.50   4686  4.14     72.8   2511 prof
                   region  group fertility   ppgdp lifeExpF pctUrban infantMortality
Afghanistan          Asia  other     5.968   499.0    49.49       23       124.53500
Albania            Europe  other     1.525  3677.2    80.40       53        16.56100
Algeria            Africa africa     2.142  4473.0    75.00       67        21.45800
American Samoa       <NA>   <NA>        NA      NA       NA       NA        11.29389
Angola             Africa africa     5.135  4321.9    53.17       59        96.19100
Anguilla        Caribbean  other     2.000 13750.1    81.10      100              NA
Argentina      Latin Amer  other     2.172  9162.1    79.89       93        12.33700
Armenia              Asia  other     1.735  3030.7    77.33       64        24.27200
Aruba           Caribbean  other     1.671 22851.5    77.75       47        14.68700
Australia         Oceania   oecd     1.949 57118.9    84.27       89         4.45500
Austria            Europe   oecd     1.346 45158.8    83.55       68         3.71300
Azerbaijan           Asia  other     2.148  5637.6    73.66       52        37.56600
Bahamas         Caribbean  other     1.877 22461.6    78.85       84        14.13500
Bahrain              Asia  other     2.430 18184.1    76.06       89         6.66300
Bangladesh           Asia  other     2.157   670.4    70.23       29        41.78600
Barbados        Caribbean  other     1.575 14497.3    80.26       45        12.28400
Belarus            Europe  other     1.479  5702.0    76.37       75         6.49400
Belgium            Europe   oecd     1.835 43814.8    82.81       97         3.73900
Belize         Latin Amer  other     2.679  4495.8    77.81       53        16.20000
Benin              Africa africa     5.078   741.1    58.66       42        76.67400pctUrban using ppgdp

Interested in “predicting” \(Y\) for a fixed value of \(X=x\)
In probability terms, want the conditional distribution of \[Y | X = x\]
Important special case: linear regression
Appropriate under certain model assumptions
Essential component in more general procedures
You will learn theory in Linear Model course
We will review basics
\[Y | X = x\]
In general, the conditional distribution can be anything
If \((X, Y)\) is jointly Normal, then \[Y | X = x \sim N(\alpha + \beta x, \sigma^2)\]
for some \(\alpha, \beta, \sigma^2\)
This is the motivation for Linear Regression
Suppose \(E(X) = \mu_X\) and \(E(Y) = \mu_Y\)
The covariance of \(X\) and \(Y\) is defined by
\[ Cov\left(X,Y\right) = E\bigl( (X-\mu_X ) ( Y-\mu_Y) \bigr) = E(XY) - \mu_X \, \mu_Y \]
The correlation coefficient \(\rho \left( X,Y \right)\) of \(X\) and \(Y\) is defined by
\[ \rho (X,Y)=\frac{Cov\left( X,Y\right) }{\sqrt{Var \left( X\right) Var\left( Y\right) }} \]
\(-1 \leq \rho \leq 1\)
\(\rho = -1\): perfect decreasing linear relation
\(\rho = 1\): perfect increasing linear relation
\(\rho = 0\): no linear relation
Sample analog of correlation
\[ r(X,Y)=\frac{\sum_i (x_i - \bar x) (y_i - \bar y)} {\sqrt{ \sum_i (x_i - \bar x)^2 \sum_i (y_i - \bar y)^2 }} \]
Closely related with regression
Correlation between height and weight (Davis data): 0.19
Correlation between reported height and weight (Davis data): 0.762
Correlations in labour dynamics data
| wages | education | age | |
|---|---|---|---|
| wages | 1.000 | 0.307 | 0.361 | 
| education | 0.307 | 1.000 | -0.298 | 
| age | 0.361 | -0.298 | 1.000 | 

In general form, the regression model assumes
\[ E(Y | X_1 = x_1, X_2 = x_2, \dotsc, X_p = x_p) = \beta_0 + \sum_{j=1}^p \beta_j x_j \] \[ Var(Y | X_1 = x_1, X_2 = x_2, \dotsc, X_p = x_p) = \sigma^2 \]
where
\(\beta_0, \beta_1, \dotsc, \beta_p, \sigma^2 > 0\) are unknown parameters
\(X_1, X_2, \dotsc, X_p\) are (conditionally) fixed covariates
\(X_1, X_2, \dotsc, X_p\) may be derived from a smaller set of predictors, e.g.,
In vector notation (incorporating intercept term in \(\mathbf{X}\))
\[ E(Y | \mathbf{X} = \mathbf{x}) = \mathbf{x}^T \mathbf{\beta} \] \[ Var(Y | \mathbf{X} = \mathbf{x}) = \sigma^2 \]
Alternatively
\[ Y = \mathbf{x}^T \mathbf{\beta} + \varepsilon \text{ where }\] \[ E(\varepsilon) = 0, Var(\varepsilon) = \sigma^2\]
We assume that
\[ Y_{i}=\mathbf{x}_{i}^T\mathbf{\beta} + \varepsilon_{i}, \]
where \(\varepsilon_i\) are independent and
\[ E(\varepsilon_{i}) = 0, Var(\varepsilon_i) = \sigma^2 \]
\[ \mathbf{Y}= \mathbf{X} \mathbf{\beta} + \mathbf{\varepsilon} \]
where \[ \mathbf{Y}=\left( \begin{array}{c} Y_{1} \\ \vdots \\ Y_{n} \end{array} \right),\, \mathbf{X=}\left( \begin{array}{c} \mathbf{x}_{1}^T \\ \vdots \\ \mathbf{x}_{n}^T \end{array} \right),\, \mathbf{\varepsilon} = \left( \begin{array}{c} \varepsilon_{1} \\ \vdots \\ \varepsilon_{n} \end{array} \right) \]
\(\mathbf{Y}\) and \(\mathbf{\varepsilon}\) are \(n\times1\)
\(\mathbf{X}\) is \(n\times p\)
Columns of \(\mathbf{X}\) are assumed to be linearly independent (\(\text{rank}(\mathbf{X}) = p\))
\(\mathbf{\beta}_{p\times 1}\) and \(\sigma^2 > 0\) are unknown parameters
\[ \nabla q(\mathbf{\beta}) = 2 \mathbf{X}^T (\mathbf{Y-X\beta}) = 2 (\mathbf{X}^T \mathbf{Y-X}^T \mathbf{X \beta}) = \mathbf{0} \]
\[ \mathbf{X}^T \mathbf{X \beta} = \mathbf{X}^T \mathbf{Y} \]
\[ \widehat{\mathbf{\beta }} = ( \mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y} \]
fm1 <- lm(weight ~ height, data = Davis)
xyplot(weight ~ height, data = Davis, grid = TRUE, type = c("p", "r"))
fm1 <- lm(repwt ~ repht, data = Davis)
xyplot(repwt ~ repht, data = Davis, grid = TRUE, type = c("p", "r"))


Less arbitrary, but needs model assumption: Multivariate Normality
To indicate that \(\mathbf{Y}\) follows \(n\)-dimensional Multivariate Normal Distribution with mean vector \(\mu\) and covariance matrix \(\Sigma\), we write
\[ \mathbf{Y} \sim N_{n}\left(\mathbf{\mu}, \Sigma \right) \]
When \(\Sigma\) has full rank (positive definite), \(\mathbf{Y}\) has joint density function (pdf)
\[ f\left(\mathbf{y}\right) = \left( 2\pi \right) ^{-n/2}\left\vert {\Sigma }\right\vert ^{-1/2} \exp\left\lbrace -\frac{1}{2} (\mathbf{y-\mu})^T \Sigma^{-1} (\mathbf{y-\mu}) \right\rbrace \]
Note that this function has the two worst things in matrices, the determinant and the inverse of a matrix.
Fortunately, the situation is simpler for the regression model \[ \mathbf{Y} \sim N_n(\mathbf{X\beta}, \sigma^2 \mathbf{I}) \] with probability density function \[ f(\mathbf{y}) = (2\pi)^{-n/2} \left( \sigma ^{2}\right)^{-n/2} \exp\left\lbrace -\frac{1}{2\sigma^2}\left\Vert \mathbf{y-X\beta }\right\Vert ^{2} \right\rbrace \]
Therefore the likelihood for this model is \[ L\left( \mathbf{\beta}, \sigma^{2} \right) = \left( 2\pi \right) ^{-n/2}\left( \sigma ^{2}\right) ^{-n/2} \exp\left\lbrace -\frac{1}{2\sigma^2}\left\Vert \mathbf{Y-X\beta }\right\Vert^{2} \right\rbrace \]
Maximizing \(L\left( \mathbf{\beta}, \sigma^{2} \right)\) w.r.t. \(\mathbf{\beta}\) is equivalent to minimizing \(\left\Vert \mathbf{Y-X\beta }\right\Vert^{2}\)
To answer this, we need some more tools
Mean and Covariance of a random vector \(\mathbf{Y}\) \[ \mathbf{\mu} = \left( \begin{array}{c} \mu _{1} \\ \vdots \\ \mu _{n} \end{array} \right),\, \Sigma = \left( \begin{array}{ccc} \Sigma _{11} & \cdots & \Sigma _{1n} \\ \vdots & \ddots & \vdots \\ \Sigma _{n1} & \cdots & \Sigma _{nn} \end{array} \right) \] where \[ \mu_{i}=EY_{i}, \] and \[ \Sigma_{ij}=\begin{cases} Var( Y_{i} ) & \text{for } i=j\\ Cov( Y_{i},Y_{j} ) & \text{for } i\neq j \end{cases} \]
The covariance matrix is symmetric
For any \(n\times n\) matrix \(\mathbf{A}\) and \(n\times 1\) vector \(\mathbf{b}\)
\[ E\left( \mathbf{AY+b}\right) =\mathbf{A}E(\mathbf{Y}) + \mathbf{b},\text{ } \]
\[ Cov(\mathbf{AY+b}) = \mathbf{A} Cov(\mathbf{Y}) \mathbf{A}^T \]
Variance of a linear combination
\[ 0 \leq Var(\mathbf{a}^T \mathbf{Y}) = \mathbf{a}^T Cov(\mathbf{Y}) \mathbf{a} \]
which implies that the covariance matrix is non-negative definite.
Mean: \[ E\widehat{\mathbf{\beta}} = \left( \mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T E\mathbf{Y} = \left( \mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T \mathbf{X\beta} =\mathbf{\beta} \]
Covariance: \[ Cov(\widehat{\mathbf{\beta}}) = \left( \mathbf{X}^T \mathbf{X}\right) ^{-1}\mathbf{X}^T \sigma^2\mathbf{IX}\left( \mathbf{X}^T \mathbf{X}\right)^{-1}= \sigma^2\left( \mathbf{X}^T \mathbf{X}\right)^{-1}=\sigma^2 \mathbf{M} \]
Property of Multivariate Normal:
If \(\mathbf{Y} \sim N_n(\mathbf{\mu}, \Sigma)\), then \[ \mathbf{AY+b} \sim N( \mathbf{A\mu + b}, \mathbf{A} \Sigma \mathbf{A}^T) \]
Therefore \[ \widehat{\mathbf{\beta}}\sim N_{p}\left( \mathbf{\beta} ,\sigma^{2}\mathbf{M}\right) \]
More precisely, \(\mathbf{\ell}^T \widehat{\mathbf{\beta}}\) is the best unbiased estimator of \(\mathbf{\ell}^T \mathbf{\beta}\) for any linear combination \(\mathbf{\ell}^T \mathbf{\beta}\).
Even without assuming normality, the OLS estimator has smaller variance than any other linear unbiased estimator.
The OLS estimator is consistent (as long as \(\mathbf{X}\) grows reasonably), i.e., \(\widehat{\mathbf{\beta}} \to \beta\) as \(n \to \infty\).
We typically estimate \(\sigma ^{2}\) by \[ \widehat{\sigma }^{2}=\left\Vert \mathbf{Y-X}\widehat{\beta }\right\Vert^{2}/\left( n-p\right) \] which is called the unbiased estimator of \(\sigma ^{2}\)
Distribution of \(\widehat{\sigma }^{2}\): \[ \frac{\left( n-p\right) \widehat{\sigma }^{2}}{\sigma ^{2}}\sim \chi_{n-p}^{2} \] independently of \(\widehat{\beta}\)
For the normal model \(\widehat{\sigma }^{2}\) is the best unbiased estimator.
Even without normality, \(\widehat{\sigma}^{2}\) is unbiased.
\(\widehat{\sigma }^{2}\) is consistent
To find MLE of \(\sigma^{2}\), differentiate \(\log L(\widehat{\beta}, \sigma^{2})\) with respect to \(\sigma\)
Easy to show that this gives \[ \widehat{\sigma }_{MLE}^{2} = \frac{1}{n} \left\Vert \mathbf{Y-X}\widehat{\beta }\right\Vert^{2} \]
Note that the MLE is not unbiased, but is consistent.
In general, neither the OLS estimator nor the MLE of \(\sigma^{2}\) minimize mean squared error (MSE)
We are often interested in coefficients \(\beta_{j}\)
Note that by properties given above \[ \widehat{\beta}_{j} \sim N(\beta_j, \sigma^2 M_{jj}) \]
“Standard error” of \(\widehat{\beta}_{j}\) \[ \widehat{\sigma }_{\widehat{\beta }_{j}}=\widehat{\sigma }\sqrt{M_{jj}} \]
Testing the null hypothesis \(H_0 : \beta_{j} = c\) \[ t=\frac{\widehat{\beta}_{j} - c}{\widehat{\sigma}_{\widehat{\beta}_{j}}} \sim t_{n-p}. \]
Call:
lm(formula = weight ~ height, data = Davis)
Residuals:
    Min      1Q  Median      3Q     Max 
-23.696  -9.506  -2.818   6.372 127.145 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.26623   14.95042   1.690  0.09260
height       0.23841    0.08772   2.718  0.00715
Residual standard error: 14.86 on 198 degrees of freedom
Multiple R-squared:  0.03597,   Adjusted R-squared:  0.0311 
F-statistic: 7.387 on 1 and 198 DF,  p-value: 0.007152[1] 0.007150357
Loosely speaking,
Suppose we are interested in testing \(H_0\) vs \(H_1\), where \(H_0\) is a sub-model of \(H_1\)
\[ H_0 \subset H_1 \quad (H_m : \mathbf{Y} \sim N_n(\mathbf{X}_m \beta_m, \sigma^2 \mathbf{I}) ) \]
Let the sum of squared errors for the two models be \(S^2_0\) and \(S^2_1\)
\[ S^2_m = \left\Vert \mathbf{Y} - \mathbf{X}_m \widehat{\beta}_m \right\Vert^{2}, m = 0, 1 \]
Let the number of parameters (length of \(\beta\)) in the two models be \(p_0\) and \(p_1\)
Then the test statistic
\[ F = \frac{\frac{S_0^2 - S_1^2}{p_1-p_0}}{\frac{S_1^2}{n-p_1}} \text{ follows } F_{p_1-p_0, n-p_1} \text{ under } H_0 \]
(Cochran’s theorem, Linear Models course)
Call:
lm(formula = weight ~ height * sex, data = Davis)
Residuals:
    Min      1Q  Median      3Q     Max 
-23.091  -6.331  -0.995   6.207  41.230 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  160.49748   13.45954  11.924  < 2e-16
height        -0.62679    0.08199  -7.644 9.17e-13
sexM        -261.82753   32.72161  -8.002 1.05e-13
height:sexM    1.62239    0.18644   8.702 1.33e-15
Residual standard error: 10.06 on 196 degrees of freedom
Multiple R-squared:  0.5626,    Adjusted R-squared:  0.556 
F-statistic: 84.05 on 3 and 196 DF,  p-value: < 2.2e-16Analysis of Variance Table
Model 1: weight ~ height
Model 2: weight ~ height * sex
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)
1    198 43713                              
2    196 19831  2     23882 118.02 < 2.2e-16Consider residual sum of squared errors \[ T^{2}=\sum \left( Y_{i}-\bar{Y}\right)^{2} \] and \[ S^{2}=\sum \left( Y_{i} - \mathbf{x}_i^T \widehat{\mathbf{\beta}} \right)^{2}=\left\Vert\mathbf{Y-X}\widehat{\beta }\right\Vert ^{2} \] for intercept-only model and regression model
We can think of these as measuring the “unexplained variation” in \(\mathbf{Y}\) under these two models.
\(T^{2}-S^{2}\) is the amount of variation in the intercept-only model which has been explained by including the extra predictors of the regression model and
\(R^{2}\) is the proportion of the variation left in the intercept-only model which has been explained by including the additional predictors.
Link with correlation: It can be shown that for one predictor, \[ R^2 = r^2(X, Y) \]
Note that \[ R^{2}=\frac{\frac{T^{2}}{n}-\frac{S^{2}}{n}}{\frac{T^{2}}{n}} \]
Possible alternative: substitute unbiased estimators
Adjusted \(R^{2}\): \[ R_{a}^{2}=\frac{\frac{T^{2}}{n-1}-\frac{S^{2}}{n-p}}{\frac{T^{2}}{n-1}}=1-\frac{n-1}{n-p}( 1-R^{2}) \]
Delete the \(i\)th observation and compute \(\widehat{\mathbf{\beta}}_{(-i)}\) after excluding \(i\)th observation.
Also compute the sample mean excluding the \(i\)th observation \[ \bar{Y}_{(-i)} = \frac{1}{n-1} \sum_{j\neq i} Y_{j} \]
Do this for all \(i\).
The predictive \(R^{2}\) is defined as \[ R_{p}^{2}=\frac{T_{p}^{2}-S_{p}^{2}}{T_{p}^{2}} \]
This computes the fit to the \(i\)th observation without using that observation
Better measure of goodness of model fit than \(R^{2}\) or adjusted \(R^{2}\)
Identifying violations of linear model assumptions
Possible solutions
Many other online resources available