Deepayan Sarkar
Most of you should be familiar with Linear Regression / Least Squares
Are there any other kinds of regression?
We will
Try to refine what we understand by the term “regression” (linear regression is only a special case)
Learn alternative approaches to solve the “regression” problem
Learn how to identify and address modeling errors
Most techniques we will learn require non-trivial programming
We will learn and use the R language for computation
Room 11 (ground floor) is a computer lab (usually locked, but security guards at the main gate will open it when you ask them)
There are two more (smaller) computer labs in teh ground floor of the Faculty Building
You can use your own laptops as well
Midterm examination: 30%
Final examination: 50%
Assignments / Projects: 20%
Consider bivariate data \((X, Y)\) with some distribution
Interested in “predicting” \(Y\) for a fixed value of \(X=x\)
In probability terms, want the conditional distribution of \[Y | X = x\]
The “Regression Problem”: when \(Y\) is numeric
The “Classification Problem”: when \(Y\) is categorical
A more modern approach is to view both as special cases of the “Learning Problem”
sex weight height repwt repht
1 M 77 182 77 180
2 F 58 161 51 159
3 F 53 161 54 158
4 M 68 177 70 175
5 F 59 157 59 155
6 M 76 170 76 165
7 M 76 167 77 165
8 M 69 186 73 180
9 M 71 178 71 175
10 M 65 171 64 170
11 M 70 175 75 174
12 F 166 57 56 163
13 F 51 161 52 158
14 F 64 168 64 165
15 F 52 163 57 160
16 F 65 166 66 165
17 M 92 187 101 185
18 F 62 168 62 165
19 M 76 197 75 200
20 F 61 175 61 171
Interested in predicting weight distribution as a function of height
How should we proceed?
'data.frame': 7425 obs. of 5 variables:
$ wages : num 10.6 11 NA 17.8 NA ...
$ education: num 15 13.2 16 14 8 16 12 14.5 15 10 ...
$ age : int 40 19 49 46 71 50 70 42 31 56 ...
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 1 1 1 2 1 ...
$ language : Factor w/ 3 levels "English","French",..: 1 1 3 3 1 1 1 1 1 1 ...
wages education age sex language
1 10.56 15.0 40 Male English
2 11.00 13.2 19 Male English
3 NA 16.0 49 Male Other
4 17.76 14.0 46 Male Other
5 NA 8.0 71 Male English
6 14.00 16.0 50 Female English
7 NA 12.0 70 Female English
8 NA 14.5 42 Female English
9 8.20 15.0 31 Male English
10 NA 10.0 56 Female English
education income women prestige census type
gov.administrators 13.11 12351 11.16 68.8 1113 prof
general.managers 12.26 25879 4.02 69.1 1130 prof
accountants 12.77 9271 15.70 63.4 1171 prof
purchasing.officers 11.42 8865 9.11 56.8 1175 prof
chemists 14.62 8403 11.68 73.5 2111 prof
physicists 15.64 11030 5.13 77.6 2113 prof
biologists 15.09 8258 25.65 72.6 2133 prof
architects 15.44 14163 2.69 78.1 2141 prof
civil.engineers 14.52 11377 1.03 73.1 2143 prof
mining.engineers 14.64 11023 0.94 68.8 2153 prof
surveyors 12.39 5902 1.91 62.0 2161 prof
draughtsmen 12.30 7059 7.83 60.0 2163 prof
computer.programers 13.83 8425 15.33 53.8 2183 prof
economists 14.44 8049 57.31 62.2 2311 prof
psychologists 14.36 7405 48.28 74.9 2315 prof
social.workers 14.21 6336 54.77 55.1 2331 prof
lawyers 15.77 19263 5.13 82.3 2343 prof
librarians 14.15 6112 77.10 58.1 2351 prof
vocational.counsellors 15.22 9593 34.89 58.3 2391 prof
ministers 14.50 4686 4.14 72.8 2511 prof
region group fertility ppgdp lifeExpF pctUrban infantMortality
Afghanistan Asia other 5.968 499.0 49.49 23 124.53500
Albania Europe other 1.525 3677.2 80.40 53 16.56100
Algeria Africa africa 2.142 4473.0 75.00 67 21.45800
American Samoa <NA> <NA> NA NA NA NA 11.29389
Angola Africa africa 5.135 4321.9 53.17 59 96.19100
Anguilla Caribbean other 2.000 13750.1 81.10 100 NA
Argentina Latin Amer other 2.172 9162.1 79.89 93 12.33700
Armenia Asia other 1.735 3030.7 77.33 64 24.27200
Aruba Caribbean other 1.671 22851.5 77.75 47 14.68700
Australia Oceania oecd 1.949 57118.9 84.27 89 4.45500
Austria Europe oecd 1.346 45158.8 83.55 68 3.71300
Azerbaijan Asia other 2.148 5637.6 73.66 52 37.56600
Bahamas Caribbean other 1.877 22461.6 78.85 84 14.13500
Bahrain Asia other 2.430 18184.1 76.06 89 6.66300
Bangladesh Asia other 2.157 670.4 70.23 29 41.78600
Barbados Caribbean other 1.575 14497.3 80.26 45 12.28400
Belarus Europe other 1.479 5702.0 76.37 75 6.49400
Belgium Europe oecd 1.835 43814.8 82.81 97 3.73900
Belize Latin Amer other 2.679 4495.8 77.81 53 16.20000
Benin Africa africa 5.078 741.1 58.66 42 76.67400
pctUrban
using ppgdp
Interested in “predicting” \(Y\) for a fixed value of \(X=x\)
In probability terms, want the conditional distribution of \[Y | X = x\]
Important special case: linear regression
Appropriate under certain model assumptions
Essential component in more general procedures
You will learn theory in Linear Model course
We will review basics
\[Y | X = x\]
In general, the conditional distribution can be anything
If \((X, Y)\) is jointly Normal, then \[Y | X = x \sim N(\alpha + \beta x, \sigma^2)\]
for some \(\alpha, \beta, \sigma^2\)
This is the motivation for Linear Regression
Suppose \(E(X) = \mu_X\) and \(E(Y) = \mu_Y\)
The covariance of \(X\) and \(Y\) is defined by
\[ Cov\left(X,Y\right) = E\bigl( (X-\mu_X ) ( Y-\mu_Y) \bigr) = E(XY) - \mu_X \, \mu_Y \]
The correlation coefficient \(\rho \left( X,Y \right)\) of \(X\) and \(Y\) is defined by
\[ \rho (X,Y)=\frac{Cov\left( X,Y\right) }{\sqrt{Var \left( X\right) Var\left( Y\right) }} \]
\(-1 \leq \rho \leq 1\)
\(\rho = -1\): perfect decreasing linear relation
\(\rho = 1\): perfect increasing linear relation
\(\rho = 0\): no linear relation
Sample analog of correlation
\[ r(X,Y)=\frac{\sum_i (x_i - \bar x) (y_i - \bar y)} {\sqrt{ \sum_i (x_i - \bar x)^2 \sum_i (y_i - \bar y)^2 }} \]
Closely related with regression
Correlation between height and weight (Davis
data): 0.19
Correlation between reported height and weight (Davis
data): 0.762
Correlations in labour dynamics data
wages | education | age | |
---|---|---|---|
wages | 1.000 | 0.307 | 0.361 |
education | 0.307 | 1.000 | -0.298 |
age | 0.361 | -0.298 | 1.000 |
In general form, the regression model assumes
\[ E(Y | X_1 = x_1, X_2 = x_2, \dotsc, X_p = x_p) = \beta_0 + \sum_{j=1}^p \beta_j x_j \] \[ Var(Y | X_1 = x_1, X_2 = x_2, \dotsc, X_p = x_p) = \sigma^2 \]
where
\(\beta_0, \beta_1, \dotsc, \beta_p, \sigma^2 > 0\) are unknown parameters
\(X_1, X_2, \dotsc, X_p\) are (conditionally) fixed covariates
\(X_1, X_2, \dotsc, X_p\) may be derived from a smaller set of predictors, e.g.,
In vector notation (incorporating intercept term in \(\mathbf{X}\))
\[ E(Y | \mathbf{X} = \mathbf{x}) = \mathbf{x}^T \mathbf{\beta} \] \[ Var(Y | \mathbf{X} = \mathbf{x}) = \sigma^2 \]
Alternatively
\[ Y = \mathbf{x}^T \mathbf{\beta} + \varepsilon \text{ where }\] \[ E(\varepsilon) = 0, Var(\varepsilon) = \sigma^2\]
We assume that
\[ Y_{i}=\mathbf{x}_{i}^T\mathbf{\beta} + \varepsilon_{i}, \]
where \(\varepsilon_i\) are independent and
\[ E(\varepsilon_{i}) = 0, Var(\varepsilon_i) = \sigma^2 \]
\[ \mathbf{Y}= \mathbf{X} \mathbf{\beta} + \mathbf{\varepsilon} \]
where \[ \mathbf{Y}=\left( \begin{array}{c} Y_{1} \\ \vdots \\ Y_{n} \end{array} \right),\, \mathbf{X=}\left( \begin{array}{c} \mathbf{x}_{1}^T \\ \vdots \\ \mathbf{x}_{n}^T \end{array} \right),\, \mathbf{\varepsilon} = \left( \begin{array}{c} \varepsilon_{1} \\ \vdots \\ \varepsilon_{n} \end{array} \right) \]
\(\mathbf{Y}\) and \(\mathbf{\varepsilon}\) are \(n\times1\)
\(\mathbf{X}\) is \(n\times p\)
Columns of \(\mathbf{X}\) are assumed to be linearly independent (\(\text{rank}(\mathbf{X}) = p\))
\(\mathbf{\beta}_{p\times 1}\) and \(\sigma^2 > 0\) are unknown parameters
\[ \nabla q(\mathbf{\beta}) = 2 \mathbf{X}^T (\mathbf{Y-X\beta}) = 2 (\mathbf{X}^T \mathbf{Y-X}^T \mathbf{X \beta}) = \mathbf{0} \]
\[ \mathbf{X}^T \mathbf{X \beta} = \mathbf{X}^T \mathbf{Y} \]
\[ \widehat{\mathbf{\beta }} = ( \mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y} \]
fm1 <- lm(weight ~ height, data = Davis)
xyplot(weight ~ height, data = Davis, grid = TRUE, type = c("p", "r"))
fm1 <- lm(repwt ~ repht, data = Davis)
xyplot(repwt ~ repht, data = Davis, grid = TRUE, type = c("p", "r"))
Less arbitrary, but needs model assumption: Multivariate Normality
To indicate that \(\mathbf{Y}\) follows \(n\)-dimensional Multivariate Normal Distribution with mean vector \(\mu\) and covariance matrix \(\Sigma\), we write
\[ \mathbf{Y} \sim N_{n}\left(\mathbf{\mu}, \Sigma \right) \]
When \(\Sigma\) has full rank (positive definite), \(\mathbf{Y}\) has joint density function (pdf)
\[ f\left(\mathbf{y}\right) = \left( 2\pi \right) ^{-n/2}\left\vert {\Sigma }\right\vert ^{-1/2} \exp\left\lbrace -\frac{1}{2} (\mathbf{y-\mu})^T \Sigma^{-1} (\mathbf{y-\mu}) \right\rbrace \]
Note that this function has the two worst things in matrices, the determinant and the inverse of a matrix.
Fortunately, the situation is simpler for the regression model \[ \mathbf{Y} \sim N_n(\mathbf{X\beta}, \sigma^2 \mathbf{I}) \] with probability density function \[ f(\mathbf{y}) = (2\pi)^{-n/2} \left( \sigma ^{2}\right)^{-n/2} \exp\left\lbrace -\frac{1}{2\sigma^2}\left\Vert \mathbf{y-X\beta }\right\Vert ^{2} \right\rbrace \]
Therefore the likelihood for this model is \[ L\left( \mathbf{\beta}, \sigma^{2} \right) = \left( 2\pi \right) ^{-n/2}\left( \sigma ^{2}\right) ^{-n/2} \exp\left\lbrace -\frac{1}{2\sigma^2}\left\Vert \mathbf{Y-X\beta }\right\Vert^{2} \right\rbrace \]
Maximizing \(L\left( \mathbf{\beta}, \sigma^{2} \right)\) w.r.t. \(\mathbf{\beta}\) is equivalent to minimizing \(\left\Vert \mathbf{Y-X\beta }\right\Vert^{2}\)
To answer this, we need some more tools
Mean and Covariance of a random vector \(\mathbf{Y}\) \[ \mathbf{\mu} = \left( \begin{array}{c} \mu _{1} \\ \vdots \\ \mu _{n} \end{array} \right),\, \Sigma = \left( \begin{array}{ccc} \Sigma _{11} & \cdots & \Sigma _{1n} \\ \vdots & \ddots & \vdots \\ \Sigma _{n1} & \cdots & \Sigma _{nn} \end{array} \right) \] where \[ \mu_{i}=EY_{i}, \] and \[ \Sigma_{ij}=\begin{cases} Var( Y_{i} ) & \text{for } i=j\\ Cov( Y_{i},Y_{j} ) & \text{for } i\neq j \end{cases} \]
The covariance matrix is symmetric
For any \(n\times n\) matrix \(\mathbf{A}\) and \(n\times 1\) vector \(\mathbf{b}\)
\[ E\left( \mathbf{AY+b}\right) =\mathbf{A}E(\mathbf{Y}) + \mathbf{b},\text{ } \]
\[ Cov(\mathbf{AY+b}) = \mathbf{A} Cov(\mathbf{Y}) \mathbf{A}^T \]
Variance of a linear combination
\[ 0 \leq Var(\mathbf{a}^T \mathbf{Y}) = \mathbf{a}^T Cov(\mathbf{Y}) \mathbf{a} \]
which implies that the covariance matrix is non-negative definite.
Mean: \[ E\widehat{\mathbf{\beta}} = \left( \mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T E\mathbf{Y} = \left( \mathbf{X}^T \mathbf{X} \right)^{-1} \mathbf{X}^T \mathbf{X\beta} =\mathbf{\beta} \]
Covariance: \[ Cov(\widehat{\mathbf{\beta}}) = \left( \mathbf{X}^T \mathbf{X}\right) ^{-1}\mathbf{X}^T \sigma^2\mathbf{IX}\left( \mathbf{X}^T \mathbf{X}\right)^{-1}= \sigma^2\left( \mathbf{X}^T \mathbf{X}\right)^{-1}=\sigma^2 \mathbf{M} \]
Property of Multivariate Normal:
If \(\mathbf{Y} \sim N_n(\mathbf{\mu}, \Sigma)\), then \[ \mathbf{AY+b} \sim N( \mathbf{A\mu + b}, \mathbf{A} \Sigma \mathbf{A}^T) \]
Therefore \[ \widehat{\mathbf{\beta}}\sim N_{p}\left( \mathbf{\beta} ,\sigma^{2}\mathbf{M}\right) \]
More precisely, \(\mathbf{\ell}^T \widehat{\mathbf{\beta}}\) is the best unbiased estimator of \(\mathbf{\ell}^T \mathbf{\beta}\) for any linear combination \(\mathbf{\ell}^T \mathbf{\beta}\).
Even without assuming normality, the OLS estimator has smaller variance than any other linear unbiased estimator.
The OLS estimator is consistent (as long as \(\mathbf{X}\) grows reasonably), i.e., \(\widehat{\mathbf{\beta}} \to \beta\) as \(n \to \infty\).
We typically estimate \(\sigma ^{2}\) by \[ \widehat{\sigma }^{2}=\left\Vert \mathbf{Y-X}\widehat{\beta }\right\Vert^{2}/\left( n-p\right) \] which is called the unbiased estimator of \(\sigma ^{2}\)
Distribution of \(\widehat{\sigma }^{2}\): \[ \frac{\left( n-p\right) \widehat{\sigma }^{2}}{\sigma ^{2}}\sim \chi_{n-p}^{2} \] independently of \(\widehat{\beta}\)
For the normal model \(\widehat{\sigma }^{2}\) is the best unbiased estimator.
Even without normality, \(\widehat{\sigma}^{2}\) is unbiased.
\(\widehat{\sigma }^{2}\) is consistent
To find MLE of \(\sigma^{2}\), differentiate \(\log L(\widehat{\beta}, \sigma^{2})\) with respect to \(\sigma\)
Easy to show that this gives \[ \widehat{\sigma }_{MLE}^{2} = \frac{1}{n} \left\Vert \mathbf{Y-X}\widehat{\beta }\right\Vert^{2} \]
Note that the MLE is not unbiased, but is consistent.
In general, neither the OLS estimator nor the MLE of \(\sigma^{2}\) minimize mean squared error (MSE)
We are often interested in coefficients \(\beta_{j}\)
Note that by properties given above \[ \widehat{\beta}_{j} \sim N(\beta_j, \sigma^2 M_{jj}) \]
“Standard error” of \(\widehat{\beta}_{j}\) \[ \widehat{\sigma }_{\widehat{\beta }_{j}}=\widehat{\sigma }\sqrt{M_{jj}} \]
Testing the null hypothesis \(H_0 : \beta_{j} = c\) \[ t=\frac{\widehat{\beta}_{j} - c}{\widehat{\sigma}_{\widehat{\beta}_{j}}} \sim t_{n-p}. \]
Call:
lm(formula = weight ~ height, data = Davis)
Residuals:
Min 1Q Median 3Q Max
-23.696 -9.506 -2.818 6.372 127.145
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.26623 14.95042 1.690 0.09260
height 0.23841 0.08772 2.718 0.00715
Residual standard error: 14.86 on 198 degrees of freedom
Multiple R-squared: 0.03597, Adjusted R-squared: 0.0311
F-statistic: 7.387 on 1 and 198 DF, p-value: 0.007152
[1] 0.007150357
Loosely speaking,
Suppose we are interested in testing \(H_0\) vs \(H_1\), where \(H_0\) is a sub-model of \(H_1\)
\[ H_0 \subset H_1 \quad (H_m : \mathbf{Y} \sim N_n(\mathbf{X}_m \beta_m, \sigma^2 \mathbf{I}) ) \]
Let the sum of squared errors for the two models be \(S^2_0\) and \(S^2_1\)
\[ S^2_m = \left\Vert \mathbf{Y} - \mathbf{X}_m \widehat{\beta}_m \right\Vert^{2}, m = 0, 1 \]
Let the number of parameters (length of \(\beta\)) in the two models be \(p_0\) and \(p_1\)
Then the test statistic
\[ F = \frac{\frac{S_0^2 - S_1^2}{p_1-p_0}}{\frac{S_1^2}{n-p_1}} \text{ follows } F_{p_1-p_0, n-p_1} \text{ under } H_0 \]
(Cochran’s theorem, Linear Models course)
Call:
lm(formula = weight ~ height * sex, data = Davis)
Residuals:
Min 1Q Median 3Q Max
-23.091 -6.331 -0.995 6.207 41.230
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 160.49748 13.45954 11.924 < 2e-16
height -0.62679 0.08199 -7.644 9.17e-13
sexM -261.82753 32.72161 -8.002 1.05e-13
height:sexM 1.62239 0.18644 8.702 1.33e-15
Residual standard error: 10.06 on 196 degrees of freedom
Multiple R-squared: 0.5626, Adjusted R-squared: 0.556
F-statistic: 84.05 on 3 and 196 DF, p-value: < 2.2e-16
Analysis of Variance Table
Model 1: weight ~ height
Model 2: weight ~ height * sex
Res.Df RSS Df Sum of Sq F Pr(>F)
1 198 43713
2 196 19831 2 23882 118.02 < 2.2e-16
Consider residual sum of squared errors \[ T^{2}=\sum \left( Y_{i}-\bar{Y}\right)^{2} \] and \[ S^{2}=\sum \left( Y_{i} - \mathbf{x}_i^T \widehat{\mathbf{\beta}} \right)^{2}=\left\Vert\mathbf{Y-X}\widehat{\beta }\right\Vert ^{2} \] for intercept-only model and regression model
We can think of these as measuring the “unexplained variation” in \(\mathbf{Y}\) under these two models.
\(T^{2}-S^{2}\) is the amount of variation in the intercept-only model which has been explained by including the extra predictors of the regression model and
\(R^{2}\) is the proportion of the variation left in the intercept-only model which has been explained by including the additional predictors.
Link with correlation: It can be shown that for one predictor, \[ R^2 = r^2(X, Y) \]
Note that \[ R^{2}=\frac{\frac{T^{2}}{n}-\frac{S^{2}}{n}}{\frac{T^{2}}{n}} \]
Possible alternative: substitute unbiased estimators
Adjusted \(R^{2}\): \[ R_{a}^{2}=\frac{\frac{T^{2}}{n-1}-\frac{S^{2}}{n-p}}{\frac{T^{2}}{n-1}}=1-\frac{n-1}{n-p}( 1-R^{2}) \]
Delete the \(i\)th observation and compute \(\widehat{\mathbf{\beta}}_{(-i)}\) after excluding \(i\)th observation.
Also compute the sample mean excluding the \(i\)th observation \[ \bar{Y}_{(-i)} = \frac{1}{n-1} \sum_{j\neq i} Y_{j} \]
Do this for all \(i\).
The predictive \(R^{2}\) is defined as \[ R_{p}^{2}=\frac{T_{p}^{2}-S_{p}^{2}}{T_{p}^{2}} \]
This computes the fit to the \(i\)th observation without using that observation
Better measure of goodness of model fit than \(R^{2}\) or adjusted \(R^{2}\)
Identifying violations of linear model assumptions
Possible solutions
Many other online resources available