next up previous
Next: About this document ...

Lecture 33: The General Linear Model

Suppose that $\vec{\epsilon}\in R^n$ is a random vector with mean zero components, that $B\in R^{n\times k}$ is a rank k matrix of real numbers, and $\vec{A}\in R^k$. We wish to consider the random vector $\vec{Y}\in R^n$defined by  
 \begin{displaymath}
\vec{Y} = B\vec{A} + \vec{\epsilon}.\end{displaymath} (1)
The model described by (1) is called the General Linear Model. We regard B as known, and we would like to both estimate $\vec{A}$ and test hypotheses about $\vec{A}$ in this model.

This model arises in many contexts, many of which are linked by the common thread that the matrix B is input to a system whose output under optimal conditions would be $B\vec{A}$. However, our measurements of the output, Yi, are contaminated by random errors, $\epsilon_i$.

For example, the time it takes for an object to fall d units is assumed to follow a law of the form

\begin{displaymath}
T = A\sqrt{d}.\end{displaymath}

Objects are dropped from various heights, $d_1,\dots,d_k$, and the time measurements are contamininated by random errors $\epsilon_i$, so that the $i^{\rm th}$ time measurement, Ti, follows the law

\begin{displaymath}
T_i = (\sqrt{d_i})A + \epsilon_i\end{displaymath}

Another example. The temperature at the location (x,y) in the plane is assumed to satisfy the quadratic relation

T = Ax2 + 2Bxy + Cy2 + Dx + Ey + F.

The measurement at (xi,yi) is contaminated by the random error $\epsilon_i$so we have

\begin{displaymath}
T_i = (x_i)^2A + 2x_iy_iB + (y_i)^2C + x_iD + y_iE + F + \epsilon_i.\end{displaymath}

We predicate our search for a good estimator on the requirement that the estimator should minimize the sum of squares

\begin{displaymath}
\sum_{i=1}^n (Y_i - (BA)_i)^2.\end{displaymath}

However, this just says that we should choose $\vec{A}$ so that $B\vec{A}$ is the projection of $\vec{Y}$ onto the span of the columns of B. Therefore, we want the columns of B to be perpendicular to $\vec{Y}-B\vec{A}$, that is

\begin{displaymath}
B^t(\vec{Y}-B\vec{A}) = \vec{0}.\end{displaymath}

This is equivalent to $B^t\vec{Y} = (B^tB)\vec{A}$. Since we assume that the rank of B is k, BtB is invertible, and we have  
 \begin{displaymath}
\hat{A} \equiv (B^tB)^{-1}B^t\vec{Y}\end{displaymath} (2)
would minimize the sum of squares. Let us now look at how the statistical properties of $\vec{\epsilon}$ determine the statistical properties of $\hat{A}$. First note that  
 \begin{displaymath}
\hat{A} = (B^tB)^{-1}B^t\vec{Y} = (B^tB)^{-1}B^tB\vec{A} +
(...
 ...{-1}B^t\vec{\epsilon} = \vec{A} + (B^tB)^{-1}B^t\vec{\epsilon}.\end{displaymath} (3)
Since we assume that the components of $\vec{\epsilon}$ are mean zero, we see

Proposition 584

$\hat{A} \equiv (B^tB)^{-1}B^t\vec{Y}$ is an unbiased estimate of $\vec{A}$in (1).

Next, suppose that the components of $\vec{\epsilon}$ have finite variances. Then we can define the covariance matrix of the errors, $\Sigma$, by

\begin{displaymath}[\Sigma]
_{i,j} = {\rm E}[\epsilon_i\epsilon_j]\end{displaymath}

and compute the covariance of $\hat{A}$ in terms of $\Sigma$ and B:

\begin{displaymath}
(\hat{A}-\vec{A})(\hat{A}-\vec{A})^t =
(B^tB)^{-1}B^t\vec{\epsilon}(\vec{\epsilon})^tB(B^tB)^{-1}\end{displaymath}

since BtB is symmetric! Therefore,

\begin{displaymath}
{\rm Cov}(\hat{A}) = (B^tB)^{-1}B^t\Sigma B(B^tB)^{-1}.\end{displaymath}

If, for example the errors are all independent with the same variance $\sigma^2$ then we achieve a further simplification:

Proposition 599

Suppose that the $\epsilon_i$ are uncorrelated with common variance $\sigma^2$. Then $\Sigma = \sigma^2 I$ and

\begin{displaymath}
{\rm Cov}(\hat{A}) = \sigma^2(B^tB)^{-1}.\end{displaymath}

A natural thing to do to improve an estimate is to take more observations. Since the observations we take are reflected in the rows of B, let us now assume that S is a $p\times k$ matrix of rank k and that S determines B in the sense that B is composed of n $p\times k$ blocks, all equal to S. Thus

\begin{displaymath}
B^t = \left[S^t,S^t,\dots,S^t\right],\end{displaymath}

and so

BtB = nStS.

Suppose we regard $\vec{\epsilon}$ also as having n blocks, each of length p, so that

\begin{displaymath}
\vec{\epsilon}^t = \left[\vec{v_1}^t,\dots,\vec{v_n}^t\right].\end{displaymath}

We have that StS is invertible and so  
 \begin{displaymath}
\hat{A} = \vec{A} + (S^tS)^{-1}\frac{1}{n}\sum_{j=1}^nS^t\vec{v_j}\end{displaymath} (4)
so that if the vectors $\vec{v_j}$ are independent and identically distributed then with probability 1,

\begin{displaymath}
\lim_{n\rightarrow\infty}
(S^tS)^{-1}\frac{1}{n}\sum_{j=1}^nS^t\vec{v_j} = \vec{0}\end{displaymath}

so with probability 1, $\hat{A}$ converges to $\vec{A}$, just like $\overline{X}$ converges to the mean with probability 1 for a random sample.

If we know that the components of $\vec{\epsilon}$ are uncorrelated and have common variance $\sigma^2$ then this block form of B reduces the formula for the covariance of $\hat{A}$ to

\begin{displaymath}
{\rm Cov}(\hat{A}) = \frac{\sigma^2}{n}(S^tS)^{-1}\end{displaymath}

and we see that

\begin{displaymath}
{\rm E}[((\hat{A})_1 - (\vec{A})_1)^2 + \cdots + ((\hat{A})_k - (\vec{A})_k)^2]
=
\frac{\sigma^2}{n}{\rm tr}((S^tS)^{-1}).\end{displaymath}

Of particular interest is the error in our fit, that is, the quantity

\begin{displaymath}
{\rm RSS} \equiv \vert\vert\vec{Y}-B\hat{A}\vert\vert^2\end{displaymath}

which is commonly called the Residual Sum of Squares. If we assume that the components of $\vec{\epsilon}$ are a random sample of size n from the standard normal distribution, we can compute the probability distribution of the residual sum of squares, RSS. First, observe that

\begin{displaymath}
\vec{Y} -B\hat{A} 
= 
(B\vec{A}+\vec{\epsilon}) - B(B^tB)^{-...
 ...vec{A}+\vec{\epsilon}) 
= 
(I - B(B^tB)^{-1}B^t)\vec{\epsilon}.\end{displaymath}

Despite its formidable appearance, the matrix $M\equiv I - B(B^tB)^{-1}B^t$ has a very simple structure.
1.
Since BtB is symmetric, so is M. Therefore this are matrices $\cal
O$ and D where D is diagonal, ${\cal O}^t = {\cal O}^{-1}$ and ${\cal
O}^tD{\cal O} = M$;
2.
M2 = M by direct calculution.
3.
MB = [0], so each column of B is an eigenvector of B for the eigenvalue 0. Since B has rank k, there are at least k 0's on the diagonal of D.
4.
There are n-k linearly indendendent vectors $\vec{v}_1,\dots,\vec{v}_{n-k}$ in Rn which are perpendicular to each of the columns of B. For each $\vec{v_j}$, $B\vec{v_j}
= \vec{0}$, so each of these n-k vectors is an eigenvector of for M for the eigenvalue 1. Hence D has at least (n-k) 1's on the diagonal. Since D is $n\times n$, this means it has n-k 1's and k 0's on the diagonal. We will assume that the 1's come first, that is [D]j,j = 1 if $j=1,\dots,n-k$and [D]j,j = 0 if $j= n-k+1,\dots,n$.
This tells us that

\begin{displaymath}
M\vec{\epsilon} = {\cal O}^tD{\cal O}\vec{\epsilon}.\end{displaymath}

Since $\cal
O$ is invertible, ${\cal O}\vec{\epsilon}$ has a multivariate normal distribtion. It is clear that its mean vector is $\vec{0}$. As for the covariance matrix,

\begin{displaymath}
{\rm Cov}({\cal O}\vec{\epsilon}) 
= 
{\rm E}[{\cal O}\vec{\epsilon}\vec{\epsilon}^t{\cal O}^t] 
=
{\cal O}{\cal O}^t = I,\end{displaymath}

so $\vec{\nu}\equiv{\cal O}\vec{\epsilon}$ is also a random sample of size n from the standard normal distribution. This shows us that

\begin{displaymath}
{\rm RSS} = \vec{\epsilon}^tM^tM\vec{\epsilon}
=
\vec{\epsil...
 ...vec{\epsilon}
= \vec{\nu}^tD\vec{\nu}
= \sum_{j=1}^{n-k}\nu_j^2\end{displaymath}

which shows us that RSS has a chi-square distribution with n-k degrees of freedom!



 
next up previous
Next: About this document ...
Eric S Key
4/3/1999