The first assumption about the error variable makes construction of confidence intervals for the mean value of Y, for fixed values of the independent variables, possible. It also allows us to conduct useful hypothesis tests. The constant variance part of the assumption states that the variation in the values of the dependent variable Y about the plane Y = X1 + X2 + … + kXk is the same for all values of the independent variables observed.
The assumptions made about the distribution of the error variable can be checked by looking at the Plot of Residuals vs. Predicted Y and a histogram of the studentized residuals, as in simple regression. In addition, for time-series data the Plot of Residuals vs. Row Number (the row numbers represent the time periods in which the data was collected) should be free of any obvious patterns that would suggest that errors were correlated.
Note that all of the residual plots appear random, and exhibit constant spread about the model (represented by the horizontal line in the plots). This suggests that the model assumptions are validated by the data.
V The Analysis of Variance (ANOVA) Table
As in simple regression, an ANOVA table is prominently featured in the output of the Analysis Summary window. Below is a description of the contents of the columns in the table.
Sums of Squares
The definition and interpretation of the sums of squares in multiple regression is similar to that in simple regression.
um of S
, is a measure of the total
variation of Y
um of S
, measures the variation of Y explained
by the model
um of S
, measures the variation of Y left unexplained
by the model
Remarkably, we find that the equality SST
always holds. Thus, the total variation in Y can
be “decomposed” into the explained variation plus the unexplained variation for the model.
Degrees of Freedom
The degrees of freedom (df) equal the amount of independent information available for the corresponding sum of squares. Starting with n degrees of freedom (one for each observation in the sample), we lose one degree of freedom for each parameter in the model estimated by a sample statistic b.
df = n
– 1 , because the parameter Y
is estimated by the sample statistic
df = n
– 1 = n
+ 1) , because the k + 1 model parameters,
, must be estimated by the sample statistics
df = k
Paralleling the results from simple regression, observe that the total
degrees of freedom, n – 1
“decomposes” into the explained degrees of freedom k plus the unexplained degrees of freedom n - k – 1
for the model.
While a sum of squares measures the total variation, the ratio of a sum of squares to its degrees of freedom is a variance. The advantage of a variance is that it can be used to compare different data sets and different models (since it incorporates information about the size of the sample and the number of independent variables used in the model). These variances are called mean squares.
= The (sample) variance of the y-values observed (see the notes “Review of Basic Statistical Concepts”). You probably never knew (or cared, perhaps) that the sample variance you computed in your introductory statistics course was an example of a mean square!
= The (sample) variance of the dependent variable unexplained by the model. The Mean Square Error
is the sample estimate of the variance of the error variable,
MSR = = = the variance of the dependent variable explained by the model, the M
quare for R
The F – Ratio
An intuitive device for judging the effectiveness of the model in describing the relationship between the dependent variable Y and the independent variables in the model, taken together, is to compute the ratio of the MSR (the variance in Y explained by the model) to the MSE (the variance of Y about the model). The resulting ratio is called the F statistic:
– Ratio =
Properties of the F
F > 0
If F is “large” then the model explains much more of the variation in Y then it leaves unexplained, which is evidence that the model Y = X1 + X2 + … + kXk is appropriate, i.e., a large F supports the linearity assumption of the model.
A “small” F indicates that the model is inappropriate.
If the error variable is normally distributed with constant variance, then the F – Ratio follows a probability distribution called the F distribution. The P-value in the ANOVA table is for a hypothesis test of the linearity assumption of the model, i.e., the assumption that the dependent variable is linearly related to the set of independent variables, taken together, Y = X1 + X2 + … + kXk. See the next section (Section VI) for a discussion of the test. (The F distribution has two degrees of freedom and . In the test conducted for the model,and .)
The ANOVA Table below summarizes the results in this section as they appear in Statgraphics.
Below is the ANOVA Table for the Securicorp example. (Note: The E6 in SSR and SST is scientific notation for 106
. Thus, SSR = 1,067,220, and SST = 1,249,880.) Remember, also, that the sales figures appearing in the spreadsheet are in units of $1,000, and that these units are squared
in the computation of the sums of squares!
VI Testing the Assumption of Linearity
Is Y = X1 + X2 + … + kXk an appropriate description of the relationship between the dependent and independent variables? To answer this question, we conduct a formal hypothesis test. For the test,
Example 1 (continued):
H0: k = 0, i.e., none of the independent variables are linearly related to the dependent variable.
HA: At least one i is not zero, i.e., at least one of the independent variables is linearly related to the dependent variable.
Test Statistic: F = = the ratio of the explained and unexplained variance in Y.
P-Value: If the error variable satisfies the assumption made in section III then F follows an F distribution. Using the F distribution, Statgraphics computes the P-value for the test statistic. Note, however, that substantial deviations from the error variable assumptions can make the P-value unreliable. Since larger F – Ratios correspond to more convincing evidence that at least one of the independent variables is correlated to Y, large values of F lead to small P-values and the rejection of H0.
Based upon the p-value of 0.
0000 for the F – Statistic in the ANOVA Table below for Securicorp, we reject the hypothesis that neither Ad nor Bonus is linearly related to Sales.
VII Testing the Importance of Individual Variables to the Model
Having established, via the F-test, that the k independent variables, taken together, are correlated to Y, we next ask which individual independent variables belong in the model. This involves conducting a t - test of the slope for each of the K independent variables in the model.
Statgraphics determines the utility of an indepedent variable by considering how much the variable improves
the model if it is the last one to enter
. The test statistic and P
-value for the test, presented in the Analysis Summary
window in the same row as the estimated slope for the variable, can be used to determine the importance of the variable in explaining the variation in Y after accounting for the effects of the other variables in the model
. Thus the t – test
measures the marginal
improvement the variable affords the model.
Because results of the individual t - tests
depend upon the presence of the other variables in the model, each time you add or remove a variable from the model all of the test statistics and P
-values will change.
Example 1 (continued):
Based upon the P
-values of 0.
0000 for the variable Ad and 0.
0170 for the variable Bonus in the ANOVA Table for Securicorp shown on page 4, both independent variables are linearly related to Sales in this model, and are therefore retained.
A real estate agent believes that the selling price of a house can be predicted using the number of bedrooms, the size of the house, and the size of the lot upon which the house sits. A random sample of 100 houses was drawn and the data recorded in the file HOUSE PRICE for the variables below.
Price: Price, in dollars
Bedrooms: The number of bedrooms
H_Size: House size, in square feet
Lot_Size: Lot size, also in square feet
What is Multicollinearity?
Looking at the House Price data, we immediately note that if the variables H_Size, Lot_Size, and Bedrooms are all included in the model, their individual P-values are all high (see the output below). This might lead us to conclude that none of them are correlated to the price of a house, but the P-value for the model assures us that at least one of them is correlated to house price. (In fact, doing simple regressions of Price on the three independent variables, taken one at a time, leads to the conclusion that all of them are correlated to house price. You should verify these results.) These seemingly contradictory results are explained by the existence of Multicollinearity in the model.
In regression, we expect the independent variables to be correlated to the dependent variable. It often happens, however, that they are also correlated to each other. If these correlations are high then multicollinearity
is said to exist.
Dependent variable: Price
Parameter Estimate Error Statistic P-Value
CONSTANT 37717.6 14176.7 2.66053 0.0091
Bedrooms 2306.08 6994.19 0.329714 0.7423
H_Size 74.2968 52.9786 1.40239 0.1640
Lot_Size -4.36378 17.024 -0.256331 0.7982
Analysis of Variance
Source Sum of Squares Df Mean Square F-Ratio P-Value
Model 7.65017E10 3 2.55006E10 40.73 0.0000
Residual 6.0109E10 96 6.26136E8
Total (Corr.) 1.36611E11 99
R-squared = 55.9998 percent
R-squared (adjusted for d.f.) = 54.6248 percent
Standard Error of Est. = 25022.7
There are several ways to diagnose the existence of multicollinearity in a model. The following are the simplest indicators:
The P-values of important explanatory (independent) variables are high for the model. For example, the individual P-values for Bedrooms, H_Size, and Lot_Size are all high although we know that they are all correlated with House Price.
The algebraic sign (+/-) of one or more of the slopes is incorrect. For example, the regression coefficient for Lot_Size is negative, suggesting that increasing the size of the lot will tend, on average, to decrease the price of the house. A simple regression of Price on Lot_Size, however, confirms our suspicion that the two are positively correlated!
Problems Stemming from Multicollinearity
While the existence of multicollinearity doesn’t violate any model assumptions, or make the model invalid, it does pose certain problems for the analyst:
Individual t-tests may prove unreliable, making it difficult to determine which variables in the model are correlated to Y.
Because the estimated slopes may vary wildly from sample to sample (and even change algebraic sign), it may not be possible to interpret the slope of an independent variable as the marginal effect of a unit change in the variable upon the average value of Y (as described in the next section).
The simplest way to remove multicollinearity is to remove one or more of the correlated variables. For example, for the House Price data removing the variables Bedrooms and Lot_Size produces a simpler (in fact, Simple Linear Regression) model without multicollinearity.
IX Interpreting the Regression Coefficients
The regression coefficients are interpreted essentially the same in multiple regression as they are in simple regression, with one caveat. The slope of an independent variable in multiple regression is usually interpreted as the marginal (or isolated) effect of a unit change in the variable upon the mean value of Y when “the values of all of the other independent variables are held constant”. Thus, as stated in the previous section, when multicollinearity is a problem it may not be possible to interpret all of the coefficients. This is because some of the independent variables are closely interrelated, making it impossible to change the value of one while holding the values of the others constant.
Graphically, the coefficient of the independent variable Xi
, is the slope of the plane Y
+ … + k
that we would experience if we decided to walk in the direction of increasing values of Xi
, i.e., parallel to the Xi
axis. Specifically, if we move one unit in this direction, Y
will change, on average, by i
units. When multicollinearity is a problem, however, we may not be able to move around the plane in directions parellel to the axes of individual independent variables (we’re restricted to moving along certain “paths” on the plane). Thus, we are unable to “experience” the slope of Xi
, and i
can no longer be interpreted as the marginal effect of a unit change in the value of Xi
upon the mean value of Y
Example 1 (continued):
For Securicorp, the regression coefficients are interpreted below. For convenience
, the Statgraphics’ output found in the Analysis Summary
window is shown again. Recall that Sales is in thousands of dollars, while Ad and Bonus are in hundreds of dollars.
b0: b0 estimates the expected annual sales for a territory if $0.00 is spent on advertising and bonuses. Because these values are outside the range of values for Ad and Bonus observed, and upon which the estimated regression equation is based, the value of b0 has no practicle interpretation. Put more concisely, an interpretation of b0 is not supported by the data. This will often, but not always, be the case in multiple regression. You should try to come up with scenario, not involving the Sericorp example, where interpretation of the estimated intercept b0 would be appropriate.
b1: Expected (mean) sales increases by about $2,472 for every $100 increase in the amount spent on advertizing, holding the amount of bonuses paid constant.
b2: Sales increases by $1,853, on average, for every $100 increase in bonuses, for a given amount spent on advertizing
= the sample estimate of the standard deviation of the error variable,
, which measures the spread of the actual values of Y
about the true plane Y
+ … + k
. As such, the standard error should be reported in the units appropriate to the dependent variable as they appear in the spreadsheet. For example, in the regression of Sales on Ad and Bonus for Securicorp the standard error of the estimate is $91,121. As in simple regression, Statgraphics displays the standard error below the ANOVA Table.
XI Preferred Measures of Fit
Although there are many statistics which can be used to measure the fit of a model to the data, such as S, the most commonly used statistics for this purpose are R2 and R2-adjusted for degrees of freedom.
R2 is defined as in simple regression and continues to have the same interpretation. The drawback to R2 is that, because it can’t decrease when new variables are added to a model, it is inappropriate for comparing models with different numbers of independent variables. For this reason, a statistic that included information about the number of independent variables (and that penalized models with lots of useless or redundant variables) was created. This new statistic is called R2-adjusted for degrees of freedom, or simply R2-adjusted
R2-adjusted =, where is the sample variance for Y (the “missing” Mean Square in the ANOVA Table). Because R2-adjusted includes information about the sample size and the number of independent variables, it is more appropriate than R2 for comparing models with different numbers of independent variables. R2-adjusted appears directly below R2 in Statgraphic’s output.
XII Dummy Variables
Regression is designed for quantitative variables, i.e., both the dependent and independent variables are quantitative. There are times, however, when we wish to include information about qualitative variables in the model. For example, qualitative factors such as the presence of a fireplace, pool, or attached garage may have an effect upon a house’s price.
The way to get around regression’s restriction to quantitative variables is through the creation of Dummy (also called binary or indicator) Variables.
A dummy variable for a characteristic indicates the presence or absence of the characteristic in the observation. For example, in the Eugene house data (see the notes for simple regression) the variables Attach and View indicate the presence or absence of an attached garage or a nice view, respectively, for each house observed. The variables are defined in the data as follows.
From looking at the spreadsheet above, we can tell that the eighth house in the sample had an attached garage but no view, while the thirteenth house had both.
Qualitative Variables with More than Two Possible Outcomes
If a qualitative variable has more than two possible outcomes, for example a variable for the seasons may acquire the values Spring, Summer, Fall, or Winter, then a dummy is created for all but one of the outcomes. (It is important that you exclude one of the outcomes from the model. Statgraphics will get upset if you try to include them all!) Thus, we might create one dummy for Spring, another for Summer, and a third for Fall.
Creating Dummy Variables in Statgraphics
In the spreadsheet for Eugene houses the variables Attach and View were already represented as dummy (0 or 1) variables. Frequently, however, the values of a qualitative variable will appear as descriptions (“smoker” or “nonsmoker”) or numerical codes (“1” for red, “2” for blue, “3” for green, etc.). To create dummy variables for a qualitative variable with m possible outcomes, begin by copying the variable and pasting it into m – 1 columns in the spreadsheet.
Example 1 (continued):
Securicorp would also like to examine whether the marketing region to which a territory belongs affects expected sales, after taking into account the effect of advertising and bonuses. These regions have been coded in the data using 1 = West, 2 = Midwest, and 3 = South. (The column “Region Codes” provides the key to the codes.) Copying and pasting the Region variable twice (m
= 3) we arrive at the view below.
To create a dummy variable for the Midwest, begin by selecting one of the pasted columns in the spreadsheet. Then use the right mouse button to access the column menu shown below and select Recode Data.