IV Checking the Error Variable Assumptions: Residual Analysis 4
V The Analysis of Variance (ANOVA) Table 5
A.Sums of Squares 5
B.Degrees of Freedom 5
C.Mean Squares 6
D.The F – Ratio 6
VI Testing the Assumption of Linearity 8
VII Testing the Importance of Individual Variables to the Model 8
VIII Multicollinearity 9
A.What is Multicollinearity? 9
B.Diagnosing Multicollinearity 10
C.Problems Stemming from Multicollinearity 10
IX Interpreting the Regression Coefficients 10
X The Standard Error of the Estimate 11
XI Preferred Measures of Fit 11
XII Dummy Variables 12
A.The Problem 12
B.The Solution 12
C.Qualitative Variables with More than Two Possible Outcomes 13
D.Creating Dummy Variables in Statgraphics 13
E.Interpreting Dummy Variables 15
In simple linear regression, represented the combined effect upon the dependent variable of all the important variables not included in the model. It seems only natural, therefore, that we might wish to add variables we think significant to the model with the goal of reducing the (unexplained) random variation in the dependent variable, i.e., producing a better fitting model.
II The Model
The model for multiple linear regression is given by
Y = X1 + X2 + … + kXk + , where
k equals the number of independent variables in the model
Xi is the ith independent variable (out of k)
Y and are random variables
are the parameters
The Multiple Linear Regression model, Y = X1 + X2 + … + kXk + , makes two different kinds of assumptions.
The first of these, mentioned previously, postulates that the dependent variable Y is linearly related to the collection of independent variables taken together, i.e., Y = X1 + X2 + … + kXk.
The Simple Linear Regression model Y = X hypothesises that the relationship between Y and X can best be described by a line. Similarly, the Quadratic (Polynomial) model Y = X + X2 hypothesises that the relationship follows a parabola. The Mulptiple Linear Regression model Y = X1 + X2, hypothesises that the relationship between the dependent variable Y and the independent variables X1 and X2 can be pictured as a plane. The picture below right shows such a plane. This model may be suggested by experience, theoretical considerations, or exploratory data analysis. (For comparison, the picture below left is of a simple regression model.)
The more general multiple regression model considered here, Y = X1 + X2 + … + kXk , is also thought of as consisting of a plane, but for K > 2 we aren’t able to picture the plane described by the model.
The Error Variable
The second set of assumptions involves the distribution of the error variable, . Specifically:
The errors are assumed to be independent of each other.
The first assumption about the error variable makes construction of confidence intervals for the mean value of Y, for fixed values of the independent variables, possible. It also allows us to conduct useful hypothesis tests. The constant variance part of the assumption states that the variation in the values of the dependent variable Y about the plane Y = X1 + X2 + … + kXk is the same for all values of the independent variables observed.
Recall that the assumptions made in Linear Regression may not be justified by the data. Using the results of a regression analysis when the assumptions are invalid may lead to serious errors! Prior to reporting the results of a regression analysis, therefore, you must demonstrate that the assumptions underlying the analysis appear reasonable given the data upon which the analysis is based.
Example 1: Securicorp markets security equipment to airports. Management wishes to evaluate the effectiveness of a new program that offers performance bonuses to salespeople, while also taking into account the effect of advertising. The company currently markets in the West, Midwest, and South. The regions are further divded into smaller sales territories. The file SECURICORP contains data from last year for each sales territory for the following variables:
Sales: Sales, in thousands of dollars
Ad: Advertising, in hundreds of dollars
Bonus: Bonuses, in hundreds of dollars
Region: The region to which the sales territory belongs
Using the Multiple Regression button on the main toolbar (or Relate>Multiple Regression), enterring Sales as the dependent variable, and Ad and Bonus as the independent variables, produces the AnalysisWindow below.
IV Checking the Error Variable Assumptions: Residual Analysis
The assumptions made about the distribution of the error variable can be checked by looking at the Plot of Residuals vs. Predicted Y and a histogram of the studentized residuals, as in simple regression. In addition, for time-series data the Plot of Residuals vs. Row Number (the row numbers represent the time periods in which the data was collected) should be free of any obvious patterns that would suggest that errors were correlated.
For Multiple Regression, it is also advisable to look at the Plot of Residuals versus X for each independent variable in the model. As with other plots of residuals, we expect the residual plots for each independent variable to be random (free of obvious patterns) and to have a constant spread for all values of the independent variable.
Example 1 (continued): For Securicorp sales regressed on advertizing and bonuses, the studentized residuals plotted against the predicted sales, and the histogram of the studentized residuals appear below.
The Plot of Residuals versus X for advertizing and bonus appear below. The plot for advertizing is the default graph because the variable Ad appears as the first variable in our model. To view the plot for the variable Bonus select Pane Options with the right mouse button and click on the variable Bonus.
Note that all of the residual plots appear random, and exhibit constant spread about the model (represented by the horizontal line in the plots). This suggests that the model assumptions are validated by the data.
V The Analysis of Variance (ANOVA) Table
As in simple regression, an ANOVA table is prominently featured in the output of the Analysis Summary window. Below is a description of the contents of the columns in the table.
The definition and interpretation of the sums of squares in multiple regression is similar to that in simple regression.
Total Sum of Squares, , is a measure of the total variation of Y Regression Sum of Squares, , measures the variation of Yexplained by the model
Error Sum of Squares, , measures the variation of Y left unexplained by the model
Note: Remarkably, we find that the equality SST = SSR + SSE always holds. Thus, the total variation in Y can
be “decomposed” into the explained variation plus the unexplained variation for the model.
Degrees of Freedom
The degrees of freedom (df) equal the amount of independent information available for the corresponding sum of squares. Starting with n degrees of freedom (one for each observation in the sample), we lose one degree of freedom for each parameter in the model estimated by a sample statistic b.
SST: df = n – 1 , because the parameter Y is estimated by the sample statistic .
SSE: df = n – k – 1 = n – (k + 1) , because the k + 1 model parameters, , must be estimated by the sample statistics .
SSR: df = k Note: Paralleling the results from simple regression, observe that the total degrees of freedom, n – 1,
“decomposes” into the explained degrees of freedom k plus the unexplained degrees of freedom n - k – 1
for the model.
While a sum of squares measures the total variation, the ratio of a sum of squares to its degrees of freedom is a variance. The advantage of a variance is that it can be used to compare different data sets and different models (since it incorporates information about the size of the sample and the number of independent variables used in the model). These variances are called mean squares.
S2 = = The (sample) variance of the y-values observed (see the notes “Review of Basic Statistical Concepts”). You probably never knew (or cared, perhaps) that the sample variance you computed in your introductory statistics course was an example of a mean square!
MSE = = = The (sample) variance of the dependent variable unexplained by the model. The Mean Square Error is the sample estimate of the variance of the error variable,
MSR== =the variance of the dependent variable explained by the model,the Mean Square for Regression.
The F – Ratio
An intuitive device for judging the effectiveness of the model in describing the relationship between the dependent variable Y and the independent variables in the model, taken together, is to compute the ratio of the MSR (the variance in Y explained by the model) to the MSE (the variance of Y about the model). The resulting ratio is called the F statistic:
F – Ratio =
Properties of the F statistic:
F > 0
If F is “large” then the model explains much more of the variation in Y then it leaves unexplained, which is evidence that the model Y = X1 + X2 + … + kXk is appropriate, i.e., a large F supports the linearity assumption of the model.
A “small” F indicates that the model is inappropriate.
If the error variable is normally distributed with constant variance, then the F – Ratio follows a probability distribution called the F distribution. The P-value in the ANOVA table is for a hypothesis test of the linearity assumption of the model, i.e., the assumption that the dependent variable is linearly related to the set of independent variables, taken together, Y = X1 + X2 + … + kXk. See the next section (Section VI) for a discussion of the test. (The F distribution has two degrees of freedom and . In the test conducted for the model,and .)
The ANOVA Table below summarizes the results in this section as they appear in Statgraphics.
Below is the ANOVA Table for the Securicorp example. (Note: The E6 in SSR and SST is scientific notation for 106. Thus, SSR = 1,067,220, and SST = 1,249,880.) Remember, also, that the sales figures appearing in the spreadsheet are in units of $1,000, and that these units are squared in the computation of the sums of squares!
VI Testing the Assumption of Linearity
Is Y = X1 + X2 + … + kXk an appropriate description of the relationship between the dependent and independent variables? To answer this question, we conduct a formal hypothesis test. For the test,
H0: k = 0, i.e., none of the independent variables are linearly related to the dependent variable.
HA: At least one i is not zero, i.e., at least one of the independent variables is linearly related to the dependent variable.
Test Statistic: F = = the ratio of the explained and unexplained variance in Y.
P-Value: If the error variable satisfies the assumption made in section III then F follows an F distribution. Using the F distribution, Statgraphics computes the P-value for the test statistic. Note, however, that substantial deviations from the error variable assumptions can make the P-value unreliable. Since larger F – Ratios correspond to more convincing evidence that at least one of the independent variables is correlated to Y, large values of Flead to small P-values and the rejection of H0.
Example 1 (continued): Based upon the p-value of 0.0000 for the F – Statistic in the ANOVA Table below for Securicorp, we reject the hypothesis that neither Ad nor Bonus is linearly related to Sales.
VII Testing the Importance of Individual Variables to the Model
Having established, via the F-test, that the k independent variables, taken together, are correlated to Y, we next ask which individual independent variables belong in the model. This involves conducting a t - test of the slope for each of the K independent variables in the model.
Statgraphics determines the utility of an indepedent variable by considering how much the variable improves the model if it is the last one to enter. The test statistic and P-value for the test, presented in the Analysis Summary window in the same row as the estimated slope for the variable, can be used to determine the importance of the variable in explaining the variation in Yafter accounting for the effects of the other variables in the model. Thus the t – test measures the marginal improvement the variable affords the model.
Because results of the individual t - tests depend upon the presence of the other variables in the model, each time you add or remove a variable from the model all of the test statistics and P-values will change.
Example 1 (continued): Based upon the P-values of 0.0000 for the variable Ad and 0.0170 for the variable Bonus in the ANOVA Table for Securicorp shown on page 4, both independent variables are linearly related to Sales in this model, and are therefore retained.
Example 2: A real estate agent believes that the selling price of a house can be predicted using the number of bedrooms, the size of the house, and the size of the lot upon which the house sits. A random sample of 100 houses was drawn and the data recorded in the file HOUSE PRICE for the variables below.
Looking at the House Price data, we immediately note that if the variables H_Size, Lot_Size, and Bedrooms are all included in the model, their individual P-values are all high (see the output below). This might lead us to conclude that none of them are correlated to the price of a house, but the P-value for the model assures us that at least one of them is correlated to house price. (In fact, doing simple regressions of Price on the three independent variables, taken one at a time, leads to the conclusion that all of them are correlated to house price. You should verify these results.) These seemingly contradictory results are explained by the existence of Multicollinearity in the model.
In regression, we expect the independent variables to be correlated to the dependent variable. It often happens, however, that they are also correlated to each other. If these correlations are high then multicollinearity is said to exist.
Dependent variable: Price
There are several ways to diagnose the existence of multicollinearity in a model. The following are the simplest indicators:
The P-values of important explanatory (independent) variables are high for the model. For example, the individual P-values for Bedrooms, H_Size, and Lot_Size are all high although we know that they are all correlated with House Price.
The algebraic sign (+/-) of one or more of the slopes is incorrect. For example, the regression coefficient for Lot_Size is negative, suggesting that increasing the size of the lot will tend, on average, to decrease the price of the house. A simple regression of Price on Lot_Size, however, confirms our suspicion that the two are positively correlated!
Problems Stemming from Multicollinearity
While the existence of multicollinearity doesn’t violate any model assumptions, or make the model invalid, it does pose certain problems for the analyst:
Individual t-tests may prove unreliable, making it difficult to determine which variables in the model are correlated to Y.
Because the estimated slopes may vary wildly from sample to sample (and even change algebraic sign), it may not be possible to interpret the slope of an independent variable as the marginal effect of a unit change in the variable upon the average value of Y (as described in the next section).
The simplest way to remove multicollinearity is to remove one or more of the correlated variables. For example, for the House Price data removing the variables Bedrooms and Lot_Size produces a simpler (in fact, Simple Linear Regression) model without multicollinearity.
IX Interpreting the Regression Coefficients
The regression coefficients are interpreted essentially the same in multiple regression as they are in simple regression, with one caveat. The slope of an independent variable in multiple regression is usually interpreted as the marginal (or isolated) effect of a unit change in the variable upon the mean value of Y when “the values of all of the other independent variables are held constant”. Thus, as stated in the previous section, when multicollinearity is a problem it may not be possible to interpret all of the coefficients. This is because some of the independent variables are closely interrelated, making it impossible to change the value of one while holding the values of the others constant.
Graphically, the coefficient of the independent variable Xi , i , is the slope of the plane Y = X1 + X2 + … + kXk that we would experience if we decided to walk in the direction of increasing values of Xi , i.e., parallel to the Xi axis. Specifically, if we move one unit in this direction, Y will change, on average, by i units. When multicollinearity is a problem, however, we may not be able to move around the plane in directions parellel to the axes of individual independent variables (we’re restricted to moving along certain “paths” on the plane). Thus, we are unable to “experience” the slope of Xi , and i can no longer be interpreted as the marginal effect of a unit change in the value of Xi upon the mean value of Y.
Example 1 (continued): For Securicorp, the regression coefficients are interpreted below. For convenience, the Statgraphics’ output found in the Analysis Summary window is shown again. Recall that Sales is in thousands of dollars, while Ad and Bonus are in hundreds of dollars.
b0: b0 estimates the expected annual sales for a territory if $0.00 is spent on advertising and bonuses. Because these values are outside the range of values for Ad and Bonus observed, and upon which the estimated regression equation is based, the value of b0 has no practicle interpretation. Put more concisely, an interpretation of b0 is not supported by the data. This will often, but not always, be the case in multiple regression. You should try to come up with scenario, not involving the Sericorp example, where interpretation of the estimated intercept b0 would be appropriate.
b1: Expected (mean) sales increases by about $2,472 for every $100 increase in the amount spent on advertizing, holding the amount of bonuses paid constant.
b2: Sales increases by $1,853, on average, for every $100 increase in bonuses, for a given amount spent on advertizing
X The Standard Error of the Estimate
S = = the sample estimate of the standard deviation of the error variable, , which measures the spread of the actual values of Y about the true plane Y = X1 + X2 + … + kXk . As such, the standard error should be reported in the units appropriate to the dependent variable as they appear in the spreadsheet. For example, in the regression of Sales on Ad and Bonus for Securicorp the standard error of the estimate is $91,121. As in simple regression, Statgraphics displays the standard error below the ANOVA Table.
XI Preferred Measures of Fit
Although there are many statistics which can be used to measure the fit of a model to the data, such as S, the most commonly used statistics for this purpose are R2 and R2-adjusted for degrees of freedom.
R2 is defined as in simple regression and continues to have the same interpretation. The drawback to R2 is that, because it can’t decrease when new variables are added to a model, it is inappropriate for comparing models with different numbers of independent variables. For this reason, a statistic that included information about the number of independent variables (and that penalized models with lots of useless or redundant variables) was created. This new statistic is called R2-adjusted for degrees of freedom, or simply R2-adjusted
R2-adjusted =, where is the sample variance for Y (the “missing” Mean Square in the ANOVA Table). Because R2-adjusted includes information about the sample size and the number of independent variables, it is more appropriate than R2 for comparing models with different numbers of independent variables. R2-adjusted appears directly below R2 in Statgraphic’s output.
XII Dummy Variables
Regression is designed for quantitative variables, i.e., both the dependent and independent variables are quantitative. There are times, however, when we wish to include information about qualitative variables in the model. For example, qualitative factors such as the presence of a fireplace, pool, or attached garage may have an effect upon a house’s price.
The way to get around regression’s restriction to quantitative variables is through the creation of Dummy (also called binary or indicator) Variables.
A dummy variable for a characteristic indicates the presence or absence of the characteristic in the observation. For example, in the Eugene house data (see the notes for simple regression) the variables Attach and View indicate the presence or absence of an attached garage or a nice view, respectively, for each house observed. The variables are defined in the data as follows.
From looking at the spreadsheet above, we can tell that the eighth house in the sample had an attached garage but no view, while the thirteenth house had both.
Qualitative Variables with More than Two Possible Outcomes
If a qualitative variable has more than two possible outcomes, for example a variable for the seasons may acquire the values Spring, Summer, Fall, or Winter, then a dummy is created for all but one of the outcomes. (It is important that you exclude one of the outcomes from the model. Statgraphics will get upset if you try to include them all!) Thus, we might create one dummy for Spring, another for Summer, and a third for Fall.
Creating Dummy Variables in Statgraphics
In the spreadsheet for Eugene houses the variables Attach and View were already represented as dummy (0 or 1) variables. Frequently, however, the values of a qualitative variable will appear as descriptions (“smoker” or “nonsmoker”) or numerical codes (“1” for red, “2” for blue, “3” for green, etc.). To create dummy variables for a qualitative variable with m possible outcomes, begin by copying the variable and pasting it into m – 1 columns in the spreadsheet.
Example 1 (continued): Securicorp would also like to examine whether the marketing region to which a territory belongs affects expected sales, after taking into account the effect of advertising and bonuses. These regions have been coded in the data using 1 = West, 2 = Midwest, and 3 = South. (The column “Region Codes” provides the key to the codes.) Copying and pasting the Region variable twice (m = 3) we arrive at the view below.
To create a dummy variable for the Midwest, begin by selecting one of the pasted columns in the spreadsheet. Then use the right mouse button to access the column menu shown below and select Recode Data.
This leads to the dialog box seen to the right. Using Tab on the keyboard to move around, we’ve instructed Statgraphics to turn all “2”s (midwestern territories) in the column into “1”s, while any other number (region) is set to zero. At this point, Modify Column may be selected from the column menu and used to rename the column “Midwest.” After similarly recoding the other pasted column for the western region, the spreadsheet will look like the one below.
Note that the fifth territory in the data, which belongs to the southern marketing region, appears with zeros in the columns for the western and midwestern regions, i.e., it belongs to neither West nor Midwest. Thus, only two (m – 1) dummy variables are required for the three (m = 3) regions! In general, m – 1 dummy variables are created for a qualitative variable with m possible outcomes.
We are now ready to include the region identification of a sales territory into our regression model for annual sales. The model now looks like Sales = Ad + Bonus + West + Midwest + . The Input Dialog button (red button, far left of the multiple regression analysis toolbar) is used to rerun the regression analysis with the new set of independent variables. The Analysis Window for the new model appears below.
R2-adjusted has increased from 84.1% to 93.6% with the addition of the dummy variables for region. (Why is R2-adjusted preffered to R2 for comparing the two models?) In addition, three of the four variables are significant at the 5% level of significance (including both dummies), while Bonus is significant at the 10% level. The Plot of Residuals vs. Predicted for Sales and the histogram of the studentized residuals for the new model are shown below. Note the outlier in row 11.
Row 11 outlier
Interpreting Dummy Variables
For the Securicorp sales model with Ad, Bonus, West, and Midwest included, the estimated regression equation returned by Statgraphics is
Comparing these two equations, the model predicts that expected annual sales, for given expenditures on advertising and bonuses, will be $258,877 less for a western territory than for a southern territory. Thus the estimated regression coefficient –258.877 for West is interpreted as the expected difference in sales between western and southern territories, in thousands of dollars, given identical advertisement and bonus expenditures. Similarly, The coefficient –210.456 for Midwest is interpreted as the marginal difference in mean annual sales between midwestern and southern territories. (How would we use the equation for Sales to compare the expected annual sales for western and midwestern territories?)