# 1Introduction

 Date 31.05.2016 Size 32.4 Kb. Statistic 302: Project 1 PREDICTING IS AS SIMPLE AS SIMPLE REGRESSION

Jimmy Huang

## 1Introduction

The Eastwestside Movers, an intracity moving company, has typically used a trained estimator to determine the number of labor hours needed for a move. This has proved useful in the past, but the company would like to be able to develop a more reliable estimate that would be more accurate in predicting the labor hours. In a preliminary effort to provide a more accurate means of estimation, the company has collected data for 36 moves (refer to provided data) in which the origin and destination were within the borough of Manhattan in New York City and the travel time was an insignificant portion of the hours worked. By having these collected data on hand, the company is now asking a statistician to do an analysis and develop a model so that the labor hours can be predicted based on the number of cubic feet to be moved from the apartment of origin.

## 2Preliminary Analysis

In terms of doing an analysis, a statistical tool, JMPIN, is being used to explore the data and find out the relationship between the number of cubic feet to be moved and the labor hours required.

With the 36 moves, the total labor required is 1,042.5 hours while the total space moved is 22,520 cubic feet. Therefore, it requires approximately 0.046292 hours on average to move one cubic feet.

After scatter plotting all the paired data between cubic feet and labor hours, Figure 1 on the left is the result. It seems that the labor hour is quite proportional to the cubic feet moved. A linear model can be used to fit the points. Figure 1 Scatter plot of the collected 36 pairs of data

## 3Fitting Model

From the scatter diagram, it appears that there exists a linear association between the cubic feet moved and the labor hours required. Using JMPIN to conduct a simple regression fit of Hours by Feet (refer to Appendix A for details), we’ve obtained a linear model for the data as:

Hours = -2.36966 + 0.0500803 Feet

with the correlation coefficient r = 0.942998 and r2 = 0.889246. As r2 = 0.889246, it indicates that the fitted model can explain a large proportion of the total variation: approximately 88.9% of the variation in the labor hours is explained by the model.

Now, let’s conduct a hypothesis testing for zero slope (β1 = 0) to verify that a straight-line model in cubic feet is better than a model that does not include cubic feet at all. For the full hypothesis testing procedure, refer to Appendix D. Since we reject the null hypothesis of zero slope for the straight line, the choice of the linear model is convincingly reasonable. More over, the Analysis of Variance section in the JMPIN output (refer to Appendix B) further shows that the null hypothesis of zero slope should be rejected as the p-value < 0.001 < 0.05 (the significance level).
By plotting the residuals of Hours and graphing the histogram for the residuals (refer to Appendix B), we can see no violation to model assumptions.

## 4Application of the Fitted Model

By obtaining the relationship between the labor hours required and cubic feet to be moved with the linear model:
Hours = -2.36966 + 0.0500803 Feet

the company now can make more reliable predictions on the labor hours easily and accurately. For example, to estimate the labor hours needed to move X0 = 800 cubic feet, all the company needs to do is to do a simple calculation by substitute 800 into the model to get an estimated point as:

Predicted Hours = -2.36966 + 0.0500803 * 800 ≈ 37.69

Then by doing the following calculation, the company can get a 95% prediction interval (PI):

Predicted Hours + b1 * (X0 – Mean Feet) ± tn-2, 1-α/2 * SY|X­ * sqrt(1 + 1/n + (X0 – Meat Feet)2/((n – 1) * SX2))

≈ 37.69 + 0.0500803 * (800 - 625.555556) ± 2.030 * 5.031427 * sqrt(1 + 1/36 + (800 – 625.555556)2/((36 – 1) * 78726.654))

≈ (36.02, 56.84)

Therefore, if the labor hour required to move 800 cubic feet is within 36 hours and 57 hours, then it is normal. If the labor hour is below 36 hours, then the move is more efficient than expected. If the labor hour is above 57 hours, then the company needs to investigate to verify what has been wrong with the move. There might have some other factors that affect the move as it has an extraordinary result.

## Appendix A: Bivariate Fit of Hours By Feet Figure 2 Mean and Regression Fit of Hours By Feet
Table 1: Fit Mean
 Mean 28.9583 Std Dev [RMSE] 14.901 Std Error 2.48351 SSE 7771.44

Table 2: Summary of Fit
 RSquare 0.889246 RSquare Adj 0.885988 Root Mean Square Error 5.03143 Mean of Response 28.9583 Observations (or Sum Wgts) 36

Table 3: Analysis of Variance (ANOVA Table)
 Source DF Sum of Squares Mean Square F Ratio Model 1 6910.7189 6910.72 272.9864 Error 34 860.7186 25.32 Prob > F C. Total 35 7771.4375 <.0001

Table 4: Parameter Estimates
 Term Estimate Std Error t Ratio Prob>|t| Intercept -2.36966 2.073261 -1.14 0.2610 Feet 0.0500803 0.003031 16.52 <.0001

Table 5: Data Summary
 N Rows Sum(Hours) Sum(Feet) Mean(Hours) Mean(Feet) Std Dev(Hours) Std Dev(Feet) 36 1042.5 22520 28.9583333 625.555556 14.9010426 280.582704

## Appendix B: Residual Plot of Hours Figure 3 Residual Plot Figure 4 Distributions Residuals Hours

## Appendix C: Hypothesis Testing for Zero Slope: β1 = 0

Testing Procedure:

1. Assumptions: The variable β1 has a normal distribution, from which a random sample has been selected.

2. Hypotheses: H0: β1 = 0

HA: β1 ≠ 0

1. Use 95% significant level: α = 0.05

2. Test Statistic: T = (b1 – β1) / Sb1, where

1. Sb1 = SY|X / (SX * sqrt(n – 1))

2. S2Y|X = (1 / (n – 2)) * ∑(Yi – Ŷi)2

3. S2X = (1 / (n – 2)) * ∑(Xi – Xi)2

4. Sample size n = 36

3. Rejection regions: reject H0 if | T | ≥ tn-2, 1-α/2 = t34, 0.975 ≈ 2.030; do not reject H0 otherwise.

4. Calculation of T:

From the JMPIN output in Appendix B, we get

b1 = 0.0500803

S2Y|X = 25.32 => SY|X ≈ 5.032

S2X = 78726.654 => SX ≈ 280.583

Sb1 = 5.032 / (280.583 * sqrt(36 – 1)) ≈ 0.00303

T = (0.0500803 – 0) / 0.00303 ≈ 16.528

1. Since T ≈ 16.528 > t34, 0.975 ≈ 2.030, we reject H0 at significance level 0.05 and conclude that there is evidence that the cubic feet to be moved indeed provides significant information for predicting the labor hours needed, that is, a straight-line model in cubic feet is better than a model that does not include cubic feet at all. Copyright © 2002 Jimmy Huang June 13, 2002 Page of