**Bio 793- Niche modeling**
**Classification and Regression Trees**
**Notes by Andrew Siefert**
Instead of doing a re-tread of Jason’s lecture notes for CART techniques, I did a review of topics covered in the class by fitting a variety of models to cover data for *Betula alleghaniensis*, yellow birch. The purpose of this exercise was to review and compare the models we’ve discussed so far (general linear models, GLMs, GAMs, and regression trees).
Starting with the dataset we’ve been using in class (named dat), I created a subset of just *Betula alleghaniesis *data, which I named “datb”:
>datb= subset(dat, dat$species==”Betula alleghaniensis”)
Next, I used the attach function so I could just enter the names of variables without specifying the data frame:
>attach(datb)
For all the models I explored, I used cover as the response variable and elev, utme, utmn, and beers as the (arbitrarily chosen) predictor variables.
**General linear models**
First, I fit a **full linear model** that included all the predictor variables and their interactions:
>lm1= lm(cover~elev*utme*utmn*beers)
Residual standard error: 1.817 on 425 degrees of freedom
Multiple R-squared: 0.1197, Adjusted R-squared: 0.08864
F-statistic: 3.853 on 15 and 425 DF, p-value: 1.782e-06
AIC= 1795.9
(In these notes I’ll just show selected information from the output or summary for each model).
This model was highly significant but explained only about 12% of the variation in yellow birch cover. Next, I used the step function to see if any factors could be removed:
>step(lm1)
The step function removed utmn and many of the interaction terms to give a “best” linear model:
>lmb= lm(cover~utme+elev+beers+utme:elev+elev:beers)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.616e+01 4.854e+00 -3.328 0.000948 ***
utme 7.593e-05 1.835e-05 4.139 4.20e-05 ***
elev 1.867e-02 4.022e-03 4.642 4.57e-06 ***
beers -6.086e-01 5.560e-01 -1.095 0.274249
utme:elev -6.885e-08 1.502e-08 -4.583 5.99e-06 ***
elev:beers 7.813e-04 4.174e-04 1.872 0.061917 .
Residual standard error: 1.802 on 435 degrees of freedom
Multiple R-squared: 0.1142, Adjusted R-squared: 0.104
F-statistic: 11.22 on 5 and 435 DF, p-value: 3.479e-10
AIC= 1778.694
This model explained roughly the same amount of variance as the full model, but because it included fewer variables, its AIC was lower by 17.2. The only variables with significant slopes were utme and elev (both positive but very small). The utme:elev interaction was negative and significant.
I did a shapiro test to check whether the residuals were normally distributed.
>shapiro.test(residuals(lmb))
The test returned a p-value of 0.1253>0.05, indicating the residuals are more or less normally distributed. The linear model is therefore a reasonable fit.
**GLMs**
Next, I tried some **GLMs**, using a poisson error distribution and its associated link function. Poisson distributions often work for variables that have similar mean and variance, which was not the case for the yellow birch cover data (mean=4.9; var=3.63), but I tried it anyway for the sake of exploration.
First, I fit a **full GLM** with all variables and their interactions.
>glm1= glm(cover~utme*utmn*elev*beers, family= poisson)
Null deviance: 357.60 on 440 degrees of freedom
Residual deviance: 317.57 on 425 degrees of freedom
AIC: 1835.9
The full poisson model had a much higher AIC than the full linear model, indicating that the linear model is a much better fit. Next, I used the applied the step function to glm1 to get a “best” model.
>glmb= glm(cover~utme+elev+beers+utme:elev, family= poisson)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.168e+00 1.292e+00 -2.452 0.014187 *
utme 1.619e-05 4.809e-06 3.366 0.000764 ***
elev 4.162e-03 1.059e-03 3.928 8.57e-05 ***
beers 8.375e-02 3.451e-02 2.427 0.015226 *
utme:elev -1.460e-08 3.919e-09 -3.725 0.000195 ***
Null deviance: 357.60 on 440 degrees of freedom
Residual deviance: 321.48 on 436 degrees of freedom
AIC: 1817.8
As for the linear model, utmn was removed, along with all the interaction terms except utme:elev. Comparing AICs of the “best” linear model with the “best” GLM clearly gives the upper hand to the linear model.
**GAMs**
Next I tried to fit some GAMs. I used the default error distribution (Gaussian) and link function (identity).
First, I fit a model that included all the predictor variables with smooth terms but no interactions:
>gam1= gam(cover~s(elev) + s(utme) + s(utmn) + s(beers))
Approximate significance of smooth terms:
edf Ref.df F p-value
s(elev) 7.029 8.114 4.401 3.58e-05 ***
s(utme) 1.406 1.688 0.332 0.68084
s(utmn) 8.085 8.767 2.260 0.01874 *
s(beers) 1.000 1.000 12.021 0.00058 ***
R-sq.(adj) = 0.152 Deviance explained = 18.6%
GCV score = 3.2071 Scale est. = 3.0724 n = 441
AIC: 1766.634
This model explained much more deviance and had a lower AIC than any of the previous models, so smoothing definitely improved the fit. The smooth terms were the significant for every variable except utme. So next I fitted a model that included utme without smoothing, which made little difference.
>gam2= (cover~s(elev)+s(utmn)+s(beers)+utme)
R-sq.(adj) = 0.151 Deviance explained = 18.4%
AIC= 1766.850
Next, I created a “smoothed” version of the “best” linear model identified by the step function, using the tensor product smoothing function for interactions.
>gam.b= gam(cover~s(elev)+s(utme)+s(beers)+te(utme,elev)+te(elev,beers))
R-sq.(adj) = 0.176 Deviance explained = 22.4%
AIC: 1761.9
This was the best model so far in terms of explained deviance and AIC, indicating the interactions and smooth terms are important.
Next, I created a **full GAM**, which included all the variables, interactions between variables, and smooth terms for everything.
>gam.int= gam(cover~s(elev+s(utme)+s(utmn)+s(beers)+te(elev,utme)+te(elev,utmn)+ te(elev,beers)+te(utme,utmn)+te(utme,beers)+te(utmn,beers))
R-sq.(adj) = 0.308 Deviance explained = 40.2%
AIC= 1714.4
This model yielded a huge reduction in AIC and huge increase in deviance explained (40%, compared to 11% for the best linear model). However, it is very hard to interpret. Many of the smooth terms were insignificant. Figuring out which variables and interactions to keep was more difficult for GAMS than for linear models or GLMs. I tried the step function but couldn’t interpret the output. Trial and error didn’t really get me anywhere.
**Regression trees **
First, I created a regression tree using all the predictor variables and the default cutoff for minimum deviance explained by a split.
>rt1= tree(cover~elev+utme+utmn+beers)
Number of terminal nodes: 23
Residual mean deviance: 2.101 = 878 / 418
Pseudo-R^{2}= 0.45
The tree this produced had 23 terminal nodes and explained more deviance than any model I’d tried so far (45%). All of the variables were used in making splits. Interestingly, the first split was based on utmn, which wasn’t included in the best linear model. Southern plots generally had high yellow birch cover, while northern plots had high or low cover depending on elev and other factors. This indicates utmn may be an important variable for explaining cover, but the relationship is not simple or linear.
I used the prune tree and cross-validation functions to see if the tree was overfitted (too many nodes).
>plot(prune.tree(rt1))
>plot(cv.tree(rt1))
The prune.tree plot indicated that the tree was not overfitted (adding new nodes decreased residual deviation), and the cross-validation results were inconsistent. So next I decreased the deviance threshold from 0.01 to 0.001 and created a new tree.
>rt2= tree(cover~elev+utme+beers, control=tree.control(441, mindev=0.0001))
Number of terminal nodes: 72
Residual mean deviance: 1.513 = 558.2 / 369
Pseudo-R^{2}= 0.65
This monster tree with 72 nodes explained 65% of the null deviance, but the prune tree function showed steadily diminishing returns for added nodes. The large pseudo-r^{2} compared to linear models, GLMs, and GAMs maybe means that the predictor variables do explain a large amount of the variation in yellow birch cover, but the relationship is very complex.
Next, I used a **random forest** procedure to get an idea of which variables had the highest predictive power in regression trees.
>rf1= randomForest(cover~utme+elev+beers, importance=T, na.action=na.exclude)
Number of trees: 500
No. of variables tried at each split: 1
Mean of squared residuals: 2.473257
% Var explained: 31.57
%IncMSE IncNodePurity
utme 1.2841409 346.8779
elev 1.0605845 358.2866
beers 0.8506234 336.6292
utmn 1.6877773 374.0897
The results show that utmn is the most important variable (leaving it out would cause the greatest increase in MSE), followed by utme, elev, and beers. This is interested because utmn was the one variable left out of the best linear model as determined by the step function.
In conclusion, the models I fit to these data either did a poor job of explaining variation in yellow birch cover (linear models, GLMs) or were hard to interpret (GAMs, regression trees). No single predictor had a very strong relationship, linear or otherwise, with yellow birch cover. Regression trees did the best job of explaining deviance, and so would probably be most useful for prediction. I don’t think that any of these models gave me a good “feel” for what was actually going on with the data. Maybe when we get to hierarchical models... |