# 10. Multiple linear regression models#

require(s20x)

Hide code cell output
Loading required package: s20x


## 10.2. Exploring relationships between the variables#

Let us first inspect the relationships between the numerical explanatory variables and the response variable.

The five variables are in columns 1,2,4,5 and 6 in the data frame Babies.df.

## Invoke the s20x library
library(s20x)
## Importing data into R
## Create the pairs plot of the five numeric variables
pairs20x(Babies.df[, c(1, 2, 4, 5, 6)]) plot(bwt ~ height,
data = Babies.df,
xlab = "Mother’s height (inch)", ylab = "Birth weight (oz)"
)
lines(lowess(Babies.df$height, Babies.df$bwt)) summary(lm(bwt ~ height, data = Babies.df))$r.squared cor(Babies.df$bwt, Babies.df$height)^2 # R 方是差异的平方，所以跟上边那个是一样的  0.0414953918045247 0.0414953918045247 Looking at the pairs plot again, we also see a somewhat weak relationship between bwt and mother’s weight. plot(bwt ~ weight, data = Babies.df, xlab = "Mother’s weight (lb)", ylab = "Birth weight (oz)" ) lines(lowess(Babies.df$weight, Babies.df$bwt), col = "red") 胎儿孕育时间与其出生体重之间存在更强的关系，这并不令人意外，因为孩子在母亲子宫内的时间越长，孩子就有更多的时间来获得营养和生长。但是，在某个特定的胎龄后，这种关系显然会变得平缓起来 - 有些人称其为“曲棍球杆形状的曲线”。 plot(bwt ~ gestation, data = Babies.df, ylab = "Birth weight (oz)") lines(lowess(Babies.df$gestation, Babies.df$bwt), col = "red") There does not seem to be any relationship between a mother’s age and her child’s bwt. plot(bwt ~ age, data = Babies.df, xlab = "Mother’s age", ylab = "Birth weight (oz)" ) lines(lowess(Babies.df$age, Babies.df$bwt), col = "red") Note: There seem to be some outlying data points(一些偏远的数据点) in these plots. There does not appear to be much of a relationship between the x variables, except between height and weight. pairs20x(Babies.df[, c(1, 3, 7)]) 让我们从理解“解释”的角度开始，因为它是最强大的关系之一。那些不典型的数据点已经用问号标记了。我们稍后会添加其他的解释变量。 plot(bwt ~ gestation, data = Babies.df, ylab = "Birth weight (oz)") lines(lowess(Babies.df$gestation, Babies.df$bwt), col = "red") text(c(152, 185), c(120, 115), "?", col = "red") abline(v = 294, lty = 3) They look extremely implausible as they have typical birth-weight but have a gestational age that is extremely low for these data. 他们看起来非常难以置信,因为他们典型的出生体重但有孕龄,对这些数据是极低的。 id <- (Babies.df$gestation < 200)
Babies.df[id, ]

A data.frame: 2 × 7
bwtgestationnot.first.bornageheightweightsmokes
<int><int><int><int><int><int><int>
239116148028661350
820110181027641330

Relationship between birth weight and gestational age...

For gestation $$\leq 294$$ days we’ll use the familiar simple linear regression model

$\text{E[bwt]}=\beta_0+\text{gestation}\times\beta_1$

We’d like to extend this model by adding an extra term so that the slope changes when gestation $$> 294$$. That is,

$\text{E[bwt]}=\beta_0+\text{gestation}\times\beta_1+\text{v}\times\beta_2$

where v is some suitable explanatory variable. What should v be?

• For gestation $$\leq 294$$ the extended model is just the simple linear regression model, so that means v = 0 when gestation ≤ 294.

• For gestation $$> 294$$ we need another slope effect for gestational age. In fact, we need $$v =$$ gestation $$- 294$$.

Let’s create the new explanatory v = 294 that is described gestation above. We’ll give it the name because it is the number of days ODdays that the baby is overdue.

Babies.df$ODdays <- ifelse( Babies.df$gestation < 294,
0,
Babies.df\$gestation - 294
)
head(Babies.df, 12) # Print first 12 lines of dataframe

A data.frame: 12 × 8
bwtgestationnot.first.bornageheightweightsmokesODdays
<int><int><int><int><int><int><int><dbl>
1120284027621000 0
2113282033641350 0
3128279028641151 0
4108282023671251 0
513628602562 930 0
6138244033621780 0
7132245023651400 0
8120289025621250 0
9143299030661361 5
1014035102768120057
11144282032641241 0
12141279023631281 0

## 10.3. Fitting the initial model#

bwt.fit <- lm(bwt ~ gestation + ODdays, data = Babies.df)
plot(bwt.fit, which = 1, add.smooth = FALSE)
normcheck(bwt.fit)
cooks20x(bwt.fit)   Let us refit with observation 239 removed.

bwt.fit2 <- lm(bwt ~ gestation + ODdays, data = Babies.df[-239, ])
cooks20x(bwt.fit2) We refit the model using the reduced data.

# This time we demonstrate using the subset argument to remove points
bwt.fit3 <- lm(bwt ~ gestation + ODdays,
data = Babies.df,
subset = -c(239, 820)
)
cooks20x(bwt.fit3)
plot(bwt.fit3, which = 1)  Let’s take a look at our fitted hockey stick model.

gestation.seq <- 201:360 # Explanatory values at which to get predictions
ODdays.seq <- ifelse(gestation.seq <= 294, 0, gestation.seq - 294)
fit.seq <- predict(bwt.fit3, new = data.frame(
gestation = gestation.seq,
ODdays = ODdays.seq
))
plot(bwt ~ gestation,
data = Babies.df[-c(239, 820), ],
ylab = "Birth weight (oz)"
)
lines(gestation.seq, fit.seq, col = "red")
abline(v = 294, lty = 2, col = "blue") summary(bwt.fit3)

Call:
lm(formula = bwt ~ gestation + ODdays, data = Babies.df, subset = -c(239,
820))

Residuals:
Min      1Q  Median      3Q     Max
-50.664 -10.993  -0.308   9.795  52.336

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -66.95336   10.42810   -6.42 1.97e-10 ***
gestation     0.67124    0.03757   17.87  < 2e-16 ***
ODdays       -0.90783    0.11745   -7.73 2.31e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 16.23 on 1169 degrees of freedom
Multiple R-squared:  0.2188,	Adjusted R-squared:  0.2174
F-statistic: 163.7 on 2 and 1169 DF,  p-value: < 2.2e-16


The fitted model is:

$\text{E[bwt]}=-66.95+0.67\times\text{gestation}-0.91\times\text{ODdays}$

## 10.4. Multiple linear regression model: Adding more terms to the model and the peril of multi-collinearity#

bwt.fit4 = lm(bwt ~ gestation + ODdays + height,
data = Babies.df,
subset = -c(239, 820)
)
plot(bwt.fit4, which = 1) All seems okay. Let us make sure that this makes sense in terms of output.

summary(bwt.fit4)

Call:
lm(formula = bwt ~ gestation + ODdays + height, data = Babies.df,
subset = -c(239, 820))

Residuals:
Min      1Q  Median      3Q     Max
-53.999 -10.393  -0.050   9.772  51.514

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -139.20571   15.05961  -9.244  < 2e-16 ***
gestation      0.65219    0.03703  17.613  < 2e-16 ***
ODdays        -0.89039    0.11543  -7.714 2.61e-14 ***
height         1.21083    0.18495   6.547 8.79e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.94 on 1168 degrees of freedom
Multiple R-squared:  0.2464,	Adjusted R-squared:  0.2445
F-statistic: 127.3 on 3 and 1168 DF,  p-value: < 2.2e-16


Let us add weight to the model. We’re going to save some typing and use the update function to update our model.

bwt.fit5 <- update(bwt.fit4, ~ . + weight)
summary(bwt.fit5)

Call:
lm(formula = bwt ~ gestation + ODdays + height + weight, data = Babies.df,
subset = -c(239, 820))

Residuals:
Min      1Q  Median      3Q     Max
-53.053 -10.540   0.121  10.076  47.746

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -131.68169   15.14974  -8.692  < 2e-16 ***
gestation      0.65624    0.03688  17.795  < 2e-16 ***
ODdays        -0.90868    0.11502  -7.900 6.41e-15 ***
height         0.90486    0.20453   4.424 1.06e-05 ***
weight         0.08535    0.02485   3.434 0.000615 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.87 on 1167 degrees of freedom
Multiple R-squared:  0.254,	Adjusted R-squared:  0.2514
F-statistic: 99.32 on 4 and 1167 DF,  p-value: < 2.2e-16
`