4. Fitting curves with the linear model#

本节需要的包:

require(s20x)
Hide code cell output
载入需要的程辑包:s20x

4.1. Identifying a curved relationship 初步探究曲线关系#

## Load the s20x library into our R session
library(s20x)
## Importing data into R
Stats20x.df <- read.table("../data/STATS20x.txt", header = T)
## Examine the data
plot(Exam ~ Assign, data = Stats20x.df)
_images/646abebe9d6b2d94dda1f6876f41ce86b402502e014a54791fe952dc58c08262.png

Hmmm, not quite a straight line – could be some curvature. Maybe will paint a clearer picture. 不是一条很直的线--可能是一些曲率。也许会描绘出一幅更清晰的图景。

trendscatter(Exam ~ Assign, data = Stats20x.df)
_images/00843bcb006a8a8f52b4cfc4c1f242f85c73f9d7253852579c8e8af9309e4a3b.png

Let’s fit a simple linear model to these data and see if it works out or not.

examassign.fit <- lm(Exam ~ Assign, data = Stats20x.df)
plot(examassign.fit, which = 1)
_images/f6c535a6bffdb3c5da65e0daad795e8ad7a15453f1d98765b43958b2c716d42d.png

The assumption of identical distribution with expected value of 0 looks to be questionable here. There tend to be more negative residuals in the middle, but more positive residuals at the extremes of the fitted values. Potential solution – add a quadratic (squared term) for.

假设相同的分布与预期值 0 看起来可疑的。会有更多负面的残差在中间,但更积极的残差的极端值。潜在的解决方案应该是:添加一个二次项(平方项)。

4.2. Fitting a quadratic model 拟合二次模型#

The standard notation for a quadratic curve is:

\[ y = a x^2 + b x + c \]

Here we will use different notation: \(\beta_0 = c\), \(\beta_1 = b\) and \(\beta_2 = a\) and use the quadratic curve to describe the expected value of our dependent variable \(y\). That is, we will use the following notation:

\[ E[Y|x] = \beta_0 + \beta_1 x + \beta_2 x^2 \]

If \(\beta_2 > 0\), then the quadratic has slope that increases with increasing x(斜率随着 x 增大而增大). If \(\beta_2 < 0\), then the quadratic has slope that decreases with increasing x. If \(\beta_2 = 0\), then the quadratic(该“二次曲线”) has a constant slope(倾斜直线的外观).

让我们回到之前的学生数据集。我们将使用一个新的变量 \(x^2\) 来拟合一个二次模型:

examassign.fit2 <- lm(Exam ~ Assign + I(Assign^2), data = Stats20x.df)
plot(examassign.fit2, which = 1)
_images/4cf6bb26ef98a9af247edd200fad47e957ae8449a9acf92a5b9fc4e253e2d03e.png

That is looking much better.

接下来我们会进行“三步走”中的后两步:

normcheck(examassign.fit2)
cooks20x(examassign.fit2)
_images/0aae3fb8b150fce3e474dcfc7655779bb47b06d00cba2d1e2dad2a38a083b9b5.png _images/5f9529e2319bfce109347216e153cee476c700eb2d43d56d508776d6b627b3c1.png

符合正态分布、方差齐性。我们可以尝试对照一下原来的模型和我们的新模型:

plot(Exam ~ Assign, data = Stats20x.df)
x <- 0:20 # Assignment values at which to predict exam mark
## Plot model 1
lines(x, predict(examassign.fit, data.frame(Assign = x)), col = "red")
## Plot model 2
lines(x, predict(examassign.fit2, data.frame(Assign = x)), col = "blue")
_images/07c97c984d6c26081c08eafa0ceb72ac180d847d83f09b7380e6abb9cc1dc421.png
summary(examassign.fit2)
Call:
lm(formula = Exam ~ Assign + I(Assign^2), data = Stats20x.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.541  -9.149   1.273   9.087  41.116 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 28.41396    5.99081   4.743 5.05e-06 ***
Assign      -0.68172    1.07242  -0.636 0.525999    
I(Assign^2)  0.16102    0.04545   3.542 0.000536 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.65 on 143 degrees of freedom
Multiple R-squared:  0.5477,	Adjusted R-squared:  0.5414 
F-statistic: 86.59 on 2 and 143 DF,  p-value: < 2.2e-16

Note that the coefficient \(β_2 > 0\) associated with the term \(I(Assign)^2\) indicates an increase that starts slowly and ‘accelerates’(加速) as Assign increases.