8. Linear models with both numeric and factor explanatory variables#

本节需要的包:

require(s20x)
Hide code cell output
Loading required package: s20x

8.1. Example: Using both test score and attendance to explain exam score#

示例:使用测试成绩和出勤率解释考试分数

## Invoke the s20x library
library(s20x)
## Importing data into R
Stats20x.df <- read.table("../data/STATS20x.txt", header = TRUE)
Stats20x.df$Attend <- as.factor(Stats20x.df$Attend)
## Plot blue "Y" for "Yes" (regular attenders), and red "N" for "No"
plot(Exam ~ Test,
    data = Stats20x.df,
    pch = substr(Attend, 1, 1), # "Y" or "N"
    col = ifelse(Attend == "Yes", "blue", "red")
)
_images/2b34c0e0745770014f4ba528fe3abefd276830a22f0c33ab000d2dae32fd7b9d.png

Here is the plot for the regular attenders and the non-attenders.

Attendees.df <- subset(Stats20x.df, Attend == "Yes")
plot(Exam ~ Test,
    data = Attendees.df,
    xlim = c(0, 20),
    ylim = c(0, 100),
    pch = "Y", cex = 0.7
)
Absentees.df <- subset(Stats20x.df, Attend == "No")
plot(Exam ~ Test,
    data = Absentees.df,
    xlim = c(0, 20),
    ylim = c(0, 100),
    pch = "N",
    cex = 0.7
)
_images/e0e13788f25b4bc7a3ad9e975300e64583f5764b577a285def34ae80d2a0069c.png _images/36cce64692f42711b43132500f34252c8cf4c67a18c5bd36538a6ccfc7b1d6ed.png

Also, there seems to be some non-attenders who do well in the test and exam so we could (and will) see whether we should include these people. They are identified in red (stars) with the code below. 此外,似乎有一些没有参加考试的人在考试和考试中表现很好,所以我们可以(也将)看看是否应该包括这些人。它们用红色(星号)标识,代码如下。

Absentees.df <- subset(Stats20x.df, Attend == "No")
plot(Exam ~ Test,
    data = Absentees.df, xlim = c(0, 20), ylim = c(0, 100),
    cex = 0.7, col = ifelse(Absentees.df$Test <= 16, "black", "red"),
    pch = ifelse(Absentees.df$Test <= 16, 1, 8)
)
_images/3203470d572f5102250b97819bdb6fc3b92dd441986a2d50d2a8d911b97ae77d.png

8.2. Fitting the linear model#

看起来我们需要符合两个不同取决于学生是否经常出席者。

We will call our indicator variable for greater convenience of notation: 我们将为了方便表示法而称之为指示变量:

## Boolean statement if Attend ="Yes" (TRUE) D=1, othwerwise 0 (FALSE);
Stats20x.df$D <- as.numeric(Stats20x.df$Attend == "Yes")
table(Stats20x.df$Attend, Stats20x.df$D) ## Check it is okay
     
        0   1
  No   46   0
  Yes   0 100

相当于说,No 为 0,Yes 为 1。

Our straight line model for the non-attenders (i.e., D = 0) students will be:

\[ Exam = β_0 + β_1 \times Test + ε\ where\ ε \sim N(0,σ^2) \]

For the non-attenders (D = 0) the slope(斜率) is:

\[ β_1 + D \times β_2 = β_1 \]

For the attenders (D = 1) the slope is:

\[ β_1 + D \times β_3 = β_1 + β_3 \]

So, our model is:

\[\begin{split} \begin{aligned}\text{Exam}&=\beta_0+\beta_2\times D+(\beta_1+\beta_3\times D)\text{Test}+\varepsilon\\ &=(\beta_0+\beta_2\times D)+(\beta_1+\beta_3\times D)\text{Test}+\varepsilon\\ &=\beta_0+\beta_1\times\text{Test}+\beta_2\times D+\beta_3\times D\times\text{Test}+\varepsilon\end{aligned} \end{split}\]

建模:

Stats20x.df$TestD <- with(Stats20x.df, {
    TestD <- D * Test
})
TestAttend.fit <- lm(Exam ~ Test + D + TestD, data = Stats20x.df)

Assumption checks 三步走:

plot(TestAttend.fit, which = 1)
normcheck(TestAttend.fit)
cooks20x(TestAttend.fit)
_images/64edc707f85613d8d088436d5b31a973e9b028f15d4f360a6211439d1020ddf2.png _images/d99112ec7091d24e2ed370f241a32a00f2193774fc0555509990cd2a102604e6.png _images/03efd117fc5dd482db733e26d85a75964b105795b2972b463b32c8170aa10b79.png

We can now trust the fitted . The summary output is:

summary(TestAttend.fit)
Call:
lm(formula = Exam ~ Test + D + TestD, data = Stats20x.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-30.3155  -6.5139   0.4383   7.3166  30.9383 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  14.4467     4.9443   2.922  0.00405 ** 
Test          2.7496     0.4603   5.973 1.78e-08 ***
D            -4.2582     6.3723  -0.668  0.50506    
TestD         1.1380     0.5577   2.040  0.04316 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.41 on 142 degrees of freedom
Multiple R-squared:  0.6347,	Adjusted R-squared:  0.627 
F-statistic: 82.25 on 3 and 142 DF,  p-value: < 2.2e-16

Note that the above Executive Summary is missing the confidence interval for the effect of test mark on attenders. To obtain this CI we need to change attenders to the baseline level of . 改变了基准线

让我们仔细看看我们刚刚安装模式。我们将会产生一个单独的情节为每个出席者。

coef(TestAttend.fit)[1:2]
# 分别为截距 和 Test 的系数
(Intercept)
14.4467499989379
Test
2.74956844608019

将我们的模型放到图上:

## Plot these data all together
b <- coef(TestAttend.fit) # easier to work with these terms
plot(Exam ~ Test,
    data = Stats20x.df,
    pch = substr(Attend, 1, 1), # "Y" or "N"
    cex = 0.7, # 缩放,默认为 1
    xlim = c(0, 20) # x 轴范围
)
## Red for "No" and blue for "Yes".
abline(b[1:2], lty = 2, col = "red")
# No 群体:beta0 做截距,beta1 做斜率
abline(b[1] + b[3], b[2] + b[4], lty = 2, col = "blue")
# Yes群体:beta0 + beta2 做截距,beta1 + beta3 做斜率
_images/020def9b76a62f8e07068c426d8d7162e2af10b035bd468fc6458e424c694e22.png

All that hard work we did with constructing and can be avoided D TestD since will automatically do this for us.

We were interested to see whether the effect of interacts with the Test students variable. Using we simply specify to Attend lm Test * Attend fit the model with interaction. That is,

TestAttend.fit2 = lm(Exam ~ Test * Attend, data = Stats20x.df)
summary(TestAttend.fit2)
Call:
lm(formula = Exam ~ Test * Attend, data = Stats20x.df)

Residuals:
     Min       1Q   Median       3Q      Max 
-30.3155  -6.5139   0.4383   7.3166  30.9383 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     14.4467     4.9443   2.922  0.00405 ** 
Test             2.7496     0.4603   5.973 1.78e-08 ***
AttendYes       -4.2582     6.3723  -0.668  0.50506    
Test:AttendYes   1.1380     0.5577   2.040  0.04316 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.41 on 142 degrees of freedom
Multiple R-squared:  0.6347,	Adjusted R-squared:  0.627 
F-statistic: 82.25 on 3 and 142 DF,  p-value: < 2.2e-16

We have the same outputs, but with slightly different names.

注意: Test * Attend 是速记符号。你可以用下边的写法更明确关于模型中各个方面:

TestAttend.fit2 <- lm(Exam ~ Test + Attend + Test:Attend, data = Stats20x.df)

8.3. Interpretting the fitted model#

We see that our intuition was correct. That is, the slope for of Test attenders is greater that for non-attenders. This is because the estimate of the difference in these slopes TestD.

coef(TestAttend.fit2)[4]
Test:AttendYes: 1.13799041829866

Confidence intervals may be needed for the coefficients:

confint(TestAttend.fit2)
A matrix: 4 × 2 of type dbl
2.5 %97.5 %
(Intercept) 4.6728751124.220625
Test 1.83956971 3.659567
AttendYes-16.85506294 8.338572
Test:AttendYes 0.03547053 2.240510

注:Statistical significance (at the 5% level) of a coefficient is NOT equivalent to the (95%) confidence interval containing zero. 统计学意义(5%)的系数并不等同于(95%)置信区间包含零。

Some predictions:

predTestAttend.df <- data.frame(
    Test = c(0, 10, 10, 20),
    Attend = factor(c("No", "No", "Yes", "Yes"))
)
predTestAttend.df
A data.frame: 4 × 2
TestAttend
<dbl><fct>
0No
10No
10Yes
20Yes

Let us estimate the expected exam scores for these values of test score and attendance 让我们这些值的估计预期的考试分数测试成绩和出勤:

predict(TestAttend.fit2, predTestAttend.df, interval = "confidence") # 区间估计
predict(TestAttend.fit2, predTestAttend.df, interval = "prediction") # 点估计
A matrix: 4 × 3 of type dbl
fitlwrupr
114.44675 4.67287524.22062
241.9424338.61637645.26849
349.0640946.41219451.71599
487.9396882.61010093.26926
A matrix: 4 × 3 of type dbl
fitlwrupr
114.44675-10.13028 39.02378
241.94243 19.14848 64.73639
349.06409 26.35871 71.76947
487.93968 64.76845111.11092

This is not the best model for predicting individual student exam scores as the intervals are too wide and in some case are meaningless (at the extremes of Test). 这不是最好的模型预测个体学生的考试成绩作为间隔太宽,在某些情况下是毫无意义的(在极端的 Test 成绩下)。

请注意,上述的执行摘要缺少有关测试成绩对参与者影响的置信区间。要获得此置信区间,我们需要将参与者更改为基线水平,即 Atten。

Stats20x.df$Attend2 <- relevel(Stats20x.df$Attend, ref = "Yes")
TestAttend.fit2b <- lm(Exam ~ Test * Attend2, data = Stats20x.df)
coef(summary(TestAttend.fit2b))
confint(TestAttend.fit2b)
A matrix: 4 × 4 of type dbl
EstimateStd. Errort valuePr(>|t|)
(Intercept)10.1885044.0199956 2.53445661.234648e-02
Test 3.8875590.314879212.34618983.020895e-24
Attend2No 4.2582466.3722922 0.66824395.050626e-01
Test:Attend2No-1.1379900.5577265-2.04040944.316270e-02
A matrix: 4 × 2 of type dbl
2.5 %97.5 %
(Intercept) 2.24173318.13527583
Test 3.265102 4.51001561
Attend2No-8.33857216.85506294
Test:Attend2No-2.240510-0.03547053

We estimate that each additional test mark will increase the expected exam mark of an attender by 3.3 to 4.5. 我们估计,每增加一个测试分数,参与者的预期考试成绩将增加3.3到4.5分。

8.4. Assessing influence of the atypical students#

在分析的探索性阶段,我们确定了三名可能异常的学生。回想一下,这些是那3个在测试中得分大于16分但没有参加考试的学生。

如果我们有理由认为这些学生“不典型”(我们有吗?),那么在分析时不考虑他们可能是合理的。

## Remove atypical points - Note that ! means ’not’
Subset.df <- subset(Stats20x.df, !(Test > 16 & Attend == "No"))
TestAttend.fit3 <- lm(Exam ~ Test * Attend, data = Subset.df)
summary(TestAttend.fit3)
Call:
lm(formula = Exam ~ Test * Attend, data = Subset.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-27.059  -6.817   0.439   6.938  32.008 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     22.6522     5.2776   4.292  3.3e-05 ***
Test             1.7590     0.5213   3.374 0.000960 ***
AttendYes      -12.4637     6.5505  -1.903 0.059146 .  
Test:AttendYes   2.1285     0.6034   3.527 0.000569 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.01 on 139 degrees of freedom
Multiple R-squared:  0.6495,	Adjusted R-squared:  0.6419 
F-statistic: 85.86 on 3 and 139 DF,  p-value: < 2.2e-16

R code to generate the plot of the full data with all three of the fitted Rlines:

## Plot these data all together
plot(Exam ~ Test, data = Subset.df, pch = substr(Attend, 1, 1), cex = 0.7)

## Remember that we’ve defined b in Slide 26
## Each abline() will have a different colour
abline(b[1:2], lty = 2, col = "red")
abline(b[1] + b[3], b[2] + b[4], lty = 2, col = "blue")

## The fitted line without the 3 atypical points
b2 <- coef(TestAttend.fit3) ## Easier to work with these terms
abline(b2[1:2], lty = 2, col = "green")

## Add a legend to help us differentiate between the lines for non-attenders
legend("topleft",
    legend = c("All non-attenders", "Without atypical non-attenders"),
    lty = 2,
    col = c("red", "green"),
    bty = "n"
)
_images/6504a7343d26871b2c8d671d859d84d624a7aa128c591aa7becba3e1f47735c9.png