Linear models with a categorical (factor) explanatory variable

5. Linear models with a categorical (factor) explanatory variable#

本节需要的包：

require(s20x)

5.1. Using categorical variables as explanatory variables by using indicator variables#

使用指标变量将分类变量用作解释变量

library(s20x)
## Importing data into R
Stats20x.df <- read.table("../data/STATS20x.txt", header = T)
## Change Attend from a character variable to a factor variable
Stats20x.df$Attend <- as.factor(Stats20x.df$Attend)
## Examine the data
Stats20x.df$Attend[1:20]

Yes
Yes
Yes
Yes
No
Yes
Yes
No
Yes
Yes
No
Yes
No
No
No
Yes
Yes
No
Yes
Yes

Levels:

'No'
'Yes'

简要分析数据集，确保有你需要的可能的关系：

summaryStats(Stats20x.df$Exam, Stats20x.df$Attend)
plot(Exam ~ Attend, data = Stats20x.df)

    Sample Size     Mean Median  Std Dev Midspread
No           46 42.21739   40.5 16.34206     20.50
Yes         100 57.78000   58.0 17.67757     28.25

_images/3f20304a48fe187577a6ddd5cf8d6f63e52cd61170f423440489977f06adb4b9.png

缺勤的确会让学生成绩变低，在数据分布上的确有一定的关系。

为了在后面进行更好的分析，我们将缺勤的 Yes 和 No 转换为 1 和 0：

# Make a new variable Attend2 which is 1 if Attend = "Yes" and 0 otherwise

# Note how we use two equal signs, ==, to test equality
Stats20x.df$Attend2 <- as.numeric(Stats20x.df$Attend == "Yes")
with(Stats20x.df, table(Attend, Attend2))

      Attend2
Attend   0   1
   No   46   0
   Yes   0 100

trendscatter(Exam ~ Attend2, data = Stats20x.df)

_images/7f950a308c40534da735c02ebe7b2410a7c2edb55fdc06b4ddc0956b04c6abbf.png

The linear model for the expected value of is

\[ E[Exam|Attend2] = \beta_0 + \beta_1 Attend2 \]

其中 \(\beta_0\) 是截距，即所有缺勤的均值；\(\beta_1\) 是考试成绩和缺勤的关系，由缺勤和出勤的成绩关系共同决定。

examattend2.fit <- lm(Exam ~ Attend2, data = Stats20x.df)
summary(examattend2.fit)

Call:
lm(formula = Exam ~ Attend2, data = Stats20x.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-46.780 -13.108  -0.217  12.642  46.783 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   42.217      2.547  16.578  < 2e-16 ***
Attend2       15.563      3.077   5.058 1.27e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 17.27 on 144 degrees of freedom
Multiple R-squared:  0.1508,	Adjusted R-squared:  0.145 
F-statistic: 25.58 on 1 and 144 DF,  p-value: 1.271e-06

上述拟合代表 x 为 Attend2（0 和 1）时，y 的期望值，即考试成绩的期望值。

但注意事实上，直接使用 lm() 函数进行拟合，也能得出正确的结果，因为 lm() 函数会自动将分类变量转换为指标变量（AttendYes）：

examattend.fit <- lm(Exam ~ Attend, data = Stats20x.df)
summary(examattend.fit)

Call:
lm(formula = Exam ~ Attend, data = Stats20x.df)

Residuals:
    Min      1Q  Median      3Q     Max 
-46.780 -13.108  -0.217  12.642  46.783 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   42.217      2.547  16.578  < 2e-16 ***
AttendYes     15.563      3.077   5.058 1.27e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 17.27 on 144 degrees of freedom
Multiple R-squared:  0.1508,	Adjusted R-squared:  0.145 
F-statistic: 25.58 on 1 and 144 DF,  p-value: 1.271e-06

让我们将拟合模型可视化。在这里，我们将使用虚拟变量拟合我们的模型得到的"最佳"估计直线绘制出来。

plot(Exam ~ Attend2, data = Stats20x.df)
## Add the lm estimated line to this plot where a=intercept, b=slope
abline(coef(examattend.fit), lty = 2, col = "blue")

_images/acdb2f014751d0cbf2dd87a740c8fe7284fb66fab5386642c2f840f2f2c53fc5.png

模型建立好了，接下来就是我们经典的模型检验三步走了：

残差均值接近于 0
残差满足正态分布
没有或排除了异常点

plot(examattend.fit, which = 1)
normcheck(examattend.fit)
cooks20x(examattend.fit)

_images/e11b6271d2a93e19e475e10b88817344a34149f198715f90ea8918485644a7ad.png

_images/05262ecbd39f48d6715ddbcadfcdb51b4b40460427cb990f04898ef5fc82c1ce.png

_images/c416250daecfa871c48b1af50e9ca091ab80f8a43c968e75bdac5ce9748573c9.png

## Create data frame of values of interest: Attend=="Yes" and "No"
## Make sure that the names of vars are exactly the same as in the data frame
preds.df <- data.frame(Attend = c("No", "Yes"))
predict(examattend.fit, preds.df, interval = "confidence")
predict(examattend.fit, preds.df, interval = "prediction")

A matrix: 2 × 3 of type dbl
	fit	lwr	upr
1	42.21739	37.18401	47.25077
2	57.78000	54.36619	61.19381

A matrix: 2 × 3 of type dbl
	fit	lwr	upr
1	42.21739	7.710259	76.72452
2	57.78000	23.471673	92.08833

再次强调：“confidence”是代表均值预测范围，而“prediction”是代表个体预测范围。

Linear models with a categorical (factor) explanatory variable

Contents

5. Linear models with a categorical (factor) explanatory variable#

5.1. Using categorical variables as explanatory variables by using indicator variables#