Regression modeling with a two-level categorical variable
Suppose that Z is a two-level categorical variable such that Z = A or B.
Define
$$X =
\begin{cases}
1, & \text{if } Z = A \\
0, & \text{otherwise}
\end{cases}$$
Then we can use the following regression model, $$ Y = \beta_0 + \beta_1X + \epsilon$$
- $\beta_0 = \mu_B$(called the base line)
- $\beta_1 = \mu_A - \mu_B$
- Consequently, $\beta_0 + \beta_1 = \mu_A$
Since $E(Y) = \beta_0 + \beta_1X$,
if Z = A, X = 1, $E(Y) = \beta_0 + \beta_1 = \mu_A$
if Z = B, X = 0, $E(Y) = \beta_0 = \mu_B$
Suppose that Z is a three-level categorical variable such that Z = A, B or C.
Define
$X_1 =
\begin{cases}
1, & \text{if } Z = A \\
0, & \text{otherwise}
\end{cases}$
$X_2 =
\begin{cases}
1, & \text{if } Z = B \\
0, & \text{otherwise}
\end{cases}$
Then we can use the following regression model, $$y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon$$
- $\beta_0 = \mu_C$ (called the base line)
- $\beta_1 = \mu_A - \mu_C$
- $\beta_2 = \mu_B - \mu_C$
Since $E(Y) = \beta_0 + \beta_1X_1 + \beta_2X_2$,
if Z = A, (1, 0), $E(Y) = \beta_0 + \beta_1 = \mu_A$
if Z = B, (0, 1), $E(Y) = \beta_0 + \beta_2 = \mu_B$
if Z = C, (0, 0), $E(Y) = \beta_0 = \mu_C$
Two categorical variables
Consider two categorical variables: One at 3 levels ($F_1, F_2, F_3$) and the other at 2 levels ($B_1, B_2$).
Then, the model can be written as $$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \epsilon,$$
where
$$X_1 = 1\,if\,F_2\, X_1 = 0, if\,not$$
$$X_2 = 1\,if\,F_3\, X_2 = 0, if\,not$$
$$X_3 = 1\,if\,B_2\, X_3 = 0, if\,not$$
Note that $F_1$ and $B_1$ : base levels
- $\beta_0 = \mu_{11}$ (mean of combination of base levels)
- $\beta_1 = \mu_{2j} - \mu_{1j}$ for any level $B_j$ (j = 1, 2)
- $\beta_2 = \mu_{3j} - \mu_{1j}$ for any level $B_j$ (j = 1, 2)
- $\beta_3 = \mu_{i2} - \mu_{i1}$ for any level $F_i$ (i = 1, 2, 3)
Interaction model with two categorical variables
Consider an extended model as follows:
$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4 + \beta_5X_5 + \epsilon,$$
where
$$X_1 = 1\,if\,F_2\, X_1 = 0, if\,not$$
$$X_2 = 1\,if\,F_3\, X_2 = 0, if\,not$$
$$X_3 = 1\,if\,B_2\, X_3 = 0, if\,not$$
$$X_4 = X1X3, \, and\, X_5 = X_2X_3$$
Note that $F_1$ and $B_1$ : Base levels.
- $\beta_0 = \mu_{11}$ (mean of combination of base levels)
- $\beta_1 = \mu_{21} - \mu_{11}$ for any level $B_1$ only
- $\beta_2 = \mu_{31} - \mu_{11}$ for any level $B_1$ only
- $\beta_3 = \mu_{12} - \mu_{11}$ for any level $F_1$ only
- $\beta_4 = (\mu_{22} - \mu_{12}) - (\mu_{21} - \mu_{11})$
- $\beta_5 = (\mu_{32} - \mu_{12}) - (\mu_{31} - \mu_{11})$
Since $F_2$, $B_1$, $\mu_{21} = \beta_0 + \beta_1$ then we can write $\beta_1 = \mu_{21} - \mu_{11}$.
Example(Two categorical variables with interaction)
이걸 보고 우리가 질문할 수 있는 것은 다음과 같습니다.
- interaction이 유의한가요?
- $H_0: \beta_4 = \beta_5 = 0$ vs. $H_1: \beta4 \neq 0\, or\, \beta_5 \neq 0 $
- SAS에서 추가적인 옵션이 test를 걸어줘서 확인을 해도 되나, T-test에서 유추가 가능합니다.
- interaction이 없는 모델과 비교할 때는 $R_a^2$을 비교합니다.
결론: 범주형에 대한 회귀분석을 진행할 때도 interaction을 고려해볼 수 있다는 것입니다.
'통계학 > 회귀분석(Regression Analysis)' 카테고리의 다른 글
Matrix format (0) | 2025.02.18 |
---|---|
Transformation of variables (0) | 2024.11.19 |
다중선형회귀 (Multiple linear regression) (0) | 2024.10.06 |
단순선형회귀 (Simple linear regression) (3) | 2024.09.25 |