Qualitative variables as predictors

머성암 2024. 11. 22. 18:18

2024. 11. 22. 18:18

Regression modeling with a two-level categorical variable

Suppose that Z is a two-level categorical variable such that Z = A or B.

Define

$$X =
\begin{cases}
1, & \text{if } Z = A \\
0, & \text{otherwise}
\end{cases}$$

Then we can use the following regression model, $$ Y = \beta_0 + \beta_1X + \epsilon$$

$\beta_0 = \mu_B$(called the base line)
$\beta_1 = \mu_A - \mu_B$
Consequently, $\beta_0 + \beta_1 = \mu_A$

Since $E(Y) = \beta_0 + \beta_1X$,

if Z = A, X = 1, $E(Y) = \beta_0 + \beta_1 = \mu_A$

if Z = B, X = 0, $E(Y) = \beta_0 = \mu_B$

Suppose that Z is a three-level categorical variable such that Z = A, B or C.

Define

$X_1 =
\begin{cases}
1, & \text{if } Z = A \\
0, & \text{otherwise}
\end{cases}$

$X_2 =
\begin{cases}
1, & \text{if } Z = B \\
0, & \text{otherwise}
\end{cases}$

Then we can use the following regression model, $$y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon$$

$\beta_0 = \mu_C$ (called the base line)
$\beta_1 = \mu_A - \mu_C$
$\beta_2 = \mu_B - \mu_C$

Since $E(Y) = \beta_0 + \beta_1X_1 + \beta_2X_2$,

if Z = A, (1, 0), $E(Y) = \beta_0 + \beta_1 = \mu_A$

if Z = B, (0, 1), $E(Y) = \beta_0 + \beta_2 = \mu_B$

if Z = C, (0, 0), $E(Y) = \beta_0 = \mu_C$

Two categorical variables

Consider two categorical variables: One at 3 levels ($F_1, F_2, F_3$) and the other at 2 levels ($B_1, B_2$).

Then, the model can be written as $$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \epsilon,$$

where

$$X_1 = 1\,if\,F_2\, X_1 = 0, if\,not$$

$$X_2 = 1\,if\,F_3\, X_2 = 0, if\,not$$

$$X_3 = 1\,if\,B_2\, X_3 = 0, if\,not$$

Note that $F_1$ and $B_1$ : base levels

$\beta_0 = \mu_{11}$ (mean of combination of base levels)
$\beta_1 = \mu_{2j} - \mu_{1j}$ for any level $B_j$ (j = 1, 2)
$\beta_2 = \mu_{3j} - \mu_{1j}$ for any level $B_j$ (j = 1, 2)
$\beta_3 = \mu_{i2} - \mu_{i1}$ for any level $F_i$ (i = 1, 2, 3)

Interaction model with two categorical variables

Consider an extended model as follows:

$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4 + \beta_5X_5 + \epsilon,$$

where

$$X_1 = 1\,if\,F_2\, X_1 = 0, if\,not$$

$$X_2 = 1\,if\,F_3\, X_2 = 0, if\,not$$

$$X_3 = 1\,if\,B_2\, X_3 = 0, if\,not$$

$$X_4 = X1X3, \, and\, X_5 = X_2X_3$$

Note that $F_1$ and $B_1$ : Base levels.

$\beta_0 = \mu_{11}$ (mean of combination of base levels)
$\beta_1 = \mu_{21} - \mu_{11}$ for any level $B_1$ only
$\beta_2 = \mu_{31} - \mu_{11}$ for any level $B_1$ only
$\beta_3 = \mu_{12} - \mu_{11}$ for any level $F_1$ only
$\beta_4 = (\mu_{22} - \mu_{12}) - (\mu_{21} - \mu_{11})$
$\beta_5 = (\mu_{32} - \mu_{12}) - (\mu_{31} - \mu_{11})$

Since $F_2$, $B_1$, $\mu_{21} = \beta_0 + \beta_1$ then we can write $\beta_1 = \mu_{21} - \mu_{11}$.

Example(Two categorical variables with interaction)

이걸 보고 우리가 질문할 수 있는 것은 다음과 같습니다.

interaction이 유의한가요?
- $H_0: \beta_4 = \beta_5 = 0$ vs. $H_1: \beta4 \neq 0\, or\, \beta_5 \neq 0 $
- SAS에서 추가적인 옵션이 test를 걸어줘서 확인을 해도 되나, T-test에서 유추가 가능합니다.
interaction이 없는 모델과 비교할 때는 $R_a^2$을 비교합니다.

결론: 범주형에 대한 회귀분석을 진행할 때도 interaction을 고려해볼 수 있다는 것입니다.

'통계학 > 회귀분석(Regression Analysis)' 카테고리의 다른 글

Matrix format (0)	2025.02.18
Transformation of variables (0)	2024.11.19
다중선형회귀 (Multiple linear regression) (0)	2024.10.06
단순선형회귀 (Simple linear regression) (3)	2024.09.25

exestudiary