Regression modeling with a two-level categorical variable

Suppose that Z is a two-level categorical variable such that Z = A or B.

Define

$$X = 
\begin{cases} 
1, & \text{if } Z = A \\ 
0, & \text{otherwise} 
\end{cases}$$

 

Then we can use the following regression model, $$ Y = \beta_0 + \beta_1X + \epsilon$$

  • $\beta_0 = \mu_B$(called the base line)
  • $\beta_1 = \mu_A - \mu_B$
  • Consequently, $\beta_0 + \beta_1 = \mu_A$

Since $E(Y) = \beta_0 + \beta_1X$,

if Z = A, X = 1, $E(Y) = \beta_0 + \beta_1 = \mu_A$

if Z = B, X = 0, $E(Y) = \beta_0 = \mu_B$

 

 

Suppose that Z is a three-level categorical variable such that Z = A, B or C. 

Define

$X_1 = 
\begin{cases} 
1, & \text{if } Z = A \\ 
0, & \text{otherwise} 
\end{cases}$

 

$X_2 = 
\begin{cases} 
1, & \text{if } Z = B \\ 
0, & \text{otherwise} 
\end{cases}$

 

Then we can use the following regression model, $$y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon$$

  • $\beta_0 = \mu_C$ (called the base line)
  • $\beta_1 = \mu_A - \mu_C$
  • $\beta_2 = \mu_B - \mu_C$

 

Since $E(Y) = \beta_0 + \beta_1X_1 + \beta_2X_2$,

if Z = A, (1, 0), $E(Y) = \beta_0 + \beta_1 = \mu_A$

if Z = B, (0, 1), $E(Y) = \beta_0 + \beta_2 = \mu_B$

if Z = C, (0, 0), $E(Y) = \beta_0 = \mu_C$

 

 

Two categorical variables

Consider two categorical variables: One at 3 levels ($F_1, F_2, F_3$) and the other at 2 levels ($B_1, B_2$).

Then, the model can be written as $$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \epsilon,$$

where

$$X_1 = 1\,if\,F_2\, X_1 = 0, if\,not$$

$$X_2 = 1\,if\,F_3\, X_2 = 0, if\,not$$

$$X_3 = 1\,if\,B_2\, X_3 = 0, if\,not$$

 

Note that $F_1$ and $B_1$ : base levels

  • $\beta_0 = \mu_{11}$ (mean of combination of base levels)
  • $\beta_1 = \mu_{2j} - \mu_{1j}$ for any level $B_j$ (j = 1, 2)
  • $\beta_2 = \mu_{3j} - \mu_{1j}$ for any level $B_j$ (j = 1, 2)
  • $\beta_3 = \mu_{i2} - \mu_{i1}$ for any level $F_i$ (i = 1, 2, 3)

 

Interaction model with two categorical variables 

Consider an extended model as follows:

$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_4 + \beta_5X_5 + \epsilon,$$

where 

$$X_1 = 1\,if\,F_2\, X_1 = 0, if\,not$$

$$X_2 = 1\,if\,F_3\, X_2 = 0, if\,not$$

$$X_3 = 1\,if\,B_2\, X_3 = 0, if\,not$$

$$X_4 = X1X3, \, and\, X_5 = X_2X_3$$

 

Note that $F_1$ and $B_1$ : Base levels.

  • $\beta_0 = \mu_{11}$ (mean of combination of base levels)
  • $\beta_1 = \mu_{21} - \mu_{11}$ for any level $B_1$ only
  • $\beta_2 = \mu_{31} - \mu_{11}$ for any level $B_1$ only
  • $\beta_3 = \mu_{12} - \mu_{11}$ for any level $F_1$ only 
  • $\beta_4 = (\mu_{22} - \mu_{12}) - (\mu_{21} - \mu_{11})$
  • $\beta_5 = (\mu_{32} - \mu_{12}) - (\mu_{31} - \mu_{11})$

 

Since $F_2$, $B_1$, $\mu_{21} = \beta_0 + \beta_1$ then we can write $\beta_1 = \mu_{21} - \mu_{11}$.

 

 

Example(Two categorical variables with interaction)

 

이걸 보고 우리가 질문할 수 있는 것은 다음과 같습니다. 

  1. interaction이 유의한가요?
    • $H_0: \beta_4 = \beta_5 = 0$ vs. $H_1: \beta4 \neq 0\, or\, \beta_5 \neq 0 $
    • SAS에서 추가적인 옵션이 test를 걸어줘서 확인을 해도 되나, T-test에서 유추가 가능합니다. 
  2. interaction이 없는 모델과 비교할 때는 $R_a^2$을 비교합니다. 

결론: 범주형에 대한 회귀분석을 진행할 때도 interaction을 고려해볼 수 있다는 것입니다. 

 

 

+ Recent posts