고객 유지를 위한 필요한 행동 예측

머성암 2024. 1. 14. 20:31

2024. 1. 14. 20:31

코칭스터디 2024

1. 데이터 구성

Kaggle 데이터셋( https://www.kaggle.com/blastchar/telco-customer-churn )은 통신사 고객 이탈(Churn)에 대한 정보
IBM에서 제공했으며 고객 데이터를 분석하여 고객 유지 프로그램을 개발하는 데 도움이 됨.
- 고객 인구 통계 정보(Demographic info): 고객의 성별, 연령대, 배우자 및 부양 가족의 유무(Gender, SeniorCitizen, Partner, Dependents)
- 고객 이탈(Churn) 정보: 서비스를 중단 여부에 대한 정보
- 서비스 가입 정보(Services subscribed): 고객들이 가입한 서비스들, 예를 들어 전화, 다중 라인, 인터넷, 온라인 보안, 온라인 백업, 장치 보호, 기술 지원, 스트리밍 TV 및 영화( honeService, MultipleLine, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies) 등에 대한 정보
- 고객 계정 정보(Customer account info): 고객이 얼마나 오래 서비스를 이용했는지, 계약 유형, 결제 방법, 무페이퍼 청구 여부, 월별 요금, 총 요금 (CustomerID, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Tenure)등의 정보

2. 필요한 라이브러리로드

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

3. 데이터셋 로드

df = pd.read_csv("https://bit.ly/telco-csv", index_col="customerID")
df.shape

df.head() # 데이터 확인

df.info() # 데이터 집합과 각 열들의 자료형 확인

# df.isnull().sum()는 컬럼별로 결측치의 합계를 구합니다.
print(df.isnull().sum())
# df.isnull().sum().sum() 전체 결측치의 합계를 구합니다.
df.isnull().sum().sum()

4. 학습, 예측 데이터셋 나누기

5. 학습, 예측에 사용할 컬럼

# 학습, 예측에 사용할 컬럼에 이름을 지정
# 범주형 데이터 (object, category)는 전처리가 따로 필요
# 따라서 수치데이터만을 사용
feature_names = df.select_dtypes(include = "number").columns
feature_names

# train 과 test 로 나누기 위해 데이터를 기준을 정함.
split_count = int(df.shape[0] * 0.8)
split_count

6. 정답값이자 예측해야할 값

label_name = "Churn"
label_name #정답값

7. 학습, 예측 데이터셋 만들기

# 데이터의 80%을 나눌 기준 인덱스(split_count)로 문제 데이터(X)를 train, test로 나눔.
# 데이터의 80%을 나눌 기준 인덱스(split_count)로 정답 데이터(y)를 train, test로 나눔.
train = df[:split_count].copy()
test = df[split_count:].copy()

X_train = train[feature_names]
y_train = train[label_name]

X_test = test[feature_names]
y_test = test[label_name]

X_train.shape, X_test.shape, y_train.shape, y_test.shape

8. 머신러닝 알고리즘 가져오기

DecisionTreeClassifier(
    *,
    criterion='gini', # 분할방법 {"gini", "entropy"}, default="gini"
    splitter='best',
    max_depth=None, # The maximum depth of the tree
    min_samples_split=2, # The minimum number of samples required to split an internal node
    min_samples_leaf=1, # The minimum number of samples required to be at a leaf node.
    min_weight_fraction_leaf=0.0, # The minimum weighted fraction of the sum total of weights
    max_features=None, # The number of features to consider when looking for the best split
    random_state=None,
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    class_weight=None,
    ccp_alpha=0.0,
)

주요 파라미터
- criterion: 가지의 분할의 품질을 측정하는 기능
- max_depth: 트리의 최대 깊이
- min_samples_split:내부 노드를 분할하는 데 필요한 최소 샘플 수
- min_samples_leaf: 리프 노드에 있어야 하는 최소 샘플 수
- max_leaf_nodes: 리프 노드 숫자의 제한치
- random_state: 추정기의 무작위성을 제어. 실행했을 때 같은 결과가 나오도록 함.

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()

9. 학습(훈련)

model.fit(X_train, y_train)

10. 예측

# 데이터를 머신러닝 모델로 예측(predict)
y_predict = model.predict(X_test)
y_predict[:5]

11. 트리 알고리즘 분석하기

from sklearn.tree import plot_tree

plt.figure(figsize=(20, 10))

tree = plot_tree(model,
                 feature_names = feature_names,
                 filled = True,
                 fontsize = 10,
                 max_depth = 4)

12. 정확도 측정하기

# 피처 중요도를 시각화
sns.barplot(x = model.feature_importances_, y=feature_names)

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_predict)

'프로그래밍 > 프로젝트' 카테고리의 다른 글

ARIMA 모델 (0)	2024.05.21
데이콘 - 고객 대출 등급 분류 프로젝트 (1)	2024.02.09
원본 데이터 보존 (0)	2023.11.28
머신러닝 기초 및 순서 (0)	2023.11.21
데이터 프로젝트 (데이터 확인) (1)	2023.11.01

exestudiary

고객 유지를 위한 필요한 행동 예측

1. 데이터 구성

2. 필요한 라이브러리로드

3. 데이터셋 로드

4. 학습, 예측 데이터셋 나누기

5. 학습, 예측에 사용할 컬럼

6. 정답값이자 예측해야할 값

7. 학습, 예측 데이터셋 만들기

8. 머신러닝 알고리즘 가져오기

9. 학습(훈련)

10. 예측

11. 트리 알고리즘 분석하기

12. 정확도 측정하기

'프로그래밍 > 프로젝트' 카테고리의 다른 글

+ Recent posts

티스토리툴바