XGBoost¶

머신러닝(MACHINE LEARNING) 2021. 4. 14. 14:46

XGBoost¶

XGBoost 는 트리 기반의 ensemble 알고리즘 학습법에서 가장 각광받고 있는 알고리즘 중 하나입니다. 기존의 GBM 을 기반으로 하고 있지만, 느린 수행시간, 과적합 규제등을 해결한 알고리즘 이다.

뛰어난 예측 성능
GBM 대비 빠른 수행 시간
과적합 규제(Overfitting Regularization)
Tree pruning(트리 가지치기) : 긍정 이득이 없는 분할을 가지치기해서 분할 수를 줄임
자체 내장된 교차 검증
- 반복 수행시마다 내부적으로 교차검증을 수행해 최적회된 반복 수행횟수를 가질 수 있음
- 지정된 반복횟수가 아니라 교차검증을 통해 평가 데이트세트의 평가 값이 최적화되면 반복을 중간에 멈출 수 있는 기능이 있음
결손값 자체 처리

In [5]:

# 주요 파라매터에 관해서는 진행하며, 설명하겠다.
# 위스콘신 유방암 데이터 세트로 종양이 악성(malignant) 인지 양성(benign) 인지 구분하겠다.
import xgboost as xgb
from xgboost import plot_importance #xgb 내의 자체 ploting 시각화 모듈
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import warnings 
warnings.filterwarnings('ignore')

dataset = load_breast_cancer()
X_features = dataset.data
y_label = dataset.target

#총 31 개의 feature 들이 존재한다. 
cancer_df = pd.DataFrame(data = X_features , columns = dataset.feature_names)
cancer_df['target'] = y_label
cancer_df.head()

Out[5]:

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	...	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
0	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	...	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
1	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	...	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
2	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	...	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
3	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	...	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
4	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	...	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678

5 rows × 31 columns

In [11]:

print(dataset.target_names)
print(cancer_df['target'].value_counts())

['malignant' 'benign']
1    357
0    212
Name: target, dtype: int64

In [15]:

# 전체 데이터 중 80%는 학습 데이터, 20% 는 테스트용 데이터로 추출
X_train , X_test , Y_train, Y_test = train_test_split(X_features, y_label, test_size = 0.2, random_state = 11)
print(X_train.shape, X_test.shape)

(455, 30) (114, 30)

In [18]:

# XGB 를 이용하기 때문에 dataset 을 XGB 만의 DMatrix 로 변환해주어야 함.
dtrain = xgb.DMatrix(data = X_train , label = Y_train)
dtest = xgb.DMatrix(data = X_test , label = Y_test)

In [23]:

# XGB 의 파라매터 설정
params = {
    'max_depth':3, #트리의 최대 깊이 (0을 지정하면, 깊이 제한 x, 과적합 가능성)
    'eta':0.1, # 0 ~ 1 사이의 값으로 학습률 조정
    'objective':'binary:logistic', # 학습 태스크 파라매터로 최솟값을 가져야할 손실 함수 여기서는 이진분류 이므로 binary:logistic 설정
    'eval_metric' : 'logloss', # 검증에 사용되는 함수
    'early_stoppings' : 100 # 100번해도 성능이 좋아지지 않을때(eval_metrics 방법이 좋아지지않을때), break
}
num_rounds = 400

In [25]:

# train 데이터 세트는 'train', evaluation(test) 데이터 세트는 'eval' 로 명명
wlist = [(dtrain,'train'),(dtest,'eval')]
# 하이퍼 파라미터와 early stopping 파라미터를 train() 함수의 파라미터로 전달
xgb_model = xgb.train(params = params , dtrain = dtrain, num_boost_round = num_rounds, early_stopping_rounds = 100, evals = wlist)

[14:33:11] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.4.0/src/learner.cc:573: 
Parameters: { "early_stoppings" } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[0]	train-logloss:0.61325	eval-logloss:0.61282
[1]	train-logloss:0.54348	eval-logloss:0.54112
[2]	train-logloss:0.48665	eval-logloss:0.48139
... #점점 train_logloss 와 eval-logloss 또한 줄어드는 모습을 볼 수 있다.
[269]	train-logloss:0.00623	eval-logloss:0.02773

In [31]:

# 이제 predict 를 해볼 건데, XGB 의 predict 는 예측결과의 확률값을 반환 하므로, 0.5이상이면 1 이하면 0 을 반환하는 함수를 작성
pred_probs = xgb_model.predict(dtest)
print('predict 수행 결괏값을 10개만 표시 : ')
print(np.round(pred_probs[:10], 3))

# 예측 확률이 0.5 보다 크면 1, 그렇지 않으면 0 으로 예측 실시함.
preds = [ 1 if x>0.5 else 0 for x in pred_probs]
print(preds)

predict 수행 결괏값을 10개만 표시 : 
[0.    0.    0.    0.    0.    0.997 0.703 0.998 1.    0.999]
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0]

In [42]:

from sklearn.metrics import accuracy_score
print('정확도: {0:.4f}'.format(accuracy_score(Y_test,preds)))

정확도: 0.9825

In [43]:

# 아까 불러온 plot_importance 로 항목들의 중요도를 표시해보겠다.
from xgboost import plot_importance
import matplotlib.pyplot as plt
%matplotlib inline

fig,ax = plt.subplots(figsize=(10,12))
plot_importance(xgb_model, ax = ax)

Out[43]:

<AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>

'머신러닝(MACHINE LEARNING)' 카테고리의 다른 글

Python_pickle[피클 이란?] (2)	2021.04.18
LightGBM 데이터 분류 (0)	2021.04.14
GridSearchCV () - 파라매터 와 교차검증을 동시에 하는 API (0)	2021.04.13
교차 검증을 위한 Cross_val_score() 함수 사용법 (0)	2021.04.13
K-Fold 와 Stratified-KFold 기법 (2)	2021.04.13

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

인기포스트

ABOUT ME

Guru_Park의 블로그

XGBoost¶

'머신러닝(MACHINE LEARNING)' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

인기포스트

ABOUT ME

XGBoost¶

'머신러닝(MACHINE LEARNING)' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역