머신러닝(MACHINE LEARNING) 2021. 4. 13. 21:51

sklearn 에서는 K-fold 교차검증을 구현하귀 위해 Kfold 와 stratifiedKfold 클래스를 제공한다. K-fold 교차 검증이란 , K번마다 K개의 학습데이터 셋을 나누어 평가를 진행하는 것이다.

Kfold => 가장 보편적인 K-fold 방법 위에서 설명한 바와 같이 학습데이터 셋과 검증 데이터 셋을 나누어 진행. stratifiedKFold => 여기서 불균형한 DataSet 을 위한 KFold 방법이다. 예를 들어 True label 이 10000개 있고, 10 건이 False 라 했을시, K-fold로 나누게 되게 되면, 불균형한 값이 나올 수 있다. 2000 개 모두 True 인 label 이 들어있고, 당연히 학습 효과 역시 없을 것이다.

K -Fold 학습데이터 분류¶

In [23]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold
import numpy as np

iris = load_iris()
features = iris.data # iris data의 feature 값
print(features.shape)
label = iris.target
dt_clf = DecisionTreeClassifier(random_state = 11)

# K-Fold 로 데이터 셋을 5개로 나눌 것이다.
Kfold = KFold(n_splits = 5)
cv_accuracy = [] #accuracy 를 담을 리스트 생성

# feature 의 갯수가 150개 이므로 5 등분하여, 30개의 features에 K-fold를 진행해 줄 것이고, cv_accuracy에 그 정확도를 넣어놓겠다.
n_iter = 0
# K-fold 객체의 split()을 호출하면, 폴드 별 학습용, 검증용 테스트의 로우 인덱스를 array 로 변환함.
for train_index , test_index in Kfold.split(features):
    # Kfold.split()으로 반환된 인덱스를 이용, 학습용 검증용 데이터 구성
    X_train , X_test  = features[train_index], features[test_index]
    Y_train , Y_test  = label[train_index], label[test_index]
    # train 및 test 데이터로 학습
    dt_clf.fit(X_train, Y_train)
    pred = dt_clf.predict(X_test)
    n_iter += 1
    # 반복 시 , 정확도 측정
    accuracy = np.round(accuracy_score(Y_test,pred), 4)
    print('{} 번째. K-fold 정확도 :{}, 학습데이터 크기 : {}, 검증 데이터 크기 : {}'.format(n_iter, accuracy, X_train.shape[0], Y_test.shape[0]))

(150, 4)
1 번째. K-fold 정확도 :1.0, 학습데이터 크기 : 120, 검증 데이터 크기 : 30
2 번째. K-fold 정확도 :0.9667, 학습데이터 크기 : 120, 검증 데이터 크기 : 30
3 번째. K-fold 정확도 :0.8667, 학습데이터 크기 : 120, 검증 데이터 크기 : 30
4 번째. K-fold 정확도 :0.9333, 학습데이터 크기 : 120, 검증 데이터 크기 : 30
5 번째. K-fold 정확도 :0.8333, 학습데이터 크기 : 120, 검증 데이터 크기 : 30

Stratified K-Fold 학습데이터 분류¶

stratified 분류법은 레이블 데이터 분포도에 따라 학습/검증 데이터를 나누기 때문에 split() 에 인자로 피처데이터 뿐 아니라, 레이블 데이터까지 줘야한다.

In [28]:

# 데이터 셋을 나누는 예시 / Stratified K-Fold
from sklearn.model_selection import StratifiedKFold
import pandas as pd

iris_df = pd.DataFrame(data = iris.data, columns = iris.feature_names)
# target 라벨 추가
iris_df['label'] = iris.target
skf = StratifiedKFold(n_splits = 5)
n_iter = 0

print(iris_df.head())

for train_index, test_index in skf.split(iris_df,iris_df['label']):
    n_iter += 1
    label_train = iris_df['label'].iloc[train_index]
    label_test = iris_df['label'].iloc[test_index]
    X_train, X_test = features[train_index], features[test_index]
    Y_train, Y_test = label[train_index], label[test_index]
    dt_clf.fit(X_train, Y_train)
    print('################# 교차 검증 ################# : {}'.format(n_iter))
    print('교차 검증 정확도 : {}'.format(accuracy_score(Y_test, dt_clf.predict(X_test))))
    print('학습 레이블(label_train) 데이터 분포 : \n', label_train.value_counts())
    print('검증 레이블(label_test) 데이터 분포 : \n', label_test.value_counts())
    print('\n')
    

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   label  
0      0  
1      0  
2      0  
3      0  
4      0  
################# 교차 검증 ################# : 1
교차 검증 정확도 : 0.9666666666666667
학습 레이블(label_train) 데이터 분포 : 
 2    40
1    40
0    40
Name: label, dtype: int64
검증 레이블(label_test) 데이터 분포 : 
 2    10
1    10
0    10
Name: label, dtype: int64


################# 교차 검증 ################# : 2
교차 검증 정확도 : 0.9666666666666667
학습 레이블(label_train) 데이터 분포 : 
 2    40
1    40
0    40
Name: label, dtype: int64
검증 레이블(label_test) 데이터 분포 : 
 2    10
1    10
0    10
Name: label, dtype: int64


################# 교차 검증 ################# : 3
교차 검증 정확도 : 0.9
학습 레이블(label_train) 데이터 분포 : 
 2    40
1    40
0    40
Name: label, dtype: int64
검증 레이블(label_test) 데이터 분포 : 
 2    10
1    10
0    10
Name: label, dtype: int64


################# 교차 검증 ################# : 4
교차 검증 정확도 : 0.9666666666666667
학습 레이블(label_train) 데이터 분포 : 
 2    40
1    40
0    40
Name: label, dtype: int64
검증 레이블(label_test) 데이터 분포 : 
 2    10
1    10
0    10
Name: label, dtype: int64


################# 교차 검증 ################# : 5
교차 검증 정확도 : 1.0
학습 레이블(label_train) 데이터 분포 : 
 2    40
1    40
0    40
Name: label, dtype: int64
검증 레이블(label_test) 데이터 분포 : 
 2    10
1    10
0    10
Name: label, dtype: int64

'머신러닝(MACHINE LEARNING)' 카테고리의 다른 글

GridSearchCV () - 파라매터 와 교차검증을 동시에 하는 API (0)	2021.04.13
교차 검증을 위한 Cross_val_score() 함수 사용법 (0)	2021.04.13
Train_Test_Split 함수 사용Train_Test_Split 함수 사용 (0)	2021.04.13
pandas_sklearn_DecisionTreeclassifier (0)	2021.04.13
Kaggle_Titanic (0)	2021.04.09

ABOUT ME

Guru_Park의 블로그

K -Fold 학습데이터 분류¶

Stratified K-Fold 학습데이터 분류¶

'머신러닝(MACHINE LEARNING)' 카테고리의 다른 글

티스토리툴바

ABOUT ME

K -Fold 학습데이터 분류¶

Stratified K-Fold 학습데이터 분류¶

'머신러닝(MACHINE LEARNING)' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바