머신러닝(MACHINE LEARNING)/간단하게 이론(Theory...) 2021. 4. 26. 14:08

저번에 살펴본 ID3 모델을 이제는 Python으로 간략히 구현해보자.

혹시나 ID3모델이 무엇인지 모른다면 , 저번 포스팅을 참고해보자

https://guru.tistory.com/entry/Decision-Tree-%EC%97%90%EC%84%9C%EC%9D%98-ID3-%EC%95%8C%EA%B3%A0%EB%A6%AC%EC%A6%98

Decision Tree 에서의 ID3 알고리즘

Decision Tree 란 ??? A decision treeis a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility...

guru.tistory.com

1. 모듈 임포트

In [1]:

import numpy as np
import pandas as pd
# eps 란 numpy에서 가장 작은 수를 의미하는 수이다.
eps = np.finfo(float).eps
from numpy import log2 as log

2. 데이터 세트 생성

In [2]:

# data set 을 생성해준다. 날씨 데이터 안에는 outlook(날씨),temp(온도)
# humidity(습기),windy(바람),play(경기를 하는지 여부) 에 관한 속성들이 담겨있다.
outlook = 'overcast,overcast,overcast,overcast,rainy,rainy,rainy,rainy,rainy,sunny,sunny,sunny,sunny,sunny'.split(',')
temp = 'hot,cool,mild,hot,mild,cool,cool,mild,mild,hot,hot,mild,cool,mild'.split(',')
humidity = 'high,normal,high,normal,high,normal,normal,normal,high,high,high,high,normal,normal'.split(',')
windy = 'FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,FALSE,TRUE,FALSE,TRUE,FALSE,FALSE,TRUE'.split(',')
play = 'yes,yes,yes,yes,yes,yes,no,yes,no,no,no,no,yes,yes'.split(',')

dataset = {'outlook': outlook, "temp": temp, "humidity":humidity, "windy":windy, "play":play}
df = pd.DataFrame(dataset, columns = ['outlook','temp','humidity','windy','play'])

In [3]:

df

Out[3]:

	outlook	temp	humidity	windy	play
0	overcast	hot	high	FALSE	yes
1	overcast	cool	normal	TRUE	yes
2	overcast	mild	high	TRUE	yes
3	overcast	hot	normal	FALSE	yes
4	rainy	mild	high	FALSE	yes
5	rainy	cool	normal	FALSE	yes
6	rainy	cool	normal	TRUE	no
7	rainy	mild	normal	FALSE	yes
8	rainy	mild	high	TRUE	no
9	sunny	hot	high	FALSE	no
10	sunny	hot	high	TRUE	no
11	sunny	mild	high	FALSE	no
12	sunny	cool	normal	FALSE	yes
13	sunny	mild	normal	TRUE	yes

3. 현재 Dataset 의 엔트로피 계산

- 현재 데이터를 단순히 play여부로 나누었을때의 엔트로피 양 계산이다.

- values = df.play.values_counts()로 세준다음 , value 값들을 순회하며, 각 부분 비율을 통해 엔트로피를 구해준다.

In [4]:

# 우선 play 를 기준으로 엔트로피 양을 측정해준 결과 , 0.94 정도의 엔트로피가 나온다.
entropy_node = 0
values = df.play.unique()
for value in values:
    fraction = df.play.value_counts()[value] / len(df.play)
    entropy_node += (-fraction)*(np.log2(fraction))
entropy_node

Out[4]:

0.9402859586706311

4. Attribute 로 나뉘었을때의 엔트로피 양 계산

- df 의 속성들 ['outlook','humidity'등등'] 으로 데이터를 나뉘었을때, 엔트로피 양을 계산해주는데, 여기서 특이한 점은 df를 함수 인자로 받는다는 것이다.

- df 를 함수인자로 받는 이유는 df 내부의 attibute에 대해 분수꼴로 다시한번 곱해줘야 하기 때문이다.

- 따라서 df 내부의 attribute 인자로 나뉘고 난 후 엔트로피양을 계산해준뒤, 각각의 attribute에 대해 fraction2(df에서의 attribute속성을 가진 놈들의 갯수 / df의 전체 길이)를 통해 각 attribute들이 얼마나 부분적으로 엔트로피를 할당했는지를가지고 있는지 각자 계산을 해주어야 한다.

In [5]:

def ent(df,attribute):
    # 데이터 df 중 play속성의 count
    target_variables = df.play.unique()
    # 데이터 df 중 attirbute속성의 count
    # 부연 설명을 하자면 attribute 를 기준으로 df를 나누었을때, 발생하는 엔트로피 양을 계산하는 것
    variables = df[attribute].unique()
    
    entropy_attribute = 0
    for variable in variables:
        entropy_each_feature = 0
        for target_variable in target_variables:
            num = len(df[attribute][df[attribute]==variable][df.play == target_variable])
            den = len(df[attribute][df[attribute]==variable])
            # den 이 0 이 될 수 도 있어서 eps(가장작은수)를 더해준다.
            fraction = num/(den+eps)
            entropy_each_feature += (-fraction*log(fraction+eps))
        fraction2 = den/len(df)
        entropy_attribute += (-fraction2)*entropy_each_feature
    return(abs(entropy_attribute))

In [6]:

a_entropy = {k:ent(df,k) for k in df.keys()[:-1]}
a_entropy

Out[6]:

{'outlook': 0.6935361388961914,
 'temp': 0.9110633930116756,
 'humidity': 0.7884504573082889,
 'windy': 0.892158928262361}

5. 전의 엔트로피와 attribute속성 기준 엔트로피의 차이 변화

In [7]:

def ig(e_dataset, e_attr):
    return e_dataset - e_attr

In [8]:

# 본 데이터 가 가지는 entropy 가 0.94이므로, attribute를 기준으로 나누었을때,
# 다음과 같이 된다.
IG = {k:ig(entropy_node,a_entropy[k]) for k in a_entropy}
IG

Out[8]:

{'outlook': 0.24674981977443977,
 'temp': 0.029222565658955535,
 'humidity': 0.15183550136234225,
 'windy': 0.048127030408270155}

'머신러닝(MACHINE LEARNING) > 간단하게 이론(Theory...)' 카테고리의 다른 글

가우스 소거법 (Gauss_Elimination) (0)	2021.04.28
ID3 모델 구현_Python(2)_전체모델 (0)	2021.04.26
Decision Tree 에서의 ID3 알고리즘 (0)	2021.04.25
간단한 LinearRegression 으로 Boston_price 예측 (1)	2021.04.22
Gradient_descent 으로 구현한 Linear_Regression (0)	2021.04.22

ABOUT ME

Guru_Park의 블로그

1. 모듈 임포트

2. 데이터 세트 생성

3. 현재 Dataset 의 엔트로피 계산

4. Attribute 로 나뉘었을때의 엔트로피 양 계산

5. 전의 엔트로피와 attribute속성 기준 엔트로피의 차이 변화

'머신러닝(MACHINE LEARNING) > 간단하게 이론(Theory...)' 카테고리의 다른 글

티스토리툴바

ABOUT ME

1. 모듈 임포트

2. 데이터 세트 생성

3. 현재 Dataset 의 엔트로피 계산

4. Attribute 로 나뉘었을때의 엔트로피 양 계산

5. 전의 엔트로피와 attribute속성 기준 엔트로피의 차이 변화

'머신러닝(MACHINE LEARNING) > 간단하게 이론(Theory...)' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바