[Alex] 데이터 장인의 블로그

[Machine Learning] Random Forest - 랜덤 포레스트 코드 구현 (feat. python) 본문

ML&DL/Machine Learning

[Machine Learning] Random Forest - 랜덤 포레스트 코드 구현 (feat. python)

Alex, Yoon 2020. 10. 4. 22:38

1. 관련 라이브러리를 임포트.

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("data/income_evaluation.csv")
df.shape
c:\users\jb.yoon\appdata\local\continuum\anaconda3\envs\jb\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm





(32561, 15)

2. 컬럼명을 지정, 범주형 변수와 연속형 변수를 확인한다.

col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
             'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

df.columns = col_names
# 연속형 변수와 범주형 변수를 구분. 
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
# 연속형 변수 6개. 
df.describe().T
  count mean std min 25% 50% 75% max
age 32561.0 38.581647 13.640433 17.0 28.0 37.0 48.0 90.0
fnlwgt 32561.0 189778.366512 105549.977697 12285.0 117827.0 178356.0 237051.0 1484705.0
education_num 32561.0 10.080679 2.572720 1.0 9.0 10.0 12.0 16.0
capital_gain 32561.0 1077.648844 7385.292085 0.0 0.0 0.0 0.0 99999.0
capital_loss 32561.0 87.303830 402.960219 0.0 0.0 0.0 0.0 4356.0
hours_per_week 32561.0 40.437456 12.347429 1.0 40.0 40.0 45.0 99.0
categorical = [var for var in df.columns if df[var].dtype=='O']
df[categorical].head()
  workclass education marital_status occupation relationship race sex native_country income
0 State-gov Bachelors Never-married Adm-clerical Not-in-family White Male United-States <=50K
1 Self-emp-not-inc Bachelors Married-civ-spouse Exec-managerial Husband White Male United-States <=50K
2 Private HS-grad Divorced Handlers-cleaners Not-in-family White Male United-States <=50K
3 Private 11th Married-civ-spouse Handlers-cleaners Husband Black Male United-States <=50K
4 Private Bachelors Married-civ-spouse Prof-specialty Wife Black Female Cuba <=50K

3. 범주형 변수들의 각각의 인자 요소 비율을 확인한 뒤 Null 값이나

replace 기준을 설정한다.

# 각 인자 요소 비율
for var in categorical:
     print(df[var].value_counts()/np.float(len(df)))
 Private             0.697030
 Self-emp-not-inc    0.078038
 Local-gov           0.064279
 ?                   0.056386
 State-gov           0.039864
 Self-emp-inc        0.034274
 Federal-gov         0.029483
 Without-pay         0.000430
 Never-worked        0.000215
Name: workclass, dtype: float64
 HS-grad         0.322502
 Some-college    0.223918
 Bachelors       0.164461
 Masters         0.052916
 Assoc-voc       0.042443
 11th            0.036086
 Assoc-acdm      0.032769
 10th            0.028654
 7th-8th         0.019840
 Prof-school     0.017690
 9th             0.015786
 12th            0.013298
 Doctorate       0.012684
 5th-6th         0.010227
 1st-4th         0.005160
 Preschool       0.001566
Name: education, dtype: float64
 Married-civ-spouse       0.459937
 Never-married            0.328092
 Divorced                 0.136452
 Separated                0.031479
 Widowed                  0.030497
 Married-spouse-absent    0.012837
 Married-AF-spouse        0.000706
Name: marital_status, dtype: float64
 Prof-specialty       0.127146
 Craft-repair         0.125887
 Exec-managerial      0.124873
 Adm-clerical         0.115783
 Sales                0.112097
 Other-service        0.101195
 Machine-op-inspct    0.061485
 ?                    0.056601
 Transport-moving     0.049046
 Handlers-cleaners    0.042075
 Farming-fishing      0.030527
 Tech-support         0.028500
 Protective-serv      0.019932
 Priv-house-serv      0.004576
 Armed-Forces         0.000276
Name: occupation, dtype: float64
 Husband           0.405178
 Not-in-family     0.255060
 Own-child         0.155646
 Unmarried         0.105832
 Wife              0.048156
 Other-relative    0.030128
Name: relationship, dtype: float64
 White                 0.854274
 Black                 0.095943
 Asian-Pac-Islander    0.031909
 Amer-Indian-Eskimo    0.009551
 Other                 0.008323
Name: race, dtype: float64
 Male      0.669205
 Female    0.330795
Name: sex, dtype: float64
 United-States                 0.895857
 Mexico                        0.019748
 ?                             0.017905
 Philippines                   0.006081
 Germany                       0.004207
 Canada                        0.003716
 Puerto-Rico                   0.003501
 El-Salvador                   0.003255
 India                         0.003071
 Cuba                          0.002918
 England                       0.002764
 Jamaica                       0.002488
 South                         0.002457
 China                         0.002303
 Italy                         0.002242
 Dominican-Republic            0.002150
 Vietnam                       0.002058
 Guatemala                     0.001966
 Japan                         0.001904
 Poland                        0.001843
 Columbia                      0.001812
 Taiwan                        0.001566
 Haiti                         0.001351
 Iran                          0.001321
 Portugal                      0.001136
 Nicaragua                     0.001044
 Peru                          0.000952
 Greece                        0.000891
 France                        0.000891
 Ecuador                       0.000860
 Ireland                       0.000737
 Hong                          0.000614
 Trinadad&Tobago               0.000584
 Cambodia                      0.000584
 Thailand                      0.000553
 Laos                          0.000553
 Yugoslavia                    0.000491
 Outlying-US(Guam-USVI-etc)    0.000430
 Honduras                      0.000399
 Hungary                       0.000399
 Scotland                      0.000369
 Holand-Netherlands            0.000031
Name: native_country, dtype: float64
 <=50K    0.75919
 >50K     0.24081
Name: income, dtype: float64
# replace '?' to `NaN`
df['workclass'].replace(' ?', np.NaN, inplace=True)
df['occupation'].replace(' ?', np.NaN, inplace=True)
df['native_country'].replace(' ?', np.NaN, inplace=True)

4. 각각 missing value가 어느정도인지 확인한다.

df[categorical].isnull().sum()
workclass         1836
education            0
marital_status       0
occupation        1843
relationship         0
race                 0
sex                  0
native_country     583
income               0
dtype: int64
# check for cardinality in categorical variables
# more N of label, more cardinality
for var in categorical:
    print(var, ' contains ', len(df[var].unique()), ' labels')
workclass  contains  9  labels
education  contains  16  labels
marital_status  contains  7  labels
occupation  contains  15  labels
relationship  contains  6  labels
race  contains  5  labels
sex  contains  2  labels
native_country  contains  42  labels
income  contains  2  labels
numerical = [var for var in df.columns if df[var].dtype !='O']
print('There are {} numerical variables\n'.format(len(numerical)))
print('The numerical variables are :\n\n', numerical)
There are 6 numerical variables

The numerical variables are :

 ['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']
df[numerical].head()
  age fnlwgt education_num capital_gain capital_loss hours_per_week
0 39 77516 13 2174 0 40
1 50 83311 13 0 0 13
2 38 215646 9 0 0 40
3 53 234721 7 0 0 40
4 28 338409 13 0 0 40
df[numerical].isnull().sum()
age               0
fnlwgt            0
education_num     0
capital_gain      0
capital_loss      0
hours_per_week    0
dtype: int64

5. Train, Test data set을 나누고 범주형 변수들의 null 값을 최빈값으로 대체해준다.

X = df.drop(['income'], axis=1)
y = df['income']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
# check the shape of X_train and X_test
X_train.shape, X_test.shape
((22792, 14), (9769, 14))
categorical = [col for col in X_train.columns if X_train[col].dtypes == 'O']
numerical = [col for col in X_train.columns if X_train[col].dtypes != 'O']
# print percentage of missing values in the categorical variables in training set
X_train[categorical].isnull().mean()
workclass         0.055985
education         0.000000
marital_status    0.000000
occupation        0.056072
relationship      0.000000
race              0.000000
sex               0.000000
native_country    0.018164
dtype: float64
# print categorical variables with missing data
for col in categorical:
    if X_train[col].isnull().mean()>0:
        print(col, (X_train[col].isnull().mean()))
workclass 0.055984555984555984
occupation 0.05607230607230607
native_country 0.018164268164268166
# impute missing categorical variables with most frequent value
for df2 in [X_train, X_test]:
    df2['workclass'].fillna(X_train['workclass'].mode()[0], inplace=True)
    df2['occupation'].fillna(X_train['occupation'].mode()[0], inplace=True)
    df2['native_country'].fillna(X_train['native_country'].mode()[0], inplace=True)    
#!pip install category_encoders
import category_encoders as ce
# encode categorical variables with one-hot encoding
encoder = ce.OneHotEncoder(cols=['workclass', 'education', 'marital_status', 'occupation', 'relationship', 
                                 'race', 'sex', 'native_country'])

X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)
X_train.shape
(22792, 105)
X_test.shape
(9769, 105)
cols = X_train.columns

원래 트리모델은 정규화가 필요하지 않지만 하이퍼파라미터 튜닝 등 모델 성능을 향상시키는 방법에서는 정규화가 필요함.

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train = pd.DataFrame(X_train, columns=[cols])
X_test = pd.DataFrame(X_test, columns=[cols])

6. 랜덤포레스트 모델 학습.

# import Random Forest classifier
from sklearn.ensemble import RandomForestClassifier
# instantiate the classifier 
rfc = RandomForestClassifier(random_state=0)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
from sklearn.metrics import accuracy_score
print('Model accuracy score with 10 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))
Model accuracy score with 10 decision-trees : 0.8446

7. 의사결정나무 100개 지정. 학습 n_estimators=100

rfc_100 = RandomForestClassifier(n_estimators=100, random_state=0)
# fit the model to the training set
rfc_100.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)
# Predict on the test set results
y_pred_100 = rfc_100.predict(X_test)
# Check accuracy score 
print('Model accuracy score with 100 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred_100)))
Model accuracy score with 100 decision-trees : 0.8521
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_100))
              precision    recall  f1-score   support

       <=50K       0.89      0.92      0.90      7407
        >50K       0.73      0.62      0.67      2362

    accuracy                           0.85      9769
   macro avg       0.81      0.77      0.79      9769
weighted avg       0.85      0.85      0.85      9769

8. Feature importance 확인

# create the classifier with n_estimators = 100
clf = RandomForestClassifier(n_estimators=100, random_state=0)
clf.fit(X_train, y_train)
feature_scores = pd.Series(clf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
feature_scores[:10]
fnlwgt              0.159772
age                 0.149074
capital_gain        0.091299
hours_per_week      0.086339
education_num       0.065130
marital_status_1    0.058860
relationship_1      0.045279
capital_loss        0.029235
marital_status_3    0.023500
occupation_9        0.018112
dtype: float64
반응형
Comments