Notice
Recent Posts
Recent Comments
Link
일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | |||
5 | 6 | 7 | 8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 | 16 | 17 | 18 |
19 | 20 | 21 | 22 | 23 | 24 | 25 |
26 | 27 | 28 | 29 | 30 | 31 |
Tags
- 데이터분석
- Spark Data Read
- BFS
- 알고리즘
- Spark jdbc parallel read
- Spark 튜닝
- enq: FB - contention
- 배깅
- Decision Tree
- 데이터 분석
- 앙상블
- git init
- Linux
- Oracle ASSM
- git 기본명령어
- SQL
- Collaborative filtering
- 오라클 데이터 처리방식
- git stash
- 네트워크
- CF
- eda
- 통계분석
- Oracle 논리적 저장 구조
- 리눅스 환경변수
- 랜덤포레스트
- 추천시스템
- 의사결정나무
- Python
- airflow 정리
Archives
- Today
- Total
[Alex] 데이터 장인의 블로그
[Machine Learning] Random Forest - 랜덤 포레스트 코드 구현 (feat. python) 본문
ML&DL/Machine Learning
[Machine Learning] Random Forest - 랜덤 포레스트 코드 구현 (feat. python)
Alex, Yoon 2020. 10. 4. 22:381. 관련 라이브러리를 임포트.
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:90% !important;}</style>"))
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv("data/income_evaluation.csv")
df.shape
c:\users\jb.yoon\appdata\local\continuum\anaconda3\envs\jb\lib\site-packages\statsmodels\tools\_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
import pandas.util.testing as tm
(32561, 15)
2. 컬럼명을 지정, 범주형 변수와 연속형 변수를 확인한다.
col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']
df.columns = col_names
# 연속형 변수와 범주형 변수를 구분.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32561 non-null int64
1 workclass 32561 non-null object
2 fnlwgt 32561 non-null int64
3 education 32561 non-null object
4 education_num 32561 non-null int64
5 marital_status 32561 non-null object
6 occupation 32561 non-null object
7 relationship 32561 non-null object
8 race 32561 non-null object
9 sex 32561 non-null object
10 capital_gain 32561 non-null int64
11 capital_loss 32561 non-null int64
12 hours_per_week 32561 non-null int64
13 native_country 32561 non-null object
14 income 32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
# 연속형 변수 6개.
df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
age | 32561.0 | 38.581647 | 13.640433 | 17.0 | 28.0 | 37.0 | 48.0 | 90.0 |
fnlwgt | 32561.0 | 189778.366512 | 105549.977697 | 12285.0 | 117827.0 | 178356.0 | 237051.0 | 1484705.0 |
education_num | 32561.0 | 10.080679 | 2.572720 | 1.0 | 9.0 | 10.0 | 12.0 | 16.0 |
capital_gain | 32561.0 | 1077.648844 | 7385.292085 | 0.0 | 0.0 | 0.0 | 0.0 | 99999.0 |
capital_loss | 32561.0 | 87.303830 | 402.960219 | 0.0 | 0.0 | 0.0 | 0.0 | 4356.0 |
hours_per_week | 32561.0 | 40.437456 | 12.347429 | 1.0 | 40.0 | 40.0 | 45.0 | 99.0 |
categorical = [var for var in df.columns if df[var].dtype=='O']
df[categorical].head()
workclass | education | marital_status | occupation | relationship | race | sex | native_country | income | |
---|---|---|---|---|---|---|---|---|---|
0 | State-gov | Bachelors | Never-married | Adm-clerical | Not-in-family | White | Male | United-States | <=50K |
1 | Self-emp-not-inc | Bachelors | Married-civ-spouse | Exec-managerial | Husband | White | Male | United-States | <=50K |
2 | Private | HS-grad | Divorced | Handlers-cleaners | Not-in-family | White | Male | United-States | <=50K |
3 | Private | 11th | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | United-States | <=50K |
4 | Private | Bachelors | Married-civ-spouse | Prof-specialty | Wife | Black | Female | Cuba | <=50K |
3. 범주형 변수들의 각각의 인자 요소 비율을 확인한 뒤 Null 값이나
replace 기준을 설정한다.
# 각 인자 요소 비율
for var in categorical:
print(df[var].value_counts()/np.float(len(df)))
Private 0.697030
Self-emp-not-inc 0.078038
Local-gov 0.064279
? 0.056386
State-gov 0.039864
Self-emp-inc 0.034274
Federal-gov 0.029483
Without-pay 0.000430
Never-worked 0.000215
Name: workclass, dtype: float64
HS-grad 0.322502
Some-college 0.223918
Bachelors 0.164461
Masters 0.052916
Assoc-voc 0.042443
11th 0.036086
Assoc-acdm 0.032769
10th 0.028654
7th-8th 0.019840
Prof-school 0.017690
9th 0.015786
12th 0.013298
Doctorate 0.012684
5th-6th 0.010227
1st-4th 0.005160
Preschool 0.001566
Name: education, dtype: float64
Married-civ-spouse 0.459937
Never-married 0.328092
Divorced 0.136452
Separated 0.031479
Widowed 0.030497
Married-spouse-absent 0.012837
Married-AF-spouse 0.000706
Name: marital_status, dtype: float64
Prof-specialty 0.127146
Craft-repair 0.125887
Exec-managerial 0.124873
Adm-clerical 0.115783
Sales 0.112097
Other-service 0.101195
Machine-op-inspct 0.061485
? 0.056601
Transport-moving 0.049046
Handlers-cleaners 0.042075
Farming-fishing 0.030527
Tech-support 0.028500
Protective-serv 0.019932
Priv-house-serv 0.004576
Armed-Forces 0.000276
Name: occupation, dtype: float64
Husband 0.405178
Not-in-family 0.255060
Own-child 0.155646
Unmarried 0.105832
Wife 0.048156
Other-relative 0.030128
Name: relationship, dtype: float64
White 0.854274
Black 0.095943
Asian-Pac-Islander 0.031909
Amer-Indian-Eskimo 0.009551
Other 0.008323
Name: race, dtype: float64
Male 0.669205
Female 0.330795
Name: sex, dtype: float64
United-States 0.895857
Mexico 0.019748
? 0.017905
Philippines 0.006081
Germany 0.004207
Canada 0.003716
Puerto-Rico 0.003501
El-Salvador 0.003255
India 0.003071
Cuba 0.002918
England 0.002764
Jamaica 0.002488
South 0.002457
China 0.002303
Italy 0.002242
Dominican-Republic 0.002150
Vietnam 0.002058
Guatemala 0.001966
Japan 0.001904
Poland 0.001843
Columbia 0.001812
Taiwan 0.001566
Haiti 0.001351
Iran 0.001321
Portugal 0.001136
Nicaragua 0.001044
Peru 0.000952
Greece 0.000891
France 0.000891
Ecuador 0.000860
Ireland 0.000737
Hong 0.000614
Trinadad&Tobago 0.000584
Cambodia 0.000584
Thailand 0.000553
Laos 0.000553
Yugoslavia 0.000491
Outlying-US(Guam-USVI-etc) 0.000430
Honduras 0.000399
Hungary 0.000399
Scotland 0.000369
Holand-Netherlands 0.000031
Name: native_country, dtype: float64
<=50K 0.75919
>50K 0.24081
Name: income, dtype: float64
# replace '?' to `NaN`
df['workclass'].replace(' ?', np.NaN, inplace=True)
df['occupation'].replace(' ?', np.NaN, inplace=True)
df['native_country'].replace(' ?', np.NaN, inplace=True)
4. 각각 missing value가 어느정도인지 확인한다.
df[categorical].isnull().sum()
workclass 1836
education 0
marital_status 0
occupation 1843
relationship 0
race 0
sex 0
native_country 583
income 0
dtype: int64
# check for cardinality in categorical variables
# more N of label, more cardinality
for var in categorical:
print(var, ' contains ', len(df[var].unique()), ' labels')
workclass contains 9 labels
education contains 16 labels
marital_status contains 7 labels
occupation contains 15 labels
relationship contains 6 labels
race contains 5 labels
sex contains 2 labels
native_country contains 42 labels
income contains 2 labels
numerical = [var for var in df.columns if df[var].dtype !='O']
print('There are {} numerical variables\n'.format(len(numerical)))
print('The numerical variables are :\n\n', numerical)
There are 6 numerical variables
The numerical variables are :
['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']
df[numerical].head()
age | fnlwgt | education_num | capital_gain | capital_loss | hours_per_week | |
---|---|---|---|---|---|---|
0 | 39 | 77516 | 13 | 2174 | 0 | 40 |
1 | 50 | 83311 | 13 | 0 | 0 | 13 |
2 | 38 | 215646 | 9 | 0 | 0 | 40 |
3 | 53 | 234721 | 7 | 0 | 0 | 40 |
4 | 28 | 338409 | 13 | 0 | 0 | 40 |
df[numerical].isnull().sum()
age 0
fnlwgt 0
education_num 0
capital_gain 0
capital_loss 0
hours_per_week 0
dtype: int64
5. Train, Test data set을 나누고 범주형 변수들의 null 값을 최빈값으로 대체해준다.
X = df.drop(['income'], axis=1)
y = df['income']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
# check the shape of X_train and X_test
X_train.shape, X_test.shape
((22792, 14), (9769, 14))
categorical = [col for col in X_train.columns if X_train[col].dtypes == 'O']
numerical = [col for col in X_train.columns if X_train[col].dtypes != 'O']
# print percentage of missing values in the categorical variables in training set
X_train[categorical].isnull().mean()
workclass 0.055985
education 0.000000
marital_status 0.000000
occupation 0.056072
relationship 0.000000
race 0.000000
sex 0.000000
native_country 0.018164
dtype: float64
# print categorical variables with missing data
for col in categorical:
if X_train[col].isnull().mean()>0:
print(col, (X_train[col].isnull().mean()))
workclass 0.055984555984555984
occupation 0.05607230607230607
native_country 0.018164268164268166
# impute missing categorical variables with most frequent value
for df2 in [X_train, X_test]:
df2['workclass'].fillna(X_train['workclass'].mode()[0], inplace=True)
df2['occupation'].fillna(X_train['occupation'].mode()[0], inplace=True)
df2['native_country'].fillna(X_train['native_country'].mode()[0], inplace=True)
#!pip install category_encoders
import category_encoders as ce
# encode categorical variables with one-hot encoding
encoder = ce.OneHotEncoder(cols=['workclass', 'education', 'marital_status', 'occupation', 'relationship',
'race', 'sex', 'native_country'])
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)
X_train.shape
(22792, 105)
X_test.shape
(9769, 105)
cols = X_train.columns
원래 트리모델은 정규화가 필요하지 않지만 하이퍼파라미터 튜닝 등 모델 성능을 향상시키는 방법에서는 정규화가 필요함.
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train = pd.DataFrame(X_train, columns=[cols])
X_test = pd.DataFrame(X_test, columns=[cols])
6. 랜덤포레스트 모델 학습.
# import Random Forest classifier
from sklearn.ensemble import RandomForestClassifier
# instantiate the classifier
rfc = RandomForestClassifier(random_state=0)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
from sklearn.metrics import accuracy_score
print('Model accuracy score with 10 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))
Model accuracy score with 10 decision-trees : 0.8446
7. 의사결정나무 100개 지정. 학습 n_estimators=100
rfc_100 = RandomForestClassifier(n_estimators=100, random_state=0)
# fit the model to the training set
rfc_100.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=0, verbose=0,
warm_start=False)
# Predict on the test set results
y_pred_100 = rfc_100.predict(X_test)
# Check accuracy score
print('Model accuracy score with 100 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred_100)))
Model accuracy score with 100 decision-trees : 0.8521
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_100))
precision recall f1-score support
<=50K 0.89 0.92 0.90 7407
>50K 0.73 0.62 0.67 2362
accuracy 0.85 9769
macro avg 0.81 0.77 0.79 9769
weighted avg 0.85 0.85 0.85 9769
8. Feature importance 확인
# create the classifier with n_estimators = 100
clf = RandomForestClassifier(n_estimators=100, random_state=0)
clf.fit(X_train, y_train)
feature_scores = pd.Series(clf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
feature_scores[:10]
fnlwgt 0.159772
age 0.149074
capital_gain 0.091299
hours_per_week 0.086339
education_num 0.065130
marital_status_1 0.058860
relationship_1 0.045279
capital_loss 0.029235
marital_status_3 0.023500
occupation_9 0.018112
dtype: float64
반응형
'ML&DL > Machine Learning' 카테고리의 다른 글
[Machine Learning] Random Forest - 랜덤 포레스트 (0) | 2020.10.04 |
---|---|
[Machine Learning] Decision Tree - 의사결정나무 (0) | 2020.09.20 |
[Machine Learning] SVM - 서포트 벡터 머신 (0) | 2020.09.11 |
Comments