[The Based Models 2] Boosting

Machine Learning

[The Based Models 2] Boosting

bomishot 2023. 4. 14. 17:37

🌀 Bagging vs Boosting

Bagging : 복원 추출- 기본모델 학습(각 기본 모델들이 학습 시 독립적, 병렬적 학습됨) - 기본 모델들의 예측값 평등하게합치기

Boosting : 순차적으로 학습됨. 지금까지 학습된 모델이 잘 예측하지 못하는 부분에 집중해 다음 모델 학습시킴

	Bagging	Boosting
기본 모델 간 영향	기본 모델들 간에 영향을 받지 않고, 개별적으로 만듦	이전 기본 모델이 예측하지 못한 부분에 집중하는 모델을 만듦
데이터셋	기존 데이터셋에서 중복 허용한 무작위 추출(부트스트랩)으로 만듦	이전 학습에서 오차가 심했던 데이터들에 대한 가중치를 부여한 후 랜덤하게 선택해서 만듦
분산과 편향	기본 모델들의 서로 다른 양상으로 발생하는 오차들이 상쇄되며 분산을 줄임 → 과대적합 해결	Boosting 과정을 반복하며 최종 모델의 복잡도를 상승시키면서 편향을 줄임 → 과소적합해결
최종 결과	기본 모델들의 평균(회귀모델), 다수결(분류모델)로 결정	기본 모델들의 결과를 취합해 예측 수행
대표 알고리즘	Random Forest	AdaBoost, GBM, XGBoost, LightGBM

cf) XGBoost의 느린 점이 LightGBM을 쓰면 훨씬 나아짐!

대표적인 Boosting 알고리즘들인 AdaBoost, Gradient Boost에 대해 알아보자!

AdaBoost

분류문제에 적합, Gradient boost보다 이상치 민감, 성능 떨어져 별로 사용 x..

잘못 분류된 관측치에 가중치 부여해 샘플링함. (가중 샘플링)

🌀 Gradient Boost

회귀, 분류 문제 모두 사용 가능

강력한 성능!

kaggle, 현업에서 인기 높음!

구현한 라이브러리도 많아 쉽게 모델 구축 가능

Gradient Boosting은 틀린 데이터에 집중하기 위해, 잔차를 학습한다.

이전 트리에서 잘못 분류된 샘플들에 가중치를 높여 다음 트리에서 더욱 집중적으로 학습하게 됩니다. 이때 가중치를 높이는 것은 데이터 샘플 자체를 더 많이 추출하는 것이 아니라, 해당 샘플의 잔여 오차(residual error)를 계산하여 이를 이용해 가중치를 조정합니다.

성능이 단일 모델보다 좋을 수 밖에 없다. 첫번째에서 학습을 제대로 못해서, 첫번째가 맞추지 못한 나머지 오차를 이용해서 계속 학습을 시켜준다.

잔차가 큰 관측치를 더 학습하도록 하는 효과가 있으며, 이전 모델이 틀린 만큼을 직접 학습하며 이전 모델을 순차적으로 보완함.

🌀 XGBoost

Gradient Boosting보다 성능, 계산속도 좋다!

XGBoost 라이브러리 : 2014년에 공개된 Gradient Boosting Decision Tree 구현 라이브러리로, kaggle등에서 꾸준히 사랑받아 온 모델이다.

scikit-learn ensemble 모듈의 GradientBoostingRegressor, GradientBoostingClassifier 클래스도 Gradient Boosting Decision Tree 기반 모델이지만, 성능과 계산 속도가 XGBoost 모델보다 떨어져 자주 사용하지는 않는다.

Gradient Boosting Decision Tree는 Tree-Based 모델의 특성을 그래도 따른다.

특성을 수치화할 필요가 있다. ( carboost등에서는 string type의 특성을 그대로 처리하긴 함.)
특성의 scaling이나 normalize가 필요없다.
one-hot encoding보다 ordinal encoding 선호됨.
- 특히 cardinality가 높은 특성의 경우 one-hot encdoing시 학습 시간 및 메모리, 컴퓨팅 자원이 많이 소모되므로 주의.

XGBoost의 파라미터

booster

weak learner 모델을 설정할 수 있는 파라미터
gbtree : Decision Tree 모델 사용
dart : Decision Tree 모델 사용하되, DART 알고리즘 사용해 모델 정규화
- 과적합 방지하기 위해, 이전에 학습된 트리 중 몇 가지를 drop시키는 기법
- https://xgboost.readthedocs.io/en/latest/tutorials/dart.html

objective

최소화하고자 하는 목적함수 설정
(default) XGBClassifier : binary : logistic / XGBREgressor : reg : squarederror
https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters

eval_metric

검증데이터를 같이 넣어줄 경우, 검증 방법 설정할 수 있다.
(default) regression : rmse, classification : logloss
eval_metric='error' : 1-accuracy지표 이용해 평가

XGBoost의 주요 하이퍼파라미터

XGBoost는 하이퍼파라미터에 따라 성능이 많이 달라지는 모델이므로 각각 모두 고려해 사용해보자!!!

n_estimators

weak learner들의 수 결정 (randomforest의 경우, decision tree의 개수가 되겠지.)

learning_rate

단계별로 weak learner들 얼마나 반영할지 결정
0~1 범위
- 값이 너무 크면 overfitting 발생 쉬움.
- 값이 너무 작으면 학습이 느려짐.
일반적으로 0.05~0.3 정도의 범위에서 탐색 진행

max_depth

각 weak learner 트리들의 최대 깊이 결정
모델의 성능에 가장 큰 영향을 주는 변수!!!
값이 너무 크면, overfitting 발생 쉬우며 메모리 사용량 늘어남.
일반적으로 5-12 정도의 범위에서 탐색 진행

min_child_weight

leaf 노드에 포함되는 관측치의 수 결정
값이 커질수록, weak learner들의 복잡도 감소
일반적으로, overfitting 시, 1,2,4,8…과 같이 2배씩 성능을 늘려 확인

subsample

각 weak learner들을 학습할 때 과적합을 막고, 일반화 성능을 올리기 위해, 전체 데이터 중 일부를 샘플링하여 학습함.
subsample 파라미터가 데이터를 샘플링할 비율 결정함.
일반적으로 0.8정도로 설정하며, 데이터의 크기에 따라 달라질 수 있다.

colsample_bytree

column을 샘플링할 비율
일반적으로 0.8정도로 설정하며, 특성의 개수에 따라 달라질 수 있다. 특성이 천 개 이상으로 매우 많을 경우 0.1등의 매우 작은 값 설정하기도 함.

scale_pos_weight

불균형 target 풀때!!
sum(negative cases) / sum(positive cases) 값을 넣어주면, scikit-learn의 'class_weight= balanced' 옵션과 동일하게 됨.

일반적으로 max_depth, learning rate가 가장 중요한 hyperparameter이며, 과적합을 방지하기 위해 추가적으로 조정해줌.

https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters

XGBoost Parameters — xgboost 2.0.0-dev documentation

update: Starts from an existing model and only updates its trees. In each boosting iteration, a tree from the initial model is taken, a specified sequence of updaters is run for that tree, and a modified tree is added to the new model. The new model would

xgboost.readthedocs.io

Early Stopping

지정해준 n_estimators 만큼 학습을 반복하지 않더라도, 일정 횟수에서 더 이상 성능이 향상되지 않으면 중단시키는 방법
다른 하이퍼파라미터 튜닝할 때, 이에 맞추어 n_estimators값을 변경해주지 않아도 되서 편리하다.
XGBoost 라이브러리에서 early_stopping_rounds 설정해 기능 사용
early stopping을 결정하는 기준 데이터셋인 eval_set을 제공해주어야함.
여러 개의 eval_set이 제공될 경우 마지막 dataset이 기준 데이터셋이 됨.

model = XGBClassifier(objective='binary:logistic',
					  eval_metric='error',
                      n_estimators=3244224,
                      n_jobs=-1,
                      max_depth=7,
                      learning_rate=0.1,
                      cale_pos_weight= train[target].value_counts(normalize=True)[0]/ train[target].value_counts(normalize=True)[1],
                      reg_lambda=1}
watchlist = [(x_train, y_train{, (x_val, y_val)]

modell.fit(x_train, y_train,
		   eval_set=watchlist,
           early_stopping_rounds=50) # 50 rounds동안 성능 개선이 없으면 학습 중지.

Kaggle - H1N1독감 백신 반응

# 80/20 비율로 train / test 데이터 분리
train, val = train_test_split(train, test_size=0.2, random_state=42, stratify=train[target])

# Feature Engineering
def engineer(df):
  # drop high cardinality columns 
  selected_cols = df.select_dtypes(include=["number", "object"])
  labels = selected_cols.nunique()  
  selected_features = labels[labels <= 30].index.tolist()  
  df = df[selected_features]

  # new feature
  behaviorals = [col for col in df.columns if 'behavioral' in col]
  df['behaviorals'] = df[behaviorals].sum(axis=1)

  # drop feature
  dels = [col for col in df.columns if ('employment' in col or 'seas' in col)]
  df.drop(columns=dels, inplace=True)

  return df


train = engineer(train.copy())
val = engineer(val.copy())
test = engineer(test.copy())
train.shape, val.shape, test.shape
>> ((33723, 32), (8431, 32), (28104, 31))

# feature, label 분리
x_train, y_train = train.drop(target, axis=1), train[target]
x_val, y_val = val.drop(target, axis=1), val[target]
x_test = test

from category_encoders import OrdinalEncoder
from sklearn.impute import SimpleImputer
from xgboost import XGBClassifier

encoder = OrdinalEncoder()
x_train = encoder.fit_transform(x_train)
x_val = encoder.transform(x_val)
x_test = encoder.transform(x_test)

model = XGBClassifier(
    objective='binary:logistic',
    eval_metric='error',
    n_estimators=100,
    random_state=42,
    n_jobs=-1, 
    max_depth=7,
    learning_rate=0.1,
    scale_pos_weight=3,
    sub_sample=0.8,
    colsample_bytre=0.8,
    min_child_weight=16, # 가지치기에 사용되는 가중치의 최솟값 지정, overfitting시, 1,2,4,8..형태로 늘려가기
    reg_lambda=1) # L1, L2규제 강도 조절 - 값클수록, 가중치가 작아져 과적합 방지

eval_set = [(x_train, y_train), (x_val, y_val)]

model.fit(x_train, y_train,
                    eval_set=eval_set,
                    early_stopping_rounds=50)


model.score(x_train, y_train)
model.score(x_val, y_val)
>> 
train score :0.7967
val score : 0.7771

randomforest모델보다, overfitting 가능성 낮아졌다.

kaggle에 제출 시, XGBClassifier의 f1-score : 0.61997

다음번에는 다른 모델을 통해 더 score를 올려보자!!

More study

트리 시각화

# 시각화 모듈 불러오기
from xgboost import plot_tree

# 시각화
plot_tree(model,        # 모델이름
          num_trees=0,  # 표시할 트리 번호
          rankdir='TB') # 표시 방향: LR' or 'TB'
plt.gcf().set_size_inches(13, 13);

lightBGM

import lightgbm as lgb

# LightGBM 모델 설정하기
model = lgb.LGBMClassifier(
    objective='binary:logistic',
    metric='error',
    n_estimators=100,
    random_state=42,
    num_leaves=31,  # 잎 노드 개수 : 더 이상 분할할 수 없는 최종적인 영역역
    learning_rate=0.1,
    max_depth=5,
    reg_lambda=1,
    scale_pos_weight=3,
    subsample=0.8,
    colsample_bytree=0.8,
    min_child_weight=16
)

wathchlist = [(X_train_encoded, y_train), (X_val_encoded, y_val)]
# LightGBM 모델 학습하기
model.fit(X_train_encoded, y_train, eval_set=watch_list, early_stopping_rounds=50)

XGB보다, lightGBM 모델이 확실히 더 빠르다..!

하지만, 내가 한 데이터셋에서는 XGB 모델이 성능이 더 좋았다.

[Reference]

Gradient Boosting이 구현된 Python Library

scikit-learn Gradient Tree Boosting — 상대적으로 속도가 느릴 수 있습니다.
- Anaconda: already installed
- Google Colab: already installed
xgboost — 결측값을 수용하며, monotonic constraints를 강제할 수 있습니다.
- Anaconda, Mac/Linux: conda install -c conda-forge xgboost
- Windows: conda install -c anaconda py-xgboost
- Google Colab: already installed
LightGBM — 결측값을 수용하며, monotonic constraints를 강제할 수 있습니다.
- Anaconda: conda install -c conda-forge lightgbm
- Google Colab: already installed
CatBoost — 결측값을 수용하며, categorical features를 전처리 없이 사용할 수 있습니다.
- Anaconda: conda install -c conda-forge catboost
- Google Colab: pip install catboost

배깅 복습

Bootstrap aggregating bagging

부스팅

AdaBoost
Gradient Boosting
Bagging vs Boosting
Single estimator versus bagging: bias-variance decomposition
Understanding AdaBoost
Friedman, Jerome H. "Greedy function approximation: a gradient boosting machine." Annals of statistics (2001): 1189-1232.
Gradient Boosting Diagram
Monotonic Constraint
DART
XGBoost Parameters
Avoid Overfitting By Early Stopping With XGBoost In Python
데이터가 뛰어노는 AI 놀이터, 캐글 상위 랭킹 진입을 위한 필살기 (가도와키 다이스케, 사카타 류지, 호사카 게이스케, 히라마쓰 유지 저/대니얼WJ )