Accuracy Improvement-Ensemble Learning

Introduction

In our life, we know that the powerful of a single person is much weaker than a group of people. One simple example is tug-of-war. A group of people have stronger power than a single person and much easier to win the competition. Similarly, when making a decision about if a person should be punished, the judgement from a group of jurors is less biased than that from a single juror, since they consider different aspects of the case.
In machine learning, ensemble method applies such idea to combine decisions from different models and improve the accuracy. Three common ensemble methods are: bagging, boosting and stacking.

Bagging (bootstrap aggregation)

  • Main Idea:
    Its goal is to reduce variance by using multiple classifiers, like decision tree. It uses boostrap method to sample data and train multiple classifiers and then average the prediction over a collection of classifiers (for continuous value prediction, regression), or return the prediction with maximum votes (for
    classification)
    Random forest is a bagging approach, which bags a set of decision trees together.


  • Assumptions
    we have training set with size of D and a set of models with size of K

  • Training:

    1. Similar to Bootstrap, at each iteration i, a training set Di of d tuples is sampled with replacement from training set.
    2. A classifier model Mi is learned for each training set Di
  • Prediction:

    1. Each Classifier Mi returns prediction for input X.
    2. Discrete value output: The bagged classifier counts the votes and assigns the class with the most votes to X
    3. Continous value output:
      take the average value of each prediction for a given test sample.
  • Advantages:

    1. Better than a single classifier from the classifier set.
    2. More robust in noisy data and hence smaller variance
  • Disadvantages:
    1.Its training depends on sampling techniques, which could affect the accuracy

    1. The prediction may be not precise, since it uses average value of classifiers’ predictions.

      For example, if valid prediction values are 1,2,or 3, then the average of predictions from different model could lead to a floating point number.

  • Features of Random Forest

    1. Comparable in accuracy to Adaboost, but more robust to errors and outliers.

    2. Insensitive to the number of attributes selected for consideration at each split, and faster than boosting


Boosting

  • Main Idea:
    Its goal is to improve accuracy, let models better fit training set. It uses weighted votes from a collection of classifiers


  • Training:

    1. Each training sample/tuple is given a weight, eg $w_i$ for the $i^{th}$ tuple. Then we have training set {(X0,y0, w0), … ,(Xi,yi, wi)}
      where $X_i$ and $y_i$ are training sample and target
    2. We have k classifiers {M0, M1,…Mk}. Each classifier is learned from the whole training set iteratively. That is, if we have k classifiers, then we need to iterate the training set at least k times (at $i^{th}$ iteration, we train the $i^{th}$ classifier), so that each classifier can learn the training set.
    3. After classifier $M_i$ is learned on training set, classifier $M_{i+1}$ pays more attention to the training samples that are misclassified by $M_i$
  • Prediction:
    The final Model combines the votes of each individual classifier. Either find the prediction with largest sum of weights (Classification), or find the average of all prediction values (Regression)

      The weight of each classifier's vote is a function of its accuracy
    

  • Advantages

    1. Boosting can be extended to numeric prediction
    2. Better fit the training set since it adjusts the weights of training set and gives more attention to the misclassified sample.
  • Disadvantages

    1. Easy to overfit. Need Additional techniques to avoid overfitting. (I will discuss the methods dealing with Overfitting ).
  • Questions:

    1. How to pay more attention to misclassified samples? give Higher weights? But How to compute weights?
      Answer: This depends on the actual boosting algorithm, like GradientBoosting, AdaBoosting
  • AdaBoosting

    1. Assumption:
      Assume we have training set with size of D and a set of classifier models with size of T

    2. __Error of model $M_i$__

      Error($M_i$) = $\sum_i^D (w_i \times err(X_i))$

      if using normalized weight (weight in range [0,1]), then
      Error($M_i$) = $\frac{\sum_i^D (w_i \times err(X_i))}{(\sum_j^D w_j)} $


      Note:
      In classification, $err(X_i)= 1(C_i(X_i) !=Y_i)$, $C_i(X_I)$ means the prediction of model $M_i$ on sample $X_i$. If the prediction is correction $error(X_i) =0$, otherwise 1.


    3. Weight of model $M_i$’s voting: $\alpha_i$
      $\alpha_i = log\frac{1-error(M_i)}{error(M_i)} + log(K-1)$.

      Note:
      K = the number of classes in dataset. When K=1, log(K-1) term can be ignored


    4. Update of weight
      $w_i = w_i \cdot exp(\alpha_i \cdot 1(M_j(X_i) != Y_i))$


      The weight $w_i$ of the $i^{th}$ training tuple $X_i$ is updated by timing exponential value of weight of model only when this model $M_j$ misclassifies the $X_i$ ( That is $M_j(X_i) !=Y_i$ and hence $1(M_j(X_i) !=Y_i) =1 $).


    5. Prediction
      $C(X_i) = argmax_{k} \sum_{j=1}^T \alpha_{m}\cdot 1(M_j(X_i)== Y_i)$
      where $1(M_j(X_i)== Y_i)$ is equal to 1 if prediction is correct, otherwise, 0.
      The prediction of the whole model has the largest sum of weight of models, NOT the weight of training tuple!


    6. More detail for AdaBoosting, Read this paper


Stacking

  • Main Idea:
    It combines and trains a set of heterogeneous classifiers in parallel.
    It consists of 2-level models:
    level-0: base model
    Models fit on the training data and whose predictions are compiled.
    level-1: Meta-Model
    It learns how to best combine the predictions of the base models.


  • Training:

    1. split the training data into K-folds
    2. one of base models is fitted on the K-1 parts and predictions are made for Kth part.
    3. Repeat step 2 for each fold
    4. Fit the base model on the whole train data set to calculate its performance on the test set.
    5. repeat step 2~4 for each base model
    6. Predictions from the train set are used as features for the second level model.
  • Classification:
    Second level model is used to make a prediction on the test set.


  • Advantage:

    1. It harness the capabilities of a range of well-performing models on a classification or regression task and make predictions that have better performance than any single model in the ensemble.
  • Disadvantage:

    1. It could be computational expensive since it uses k-fold method and use multiple level models.

Comparison among Ensembling, boosting and bagging

  1. Goal of bagging is to reduce variance and noise while boosting is to improve accuracy using weighted models. Stacking is to improve accuracy of model using hetergenerous models.
  2. Adaboost let classifiers pay more attention to the misclassified samples, but if those misclassified samples are outlier or noisy data, it will affect a lot and lead to larger variance.
    However, bagging and ensemble uses averaging and voting methods and each classifier has equal weight, which is less sensitive to the noise data and outlier.

Reference

[1] https://blog.csdn.net/weixin_37352167/article/details/85028835
[2] https://machinelearningmastery.com/stacking-ensemble-machine-learning-with-python/
[3] https://miro.medium.com/max/497/0*Bbf8eeslDtVKog7U.png

Comments