The base learning algorithms used by an ensemble could be of different types, for example the individual base learners could be Bayesian, Decision Trees, SVM, etc. in other cases, the base learners could be the same algorithm with different tuning parameters and training sets, for example in Random Forest ensemble, each base learner is a Decision Tree.

I’m going to discuss the five key ensemble techniques (Voting, Stacking, Bagging, Random Forest, and Boosting) and will attempt to represent them using a simple graphic.

Some terminology, before we enter the ensembles:

- Training Set is the set of observations that is used to arrive at a hypothesis. The features( a.k.a. attributes) of the training set are used by the learning algorithm to arrive at the hypothesis.
- Learning algorithm or the learner is an algorithm that identifies complex relationships of features in the training set using a learning process.
- The hypothesis, or the model, is the outcome of learning process. It can be used to predict an unknown observation.
- Test Set is the set of observations that is kept aside to test, validate, and measure the error in a hypothesis.The model, or the hypothesis is made to predict the values on the test set. The error is measured, and reported as “generalized error”.
- Classification problem is the problem of predicting a class, or a category in test data set.
- Regression problem is the problem of predicting a numerical value in test data set.

This uses the sheer power of democracy. To create a *Voting Ensemble*, train several base learners on the training set. Each of the base learners is allowed to make a prediction on the test set.

For classification prediction, take majority vote on the predictions of base learners. For regression prediction, use the average of the predictions from the base learners.

Creating a *Stacked Ensemble*:

- Train several base learners on the training set. The base learners could be different algorithms altogether.
- Train a meta-learner on the predictions of the base learners. The meta-learner learns when the base learners are right or wrong.
- The meta-learner makes final predictions, based on the predictions of the base learners on the test set.

Creating a Bagging Ensemble:

- The key to bagging is the art of creating bootstrap training sets. Let’s say there are m observations. Then create t bootstrap sets of size n each, by choosing n observations(with replacement) from m observations. If n tends to m, then each set will have 63% of original observations, and rest will be duplicates.
- The training observations that are not chosen in a specific bootstrap set are referred to as out-of-bag (OOB) observations. Each base learner can report errors on the OOB observations.
- Create t models using each of these bootstrap training sets.
- Each of the base learners is allowed to make predictions. Like in Voting Ensemble, for a regression based prediction, average the predictions from base learners. For classification, take majority vote on the predictions of base learners.

- Choosing a random training set for each base learner. A bootstrap set is chosen for each base learner, where each base learner is a Decision Tree.
- The nodes of the Decision Tree must use a feature to split, and grow the tree to next level. This feature is chosen from a random subset of features of the training set.

Creating a *Random Forest*:

- Starting with the training set, create t bootstrap samples. This is similar to Bagging, but generally the size of bootstrap samples is chosen to be the same as the size of the training set.
- If there are a total of X features in the observation, then a number x is chosen such that x is very small than X. One choice for x is sqrt(X). This value of x is held constant when the forest is growing.
- For split of each node, in each of the base learners (Decision Trees), x features are randomly chosen out of X. Then a feature out of x is chosen to split the node. This is one of the reasons that random forest trees are created faster than regular trees, where each split decision involves all the features.
- Each tree is grown to the largest extent possible. There is no pruning, since we desire high variance, and low correlation between any two base learners.

The generalized error with Random Forest depends on two items:

- The
*correlation*between any two base learners. Increase in the correlation means decrease in variance, and this results in increased generalization error. With ~500 base learners, and randomness as discussed, the correlation between trees can usually be kept low. - The
*strength*of individual base learner. Increasing the strength of the base learners decreases the forest error rate. Better base learners result in a better ensemble.

AdaBoost is a popular boosting algorithm, and is implemented as below:

- Start with a uniform selection bias for each observation in the training set of size m.
- Choose a number t, which is the number of adaptive iterations AdaBoost will execute. On each iteration:

- Create a bootstrap sample. This is same as we did in Bagging Ensemble. However, the use a bias for the random selection. For the first iteration, each observation has equal probability to be chosen in bootstrap sample.
- Train a new base learner to create a hypothesis on the bootstrap sample (hi).
- Test the hypothesis on the training set. Use the results to increase the selection bias for each wrongly classified observation. The idea is to go on creating harder and harder samples for the subsequent base learners, as the iterations increase.
- For classification problems, the final classifier takes a weighted vote on the predictions of base classifiers. The weaker base learners have a lesser weight in the final voting. For regression problems, the final classifier takes a weighted mean from the base classifiers.

Category: Blog, Blogs, Compendium

Tagged: classifiers, data science, ensemble, machine learning, meta classifiers, meta learners, regression

Copyright © 2009-2014 · Views expressed are my own · All Rights Reserved · Manish Maheshwari