# Ensemble of Weak Learners There are various machine-learning algorithms a.k.a. “learners” for regression and classification problems. Ensembles allow us to create a more powerful learner from a set of base learners. They are known to produce better results than the individual algorithms and are better at reducing generalization errors. Thus, the base learners are also referred as weak learners when discussing the ensembles.

The base learning algorithms used by an ensemble could be of different types, for example the individual base learners could be Bayesian, Decision Trees, SVM, etc. in other cases, the base learners could be the same algorithm with different tuning parameters and training sets, for example in Random Forest ensemble, each base learner is a Decision Tree.

I’m going to discuss the five key ensemble techniques (Voting, Stacking, Bagging, Random Forest, and Boosting) and will attempt to represent them using a simple graphic.

Some terminology, before we enter the ensembles:

1. Training Set is the set of observations that is used to arrive at a hypothesis. The features( a.k.a. attributes) of the training set are used by the learning algorithm to arrive at the hypothesis.
2. Learning algorithm or the learner is an algorithm that identifies complex relationships of features in the training set using a learning process.
3. The hypothesis, or the model, is the outcome of learning process. It can be used to predict an unknown observation.
4. Test Set is the set of observations that is kept aside to test, validate, and measure the error in a hypothesis.The model, or the hypothesis is made to predict the values on the test set. The error is measured, and reported as “generalized error”.
5. Classification problem is the problem of predicting a class, or a category in test data set.
6. Regression problem is the problem of predicting a numerical value in test data set.

## Voting Ensemble

This uses the sheer power of democracy. To create a Voting Ensemble, train several base learners on the training set. Each of the base learners is allowed to make a prediction on the test set.

For classification prediction, take majority vote on the predictions of base learners. For regression prediction, use the average of the predictions from the base learners.

## Stacking

The idea in Stacking is to see patterns of agreements, and disagreements between base learners. Stacking uses a meta-learner to discover these patterns. The meta-learner looks for patterns like “when learner A, and learner B are in unison, the predictions are always correct”, or “when A, B, C disagree, then C is almost always correct”, etc.

Creating a Stacked Ensemble:

1. Train several base learners on the training set. The base learners could be different algorithms altogether.
2. Train a meta-learner on the predictions of the base learners. The meta-learner learns when the base learners are right or wrong.
3. The meta-learner makes final predictions, based on the predictions of the base learners on the test set.

## Bagging (Bootstrap Aggregation)

The idea in Bagging is to create an “averaged out” learner. The trick is to use the base learners in a way that creates a completely over-fit base learning model. Thus the bias of individual learners tends to be zero, and the variances of individual learners is very high. With sufficiently large number of base learners, the ensemble tends to cancel out the random variances of base learners. Bagging benefits when a slight change in the data effects the model at large.

Creating a Bagging Ensemble:

1. The key to bagging is the art of creating bootstrap training sets. Let’s say there are m observations. Then create t bootstrap sets of size n each, by choosing n observations(with replacement) from m observations. If n tends to m, then each set will have 63% of original observations, and rest will be duplicates.
2. The training observations that are not chosen in a specific bootstrap set are  referred to as out-of-bag (OOB) observations. Each base learner can report errors on the OOB observations.
3. Create t models using each of these bootstrap training sets.
4. Each of the base learners is allowed to make predictions. Like in Voting Ensemble, for a regression based prediction, average the predictions from base learners. For classification, take majority vote on the predictions of base learners.

## Random Forest

Random Forest is an ensemble of Decision Trees. They are known to run efficiently on large datasets, and can take care of large number of features. The randomness is introduced by two factors:

1. Choosing a random training set for each base learner. A bootstrap set is chosen for each base learner, where each base learner is a Decision Tree.
2. The nodes of the Decision Tree must use a feature to split, and grow the tree to next level. This feature is chosen from a random subset of features of the training set.

Creating a Random Forest:

1. Starting with the training set, create t bootstrap samples. This is similar to Bagging, but generally the size of bootstrap samples is chosen to be the same as the size of the training set.
2. If there are a total of X features in the observation, then a number x is chosen such that x is very small than X. One choice for x is sqrt(X). This value of x is held constant when the forest is growing.
3. For split of each node, in each of the base learners (Decision Trees), x features are randomly chosen out of X. Then a feature out of x is chosen to split the node. This is one of the reasons that random forest trees are created faster than regular trees, where each split decision involves all the features.
4. Each tree is grown to the largest extent possible. There is no pruning, since we desire high variance, and low correlation between any two base learners.

The generalized error with Random Forest depends on two items:

• The correlation between any two base learners. Increase in the correlation means decrease in variance, and this results in increased generalization error. With ~500 base learners, and randomness as discussed, the correlation between trees can usually be kept low.
• The strength of individual base learner. Increasing the strength of the base learners decreases the forest error rate. Better base learners result in a better ensemble.

## Boosting

The idea of Boosting is that a “weak” learning algorithm, that performs just slightly better than random guessing, can be “boosted” into a “strong” learning algorithm in a set of iterative trainings. There are many different versions of boosting algorithms proposed and implemented. The main variation between different boosting algorithms is their method of weighting training data points and hypotheses.

AdaBoost is a popular boosting algorithm, and is implemented as below:

1. Start with a uniform selection bias for each observation in the training set of size m.
2. Choose a number t, which is the number of adaptive iterations AdaBoost will execute. On each iteration:
•    Create a bootstrap sample. This is same as we did in Bagging Ensemble. However, the use a bias for the random selection. For the first iteration, each observation has equal probability to be chosen in bootstrap sample.
•    Train a new base learner to create a hypothesis on the bootstrap sample (hi).
•    Test the hypothesis on the training set. Use the results to increase the selection bias for each wrongly classified observation. The idea is to go on creating harder and harder samples for the subsequent base learners, as the iterations increase.
• For classification problems, the final classifier takes a weighted vote on the predictions of base classifiers. The weaker base learners have a lesser weight in the final voting. For regression problems, the final classifier takes a weighted mean from the base classifiers.