ENSMBLE LEARNING

 

                         Ensemble learning


In statistics and machine learningensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

Overview


Supervised learning algorithms perform the task of searching through a hypothesis space to find a suitable hypothesis that will make good predictions with a particular problem.[4] Even if the hypothesis space contains hypotheses that are very well-suited for a particular problem, it may be very difficult to find a good one. Ensembles combine multiple hypotheses to form a (hopefully) better hypothesis. The term ensemble is usually reserved for methods that generate multiple hypotheses using the same base learner.[according to whom?] The broader term of multiple classifier systems also covers hybridization of hypotheses that are not induced by the same base learner.[citation needed]

Evaluating the prediction of an ensemble typically requires more computation than evaluating the prediction of a single model. In one sense, ensemble learning may be thought of as a way to compensate for poor learning algorithms by performing a lot of extra computation. On the other hand, the alternative is to do a lot more learning on one non-ensemble system. An ensemble system may be more efficient at improving overall accuracy for the same increase in compute, storage, or communication resources by using that increase on two or more methods, than would have been improved by increasing resource use for a single method. Fast algorithms such as decision trees are commonly used in ensemble methods (for example, random forests), although slower algorithms can benefit from ensemble techniques as well.

By analogy, ensemble techniques have been used also in unsupervised learning scenarios, for example in consensus clustering or in anomaly detection.

Ensemble theory


Empirically, ensembles tend to yield better results when there is a significant diversity among the models.[5][6] Many ensemble methods, therefore, seek to promote diversity among the models they combine.[7][8] Although perhaps non-intuitive, more random algorithms (like random decision trees) can be used to produce a stronger ensemble than very deliberate algorithms (like entropy-reducing decision trees).[9] Using a variety of strong learning algorithms, however, has been shown to be more effective than using techniques that attempt to dumb-down the models in order to promote diversity.[10] It is possible to increase diversity in the training stage of the model using correlation for regression tasks [11] or using information measures such as cross entropy for classification tasks.

 

Bootstrap aggregating (bagging)


Bootstrap aggregation (bagging) involves training an ensemble on bootstrapped data sets. A bootstrapped set is created by selecting from original training data set with replacement. Thus, a bootstrap set may contain a given example zero, one, or multiple times. Ensemble members can also have limits on the features (e.g., nodes of a decision tree), to encourage exploring of diverse features.[16] The variance of local information in the bootstrap sets and feature considerations promote diversity in the ensemble, and can strengthen the ensemble.[17] To reduce overfitting, a member can be validated using the out-of-bag set (the examples that are not in its bootstrap set).[18]

Inference is done by voting of predictions of ensemble members, called aggregation. It is illustrated below with an ensemble of four decision trees. The query example is classified by each tree. Because three of the four predict the positive class, the ensemble's overall classification is positiveRandom forests like the one shown are a common application of bagging.





Bayesian model averaging


Bayesian model averaging (BMA) makes predictions by averaging the predictions of models weighted by their posterior probabilities given the data.[19] BMA is known to generally give better answers than a single model, obtained, e.g., via stepwise regression, especially where very different models have nearly identical performance in the training set but may otherwise perform quite differently.

The question with any use of Bayes' theorem is the prior, i.e., the probability (perhaps subjective) that each model is the best to use for a given purpose. Conceptually, BMA can be used with any prior. R packages ensembleBMA[20] and BMA[21] use the prior implied by the Bayesian information criterion, (BIC), following Raftery (1995).[22] R package BAS supports the use of the priors implied by Akaike information criterion (AIC) and other criteria over the alternative models as well as priors over the coefficients.
















Comments