This post is a brief introfuction of Random Forest.

Introduction

This section gives a brief overview of random forests and some comments about the features of the method.

Ensemble Learning

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

Two major barnces of ensemble learning

1.Boosting

Typical algorithm-AdaBoost

2.Bagging

Typical algorithm–Random forest.

Thinking of bagging

由于是多棵树,因此可以解决决策树泛化能力弱的缺点.

Example

Description:How to use random forest to predict the income of a person based on the age, gender, higest educational qulifacation, industry and residence.(The country in this example is Mexico)

Salary bands :

Band 1 : Below $40,000

Band 2: $40,000 – 150,000

Band 3: More than $150,000

Mexico has a population of 118 MM. Say, the algorithm Random forest picks up 10k observation with only one variable (for simplicity) to build each CART model. In total, we are looking at 5 CART model being built with different variables. In a real life problem, you will have more number of population sample and different combinations of input variables. Following are the outputs of the 5 different CART model.

CART 1:Variable Age

CART 2 : Variable Gender

CART 3 : Variable Education

CART 4 : Variable Residence

CART 5 : Variable Industry

!!!All these five CART model are based on the statistical result!!!

the following profile in each CART model :

  1. Age : 35 years , 2, Gender : Male , 3. Highest Educational Qualification : Diploma holder, 4. Industry : Manufacturing, 5. Residence : Metro

For each of these CART model, following is the distribution across salary bands :

The final probability is simply the average of the probability in the same salary bands in different CART models. As you can see from this analysis, that there is 70% chance of this individual falling in class 1 (less than $40,000) and around 24% chance of the individual falling in class 2.