Explaining Random Forest
What is Random Forest ?
Random Forest is Supervised Machine learning Ensemble (of Decision Tress) method. Which is used for both Classification and Regression.
Decision Trees
Lets quickly go over Decision trees as they are used in Random Forest as model. Fortunately , they are easy and intuitive. It is used by almost all people at some point in life, knowingly or not.
Let understand by example. Imagine you are student and now you are in phase where you ask question for your future. So from this decision tree you can find answer. First of all , you want a job? CGPA is more than 7 ? if both are yes then you can seat for on-campus otherwise try for off campus. If you dont want job than there are many other reason. One thing to note is this question/condition on every level is problem dependent and for same problem there may be different Decision tress possible.
Steps to make Random Forest Model
- Find dataset.
- Do row sampling with replacement.
- Do Feature sampling.
- Apply Decision Tree on all samples.
- Find output on all models for our query point.
- Do Aggregate/Majority vote as your prediction.
Random Forest Classifier
Random Forest as name implies is large number of indivisual decision trees(that makes Forest) operate as an ensemble. Each tree in random forest predict a class and the class with majority vote selected as our model’s prediction.
Concept is simple but very powerful as different uncorrelated decision trees give accurate result than any single tree.Trees protects each others mistake(error) by giving majority vote.
Random Forest Regression
Here unlike RF classifier we take Average of all prediction from decision trees.There are other metric also like median etc . Its our choice which works well for our model .
Hyper-Parameter Tuning
Random forest is flexible about this. Even without hyper-parameter tuning it works well most of the time.
The hyperparameters in random forest are either used to increase the predictive power of the model or to make the model faster. It has 2 hyper-parameter (1) max_depth and (2) #of trees(number of decision trees)
Firstly, there is the #of trees hyperparameter, which is just the number of trees the algorithm builds before taking the maximum voting or taking the averages of predictions. In general, a higher number of trees increases the performance and makes the predictions more stable, but it also slows down the computation.
Secondly, max_depth which is the maximum number of features random forest considers to split a node.
Advantages
- It can be used for both Classification and Regression
- Only few hyper-parameters there, even default hyperparameter often produces good results.
- If there are enough trees in the forest, the classifier won’t overfit the model.
- Handle large datasets with high dimensionality
- Handle the missing values and Maintains accuracy
Disadvantages
- a large number of trees can make the algorithm too slow and ineffective for real-time predictions.
- You have little control on what model does.(Black box)
Applications
The random forest algorithm is used in a lot of different fields, like banking, the stock market, medicine and e-commerce.
- Banking : It can be use to find loyal customer for banks. who are likely to repay their debt on time, or use a bank’s services more frequently. In this domain it is also used to detect fraudsters out to scam the bank.
- Trading : The algorithm can be used to determine a stock’s future behavior as well as expected loss or profit by purchasing particular stock.
- HealthCare: it is used to identify the correct combination of components in medicine and to analyze a patient’s medical history to identify diseases.
- E-commerce : Random forest is used in e-commerce to determine whether a customer will actually like the product or not.
- In computer vision Random Forest use for image classification. Microsoft used it in Xbox for body part classification.
- It also used in Voice classification.
Sample Code
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=1000, n_features=4,
... n_informative=2, n_redundant=0,
... random_state=0, shuffle=False)
>>> clf = RandomForestClassifier(max_depth=2, random_state=0)
>>> clf.fit(X, y)
RandomForestClassifier(...)
>>> print(clf.predict([[0, 0, 0, 0]]))
[1]#Source Sklearn