Introduction to Decision Tree Algorithm
When we have got a problem to solve which is either a classification or a regression problem, the decision tree algorithm is one of the most popular algorithms used for building the classification and regression models. They fall under the category of supervised learning i.e. data that are labeled.
What is Decision Tree Algorithm?
Decision Tree Algorithm is a supervised Machine Learning Algorithm where data is continuously divided at each row based on certain rules until the final outcome is generated. Let’s take an example, suppose you open a shopping mall and of course, you would want it to grow in business with time. So for that matter, you would require returning customers plus new customers in your mall. For this, you would prepare different business and marketing strategies such as sending emails to potential customers; create offers and deals, targeting new customers, etc. But how do we know who are the potential customers? In other words, how do we classify the category of the customers? Like some customers will visit once in a week and others would like to visit once or twice in a month, or some will visit in a quarter. So decision trees are one such classification algorithm which will classify the results into groups until no more similarity is left.
In this way, decision tree goes down in a tree-structured format. The main components of a decision tree are:
- Decision Nodes, which is where the data is split or say, it is a place for the attribute.
- Decision Link, which represents a rule.
- Decision Leaves, which are the final outcomes.
Working of a Decision Tree Algorithm
There are many steps that are involved in the working of a decision tree:
1. Splitting – It is the process of the partitioning of data into subsets. Splitting can be done on various factors as shown below i.e. on gender basis, height basis or based on class.
2. Pruning – It is the process of shortening the branches of the decision tree, hence limiting the tree depth
Pruning is also of two types:
- Pre-Pruning – Here we stop growing the tree when we do not find any statistically significant association between the attributes and class at any particular node.
- Post-Pruning – In order to post prune, we must validate the performance of the test set model and then cut the branches that are a result of overfitting noise from the training set.
3. Tree Selection – The third step is the process of finding the smallest tree that fits the data.
Examples and Illustration of Constructing a Decision Tree
Now, as we have learned the principles of a Decision Tree. Let’s understand and illustrate this with the help of an example.
Let’s say you want to play cricket on some particular day (For e.g., Saturday). What are the factors that are involved which will decide if the play is going to happen or not?
Clearly, the major factor is the climate, no other factor has that much probability as much climate is having for the play interruption.
We have collected the data from the last 10 days which is presented below:
Let us now construct our decision tree based on the data that we have got. So we have divided the decision tree into two levels, the first one is based on the attribute “Weather” and the second row is based on “Humidity” and “Wind”. The below images illustrates a learned decision tree.
We can also set some threshold values if the features are continuous.
What is Entropy in Decision Tree Algorithm?
In simple words, entropy is the measure of how disordered your data is. While you might have heard this term in your Mathematics or Physics classes, it’s the same here.
The reason Entropy is used in the decision tree is because the ultimate goal in the decision tree is to group similar data groups into similar classes, i.e. to tidy the data.
Let us see the below image, where we have the initial dataset and we are required to apply decision tree algorithm in order to group together the similar data points in one category.
After the decision split, as we can clearly see, most of the red circles fall under one class while most of the blue crosses fall under another class. Hence a decision was to classify the attributes that could be based on various factors.
Now, let us try to do some math over here:
Let us say that we have got “N” sets of the item and these items fall into two categories, and now in order to group the data based on labels, we introduce the ratio:
The entropy of our set is given by the following equation:
Let us check out the graph for the given equation:
Above Image (with p=0.5 and q=0.5)
1. A decision tree is simple to understand and once it is understood, we can construct it.
2. We can implement a decision tree on numerical as well as categorical data.
3. Decision Tree is proven to be a robust model with promising outcomes.
4. They are also time efficient with large data.
5. It requires less effort for the training of the data.
1. Instability – Only if the information is precise and accurate, the decision tree will deliver promising results. Even if there is a slight change in the input data, it can cause large changes in the tree.
2. Complexity – If the dataset is huge with many columns and rows, it is a very complex task to design a decision tree with many branches.
3. Costs – Sometimes cost also remains a main factor because when one is required to construct a complex decision tree, it requires advanced knowledge in quantitative and statistical analysis.
In this article, we learned about the decision tree algorithm and how to construct one. We also saw the big role that is being played by Entropy in the decision tree algorithm and finally, we saw the advantages and disadvantages of the decision tree.
This has been a guide to Decision Tree Algorithm. Here we discussed the Role played by Entropy, Working, Advantages, and Disadvantage. You can also go through our other suggested articles to learn more –