Decision tree is a supervised algorithm used in the machine learning. It is using binary tree graph (each node has two children) to assign for each data sample a target value. The target values are presented in the tree leaves. To reach to the leaf, the sample is propagated through nodes, starting at the root node. In the each node a decision is made, to which descendant node it should go. Decision is made based on selected sample’s feature. It is usually one feature used to make the decision (one feature is used in the node to make a decision). The decision tree learning is a process of finding the optimal rules in each internal tree node according to selected metric.
The decision trees can be divided, in respect to the target values, into:
- classification trees used to classify samples (assign to limited set of values - classes)
- regression trees used to assign samples into numerical values within range
Decision trees are popular tool in decision analysis. They can support decisions thanks to visual representation of each decision.
Decision trees advantages
- human readable representation, which makes them perfect algorithm for data mining to gain data insight
- works with categorical values and numerical values - there is no need to convert categorical values into numbers
- don’t need scale numerical values into arbitrary range, which is needed for neural networks
- decision trees can work with large data sets, and prediction time is fast
- decision trees have feature selection built-in - they simply do not use irrelevant features for making splits in the nodes
Decision trees disadvantages
- usually poor performance comparing to more complex algorithms (like xgboost, random forest or neural networks)
- can easily overfit to training data, that’s why there are many techniques used to limit the size of tree or prune fully grown tree
- decision trees are sensitive to data variations, because of this even small changes in the data can result is change in the tree structure