Decision Tree · Siling

Decision Tree algorithm with example.

Basic algorithm (greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner.

At start, all the training examples are at the root.

Attributes are categorical (if continuous-valued, they are discretized in advance).

Examples are partitioned recursively based on selected attributes.

Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain).

When all samples for a given node belong to a same class, or no remaining attributes for further partitioning, or no more samples left, partitioning end.

Information Gain (ID3/C4.5)

Information Gain is a attribute selection measure.

The basic idea is to select the attribute with the highest information gain.

D as a data partition

$p_i$$ as the probability that an arbitrary tuple in *D* which belongs to class $$C_i$ $p_i=\frac{\lvert{C_{i,D}}\rvert}{\lvert{D}\rvert}$

Expected information

(entropy) needed to classify a tuple in D

$Info(D)=-\sum_{i=1}^{m}p_ilog_2(p_i)$

Information

needed (after using A to split D into v partitions) to classify D

$Info_A(D)=-\sum_{j=1}^{v}\frac{\lvert{D_j}\rvert}{\lvert{D}\rvert}*I(D_j)$

Information gained

by branching on attribute A

$Gain(A) = Info(D)-Info_A(D)$

Example

if we have a database below and want to build a decision tree for “who will buy the computer”

dt1

Class N: buys_computer = “yes”

Class M: buys_computer = “no”

First, select a attribute to calculate the Gain. Here we choose age.

Age	$N_i$	$M_i$	$I(N_i, M_i)$
<= 30	2	3	0.971
31-40	4	0	0
> 40	3	2	0.971
	9	5	0.940

$\begin{align} Info(D) &= I(9, 5)=-(\frac9{14}log_2(\frac9{14}) + \frac5{14}log_2(\frac5{14})) = 0.94 \\ I(2, 3) &= -(\frac2{5}log_2(\frac2{5}) + \frac3{5}log_2(\frac3{5})) = 0.971\\ I(4, 0) &= -(\frac4{4}log_2(\frac4{4}) + 0) = 0\\ I(3, 2) &= -(\frac3{5}log_2(\frac3{5}) + \frac2{5}log_2(\frac2{5}) = 0.971\\ Info_{age}(D) &= \frac5{14}I(2, 3) + \frac4{14}I(4, 0) + \frac5{14}I(3,2) = 0.694 \end{align}$ $\frac5{14}I(2,3)$$ means “age <= 30” has 5 out of 14 samples, with 2 yes and 3 no. Hence: $$Gain(age) = Info(D) - Info_age(D) = 0.94 - 0.694 = 0.246$

Similarly, we have to calculate Gains of other attributes:

Gain(income) = 0.029

Gain(student) = 0.151

Gain(credit, rating) = 0.048

Finally we get a decision tree below.

dt1

Gain Ratio (C4.5)

Gain is another attribute selection measure

Information gain measure is biased towards attributes with a large number of values. C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)

$SplitInfo_A(D) &= -\sum_{j=1}^v(\frac{\lvert{D_j}\rvert}{D} * log_2(\frac{\lvert{D_j}\rvert}{D})) \\ GainRatio(A) &= \frac{Gain(A)}{SplitInfo_A(D)}$

For example:

$\begin{align} SplitInfo_{income}(D) &= -(\frac4{14}*log_2(\frac4{14}) + \frac6{14}*log_2(\frac6{14}) + \frac4{14}*log_2(\frac4{14})) = 0.926 \\ GainRatio(income) &= 0.029/0.926 = 0.031 \end{align}$

The attribute with the maximum gain ratio is selected as the splitting attribute.

Gini index (CART, IBM IntelligentMiner)

If a data set D contains examples from n classes, and the $p_j$ is the relative frequency of class j in D, the gini index, gini(D) is defined as $gini(D)=1-\sum_{j=1}^np_j^2$

If a data set D is split on A into two subsets D1 and D2, the gini index is defined as

$gini_A(D) = \frac{D_1}{D}gini(D_1)+\frac{D_2}{D}gini(D_2)$

Reduction in impurity:

$\Delta gini(A) = gini(D)-gini_A(D)$

The attribute that has the lowest $gini_{split}(D)$ (or the greatest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)

In the example above, D Has 9 tuples in buy_computer = “yes” and 5 in “no”

$gini(D) = 1-((\frac9{14})^2+(\frac5{14})^2) = 0.459$

Suppose the attribute income partitions D into D1: {low, medium} = 10 and D2: {high} = 4

$\begin{equation} \begin{aligned} gini_{income\in\{low, medium\}}(D) &= \frac{10}{14}gini(D_1)+\frac4{14}gini(D_2) \\ &=\frac{10}{14}(1-(\frac{6}{10})^2 - (\frac{4}{10})^2) + \frac{4}{14}(1-(\frac14)^2-(\frac34)^2) \\ &= 0.450\\ &= gini_{income\in\{high\}} \end{aligned} \end{equation}$

However, $gini_{\{medium, high\}}$ =0,30 ,which is the lowest thus the best.

All attributes are assumed continuous-valued

Sometimes we may need other tools, e.g., clustering, to get the possible split values

Can be modified for categorical attributes

Comparison

Information gain: Bias toward multivalued attributes.

Gain ratio: Tends to prefer unbalanced splits in which one partition is much smaller than the others

Gini index:

Bias toward multivalued attributes.
Hard to deal with the dataset that has large number of classes.
Tends to favor tests that result in equal-sized partitions and purity in both partitions

Reference

Han, J., Kamber, M., & Pei, J. (2011). Data Mining: Concepts and Techniques (3rd ed.). Burlington: Elsevier Science.

Siling

Basic algorithm (greedy algorithm)

Information Gain (ID3/C4.5)

Expected information

Information

Information gained

Example

Gain Ratio (C4.5)

Gini index (CART, IBM IntelligentMiner)

Comparison

Reference

Algorithm

Python

Leetcode

Trie

Data Structure

Backtracking

Binary Search

Design

Dynamic Programming

Greedy

KMP

Linked List

Two Pointers

Sorting

Heap

Searching

Stack

Tree

BFS

DFS

Hash Table

Sliding windows

Union Find

Data Science

Data Mining

Bash

Shell

Machine Learning

DFA

Computer Science

NFA

RegExp

docker

container

devops

Big Data

Database

NoSQL

Relational Database

career

wepay

jpmc

journal