17 min read

In this article by David Toth, the author of the book Data Science Algorithms in a Week, we will cover the following topics:

  • Concepts
  • Analysis

Concepts

A decision tree is the arrangement of the data in a tree structure where at each node data is separated to different branches according to the value of the attribute at the node.

Analysis

To construct a decision tree, we will use a standard ID3 learning algorithm that chooses an attribute that classifies the data samples in the best possible way to maximize the information gain – a measure based on information entropy.

Information entropy

Information entropy of the given data measures the least amount of the information necessary to represent a data item from the given data. The unit of the information entropy is a familiar unit – a bit and a byte, a kilobyte, and so on. The lower information entropy, the more regular, data is, the more pattern occurs in the data and thus less amount of the information is necessary to represent it. That is why compression tools on the computer can take large text files and compress them to a much smaller size, as words and word expressions keep reoccurring, forming a pattern.

Coin flipping

Imagine we flip and unbiased coin. We would like to know if the result is head or tail. How much information do we need to represent the result? Both words head and tail consists of 4 characters and if we represent one character with one byte (8 bits) as it is standard in ASCII table, then we would need 4 bytes or 32 bits to represent the result.

But the information entropy is the least amount of the data necessary to represent the result. We know that there are only two possible results – head or tail. If we agree to represent head with 0 and tail with 1, then 1 bit would be sufficient to communicate the result efficiently. Here the data is the space of the possibilities of the result of the coin throw. It is the set {head,tail} which can represented as a set {0,1}. The actual result is a data item from this set. It turns out that the entropy of the set is 1. This is owing to that the probability of head and tail are both 50%.

Now imagine that the coin is biased and throws head 25% of time and tails 75% of time. What would be the entropy of the probability space {0,1} this time? We could certainly represent the result with 1 bit of the information. But can we do better? 1 bit is of course indivisible, but maybe we could generalize the concept of the information to indiscrete amounts.

In the previous example, we know nothing about the previous result of the coin flip unless we look at the coin. But in the example with the biased coin, we know that the result tail is more likely to happen. If we recorded n results of coin flips in a file representing heads with 0 and tails with 1, then about 75% of the bits there would have the value 1 and 25% of them would have the value 0. The size of such file would be n bits. But since it is more regular (the pattern of 1s prevails in it) a good compression tool should be able to compress it to less than n bits.

To learn the theoretical bound to the compression and the amount of the information necessary to represent a data item we define information entropy precisely.

Definition of Information Entropy

Suppose that we are given a probability space S with the elements 1, 2, …, n. The probability an element i would be chosen from the probability space is pi. Then the information entropy of the probability space is defined as:

E(S)=-p1 *log2(p1) – … – pn *log2(pn) where log2 is a binary logarithm.

So for the information entropy of the probability space of unbiased coin throws is E = -0.5 * log2(0.5) – 0.5*log2(0.5)=0.5+0.5=1.

When the coin is based with 25% chance of a head and 75% change of a tail, then the information entropy of such space is:

E = -0.25 * log2(0.25) – 0.75*log2(0.75) = 0.81127812445

which is less than 1. Thus for example if we had a large file with about 25% of 0 bits and 75% of 1 bits, a good compression tool should be able to compress it down to about 81.12% of its size.

Information gain

The information gain is the amount of the information entropy gained as a result of a certain procedure. For example, if we would like to know the results of 3 fair coins, then its information entropy is 3. But if we could look at the 3rd coin, then information entropy of the result for the remaining 2 coins would be 2. Thus by looking at the 3rd coin we gained 1 bit information, so the information gain was 1.

We may also gain the information entropy by dividing the whole set S into sets grouping them by similar pattern. If we group elements by their value of an attribute A, then we define the information gain as

IG(S, A) = E(S) – Sumv in values(A)[(|Sv|/|S|) * E(Sv)] where Sv is a set with the elements of S that have the value v for the attribute A.

For example let us calculate the information gain for the 6 rows in the swimming example by taking swimming suit as an attribute. Because we are interested whether a given row of data is classified as no or yes for the question whether one should swim, we will use swim preference to calculate the entropy and information gain. We partition the set S by the attribute swimming suit:

Snone={(none,cold,no),(none,warm,no)} Ssmall={(small,cold,no),(small,warm,no)} Sgood= {(good,cold,no),(good,warm,yes)}
The information entropy of S is E(S)=-(1/6)*log2(1/6)-(5/6)*log2(5/6)~0.65002242164 The information entropy of the partitions is:
E(Snone)=-(2/2)*log2(2/2)=-log2(1)=0 since all instances have the class no. E(Ssmall)=0 for a similar reason
E(Sgood)=-(1/2)*log2(1/2)=1 Therefore the information gain is
IG(S,swimming suit)=E(S)-[(2/6)*E(Snone)+(2/6)*E(Ssmall)+(2/6)*E(Sgood)]
=0.65002242164-(1/3)=0.3166890883

If we chose the attribute water temperature to partition the set S, what would be the information gain IG(S,water temperature)? The water temperature partitions the set S into the following sets:

Scold={(none,cold,no),(small,cold,no),(good,cold,no)}
Swarm={(none,warm,no),(small,warm,no),(good,warm,yes)}

Their entropies are:

E(Scold)=0 as all instances are classified as no. E(Swarm)=-(2/3)*log2(2/3)-(1/3)*log2(1/3)~0.91829583405

which is less than IG(S,swimming suit). Therefore, we can gain more information about the set S (the classification of its instances) by partitioning it per the attribute swimming suit instead of the attribute water temperature. This finding will be the basis of the ID3 algorithm constructing a decision tree in the next paragraph.

ID3 algorithm

ID3 algorithm constructs a decision tree from the data based on the information gain. In the beginning, we start with the set S. The data items in the set S have various properties according to which we can partition the set S. If an attribute A has the values {v1, …, vn}, then we partition the set S into the sets Sv1, …, Svn. Where the set Svi is a subset of the set S where the elements have the value vi for the attribute A.

If each element in the set S has attributes A1, …, Am, then we can partition the set S according to any of the possible attributes. ID3 algorithm partitions the set S according to the attribute that yields the highest information gain. Now suppose that it was an attribute A1. Then for the set S we have the partitions Sv1, …, Svn where A1 has the possible values {v1,…, vn}.

Since we have not constructed any tree yet, we first place a root node into the tree. For every partition of S we place a new branch from the root. Every branch represents one value of the selected attributes. A branch has data samples with the same value for that attribute. For every new branch we can define a new node that will have data samples from its ancestor branch.

Once we have defined a new node, we choose another of the remaining attributes with the highest information gain for the data at that node to partition the data at that node further, then define new branches and nodes. This process can be repeated until we run out of all the attributes for the nodes or even earlier until all the data at the node have the same class of our interest. In the case of a swimming example there are only two possible classes for swimming preference: class no and class yes. The last node is called a leaf node and decides the class of a data item from the data.

Tree construction by ID3 algorithm

Here we describe step by step how an ID3 algorithm would construct a decision tree from the given data samples in the swimming example. The initial set consists of 6 data samples:

S={(none,cold,no),(small,cold,no),(good,cold,no),(none,warm,no),(small,warm,no),(good,warm,yes)}

In the previous sections we calculated the information gains for both and the only non- classifying attributes swimming suit and water temperature:

IG(S,swimming suit)=0.3166890883 IG(S,water temperature)=0.19087450461

Hence we would choose the attribute swimming suit as it has a higher information gain. There is no tree drawn yet, so we start from the root node. As the attribute swimming suit has 3 possible values {none, small, good}, we draw 3 possible branches out of it for each. Each branch will have one partition from the partitioned set S: Snone, Ssmall, Sgood. We add nodes to the ends of the branches. Snone data samples have the same class swimming preference = no, so we do not need to branch that node by a further attribute and partition set. Thus the node with the data Snone is already a leaf node. The same is true for the node with the data Ssmall.

But the node with the data Sgood has two possible classes for swimming preference. Therefore, we will branch the node further. There is only one non- classifying attribute left – water temperature. So there is no need to calculate the information gain for that attribute with the data Sgood. From the node Sgood we will have 2 branches each with the partition from the set Sgood. One branch will have the set of the data sample Sgood, cold={(good,cold,no)}, the other branch will have the partition Sgood, warm={(good,warm,yes)}. Each of these 2 branches will end with a node. Each node will be a leaf node because each node has the data samples of the same value for the classifying attribute swimming preference.

The resulting decision tree has 4 leaf nodes and is the tree in the picture decision tree for the swimming preference example.

Deciding with a decision tree

Once we have constructed a decision tree from the data with the attributes A1, …, Am and the classes {c1, …, ck}; we can use this decision tree to classify a new data item with the attributes A1, …, Am into one of the classes {c1, …, ck}.

Given a new data item that we would like to classify, we can think of each node including the root as a question for data sample: What value does that data sample for the selected attribute Aihave? Then based on the answer we select the branch of a decision tree and move further to the next node. Then another question is answered about the data sample and another until the data sample reaches the leaf node. A leaf node has an associated one of the classes {c1, …, ck} with it, e.g. ci. Then the decision tree algorithm would classify the data sample into the class ci.

Deciding a data sample with the swimming preference decision tree

Let us construct a decision tree for the swimming preference example with the ID3 algorithm. Consider a data sample (good,cold,?) and we would like to use the constructed decision tree to decide into which class it should belong.

Start with a data sample at the root of the tree. The first attribute that branches from the root is swimming suit, so we ask for the value for the attribute swimming suit of the sample (good,cold,?). We learn that the value of the attribute is swimming suit=good, therefore move down the rightmost branch with that value for its data samples. We arrive at the node with the attribute water temperature and ask the question: what is the value of the attribute water temperature for the data sample (good,cold,?). We learn that for that data sample we have water temperature=cold, therefore we move down the left branch into the leaf node. This leaf is associated with the class swimming preference=no. Therefore the decision tree would classify the data sample (good,cold,?) to be in that class swimming preference, i.e. to complete it to the data sample (good,cold,no).

Therefore, the decision tree says that if one has a good swimming suit, but the water temperature is cold, then one would still not want to swim based on the data collected in the table.

Implementation

decision_tree.py import math import imp import sys
#anytree module is used to visualize the decision tree constructed by this ID3 algorithm.
from anytree import Node, RenderTree import common

#Node for the construction of a decision tree. class TreeNode:
definit(self,var=None,val=None): self.children=[] self.var=varself.val=val
defadd_child(self,child):
self.children.append(child)

defget_children(self):
return self.children

defget_var(self):
return self.var

defis_root(self):
return self.var==None and self.val==None

defis_leaf(self):
return len(self.children)==0 def name(self):
if self.is_root():
return “[root]”
return “[“+self.var+”=“+self.val+”]”
#Constructs a decision tree where heading is the heading of the table with the data, i.e. the names of the attributes.
#complete_data are data samples with a known value for every attribute.
#enquired_column is the index of the column (starting from zero) which holds the classifying attribute.
defconstuct_decision_tree(heading,complete_data,enquired_column): available_columns=[]
for col in range(0,len(heading)): if col!=enquired_column:
available_columns.append(col)
tree=TreeNode() add_children_to_node(tree,heading,complete_data,available_columns,enquired_ column)
return tree

#Splits the data samples into the groups with each having a different value for the attribute at the column col.
defsplit_data_by_col(data,col): data_groups={}
for data_item in data:
if data_groups.get(data_item[col])==None: data_groups[data_item[col]]=[]
data_groups[data_item[col]].append(data_item) return data_groups

#Adds a leaf node to node.
defadd_leaf(node,heading,complete_data,enquired_column): node.add_child(TreeNode(heading[enquired_column],complete_data[0][enquired_ column]))

#Adds all the descendants to the node. def
add_children_to_node(node,heading,complete_data,available_columns,enquired_ column):
if len(available_columns)==0: add_leaf(node,heading,complete_data,enquired_column) return -1
selected_col=select_col(complete_data,available_columns,enquired_column) for i inrange(0,len(available_columns)):
if available_columns[i]==selected_col: available_columns.pop(i)
break

data_groups=split_data_by_col(complete_data,selected_col) if(len(data_groups.items())==1):
add_leaf(node,heading,complete_data,enquired_column) return -1

for child_group, child_data in data_groups.items(): child=TreeNode(heading[selected_col],child_group)
add_children_to_node(child,heading,child_data,list(available_columns),enquired_column)
node.add_child(child)

#Selects an available column/attribute with the highest information gain. defselect_col(complete_data,available_columns,enquired_column):
selected_col=-1 selected_col_information_gain=-1 for col in available_columns:
current_information_gain=col_information_gain(complete_data,col,enquired_column)
if current_information_gain>selected_col_information_gain: selected_col=col
selected_col_information_gain=current_information_gainreturn selected_col

#Calculates the information gain when partitioning complete_dataaccording to the attribute at the column col and classifying by the attribute at enquired_column.
defcol_information_gain(complete_data,col,enquired_column): data_groups=split_data_by_col(complete_data,col) information_gain=entropy(complete_data,enquired_column) for _,data_group in data_groups.items():
information_gain-
=(float(len(data_group))/len(complete_data))*entropy(data_group,enquired_column)
return information_gain

#Calculates the entropy of the data classified by the attribute at the enquired_column.
def entropy(data,enquired_column): value_counts={}
for data_item in data:
if value_counts.get(data_item[enquired_column])==None: value_counts[data_item[enquired_column]]=0
value_counts[data_item[enquired_column]]+=1 entropy=0
for _,count in value_counts.items(): probability=float(count)/len(data)
entropy-=probability*math.log(probability,2) return entropy

#A visual output of a tree using the text characters. defdisplay_tree(tree):
anytree=convert_tree_to_anytree(tree)
for pre, fill, node in RenderTree(anytree): pre=pre.encode(encoding=‘UTF-8’,errors=‘strict’) print(“%s%s” % (pre, node.name))

#A simple textual output of a tree without the visualization. defdisplay_tree_simple(tree):
print(‘***Tree structure***’) display_node(tree) sys.stdout.flush()

#A simple textual output of a node in a tree. defdisplay_node(node):
if node.is_leaf():
print(‘The node ‘+node.name()+’ is a leaf node.’) return
sys.stdout.write(‘The node ‘+node.name()+’ has children: ‘) for child in node.get_children():
sys.stdout.write(child.name()+’‘) print(‘‘)
for child in node.get_children(): display_node(child)

#Convert a decision tree into the anytree module tree format to make it ready for rendering.
defconvert_tree_to_anytree(tree): anytree=Node(“Root”) attach_children(tree,anytree) return anytree#Attach the children from the decision tree into the anytree tree format. defattach_children(parent_node, parent_anytree_node):
for child_node in parent_node.get_children(): child_anytree_node=Node(child_node.name(),parent=parent_anytree_node)
attach_children(child_node,child_anytree_node)

###PROGRAM START###

if len(sys.argv)<2:
sys.exit(‘Please, input as an argument the name of the CSV file.’)

csv_file_name=sys.argv[1] (heading,complete_data,incomplete_data,enquired_column)=common.csv_file_to_ ordered_data(csv_file_name)

tree=constuct_decision_tree(heading,complete_data,enquired_column) display_tree(tree)

common.py
#Reads the csv file into the table and then separates the table into heading, complete data, incomplete data and then produces also the index number for the column that is not complete, i.e. contains a question mark. defcsv_file_to_ordered_data(csv_file_name):
with open(csv_file_name, ‘rb’) as f: reader = csv.reader(f)
data = list(reader) return order_csv_data(data)

deforder_csv_data(csv_data):
#The first row in the CSV file is the heading of the data table. heading=csv_data.pop(0)
complete_data=[] incomplete_data=[]
#Let enquired_column be the column of the variable which conditional probability should be calculated. Here set that column to be the last one.
enquired_column=len(heading)-1
#Divide the data into the complete and the incomplete data. An incomplete row is the one that has a question mark in the enquired_column. The question mark will be replaced by the calculated Baysian probabilities from the complete data.
for data_item in csv_data:
if is_complete(data_item,enquired_column): complete_data.append(data_item)
else:
incomplete_data.append(data_item)
return (heading,complete_data,incomplete_data,enquired_column) 

Program input

swim.csv swimming_suit,water_temperature,swimNone,Cold,No
None,Warm,NoSmall,Cold,NoSmall,Warm,NoGood,Cold,NoGood,Warm,Yes

Program output

$ python decision_tree.py swim.csv

Root
├── [swimming_suit=Small]
│├──[water_temperature=Cold]
││└──[swim=No]
│└──[water_temperature=Warm]
│└──[swim=No]
├── [swimming_suit=None]
│├──[water_temperature=Cold]
││└──[swim=No]
│└──[water_temperature=Warm]
│└──[swim=No]
└── [swimming_suit=Good]
├── [water_temperature=Cold]
│└──[swim=No]
└── [water_temperature=Warm]
└── [swim=Yes]

Summary

In this article we have learned the concept of decision tree, analysis using ID3 algorithm, and implementation.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here