Decision Tree Classifiers | Let’s talk about them…
You have got a massive data. And now you are wondering how do you make sense out of it.
You are also thinking of the ways you can break that data into some logical structure.
While you can always create a logical tree (can also call them a KPI Tree) manually, but what if we say that there is a Machine learning algorithm that can actually take care of it, on its own.
Alright! — So first — let’s show how you might want to create a KPI Tree out of a given dataset, manually.?.?.?
We are looking at a sample data of some customers who visit an e-commerce website on a frequent basis, and looking at some of their prominent web-navigation attributes concerning the type of device category they use, the browser they use, the channel they visited the website from and the geo-network region they accessed the website from; we want to figure out whether they will convert or not.
I have created a sample data of 100 records (customers who visited the website and mentioning their web navigation / behavioural attributes) :-
import pandas as pd
import random
# Sample data for each column
device_categories = ['Desktop', 'Android Mobile', 'Apple Mobile', 'Tablet', 'Laptop']
browsers = ['Chrome', 'Opera', 'Safari', 'Firefox', 'Edge']
visit_sources = ['Paid Social', 'Referral', 'Organic Search', 'Direct', 'Email Campaign']
geonetwork_regions = ['Mumbai', 'Delhi', 'Bangalore', 'Hyderabad', 'Chennai', 'Kolkata', 'Pune', 'Ahmedabad', 'Surat', 'Jaipur']
conversion_status = [0, 1]
# Generate random data
data = {
'device_category': [random.choice(device_categories) for _ in range(100)],
'browser': [random.choice(browsers) for _ in range(100)],
'visit_source': [random.choice(visit_sources) for _ in range(100)],
'geonetwork_region': [random.choice(geonetwork_regions) for _ in range(100)],
'will_convert': [random.choice(conversion_status) for _ in range(100)]
}
# Create DataFrame
df = pd.DataFrame(data)
# Display the records
df
Here’s how the data might typically look like :-
Now, basically the moment we say — “whether they will convert or not”, it organically becomes a classification / prediction problem.
Though, the idea of KPI Tree that I have been used to, in my previous organization requires some sort of a metric to be maintained across the different nodes of the KPI Tree. Now this metric shall remain consistent across the different nodes and what may vary would be obviously the value of the metric.
Now, I understand that we are diverging from the idea of a quintessential Decision Tree Classifier, which renders the decided class for a particular record in your dataset, at the leaf nodes, and that is how it also settles for the best logical structure that it can break down the data into. Like, looking at the entropy / information gain, etc.
Yea - we will come to that.
But for now, coming back to the idea of a KPI Tree which would be more of a logical structure that simply dissects your data into the format of a typical decision tree but the order / hierarchy of the nodes can be random and purely dependent on what dimension you want to analyse at what level, based on your business requirements.
For instance, looking at the e-commerce example I mentioned above, the Data Scientist / Analyst / or any Dear John, might want to may be have geo-network region as the first level of the KPI Tree, followed by device category, followed by browser, followed by visit source, etc.
Which may spin up a KPI Tree that may look something like below :-
Ok, we haven’t fixed a KPI. Let’s may be call it Conversion Rate(%).
Looks good, aye?
So the KPI Tree shown below depicts the drill-down of Conversion Rate(%) observed for all the customers across the different geo-network regions, followed by the device categories, followed by the browsers, followed by the various visit sources.
So the way you interpret is — Let’s say Mumbai has a Conversion Rate(%) of X% and within Mumbai, the Desktop as a Device Category has a Conversion Rate(%) of Y% and within Desktop, Chrome has a Conversion Rate(%) of Z%, and within Chrome, Paid Social has a Conversion Rate(%) of P%.
Similarly, you could see it for some other geo-network region and under it some other device category than Desktop or even same too. It is like what path you want to traverse (as there are umpteen combinations / paths to follow).
How would the KPI Tree look after putting in the Conversion Rate(%) values?
Let me put in the values (time to revisit my Google Colab notebook).
https://colab.research.google.com/drive/1_h04I0NiD91jPDGe4V4I5gZ5HOEgoKVG#scrollTo=crVK5aaSJ8R1
You may please request for access, if you want to look at the code.
We are going to solve a classification problem using the Decision Tree Algorithm.
When you have a dataset like this — it is easier to perform classification (i.e. draw a decision boundary using Logistic Regression) :-
Now, imagine a dataset as below where you may have to end up splitting the data multiple times in order to have very explicit decision boundaries.
Next, what I’d request you to do is to think of a decision tree in your brain.
So, if by any chance you have imagined a tree with Device Category as a root node; this is how the model may create a logical structure.
For what?
Let’s understand in a while, but just look at the data snapshots below the tree. For every device category, we have created their sub-datasets. The red marked records show where the class is zero (0) which means that the particular individual did not convert.
In all the device category cases, the results / samples are mixed (what I mean is that there both classes (colors) records are present.
Here, the expectation remains such that till the time a particular node hasn’t given you a pure class records (which means all the records are either 1 (converted / green) or 0 (not converted / red); the model keeps breaking down the nodes further (by asking more questions to the nodes that had a mix of classes).
Criteria :- The customer converted / not converted (by looking at the column ‘will_convert’).
Now here in the above screenshots, if you will try to see — there is no such device category, using which if you split your overall data, then we get will_convert = 1, for all the customers, regardless of the browser, visit source & geo-network region.
So, which means to get to that “pure” view (if you will) — where all the records are either red or green (will_convert = 0 or will_convert = 1); you will ask your data more questions.
Let’s say I drill-down further into the set of records (customers) under Device Category = Desktop.
Alright, so the Yes and No in the decision tree, you see above on the nodes are derived from the record level views I have shown below for each decision / criteria. Decision is nothing but a set of rules / questions / criteria at the particular node.
PS : The 0 and 1 on the leaf nodes — please read them as No and Yes only (in view of the classification labels). I just used 0 and 1 since we had got to the leaf level, and these were the most unit level records in the data (granular) where the target class already had 0 / 1. Else, when talking about the nodes above (the child level nodes basically) — we derive Yes when all the records have target class = 1 (in our case will_convert), and we derive No when all the records have target class = 0 (in our case will_convert).
So I basically just traversed the path of Device Category = Desktop and showed the drill-down till the end (leaf nodes).
However, the stratagem is similar when splitting the data for other device categories and the dimensions (browser, visit source, geo-network region) underneath each of those device categories.
The big question that can arise is that how do we decide which attribute of the customer do we start splitting the data with? What I mean is that what comes on the root node?
It could have been Browser too, no?
Ok — it is the hour to discuss Entropy (remember when the teachers talked about randomness back in school? yea same randomness but in data when we talk about it here in case od DTCs). You may get the urge to google it, aye?
Go ahead.
But just to give you a basic overview : A low entropy (randomness) in your data subset below any particular device category or browser (depending upon which one you are experimenting with as root node) — would mean that most of the records are either red (0) or green (1).
A high entropy (randomness) would mean let’s say if half records are red (0) and half green (1) or 60% / 40%. The idea is that higher the randomness (uneven split), higher the entropy and that is what we need to fear.
From what you understand there — these screenshots below may help you too in evaluating Entropy when considering root node as Device Category vs Browser
Or may be if you want to go ahead with Visit Source or Geonetwork Region, may be!!
— — — — — — — — — — — Let’s catch up once you have done your homework on entropy, aye?
— — — — — — — — — — — — — — — — — — — — — — — — — — — |
Alright, so it is Oct 02, 2024 and I am back to talk about Entropy again.
Now, last time when we spoke, we talked about deciding which attribute from your dataset should be used at the root node in a decision tree classifier, and for that here’s what I have to tell ya :-
You can calculate the information gain for each attribute. Information gain is based on the entropy of the system, and the attribute with the highest information gain is selected as the root node.
Here’s a step-by-step guide on how to calculate entropy and information gain in Python :-
Step 1: Calculate Entropy
Entropy is a measure of disorder or uncertainty. For a binary classification, it is calculated as:
Entropy(S) = −p1log2(p1)−p2log2(p2)
Where:
- p1 is the proportion of positive examples in the dataset.
- p2 is the proportion of negative examples in the dataset.
Step 2: Calculate Information Gain
Information gain for an attribute is the difference between the entropy of the original dataset and the weighted entropy of the dataset after splitting on the attribute :-
Where :-
- S is the dataset.
- A is the attribute.
- Sv is the subset of S for which attribute A has value v.
Now, I put together some Python code to actually come up with the attribute that is most eligible for being the root node of the decision tree (which by the way, the algorithm — Decision Tree Classifier itself chooses intrinsically).
import numpy as np
import pandas as pd
from collections import Counter
from math import log2
# Function to calculate entropy
def entropy(target_column):
elements, counts = np.unique(target_column, return_counts=True)
entropy_value = np.sum([(-counts[i]/np.sum(counts)) * log2(counts[i]/np.sum(counts)) for i in range(len(elements))])
return entropy_value
# Function to calculate information gain
def info_gain(data, split_attribute_name, target_name="will_convert"):
total_entropy = entropy(data[target_name])
# Values and counts for the split attribute
vals, counts = np.unique(data[split_attribute_name], return_counts=True)
# Weighted entropy for each subset
weighted_entropy = np.sum([(counts[i]/np.sum(counts)) * entropy(data.where(data[split_attribute_name]==vals[i]).dropna()[target_name]) for i in range(len(vals))])
# Information gain
information_gain = total_entropy - weighted_entropy
return information_gain
# Example usage:
# Create a sample dataset
data = df
# Calculate information gain for each attribute
for col in data.columns[:-1]: # Exclude the target column 'Play'
print(f"Information Gain for {col}: {info_gain(data, col, 'will_convert')}")
This gives me the following output :-
Information Gain for device_category: 0.017176663225693
Information Gain for browser: 0.056722703285356446
Information Gain for visit_source: 0.07374935342424482
Information Gain for geonetwork_region: 0.07485460720313275
So, clearly — the information gain for geonetwork_region is the best (highest), thus making it the undisputed root node (to split the overall data) for the decision tree.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Now, the part — where we make some predictions using the decision tree classifier model that we fit on the data.
Here’s the Google Collab link to the code I have written. Please raise requests to gain access.
https://colab.research.google.com/drive/1_h04I0NiD91jPDGe4V4I5gZ5HOEgoKVG#scrollTo=H8oejTX-g9WD
Step 01 :-
I create an inputs (independent variables) subset and a target column (dependent variable) subset below.
inputs = df.drop('will_convert', axis = 'columns')
target = df['will_convert']
Step 02 :-
I do some label encoding, since the machine learning models understand only numerical data.
# device_category browser visit_source geonetwork_region
from sklearn.preprocessing import LabelEncoder
le_device_category = LabelEncoder()
le_browser = LabelEncoder()
le_visit_source = LabelEncoder()
le_geonetwork_region = LabelEncoder()
inputs['device_category_n'] = le_device_category.fit_transform(inputs['device_category'])
inputs['browser_n'] = le_browser.fit_transform(inputs['browser'])
inputs['visit_source_n'] = le_visit_source.fit_transform(inputs['visit_source'])
inputs['geonetwork_region_n'] = le_geonetwork_region.fit_transform(inputs['geonetwork_region'])
inputs.head(5)
Here’s how the sample inputs data-frame looks like (printing the head of the inputs subset) :-
Looking at the target column subset (top five records / head) :-
target.head(5)
Step 03 : Next, I drop the original categorical variables, which I had label encoded later on.
inputs_n = inputs.drop(['device_category','browser','visit_source','geonetwork_region'],axis='columns')
inputs_n
This is how the resultant data-frame looks like :-
Step 04 : Next, I instantiate a Decision Tree Classifier model.
from sklearn import tree
model_ = tree.DecisionTreeClassifier()
Step 05 : Next, I fit the Decision Tree Classifier model with the data (X, y). X being my independent variables feature set (label encoded) i.e. inputs_n and y being the target variable.
model_.fit(inputs_n, target)
Step 06 : I look at the model score. (accuracy, basically)
model_.score(inputs_n,target)
I observe a score of 0.95.
Now, obviously this looks very promising.
This is primarily because I have been testing the model on the same data on which it is fit.
Next, I make some predictions.
- Let us first look at the prediction I did for a consumer who used Apple Mobile as the device category, Firefox as the browser, Direct as the Visit source & came from Delhi as his/her geo-network region.
PREDICTION : 0 (basically will not convert, and it is the same as in the original dataset)
Next, I again made some prediction for a consumer who used Apple Mobile as the device category, Edge as the browser, Direct as the Visit source & came from Mumbai as his/her geo-network region.
- PREDICTION : 1 (basically will convert, and it is the same as in the original dataset).
That’s it for today, again!
If you wish to use Decision Tree Classifier to practice — you may go to Kaggle and look for suitable datasets there.
My recommendation — Titanic Survival Dataset.