An Example to Creating a Decision Tree

Travis Smith

3 years ago

Every decision has the following steps:

Determine the factor: In this example, it will be the absolute value of the correlation.
1. Determine the correlation
2. Analyze the absolute values of the correlation
Determine the Split Value by taking the median.
Add node to the decision tree.
Split the data: left & right.

Repeat with both sides of the data (left first, then right because in the end you want to be right).

Given this data set, where X2, X10, and X11 are factors in the decision tree, and Y is the value determination:

row #	X2	X10	X11	Y
0	0.885	0.330	9.100	4.000
1	0.725	0.390	10.900	5.000
2	0.560	0.500	9.400	6.000
3	0.735	0.570	9.800	5.000
4	0.610	0.630	8.400	3.000
5	0.260	0.630	11.800	8.000
6	0.500	0.680	10.500	7.000
7	0.320	0.780	10.000	6.000

DECISION #0 - ROOT

Step 1: Determine the factor.

Step 1a: Determine the correlation

row #	X2	X10	X11	Y
correl	-0.731	0.406	0.826

Step 1b: Analyze the absolute values of the correlation

row #	X2	X10	X11	Y
correl	0.731	0.406	0.826

The biggest impact is X11, so I will split on X11.

Step 2: Determine the Split Value by taking the median.

The median of X11 is 9.9.

Step 3: Add node to the decision tree.

Tree
node	Factor	SplitVal	Left	Right
0	11	9.900	1	?

Because the left tree always goes first and because the nodes are listed relatively, the left tree node will always be 1 (or nan for leaves).

Step 4: Split the data (red is left; right is green):

row #	X2	X10	X11	Y
correl	-0.731	0.406	0.826
4	0.610	0.630	8.400	3.000
0	0.885	0.330	9.100	4.000
2	0.560	0.500	9.400	6.000
3	0.735	0.570	9.800	5.000
7	0.320	0.780	10.000	6.000
6	0.500	0.680	10.500	7.000
1	0.725	0.390	10.900	5.000
5	0.260	0.630	11.800	8.000

DECISION #1 - LEFT TREE

With my subtree, I now have this data:

row #	X2	X10	X11	Y
4	0.610	0.630	8.400	3.000
0	0.885	0.330	9.100	4.000
2	0.560	0.500	9.400	6.000
3	0.735	0.570	9.800	5.000

Step 1: Determine the factor.

Step 1a: Determine the correlation.

row #	X2	X10	X11	Y
correl	-0.267	-0.149	0.808

Step 1b: Analyze the absolute values of the correlation

row #	X2	X10	X11	Y
correl	0.267	0.149	0.808

The biggest impact will be X11 again.

Step 2: Determine the Split Value by taking the median.

The median of X11 in this subtree is 9.25.

Step 3: Add node to the decision tree.

Tree
node	Factor	SplitVal	Left	Right
0	11	9.900	1	?
1	11	9.250	1	?

Since we don’t know where the right decision nodes are yet, we cannot update that.

Step 4: Split the data (red is left; right is green):

row #	X2	X10	X11	Y
correl	-0.267	-0.149	0.808
4	0.610	0.630	8.400	3.000
0	0.885	0.330	9.100	4.000
2	0.560	0.500	9.400	6.000
3	0.735	0.570	9.800	5.000

DECISION #1.1 - LEFT TREE: LEFT SUBTREE

Step 0: The data

row #	X2	X10	X11	Y
4	0.610	0.630	8.400	3.000
0	0.885	0.330	9.100	4.000

Step 1: Determine the factor.

Step 1a: Determine the correlation

row #	X2	X10	X11	Y
correl	1.000	-1.000	1.000

Step 1b: Analyze the absolute values of the correlation

row #	X2	X10	X11	Y
correl	1.000	1.000	1.000

All the correlations are the same, so let’s take the first one, X2.

Step 2: Determine the Split Value by taking the median.

X2’s median of this subtree is 0.748.

Step 3: Add node to the decision tree.

Tree
node	Factor	SplitVal	Left	Right
0	11	9.900	1	?
1	11	9.250	1	?
2	X2	0.748	1	2

Again, since we don’t know where the right decision nodes are yet, we cannot update that. However, since there are only two lines of data remaining, we know what the left and right relative node values will be. The left is 1 as always and the right here is 2, which will always be the case for a node containing two leaves.

Step 4: Split the tree (red is left; right is green):

row #	X2	X10	X11	Y
correl	1.000	-1.000	1.000
4	0.610	0.630	8.400	3.000
0	0.885	0.330	9.100	4.000

DECISION #1.1.1 - LEFT TREE: LEFT SUBTREE: LEFT LEAF

Step 0: The data

row #	X2	X10	X11	Y
4	0.610	0.630	8.400	3.000

Now that we only have one row, we have a leaf.

With a leaf, there is no factor to determine and no need to split any further. So we create a leaf by taking the Y value as the Split Value. Since it’s a leaf, it’s the end of the line, so there is no value for left and right. The value we will enter is NAN.

Step Final: Add node to the decision tree.

Tree
node	Factor	SplitVal	Left	Right
0	11	9.900	1	?
1	11	9.250	1	?
2	X2	0.748	1	2
3	LEAF	3.000	nan	nan

Again, since we don’t know where the right decision nodes are yet, we cannot update that anywhere.

DECISION #1.1.2 - LEFT TREE: LEFT SUBTREE: RIGHT LEAF

Step 0: The data

row #	X2	X10	X11	Y
0	0.885	0.330	9.100	4.000

Now that we only have one row, we have a leaf.

Step Final: Add node to the decision tree.

Tree
node	Factor	SplitVal	Left	Right
0	11	9.900	1	?
1	11	9.250	1	?
2	X2	0.748	1	2
3	LEAF	3.000	nan	nan
4	LEAF	4.000	nan	nan

This completes the left subtree of the left tree.

DECISION #1.1 - UPDATE

Now that we know where the right tree of the left tree will start, let’s update that tree node’s right relative value. Since the tree node is node 1 and the right tree will start on node 5, the value is 4 (5-1).

Tree
node	Factor	SplitVal	Left	Right
0	11	9.900	1	?
1	11	9.250	1	4
2	X2	0.748	1	2
3	LEAF	3.000	nan	nan
4	LEAF	4.000	nan	nan

DECISION #1.1 – LEFT TREE: RIGHT SUBTREE

Step 0: The data

row #	X2	X10	X11	Y
2	0.560	0.500	9.400	6.000
3	0.735	0.570	9.800	5.000

Step 1: Determine the factor.

Step 1a: Determine the correlation

row #	X2	X10	X11	Y
correl	-1.000	-1.000	-1.000

Step 1b: Analyze the absolute values of the correlation

row #	X2	X10	X11	Y
correl	1.000	1.000	1.000

All the correlations are the same, so let’s take the first one, X2.

Step 2: Determine the Split Value by taking the median.

X2’s median of this subtree is 0.648.

Step 3: Add node to the decision tree.

Tree
node	Factor	SplitVal	Left	Right
0	11	9.900	1	?
1	11	9.250	1	?
2	X2	0.748	1	2
3	LEAF	3.000	nan	nan
4	LEAF	4.000	nan	nan
5	X2	0.648	1	2

Step 4: Split the tree (red is left; right is green):

row #	X2	X10	X11	Y
correl	-1.000	-1.000	-1.000
2	0.560	0.500	9.400	6.000
3	0.735	0.570	9.800	5.000

DECISION #1.1.1 - LEFT TREE: RIGHT SUBTREE: LEFT LEAF

Step 0: The data

row #	X2	X10	X11	Y
2	0.560	0.500	9.400	6.000

Now that we only have one row, we have a leaf.

Step Final: Add node to the decision tree.

Tree
node	Factor	SplitVal	Left	Right
0	11	9.900	1	?
1	11	9.250	1	4
2	X2	0.748	1	2
3	LEAF	3.000	nan	nan
4	LEAF	4.000	nan	nan
5	X2	0.648	1	2
6	LEAF	6.000	nan	Nan

Again, since we don’t know where the right decision nodes are yet, we cannot update that anywhere.

DECISION #1.1.2 - LEFT TREE: RIGHT SUBTREE: RIGHT LEAF

Step 0: The data

row #	X2	X10	X11	Y
3	0.735	0.570	9.800	5.000

Now that we only have one row, we have a leaf.

Step Final: Add node to the decision tree.

Tree
node	Factor	SplitVal	Left	Right
0	11	9.900	1	?
1	11	9.250	1	4
2	X2	0.748	1	2
3	LEAF	3.000	nan	nan
4	LEAF	4.000	nan	nan
5	X2	0.648	1	2
6	LEAF	6.000	nan	nan
7	LEAF	5.000	nan	nan

This completes the right subtree of the left tree.

DECISION #1 - UPDATE

Now that we know where the right tree of the main tree will start, let’s update the root tree node’s right relative value. Since the tree node is node zero (0) and the right tree will start on node 8, the value is 8 (8-0).

Tree
node	Factor	SplitVal	Left	Right
0	11	9.900	1	8
1	11	9.250	1	4
2	X2	0.748	1	2
3	LEAF	3.000	nan	nan
4	LEAF	4.000	nan	nan
5	X2	0.648	1	2
6	LEAF	6.000	nan	nan
7	LEAF	5.000	nan	nan
8

DECISION #2 - RIGHT TREE

With my subtree, I now have this data:

row #	X2	X10	X11	Y
7	0.320	0.780	10.000	6.000
6	0.500	0.680	10.500	7.000
1	0.725	0.390	10.900	5.000
5	0.260	0.630	11.800	8.000

Step 1: Determine the factor.

Step 1a: Determine the correlation.

row #	X2	X10	X11	Y
correl	-0.750	0.484	0.542

Step 1b: Analyze the absolute values of the correlation

row #	X2	X10	X11	Y
correl	0.750	0.484	0.542

The biggest impact will be X2.

Step 2: Determine the Split Value by taking the median.

The median of X11 in this subtree is 0.410.

Step 3: Add node to the decision tree.

Tree
node	Factor	SplitVal	Left	Right
0	11	9.900	1	8
1	11	9.250	1	4
2	X2	0.748	1	2
3	LEAF	3.000	nan	nan
4	LEAF	4.000	nan	nan
5	X2	0.648	1	2
6	LEAF	6.000	nan	nan
7	LEAF	5.000	nan	nan
8	X2	0.410	1	?

Since we don’t know where the right decision nodes are yet, we cannot update that.

Step 4: Split the data (red is left; right is green):

row #	X2	X10	X11	Y
correl	-0.267	-0.149	0.808
7	0.320	0.780	10.000	6.000
5	0.260	0.630	11.800	8.000
6	0.500	0.680	10.500	7.000
1	0.725	0.390	10.900	5.000

DECISION #2.1 - RIGHT TREE: LEFT SUBTREE

Step 0: The data

row #	X2	X10	X11	Y
7	0.320	0.780	10.000	6.000
5	0.260	0.630	11.800	8.000

Step 1: Determine the factor.

Step 1a: Determine the correlation

row #	X2	X10	X11	Y
correl	-1.000	-1.000	1.000

Step 1b: Analyze the absolute values of the correlation

row #	X2	X10	X11	Y
correl	1.000	1.000	1.000

All the correlations are the same, so let’s take the first one, X2.

Step 2: Determine the Split Value by taking the median.

X2’s median of this subtree is 0.290.

Step 3: Add node to the decision tree.

Tree
node	Factor	SplitVal	Left	Right
0	11	9.900	1	8
1	11	9.250	1	4
2	X2	0.748	1	2
3	LEAF	3.000	nan	nan
4	LEAF	4.000	nan	nan
5	X2	0.648	1	2
6	LEAF	6.000	nan	nan
7	LEAF	5.000	nan	nan
8	X2	0.410	1	?
9	X2	0.290	1	2

Step 4: Split the tree (red is left; right is green):

row #	X2	X10	X11	Y
correl	1.000	-1.000	1.000
7	0.320	0.780	10.000	6.000
5	0.260	0.630	11.800	8.000

DECISION #2.1.1 - RIGHT TREE: LEFT SUBTREE: LEFT LEAF

Step 0: The data

row #	X2	X10	X11	Y
7	0.320	0.780	10.000	6.000

Now that we only have one row, we have a leaf.

Step Final: Add node to the decision tree.

Tree
node	Factor	SplitVal	Left	Right
0	11	9.900	1	8
1	11	9.250	1	4
2	X2	0.748	1	2
3	LEAF	3.000	nan	nan
4	LEAF	4.000	nan	nan
5	X2	0.648	1	2
6	LEAF	6.000	nan	nan
7	LEAF	5.000	nan	nan
8	X2	0.410	1	?
9	X2	0.290	1	2
10	LEAF	6.000	nan	nan

Again, since we don’t know where the right decision nodes are yet, we cannot update that anywhere.

DECISION #2.1.2 - RIGHT TREE: LEFT SUBTREE: RIGHT LEAF

Step 0: The data

row #	X2	X10	X11	Y
0	0.885	0.330	9.100	4.000

Now that we only have one row, we have a leaf.

Step Final: Add node to the decision tree.

Tree
node	Factor	SplitVal	Left	Right
0	11	9.900	1	8
1	11	9.250	1	4
2	X2	0.748	1	2
3	LEAF	3.000	nan	nan
4	LEAF	4.000	nan	nan
5	X2	0.648	1	2
6	LEAF	6.000	nan	nan
7	LEAF	5.000	nan	nan
8	X2	0.410	1	?
9	X2	0.290	1	2
10	LEAF	6.000	nan	nan
11	LEAF	4.000	nan	nan

This completes the left tree of the left tree.

DECISION #2.1 - Update

Now that we know where the right tree of the left tree will start, let’s update that tree node’s right relative value. Since the tree node is node 8 and the right tree will start on node 12, the value is 4 (12-8).

Tree
node	Factor	SplitVal	Left	Right
0	11	9.900	1	8
1	11	9.250	1	4
2	X2	0.748	1	2
3	LEAF	3.000	nan	nan
4	LEAF	4.000	nan	nan
5	X2	0.648	1	2
6	LEAF	6.000	nan	nan
7	LEAF	5.000	nan	nan
8	X2	0.410	1	4
9	X2	0.290	1	2
10	LEAF	6.000	nan	nan
11	LEAF	4.000	nan	nan
12

DECISION #2.2 – LEFT TREE: RIGHT SUBTREE

Step 0: The data

row #	X2	X10	X11	Y
6	0.500	0.680	10.500	7.000
1	0.725	0.390	10.900	5.000

Step 1: Determine the factor.

Step 1a: Determine the correlation

row #	X2	X10	X11	Y
correl	-1.000	1.000	-1.000

Step 1b: Analyze the absolute values of the correlation

row #	X2	X10	X11	Y
correl	1.000	1.000	1.000

All the correlations are the same, so let’s take the first one, X2.

Step 2: Determine the Split Value by taking the median.

X2’s median of this subtree is 0.648.

Step 3: Add node to the decision tree.

Tree
node	Factor	SplitVal	Left	Right
0	11	9.900	1	8
1	11	9.250	1	4
2	X2	0.748	1	2
3	LEAF	3.000	nan	nan
4	LEAF	4.000	nan	nan
5	X2	0.648	1	2
6	LEAF	6.000	nan	nan
7	LEAF	5.000	nan	nan
8	X2	0.410	1	4
9	X2	0.290	1	2
10	LEAF	6.000	nan	nan
11	LEAF	4.000	nan	nan
12	X2	0.535	1	2

Step 4: Split the tree (red is left; right is green):

row #	X2	X10	X11	Y
correl	-1.000	-1.000	-1.000
6	0.500	0.680	10.500	7.000
1	0.725	0.390	10.900	5.000

DECISION #2.2.1 - LEFT TREE: RIGHT SUBTREE: LEFT LEAF

Step 0: The data

row #	X2	X10	X11	Y
6	0.500	0.680	10.500	7.000

Now that we only have one row, we have a leaf.

Step Final: Add node to the decision tree.

Tree
node	Factor	SplitVal	Left	Right
0	11	9.900	1	8
1	11	9.250	1	4
2	X2	0.748	1	2
3	LEAF	3.000	nan	nan
4	LEAF	4.000	nan	nan
5	X2	0.648	1	2
6	LEAF	6.000	nan	nan
7	LEAF	5.000	nan	nan
8	X2	0.410	1	4
9	X2	0.290	1	2
10	LEAF	6.000	nan	nan
11	LEAF	4.000	nan	nan
12	X2	0.535	1	2
13	LEAF	7.000	nan	nan

Again, since we don’t know where the right decision nodes are yet, we cannot update that anywhere.

DECISION #2.2.2 - LEFT TREE: RIGHT TREE: RIGHT LEAF

Step 0: The data

row #	X2	X10	X11	Y
1	0.725	0.390	10.900	5.000

Now that we only have one row, we have a leaf.

Step Final: Add node to the decision tree.

Tree
node	Factor	SplitVal	Left	Right
0	11	9.900	1	8
1	11	9.250	1	4
2	X2	0.748	1	2
3	LEAF	3.000	nan	nan
4	LEAF	4.000	nan	nan
5	X2	0.648	1	2
6	LEAF	6.000	nan	nan
7	LEAF	5.000	nan	nan
8	X2	0.410	1	4
9	X2	0.290	1	2
10	LEAF	6.000	nan	nan
11	LEAF	4.000	nan	nan
12	X2	0.535	1	2
13	LEAF	7.000	nan	nan
14	LEAF	5.000	nan	nan

This completes the right tree.

CONCLUSION

This completes the decision tree. Now go forth and make this in Python.