Travis Smith

An Example to Creating a Decision Tree

Every decision has the following steps:

  1. Determine the factor: In this example, it will be the absolute value of the correlation.
    1. Determine the correlation
    2. Analyze the absolute values of the correlation
  2. Determine the Split Value by taking the median.
  3. Add node to the decision tree.
  4. Split the data: left & right.

Repeat with both sides of the data (left first, then right because in the end you want to be right).

Given this data set, where X2, X10, and X11 are factors in the decision tree, and Y is the value determination:

row #X2X10X11Y
00.8850.3309.1004.000
10.7250.39010.9005.000
20.5600.5009.4006.000
30.7350.5709.8005.000
40.6100.6308.4003.000
50.2600.63011.8008.000
60.5000.68010.5007.000
70.3200.78010.0006.000

DECISION #0 - ROOT

Step 1: Determine the factor.

Step 1a: Determine the correlation

row #X2X10X11Y
correl-0.7310.4060.826

Step 1b: Analyze the absolute values of the correlation

row #X2X10X11Y
correl0.7310.4060.826

The biggest impact is X11, so I will split on X11.

Step 2: Determine the Split Value by taking the median.

The median of X11 is 9.9.

Step 3: Add node to the decision tree.

Tree
nodeFactorSplitValLeftRight
0119.9001?

Because the left tree always goes first and because the nodes are listed relatively, the left tree node will always be 1 (or nan for leaves).

Step 4: Split the data (red is left; right is green):

row #X2X10X11Y
correl-0.7310.4060.826
40.6100.6308.4003.000
00.8850.3309.1004.000
20.5600.5009.4006.000
30.7350.5709.8005.000
70.3200.78010.0006.000
60.5000.68010.5007.000
10.7250.39010.9005.000
50.2600.63011.8008.000

DECISION #1 - LEFT TREE

With my subtree, I now have this data:

row #X2X10X11Y
40.6100.6308.4003.000
00.8850.3309.1004.000
20.5600.5009.4006.000
30.7350.5709.8005.000

Step 1: Determine the factor.

Step 1a: Determine the correlation.

row #X2X10X11Y
correl-0.267-0.1490.808

Step 1b: Analyze the absolute values of the correlation

row #X2X10X11Y
correl0.2670.1490.808

The biggest impact will be X11 again.

Step 2: Determine the Split Value by taking the median.

The median of X11 in this subtree is 9.25.

Step 3: Add node to the decision tree.

Tree
nodeFactorSplitValLeftRight
0119.9001?
1119.2501?

Since we don’t know where the right decision nodes are yet, we cannot update that.

Step 4: Split the data (red is left; right is green):

row #X2X10X11Y
correl-0.267-0.1490.808
40.6100.6308.4003.000
00.8850.3309.1004.000
20.5600.5009.4006.000
30.7350.5709.8005.000

DECISION #1.1 - LEFT TREE: LEFT SUBTREE

Step 0: The data

row #X2X10X11Y
40.6100.6308.4003.000
00.8850.3309.1004.000

Step 1: Determine the factor.

Step 1a: Determine the correlation

row #X2X10X11Y
correl1.000-1.0001.000

Step 1b: Analyze the absolute values of the correlation

row #X2X10X11Y
correl1.0001.0001.000

All the correlations are the same, so let’s take the first one, X2.

Step 2: Determine the Split Value by taking the median.

X2’s median of this subtree is 0.748.

Step 3: Add node to the decision tree.

Tree
nodeFactorSplitValLeftRight
0119.9001?
1119.2501?
2X20.74812

Again, since we don’t know where the right decision nodes are yet, we cannot update that. However, since there are only two lines of data remaining, we know what the left and right relative node values will be. The left is 1 as always and the right here is 2, which will always be the case for a node containing two leaves.

Step 4: Split the tree (red is left; right is green):

row #X2X10X11Y
correl1.000-1.0001.000
40.6100.6308.4003.000
00.8850.3309.1004.000

DECISION #1.1.1 - LEFT TREE: LEFT SUBTREE: LEFT LEAF

Step 0: The data

row #X2X10X11Y
40.6100.6308.4003.000

Now that we only have one row, we have a leaf.

With a leaf, there is no factor to determine and no need to split any further. So we create a leaf by taking the Y value as the Split Value. Since it’s a leaf, it’s the end of the line, so there is no value for left and right. The value we will enter is NAN.

Step Final: Add node to the decision tree.

Tree
nodeFactorSplitValLeftRight
0119.9001?
1119.2501?
2X20.74812
3LEAF3.000nannan

Again, since we don’t know where the right decision nodes are yet, we cannot update that anywhere.

DECISION #1.1.2 - LEFT TREE: LEFT SUBTREE: RIGHT LEAF

Step 0: The data

row #X2X10X11Y
00.8850.3309.1004.000

Now that we only have one row, we have a leaf.

With a leaf, there is no factor to determine and no need to split any further. So we create a leaf by taking the Y value as the Split Value. Since it’s a leaf, it’s the end of the line, so there is no value for left and right. The value we will enter is NAN.

Step Final: Add node to the decision tree.

Tree
nodeFactorSplitValLeftRight
0119.9001?
1119.2501?
2X20.74812
3LEAF3.000nannan
4LEAF4.000nannan

This completes the left subtree of the left tree.

DECISION #1.1 - UPDATE

Now that we know where the right tree of the left tree will start, let’s update that tree node’s right relative value. Since the tree node is node 1 and the right tree will start on node 5, the value is 4 (5-1).

Tree
nodeFactorSplitValLeftRight
0119.9001?
1119.25014
2X20.74812
3LEAF3.000nannan
4LEAF4.000nannan
     

DECISION #1.1 – LEFT TREE: RIGHT SUBTREE

Step 0: The data

row #X2X10X11Y
20.5600.5009.4006.000
30.7350.5709.8005.000

Step 1: Determine the factor.

Step 1a: Determine the correlation

row #X2X10X11Y
correl-1.000-1.000-1.000

Step 1b: Analyze the absolute values of the correlation

row #X2X10X11Y
correl1.0001.0001.000

All the correlations are the same, so let’s take the first one, X2.

Step 2: Determine the Split Value by taking the median.

X2’s median of this subtree is 0.648.

Step 3: Add node to the decision tree.

Tree
nodeFactorSplitValLeftRight
0119.9001?
1119.2501?
2X20.74812
3LEAF3.000nannan
4LEAF4.000nannan
5X20.64812

Again, since we don’t know where the right decision nodes are yet, we cannot update that. However, since there are only two lines of data remaining, we know what the left and right relative node values will be. The left is 1 as always and the right here is 2, which will always be the case for a node containing two leaves.

Step 4: Split the tree (red is left; right is green):

row #X2X10X11Y
correl-1.000-1.000-1.000
20.5600.5009.4006.000
30.7350.5709.8005.000

DECISION #1.1.1 - LEFT TREE: RIGHT SUBTREE: LEFT LEAF

Step 0: The data

row #X2X10X11Y
20.5600.5009.4006.000

Now that we only have one row, we have a leaf.

With a leaf, there is no factor to determine and no need to split any further. So we create a leaf by taking the Y value as the Split Value. Since it’s a leaf, it’s the end of the line, so there is no value for left and right. The value we will enter is NAN.

Step Final: Add node to the decision tree.

Tree
nodeFactorSplitValLeftRight
0119.9001?
1119.25014
2X20.74812
3LEAF3.000nannan
4LEAF4.000nannan
5X20.64812
6LEAF6.000nanNan

Again, since we don’t know where the right decision nodes are yet, we cannot update that anywhere.

DECISION #1.1.2 - LEFT TREE: RIGHT SUBTREE: RIGHT LEAF

Step 0: The data

row #X2X10X11Y
30.7350.5709.8005.000

Now that we only have one row, we have a leaf.

With a leaf, there is no factor to determine and no need to split any further. So we create a leaf by taking the Y value as the Split Value. Since it’s a leaf, it’s the end of the line, so there is no value for left and right. The value we will enter is NAN.

Step Final: Add node to the decision tree.

Tree
nodeFactorSplitValLeftRight
0119.9001?
1119.25014
2X20.74812
3LEAF3.000nannan
4LEAF4.000nannan
5X20.64812
6LEAF6.000nannan
7LEAF5.000nannan

This completes the right subtree of the left tree.

DECISION #1 - UPDATE

Now that we know where the right tree of the main tree will start, let’s update the root tree node’s right relative value. Since the tree node is node zero (0) and the right tree will start on node 8, the value is 8 (8-0).

Tree
nodeFactorSplitValLeftRight
0119.90018
1119.25014
2X20.74812
3LEAF3.000nannan
4LEAF4.000nannan
5X20.64812
6LEAF6.000nannan
7LEAF5.000nannan
8    

DECISION #2 - RIGHT TREE

With my subtree, I now have this data:

row #X2X10X11Y
70.3200.78010.0006.000
60.5000.68010.5007.000
10.7250.39010.9005.000
50.2600.63011.8008.000

Step 1: Determine the factor.

Step 1a: Determine the correlation.

row #X2X10X11Y
correl-0.7500.4840.542

Step 1b: Analyze the absolute values of the correlation

row #X2X10X11Y
correl0.7500.4840.542

The biggest impact will be X2.

Step 2: Determine the Split Value by taking the median.

The median of X11 in this subtree is 0.410.

Step 3: Add node to the decision tree.

Tree
nodeFactorSplitValLeftRight
0119.90018
1119.25014
2X20.74812
3LEAF3.000nannan
4LEAF4.000nannan
5X20.64812
6LEAF6.000nannan
7LEAF5.000nannan
8X20.4101?

Since we don’t know where the right decision nodes are yet, we cannot update that.

Step 4: Split the data (red is left; right is green):

row #X2X10X11Y
correl-0.267-0.1490.808
70.3200.78010.0006.000
50.2600.63011.8008.000
60.5000.68010.5007.000
10.7250.39010.9005.000

DECISION #2.1 - RIGHT TREE: LEFT SUBTREE

Step 0: The data

row #X2X10X11Y
70.3200.78010.0006.000
50.2600.63011.8008.000

Step 1: Determine the factor.

Step 1a: Determine the correlation

row #X2X10X11Y
correl-1.000-1.0001.000

Step 1b: Analyze the absolute values of the correlation

row #X2X10X11Y
correl1.0001.0001.000

All the correlations are the same, so let’s take the first one, X2.

Step 2: Determine the Split Value by taking the median.

X2’s median of this subtree is 0.290.

Step 3: Add node to the decision tree.

Tree
nodeFactorSplitValLeftRight
0119.90018
1119.25014
2X20.74812
3LEAF3.000nannan
4LEAF4.000nannan
5X20.64812
6LEAF6.000nannan
7LEAF5.000nannan
8X20.4101?
9X20.29012

Again, since we don’t know where the right decision nodes are yet, we cannot update that. However, since there are only two lines of data remaining, we know what the left and right relative node values will be. The left is 1 as always and the right here is 2, which will always be the case for a node containing two leaves.

Step 4: Split the tree (red is left; right is green):

row #X2X10X11Y
correl1.000-1.0001.000
70.3200.78010.0006.000
50.2600.63011.8008.000

DECISION #2.1.1 - RIGHT TREE: LEFT SUBTREE: LEFT LEAF

Step 0: The data

row #X2X10X11Y
70.3200.78010.0006.000

Now that we only have one row, we have a leaf.

With a leaf, there is no factor to determine and no need to split any further. So we create a leaf by taking the Y value as the Split Value. Since it’s a leaf, it’s the end of the line, so there is no value for left and right. The value we will enter is NAN.

Step Final: Add node to the decision tree.

Tree
nodeFactorSplitValLeftRight
0119.90018
1119.25014
2X20.74812
3LEAF3.000nannan
4LEAF4.000nannan
5X20.64812
6LEAF6.000nannan
7LEAF5.000nannan
8X20.4101?
9X20.29012
10LEAF6.000nannan

Again, since we don’t know where the right decision nodes are yet, we cannot update that anywhere.

DECISION #2.1.2 - RIGHT TREE: LEFT SUBTREE: RIGHT LEAF

Step 0: The data

row #X2X10X11Y
00.8850.3309.1004.000

Now that we only have one row, we have a leaf.

With a leaf, there is no factor to determine and no need to split any further. So we create a leaf by taking the Y value as the Split Value. Since it’s a leaf, it’s the end of the line, so there is no value for left and right. The value we will enter is NAN.

Step Final: Add node to the decision tree.

Tree
nodeFactorSplitValLeftRight
0119.90018
1119.25014
2X20.74812
3LEAF3.000nannan
4LEAF4.000nannan
5X20.64812
6LEAF6.000nannan
7LEAF5.000nannan
8X20.4101?
9X20.29012
10LEAF6.000nannan
11LEAF4.000nannan

This completes the left tree of the left tree.

DECISION #2.1 - Update

Now that we know where the right tree of the left tree will start, let’s update that tree node’s right relative value. Since the tree node is node 8 and the right tree will start on node 12, the value is 4 (12-8).

Tree
nodeFactorSplitValLeftRight
0119.90018
1119.25014
2X20.74812
3LEAF3.000nannan
4LEAF4.000nannan
5X20.64812
6LEAF6.000nannan
7LEAF5.000nannan
8X20.41014
9X20.29012
10LEAF6.000nannan
11LEAF4.000nannan
12    

DECISION #2.2 – LEFT TREE: RIGHT SUBTREE

Step 0: The data

row #X2X10X11Y
60.5000.68010.5007.000
10.7250.39010.9005.000

Step 1: Determine the factor.

Step 1a: Determine the correlation

row #X2X10X11Y
correl-1.0001.000-1.000

Step 1b: Analyze the absolute values of the correlation

row #X2X10X11Y
correl1.0001.0001.000

All the correlations are the same, so let’s take the first one, X2.

Step 2: Determine the Split Value by taking the median.

X2’s median of this subtree is 0.648.

Step 3: Add node to the decision tree.

Tree
nodeFactorSplitValLeftRight
0119.90018
1119.25014
2X20.74812
3LEAF3.000nannan
4LEAF4.000nannan
5X20.64812
6LEAF6.000nannan
7LEAF5.000nannan
8X20.41014
9X20.29012
10LEAF6.000nannan
11LEAF4.000nannan
12X20.53512

Again, since we don’t know where the right decision nodes are yet, we cannot update that. However, since there are only two lines of data remaining, we know what the left and right relative node values will be. The left is 1 as always and the right here is 2, which will always be the case for a node containing two leaves.

Step 4: Split the tree (red is left; right is green):

row #X2X10X11Y
correl-1.000-1.000-1.000
60.5000.68010.5007.000
10.7250.39010.9005.000

DECISION #2.2.1 - LEFT TREE: RIGHT SUBTREE: LEFT LEAF

Step 0: The data

row #X2X10X11Y
60.5000.68010.5007.000

Now that we only have one row, we have a leaf.

With a leaf, there is no factor to determine and no need to split any further. So we create a leaf by taking the Y value as the Split Value. Since it’s a leaf, it’s the end of the line, so there is no value for left and right. The value we will enter is NAN.

Step Final: Add node to the decision tree.

Tree
nodeFactorSplitValLeftRight
0119.90018
1119.25014
2X20.74812
3LEAF3.000nannan
4LEAF4.000nannan
5X20.64812
6LEAF6.000nannan
7LEAF5.000nannan
8X20.41014
9X20.29012
10LEAF6.000nannan
11LEAF4.000nannan
12X20.53512
13LEAF7.000nannan

Again, since we don’t know where the right decision nodes are yet, we cannot update that anywhere.

DECISION #2.2.2 - LEFT TREE: RIGHT TREE: RIGHT LEAF

Step 0: The data

row #X2X10X11Y
10.7250.39010.9005.000

Now that we only have one row, we have a leaf.

With a leaf, there is no factor to determine and no need to split any further. So we create a leaf by taking the Y value as the Split Value. Since it’s a leaf, it’s the end of the line, so there is no value for left and right. The value we will enter is NAN.

Step Final: Add node to the decision tree.

Tree
nodeFactorSplitValLeftRight
0119.90018
1119.25014
2X20.74812
3LEAF3.000nannan
4LEAF4.000nannan
5X20.64812
6LEAF6.000nannan
7LEAF5.000nannan
8X20.41014
9X20.29012
10LEAF6.000nannan
11LEAF4.000nannan
12X20.53512
13LEAF7.000nannan
14LEAF5.000nannan

This completes the right tree.

CONCLUSION

This completes the decision tree. Now go forth and make this in Python.