Every decision has the following steps:
- Determine the factor: In this example, it will be the absolute value of the correlation.
- Determine the correlation
- Analyze the absolute values of the correlation
- Determine the Split Value by taking the median.
- Add node to the decision tree.
- Split the data: left & right.
Repeat with both sides of the data (left first, then right because in the end you want to be right).
Given this data set, where X2, X10, and X11 are factors in the decision tree, and Y is the value determination:
row # | X2 | X10 | X11 | Y |
0 | 0.885 | 0.330 | 9.100 | 4.000 |
1 | 0.725 | 0.390 | 10.900 | 5.000 |
2 | 0.560 | 0.500 | 9.400 | 6.000 |
3 | 0.735 | 0.570 | 9.800 | 5.000 |
4 | 0.610 | 0.630 | 8.400 | 3.000 |
5 | 0.260 | 0.630 | 11.800 | 8.000 |
6 | 0.500 | 0.680 | 10.500 | 7.000 |
7 | 0.320 | 0.780 | 10.000 | 6.000 |
DECISION #0 - ROOT
Step 1: Determine the factor.
Step 1a: Determine the correlation
row # | X2 | X10 | X11 | Y |
correl | -0.731 | 0.406 | 0.826 |
Step 1b: Analyze the absolute values of the correlation
row # | X2 | X10 | X11 | Y |
correl | 0.731 | 0.406 | 0.826 |
The biggest impact is X11, so I will split on X11.
Step 2: Determine the Split Value by taking the median.
The median of X11 is 9.9.
Step 3: Add node to the decision tree.
Tree | ||||
node | Factor | SplitVal | Left | Right |
0 | 11 | 9.900 | 1 | ? |
Because the left tree always goes first and because the nodes are listed relatively, the left tree node will always be 1 (or nan for leaves).
Step 4: Split the data (red is left; right is green):
row # | X2 | X10 | X11 | Y |
correl | -0.731 | 0.406 | 0.826 | |
4 | 0.610 | 0.630 | 8.400 | 3.000 |
0 | 0.885 | 0.330 | 9.100 | 4.000 |
2 | 0.560 | 0.500 | 9.400 | 6.000 |
3 | 0.735 | 0.570 | 9.800 | 5.000 |
7 | 0.320 | 0.780 | 10.000 | 6.000 |
6 | 0.500 | 0.680 | 10.500 | 7.000 |
1 | 0.725 | 0.390 | 10.900 | 5.000 |
5 | 0.260 | 0.630 | 11.800 | 8.000 |
DECISION #1 - LEFT TREE
With my subtree, I now have this data:
row # | X2 | X10 | X11 | Y |
4 | 0.610 | 0.630 | 8.400 | 3.000 |
0 | 0.885 | 0.330 | 9.100 | 4.000 |
2 | 0.560 | 0.500 | 9.400 | 6.000 |
3 | 0.735 | 0.570 | 9.800 | 5.000 |
Step 1: Determine the factor.
Step 1a: Determine the correlation.
row # | X2 | X10 | X11 | Y |
correl | -0.267 | -0.149 | 0.808 |
Step 1b: Analyze the absolute values of the correlation
row # | X2 | X10 | X11 | Y |
correl | 0.267 | 0.149 | 0.808 |
The biggest impact will be X11 again.
Step 2: Determine the Split Value by taking the median.
The median of X11 in this subtree is 9.25.
Step 3: Add node to the decision tree.
Tree | ||||
node | Factor | SplitVal | Left | Right |
0 | 11 | 9.900 | 1 | ? |
1 | 11 | 9.250 | 1 | ? |
Since we don’t know where the right decision nodes are yet, we cannot update that.
Step 4: Split the data (red is left; right is green):
row # | X2 | X10 | X11 | Y |
correl | -0.267 | -0.149 | 0.808 | |
4 | 0.610 | 0.630 | 8.400 | 3.000 |
0 | 0.885 | 0.330 | 9.100 | 4.000 |
2 | 0.560 | 0.500 | 9.400 | 6.000 |
3 | 0.735 | 0.570 | 9.800 | 5.000 |
DECISION #1.1 - LEFT TREE: LEFT SUBTREE
Step 0: The data
row # | X2 | X10 | X11 | Y |
4 | 0.610 | 0.630 | 8.400 | 3.000 |
0 | 0.885 | 0.330 | 9.100 | 4.000 |
Step 1: Determine the factor.
Step 1a: Determine the correlation
row # | X2 | X10 | X11 | Y |
correl | 1.000 | -1.000 | 1.000 |
Step 1b: Analyze the absolute values of the correlation
row # | X2 | X10 | X11 | Y |
correl | 1.000 | 1.000 | 1.000 |
All the correlations are the same, so let’s take the first one, X2.
Step 2: Determine the Split Value by taking the median.
X2’s median of this subtree is 0.748.
Step 3: Add node to the decision tree.
Tree | ||||
node | Factor | SplitVal | Left | Right |
0 | 11 | 9.900 | 1 | ? |
1 | 11 | 9.250 | 1 | ? |
2 | X2 | 0.748 | 1 | 2 |
Again, since we don’t know where the right decision nodes are yet, we cannot update that. However, since there are only two lines of data remaining, we know what the left and right relative node values will be. The left is 1 as always and the right here is 2, which will always be the case for a node containing two leaves.
Step 4: Split the tree (red is left; right is green):
row # | X2 | X10 | X11 | Y |
correl | 1.000 | -1.000 | 1.000 | |
4 | 0.610 | 0.630 | 8.400 | 3.000 |
0 | 0.885 | 0.330 | 9.100 | 4.000 |
DECISION #1.1.1 - LEFT TREE: LEFT SUBTREE: LEFT LEAF
Step 0: The data
row # | X2 | X10 | X11 | Y |
4 | 0.610 | 0.630 | 8.400 | 3.000 |
Now that we only have one row, we have a leaf.
With a leaf, there is no factor to determine and no need to split any further. So we create a leaf by taking the Y value as the Split Value. Since it’s a leaf, it’s the end of the line, so there is no value for left and right. The value we will enter is NAN.
Step Final: Add node to the decision tree.
Tree | ||||
node | Factor | SplitVal | Left | Right |
0 | 11 | 9.900 | 1 | ? |
1 | 11 | 9.250 | 1 | ? |
2 | X2 | 0.748 | 1 | 2 |
3 | LEAF | 3.000 | nan | nan |
Again, since we don’t know where the right decision nodes are yet, we cannot update that anywhere.
DECISION #1.1.2 - LEFT TREE: LEFT SUBTREE: RIGHT LEAF
Step 0: The data
row # | X2 | X10 | X11 | Y |
0 | 0.885 | 0.330 | 9.100 | 4.000 |
Now that we only have one row, we have a leaf.
With a leaf, there is no factor to determine and no need to split any further. So we create a leaf by taking the Y value as the Split Value. Since it’s a leaf, it’s the end of the line, so there is no value for left and right. The value we will enter is NAN.
Step Final: Add node to the decision tree.
Tree | ||||
node | Factor | SplitVal | Left | Right |
0 | 11 | 9.900 | 1 | ? |
1 | 11 | 9.250 | 1 | ? |
2 | X2 | 0.748 | 1 | 2 |
3 | LEAF | 3.000 | nan | nan |
4 | LEAF | 4.000 | nan | nan |
This completes the left subtree of the left tree.
DECISION #1.1 - UPDATE
Now that we know where the right tree of the left tree will start, let’s update that tree node’s right relative value. Since the tree node is node 1 and the right tree will start on node 5, the value is 4 (5-1).
Tree | ||||
node | Factor | SplitVal | Left | Right |
0 | 11 | 9.900 | 1 | ? |
1 | 11 | 9.250 | 1 | 4 |
2 | X2 | 0.748 | 1 | 2 |
3 | LEAF | 3.000 | nan | nan |
4 | LEAF | 4.000 | nan | nan |
DECISION #1.1 – LEFT TREE: RIGHT SUBTREE
Step 0: The data
row # | X2 | X10 | X11 | Y |
2 | 0.560 | 0.500 | 9.400 | 6.000 |
3 | 0.735 | 0.570 | 9.800 | 5.000 |
Step 1: Determine the factor.
Step 1a: Determine the correlation
row # | X2 | X10 | X11 | Y |
correl | -1.000 | -1.000 | -1.000 |
Step 1b: Analyze the absolute values of the correlation
row # | X2 | X10 | X11 | Y |
correl | 1.000 | 1.000 | 1.000 |
All the correlations are the same, so let’s take the first one, X2.
Step 2: Determine the Split Value by taking the median.
X2’s median of this subtree is 0.648.
Step 3: Add node to the decision tree.
Tree | ||||
node | Factor | SplitVal | Left | Right |
0 | 11 | 9.900 | 1 | ? |
1 | 11 | 9.250 | 1 | ? |
2 | X2 | 0.748 | 1 | 2 |
3 | LEAF | 3.000 | nan | nan |
4 | LEAF | 4.000 | nan | nan |
5 | X2 | 0.648 | 1 | 2 |
Again, since we don’t know where the right decision nodes are yet, we cannot update that. However, since there are only two lines of data remaining, we know what the left and right relative node values will be. The left is 1 as always and the right here is 2, which will always be the case for a node containing two leaves.
Step 4: Split the tree (red is left; right is green):
row # | X2 | X10 | X11 | Y |
correl | -1.000 | -1.000 | -1.000 | |
2 | 0.560 | 0.500 | 9.400 | 6.000 |
3 | 0.735 | 0.570 | 9.800 | 5.000 |
DECISION #1.1.1 - LEFT TREE: RIGHT SUBTREE: LEFT LEAF
Step 0: The data
row # | X2 | X10 | X11 | Y |
2 | 0.560 | 0.500 | 9.400 | 6.000 |
Now that we only have one row, we have a leaf.
With a leaf, there is no factor to determine and no need to split any further. So we create a leaf by taking the Y value as the Split Value. Since it’s a leaf, it’s the end of the line, so there is no value for left and right. The value we will enter is NAN.
Step Final: Add node to the decision tree.
Tree | ||||
node | Factor | SplitVal | Left | Right |
0 | 11 | 9.900 | 1 | ? |
1 | 11 | 9.250 | 1 | 4 |
2 | X2 | 0.748 | 1 | 2 |
3 | LEAF | 3.000 | nan | nan |
4 | LEAF | 4.000 | nan | nan |
5 | X2 | 0.648 | 1 | 2 |
6 | LEAF | 6.000 | nan | Nan |
Again, since we don’t know where the right decision nodes are yet, we cannot update that anywhere.
DECISION #1.1.2 - LEFT TREE: RIGHT SUBTREE: RIGHT LEAF
Step 0: The data
row # | X2 | X10 | X11 | Y |
3 | 0.735 | 0.570 | 9.800 | 5.000 |
Now that we only have one row, we have a leaf.
With a leaf, there is no factor to determine and no need to split any further. So we create a leaf by taking the Y value as the Split Value. Since it’s a leaf, it’s the end of the line, so there is no value for left and right. The value we will enter is NAN.
Step Final: Add node to the decision tree.
Tree | ||||
node | Factor | SplitVal | Left | Right |
0 | 11 | 9.900 | 1 | ? |
1 | 11 | 9.250 | 1 | 4 |
2 | X2 | 0.748 | 1 | 2 |
3 | LEAF | 3.000 | nan | nan |
4 | LEAF | 4.000 | nan | nan |
5 | X2 | 0.648 | 1 | 2 |
6 | LEAF | 6.000 | nan | nan |
7 | LEAF | 5.000 | nan | nan |
This completes the right subtree of the left tree.
DECISION #1 - UPDATE
Now that we know where the right tree of the main tree will start, let’s update the root tree node’s right relative value. Since the tree node is node zero (0) and the right tree will start on node 8, the value is 8 (8-0).
Tree | ||||
node | Factor | SplitVal | Left | Right |
0 | 11 | 9.900 | 1 | 8 |
1 | 11 | 9.250 | 1 | 4 |
2 | X2 | 0.748 | 1 | 2 |
3 | LEAF | 3.000 | nan | nan |
4 | LEAF | 4.000 | nan | nan |
5 | X2 | 0.648 | 1 | 2 |
6 | LEAF | 6.000 | nan | nan |
7 | LEAF | 5.000 | nan | nan |
8 |
DECISION #2 - RIGHT TREE
With my subtree, I now have this data:
row # | X2 | X10 | X11 | Y |
7 | 0.320 | 0.780 | 10.000 | 6.000 |
6 | 0.500 | 0.680 | 10.500 | 7.000 |
1 | 0.725 | 0.390 | 10.900 | 5.000 |
5 | 0.260 | 0.630 | 11.800 | 8.000 |
Step 1: Determine the factor.
Step 1a: Determine the correlation.
row # | X2 | X10 | X11 | Y |
correl | -0.750 | 0.484 | 0.542 |
Step 1b: Analyze the absolute values of the correlation
row # | X2 | X10 | X11 | Y |
correl | 0.750 | 0.484 | 0.542 |
The biggest impact will be X2.
Step 2: Determine the Split Value by taking the median.
The median of X11 in this subtree is 0.410.
Step 3: Add node to the decision tree.
Tree | ||||
node | Factor | SplitVal | Left | Right |
0 | 11 | 9.900 | 1 | 8 |
1 | 11 | 9.250 | 1 | 4 |
2 | X2 | 0.748 | 1 | 2 |
3 | LEAF | 3.000 | nan | nan |
4 | LEAF | 4.000 | nan | nan |
5 | X2 | 0.648 | 1 | 2 |
6 | LEAF | 6.000 | nan | nan |
7 | LEAF | 5.000 | nan | nan |
8 | X2 | 0.410 | 1 | ? |
Since we don’t know where the right decision nodes are yet, we cannot update that.
Step 4: Split the data (red is left; right is green):
row # | X2 | X10 | X11 | Y |
correl | -0.267 | -0.149 | 0.808 | |
7 | 0.320 | 0.780 | 10.000 | 6.000 |
5 | 0.260 | 0.630 | 11.800 | 8.000 |
6 | 0.500 | 0.680 | 10.500 | 7.000 |
1 | 0.725 | 0.390 | 10.900 | 5.000 |
DECISION #2.1 - RIGHT TREE: LEFT SUBTREE
Step 0: The data
row # | X2 | X10 | X11 | Y |
7 | 0.320 | 0.780 | 10.000 | 6.000 |
5 | 0.260 | 0.630 | 11.800 | 8.000 |
Step 1: Determine the factor.
Step 1a: Determine the correlation
row # | X2 | X10 | X11 | Y |
correl | -1.000 | -1.000 | 1.000 |
Step 1b: Analyze the absolute values of the correlation
row # | X2 | X10 | X11 | Y |
correl | 1.000 | 1.000 | 1.000 |
All the correlations are the same, so let’s take the first one, X2.
Step 2: Determine the Split Value by taking the median.
X2’s median of this subtree is 0.290.
Step 3: Add node to the decision tree.
Tree | ||||
node | Factor | SplitVal | Left | Right |
0 | 11 | 9.900 | 1 | 8 |
1 | 11 | 9.250 | 1 | 4 |
2 | X2 | 0.748 | 1 | 2 |
3 | LEAF | 3.000 | nan | nan |
4 | LEAF | 4.000 | nan | nan |
5 | X2 | 0.648 | 1 | 2 |
6 | LEAF | 6.000 | nan | nan |
7 | LEAF | 5.000 | nan | nan |
8 | X2 | 0.410 | 1 | ? |
9 | X2 | 0.290 | 1 | 2 |
Again, since we don’t know where the right decision nodes are yet, we cannot update that. However, since there are only two lines of data remaining, we know what the left and right relative node values will be. The left is 1 as always and the right here is 2, which will always be the case for a node containing two leaves.
Step 4: Split the tree (red is left; right is green):
row # | X2 | X10 | X11 | Y |
correl | 1.000 | -1.000 | 1.000 | |
7 | 0.320 | 0.780 | 10.000 | 6.000 |
5 | 0.260 | 0.630 | 11.800 | 8.000 |
DECISION #2.1.1 - RIGHT TREE: LEFT SUBTREE: LEFT LEAF
Step 0: The data
row # | X2 | X10 | X11 | Y |
7 | 0.320 | 0.780 | 10.000 | 6.000 |
Now that we only have one row, we have a leaf.
With a leaf, there is no factor to determine and no need to split any further. So we create a leaf by taking the Y value as the Split Value. Since it’s a leaf, it’s the end of the line, so there is no value for left and right. The value we will enter is NAN.
Step Final: Add node to the decision tree.
Tree | ||||
node | Factor | SplitVal | Left | Right |
0 | 11 | 9.900 | 1 | 8 |
1 | 11 | 9.250 | 1 | 4 |
2 | X2 | 0.748 | 1 | 2 |
3 | LEAF | 3.000 | nan | nan |
4 | LEAF | 4.000 | nan | nan |
5 | X2 | 0.648 | 1 | 2 |
6 | LEAF | 6.000 | nan | nan |
7 | LEAF | 5.000 | nan | nan |
8 | X2 | 0.410 | 1 | ? |
9 | X2 | 0.290 | 1 | 2 |
10 | LEAF | 6.000 | nan | nan |
Again, since we don’t know where the right decision nodes are yet, we cannot update that anywhere.
DECISION #2.1.2 - RIGHT TREE: LEFT SUBTREE: RIGHT LEAF
Step 0: The data
row # | X2 | X10 | X11 | Y |
0 | 0.885 | 0.330 | 9.100 | 4.000 |
Now that we only have one row, we have a leaf.
With a leaf, there is no factor to determine and no need to split any further. So we create a leaf by taking the Y value as the Split Value. Since it’s a leaf, it’s the end of the line, so there is no value for left and right. The value we will enter is NAN.
Step Final: Add node to the decision tree.
Tree | ||||
node | Factor | SplitVal | Left | Right |
0 | 11 | 9.900 | 1 | 8 |
1 | 11 | 9.250 | 1 | 4 |
2 | X2 | 0.748 | 1 | 2 |
3 | LEAF | 3.000 | nan | nan |
4 | LEAF | 4.000 | nan | nan |
5 | X2 | 0.648 | 1 | 2 |
6 | LEAF | 6.000 | nan | nan |
7 | LEAF | 5.000 | nan | nan |
8 | X2 | 0.410 | 1 | ? |
9 | X2 | 0.290 | 1 | 2 |
10 | LEAF | 6.000 | nan | nan |
11 | LEAF | 4.000 | nan | nan |
This completes the left tree of the left tree.
DECISION #2.1 - Update
Now that we know where the right tree of the left tree will start, let’s update that tree node’s right relative value. Since the tree node is node 8 and the right tree will start on node 12, the value is 4 (12-8).
Tree | ||||
node | Factor | SplitVal | Left | Right |
0 | 11 | 9.900 | 1 | 8 |
1 | 11 | 9.250 | 1 | 4 |
2 | X2 | 0.748 | 1 | 2 |
3 | LEAF | 3.000 | nan | nan |
4 | LEAF | 4.000 | nan | nan |
5 | X2 | 0.648 | 1 | 2 |
6 | LEAF | 6.000 | nan | nan |
7 | LEAF | 5.000 | nan | nan |
8 | X2 | 0.410 | 1 | 4 |
9 | X2 | 0.290 | 1 | 2 |
10 | LEAF | 6.000 | nan | nan |
11 | LEAF | 4.000 | nan | nan |
12 |
DECISION #2.2 – LEFT TREE: RIGHT SUBTREE
Step 0: The data
row # | X2 | X10 | X11 | Y |
6 | 0.500 | 0.680 | 10.500 | 7.000 |
1 | 0.725 | 0.390 | 10.900 | 5.000 |
Step 1: Determine the factor.
Step 1a: Determine the correlation
row # | X2 | X10 | X11 | Y |
correl | -1.000 | 1.000 | -1.000 |
Step 1b: Analyze the absolute values of the correlation
row # | X2 | X10 | X11 | Y |
correl | 1.000 | 1.000 | 1.000 |
All the correlations are the same, so let’s take the first one, X2.
Step 2: Determine the Split Value by taking the median.
X2’s median of this subtree is 0.648.
Step 3: Add node to the decision tree.
Tree | ||||
node | Factor | SplitVal | Left | Right |
0 | 11 | 9.900 | 1 | 8 |
1 | 11 | 9.250 | 1 | 4 |
2 | X2 | 0.748 | 1 | 2 |
3 | LEAF | 3.000 | nan | nan |
4 | LEAF | 4.000 | nan | nan |
5 | X2 | 0.648 | 1 | 2 |
6 | LEAF | 6.000 | nan | nan |
7 | LEAF | 5.000 | nan | nan |
8 | X2 | 0.410 | 1 | 4 |
9 | X2 | 0.290 | 1 | 2 |
10 | LEAF | 6.000 | nan | nan |
11 | LEAF | 4.000 | nan | nan |
12 | X2 | 0.535 | 1 | 2 |
Again, since we don’t know where the right decision nodes are yet, we cannot update that. However, since there are only two lines of data remaining, we know what the left and right relative node values will be. The left is 1 as always and the right here is 2, which will always be the case for a node containing two leaves.
Step 4: Split the tree (red is left; right is green):
row # | X2 | X10 | X11 | Y |
correl | -1.000 | -1.000 | -1.000 | |
6 | 0.500 | 0.680 | 10.500 | 7.000 |
1 | 0.725 | 0.390 | 10.900 | 5.000 |
DECISION #2.2.1 - LEFT TREE: RIGHT SUBTREE: LEFT LEAF
Step 0: The data
row # | X2 | X10 | X11 | Y |
6 | 0.500 | 0.680 | 10.500 | 7.000 |
Now that we only have one row, we have a leaf.
With a leaf, there is no factor to determine and no need to split any further. So we create a leaf by taking the Y value as the Split Value. Since it’s a leaf, it’s the end of the line, so there is no value for left and right. The value we will enter is NAN.
Step Final: Add node to the decision tree.
Tree | ||||
node | Factor | SplitVal | Left | Right |
0 | 11 | 9.900 | 1 | 8 |
1 | 11 | 9.250 | 1 | 4 |
2 | X2 | 0.748 | 1 | 2 |
3 | LEAF | 3.000 | nan | nan |
4 | LEAF | 4.000 | nan | nan |
5 | X2 | 0.648 | 1 | 2 |
6 | LEAF | 6.000 | nan | nan |
7 | LEAF | 5.000 | nan | nan |
8 | X2 | 0.410 | 1 | 4 |
9 | X2 | 0.290 | 1 | 2 |
10 | LEAF | 6.000 | nan | nan |
11 | LEAF | 4.000 | nan | nan |
12 | X2 | 0.535 | 1 | 2 |
13 | LEAF | 7.000 | nan | nan |
Again, since we don’t know where the right decision nodes are yet, we cannot update that anywhere.
DECISION #2.2.2 - LEFT TREE: RIGHT TREE: RIGHT LEAF
Step 0: The data
row # | X2 | X10 | X11 | Y |
1 | 0.725 | 0.390 | 10.900 | 5.000 |
Now that we only have one row, we have a leaf.
With a leaf, there is no factor to determine and no need to split any further. So we create a leaf by taking the Y value as the Split Value. Since it’s a leaf, it’s the end of the line, so there is no value for left and right. The value we will enter is NAN.
Step Final: Add node to the decision tree.
Tree | ||||
node | Factor | SplitVal | Left | Right |
0 | 11 | 9.900 | 1 | 8 |
1 | 11 | 9.250 | 1 | 4 |
2 | X2 | 0.748 | 1 | 2 |
3 | LEAF | 3.000 | nan | nan |
4 | LEAF | 4.000 | nan | nan |
5 | X2 | 0.648 | 1 | 2 |
6 | LEAF | 6.000 | nan | nan |
7 | LEAF | 5.000 | nan | nan |
8 | X2 | 0.410 | 1 | 4 |
9 | X2 | 0.290 | 1 | 2 |
10 | LEAF | 6.000 | nan | nan |
11 | LEAF | 4.000 | nan | nan |
12 | X2 | 0.535 | 1 | 2 |
13 | LEAF | 7.000 | nan | nan |
14 | LEAF | 5.000 | nan | nan |
This completes the right tree.
CONCLUSION
This completes the decision tree. Now go forth and make this in Python.