Path: blob/main/C2 - Advanced Learning Algorithms/week4/optional labs/C2_W4_Lab_01_Decision_Trees.ipynb
4860 views
Ungraded Lab: Decision Trees
In this notebook you will visualize how a decision tree is splitted using information gain.
We will revisit the dataset used in the video lectures. The dataset is:
As you saw in the lectures, in a decision tree, we decide if a node will be split or not by looking at the information gain that split would give us. (Image of video IG)
Where
and is the entropy, defined as
Remember that log here is defined to be in base 2. Run the code block below to see by yourself how the entropy. behaves while varies.
Note that the H attains its higher value when . This means that the probability of event is . And its minimum value is attained in and , i.e., the probability of the event happening is totally predictable. Thus, the entropy shows the degree of predictability of an event.
| Ear Shape | Face Shape | Whiskers | Cat | |
|---|---|---|---|---|
![]() | Pointy | Round | Present | 1 |
![]() | Floppy | Not Round | Present | 1 |
![]() | Floppy | Round | Absent | 0 |
![]() | Pointy | Not Round | Present | 0 |
![]() | Pointy | Round | Present | 1 |
![]() | Pointy | Round | Absent | 1 |
![]() | Floppy | Not Round | Absent | 0 |
![]() | Pointy | Round | Absent | 1 |
![]() | Floppy | Round | Absent | 0 |
![]() | Floppy | Round | Absent | 0 |
We will use one-hot encoding to encode the categorical features. They will be as follows:
Ear Shape: Pointy = 1, Floppy = 0
Face Shape: Round = 1, Not Round = 0
Whiskers: Present = 1, Absent = 0
Therefore, we have two sets:
X_train: for each example, contains 3 features: - Ear Shape (1 if pointy, 0 otherwise) - Face Shape (1 if round, 0 otherwise) - Whiskers (1 if present, 0 otherwise)y_train: whether the animal is a cat - 1 if the animal is a cat - 0 otherwise
This means that the first example has a pointy ear shape, round face shape and it has whiskers.
On each node, we compute the information gain for each feature, then split the node on the feature with the higher information gain, by comparing the entropy of the node with the weighted entropy in the two splitted nodes.
So, the root node has every animal in our dataset. Remember that is the proportion of positive class (cats) in the root node. So
Now let's write a function to compute the entropy.
To illustrate, let's compute the information gain if we split the node for each of the features. To do this, let's write some functions.
So, if we choose Ear Shape to split, then we must have in the left node (check the table above) the indices:
and the right indices, the remaining ones.
Now we need another function to compute the weighted entropy in the splitted nodes. As you've seen in the video lecture, we must find:
and , the proportion of animals in each node.
and , the proportion of cats in each split.
Note the difference between these two definitions!! To illustrate, if we split the root node on the feature of index 0 (Ear Shape), then in the left node, the one that has the animals 0, 3, 4, 5 and 7, we have:
So, the weighted entropy in the 2 split nodes is 0.72. To compute the Information Gain we must subtract it from the entropy in the node we chose to split (in this case, the root node).
Now, let's compute the information gain if we split the root node for each feature:
So, the best feature to split is indeed the Ear Shape. Run the code below to see the split in action. You do not need to understand the following code block.
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
, in Dot.create(self, prog, format, encoding)
1922 try:
-> 1923 stdout_data, stderr_data, process = call_graphviz(
1924 program=prog,
1925 arguments=arguments,
1926 working_dir=tmp_dir,
1927 )
1928 except OSError as e:
, in call_graphviz(program, arguments, working_dir, **kwargs)
130 program_with_args = [program, ] + arguments
--> 132 process = subprocess.Popen(
133 program_with_args,
134 env=env,
135 cwd=working_dir,
136 shell=False,
137 stderr=subprocess.PIPE,
138 stdout=subprocess.PIPE,
139 **kwargs
140 )
141 stdout_data, stderr_data = process.communicate()
, in Popen.__init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group, extra_groups, encoding, errors, text, umask, pipesize)
968 self.stderr = io.TextIOWrapper(self.stderr,
969 encoding=encoding, errors=errors)
--> 971 self._execute_child(args, executable, preexec_fn, close_fds,
972 pass_fds, cwd, env,
973 startupinfo, creationflags, shell,
974 p2cread, p2cwrite,
975 c2pread, c2pwrite,
976 errread, errwrite,
977 restore_signals,
978 gid, gids, uid, umask,
979 start_new_session)
980 except:
981 # Cleanup if the child failed starting.
, in Popen._execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_gid, unused_gids, unused_uid, unused_umask, unused_start_new_session)
1439 try:
-> 1440 hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
1441 # no special security
1442 None, None,
1443 int(not close_fds),
1444 creationflags,
1445 env,
1446 cwd,
1447 startupinfo)
1448 finally:
1449 # Child is launched. Close the parent's copy of those pipe
1450 # handles that only the child should have open. You need
(...)
1453 # pipe will not close when the child process exits and the
1454 # ReadFile will hang.
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
FileNotFoundError Traceback (most recent call last)
d:\TA\machine-learning-specialization-coursera\C2 - Advanced Learning Algorithms\week4\optional labs\C2_W4_Lab_01_Decision_Trees.ipynb Cell 26 line 4
<a href='vscode-notebook-cell:/d%3A/TA/machine-learning-specialization-coursera/C2%20-%20Advanced%20Learning%20Algorithms/week4/optional%20labs/C2_W4_Lab_01_Decision_Trees.ipynb#X36sZmlsZQ%3D%3D?line=0'>1</a> tree = []
<a href='vscode-notebook-cell:/d%3A/TA/machine-learning-specialization-coursera/C2%20-%20Advanced%20Learning%20Algorithms/week4/optional%20labs/C2_W4_Lab_01_Decision_Trees.ipynb#X36sZmlsZQ%3D%3D?line=1'>2</a> build_tree_recursive(X_train, y_train, [0,1,2,3,4,5,6,7,8,9], "Root", max_depth=1, current_depth=0, tree = tree)
----> <a href='vscode-notebook-cell:/d%3A/TA/machine-learning-specialization-coursera/C2%20-%20Advanced%20Learning%20Algorithms/week4/optional%20labs/C2_W4_Lab_01_Decision_Trees.ipynb#X36sZmlsZQ%3D%3D?line=3'>4</a> generate_tree_viz([0,1,2,3,4,5,6,7,8,9], y_train, tree)
, in generate_tree_viz(root_indices, y, tree)
181 root += 1
184 node_names = decision_names + leaf_names
--> 185 pos = graphviz_layout(G, prog="dot")
187 fig=plt.figure(figsize=(14, 10))
188 ax=plt.subplot(111)
, in graphviz_layout(G, prog, root)
351 msg = (
352 "nx.nx_pydot.graphviz_layout depends on the pydot package, which has "
353 "known issues and is not actively maintained. Consider using "
354 "nx.nx_agraph.graphviz_layout instead.\n\n"
355 "See https://github.com/networkx/networkx/issues/5723"
356 )
357 warnings.warn(msg, DeprecationWarning, stacklevel=2)
--> 359 return pydot_layout(G=G, prog=prog, root=root)
, in pydot_layout(G, prog, root)
410 P.set("root", str(root))
412 # List of low-level bytes comprising a string in the dot language converted
413 # from the passed graph with the passed external GraphViz command.
--> 414 D_bytes = P.create_dot(prog=prog)
416 # Unique string decoded from these bytes with the preferred locale encoding
417 D = str(D_bytes, encoding=getpreferredencoding())
, in Dot.__init__.<locals>.new_method(f, prog, encoding)
1729 def new_method(
1730 f=frmt, prog=self.prog,
1731 encoding=None):
1732 """Refer to docstring of method `create`."""
-> 1733 return self.create(
1734 format=f, prog=prog, encoding=encoding)
, in Dot.create(self, prog, format, encoding)
1930 args = list(e.args)
1931 args[1] = '"{prog}" not found in path.'.format(
1932 prog=prog)
-> 1933 raise OSError(*args)
1934 else:
1935 raise
FileNotFoundError: [WinError 2] "dot" not found in path.
The process is recursive, which means we must perform these calculations for each node until we meet a stopping criteria:
If the tree depth after splitting exceeds a threshold
If the resulting node has only 1 class
If the information gain of splitting is below a threshold
The final tree looks like this:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
, in Dot.create(self, prog, format, encoding)
1922 try:
-> 1923 stdout_data, stderr_data, process = call_graphviz(
1924 program=prog,
1925 arguments=arguments,
1926 working_dir=tmp_dir,
1927 )
1928 except OSError as e:
, in call_graphviz(program, arguments, working_dir, **kwargs)
130 program_with_args = [program, ] + arguments
--> 132 process = subprocess.Popen(
133 program_with_args,
134 env=env,
135 cwd=working_dir,
136 shell=False,
137 stderr=subprocess.PIPE,
138 stdout=subprocess.PIPE,
139 **kwargs
140 )
141 stdout_data, stderr_data = process.communicate()
, in Popen.__init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags, restore_signals, start_new_session, pass_fds, user, group, extra_groups, encoding, errors, text, umask, pipesize)
968 self.stderr = io.TextIOWrapper(self.stderr,
969 encoding=encoding, errors=errors)
--> 971 self._execute_child(args, executable, preexec_fn, close_fds,
972 pass_fds, cwd, env,
973 startupinfo, creationflags, shell,
974 p2cread, p2cwrite,
975 c2pread, c2pwrite,
976 errread, errwrite,
977 restore_signals,
978 gid, gids, uid, umask,
979 start_new_session)
980 except:
981 # Cleanup if the child failed starting.
, in Popen._execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, unused_restore_signals, unused_gid, unused_gids, unused_uid, unused_umask, unused_start_new_session)
1439 try:
-> 1440 hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
1441 # no special security
1442 None, None,
1443 int(not close_fds),
1444 creationflags,
1445 env,
1446 cwd,
1447 startupinfo)
1448 finally:
1449 # Child is launched. Close the parent's copy of those pipe
1450 # handles that only the child should have open. You need
(...)
1453 # pipe will not close when the child process exits and the
1454 # ReadFile will hang.
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
FileNotFoundError Traceback (most recent call last)
d:\TA\machine-learning-specialization-coursera\C2 - Advanced Learning Algorithms\week4\optional labs\C2_W4_Lab_01_Decision_Trees.ipynb Cell 29 line 3
<a href='vscode-notebook-cell:/d%3A/TA/machine-learning-specialization-coursera/C2%20-%20Advanced%20Learning%20Algorithms/week4/optional%20labs/C2_W4_Lab_01_Decision_Trees.ipynb#X42sZmlsZQ%3D%3D?line=0'>1</a> tree = []
<a href='vscode-notebook-cell:/d%3A/TA/machine-learning-specialization-coursera/C2%20-%20Advanced%20Learning%20Algorithms/week4/optional%20labs/C2_W4_Lab_01_Decision_Trees.ipynb#X42sZmlsZQ%3D%3D?line=1'>2</a> build_tree_recursive(X_train, y_train, [0,1,2,3,4,5,6,7,8,9], "Root", max_depth=2, current_depth=0, tree = tree)
----> <a href='vscode-notebook-cell:/d%3A/TA/machine-learning-specialization-coursera/C2%20-%20Advanced%20Learning%20Algorithms/week4/optional%20labs/C2_W4_Lab_01_Decision_Trees.ipynb#X42sZmlsZQ%3D%3D?line=2'>3</a> generate_tree_viz([0,1,2,3,4,5,6,7,8,9], y_train, tree)
, in generate_tree_viz(root_indices, y, tree)
181 root += 1
184 node_names = decision_names + leaf_names
--> 185 pos = graphviz_layout(G, prog="dot")
187 fig=plt.figure(figsize=(14, 10))
188 ax=plt.subplot(111)
, in graphviz_layout(G, prog, root)
351 msg = (
352 "nx.nx_pydot.graphviz_layout depends on the pydot package, which has "
353 "known issues and is not actively maintained. Consider using "
354 "nx.nx_agraph.graphviz_layout instead.\n\n"
355 "See https://github.com/networkx/networkx/issues/5723"
356 )
357 warnings.warn(msg, DeprecationWarning, stacklevel=2)
--> 359 return pydot_layout(G=G, prog=prog, root=root)
, in pydot_layout(G, prog, root)
410 P.set("root", str(root))
412 # List of low-level bytes comprising a string in the dot language converted
413 # from the passed graph with the passed external GraphViz command.
--> 414 D_bytes = P.create_dot(prog=prog)
416 # Unique string decoded from these bytes with the preferred locale encoding
417 D = str(D_bytes, encoding=getpreferredencoding())
, in Dot.__init__.<locals>.new_method(f, prog, encoding)
1729 def new_method(
1730 f=frmt, prog=self.prog,
1731 encoding=None):
1732 """Refer to docstring of method `create`."""
-> 1733 return self.create(
1734 format=f, prog=prog, encoding=encoding)
, in Dot.create(self, prog, format, encoding)
1930 args = list(e.args)
1931 args[1] = '"{prog}" not found in path.'.format(
1932 prog=prog)
-> 1933 raise OSError(*args)
1934 else:
1935 raise
FileNotFoundError: [WinError 2] "dot" not found in path.
Congratulations! You completed the notebook!









