Python
by Mike Thompson, Updated 4/16/2016
There's code for three Python programs here. The first is a little calculator that I wrote to get a better understanding of parsing computer languages. As a calculator it's just a toy, but it was interesting to write. The second is a decision tree machine learner that was my capstone project for the University of Washington Data Science certificate program. I've named it Dc2 because it's the second decision tree machine learner that I've written, and because there was once a nice airplane that went by a similar name. The third is a brute force solution to the classic TSP (Travelling Salesman Problem) graph traversal problem.
The calculator runs on either a Python 2 or Python 3 interpreter, with just a couple minor changes that are noted in the source code. Dc2 and TSP were written for Python 3, but it should be only a little work to get them to run on Python 2 if so desired.
Calculator
This is a calculator, written in Python. You wouldn't pay much money for it - it only handles addition, subtraction, multiplication, division, and the use of parenthesis to control the order of operations. No unary minus, no exponents, no statistical functions, no ...<the long list of things that a TI-84 calculator will do goes here>. But what it has done is give me a chance to explore some classic computer science problems while writing some Python code.
Parsing is one of those classic computer science problems, and I got to learn about grammars as a means of generating syntactically-correct sentences in a language like arithmetic expressions (or regular expressions, or Python, or Swahili). Then I got to learn about the types of computational machines (e.g. push-down automata) sufficient to parse them. "Introduction to Languages, Machines, and Logic" by Alan Parkes was a great read on these topics, and in it I learned a bit about the seminal work of Noam Chomsky in the field of linguistics, and of Alan Turing in the field of automata. I also got to explore Edsger Dijkstra's "shunting-yard algorithm" solution to the problem of converting an arithmetic expression from infix notation (the way we humans write 'em) to RPN (Reverse Polish Notation). I got to implement stacks with Python lists and play with Python's support for regular expressions, among other fun things.
Within the main loop, the first section is the scanner, implemented by using Python's "re" regular expressions library. The scanner chops up the user input into tokens (numbers and operators) that are the building blocks of arithmetic expressions. It rejects any symbols that don't have meaning in the language of arithmetic expressions, like "@" or "_". The second section is the parser, which determines whether the series of tokens entered by the user constitutes a syntactically-correct arithmetic expression. Just as a good grammar checker would reject the sentence, "Go to a the store.", because it doesn't make sense to have both "a" and "the" preceding the word "store", an arithmetic expression parser should reject "2 + * 3".
Given an arithmetic expression that contains no invalid symbols and is syntactically correct, the third section of the main loop evaluates it and returns the result. It first employs Dijkstra's shunting-yard algorithm to convert the arithmetic expression from infix notation (e.g. 2 + 2) to RPN (e.g. 2 2 +). If you've ever used an HP scientific calculator you're familiar with RPN, where you enter the operands before you enter the operator. The final step is to execute the RPN, the variable "a" representing the accumulator register.
Primitive though it is, writing this little calculator has provided a great excuse for diving into some classic computer science territory that I've wanted to explore for many years, but never got around to.
from __future__ import division # to avoid floor division in Python 2.x
import re # regular expressions
in_debug_mode = None # debug flag
def get_token_parse():
global curr_token
if input_expression:
curr_token = input_expression.pop()
return(True)
else:
return(False) # no next char - we've hit the end of the input
#
# Value <= [0-9]+ | '(' Expr ')'
#
def Value():
if get_token_parse():
if curr_token[TYPE] == 'NUMBER':
return(True)
elif curr_token[TYPE] == 'LPAREN':
if Expr():
if get_token_parse():
if curr_token[TYPE] == 'RPAREN':
return(True)
else:
return(False) # no closing )
else:
return(False) # no closing )
else:
return(False) # invalid expression
else:
input_expression.append(curr_token) # unexpected token: backtrack
return(False)
else:
return(False) # unexpected end of expression
#
# Product <= Value (('*' | '/') Value)*
#
def Product():
if Value():
if get_token_parse():
while (curr_token[TYPE] == 'OP_MUL' or
curr_token[TYPE] == 'OP_DIV'):
if not Value():
return(False) # no Value following operator
else:
if get_token_parse(): # possibly another * or /
continue
else:
return(True) # we've hit the end of the input
else:
input_expression.append(curr_token) # Not * or /, so backtrack
return(True) # we've found a single Value
else:
return(True) # we've hit the end of the input
else:
return(False)
#
# Sum <= Product (('+' | '-') Product)*
#
def Sum():
if Product():
if get_token_parse():
while (curr_token[TYPE] == 'OP_ADD' or
curr_token[TYPE] == 'OP_SUB'):
if not Product():
return(False) # no Product following operator
else:
if get_token_parse(): # possibly another + or -
continue
else:
return(True) # we've hit the end of the input
else:
input_expression.append(curr_token) # Not + or -, so backtrack
return(True) # we've found a single Product
else:
return(True) # we've hit the end of the input
else:
return(False) # no Product found
#
# Expr <= Sum
#
def Expr():
return(Sum())
token_specs = [
('NUMBER', r'\d+(\.\d*)?|\.\d+'), # Integer or decimal number
('OP_ADD', r'[+]'), # +
('OP_SUB', r'[\-]'), # -
('OP_MUL', r'[*]'), # *
('OP_DIV', r'[\/]'), # /
('LPAREN', r'[(]'), # (
('RPAREN', r'[)]'), # )
('WHITESPACE', r'[ \t]'), # Skip spaces & tabs
]
token_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specs)
get_token_scan = re.compile(token_regex).match
TYPE = 0 # index to token type
VALUE = 1 # index to token value
rpn_expression = [] # Reverse Polish Notation version of the expression
op_stack = [] # Operator stack for Dijkstra's shunting-yard algorithm
op_precedence = {'+': 1, '-': 1, '*': 2, '/': 2} # Op precedence lookup
eval_stack = [] # Stack for evaluating the RPN expression
a = None # Accumulator for evaluating the RPN expression
while True:
# If Python 2.x use: raw_input('Enter a...
# If Python 3.x use: input('Enter a...
user_input = input('Enter an arithmetic expression, or "Stop" to exit: ')
if user_input.upper() == 'STOP':
break
#
# Scan
#
input_expression = []
curr_pos = 0
match = get_token_scan(user_input)
while match is not None:
type = match.lastgroup
if type != 'WHITESPACE':
if type == 'NUMBER':
value = float(match.group(type))
else:
value = match.group(type)
input_expression.append((type, value))
if in_debug_mode:
print('Type: %s Value: %s' %(type, value))
curr_pos = match.end()
match = get_token_scan(user_input, curr_pos)
# If unable to tokenize all of the input, then it contains invalid chars
if curr_pos != len(user_input):
print('Unexpected character: %s' %(user_input[curr_pos]))
break
if in_debug_mode:
print('Input: %s' %(input_expression))
#
# Parse
#
save_input_expression = input_expression[:]
input_expression.reverse() # pop to read & push to backtrack
curr_token = []
if Expr():
if input_expression:
# Reached halt state but input not consumed, e.g. "2+3 4"
print('Extraneous input at ', input_expression[-1])
break
else:
# Unable to reach halt state, e.g. "2+-" or "2+"
if input_expression:
print('Parsing error at ', input_expression[-1])
break
else:
print('Parsing error: Unexpected end of expression.')
break
#
# Evaluate
#
input_expression = save_input_expression
for token in input_expression:
if token[TYPE] == 'NUMBER':
rpn_expression.append(token[VALUE])
elif token[TYPE] == 'LPAREN':
op_stack.append(token[VALUE])
elif token[TYPE] == 'RPAREN':
while op_stack and op_stack[-1] != '(':
rpn_expression.append(op_stack.pop())
if not op_stack:
print('Eval: Unmatched parenthesis (#1).')
raise BufferError
op_stack.pop()
elif token[VALUE] in op_precedence:
while op_stack and op_stack[-1] in op_precedence and \
op_precedence[token[VALUE]] <= op_precedence[op_stack[-1]]:
rpn_expression.append(op_stack.pop())
op_stack.append(token[VALUE])
else:
print('Eval: Token not recognized.')
raise LookupError
while op_stack:
if op_stack[-1] == '(' or op_stack[-1] == ')':
print('Eval: Unmatched parenthesis (#2).')
raise BufferError
rpn_expression.append(op_stack.pop())
# So we can use "pop" to read RPN, and "push" to store the accumulator
rpn_expression.reverse()
while rpn_expression:
if isinstance(rpn_expression[-1], int) or \
isinstance(rpn_expression[-1], float):
if a:
eval_stack.append(a)
a = rpn_expression.pop()
else:
if rpn_expression[-1] == '+':
a = eval_stack[-1] + a
elif rpn_expression[-1] == '-':
a = eval_stack[-1] - a
elif rpn_expression[-1] == '*':
a = eval_stack[-1] * a
else:
a = eval_stack[-1] / a
eval_stack.pop()
rpn_expression.pop()
# If Python 2.x use: print('The answer is: %s') %round(a, 2)
# If Python 3.x use: print('The answer is: ' + str(a))
print('The answer is: ' + str(a))
print('Bye!')
Dc2
My final project for the University of Washington Data Science Certificate Program was to write a decision tree machine learner in Python. That's a lot to do in a month, and my instructor advised me at the outset that I might be biting off more than I could chew, suggesting instead that I join one of the Kaggle project teams as a safer alternative. But I really wanted to write it, had been reading about and thinking about decision trees since spring of this year, and had found code snippets that showed how to solve some key parts of the problem. It was one of those "no guts, no glory" moments, and I decided to go for it! I did manage to finish it on time, and I named it Dc2.
How does it stack up against other popular decision tree machine learners? I wondered that myself, so did the following. First I downloaded the Titanic training dataset from Kaggle, and then did a number of transformations on the data. I discretized the continuous attributes (e.g. Age, Fare), and binarized or threw out some attributes that constituted leaks (e.g. Cabin Number, Passenger ID, Name). Then I ran the data through both Dc2 and Weka's J48 decision tree machine learners. The results of the two runs were very close, both classifiers yielding models with just over 80% accuracy on the training data. So its accuracy is close to that of popular, big-name products.
You certainly wouldn't call it feature-rich, though. Here's a list of things it won't do that you might expect from a decision tree machine learner.
- No GUI - command line invocation only.
- Doesn't support continuous-valued attributes like weight or age, so continuous data must be binned into nominal values (e.g. Small/Medium/Large).
- No pruning or other sophisticated methods for balancing overfitting vs. underfitting.
- No ensemble methods like bagging or random forests.
- No cross-validation or other sophisticated methods for predicting how well the model will generalize.
Here's a list of what it will do.
- Read in a labeled dataset in .CSV format.
- Build a binary decision tree consisting of internal nodes containing attribute/value pairs for dataset splitting decisions, and of leaves containing class values.
- Save the decision tree to a file, and re-load it for the post-tree-building processing of test or production datasets.
- Generate a text representation of the decision tree and statistics on the number/type of nodes in the tree.
- Classify a dataset (training, test, or production), appending a column containing the predicted class values, and then write the results to a file.
- Generate statistics on the results of a classification run, including accuracy and a confusion matrix.
- Accept command line parameters for specifying: the name of the input/output files, whether a tree-building or classification run (or both) should be performed, the values of two parameters that can be used to avoid overfitting, and activating a debug mode to support the inclusion of print statements for debugging.
The one required command line parameter is filestem, which represents the root of the input and output files used by Dc2. For example, if "Titanic" is specified for the filestem parameter, Dc2 will read the input dataset from Titanic.csv, will save/re-load the decision tree to/from Titanic.pic, will write the results of the classification step to TitanicOut.csv, and will write a human-readable rendition of the decision tree to TitanicTree.csv. Specifying the -h command line parameter will cause the following help text to be displayed. (At your command prompt type, "python Dc2.py -h".)
usage: Dc2.py [-h] [-i I] [-w W] [-b] [-c] [-d] filestem
Dc2.py - learn a decision tree from input data and/or categorize the data
positional arguments:
- filestem =>filename base for input and output files
optional arguments:
- -h, --help =>show this help message and exit
- -i I =>threshold info gain for decision node (default=0.02)
- -w W =>threshold weight for decision node (default=0)
- -b =>build decision tree from input data (True if (Not -c))
- -c =>categorize input data using decision tree (True if (Not -b))
- -d =>turn on debug messages
"""
Dc2.py - A decision tree classifier written by Mike Thompson in 2015
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License, Version 3, which
can be found at http://www.gnu.org/licenses. This program is distributed
in the hope that it will be useful, but without any warranty expressed
or implied, including the implied warranties of merchantability or
fitness for a particular purpose.
Acknowledgements
----------------
1. "C4.5: Programs for Machine Learning" by Ross Quinlan (1993). A
classic, covering all aspects of decision trees, and a pleasure
to read.
2. "Machine Learning in Action" by Peter Harrington (2012). An excellent
do-it-yourself-in-Python, containing sample code covering a wide
variety of data science topics.
3. "Data Structures & Algorithms in Python" by Michael Goodrich, Roberto
Tamassia, and Michael Goldwasser (2013). Chapter 8 on trees was
invaluable.
This program was written in Python 3, specifically the Anaconda 2.3
distribution of Python 3.4.3.
Rev Date Remarks
---- ---- -------
1.0 12/07/2015 And away we go!
1.01 01/08/2016 Write decision tree to .CSV file.
1.02 02/04/2016 Compute accuracy both by case and by weight.
"""
import argparse
from math import log
import operator
import csv
import pickle
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
class Btree:
"""
I think that I shall never see a poem lovely as a tree.
A tree whose hungry mouth is prest
against the earth's sweet flowing breast;
A tree that looks at God all day, and lifts her leafy arms to pray;
A tree that may in summer wear a nest of robins in her hair;
Upon whose bosom snow has lain; who intimately lives with rain.
Poems are made by fools like me, but only God can make a tree.
-- Joyce Kilmer (1913)
"""
def __init__(self):
"""Create an initially-empty binary tree."""
self._root = None
self._size = 0
def get_root(self):
return self._root
def set_root(self, node):
self._root = node
def get_size(self):
return self._size
def set_size(self, size):
self._size = size
def is_empty(self):
"""Return True if the tree is empty."""
return self.get_root() == None
def is_leaf(self, n):
"""
Return True if node does not have any children, else False.
"""
return self.num_children(n) == 0
def num_children(self, n):
"""
Return node's number of children (0, 1, or 2).
"""
count = 0
if n[NODE_LEFT] is not None: # left child exists
count += 1
if n[NODE_RIGHT] is not None: # right child exists
count += 1
return count
def children(self, n):
"""Generate an iteration of node's children."""
if not n[NODE_LEFT] == None:
yield n[NODE_LEFT]
if not n[NODE_RIGHT] == None:
yield n[NODE_RIGHT]
def preorder(self):
"""Generate a preorder iteration of all the nodes in the tree."""
if not self.is_empty():
for n in self.subtree_preorder(self.get_root()):
yield n
def subtree_preorder(self, n):
"""Generate a preorder iteration of nodes of subtree rooted at n."""
yield n # visit n before its subtrees
for c in self.children(n): # for each child, c
for other in self.subtree_preorder(c): # preorder of c's subtree
yield other
def add_root(self):
"""
Create and return the root of an empty tree.
Raise ValueError if tree nonempty.
"""
if self.get_root() is not None:
raise ValueError("Root already exists")
new = self.add_node()
self.set_root(new)
self.set_size(1)
return new
def add_node(self, attr=None, value=None,
parent=None, left=None, right=None):
"""
Create and return a new node.
"""
new = [parent, left, right, attr, value]
self._size += 1
return new
# Offsets to node elements
NODE_PARENT = 0 # index to parent
NODE_LEFT = 1 # index to left child
NODE_RIGHT = 2 # index to right child
NODE_ATTR = 3 # index to attr column num or, if leaf, None
NODE_VALUE = 4 # index to attr value or, if leaf, class
def get_dataset():
"""
Read input dataset in .CSV (Comma Separated Values) format, in
which the first row contains column labels.
"""
labels = []
dataset = []
with open(filestem + ".csv") as csv_input_file:
csv_reader = csv.reader(csv_input_file, dialect="excel")
labels = next(csv_reader)
for row in csv_reader:
dataset.append(row)
csv_input_file.close()
labels = labels[: len(dataset[0]) - 2] # ditch labels for weight & class
for row in dataset:
row[-2] = float(row[-2]) # convert weight from string to numeric
return dataset, labels
def calc_entropy(dataset):
"""
Calculate the entropy of a data set in which the next-to-last column
contains weights and the last column contains the class values.
"""
label_weights = {}
total_weight = 0.0
for row in dataset:
current_label = row[-1]
if current_label not in label_weights.keys():
label_weights[current_label] = 0
label_weights[current_label] += row[-2]
total_weight += row[-2]
shannon_entropy = 0.0
for key in label_weights:
prob = label_weights[key] / total_weight
shannon_entropy -= prob * log(prob,2) #log base 2
return shannon_entropy
def best_attr_val_for_split(dataset):
"""
Return the column index of the attribute, and the value of that
attribute, that will yield the greatest information gain.
"""
num_attrs = len(dataset[0]) - 2 # ignore last two columns
base_entropy = calc_entropy(dataset)
best_info_gain = 0.0
best_attr = -1
best_value = None
for i in range(num_attrs): # iterate over all the attribute columns
value_list = [row[i] for row in dataset] # all attr's values
unique_values = sorted(list(set(value_list))) # eliminate duplicates
for value in unique_values:
new_entropy = 0.0
sub_dataset_match = []
sub_dataset_no_match = []
for row in dataset:
if row[i] == value:
sub_dataset_match.append(row)
else:
sub_dataset_no_match.append(row)
if len(sub_dataset_match) != 0:
prob = sum(row[-2] for row in sub_dataset_match) \
/ sum(row[-2] for row in dataset)
new_entropy += prob * calc_entropy(sub_dataset_match)
if len(sub_dataset_no_match) != 0:
prob = sum(row[-2] for row in sub_dataset_no_match) \
/ sum(row[-2] for row in dataset)
new_entropy += prob * calc_entropy(sub_dataset_no_match)
info_gain = base_entropy - new_entropy
if (info_gain > best_info_gain): # best info gain so far?
best_info_gain = info_gain
best_attr = i
best_value = value
return best_attr, best_value, best_info_gain
def majority_count(dataset):
"""
Determine a dataset's most common class
"""
class_count={}
for row in dataset:
if row[-1] not in class_count.keys():
class_count[row[-1]] = 0.0
class_count[row[-1]] += row[-2]
sortedClassCount = sorted(class_count.items(),
key=operator.itemgetter(1),
reverse=True)
return sortedClassCount[0][0]
def create_tree(n, dataset, labels):
"""
Recursively create a decision tree given a node, a data set and
a list of attribute column labels.
"""
class_list = [row[-1] for row in dataset]
if class_list.count(class_list[0]) == len(class_list):
n[NODE_ATTR] = None
n[NODE_VALUE] = class_list[0] # a leaf when all classes are equal
return
best_attr, best_value, best_info_gain = best_attr_val_for_split(dataset)
if best_info_gain < minimum_info_gain or \
sum(row[-2] for row in dataset) < minimum_weight:
n[NODE_ATTR] = None
n[NODE_VALUE] = majority_count(dataset) # a leaf when no progress
return
n[NODE_ATTR] = best_attr
n[NODE_VALUE] = best_value
sub_dataset_match = []
sub_dataset_no_match = []
for row in dataset:
if row[best_attr] == best_value:
sub_dataset_match.append(row)
else:
sub_dataset_no_match.append(row)
if len(sub_dataset_match) != 0:
left_child = my_tree.add_node(None, None, n)
n[NODE_LEFT] = left_child
create_tree(left_child, sub_dataset_match, labels)
if len(sub_dataset_no_match) != 0:
right_child = my_tree.add_node(None, None, n)
n[NODE_RIGHT] = right_child
create_tree(right_child, sub_dataset_no_match, labels)
return
def classify(n, row):
if my_tree.is_leaf(n):
return n[NODE_VALUE]
if row[n[NODE_ATTR]] == n[NODE_VALUE]:
return classify(n[NODE_LEFT], row)
else:
return classify(n[NODE_RIGHT], row)
def print_tree(n):
global depth
global csv_writer
if my_tree.is_leaf(n):
csv_writer.writerow([" "] * depth +
["class=" + str(n[NODE_VALUE])])
print(depth * indent,
"class =", n[NODE_VALUE])
depth -= 1
return
else:
csv_writer.writerow([" "] * depth +
[str(my_data_labels[n[NODE_ATTR]]) +
"=" +
str(n[NODE_VALUE])
])
print(depth * indent,
my_data_labels[n[NODE_ATTR]],
"=",
n[NODE_VALUE])
depth += 1
print_tree(n[NODE_LEFT])
csv_writer.writerow([" "] * depth +
["else"])
print(depth * indent,
"else")
depth += 1
print_tree(n[NODE_RIGHT])
depth -= 1
return
"""
Parse command line arguments. Run-time help text is provided for the user
via the -h argument, as found in the code below.
Sample Command Line Action Taken
------------------- ------------
python Dc2.py MyFile Build decision tree from MyFile.csv,
saving the tree to MyFile.pic. Then
re-catagorize input, writing output to
MyFileOut.csv, and a text representation
of the tree to MyFileTree.csv.
python Dc2.py MyFile -c Categorize MyFile.csv using decision tree
stored in MyFile.pic (which was created on
a previous -d execution). Write output to
MyFileOut.csv, and a text representation
of the tree to MyFileTree.csv.
"""
parser = argparse.ArgumentParser(description="Dc2.py - learn a decision tree \
from input data and/or categorize the data")
parser.add_argument("filestem",
help="filename base for input and output files")
parser.add_argument("-i",
help="threshold info gain for decision node (default=0)",
type=float)
parser.add_argument("-w",
help="threshold weight for decision node (default=0)",
type=float)
parser.add_argument("-b",
help="build decision tree from input data \
(True if (Not -c))", action="store_true")
parser.add_argument("-c",
help="categorize input data using decision tree \
(True if (Not -b))", action="store_true")
parser.add_argument("-d",
help="turn on debug messages", action="store_true")
args = parser.parse_args()
filestem = args.filestem
minimum_info_gain = 0.02 if args.i == None else args.i
minimum_weight = 0.0 if args.w == None else args.w
build_tree = False if args.b == False and args.c == True else True
categorize = False if args.c == False and args.b == True else True
debug_mode = args.d
indent = " " # tree printing - num spaces for indenting
depth = 0 # tree printing - indentation depth
"""
OK, so let's get on with it!
"""
if build_tree:
"""
Build a decision tree from the dataset
"""
my_data, my_data_labels = get_dataset()
my_tree = Btree()
root = my_tree.add_root()
create_tree(root, my_data, my_data_labels) # create the decision tree
fw = open(filestem + ".pic", "wb") # pickle the decision tree
pickle.dump(my_tree,fw)
fw.close()
"""
Display run parameters and a text representation of the decision tree
"""
print()
print("======>> Decision Tree")
print()
print("Input parameters: Minimum info gain =", minimum_info_gain,
" Minimum weight =", minimum_weight)
print()
root = my_tree.get_root() # print tree in if/else format
with open(filestem + "Tree.csv", "w", newline='') as csv_output_file:
csv_writer = csv.writer(csv_output_file, quoting=csv.QUOTE_MINIMAL)
print_tree(root)
csv_output_file.close()
print()
num_nodes = 0
num_leaves = 0
for node in my_tree.preorder(): # traverse tree and count nodes
num_nodes += 1
if my_tree.is_leaf(node):
num_leaves += 1
print("Tree structure: Total nodes =", num_nodes,
" Internal nodes =", num_nodes - num_leaves,
" Leaves =", num_leaves)
print()
if categorize:
"""
Classify the dataset based on the decision tree
"""
my_data, my_data_labels = get_dataset() # read input data from .csv file
fr = open(filestem + ".pic", "rb") # un-pickle the decision tree
my_tree = pickle.load(fr)
root = my_tree.get_root() # point at the root node
for row in my_data: # classify the data via the tree,
row.append(classify(root, row)) # appending a new class column
# write the output in .csv format to filestem.txt
with open(filestem + "Out.csv", "w") as csv_output_file:
csv_writer = csv.writer(csv_output_file, dialect="excel")
csv_writer.writerow(my_data_labels + [", WEIGHT"] + [", CLASS"] \
+ [", PREDICTED"])
for row in my_data:
csv_writer.writerow(row)
csv_output_file.close()
"""
Display statistics
"""
my_data_predicted = []
my_data_actual = []
for x in my_data:
my_data_predicted.append(x[-1])
my_data_actual.append(x[-2])
"""
Display a confusion matrix if there are ten or fewer class values
"""
if len(set(my_data[-2])) <= 10:
print("")
print("======>> Confusion Matrix (columns show predicted values, " +
"rows show actual)")
print("")
my_data_predicted = []
my_data_actual = []
by_weight_correct = 0
by_weight_total = 0
for x in my_data:
my_data_predicted.append(x[-1])
my_data_actual.append(x[-2])
if x[-1] == x[-2]:
by_weight_correct += x[-3]
by_weight_total += x[-3]
cm = confusion_matrix(my_data_predicted, my_data_actual)
cm_s = [[str(e) for e in row] for row in cm] # make type=string
# add column and row labels to the confusion matrix
ucase_alpha = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"]
column_labels = [""]
count = 0
for row in cm_s:
row.insert(0, ucase_alpha[count])
column_labels.append(ucase_alpha[count])
count += 1
cm_s.insert(0, column_labels)
# find the max element length for each column
max_len = [max(map(len, col)) for col in zip(*cm_s)]
# build format string
fmt = "\t".join("{{:{}}}".format(x) for x in max_len)
# apply format to each row in the confusion matrix and print
table = [fmt.format(*row) for row in cm_s]
print("\n".join(table))
print("")
# print key to row/column labels
unique_values = sorted(list(set(my_data_actual)))
count=0
for x in unique_values:
print(ucase_alpha[count], " = ", unique_values[count])
count += 1
"""
Display accuracy
"""
print()
print("======>> Classification statistics")
print()
a_percent = accuracy_score(my_data_actual, my_data_predicted)
a_quantity = accuracy_score(my_data_actual, my_data_predicted,
normalize=False)
print(a_quantity, " cases correctly predicted")
print(len(my_data_actual) - a_quantity, " incorrectly predicted")
print("Accuracy (by case) = {0:.4f}".format(a_percent))
print("Accuracy (by weight) = {0:.4f}".format
(by_weight_correct / by_weight_total))
print("")
print("======>> END")
Travelling Salesman Problem - Graph Traversal
This is a brute force solution to the TSP. This classic problem is a difficult one for a computer to solve, and mathematicians have been searching for optimizations with limited success for the past 150 years. As the number of nodes increases, the number of possible paths grows at a tremendous rate. No fancy math here - this algorithm explores every possible path through a graph that begins and ends at the starting point, choosing the path that covers the shortest distance.
'''
Let's have a litle graph traversal fun with the TSP (Traveling Salesman
Problem)!
This program investigates all the possible ways to traverse a graph,
starting and ending at the same point, and visiting all of the nodes.
It's capable of navigating graphs with dead end nodes - nodes from which
you must back up before continuing to navigate through the rest of the
graph.
Below lies the GGB (Great Graph Boneyard). Pick one of these and use it
as the "Graph du Jour", or else build a graph of your own design. A
digraph (directed graph) is a collection of edges, each edge consisting
of two nodes connected by a path that leads from the first node to the
second. For example, "['A','B',3]" represents a path from node A to node
B of length 3.
# line segment AB
graph = [['A','B',3],
['B','A',3]]
# line segment ABC
graph = [['A','B',3],
['B','A',3],
['B','C',4],
['C','B',4]]
# right triangle, but with no path leading from C back to the start at A
graph = [['A','B',3],
['A','C',4],
['B','A',3],
['B','C',2],
['C','B',2]]
# equilateral triangle, plus one dead-end leg between C and D
graph = [['A','B',2],
['A','C',2],
['B','A',2],
['B','C',2],
['C','A',2],
['C','B',2],
['C','D',5],
['D','C',5]]
# triangle with a long (windy mountain road?) leg between B and C
graph = [['A','B',3],
['A','C',2],
['B','A',3],
['B','C',20],
['C','A',2],
['C','B',20]]
# 5 nodes, with all nodes connected to all the other nodes by edges
#
# This one runs for about 30 seconds on my laptop, demonstrating the
# complexity of the TSP for even small numbers of nodes. The output is...
# Shortest tour: ['A', 'B', 'C', 'D', 'E', 'A']
# Shortest distance: 5
# Number of complete tours: 1,015,968
# Number of aborted tours: 4,115,212
# Number of nodes visited: 82,538,496
# Elapsed time: 28.828
#
graph = [['A','B',1],
['A','C',1],
['A','D',1],
['A','E',1],
['B','A',1],
['B','C',1],
['B','D',1],
['B','E',1],
['C','A',1],
['C','B',1],
['C','D',1],
['C','E',1],
['D','A',1],
['D','B',1],
['D','C',1],
['D','E',1],
['E','A',1],
['E','B',1],
['E','C',1],
['E','D',1]]
'''
import time
def traverse(node, log):
global graph, start, tours, non_tours
log.append(node) # note that we came to this node
if (node == start and len(log) > 1): # have returned to the start?
if set(log) == set([edge[G_FROM] for edge in graph]):
tours.append(log) # have visited all nodes, and are
return # at the end of a complete tour
for next_node in [edge[G_TO] for edge in graph if edge[G_FROM] == node]:
# have previously traversed this edge in this direction?
been_there_done_that = False
for i in range(0, len(log) - 1):
if (log[i] == node and log[i + 1] == next_node):
been_there_done_that = True
non_tours.append(log)
if not been_there_done_that: # don't re-trace your steps
next_log = log[:]
traverse(next_node, next_log)
# Graph du Jour:
#
# right triangle, but with no path leading from C back to A
graph = [['A','B',3],
['A','C',4],
['B','A',3],
['B','C',2],
['C','B',2]]
# Offsets to edge elements: (1) from node, (2) to node, (3) distance
G_FROM = 0
G_TO = 1
G_LENGTH = 2
tours = [] # the list of complete tours
non_tours = [] # the list of failed attempts
start = graph[0][G_FROM] # first FROM node in graph is starting point
log = []
time_start = time.time()
traverse(start, log) # kick off all the possible traversals
time_end = time.time()
# determine which of all the successful tours was the shortest
tours_shortest = sum([edge[G_LENGTH] for edge in graph]) + 1 # too big
tours_best = []
for tour in tours:
length = 0
for i in range(0, len(tour) - 1):
for j in range(0, len(graph)):
if (tour[i] == graph[j][G_FROM] and
tour[i + 1] == graph[j][G_TO]):
length += graph[j][G_LENGTH]
if length < tours_shortest:
tours_shortest = length
tours_best = tour
# determine the total number of nodes visited
nodes_visited = 0
for tour in tours:
nodes_visited += len(tour)
for tour in non_tours:
nodes_visited += len(tour)
# show stats
print('Shortest tour: ', tours_best)
print('Shortest distance: ', tours_shortest)
print('Number of complete tours: ', len(tours))
print('Number of aborted tours: ', len(non_tours))
print('Number of nodes visited: ', nodes_visited)
print('Elapsed time: ', round(time_end - time_start, 3))