Python
by Mike Thompson, Updated 4/16/2016
There's code for three Python programs here. The first is a little calculator that I wrote to get a better understanding of parsing computer languages. As a calculator it's just a toy, but it was interesting to write. The second is a decision tree machine learner that was my capstone project for the University of Washington Data Science certificate program. I've named it Dc2 because it's the second decision tree machine learner that I've written, and because there was once a nice airplane that went by a similar name. The third is a brute force solution to the classic TSP (Travelling Salesman Problem) graph traversal problem.
The calculator runs on either a Python 2 or Python 3 interpreter, with just a couple minor changes that are noted in the source code. Dc2 and TSP were written for Python 3, but it should be only a little work to get them to run on Python 2 if so desired.
Calculator
This is a calculator, written in Python. You wouldn't pay much money for it - it only handles addition, subtraction, multiplication, division, and the use of parenthesis to control the order of operations. No unary minus, no exponents, no statistical functions, no ...<the long list of things that a TI-84 calculator will do goes here>. But what it has done is give me a chance to explore some classic computer science problems while writing some Python code.
Parsing is one of those classic computer science problems, and I got to learn about grammars as a means of generating syntactically-correct sentences in a language like arithmetic expressions (or regular expressions, or Python, or Swahili). Then I got to learn about the types of computational machines (e.g. push-down automata) sufficient to parse them. "Introduction to Languages, Machines, and Logic" by Alan Parkes was a great read on these topics, and in it I learned a bit about the seminal work of Noam Chomsky in the field of linguistics, and of Alan Turing in the field of automata. I also got to explore Edsger Dijkstra's "shunting-yard algorithm" solution to the problem of converting an arithmetic expression from infix notation (the way we humans write 'em) to RPN (Reverse Polish Notation). I got to implement stacks with Python lists and play with Python's support for regular expressions, among other fun things.
Within the main loop, the first section is the scanner, implemented by using Python's "re" regular expressions library. The scanner chops up the user input into tokens (numbers and operators) that are the building blocks of arithmetic expressions. It rejects any symbols that don't have meaning in the language of arithmetic expressions, like "@" or "_". The second section is the parser, which determines whether the series of tokens entered by the user constitutes a syntactically-correct arithmetic expression. Just as a good grammar checker would reject the sentence, "Go to a the store.", because it doesn't make sense to have both "a" and "the" preceding the word "store", an arithmetic expression parser should reject "2 + * 3".
Given an arithmetic expression that contains no invalid symbols and is syntactically correct, the third section of the main loop evaluates it and returns the result. It first employs Dijkstra's shunting-yard algorithm to convert the arithmetic expression from infix notation (e.g. 2 + 2) to RPN (e.g. 2 2 +). If you've ever used an HP scientific calculator you're familiar with RPN, where you enter the operands before you enter the operator. The final step is to execute the RPN, the variable "a" representing the accumulator register.
Primitive though it is, writing this little calculator has provided a great excuse for diving into some classic computer science territory that I've wanted to explore for many years, but never got around to.
from __future__ import division # to avoid floor division in Python 2.x import re # regular expressions in_debug_mode = None # debug flag def get_token_parse(): global curr_token if input_expression: curr_token = input_expression.pop() return(True) else: return(False) # no next char - we've hit the end of the input # # Value <= [0-9]+ | '(' Expr ')' # def Value(): if get_token_parse(): if curr_token[TYPE] == 'NUMBER': return(True) elif curr_token[TYPE] == 'LPAREN': if Expr(): if get_token_parse(): if curr_token[TYPE] == 'RPAREN': return(True) else: return(False) # no closing ) else: return(False) # no closing ) else: return(False) # invalid expression else: input_expression.append(curr_token) # unexpected token: backtrack return(False) else: return(False) # unexpected end of expression # # Product <= Value (('*' | '/') Value)* # def Product(): if Value(): if get_token_parse(): while (curr_token[TYPE] == 'OP_MUL' or curr_token[TYPE] == 'OP_DIV'): if not Value(): return(False) # no Value following operator else: if get_token_parse(): # possibly another * or / continue else: return(True) # we've hit the end of the input else: input_expression.append(curr_token) # Not * or /, so backtrack return(True) # we've found a single Value else: return(True) # we've hit the end of the input else: return(False) # # Sum <= Product (('+' | '-') Product)* # def Sum(): if Product(): if get_token_parse(): while (curr_token[TYPE] == 'OP_ADD' or curr_token[TYPE] == 'OP_SUB'): if not Product(): return(False) # no Product following operator else: if get_token_parse(): # possibly another + or - continue else: return(True) # we've hit the end of the input else: input_expression.append(curr_token) # Not + or -, so backtrack return(True) # we've found a single Product else: return(True) # we've hit the end of the input else: return(False) # no Product found # # Expr <= Sum # def Expr(): return(Sum()) token_specs = [ ('NUMBER', r'\d+(\.\d*)?|\.\d+'), # Integer or decimal number ('OP_ADD', r'[+]'), # + ('OP_SUB', r'[\-]'), # - ('OP_MUL', r'[*]'), # * ('OP_DIV', r'[\/]'), # / ('LPAREN', r'[(]'), # ( ('RPAREN', r'[)]'), # ) ('WHITESPACE', r'[ \t]'), # Skip spaces & tabs ] token_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specs) get_token_scan = re.compile(token_regex).match TYPE = 0 # index to token type VALUE = 1 # index to token value rpn_expression = [] # Reverse Polish Notation version of the expression op_stack = [] # Operator stack for Dijkstra's shunting-yard algorithm op_precedence = {'+': 1, '-': 1, '*': 2, '/': 2} # Op precedence lookup eval_stack = [] # Stack for evaluating the RPN expression a = None # Accumulator for evaluating the RPN expression while True: # If Python 2.x use: raw_input('Enter a... # If Python 3.x use: input('Enter a... user_input = input('Enter an arithmetic expression, or "Stop" to exit: ') if user_input.upper() == 'STOP': break # # Scan # input_expression = [] curr_pos = 0 match = get_token_scan(user_input) while match is not None: type = match.lastgroup if type != 'WHITESPACE': if type == 'NUMBER': value = float(match.group(type)) else: value = match.group(type) input_expression.append((type, value)) if in_debug_mode: print('Type: %s Value: %s' %(type, value)) curr_pos = match.end() match = get_token_scan(user_input, curr_pos) # If unable to tokenize all of the input, then it contains invalid chars if curr_pos != len(user_input): print('Unexpected character: %s' %(user_input[curr_pos])) break if in_debug_mode: print('Input: %s' %(input_expression)) # # Parse # save_input_expression = input_expression[:] input_expression.reverse() # pop to read & push to backtrack curr_token = [] if Expr(): if input_expression: # Reached halt state but input not consumed, e.g. "2+3 4" print('Extraneous input at ', input_expression[-1]) break else: # Unable to reach halt state, e.g. "2+-" or "2+" if input_expression: print('Parsing error at ', input_expression[-1]) break else: print('Parsing error: Unexpected end of expression.') break # # Evaluate # input_expression = save_input_expression for token in input_expression: if token[TYPE] == 'NUMBER': rpn_expression.append(token[VALUE]) elif token[TYPE] == 'LPAREN': op_stack.append(token[VALUE]) elif token[TYPE] == 'RPAREN': while op_stack and op_stack[-1] != '(': rpn_expression.append(op_stack.pop()) if not op_stack: print('Eval: Unmatched parenthesis (#1).') raise BufferError op_stack.pop() elif token[VALUE] in op_precedence: while op_stack and op_stack[-1] in op_precedence and \ op_precedence[token[VALUE]] <= op_precedence[op_stack[-1]]: rpn_expression.append(op_stack.pop()) op_stack.append(token[VALUE]) else: print('Eval: Token not recognized.') raise LookupError while op_stack: if op_stack[-1] == '(' or op_stack[-1] == ')': print('Eval: Unmatched parenthesis (#2).') raise BufferError rpn_expression.append(op_stack.pop()) # So we can use "pop" to read RPN, and "push" to store the accumulator rpn_expression.reverse() while rpn_expression: if isinstance(rpn_expression[-1], int) or \ isinstance(rpn_expression[-1], float): if a: eval_stack.append(a) a = rpn_expression.pop() else: if rpn_expression[-1] == '+': a = eval_stack[-1] + a elif rpn_expression[-1] == '-': a = eval_stack[-1] - a elif rpn_expression[-1] == '*': a = eval_stack[-1] * a else: a = eval_stack[-1] / a eval_stack.pop() rpn_expression.pop() # If Python 2.x use: print('The answer is: %s') %round(a, 2) # If Python 3.x use: print('The answer is: ' + str(a)) print('The answer is: ' + str(a)) print('Bye!')
Dc2
My final project for the University of Washington Data Science Certificate Program was to write a decision tree machine learner in Python. That's a lot to do in a month, and my instructor advised me at the outset that I might be biting off more than I could chew, suggesting instead that I join one of the Kaggle project teams as a safer alternative. But I really wanted to write it, had been reading about and thinking about decision trees since spring of this year, and had found code snippets that showed how to solve some key parts of the problem. It was one of those "no guts, no glory" moments, and I decided to go for it! I did manage to finish it on time, and I named it Dc2.
How does it stack up against other popular decision tree machine learners? I wondered that myself, so did the following. First I downloaded the Titanic training dataset from Kaggle, and then did a number of transformations on the data. I discretized the continuous attributes (e.g. Age, Fare), and binarized or threw out some attributes that constituted leaks (e.g. Cabin Number, Passenger ID, Name). Then I ran the data through both Dc2 and Weka's J48 decision tree machine learners. The results of the two runs were very close, both classifiers yielding models with just over 80% accuracy on the training data. So its accuracy is close to that of popular, big-name products.
You certainly wouldn't call it feature-rich, though. Here's a list of things it won't do that you might expect from a decision tree machine learner.
- No GUI - command line invocation only.
- Doesn't support continuous-valued attributes like weight or age, so continuous data must be binned into nominal values (e.g. Small/Medium/Large).
- No pruning or other sophisticated methods for balancing overfitting vs. underfitting.
- No ensemble methods like bagging or random forests.
- No cross-validation or other sophisticated methods for predicting how well the model will generalize.
Here's a list of what it will do.
- Read in a labeled dataset in .CSV format.
- Build a binary decision tree consisting of internal nodes containing attribute/value pairs for dataset splitting decisions, and of leaves containing class values.
- Save the decision tree to a file, and re-load it for the post-tree-building processing of test or production datasets.
- Generate a text representation of the decision tree and statistics on the number/type of nodes in the tree.
- Classify a dataset (training, test, or production), appending a column containing the predicted class values, and then write the results to a file.
- Generate statistics on the results of a classification run, including accuracy and a confusion matrix.
- Accept command line parameters for specifying: the name of the input/output files, whether a tree-building or classification run (or both) should be performed, the values of two parameters that can be used to avoid overfitting, and activating a debug mode to support the inclusion of print statements for debugging.
The one required command line parameter is filestem, which represents the root of the input and output files used by Dc2. For example, if "Titanic" is specified for the filestem parameter, Dc2 will read the input dataset from Titanic.csv, will save/re-load the decision tree to/from Titanic.pic, will write the results of the classification step to TitanicOut.csv, and will write a human-readable rendition of the decision tree to TitanicTree.csv. Specifying the -h command line parameter will cause the following help text to be displayed. (At your command prompt type, "python Dc2.py -h".)
usage: Dc2.py [-h] [-i I] [-w W] [-b] [-c] [-d] filestem
Dc2.py - learn a decision tree from input data and/or categorize the data
positional arguments:
- filestem =>filename base for input and output files
optional arguments:
- -h, --help =>show this help message and exit
- -i I =>threshold info gain for decision node (default=0.02)
- -w W =>threshold weight for decision node (default=0)
- -b =>build decision tree from input data (True if (Not -c))
- -c =>categorize input data using decision tree (True if (Not -b))
- -d =>turn on debug messages
""" Dc2.py - A decision tree classifier written by Mike Thompson in 2015 This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License, Version 3, which can be found at http://www.gnu.org/licenses. This program is distributed in the hope that it will be useful, but without any warranty expressed or implied, including the implied warranties of merchantability or fitness for a particular purpose. Acknowledgements ---------------- 1. "C4.5: Programs for Machine Learning" by Ross Quinlan (1993). A classic, covering all aspects of decision trees, and a pleasure to read. 2. "Machine Learning in Action" by Peter Harrington (2012). An excellent do-it-yourself-in-Python, containing sample code covering a wide variety of data science topics. 3. "Data Structures & Algorithms in Python" by Michael Goodrich, Roberto Tamassia, and Michael Goldwasser (2013). Chapter 8 on trees was invaluable. This program was written in Python 3, specifically the Anaconda 2.3 distribution of Python 3.4.3. Rev Date Remarks ---- ---- ------- 1.0 12/07/2015 And away we go! 1.01 01/08/2016 Write decision tree to .CSV file. 1.02 02/04/2016 Compute accuracy both by case and by weight. """ import argparse from math import log import operator import csv import pickle from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score class Btree: """ I think that I shall never see a poem lovely as a tree. A tree whose hungry mouth is prest against the earth's sweet flowing breast; A tree that looks at God all day, and lifts her leafy arms to pray; A tree that may in summer wear a nest of robins in her hair; Upon whose bosom snow has lain; who intimately lives with rain. Poems are made by fools like me, but only God can make a tree. -- Joyce Kilmer (1913) """ def __init__(self): """Create an initially-empty binary tree.""" self._root = None self._size = 0 def get_root(self): return self._root def set_root(self, node): self._root = node def get_size(self): return self._size def set_size(self, size): self._size = size def is_empty(self): """Return True if the tree is empty.""" return self.get_root() == None def is_leaf(self, n): """ Return True if node does not have any children, else False. """ return self.num_children(n) == 0 def num_children(self, n): """ Return node's number of children (0, 1, or 2). """ count = 0 if n[NODE_LEFT] is not None: # left child exists count += 1 if n[NODE_RIGHT] is not None: # right child exists count += 1 return count def children(self, n): """Generate an iteration of node's children.""" if not n[NODE_LEFT] == None: yield n[NODE_LEFT] if not n[NODE_RIGHT] == None: yield n[NODE_RIGHT] def preorder(self): """Generate a preorder iteration of all the nodes in the tree.""" if not self.is_empty(): for n in self.subtree_preorder(self.get_root()): yield n def subtree_preorder(self, n): """Generate a preorder iteration of nodes of subtree rooted at n.""" yield n # visit n before its subtrees for c in self.children(n): # for each child, c for other in self.subtree_preorder(c): # preorder of c's subtree yield other def add_root(self): """ Create and return the root of an empty tree. Raise ValueError if tree nonempty. """ if self.get_root() is not None: raise ValueError("Root already exists") new = self.add_node() self.set_root(new) self.set_size(1) return new def add_node(self, attr=None, value=None, parent=None, left=None, right=None): """ Create and return a new node. """ new = [parent, left, right, attr, value] self._size += 1 return new # Offsets to node elements NODE_PARENT = 0 # index to parent NODE_LEFT = 1 # index to left child NODE_RIGHT = 2 # index to right child NODE_ATTR = 3 # index to attr column num or, if leaf, None NODE_VALUE = 4 # index to attr value or, if leaf, class def get_dataset(): """ Read input dataset in .CSV (Comma Separated Values) format, in which the first row contains column labels. """ labels = [] dataset = [] with open(filestem + ".csv") as csv_input_file: csv_reader = csv.reader(csv_input_file, dialect="excel") labels = next(csv_reader) for row in csv_reader: dataset.append(row) csv_input_file.close() labels = labels[: len(dataset[0]) - 2] # ditch labels for weight & class for row in dataset: row[-2] = float(row[-2]) # convert weight from string to numeric return dataset, labels def calc_entropy(dataset): """ Calculate the entropy of a data set in which the next-to-last column contains weights and the last column contains the class values. """ label_weights = {} total_weight = 0.0 for row in dataset: current_label = row[-1] if current_label not in label_weights.keys(): label_weights[current_label] = 0 label_weights[current_label] += row[-2] total_weight += row[-2] shannon_entropy = 0.0 for key in label_weights: prob = label_weights[key] / total_weight shannon_entropy -= prob * log(prob,2) #log base 2 return shannon_entropy def best_attr_val_for_split(dataset): """ Return the column index of the attribute, and the value of that attribute, that will yield the greatest information gain. """ num_attrs = len(dataset[0]) - 2 # ignore last two columns base_entropy = calc_entropy(dataset) best_info_gain = 0.0 best_attr = -1 best_value = None for i in range(num_attrs): # iterate over all the attribute columns value_list = [row[i] for row in dataset] # all attr's values unique_values = sorted(list(set(value_list))) # eliminate duplicates for value in unique_values: new_entropy = 0.0 sub_dataset_match = [] sub_dataset_no_match = [] for row in dataset: if row[i] == value: sub_dataset_match.append(row) else: sub_dataset_no_match.append(row) if len(sub_dataset_match) != 0: prob = sum(row[-2] for row in sub_dataset_match) \ / sum(row[-2] for row in dataset) new_entropy += prob * calc_entropy(sub_dataset_match) if len(sub_dataset_no_match) != 0: prob = sum(row[-2] for row in sub_dataset_no_match) \ / sum(row[-2] for row in dataset) new_entropy += prob * calc_entropy(sub_dataset_no_match) info_gain = base_entropy - new_entropy if (info_gain > best_info_gain): # best info gain so far? best_info_gain = info_gain best_attr = i best_value = value return best_attr, best_value, best_info_gain def majority_count(dataset): """ Determine a dataset's most common class """ class_count={} for row in dataset: if row[-1] not in class_count.keys(): class_count[row[-1]] = 0.0 class_count[row[-1]] += row[-2] sortedClassCount = sorted(class_count.items(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0] def create_tree(n, dataset, labels): """ Recursively create a decision tree given a node, a data set and a list of attribute column labels. """ class_list = [row[-1] for row in dataset] if class_list.count(class_list[0]) == len(class_list): n[NODE_ATTR] = None n[NODE_VALUE] = class_list[0] # a leaf when all classes are equal return best_attr, best_value, best_info_gain = best_attr_val_for_split(dataset) if best_info_gain < minimum_info_gain or \ sum(row[-2] for row in dataset) < minimum_weight: n[NODE_ATTR] = None n[NODE_VALUE] = majority_count(dataset) # a leaf when no progress return n[NODE_ATTR] = best_attr n[NODE_VALUE] = best_value sub_dataset_match = [] sub_dataset_no_match = [] for row in dataset: if row[best_attr] == best_value: sub_dataset_match.append(row) else: sub_dataset_no_match.append(row) if len(sub_dataset_match) != 0: left_child = my_tree.add_node(None, None, n) n[NODE_LEFT] = left_child create_tree(left_child, sub_dataset_match, labels) if len(sub_dataset_no_match) != 0: right_child = my_tree.add_node(None, None, n) n[NODE_RIGHT] = right_child create_tree(right_child, sub_dataset_no_match, labels) return def classify(n, row): if my_tree.is_leaf(n): return n[NODE_VALUE] if row[n[NODE_ATTR]] == n[NODE_VALUE]: return classify(n[NODE_LEFT], row) else: return classify(n[NODE_RIGHT], row) def print_tree(n): global depth global csv_writer if my_tree.is_leaf(n): csv_writer.writerow([" "] * depth + ["class=" + str(n[NODE_VALUE])]) print(depth * indent, "class =", n[NODE_VALUE]) depth -= 1 return else: csv_writer.writerow([" "] * depth + [str(my_data_labels[n[NODE_ATTR]]) + "=" + str(n[NODE_VALUE]) ]) print(depth * indent, my_data_labels[n[NODE_ATTR]], "=", n[NODE_VALUE]) depth += 1 print_tree(n[NODE_LEFT]) csv_writer.writerow([" "] * depth + ["else"]) print(depth * indent, "else") depth += 1 print_tree(n[NODE_RIGHT]) depth -= 1 return """ Parse command line arguments. Run-time help text is provided for the user via the -h argument, as found in the code below. Sample Command Line Action Taken ------------------- ------------ python Dc2.py MyFile Build decision tree from MyFile.csv, saving the tree to MyFile.pic. Then re-catagorize input, writing output to MyFileOut.csv, and a text representation of the tree to MyFileTree.csv. python Dc2.py MyFile -c Categorize MyFile.csv using decision tree stored in MyFile.pic (which was created on a previous -d execution). Write output to MyFileOut.csv, and a text representation of the tree to MyFileTree.csv. """ parser = argparse.ArgumentParser(description="Dc2.py - learn a decision tree \ from input data and/or categorize the data") parser.add_argument("filestem", help="filename base for input and output files") parser.add_argument("-i", help="threshold info gain for decision node (default=0)", type=float) parser.add_argument("-w", help="threshold weight for decision node (default=0)", type=float) parser.add_argument("-b", help="build decision tree from input data \ (True if (Not -c))", action="store_true") parser.add_argument("-c", help="categorize input data using decision tree \ (True if (Not -b))", action="store_true") parser.add_argument("-d", help="turn on debug messages", action="store_true") args = parser.parse_args() filestem = args.filestem minimum_info_gain = 0.02 if args.i == None else args.i minimum_weight = 0.0 if args.w == None else args.w build_tree = False if args.b == False and args.c == True else True categorize = False if args.c == False and args.b == True else True debug_mode = args.d indent = " " # tree printing - num spaces for indenting depth = 0 # tree printing - indentation depth """ OK, so let's get on with it! """ if build_tree: """ Build a decision tree from the dataset """ my_data, my_data_labels = get_dataset() my_tree = Btree() root = my_tree.add_root() create_tree(root, my_data, my_data_labels) # create the decision tree fw = open(filestem + ".pic", "wb") # pickle the decision tree pickle.dump(my_tree,fw) fw.close() """ Display run parameters and a text representation of the decision tree """ print() print("======>> Decision Tree") print() print("Input parameters: Minimum info gain =", minimum_info_gain, " Minimum weight =", minimum_weight) print() root = my_tree.get_root() # print tree in if/else format with open(filestem + "Tree.csv", "w", newline='') as csv_output_file: csv_writer = csv.writer(csv_output_file, quoting=csv.QUOTE_MINIMAL) print_tree(root) csv_output_file.close() print() num_nodes = 0 num_leaves = 0 for node in my_tree.preorder(): # traverse tree and count nodes num_nodes += 1 if my_tree.is_leaf(node): num_leaves += 1 print("Tree structure: Total nodes =", num_nodes, " Internal nodes =", num_nodes - num_leaves, " Leaves =", num_leaves) print() if categorize: """ Classify the dataset based on the decision tree """ my_data, my_data_labels = get_dataset() # read input data from .csv file fr = open(filestem + ".pic", "rb") # un-pickle the decision tree my_tree = pickle.load(fr) root = my_tree.get_root() # point at the root node for row in my_data: # classify the data via the tree, row.append(classify(root, row)) # appending a new class column # write the output in .csv format to filestem.txt with open(filestem + "Out.csv", "w") as csv_output_file: csv_writer = csv.writer(csv_output_file, dialect="excel") csv_writer.writerow(my_data_labels + [", WEIGHT"] + [", CLASS"] \ + [", PREDICTED"]) for row in my_data: csv_writer.writerow(row) csv_output_file.close() """ Display statistics """ my_data_predicted = [] my_data_actual = [] for x in my_data: my_data_predicted.append(x[-1]) my_data_actual.append(x[-2]) """ Display a confusion matrix if there are ten or fewer class values """ if len(set(my_data[-2])) <= 10: print("") print("======>> Confusion Matrix (columns show predicted values, " + "rows show actual)") print("") my_data_predicted = [] my_data_actual = [] by_weight_correct = 0 by_weight_total = 0 for x in my_data: my_data_predicted.append(x[-1]) my_data_actual.append(x[-2]) if x[-1] == x[-2]: by_weight_correct += x[-3] by_weight_total += x[-3] cm = confusion_matrix(my_data_predicted, my_data_actual) cm_s = [[str(e) for e in row] for row in cm] # make type=string # add column and row labels to the confusion matrix ucase_alpha = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"] column_labels = [""] count = 0 for row in cm_s: row.insert(0, ucase_alpha[count]) column_labels.append(ucase_alpha[count]) count += 1 cm_s.insert(0, column_labels) # find the max element length for each column max_len = [max(map(len, col)) for col in zip(*cm_s)] # build format string fmt = "\t".join("{{:{}}}".format(x) for x in max_len) # apply format to each row in the confusion matrix and print table = [fmt.format(*row) for row in cm_s] print("\n".join(table)) print("") # print key to row/column labels unique_values = sorted(list(set(my_data_actual))) count=0 for x in unique_values: print(ucase_alpha[count], " = ", unique_values[count]) count += 1 """ Display accuracy """ print() print("======>> Classification statistics") print() a_percent = accuracy_score(my_data_actual, my_data_predicted) a_quantity = accuracy_score(my_data_actual, my_data_predicted, normalize=False) print(a_quantity, " cases correctly predicted") print(len(my_data_actual) - a_quantity, " incorrectly predicted") print("Accuracy (by case) = {0:.4f}".format(a_percent)) print("Accuracy (by weight) = {0:.4f}".format (by_weight_correct / by_weight_total)) print("") print("======>> END")
Travelling Salesman Problem - Graph Traversal
This is a brute force solution to the TSP. This classic problem is a difficult one for a computer to solve, and mathematicians have been searching for optimizations with limited success for the past 150 years. As the number of nodes increases, the number of possible paths grows at a tremendous rate. No fancy math here - this algorithm explores every possible path through a graph that begins and ends at the starting point, choosing the path that covers the shortest distance.
''' Let's have a litle graph traversal fun with the TSP (Traveling Salesman Problem)! This program investigates all the possible ways to traverse a graph, starting and ending at the same point, and visiting all of the nodes. It's capable of navigating graphs with dead end nodes - nodes from which you must back up before continuing to navigate through the rest of the graph. Below lies the GGB (Great Graph Boneyard). Pick one of these and use it as the "Graph du Jour", or else build a graph of your own design. A digraph (directed graph) is a collection of edges, each edge consisting of two nodes connected by a path that leads from the first node to the second. For example, "['A','B',3]" represents a path from node A to node B of length 3. # line segment AB graph = [['A','B',3], ['B','A',3]] # line segment ABC graph = [['A','B',3], ['B','A',3], ['B','C',4], ['C','B',4]] # right triangle, but with no path leading from C back to the start at A graph = [['A','B',3], ['A','C',4], ['B','A',3], ['B','C',2], ['C','B',2]] # equilateral triangle, plus one dead-end leg between C and D graph = [['A','B',2], ['A','C',2], ['B','A',2], ['B','C',2], ['C','A',2], ['C','B',2], ['C','D',5], ['D','C',5]] # triangle with a long (windy mountain road?) leg between B and C graph = [['A','B',3], ['A','C',2], ['B','A',3], ['B','C',20], ['C','A',2], ['C','B',20]] # 5 nodes, with all nodes connected to all the other nodes by edges # # This one runs for about 30 seconds on my laptop, demonstrating the # complexity of the TSP for even small numbers of nodes. The output is... # Shortest tour: ['A', 'B', 'C', 'D', 'E', 'A'] # Shortest distance: 5 # Number of complete tours: 1,015,968 # Number of aborted tours: 4,115,212 # Number of nodes visited: 82,538,496 # Elapsed time: 28.828 # graph = [['A','B',1], ['A','C',1], ['A','D',1], ['A','E',1], ['B','A',1], ['B','C',1], ['B','D',1], ['B','E',1], ['C','A',1], ['C','B',1], ['C','D',1], ['C','E',1], ['D','A',1], ['D','B',1], ['D','C',1], ['D','E',1], ['E','A',1], ['E','B',1], ['E','C',1], ['E','D',1]] ''' import time def traverse(node, log): global graph, start, tours, non_tours log.append(node) # note that we came to this node if (node == start and len(log) > 1): # have returned to the start? if set(log) == set([edge[G_FROM] for edge in graph]): tours.append(log) # have visited all nodes, and are return # at the end of a complete tour for next_node in [edge[G_TO] for edge in graph if edge[G_FROM] == node]: # have previously traversed this edge in this direction? been_there_done_that = False for i in range(0, len(log) - 1): if (log[i] == node and log[i + 1] == next_node): been_there_done_that = True non_tours.append(log) if not been_there_done_that: # don't re-trace your steps next_log = log[:] traverse(next_node, next_log) # Graph du Jour: # # right triangle, but with no path leading from C back to A graph = [['A','B',3], ['A','C',4], ['B','A',3], ['B','C',2], ['C','B',2]] # Offsets to edge elements: (1) from node, (2) to node, (3) distance G_FROM = 0 G_TO = 1 G_LENGTH = 2 tours = [] # the list of complete tours non_tours = [] # the list of failed attempts start = graph[0][G_FROM] # first FROM node in graph is starting point log = [] time_start = time.time() traverse(start, log) # kick off all the possible traversals time_end = time.time() # determine which of all the successful tours was the shortest tours_shortest = sum([edge[G_LENGTH] for edge in graph]) + 1 # too big tours_best = [] for tour in tours: length = 0 for i in range(0, len(tour) - 1): for j in range(0, len(graph)): if (tour[i] == graph[j][G_FROM] and tour[i + 1] == graph[j][G_TO]): length += graph[j][G_LENGTH] if length < tours_shortest: tours_shortest = length tours_best = tour # determine the total number of nodes visited nodes_visited = 0 for tour in tours: nodes_visited += len(tour) for tour in non_tours: nodes_visited += len(tour) # show stats print('Shortest tour: ', tours_best) print('Shortest distance: ', tours_shortest) print('Number of complete tours: ', len(tours)) print('Number of aborted tours: ', len(non_tours)) print('Number of nodes visited: ', nodes_visited) print('Elapsed time: ', round(time_end - time_start, 3))