Python
SQL
JavaScript
Excel
Tableau

Python

by Mike Thompson, Updated 4/16/2016

There's code for three Python programs here. The first is a little calculator that I wrote to get a better understanding of parsing computer languages. As a calculator it's just a toy, but it was interesting to write. The second is a decision tree machine learner that was my capstone project for the University of Washington Data Science certificate program. I've named it Dc2 because it's the second decision tree machine learner that I've written, and because there was once a nice airplane that went by a similar name. The third is a brute force solution to the classic TSP (Travelling Salesman Problem) graph traversal problem.

The calculator runs on either a Python 2 or Python 3 interpreter, with just a couple minor changes that are noted in the source code. Dc2 and TSP were written for Python 3, but it should be only a little work to get them to run on Python 2 if so desired.


Calculator

This is a calculator, written in Python. You wouldn't pay much money for it - it only handles addition, subtraction, multiplication, division, and the use of parenthesis to control the order of operations. No unary minus, no exponents, no statistical functions, no ...<the long list of things that a TI-84 calculator will do goes here>. But what it has done is give me a chance to explore some classic computer science problems while writing some Python code.

Parsing is one of those classic computer science problems, and I got to learn about grammars as a means of generating syntactically-correct sentences in a language like arithmetic expressions (or regular expressions, or Python, or Swahili). Then I got to learn about the types of computational machines (e.g. push-down automata) sufficient to parse them. "Introduction to Languages, Machines, and Logic" by Alan Parkes was a great read on these topics, and in it I learned a bit about the seminal work of Noam Chomsky in the field of linguistics, and of Alan Turing in the field of automata. I also got to explore Edsger Dijkstra's "shunting-yard algorithm" solution to the problem of converting an arithmetic expression from infix notation (the way we humans write 'em) to RPN (Reverse Polish Notation). I got to implement stacks with Python lists and play with Python's support for regular expressions, among other fun things.

Within the main loop, the first section is the scanner, implemented by using Python's "re" regular expressions library. The scanner chops up the user input into tokens (numbers and operators) that are the building blocks of arithmetic expressions. It rejects any symbols that don't have meaning in the language of arithmetic expressions, like "@" or "_". The second section is the parser, which determines whether the series of tokens entered by the user constitutes a syntactically-correct arithmetic expression. Just as a good grammar checker would reject the sentence, "Go to a the store.", because it doesn't make sense to have both "a" and "the" preceding the word "store", an arithmetic expression parser should reject "2 + * 3".

Given an arithmetic expression that contains no invalid symbols and is syntactically correct, the third section of the main loop evaluates it and returns the result. It first employs Dijkstra's shunting-yard algorithm to convert the arithmetic expression from infix notation (e.g. 2 + 2) to RPN (e.g. 2 2 +). If you've ever used an HP scientific calculator you're familiar with RPN, where you enter the operands before you enter the operator. The final step is to execute the RPN, the variable "a" representing the accumulator register.

Primitive though it is, writing this little calculator has provided a great excuse for diving into some classic computer science territory that I've wanted to explore for many years, but never got around to.

from __future__ import division  # to avoid floor division in Python 2.x
import re                        # regular expressions

in_debug_mode = None             # debug flag

def get_token_parse():
    global curr_token
   
    if input_expression:
        curr_token = input_expression.pop()
        return(True)
    else:
        return(False)   # no next char - we've hit the end of the input

#
#   Value <= [0-9]+ | '(' Expr ')'
#
def Value():
    if get_token_parse():
        if curr_token[TYPE] == 'NUMBER':
            return(True)
        elif curr_token[TYPE] == 'LPAREN':
            if Expr():
                if get_token_parse():
                    if curr_token[TYPE] == 'RPAREN':
                        return(True)
                    else:
                        return(False)   # no closing )
                else:
                    return(False)   # no closing )
            else:
                return(False)   # invalid expression
        else:
            input_expression.append(curr_token) # unexpected token: backtrack
            return(False)
    else:
        return(False)   # unexpected end of expression

#
#   Product <= Value (('*' | '/') Value)*
# 
def Product():
    if Value():
        if get_token_parse():
            while (curr_token[TYPE] == 'OP_MUL' or
                   curr_token[TYPE] == 'OP_DIV'):
                if not Value():
                    return(False)   # no Value following operator
                else:
                    if get_token_parse():  # possibly another * or /
                        continue
                    else:
                        return(True)    # we've hit the end of the input
            else:
                input_expression.append(curr_token) # Not * or /, so backtrack
                return(True)                # we've found a single Value
        else:
            return(True)    # we've hit the end of the input
    else:
        return(False)

#
#   Sum <= Product (('+' | '-') Product)*
#
def Sum():
    if Product():
        if get_token_parse():
            while (curr_token[TYPE] == 'OP_ADD' or
                   curr_token[TYPE] == 'OP_SUB'):
                if not Product():
                    return(False)   # no Product following operator
                else:
                    if get_token_parse():   # possibly another + or -
                        continue
                    else:
                        return(True)    # we've hit the end of the input
            else:
                input_expression.append(curr_token) # Not + or -, so backtrack
                return(True)                # we've found a single Product       
        else:
            return(True)    # we've hit the end of the input
    else:
        return(False)   # no Product found
  
#
#   Expr <= Sum
#
def Expr():
    return(Sum())


token_specs = [
        ('NUMBER',      r'\d+(\.\d*)?|\.\d+'),  # Integer or decimal number
        ('OP_ADD',      r'[+]'),                # +
        ('OP_SUB',      r'[\-]'),               # -
        ('OP_MUL',      r'[*]'),                # *
        ('OP_DIV',      r'[\/]'),               # /
        ('LPAREN',      r'[(]'),                # (
        ('RPAREN',      r'[)]'),                # )
        ('WHITESPACE',  r'[ \t]'),              # Skip spaces & tabs
]
token_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specs)
get_token_scan = re.compile(token_regex).match

TYPE = 0        # index to token type
VALUE = 1       # index to token value

rpn_expression = []     # Reverse Polish Notation version of the expression
op_stack = []           # Operator stack for Dijkstra's shunting-yard algorithm
op_precedence = {'+': 1, '-': 1, '*': 2, '/': 2}    # Op precedence lookup
eval_stack = []         # Stack for evaluating the RPN expression
a = None                # Accumulator for evaluating the RPN expression


while True:
    # If Python 2.x use: raw_input('Enter a...
    # If Python 3.x use: input('Enter a...
    user_input = input('Enter an arithmetic expression, or "Stop" to exit: ')
   
    if user_input.upper() == 'STOP':
        break
    #
    # Scan
    #
    input_expression = []
    curr_pos = 0
    match = get_token_scan(user_input)
    while match is not None:
        type = match.lastgroup
        if type != 'WHITESPACE':
            if type == 'NUMBER':
                value = float(match.group(type))
            else:
                value = match.group(type)
            input_expression.append((type, value))
            if in_debug_mode:
                print('Type: %s Value: %s' %(type, value))
        curr_pos = match.end()
        match = get_token_scan(user_input, curr_pos)
        
    # If unable to tokenize all of the input, then it contains invalid chars
    if curr_pos != len(user_input):
        print('Unexpected character: %s' %(user_input[curr_pos]))
        break
    if in_debug_mode:
        print('Input: %s' %(input_expression))
    #
    # Parse
    #
    save_input_expression = input_expression[:]
    input_expression.reverse()      # pop to read & push to backtrack
    curr_token = []
    if Expr():
        if input_expression:
            # Reached halt state but input not consumed, e.g. "2+3 4"
            print('Extraneous input at ', input_expression[-1])
            break
    else:
        # Unable to reach halt state, e.g. "2+-" or "2+"
        if input_expression:
            print('Parsing error at ', input_expression[-1])
            break
        else:
            print('Parsing error: Unexpected end of expression.')
            break
    #
    # Evaluate
    #
    input_expression = save_input_expression
    for token in input_expression:
        if token[TYPE] == 'NUMBER':
            rpn_expression.append(token[VALUE])  
        elif token[TYPE] == 'LPAREN':
            op_stack.append(token[VALUE])
        elif token[TYPE] == 'RPAREN':
            while op_stack and op_stack[-1] != '(':
                rpn_expression.append(op_stack.pop())
            if not op_stack:
                print('Eval: Unmatched parenthesis (#1).')
                raise BufferError
            op_stack.pop()
        elif token[VALUE] in op_precedence:
            while op_stack and op_stack[-1] in op_precedence and \
                  op_precedence[token[VALUE]] <= op_precedence[op_stack[-1]]:
                rpn_expression.append(op_stack.pop())
            op_stack.append(token[VALUE])
        else:
            print('Eval: Token not recognized.')
            raise LookupError

    while op_stack:
        if op_stack[-1] == '(' or op_stack[-1] == ')':
            print('Eval: Unmatched parenthesis (#2).')
            raise BufferError
        rpn_expression.append(op_stack.pop())

    # So we can use "pop" to read RPN, and "push" to store the accumulator
    rpn_expression.reverse()
   
    while rpn_expression:
        if isinstance(rpn_expression[-1], int) or \
           isinstance(rpn_expression[-1], float):
            if a:
                eval_stack.append(a)
            a = rpn_expression.pop()
        else:
            if rpn_expression[-1] == '+':
                a = eval_stack[-1] + a
            elif rpn_expression[-1] == '-':
                a = eval_stack[-1] - a
            elif rpn_expression[-1] == '*':
                a = eval_stack[-1] * a
            else:
                a = eval_stack[-1] / a
            eval_stack.pop()
            rpn_expression.pop()

    # If Python 2.x use: print('The answer is: %s') %round(a, 2)
    # If Python 3.x use: print('The answer is: ' + str(a))      
    print('The answer is: ' + str(a))
       
print('Bye!')

Dc2

My final project for the University of Washington Data Science Certificate Program was to write a decision tree machine learner in Python. That's a lot to do in a month, and my instructor advised me at the outset that I might be biting off more than I could chew, suggesting instead that I join one of the Kaggle project teams as a safer alternative. But I really wanted to write it, had been reading about and thinking about decision trees since spring of this year, and had found code snippets that showed how to solve some key parts of the problem. It was one of those "no guts, no glory" moments, and I decided to go for it! I did manage to finish it on time, and I named it Dc2.

How does it stack up against other popular decision tree machine learners? I wondered that myself, so did the following. First I downloaded the Titanic training dataset from Kaggle, and then did a number of transformations on the data. I discretized the continuous attributes (e.g. Age, Fare), and binarized or threw out some attributes that constituted leaks (e.g. Cabin Number, Passenger ID, Name). Then I ran the data through both Dc2 and Weka's J48 decision tree machine learners. The results of the two runs were very close, both classifiers yielding models with just over 80% accuracy on the training data. So its accuracy is close to that of popular, big-name products.

You certainly wouldn't call it feature-rich, though. Here's a list of things it won't do that you might expect from a decision tree machine learner.

  1. No GUI - command line invocation only.
  2. Doesn't support continuous-valued attributes like weight or age, so continuous data must be binned into nominal values (e.g. Small/Medium/Large).
  3. No pruning or other sophisticated methods for balancing overfitting vs. underfitting.
  4. No ensemble methods like bagging or random forests.
  5. No cross-validation or other sophisticated methods for predicting how well the model will generalize.

Here's a list of what it will do.

  1. Read in a labeled dataset in .CSV format.
  2. Build a binary decision tree consisting of internal nodes containing attribute/value pairs for dataset splitting decisions, and of leaves containing class values.
  3. Save the decision tree to a file, and re-load it for the post-tree-building processing of test or production datasets.
  4. Generate a text representation of the decision tree and statistics on the number/type of nodes in the tree.
  5. Classify a dataset (training, test, or production), appending a column containing the predicted class values, and then write the results to a file.
  6. Generate statistics on the results of a classification run, including accuracy and a confusion matrix.
  7. Accept command line parameters for specifying: the name of the input/output files, whether a tree-building or classification run (or both) should be performed, the values of two parameters that can be used to avoid overfitting, and activating a debug mode to support the inclusion of print statements for debugging.

The one required command line parameter is filestem, which represents the root of the input and output files used by Dc2. For example, if "Titanic" is specified for the filestem parameter, Dc2 will read the input dataset from Titanic.csv, will save/re-load the decision tree to/from Titanic.pic, will write the results of the classification step to TitanicOut.csv, and will write a human-readable rendition of the decision tree to TitanicTree.csv. Specifying the -h command line parameter will cause the following help text to be displayed. (At your command prompt type, "python Dc2.py -h".)

usage: Dc2.py [-h] [-i I] [-w W] [-b] [-c] [-d] filestem

Dc2.py - learn a decision tree from input data and/or categorize the data

positional arguments:

optional arguments:

"""
    Dc2.py - A decision tree classifier written by Mike Thompson in 2015
    
    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License, Version 3, which
    can be found at http://www.gnu.org/licenses. This program is distributed
    in the hope that it will be useful, but without any warranty expressed
    or implied, including the implied warranties of merchantability or
    fitness for a particular purpose.

    Acknowledgements
    ----------------
    1. "C4.5: Programs for Machine Learning" by Ross Quinlan (1993). A
         classic, covering all aspects of decision trees, and a pleasure
         to read.
    2. "Machine Learning in Action" by Peter Harrington (2012). An excellent
        do-it-yourself-in-Python, containing sample code covering a wide
        variety of data science topics.
    3. "Data Structures & Algorithms in Python" by Michael Goodrich, Roberto
        Tamassia, and Michael Goldwasser (2013). Chapter 8 on trees was
        invaluable.
    
    This program was written in Python 3, specifically the Anaconda 2.3
    distribution of Python 3.4.3.
    
    Rev     Date        Remarks
    ----    ----        -------
    1.0     12/07/2015  And away we go!
    1.01    01/08/2016  Write decision tree to .CSV file.
    1.02    02/04/2016  Compute accuracy both by case and by weight.
"""

import argparse
from math import log
import operator
import csv
import pickle
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

class Btree:
    """
    I think that I shall never see a poem lovely as a tree.
    A tree whose hungry mouth is prest
      against the earth's sweet flowing breast;
    A tree that looks at God all day, and lifts her leafy arms to pray;
    A tree that may in summer wear a nest of robins in her hair;
    Upon whose bosom snow has lain; who intimately lives with rain.
    Poems are made by fools like me, but only God can make a tree.
        -- Joyce Kilmer (1913)
    """
    def __init__(self):
        """Create an initially-empty binary tree."""
        self._root = None
        self._size = 0    
    
    def get_root(self):
        return self._root
        
    def set_root(self, node):
        self._root = node
        
    def get_size(self):
        return self._size
        
    def set_size(self, size):
        self._size = size
        
    def is_empty(self):
        """Return True if the tree is empty."""
        return self.get_root() == None
        
    def is_leaf(self, n):
        """
        Return True if node does not have any children, else False.
        """
        return self.num_children(n) == 0
        
    def num_children(self, n):
        """
        Return node's number of children (0, 1, or 2).
        """
        count = 0
        if n[NODE_LEFT] is not None:     # left child exists
            count += 1
        if n[NODE_RIGHT] is not None:    # right child exists
            count += 1
        return count

    def children(self, n):
        """Generate an iteration of node's children."""
        if not n[NODE_LEFT] == None:
            yield n[NODE_LEFT]
        if not n[NODE_RIGHT] == None:
            yield n[NODE_RIGHT]  
                
    def preorder(self):
        """Generate a preorder iteration of all the nodes in the tree."""
        if not self.is_empty():
            for n in self.subtree_preorder(self.get_root()):
                yield n

    def subtree_preorder(self, n):
        """Generate a preorder iteration of nodes of subtree rooted at n."""
        yield n                     # visit n before its subtrees
        for c in self.children(n):                   # for each child, c
            for other in self.subtree_preorder(c):  # preorder of c's subtree
                yield other 
        
    def add_root(self):
        """
        Create and return the root of an empty tree.
        Raise ValueError if tree nonempty.
        """
        if self.get_root() is not None:
            raise ValueError("Root already exists")
        new = self.add_node()
        self.set_root(new)
        self.set_size(1)
        return new    
    
    def add_node(self, attr=None, value=None, 
                 parent=None, left=None, right=None):
        """
        Create and return a new node.
        """
        new = [parent, left, right, attr, value]
        self._size += 1
        return new      

# Offsets to node elements
NODE_PARENT = 0         # index to parent
NODE_LEFT = 1           # index to left child
NODE_RIGHT = 2          # index to right child
NODE_ATTR = 3           # index to attr column num or, if leaf, None
NODE_VALUE = 4          # index to attr value or, if leaf, class

def get_dataset():
    """
    Read input dataset in .CSV (Comma Separated Values) format, in
    which the first row contains column labels.
    """
    labels = []
    dataset = []
    with open(filestem + ".csv") as csv_input_file:
        csv_reader = csv.reader(csv_input_file, dialect="excel")
        labels = next(csv_reader)
        for row in csv_reader:
            dataset.append(row)
    csv_input_file.close()
    labels = labels[: len(dataset[0]) - 2]  # ditch labels for weight & class
    for row in dataset:
        row[-2] = float(row[-2])    # convert weight from string to numeric
    return dataset, labels

def calc_entropy(dataset):
    """
    Calculate the entropy of a data set in which the next-to-last column
    contains weights and the last column contains the class values.
    """
    label_weights = {}
    total_weight = 0.0
    for row in dataset:
        current_label = row[-1]
        if current_label not in label_weights.keys():
            label_weights[current_label] = 0
        label_weights[current_label] += row[-2]
        total_weight += row[-2]
    shannon_entropy = 0.0
    for key in label_weights:
        prob = label_weights[key] / total_weight
        shannon_entropy -= prob * log(prob,2) #log base 2
    return shannon_entropy

def best_attr_val_for_split(dataset):
    """
    Return the column index of the attribute, and the value of that
    attribute, that will yield the greatest information gain.
    """
    num_attrs = len(dataset[0]) - 2    # ignore last two columns
    base_entropy = calc_entropy(dataset)
    best_info_gain = 0.0
    best_attr = -1
    best_value = None
    for i in range(num_attrs):    # iterate over all the attribute columns
        value_list = [row[i] for row in dataset]        # all attr's values
        unique_values = sorted(list(set(value_list)))   # eliminate duplicates
        for value in unique_values:
            new_entropy = 0.0
            sub_dataset_match = []
            sub_dataset_no_match = []
            for row in dataset:
                if row[i] == value:
                    sub_dataset_match.append(row)
                else:
                    sub_dataset_no_match.append(row)
            if len(sub_dataset_match) != 0:
                prob = sum(row[-2] for row in sub_dataset_match) \
                        / sum(row[-2] for row in dataset)
                new_entropy += prob * calc_entropy(sub_dataset_match)
            if len(sub_dataset_no_match) != 0:
                prob = sum(row[-2] for row in sub_dataset_no_match) \
                        / sum(row[-2] for row in dataset)
                new_entropy += prob * calc_entropy(sub_dataset_no_match)
            info_gain = base_entropy - new_entropy
            if (info_gain > best_info_gain):    # best info gain so far?
                best_info_gain = info_gain 
                best_attr = i
                best_value = value
    return best_attr, best_value, best_info_gain

def majority_count(dataset):
    """
    Determine a dataset's most common class
    """
    class_count={}
    for row in dataset:
        if row[-1] not in class_count.keys():
            class_count[row[-1]] = 0.0
        class_count[row[-1]] += row[-2]
    sortedClassCount = sorted(class_count.items(),
                              key=operator.itemgetter(1),
                              reverse=True)
    return sortedClassCount[0][0]

def create_tree(n, dataset, labels):
    """
    Recursively create a decision tree given a node, a data set and
    a list of attribute column labels.
    """
    class_list = [row[-1] for row in dataset]
    if class_list.count(class_list[0]) == len(class_list):
        n[NODE_ATTR] = None
        n[NODE_VALUE] = class_list[0]   # a leaf when all classes are equal
        return
    best_attr, best_value, best_info_gain = best_attr_val_for_split(dataset)
    if best_info_gain < minimum_info_gain or \
       sum(row[-2] for row in dataset) < minimum_weight:
        n[NODE_ATTR] = None
        n[NODE_VALUE] = majority_count(dataset)   # a leaf when no progress
        return
    n[NODE_ATTR] = best_attr
    n[NODE_VALUE] = best_value
    sub_dataset_match = []
    sub_dataset_no_match = []
    for row in dataset:
        if row[best_attr] == best_value:
            sub_dataset_match.append(row)
        else:
            sub_dataset_no_match.append(row)
    if len(sub_dataset_match) != 0:
        left_child = my_tree.add_node(None, None, n)
        n[NODE_LEFT] = left_child
        create_tree(left_child, sub_dataset_match, labels)
    if len(sub_dataset_no_match) != 0:
        right_child = my_tree.add_node(None, None, n)
        n[NODE_RIGHT] = right_child
        create_tree(right_child, sub_dataset_no_match, labels)
    return                            

def classify(n, row):
    if my_tree.is_leaf(n):
        return n[NODE_VALUE]
    if row[n[NODE_ATTR]] == n[NODE_VALUE]:
        return classify(n[NODE_LEFT], row)
    else:
        return classify(n[NODE_RIGHT], row)

def print_tree(n):
    global depth
    global csv_writer
    
    if my_tree.is_leaf(n):
        csv_writer.writerow([" "] * depth + 
                            ["class=" + str(n[NODE_VALUE])])               
        print(depth * indent, 
              "class =", n[NODE_VALUE])
        depth -= 1
        return
    else:
        csv_writer.writerow([" "] * depth +
                            [str(my_data_labels[n[NODE_ATTR]]) + 
                             "=" +
                             str(n[NODE_VALUE])
                            ])
        print(depth * indent,
              my_data_labels[n[NODE_ATTR]], 
              "=",
              n[NODE_VALUE])
        depth += 1
        print_tree(n[NODE_LEFT])
        csv_writer.writerow([" "] * depth + 
                            ["else"])
        print(depth * indent,
              "else")
        depth += 1
        print_tree(n[NODE_RIGHT])
        depth -= 1
        return
  
        
"""
Parse command line arguments. Run-time help text is provided for the user
via the -h argument, as found in the code below. 

Sample Command Line         Action Taken
-------------------         ------------
python Dc2.py MyFile        Build decision tree from MyFile.csv, 
                            saving the tree to MyFile.pic. Then
                            re-catagorize input, writing output to 
                            MyFileOut.csv, and a text representation
                            of the tree to MyFileTree.csv.
python Dc2.py MyFile -c     Categorize MyFile.csv using decision tree
                            stored in MyFile.pic (which was created on
                            a previous -d execution). Write output to
                            MyFileOut.csv, and a text representation
                            of the tree to MyFileTree.csv.                                            
"""
parser = argparse.ArgumentParser(description="Dc2.py - learn a decision tree \
          from input data and/or categorize the data")
parser.add_argument("filestem",
                    help="filename base for input and output files")
parser.add_argument("-i",
                    help="threshold info gain for decision node (default=0)",
                    type=float)
parser.add_argument("-w",
                    help="threshold weight for decision node (default=0)",
                    type=float)
parser.add_argument("-b",
                    help="build decision tree from input data \
                    (True if (Not -c))", action="store_true")
parser.add_argument("-c",
                    help="categorize input data using decision tree \
                    (True if (Not -b))", action="store_true")                   
parser.add_argument("-d",
                    help="turn on debug messages", action="store_true")              
args = parser.parse_args()   
    
filestem = args.filestem
minimum_info_gain = 0.02 if args.i == None else args.i
minimum_weight = 0.0 if args.w == None else args.w   
build_tree = False if args.b == False and args.c == True else True
categorize = False if args.c == False and args.b == True else True
debug_mode = args.d

indent = "   "      # tree printing - num spaces for indenting
depth = 0           # tree printing - indentation depth

"""
OK, so let's get on with it!
"""
if build_tree:
    """
    Build a decision tree from the dataset
    """
    my_data, my_data_labels = get_dataset()
    my_tree = Btree()
    root = my_tree.add_root()
    create_tree(root, my_data, my_data_labels)  # create the decision tree
    fw = open(filestem + ".pic", "wb")          # pickle the decision tree
    pickle.dump(my_tree,fw)
    fw.close()
    """
    Display run parameters and a text representation of the decision tree
    """
    print()
    print("======>> Decision Tree")
    print()
    print("Input parameters: Minimum info gain =", minimum_info_gain,
          "  Minimum weight =", minimum_weight)
    print()
    root = my_tree.get_root()   # print tree in if/else format
    with open(filestem + "Tree.csv", "w", newline='') as csv_output_file:
        csv_writer = csv.writer(csv_output_file, quoting=csv.QUOTE_MINIMAL)
        print_tree(root)
        csv_output_file.close()
    print()
    num_nodes = 0
    num_leaves = 0
    for node in my_tree.preorder(): # traverse tree and count nodes
        num_nodes += 1
        if my_tree.is_leaf(node):
            num_leaves += 1
    print("Tree structure: Total nodes =", num_nodes,
          " Internal nodes =", num_nodes - num_leaves,
          " Leaves =", num_leaves)
    print()
 
if categorize:
    """
    Classify the dataset based on the decision tree
    """
    my_data, my_data_labels = get_dataset() # read input data from .csv file
    fr = open(filestem + ".pic", "rb")      # un-pickle the decision tree
    my_tree = pickle.load(fr)
    root = my_tree.get_root()               # point at the root node
    for row in my_data:                     # classify the data via the tree,
        row.append(classify(root, row))     #  appending a new class column
    # write the output in .csv format to filestem.txt
    with open(filestem + "Out.csv", "w") as csv_output_file:
        csv_writer = csv.writer(csv_output_file, dialect="excel")
        csv_writer.writerow(my_data_labels + [", WEIGHT"] + [", CLASS"] \
                            + [", PREDICTED"])
        for row in my_data:
            csv_writer.writerow(row)
    csv_output_file.close()
    """
  Display statistics
  """
    my_data_predicted = []
    my_data_actual = []
    for x in my_data:
        my_data_predicted.append(x[-1])
        my_data_actual.append(x[-2])
    """
    Display a confusion matrix if there are ten or fewer class values
    """
    if len(set(my_data[-2])) <= 10:
        print("")
        print("======>> Confusion Matrix (columns show predicted values, " +
              "rows show actual)")
        print("")
        
        my_data_predicted = []
        my_data_actual = []
        by_weight_correct = 0
        by_weight_total = 0
        for x in my_data:
            my_data_predicted.append(x[-1])
            my_data_actual.append(x[-2])
            if x[-1] == x[-2]:
                by_weight_correct += x[-3]
            by_weight_total += x[-3]
        cm = confusion_matrix(my_data_predicted, my_data_actual)
        cm_s = [[str(e) for e in row] for row in cm]    # make type=string
        # add column and row labels to the confusion matrix
        ucase_alpha = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"]
        column_labels = [""]
        count = 0
        for row in cm_s:
            row.insert(0, ucase_alpha[count])
            column_labels.append(ucase_alpha[count])
            count += 1
        cm_s.insert(0, column_labels)
        # find the max element length for each column
        max_len = [max(map(len, col)) for col in zip(*cm_s)]
        # build format string
        fmt = "\t".join("{{:{}}}".format(x) for x in max_len)
        # apply format to each row in the confusion matrix and print
        table = [fmt.format(*row) for row in cm_s]
        print("\n".join(table))
        print("")
        # print key to row/column labels
        unique_values = sorted(list(set(my_data_actual)))
        count=0
        for x in unique_values:
            print(ucase_alpha[count], " = ", unique_values[count])
            count += 1
    """
    Display accuracy
    """
    print()
    print("======>> Classification statistics")
    print()
    a_percent = accuracy_score(my_data_actual, my_data_predicted)
    a_quantity = accuracy_score(my_data_actual, my_data_predicted,
                                normalize=False)
    print(a_quantity, " cases correctly predicted")
    print(len(my_data_actual) - a_quantity, " incorrectly predicted")
    
    print("Accuracy (by case)   = {0:.4f}".format(a_percent))
    print("Accuracy (by weight) = {0:.4f}".format
           (by_weight_correct / by_weight_total))

print("")    
print("======>> END")

Travelling Salesman Problem - Graph Traversal

This is a brute force solution to the TSP. This classic problem is a difficult one for a computer to solve, and mathematicians have been searching for optimizations with limited success for the past 150 years. As the number of nodes increases, the number of possible paths grows at a tremendous rate. No fancy math here - this algorithm explores every possible path through a graph that begins and ends at the starting point, choosing the path that covers the shortest distance.

'''
Let's have a litle graph traversal fun with the TSP (Traveling Salesman
Problem)!
 
This program investigates all the possible ways to traverse a graph,
starting and ending at the same point, and visiting all of the nodes.
It's capable of navigating graphs with dead end nodes - nodes from which
you must back up before continuing to navigate through the rest of the
graph.

Below lies the GGB (Great Graph Boneyard). Pick one of these and use it
as the "Graph du Jour", or else build a graph of your own design. A
digraph (directed graph) is a collection of edges, each edge consisting
of two nodes connected by a path that leads from the first node to the
second. For example, "['A','B',3]" represents a path from node A to node
B of length 3.

# line segment AB
graph = [['A','B',3],
         ['B','A',3]]

# line segment ABC
graph = [['A','B',3],
         ['B','A',3],
         ['B','C',4],
         ['C','B',4]]

# right triangle, but with no path leading from C back to the start at A
graph = [['A','B',3],
         ['A','C',4],
         ['B','A',3],  
         ['B','C',2],
         ['C','B',2]]

# equilateral triangle, plus one dead-end leg between C and D
graph = [['A','B',2],
         ['A','C',2],
         ['B','A',2],  
         ['B','C',2],
         ['C','A',2],
         ['C','B',2],
         ['C','D',5],
         ['D','C',5]]

# triangle with a long (windy mountain road?) leg between B and C
graph = [['A','B',3],
         ['A','C',2],
         ['B','A',3],  
         ['B','C',20],
         ['C','A',2],
         ['C','B',20]]
         
# 5 nodes, with all nodes connected to all the other nodes by edges
#
# This one runs for about 30 seconds on my laptop, demonstrating the
# complexity of the TSP for even small numbers of nodes. The output is...
#       Shortest tour:  ['A', 'B', 'C', 'D', 'E', 'A']
#       Shortest distance:  5
#       Number of complete tours:  1,015,968
#       Number of aborted tours:  4,115,212
#       Number of nodes visited:  82,538,496
#       Elapsed time:  28.828
#
graph = [['A','B',1],
         ['A','C',1],
         ['A','D',1],
         ['A','E',1],
         ['B','A',1],
         ['B','C',1],
         ['B','D',1],
         ['B','E',1],
         ['C','A',1],
         ['C','B',1],
         ['C','D',1],
         ['C','E',1],
         ['D','A',1],
         ['D','B',1],
         ['D','C',1],
         ['D','E',1],
         ['E','A',1],
         ['E','B',1],
         ['E','C',1],
         ['E','D',1]]
''' 

import time

def traverse(node, log):
    global graph, start, tours, non_tours
    
    log.append(node)                        # note that we came to this node
    if (node == start and len(log) > 1):    # have returned to the start?
        if set(log) == set([edge[G_FROM] for edge in graph]):
            tours.append(log)   # have visited all nodes, and are
            return              #  at the end of a complete tour
    for next_node in [edge[G_TO] for edge in graph if edge[G_FROM] == node]:
        # have previously traversed this edge in this direction?
        been_there_done_that = False
        for i in range(0, len(log) - 1):
            if (log[i] == node and log[i + 1] == next_node):
                been_there_done_that = True
                non_tours.append(log)
        if not been_there_done_that:    # don't re-trace your steps
            next_log = log[:]
            traverse(next_node, next_log)

# Graph du Jour:
#
# right triangle, but with no path leading from C back to A
graph = [['A','B',3],
         ['A','C',4],
         ['B','A',3],  
         ['B','C',2],
         ['C','B',2]]
        
# Offsets to edge elements: (1) from node, (2) to node, (3) distance
G_FROM = 0
G_TO = 1
G_LENGTH = 2

tours = []          # the list of complete tours
non_tours = []      # the list of failed attempts          
           
start = graph[0][G_FROM]    # first FROM node in graph is starting point
log = []
time_start = time.time()
traverse(start, log)        # kick off all the possible traversals
time_end = time.time()

# determine which of all the successful tours was the shortest
tours_shortest = sum([edge[G_LENGTH] for edge in graph]) + 1    # too big
tours_best = []
for tour in tours:
    length = 0
    for i in range(0, len(tour) - 1):
        for j in range(0, len(graph)):
            if (tour[i] == graph[j][G_FROM] and
               tour[i + 1] == graph[j][G_TO]):
                length += graph[j][G_LENGTH]
    if length < tours_shortest:
        tours_shortest = length
        tours_best = tour
        
# determine the total number of nodes visited
nodes_visited = 0
for tour in tours:
    nodes_visited += len(tour)
for tour in non_tours:
    nodes_visited += len(tour)

# show stats
print('Shortest tour: ', tours_best)
print('Shortest distance: ', tours_shortest)
print('Number of complete tours: ', len(tours))
print('Number of aborted tours: ', len(non_tours))
print('Number of nodes visited: ', nodes_visited)
print('Elapsed time: ', round(time_end - time_start, 3))