Creating and using Decision trees

So, now that we’ve covered the basic usage of the library, I thought it perhaps time to show you how to create decision trees and use them to process data. I’d like to show 2 different variations: pattern- and statistics based algorithms. In a next installment,I’ll also show some ideas you can use to improve upon the different techniques. For source code, I’ll mostly be working with C# examples and occasionally perhaps also show you the translation to NNL.

About decision trees

Before we begin, perhaps a small word about decision trees and what you can use them for. Well, a decision tree is, like the word says, a tree that allows you to make a decision about what to do with some input by matching it against the tree. To construct this tree, you use previously recorded values with their results. This means that you use historical data to train your algorithm. So with a little crafting here and there, you can use decision trees to create auto learning algorithms.

A basic tree

Whether you are building statically based decision trees or the simpler, pattern based versions, the basis is always the same: a tree always starts with a root. Each data point in a single row of the historical data set (the training values) , will generate a single branch, determined by the value of the data-point and starting from the branch of the previous value, or the root. Finally, the last leaf will point to the result value. dtree_WhatLanguage

From patterns to statistics

The major difference between the 2 algorithms lies in how duplicate training data is handled: pattern based matching can’t handle this, or not properly while the statistics version actually relies upon it to work properly. The latter algorithm also keeps track of how many times each result value has occurred during training, thus, it is able to return the ‘most encountered’ result. Even though at first glance this makes the statistical algorithms superior compared to the pattern based one, there are use cases for both situations and tons of variations on the same theme.

Sequence

There is 1 more important difference between the 2 techniques: The order of the data points in a each row of the training data. For statistical approaches, this is usually not very important, that is, the meaning of the data doesn’t change if you switch the order of 2 values, the sequence can still play a role during the evaluation of new data, but more on that later. For instance, in Kaggle’s titanic competition, it doesn’t matter if you are first a male and then an adult or first an adult and next a male, you are both at the same time. Because of the ‘counting’ nature of the algorithm, the more data you have to train your system with, the better it becomes (there is a saturation point though, at which accuracy no longer improves).

For pattern based algorithms on the other hand, the order of the values tends to be very important and should not be changed. For instance, while matching text, the order of the words needs to be respected (or mostly), otherwise it’s just another pattern.

Pattern matchers

Now that you have a basic idea of what decision trees are, lets try and create one ourselves. We’ll start with a pattern based decision tree. This is the easiest to create. In fact, the basic algorithm can be pretty simple. Our tree will be used to match some text against a database of known statements (our patterns). The result of each statement is the output text that should be returned. In other words, we are going to create a very basic chatbot.

Why create a tree to match some text instead of not simply using a list of statements together with their outputs? Well, there are many reasons, for one: scalability: the bigger the list gets, the slower it become, decision trees still get a little slower when they grow huge, but not that much. Furthermore, there are a lot of cool things you can do in the middle of the tree which you can’t do (or is very hard/slow)  with regular string matching. More on that later.

Creation

To construct our tree from a set of known statements, we must first split each statement into smaller units that will form the branches. As we are working with sentences, there are 2 obvious candidates: split the sentence into letters or words. While groups of letters are often used in n-grams, for this tutorial we’ll be using words, for which the library provides excessive support. We can use the ‘Tokenizer’ class (defined in it’s own library) to split the sentence into smaller groups: words, numbers and signs. We can then convert these shorter strings into neurons thanks to ‘BrainHelper.GetNeuronFor’. Using these 2 classes makes certain that all text is always converted in the same way to neurons, which in turn helps the grouping/clustering of neurons. Ones you’ve got the neurons for the branches, it’s a matter of choosing a root object and finding/creating links. So here’s a small C# function to build a simple pattern matcher:

/// <summary>
/// Trains a decision tree.
/// </summary>
void Train(string value, string result, ref Neuron root)
{
   //make certain that there is a root object
   if (root == null)
   {
      root = NeuronFactory.GetNeuron();
      Brain.Current.Add(root);
   }
   //icurrent keeps track of the current position in the tree.
   Neuron iCurrent = root;
   //when splitting the sentence, we don't want spaces, only the words
   string[] iWords = Tokenizer.Split(value, false);
   foreach (string i in iWords)
   {
      //get the neuron for the current text value.
      Neuron iRelationship = BrainHelper.GetNeuronFor(i);
      //perhaps the branch already exists?
      Neuron iNext = iCurrent.FindFirstOut(iRelationship.ID);
      if (iNext == null) //if it doesn’t exist, create the branch.
      {
         iNext = NeuronFactory.GetNeuron();
         Brain.Current.Add(iNext);
         Link.Create(iCurrent, iNext, iRelationship);
      }
      //the next new branch must start from the current one.
      iCurrent = iNext;
   }
   //get a neuron to represent the result value.
   Neuron iResult = BrainHelper.GetNeuronFor(result);
   //if the result was not yet recorded add it as a leaf to the tree.
   //we use a predefined relationship meaning of 'PatternRule' so that we can
   //see that this is the end of the tree.
   if (iCurrent.FindFirstOut((ulong)PredefinedNeurons.PatternRule) == null)
      Link.Create(iCurrent, iResult, (ulong)PredefinedNeurons.PatternRule);
   else
      throw new InvalidOperationException();
}

And the same routine in NNL, which looks very similar to the C# version, but a bit shorter:

//declare a code cluster called 'train', which can be called like a function
cluster Train
{
   var Root;      //NNL doesn't know 'ref', but we can add a field to the function
   this(var value, var result)
   {
      //if there is no root yet, create it now. nnl can simply call 'new', which also adds to db
      if(count(root) == 0)
         root = new(Neuron);
      //icurrent keeps track of the current position in the tree. 
      var iCurrent = root;   
      //we can only attach 1 neuron to a link so create a cluster if need be.
      if(count(result) > 1)           
         result = MakeCluster(list, result);
      //although you can split a string with an instruction, it's usually better to do this before 
      //pushing the data into the netwrok. 'value' would then contain all the words instead of a single string.
      foreach(var i in SplitString(value))     
      {
      var iNext = GetFirstOut(iCurrent, i);     //perhaps the branch already exists? 
      if(count(iNext) == 0)
         {
            iNext = new(neuron);
            AddLink(iCurrent, iNext, i);
         }
         iCurrent = iNext;
      }
      //if there is a previous result at the current location, show error we use a predefined relationship meaning 
      //of 'PatternRule' so that we can see that this is the end of the tree.
      if(ContainsLinksOut(iCurrent, Statics.PatternRule) == false)
         AddLink(iCurrent, result, Statics.PatternRule);
   }
}

Usage

Once you have added a few statements to your decision tree, you can start using it to find matches. As this is a basic introduction, our decision tree will only support exact matching. That is, the input needs to be exactly the same as one of the previously trained values. Only the number of spaces between the words can differ, since we are skipping them.  I’ll show you some tricks in the next post to make things more interesting.

The match code is very similar to the training code, except that we won’t be creating any new branches, we will only try to find existing ones and look for end-points. Since we are doing a simple match without any variable or missing parts in the middle, we only need to traverse 1 branch until the end, which makes the code pretty simple:

/// <summary>
/// Push the string through the tree and find the result.
/// </summary>
List<string> Match(string value, Neuron root)
{
   List <string> iRes = new List<string>();      //to store the result values
   Neuron iCurrent = root;                      //the current pos in the tree
   Neuron iRule;                                //any results that we find
   string[] iWords = Tokenizer.Split(value, false);
   foreach (string i in iWords)
   {
      //check if the path contains a result, if so, store.
      iRule = iCurrent.FindFirstOut((ulong)PredefinedNeurons.PatternRule);
      if (iRule != null)
         iRes.Add(BrainHelper.GetTextFrom(iRule));
      //get the neuron for the current text value.
      Neuron iRelationship = BrainHelper.GetNeuronFor(i);
      Neuron iNext = iCurrent.FindFirstOut(iRelationship.ID);
      //if we can't find the branch, we have reached the end of the match process
      if (iNext == null)
         return iRes;
      else
         iCurrent = iNext;
   }
   //1 last time to check at the very end of the path
   iRule = iCurrent.FindFirstOut((ulong)PredefinedNeurons.PatternRule);
   if (iRule != null)
      iRes.Add(BrainHelper.GetTextFrom(iRule));
   return iRes;
}

And in NNL again:

Match(var value, var root): var
{
   var iRes, iRule, iNext; 
   var iCur = root;
   foreach(var i in SplitString(value)) 
   {
      iRule = GetFirstOut(iCur, Statics.PatternRule);
      if(count(iRule) > 0)
         Add(iRes, iRule);
      iNext = GetFirstOut(iCur, i);
      if(count(iNext) == 0)
         return iRes;
      else
         iCur = iNext;
   }
   iRule = GetFirstOut(iCur, Statics.PatternRule);
   if(count(iRule) > 0)
      Add(iRes, iRule);
   return iRes;
}

Statistical approaches

The previous pattern matching algorithm was pretty simple and not that very powerful, all it can do is perform an exact word match. Once a certain input-output pair is learned, it can’t change that anymore unless it overwrites the previous result.  By changing the trainer so that it counts the number of times that the same result was learned for the same input, things can already become a lot more interesting: the decision tree’s results become related to the number of times that the result has been encountered, not just the last or first one.

Creation

The structure of a statistical decision tree is almost identical to that of a pattern matcher. We only need to allow for multiple results on the same branch and store 2 more data points: the total nr of input-output pairs that have been used to train the network and the nr of times that a single input-output pair has been found. With these 2 values, we can calculate the percentage of times that a certain input-output pair was encountered in the tree. If we only use 1 tree, we actually don’t need the total nr of learned items, and can simply take the input-output pair with the highest count. But if you are using multiple trees that you want to test at the same time (forests) which can vary in the nr of trained data, it can be useful to convert the values to percentages. Also, sometimes you need to present the result as a percentage (like 70% of people like cola).

The nr of times that an input-output pair has been encountered, can be stored as an IntNeuron in the ‘info’ section of the link or as an extra step in the tree (make ‘PatternRule’ point to the result and use that result to point to the count, see the NNL example).  The total number of learned inputs can can be attached as an intNeuron to the tree root. For the example, we take the code of the pattern matching trainer and expand on it, which makes:

void Train(string value, string result, ref Neuron root)
{
   IntNeuron iTotal;          //stores the total nr of learned values.
   if (root == null)
   {
      root = NeuronFactory.GetNeuron();
      Brain.Current.Add(root);
      iTotal = NeuronFactory.GetInt(1);
      Brain.Current.Add(iTotal);
      //use a textneuron 'total' to attach the value
      Link.Create(root, iTotal, TextNeuron.GetFor("total"));
   }
   else
   {
      iTotal = (IntNeuron)root.FindFirstOut(TextNeuron.GetFor("total").ID);
      iTotal.IncValue();      //thread save ++
   }
   Neuron iCurrent = root;
   string[] iWords = Tokenizer.Split(value, false);
   foreach (string i in iWords)
   {
      Neuron iRelationship = BrainHelper.GetNeuronFor(i);
      Neuron iNext = iCurrent.FindFirstOut(iRelationship.ID);
      if (iNext == null) //if it doesn’t exist, create the branch.
      {
         iNext = NeuronFactory.GetNeuron();
         Brain.Current.Add(iNext);
         Link.Create(iCurrent, iNext, iRelationship);
      }
      iCurrent = iNext;
   }
   Neuron iResult = BrainHelper.GetNeuronFor(result);
   Link iLink = Link.Find(iCurrent, iResult, Brain.Current[(ulong)PredefinedNeurons.PatternRule]);
   if (iLink == null)
   {
      iLink = Link.Create(iCurrent, iResult, (ulong)PredefinedNeurons.PatternRule);
      iTotal = NeuronFactory.GetInt(1);
      Brain.Current.Add(iTotal);
      using (IDListAccessor iInfo = iLink.InfoW)
         iInfo.Add(iTotal);
   }
   else
   {
      using (IDListAccessor iInfo = iLink.Info)
      {
         iTotal = Brain.Current[iInfo[0]] as IntNeuron;
         iTotal.IncValue();
      }
   }
}


This C# function is already a bit longer. In the NNL version, the input-output count is stored as an extra step in the tree, which looks like:

//declare a code cluster called 'train', which can be called like a function
cluster Train
{
   var Root;      //NNL doesn't know 'ref', but we can add a field to the function

   this(var value, var result)
   {
      if(count(root) == 0)
      {
         root = new(Neuron);
         var iTotal = new(IntNeuron, 1);
         AddLink(root, iTotal, 'total');
      }
      else
      {
         int iTotal = GetFirstOut(root, 'total');
         iTotal++;
      }
      var iCurrent = root;   
      //we can only attach 1 neuron to a link so create a cluster if need be.
      if(count(result) > 1)           
         result = MakeCluster(list, result);

      foreach(var i in SplitString(value))    
      {
         var iNext = GetFirstOut(iCurrent, i);     //perhaps the branch already exists? 
         if(count(iNext) == 0)
         {
            iNext = new(neuron);
            AddLink(iCurrent, iNext, i);
         }
         iCurrent = iNext;
      }
      if(ContainsLinksOut(iCurrent, Statics.PatternRule) == false)
         AddLink(iCurrent, result, Statics.PatternRule);
      int iCount = GetFirstOut(iCurrent, result);
      if(count(iCount) == 0)
      {
         iCount = new(IntNeuron);
         AddLink(iCurrent, iCount, result);
      }
      else
         iCount++;
   }
}

Usage

Because the structure of the statistical decision tree is so similar to that of the pattern matcher, using it to find results is also very similar. The only difference with the pattern matcher is that, at the end, you look for a single result that has the highest count and return that.

/// <summary>
/// Push the string through the tree and find the result.
/// </summary>
string Test(string value, Neuron root)
{
   Neuron iRes = null;                          //the current result
   int iMaxCount = 0;                           //the count of the current result.
   Neuron iCurrent = root;                      //the current pos in the tree
   string[] iWords = Tokenizer.Split(value, false);
   foreach (string i in iWords)
   {
      iRes = CollectResults(iRes, iCurrent, ref iMaxCount);
      //get the neuron for the current text value.
      Neuron iRelationship = BrainHelper.GetNeuronFor(i);
      Neuron iNext = iCurrent.FindFirstOut(iRelationship.ID);
      //if we can't find the branch, we have reached the end of the match process
      if (iNext == null)
         return BrainHelper.GetTextFrom(iRes);
      else
         iCurrent = iNext;
   }
   iRes = CollectResults(iRes, iCurrent, ref iMaxCount);
   return BrainHelper.GetTextFrom(iRes);
}

Neuron CollectResults(Neuron result, Neuron curPos, ref int count)
{
   using(LinksListAccessor iOut = curPos.LinksOut)
   {
      foreach(Link i in iOut)
      {
         if (i.MeaningID == (ulong)PredefinedNeurons.PatternRule)
         {
            using (IDListAccessor iInfo = i.Info)
            {
               IntNeuron iInt = (IntNeuron)Brain.Current[iInfo[0]];
               if (iInt.Value > count)
               {
                  result = i.To;
                  count = iInt.Value;
               }
            }
         }
      }
   }
   return result;
}


And again, in NNL but with the count not in the info, but also attached to the end of the tree. This example also shows a first use of the ‘split’ instruction so we can let the system use the ‘count’ as the weight for the results which allows the system to work out the result with the biggest weight/count automatically.  This looks like:

//pushes the specified list of values through the decision tree and returns the result with the biggest weight.
//only call with ‘waitfor’ modifier.
Test(var values, var root): var
{
   var iCur, iNext; 
   //set up the call back and do iCur = root
   split(TestDone, ref(iCur), root);
   foreach(var i in SplitString(values)) 
   {
      GetResult(iCur);
      iNext = GetFirstOut(iCur, i);
      if(count(iNext) == 0)
         ExitSolve();
      else
         iCur = iNext;
   }
   GetResult(iCur);
   //don't need to return a value, this is done by 'TestDone' which is automatically
   //called when we get all split paths are done (only 1 in this case)
}

TestDone(): var
{
   return GetMaxWeight();
}

//adds all the results to the Result list
GetResult(var curPos)
{
   //get all the results for this stage.
   var iFound = GetOutgoing(curPos, Statics.PatternRule);  
   foreach(var i in iFound)
   {
      //assign the count of the result as it's weight
      IncreaseWeightOf(i, GetFirstOut(CurPos, i));
      //and add it as a splitresult. The one with the biggest weight will be the final result.
      AddSplitResult(i);
   }
}


Although this was a very short tutorial about a vast subject, I hope it has inspired you and made you think about trees. I find it an easy way to visualize algorithms internally, though sometimes a little help from the designer can help,… In a next tutorial, I’ll try to present you some ways to make the decision trees more flexible.

Leave a Reply