«
»

It’s All Just Counts

The idea of a memory base is simple.  We define a memory as a matrix, keeping counts in the matrix cells between the names of things on the rows and columns.  Let’s say we had a memory of me, called “Person: Manny”.  If I query the row called “City: London” and ask for associated columns for “Carrier: ?”, I’d see “AA” and “BA” for American and British Airlines.  Moreover, AA would be returned with a count of 6 and BA with a count of 1.  More than just the existence of my travel relationships, we can also see the strength of my travel habits – in the context of going to London at least.

The idea is simple but fundamental.  When we started Saffron and began working with one of the big intelligent agencies, one true believer in what we were doing would provoke others by saying, “It’s all just counts.  What else is there?”  What did he mean?  When dealing with the analysis of massive data, so much of what is computed needs to be computed over counts.   More deeply, information and knowledge is based on the frequencies of what we see in the world.  Counts are fundamental to knowing what we know.

Let’s use even the most basic definition of a statistical mean or “average”, computed as the sum of counts divided by the number of counts.  An average count is useful as a measure of expected value (although other measures are also possible from counts).  Knowing nothing else, you can have an expectation about how tall a person is or often a business person travels.  Meeting someone who is 7 feet tall is rare.   A million mile traveler is unusual.   In general, these deviations from expectation are surprising, unusual, and potentially interesting.  This simple measure, based on count frequencies, is fundamental to many forms of analysis.

Now let’s look at a more advanced but still fundamental measure, such as entropy, a measure of disorder — or information — contained in a frequency distribution.    If we define information formally, as in Information Theory, then counts are critical to this definition and its application.  Think of the “analogies” query in the SaffronMemoryBase REST API.  How do we find entities that are similar to a given entity?  The entire method starts with getting a signature of the most informative attributes for the given entity.  What is most informative?  If the analysis were over a “bad guy” database and in fact all the persons are described as “gender: male”, then being male doesn’t tell us much.  In fact, it holds zero information as defined by an entropy measure.  But suppose the given entity was born in Timbuktu and only one other person was born in Timbuktu?  This is maximally informative when reasoning by similarity.

This example describes a method for entity resolution, determining who might actually be whom such as to resolve name variants and aliases.  Or you might want to find similar bad guys with a similar signature.  If you are a manufacturer, we might need to find similar parts, such as for part replacement strategies.  Also for predictive analytics, the “classifications” method in the REST API is a form of nearest neighbor classification, also reasoning by similarity.    As with “analogies”, the memory base supports a larger process of easily and quickly computing over connections and counts, but computing entropy is part of the process.   If the classifications are for buying and selling, sniffing whether transactions are suspicious or not, or deciding whether a product is desirable or not, the differentiating power of each attribute can be weighted by the distribution of frequencies.  If the counts of the attribute are distributed over the classifications according to chance, then it doesn’t seem to matter.  To the degree an attribute is more associated with one class, then it is informative to helping us make the classification.

The information in counts comes from data but is stored in a memory “above” the data.  I recently heard someone in the intelligence community say, “Everything we want to compute begins with collecting the counts.  We spend too much query time just to get these counts” (using map-reduce to compute of massive data at query time).   So why not start with a count-based representation? Why waste time searching and scanning for counts when they can be always at the ready for more advanced analysis?   The requirements for massive data analysis, whether for entity analysis or predictive analysis, are growing.   SaffronMemoryBase is a massive connection engine as well as a massive correlation engine, both of which are fundamental to analysis.  Connections and counts.  What else is there?

  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Reddit
  • Twitter
  • DZone

Tags: , , ,

This entry was posted on Thursday, November 12th, 2009 at 1:40 pm and is filed under Natural Intelligence, SaffronSierra. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

One Response to “It’s All Just Counts”

  1. [...] speed is the requirement for analysis, and analysis requires counting things as its core (see “It’s All Just Counts”), then the associative memory approach will be fastest to query.  All the connections and all [...]

Leave a Reply