«
»

The Need for Speed (Especially At Query Time)

It should be no surprise that there is growing interest in data analytics and a rise of  “post-relational” analytic data stores.  Traditional databases work well to load data in, but they are becoming problematic to get answers out when faced with growing scale and complexity.    There is a growing problem with Big Data, defined by Adam Jacobs in a recent ACMqueue article, “Pathologies of Big Data.”  Big Data is “data whose size forces us to look beyond the tried-and-true methods that are prevalent at that time.”  Given our growing data tsunami, many people are looking beyond the tried-and-true methods of relational databases. Billions of rows can easily be stored, but Jacobs describes the pathologies of query scalability, even when asking for basic counts!

You can store data all day, but the value of data is in its query: in its exploitation.  I remember first hearing this word, “exploitation”, from the leader of a UK intelligence agency on one of my trips to London.  Whether for government or commercial interests, the time to analyze and act is what matters.  The word struck me and has remained with me every since.  Exploitation time is what matters.  Furthermore, the number of queries is increasing faster than the growth of the data tsunami itself.    Query time is what matters because response time in most critical, and critical times drive the massive load of many simultaneous requests.

The weakness of transactional databases to support fast and flexible analysis is now widely accepted, and a plethora of emerging approaches include multi-dimensional star schemas, column-oriented, in-memory, and value-based stores to mention a few, including associative memories from Saffron.   However, if query speed is the requirement for analysis, and analysis requires counting things as its core (see “It’s All Just Counts”), then the associative memory approach will be fastest to query.  All the connections and all the frequencies and frequency distributions of the connections are ready to be queried, all at the ready for analysis.

Because Hadoop, the massive data store derived from Google’s data store, is so well known, it serves as a good case in point.  Hadoop and its MapReduce architecture are great for what they were intended to do, such as in Google’s search engine.    On the other hand, Hadoop is now being pushed toward almost every Big Data application without understanding its limits.  For example, Hadoop should not be used as a DBMS.    A recent presentation, “The Big Data Revolution”, by Mike Stonebreaker of Vertica with Mike Olson of Cloudera makes this point, positioning Hadoop for ETL but not as a DBMS.  One report from Google of Hadoop’s use as a DBMS is criticized by Stonebreaker as a “head fake”, a move without real effect.    Google’s own earlier research report, the Sawzall Report in 2005, comments how Hadoop in inappropriate for SQL and operations such as joins. But attempts to use Hadoop for any Big Data problem continue and the debate goes on.

So how about storing counts in Hadoop? The JASONs, a scientific advisory group to the US Government, wrote a recently declassified report, “Data Analysis Challenges”, which comments on how Hadoop is fundamentally batch-oriented and not-transactional.  In consequence, the JASONs mark it as inappropriate for data updates.   If Hadoop is ill advised for updating data, it is even worse for updating counts.  By definition, counts are always being updated.   If data arrives all the time and memory is represented in terms of counts, then a more transactional store is required.   Our brains observe the world and update memories all the time in real-time.

Another approach using Hadoop attempts to run massively parallel MapReduce algorithms at query time.  In other words, given a massively distributed data store, analysis is provided by running parallel and distributed data mining algorithms over the data store.  This sounds great at first blush.  But as I recently heard from direct experiences, the first step of many (if not most) data mining algorithms is to scan for the frequency counts.  This scanning wastes time during query, precisely when time matters most.   Not only does it take time to count by scanning over all the data, the dependency of algorithms to scan for these counts before other computations weakens the parallelism of MapReduce.  Several passes are required in sequential dependency.  Cloudera uses Hadoop for machine learning and data mining, but even Mike Olson admits, “Hadoop is a batch-oriented tool” and can take minutes if not hours to compute complex queries.

The future of “Hybrid Data Centers” is also well articulated by all presenters of  “The Big Data Revolution”.  No one representation will do everything.  Each approach to Big Data has its strengths and weaknesses.  To be clear, Saffron – like Hadoop – is not appropriate as a DBMS.  We joke about ourselves in saying, “Don’t use Saffron to run your accounting system!”  SaffronMemoryBase is intended for a new kind of data analysis, including real-time updates and real-time query for complex analysis.

More is required to achieve this speed than the store of counts.  Everything is pre-counted in a memory-base, but as I hope to explain in later postings, the pre-joining and pre-sorting of Saffron’s matrix representation further satisfy the need for speed.  Otherwise, the lost time and CPU load of table scanning and table joining can be substantial, just when time and load are most critical.

  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Reddit
  • Twitter
  • DZone

Tags: , ,

This entry was posted on Wednesday, November 25th, 2009 at 11:18 am and is filed under Natural Intelligence. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

One Response to “The Need for Speed (Especially At Query Time)”

  1. [...] This post was Twitted by saffrontech [...]

Leave a Reply