Wednesday, March 5, 2014

When would you use HANA?

Yesterday, I wrote about how HANA's architecture finesses the problem of using the same database to do queries and transaction processing.  When I wrote this, I was thinking mostly about other traditional SQL databases.

There are other types of databases designed to process extremely large data streams from the Internet. These are usually distributed, work better with unstructured data and are not traditional SQL databases.  I haven't worked with any of these systems myself. What follows is just me organizing my thoughts about HANA's competitive space.

One example of this is called Druid. Druid is open source, but comes from a company called Metamarkets whose job is to provide near real-time analytics on Twitter (and other) feeds. To do that, you need a database that can handle large quantities of transactions AND process queries at the same time. Metamarkets built their own technology and then put it into the public domain under the name Druid.

Druid processes data using a network of nodes.  There are several types of nodes. Real-time nodes process online transactions and persist data. Broker nodes handle queries.  Coordinator nodes and historical nodes work behind the scenes to manage data traffic. It isn't clear to me if DRUID would pass the so-called ACID test for transaction isolation and durability.

Druid uses a JSON-based query language instead of SQL.

Druid is optimized for insert transactions. Once stored, DRUID does not expect the data to change. Druid is capable of impressive throughput (350,000 rows/sec) and query scan rates (30 million rows/sec).  The latter was on a 100 node cluster.

MongoDB is another example.  MongoDB stores documents in JSON format using a grid of Hadoop computers (Elastic Map Reduce). The database is stored in portions called shards in the grid. Indexing helps speed up the process of locating records. The data is stored on redundant nodes to make the system reliable.

Although MongoDB is well suited to very large datasets, it isn't clear that it's performance is any better than traditional databases.  Like Druid, MongoDB uses a JSON-based query language instead of SQL.

All these databases use clusters of cloud-based computers. I would imagine one of the challenges would be diminishing returns as the cluster gets larger. Adding computers comes at a cost and you could quickly find yourself on the losing end of the diminishing returns curve. An interesting article from GNIP (one of the few companies that has every Tweet ever sent) hints at this.

Considering all the traditional SQL databases on one hand and the new crop of "noSQL" databases on the other, when would you use HANA?

I'm not sure I can answer that just yet, but keep reading.

Objective, controlled performance comparisons across these types of systems would be really helpful. It would also be helpful to compare the total costs to operate these systems. This would allow for a real cost/performance calculation.

Right now, my guess is that HANA would be best suited when these things are true:

  • Large quantities of transactions occur on a daily basis. The transactions should be more than a simple stream of inserts.
  • There is a need for same-day analytics.
  • Existing data systems are not performing adequately.
  • A high level of transaction integrity is needed (ACID)

No comments: