Tuesday, March 4, 2014

Thoughts on SAP HANA

The company I work for (Critigen) has a partnership with SAP and through that partnership, I've recently been introduced to the HANA database. Part of my job over the next few weeks is to learn as much about HANA as I can. Writing helps me organize my thoughts, so I thought I'd blog about it.

There are several things that make HANA different from most databases, but there is one particularly subtle difference that I'd like to focus on in this post.

Much has been written about the two most obvious of HANA's differences: the fact that it's an in-memory database and that it's tables are stored as columns instead of rows. Everyone knows that running in memory is much faster than running from disk.  Storing data in columns speeds queries because most queries filters based on column values and it's much faster to scan a list of column values than it is to scan all the data in a table row by row. Early adopters have reported that some queries run up to 10,000 times faster in HANA than in a traditional database.

One of the buzz words these days is "Big Data" which I take to mean the problem of extracting meaning from the enormous quantities of data now available. Again, much has been written about how HANA's speed can help with this problem.

That's fine, but here's my problem: Neither HANA's in-memory nor it's column-oriented architecture make it unique. In fact, all databases cache data in memory and if you have enough RAM, you can effectively turn any database into an in-memory database. There are several column-oriented databases designed for super-fast queries of large data sets. The traditional knock on columnar databases is that while they run super-fast queries, they're useless for online transaction processing. Inserting a row involves locking and changing each column's data file. Because of this, today's columnar databases are used strictly as data warehouses. They're taken offline at night and bulk loaded with the previous days' transactions.

So, what makes HANA worth it's high cost and why would anyone invest money to move a database into HANA?  SAP actually ported it's ERP to HANA, which must have been a huge and risky effort.

I struggled with this until yesterday. The answer is that HANA's engineers have solved the problem of efficient transaction processing using a columnar architecture.

Here's how it works:

  • When data is changed or inserted, it is written into a temporary row store that is connected to every table.  The transaction is committed at this point without any performance penalty.
  • A background process moves data from the temporary row store into the permanent column stores.
  • Queries first examine the column store and then examine the row store. If the row store is empty (which is most of the time), the query suffers no performance penalty.  If the row store contains data, the query suffers a small performance penalty.
This elegant finesse is what makes HANA worthwhile.

No comments: