Friday, March 28, 2014

HANA GeoSpatial Support

Beginning with SPS 7, Revision 70, geospatial datatypes are generally available. I spent the day today learning what I could about HANA's geospatial datatype features.

Although geospatial data types are available, there are currently quite a few limitations:
  • You can't declare a spatial datatype in an .hdbtable file. You have to use SQL.
  • You can't use spatial datatypes in attribute, analytic or computed views.
  • You can't pass a spatial datatype as an argument in a stored procedure. (you can pass a string representation, which is almost as good).
I have an existing table with LATITUDE and LONGITUDE columns and want to use them to create a new ST_POINT column. From there, I hope to take advantage of the new geospatial functions. Getting it working took most of the day and learned a few lessons.

Here's the first lesson learned: To use latitudes and longitudes, you must specify the SRID (coordinate system) of 4326 when you create the column:

alter table <TABLENAME> add( LOCATION st_point(4326) );

This SRID represents WGS84, which is a spherical model of the earth. The default SRID of zero gives you a "flat earth" Cartesian coordinate system. For most real-world applications, this is pretty useless.

The second lesson learned: You also have to specify the SRID in the update statement.  

update <TABLENAME> set LOCATION = new st_point( 'POINT(' || LONGITUDE || ' ' || LATITUDE ||')',4326)

I'm not sure why this is necessary, but omitting the SRID results in point geometries that can't be used in distance calculations.

The third lesson I learned was that performance is disappointing. The table in question contains 91,000 weather stations. The query below returns stations within 50 km of a specified lat/long and takes almost 11 seconds to run (on an AWS HANA):

SELECT ID, name, longitude, latitude, location.st_distance(ST_GeomFromEWKT('SRID=4326;POINT(-94.1167 29.7000)'))/1000 as Dist_KM
FROM <TABLENAME>
where location.st_distance(ST_GeomFromEWKT('SRID=4326;POINT(-94.1167 29.7000)'))/1000 < 50

Almost all the time went into WHERE clause. Hopefully, the HANA developers will work on this.


Wednesday, March 19, 2014

HANA versus SQL Server

In previous posts, I reported some basic performance data I'd obtained from HANA.

I needed a frame of reference to understand what these numbers actually meant, so I set up a Microsoft SQL server in Azure, loaded the same weather data and timed the same query.  I repeated the query 8 times and computed the average CPU processing time.

On average, the query took 32.781 seconds.  The effective scan rate was 981,000 rows per second.

By comparison, the same query on the same data ran in 0.396 seconds in HANA.

In this experiment, HANA's scan rate was approximately 100 times greater than SQL Server.

I had expected the SQL Server to do better, so some further investigation seemed warranted.

Could differences in hardware explain the poor SQL Server result?

I used SQL Server 2012 Standard running on an Azure "Extra Large" VM. This configuration has 8 cores and 16GB RAM.  The number of cores matches HANA's configuration but has less RAM. Although my HANA instance has 60GB RAM, it has never used more than 18GB.  I monitored RAM usage and paging closely during the experiment.  The SQL Server topped out at about 12GB RAM usage and paging was negligible.

I also monitored disk utilization and noticed that the C: drive spent much of the time at 100% utilization. By default, SQL Server places database files on the C: drive, which means access to the database competes with the operating system for access to the hard drive. So, I moved the database to the D: drive and re-ran the experiment.  It didn't make any difference.

Azure's only promises about 60 mbps performance for disks attached to it's VM's (link). This is, at best a modest number and I believe significantly hampered the SQL Server.

A faster disk would make big difference.

By comparison, I found these results for Amazon EC2: they ranged from 100 - 200 mbps. The solid-state disk in my laptop ranged from 287 mbps to 476 mpbs. Building the SQL Server in Amazon may have doubled its performance.  Building a SQL Server with solid state disks or RAID (actual hardware), could quadruple performance.

Conclusion: Azure's limited disk performance significantly hampers SQL Server performance.

Would indexing make any difference?

I ran the query through the SQL Server optimizer, which suggested an index.  I created that index (which took over an hour) and repeated the experiment.  The index made the query run slower by a factor of about 3 times.

The index suggested by the query optimizer involved 5 columns. I tried two single-column indexes designed to make the join and "where clause" go faster.  These indexes also slowed the query down.

Conclusion: Indexing didn't help.  In fact, it made things worse.



Monday, March 17, 2014

HANA SQL View Versus Attribute View Performance

In addition to the normal SQL view, HANA has three new types of non-SQL views called Attribute Views, Analytic Views and Calculation Views.  SAP training makes a big deal out of these views saying that they have capabilities not found in normal SQL views and, perhaps more importantly, they're supposed to be faster primarily because they bypass the SQL parser and query optimizer; going directly to the calculation engine.

After sitting through the training on Attribute Views, I was skeptical of their value. The tool used to build AT views looks like a standard graphical query designer (similar to Microsoft Access). I did not see that AT views had any capabilities that could not also be done in SQL.

For me, the one thing left was the possibility that attribute views might be faster than their equivalent SQL.

So, using the weather observation data,described in the two previous posts, I ran a direct comparison.

Here's the SQL:

select OBS_ID,
STATION_ID,
YYYYMMDD,
LATITUDE,
LONGITUDE,
ELEVATION,
VALUE,
QFLAG,
(VALUE)/10*9/5+32 AS TEMPERATURE
from "WEATHER_DATA"."Critigen.WeatherData.Data::DAILY_WEATHER_OBS" d
inner join "WEATHER_DATA"."Critigen.WeatherData.Data::STATION" s on d.STATION_ID = s.ID
where s.STATE IS NOT NULL
and d.ELEMENT = 'TMAX'
and VALUE >= -500
and QFLAG = ''
order by d.YYYYMMDD

The Attribute View returns the same result set and is executed via a stored procedure.

I ran each query 8 times and the average execution time in seconds is shown below.  SQL takes about 7 times longer to execute than the Attribute View.

As AT View As SQL
0.2585 1.77225

Attribute views are clearly much faster than SQL.




Thursday, March 13, 2014

HANA Query Performance

I loaded 2013 weather observation today, so I finally have enough data to do report basic query performance.

This query returns maximum temperature readings for a particular weather station. It joins the weather observation table (33 million rows) with the station table (91,000 rows).  There are no keys or indexes.

select * from "WEATHER_DATA"."Critigen.WeatherData.Data::DAILY_WEATHER_OBS" d
inner join "WEATHER_DATA"."Critigen.WeatherData.Data::STATION" s on d.STATION_ID = s.ID
where s.STATE = 'CO' and d.ELEMENT = 'TMAX' and d.STATION_ID = 'USC00053147'
order by d.YYYYMMDD

The query returned 396 rows in 335 ms. The overall scan rate was 99 million rows per second.

Tuesday, March 11, 2014

Bulk loading HANA

For the last few days, I've been experimenting with ways to get large chunks of data into HANA. I thought I'd share some performance results I've observed during these experiments.

About the Run-Time Environment

I've decided to work with the so-called Global Historical Climatology Network data. This is available from several sources, including NOAA.  For the US in the first two months of 2014, there are about 4 million observation records from 36,000 weather stations. Historical data goes back to 1763. I think this data set is large enough to give me a way to learn and evaluate HANA's big data capabilities.

There are two tables in my starter database; a weather observation table with 9 columns and a weather station info table with 8 columns.

Three Ways to Load Data

I've tried three ways to load data (there are others):

  • CSV files included in a HANA XS project.
  • A linked table in Microsoft Access.  This uses the HANA ODBC driver
  • A web service hosted in HANA combined with a SAPUI5 File Upload control.  (Update:  Because the SAPUI 5 control can only upload one file at a time, I switched to a drag and drop interface that can queue up lots of files for sequential upload.  This works a lot better).
Loading Data Using CSV in HANA Studio Project

When you create a HANA XS project, you can include CSV files that are loaded when the project is deployed and activated. While handy, I quickly discovered this method is useless for data sets larger than 3 or 4 megabytes.

Loading Data Using ODBC Driver

The HANA client includes an ODBC driver which can be used to load data. This makes it possible to use Microsoft Access as a data import tool.  The process goes like this:
  • Create a linked table that points to the destination table in HANA
  • Import the CSV file into a working table in Access
  • Write an update query that moves data from the working table to the linked table
This is easy to set up, but performance is miserable.  In my case, the best INSERT performance I could get was about 3 records per second. The 91,000 record station table took about 6 hours to populate. While better, 3 records per second isn't very useful.  

I wondered if my internet connection speed was the culprit, so I moved the experiment to a VM on my company network where the network is much faster.  Performance went from 3 records per second to about 5 records per second.  

If you have more than a few thousands of rows to load, avoid using the ODBC driver.  It's too inefficient. 

Loading Data Using XS-hosted Web Service


Following instructions from the SAP site, I created an XS web service, hosted on the HANA server.  This web service takes an uploaded CSV file and processes it into the database.  I created two versions; the first version processed one row at a time (committing after each insert) and the second processed the entire file in one batch (using the executeBatch command).

The one-row-at-a-time version performs inserts at about 40 rows per second.  It took about 6 hours to upload 1 million rows of weather observation data.  Better, but still too slow.

The batch version processed the same million row insert in about 18 seconds. A 1-million-row CSV of weather observation data is about 34 megabytes. This took about 6 seconds to upload and about 12 seconds for the database to process the results.  The effective throughput is about 55,000 rows per second.

I tried to upload a 100 MB file, but that failed.  I suspect that exceeds the web server's file upload size limit.

To recap:

  1. Using CSV files in a HANA Studio project is useless for all but the smallest of data loads.
  2. Using ODBC, you can insert about 5 records per second.
  3. Using an XS web service, you can insert more than 50000 records per second.

Wednesday, March 5, 2014

When would you use HANA?

Yesterday, I wrote about how HANA's architecture finesses the problem of using the same database to do queries and transaction processing.  When I wrote this, I was thinking mostly about other traditional SQL databases.

There are other types of databases designed to process extremely large data streams from the Internet. These are usually distributed, work better with unstructured data and are not traditional SQL databases.  I haven't worked with any of these systems myself. What follows is just me organizing my thoughts about HANA's competitive space.

One example of this is called Druid. Druid is open source, but comes from a company called Metamarkets whose job is to provide near real-time analytics on Twitter (and other) feeds. To do that, you need a database that can handle large quantities of transactions AND process queries at the same time. Metamarkets built their own technology and then put it into the public domain under the name Druid.

Druid processes data using a network of nodes.  There are several types of nodes. Real-time nodes process online transactions and persist data. Broker nodes handle queries.  Coordinator nodes and historical nodes work behind the scenes to manage data traffic. It isn't clear to me if DRUID would pass the so-called ACID test for transaction isolation and durability.

Druid uses a JSON-based query language instead of SQL.

Druid is optimized for insert transactions. Once stored, DRUID does not expect the data to change. Druid is capable of impressive throughput (350,000 rows/sec) and query scan rates (30 million rows/sec).  The latter was on a 100 node cluster.

MongoDB is another example.  MongoDB stores documents in JSON format using a grid of Hadoop computers (Elastic Map Reduce). The database is stored in portions called shards in the grid. Indexing helps speed up the process of locating records. The data is stored on redundant nodes to make the system reliable.

Although MongoDB is well suited to very large datasets, it isn't clear that it's performance is any better than traditional databases.  Like Druid, MongoDB uses a JSON-based query language instead of SQL.

All these databases use clusters of cloud-based computers. I would imagine one of the challenges would be diminishing returns as the cluster gets larger. Adding computers comes at a cost and you could quickly find yourself on the losing end of the diminishing returns curve. An interesting article from GNIP (one of the few companies that has every Tweet ever sent) hints at this.

Considering all the traditional SQL databases on one hand and the new crop of "noSQL" databases on the other, when would you use HANA?

I'm not sure I can answer that just yet, but keep reading.

Objective, controlled performance comparisons across these types of systems would be really helpful. It would also be helpful to compare the total costs to operate these systems. This would allow for a real cost/performance calculation.

Right now, my guess is that HANA would be best suited when these things are true:

  • Large quantities of transactions occur on a daily basis. The transactions should be more than a simple stream of inserts.
  • There is a need for same-day analytics.
  • Existing data systems are not performing adequately.
  • A high level of transaction integrity is needed (ACID)

Tuesday, March 4, 2014

Thoughts on SAP HANA

The company I work for (Critigen) has a partnership with SAP and through that partnership, I've recently been introduced to the HANA database. Part of my job over the next few weeks is to learn as much about HANA as I can. Writing helps me organize my thoughts, so I thought I'd blog about it.

There are several things that make HANA different from most databases, but there is one particularly subtle difference that I'd like to focus on in this post.

Much has been written about the two most obvious of HANA's differences: the fact that it's an in-memory database and that it's tables are stored as columns instead of rows. Everyone knows that running in memory is much faster than running from disk.  Storing data in columns speeds queries because most queries filters based on column values and it's much faster to scan a list of column values than it is to scan all the data in a table row by row. Early adopters have reported that some queries run up to 10,000 times faster in HANA than in a traditional database.

One of the buzz words these days is "Big Data" which I take to mean the problem of extracting meaning from the enormous quantities of data now available. Again, much has been written about how HANA's speed can help with this problem.

That's fine, but here's my problem: Neither HANA's in-memory nor it's column-oriented architecture make it unique. In fact, all databases cache data in memory and if you have enough RAM, you can effectively turn any database into an in-memory database. There are several column-oriented databases designed for super-fast queries of large data sets. The traditional knock on columnar databases is that while they run super-fast queries, they're useless for online transaction processing. Inserting a row involves locking and changing each column's data file. Because of this, today's columnar databases are used strictly as data warehouses. They're taken offline at night and bulk loaded with the previous days' transactions.

So, what makes HANA worth it's high cost and why would anyone invest money to move a database into HANA?  SAP actually ported it's ERP to HANA, which must have been a huge and risky effort.

I struggled with this until yesterday. The answer is that HANA's engineers have solved the problem of efficient transaction processing using a columnar architecture.

Here's how it works:

  • When data is changed or inserted, it is written into a temporary row store that is connected to every table.  The transaction is committed at this point without any performance penalty.
  • A background process moves data from the temporary row store into the permanent column stores.
  • Queries first examine the column store and then examine the row store. If the row store is empty (which is most of the time), the query suffers no performance penalty.  If the row store contains data, the query suffers a small performance penalty.
This elegant finesse is what makes HANA worthwhile.