Tuesday, March 11, 2014

Bulk loading HANA

For the last few days, I've been experimenting with ways to get large chunks of data into HANA. I thought I'd share some performance results I've observed during these experiments.

About the Run-Time Environment

I've decided to work with the so-called Global Historical Climatology Network data. This is available from several sources, including NOAA.  For the US in the first two months of 2014, there are about 4 million observation records from 36,000 weather stations. Historical data goes back to 1763. I think this data set is large enough to give me a way to learn and evaluate HANA's big data capabilities.

There are two tables in my starter database; a weather observation table with 9 columns and a weather station info table with 8 columns.

Three Ways to Load Data

I've tried three ways to load data (there are others):

  • CSV files included in a HANA XS project.
  • A linked table in Microsoft Access.  This uses the HANA ODBC driver
  • A web service hosted in HANA combined with a SAPUI5 File Upload control.  (Update:  Because the SAPUI 5 control can only upload one file at a time, I switched to a drag and drop interface that can queue up lots of files for sequential upload.  This works a lot better).
Loading Data Using CSV in HANA Studio Project

When you create a HANA XS project, you can include CSV files that are loaded when the project is deployed and activated. While handy, I quickly discovered this method is useless for data sets larger than 3 or 4 megabytes.

Loading Data Using ODBC Driver

The HANA client includes an ODBC driver which can be used to load data. This makes it possible to use Microsoft Access as a data import tool.  The process goes like this:
  • Create a linked table that points to the destination table in HANA
  • Import the CSV file into a working table in Access
  • Write an update query that moves data from the working table to the linked table
This is easy to set up, but performance is miserable.  In my case, the best INSERT performance I could get was about 3 records per second. The 91,000 record station table took about 6 hours to populate. While better, 3 records per second isn't very useful.  

I wondered if my internet connection speed was the culprit, so I moved the experiment to a VM on my company network where the network is much faster.  Performance went from 3 records per second to about 5 records per second.  

If you have more than a few thousands of rows to load, avoid using the ODBC driver.  It's too inefficient. 

Loading Data Using XS-hosted Web Service

Following instructions from the SAP site, I created an XS web service, hosted on the HANA server.  This web service takes an uploaded CSV file and processes it into the database.  I created two versions; the first version processed one row at a time (committing after each insert) and the second processed the entire file in one batch (using the executeBatch command).

The one-row-at-a-time version performs inserts at about 40 rows per second.  It took about 6 hours to upload 1 million rows of weather observation data.  Better, but still too slow.

The batch version processed the same million row insert in about 18 seconds. A 1-million-row CSV of weather observation data is about 34 megabytes. This took about 6 seconds to upload and about 12 seconds for the database to process the results.  The effective throughput is about 55,000 rows per second.

I tried to upload a 100 MB file, but that failed.  I suspect that exceeds the web server's file upload size limit.

To recap:

  1. Using CSV files in a HANA Studio project is useless for all but the smallest of data loads.
  2. Using ODBC, you can insert about 5 records per second.
  3. Using an XS web service, you can insert more than 50000 records per second.

No comments: