HBase: one database for both analytical and operational workloads

Posted on 15 Feb 2015

From its deep integration within the Hadoop platform, the promise of HBase is clear: one datastore for both analytical and operational workloads. But these two type of workloads command very different data interaction patterns.

On one hand, exploratory analytics is focused on complex analysis. Data analysts need to perform rich queries on the data, and typically generate analytic reports through the combination of usual SQL commands and BI visualization products. On the other hand, operational intelligence requires real-time access to the data for fast serving within a data-driven application. Developers want a natural, efficient access to the data, with optimal disk access speed, to deliver a truly interactive user experience.

Scaling difficulties set aside, RDBMS databases are perfectly geared to serve data analysts, but their tables and relationships structure is very different from the logic of an application. HBase, by design, is more closely aligned with application developers requirements – with the difficulty however that selecting the appropriate model (ie the rowkey and the column families) requires a good understanding of the underlying data sharding consequences. The good news is that by using the appropriate abstraction layer for each use case, it is now possible to serve equally well the data analyst and the application developer with HBase.

On the data analyst front, HBase obviously benefits from the community investment in making Hadoop data accessible through SQL. Several options are available to run SQL queries on HBase, from Cloudera-sponsored Impala to MapR-driven Apache Drill. HBase also has its own read and write SQL skin, with the Apache Phoenix project which provides a bridge between HBase and a relational approach to manipulate data. It also adds usual RDBMS features such as secondary indexes, making HBase look more like a traditional database. This said, let’s not forget that Phoenix only extends HBase native functions and compiles SQL queries into a series of HBase scans. But of course the end user does not have to know about this!

On the application developer front, a variety of options is now available. Both WibiData and Continuuity / Cask perceived as early as 2011 the need to reduce the plumbing and make HBase easier to consume for the application developer. We experienced it first-hand at Ubeeko as well, and HFactory leverages Scala, Akka, Spray and Docker to drive the simplification of the application developer experience to a new level. No need here to add an unnecessary SQL layer, this could even be couter-productive as it would mask the imperative of aligning the HBase data model with the application characteristics and data access patterns. The focus is on delivering simplicity and modularity with 1) complete application scaffolding and a developer-friendly, entity-driven JSON API to HBase, 2) open interfaces and easy integration with data processing engines such as Spark.

More information on the Apache Phoenix and HFactory projects on http://phoenix.apache.org and http://hfactory.io.