HFactory integrates Spark for advanced analytics

Posted on 17 May 2016

After using it in various customer projects over the last year, we have now finalised the integration of Apache Spark into the HFactory platform. From the very start, Spark was a natural extension of the work we had accomplished within HFactory: the solutions share the same functional programming principles, a focus on elegant and expressive APIs and both are built on the powerful Scala programming language. So we are especially pleased to now offer a fully packaged solution for building Spark-based operational analytic applications.

Initiated as an ORM-like abstraction layer on top of HBase, HFactory abstracts away the complexity of HBase and delivers a full application stack on top of the Hadoop NoSQL database. In such a Hadoop-centric environment, Spark is the ideal HBase complement to deliver advanced analytic capabilities. On one side, Spark brings in core distributed processing and, unlike MapReduce, intelligent caching to limit the need for full dumps of massive datasets to disk. On the other side, HBase offers a highly scalable and flexible low-latency persistence layer for interactive business applications.

In our Hadoop / HFactory architecture, Spark is used to process the data across the cluster, while HFactory manages the storage into HBase and the servicing of the data via its Spray-based application server. Throughout the chain, data is manipulated via the high-level, strongly-typed entities generated with HFactory. There is no need to continuously switch between HBase byte arrays and Spark data frames any longer, as HFactory delivers a programmatic access and a Scala interface to data for both analytical and operational functions.

HFactory Spark Architecture

Going one step further, a key advantage of Spark is that it can cover a wide range of computing workloads. The framework unifies into one single execution environment different data processing paradigms, with dedicated librairies for streaming and machine learning (MLlib). As an example, Spark seamlessly joins streaming data to barch data: the same aggregates can be used in batch and stream mode, which significantly simplifies the overall architecture compared to using specialised data processing engines.

In our experience, such capability has proved especially useful in Industrial IoT / predictive maintenance projects. Incoming time series data from remote machines is analysed as it is ingested into the HFactory cluster, the system flags possible anomalies in real-time and warns end users of impending equipment failure. In parallel, both the raw data points and the Spark-computed statistic and analytic funtions are stored into HBase for subsequent analysis and operational reporting.

Overall, the combined solution makes batch analytics, real-time processing and data servicing harmoniously work together. HFactory not only transparently manages the persistence of Spark jobs into HBase, but also gracefully stitches together all the components of a complete data-driven application, from the ingestion of data to its exposition via a standard RESTful API and its visualisation in a web application.