The GridGain Data Lake Accelerator boosts data lake access by providing bi-directional integration with Apache™ Hadoop®. This integration brings the historical data into the same in-memory computing layer as the operational data, enabling real-time analytics and computing on the combined data to drive real-time business processes. It leverages the GridGain Unified API and native Apache Spark™ connector to power real-time HTAP (hybrid transactional/analytical processing) in which transactions and analytics are performed on the same operational dataset.
“Many of today’s digital transformation and IoT use cases require real-time analytics against a combination of data lake and operational data,” said Abe Kleinfeld, president and CEO of GridGain. “The GridGain Data Lake Accelerator addresses the requirements of today’s businesses to gain instant insight, capitalize on opportunities as they arise and automate decision making.”
“Many companies have created Hadoop-based data lakes with a view to consolidating data from multiple data sources and serving the processing and analytics needs of multiple use-cases, but have then struggled to generate the expected value,” said Matt Aslett, Research VP, Data, AI and Analytics, 451 Research. “By bringing its in-memory compute functionality to the data lake, GridGain is providing an option for accelerating access to historical and live data to support real-time decision-making.”
Typical use cases for the GridGain Data Lake Accelerator include using historical data to enrich real-time data streams, calculating thresholds for real-time operational triggers from historical trends, and displaying historical and real-time data together in operational dashboards. For example, a transportation company might be collecting a continuous stream of data from its vehicle engines. The data is ingested, processed and analyzed and then stored in a data lake, with only the most recent data retained in the operational data store. When an anomalous reading in the live data triggers an alert for a particular engine, the system needs to analyze the engine data to identify the root cause of the problem. An infrastructure powered by GridGain’s in-memory computing platform, Kafka, Spark and Hadoop makes this possible. Apache Kafka feeds the live streaming data to the GridGain in-memory computing platform and to the Hadoop data lake. Spark retrieves the required data from the data lake and delivers it to the in-memory computing platform. The GridGain in-memory computing platform maintains the combined data set in memory and runs real-time queries across the data set. The result is deep and immediate insight into the causes of the anomalous reading.