MapR, in partnership with Databricks, has added a key new feature to its Apache Hadoop distribution for Big Data. Now, the MapR platform integrates the open source Apache Spark software stack, which dramatically increases the efficiency of large-scale data processing, especially in conjunction with in-memory computing.
Apache Spark, which was developed by the Apache Software Foundation, the University of California-Berkeley and Databricks, works on top of the Hadoop Distributed File System (HDFS), the file system tailored for Hadoop Big Data deployments. But Spark takes a different approach to processing data than other Hadoop software. Instead of adopting the two-stage MapReduce strategy, Spark can perform repeated queries on the same information while keeping it in memory, which can make data analysis much more efficient.
The result, according to Databricks, is Hadoop application performance that is up to 100 times faster with in-memory computing, and 10 times faster using traditional storage.
MapR is hoping to capitalize on those performance improvements to help the MapR Hadoop distribution appeal to enterprises that demand high-performance computing. "With this release, MapR extends its lead in the Hadoop market for high performance by enabling Spark applications to run on the world record-holding distribution for Hadoop, which uniquely allows streaming writes directly to the data platform," according to the company.
MapR also is pitching Spark integration as a way to improve data quality and derive better information from data, since Spark-powered applications "are operating on more real-time data, which ultimately enables faster fraud detection, better personalization of media, higher quality from manufacturing processes and other operational analytic use cases."
That Spark is open source means it's likely to continue evolving rapidly, expanding the applicability of Hadoop and making it more useful in high-performance environments.