Using Open Source to Reinvent the Data Warehouse
When you add up all the money invested in data warehouses over the years it easily adds up to billions, perhaps even trillions of dollars. But if you ask IT organizations whether their companies are generating a lot of business value from the investments most would simply shake their heads and groan.
That's because most data warehouses are used to generate canned reports. The overall environment is simply to slow to respond to anything approaching a truly interactive series of queries launched in real time. In fact, by the time most IT organizations can respond to a question, the people who launched that query have long since forgotten why they cared to even ask the question in the first place.
For that reason there’s been a rising amount of interest in set of complementary open source technologies that promise to enable the development of data warehouse applications that are capable of processing massive amounts of big data in real time. While most of that data is stored in Hadoop, the three core open source technologies that will enable these applications are Storm, a real-time processing engine; Spark, a framework for building clusters; and Kafka, a distributed messaging system.
Massive amounts of venture capital funding are pouring into these platforms. For example, Confluent, which created Kafka, announced this week it has raised an additional $24 million in funding. Confluent CEO Jay Kreps said interest in Kafka is growing leaps and bounds because it provides an open source mechanism that makes it possible to share massive amounts of data in real time. In essence, Kafka is the pipeline that feeds data to Storm and Spark.
While most IT organizations currently view Hadoop as an adjunct to their existing data warehouse investments, the rise of Storm, Spark and Kafka together represent a much more serious threat to the existence of the data warehouse as we currently know it. Instead of taking data from Hadoop and feeding it into a relational database, Storm, Spark and Kafka make it feasible to process all that data directly on top of Hadoop. In effect, the data warehouse becomes a logical entity running on top of a Hadoop platform that is capable of processing data in real time at a fraction of the cost of a traditional data warehouse.
It still will take a while for these technologies to become robust enough to be deployed in production environments. But at this juncture there’s enough evidence to suggest that it’s now definitely going to happen. For solution providers that have made a living off of data warehouse applications (which probably account for half of all the dollars spent on enterprise IT), the emergence of a new way to deliver that functionality a much lower cost is cause for pause.
Naturally, there’s a huge opportunity for solution providers to lead that transformation. On the other hand, solution providers that have a lot of investments in data warehouse technologies from Oracle, Teradata, IBM and Microsoft may want to think carefully about their strategic direction. There’s a major bend in the proverbial road ahead, and solution providers that fail to make that turn may very well soon find themselves heading toward a proverbial dead end from which there will be no return.