Big Data just gets bigger
May 25, 2012 —
(Page 2 of 5)
Related Search Term(s): Big Data, Hadoop
Why all the enthusiasm for Hadoop? Because there's no alternative at the moment if you have to deal in the petabyte range. Below that threshold a number of other solutions are available, but even vendors have realized that no matter how robust their solutions are, Hadoop integrations can only make them better.
John Bantleman, CEO of RainStor, said that Hadoop really became relevant around two years ago. “Go back two years, and the data management landscape was all [online transaction processing] relational databases, like Oracle,” he said. “Once you start to hit velocity and volume, you'll generate hundreds of terabytes of data and billions of records a day. Your general roles-based relational database tops out. That's forcing customers to look at alternative solutions.
“Part of what's available in the market is data-warehousing technology in products like Teradata. Those do scale to petabytes, but they're extremely costly. They cost hundreds of millions of dollars to put that infrastructure in place. Hadoop has a cloud-like infrastructure, which allows you to manage that data at that scale at a fraction of the cost.”
And despite the fact that RainStor essentially competes with Hadoop, the company is still offering its products in forms that work on top of or in conjunction with Hadoop. So while RainStor's bread and butter is storing large amounts of data in its highly compressed database, it is also able to perform analytics across a Hadoop cluster instead of inside itself.
“My view is Hadoop is going to be a platform, kind of like Linux is a platform,” said Bantleman. “It's going to be a management system to manage Big Data. Generally, you have to build the stack out to meet enterprise requirements.”
Horton Hears a 2.0
Hortonworks (a producer of supplemental Hadoop software) is pushing forward efforts to do just this: turn Hadoop into a job-scheduling cluster-management system, rather than just a vehicle for MapReduce. As the company charged with maintaining and pushing forward innovation in the Apache Hadoop Project, Hortonworks is focused on two things: training, and improving open-source Hadoop.
On the other side of the coin, Cloudera is the commercial services and products company for Hadoop. It offers its own distribution of Hadoop, coupled with management tools to make cluster control simpler.
But it is Hortonworks that is thinking most heavily about the next version of Hadoop. Dubbed version 2.0 by some, this next edition will include complete rewrites of many aspects of the system.
Shaun Connolly, vice president of corporate strategy at Hortonworks, said that Hadoop 2.0 will be about updating some of the lagging portions of the project to be more in line with modern needs.
“Today's MapReduce is a job-processing architecture and a task tracker,” he said. “The Hadoop 2.0 effort really has been focused on separating the notions of the MapReduce framework paradigm from the resource-management capabilities of the platform, and generalizing the platform so MapReduce can just be one type of application framework that can run on the platform.”
That means Hadoop 2.0 is poised to be more like a data center operating system than simply a MapReduce bucket and scheduler. “Other types of applications we can foresee coming would be message-passing interface applications, graphic-processing applications, stream-processing systems, those types of things,” said Connolly. “At the end of the day, they begin to open up Hadoop. We view Hadoop as a data platform, and for the data platform to continue to be relevant, it needs to open itself up to other work cases and to effectively store the data across larger clusters.”
HDFS too is receiving an update in Hadoop 2.0, he said, and will become highly available in the next major release. Making the file system highly available will vastly improve the performance of HBase, the in-Hadoop database framework. HBase has been slowly evolving to become fast enough for front-line usage, with the ultimate goal being to allow Hadoop to run all the data for your sites, not just the long-term storage.