James Kobielus, senior analyst at Forrester Research, has been knee-deep in enterprise elephant experiments for well over a year now. As Forrester's point-man on Hadoop, Kobielus has a lot of insight into what works and what doesn't for enterprises undertaking Hadoop deployments.
I spoke with him, today, about the Hadoop 1.0 release, and the release's focus on HBase as a stronger platform for enterprises. Here, then, are some of his thoughts on Hadoop 1.0:
It's now very appealing to use HBase. Last week, I published my Hadoop case studies on early adopters. In there is a case study I did with Yahoo!, and this past year I interviewed about a dozen early adopters, many of them online service providers. I'm seeing a growing number of deployments.
There is a strong push towards using HBase as the principle database in a Hadoop cluster for near real-time applications that enterprises are building on Hadoop, like social media monitoring and sentiment monitoring applications. It's not that anyone is backing away from HDFS: we see hybrid deployments where there are two or more different databases, or data storage platforms in use. HDFS is a file system. Massive amounts of unstructured content are brought into HDFS and extracted into HBase and other databases for other analysis. We see a fair amount of MySQL. We saw some that use MySQL and HBase and even nosqls like CouchDB and MongoDB. These hybrid deployments are becoming more common.
EMC Greenplum is another type of hybrid deployment which is a relational database and Hadoop, specifically Greenplum HD, which is their own tweak to MapR's distribution. That's an example of what I see more and more often: Using traditional data warehousing technology and Hadoop in the same implementation.
Oracle has this same song, but they have the next verse: a big appliance that incorporates Cloudera's distribution of Hadoop, and Oracle's key value store built on BerkeleyDB.
The important thing here is that 1.0 is simply a milestoene in the maturation of Hadoop. It jumped form 0.22 to 1.0. That was symbolic that Hadoop has come into its own, and has become more stable and mature.
So, in this release there are additional features in HBase: apphend, hflush and some security. There's Web HDFS, which I believe is a REST interface for HDFS. There are various performance enhancements and bug fixes. It's what you'd expect from a 1.0 release. It's been 6 years or so since the Hadoop project got going under Doug Cutting, and 5 years since it was open sourced by Yahoo! And we're now, publishing the first ever Forrester wave study on Hadoop to market.
All of these events are signaling that we're at the beginning of the maturation of this market.
Look at Oracle. They decided, for various reasons, to partner with Cloudera. They're the best established of the startups, Cloudera has over 100 customers in deployment. Oracle obviously wants to get a Hadoop offering out now while the market is continuing to build so they can begin the process of duking it out with EMC. I've tweeted the same sentiment as others: I can't imagine Oracle relying on a partner forever. It's as likely now as it was last year that Oracle, maybe, is sizing Cloudera up for acquisition.
With Oracle having jumped in the market, that further validates Hadoop. Actually, they say it's generally available, but I haven't seen it demonstrated yet so I'll reserve judgement on their deployment until I've seen it.
One of the things that impressed me about Oracle is that when I asked, point blank, are you committed to committing code changes back to Apache, and they said “yes.” They indicated they are strongly committed to the core Apache Hadoop stack, and participating in that community.
So, they're not going to fork Apache Hadoop.
But there are a number of distributions of Hadoop. There's Cloudera's, there's MapR, Greenplum, and IBM with BigInsnights. Hortonworks is by-the-book, gospel open source Apache, and Microsoft is partnering with Hortonworks. Microsoft has made no bones about wanting to stay open source: they say they're committed to remaining open source. Oracle is indicating they also very much want to adhere closely to the open source core. These are all good signs politically, which means we won't get distracted like the Linux community did by things that won't be differentiating.
The distribution is not differentiating. What is differentiating is the degree of hardware integration and the modeling tools they offer. It's their ability to support both appliance-based and cloud-based, and also software-based packaging of Hadoop functionality. It's about the support for multiple form factors, and things like visual modeling and development tools that they roll out in conjunction with their offerings. It's things like support for the open source process.
Hortonworks is primarily developing the next generation of MapReduce and HDFS, and adding such things as a cluster management sub-project to the whole Apache project. I find the fact that Cloudera and others have not engaged in sniping against Hortonworks as chief committer--it's inheriting of Yahoo!'s role interesting. I've detected no covert attempts to undermine Hortonworks, and that is a good thing, actually.
I think the market recognizes the role that Hortonworks is playing is good for everybody. I see Hortonworks helping other vendors evaluate Hadoop. As all the vendors Hadoopicize [you read that word here first, folks: credit James] their products, Hortonworks is becoming an important partner for the ISVs. If you look at the Apache press release, it's written by an exec at Hortonworks. In many ways, Hortonworks is generally recognized as the driving that side of it. I don't want to overstate their role because Cloudera contributes a lot of innovative software, too, however.
The fact we have a 1.0 should be a clear signal to everyone that Hadoop has arrived at a critical milestone as a project, and as a marketplace.
One thing I'd like to see is the community coalesce around the road map: What are they actually evolving towards? What are the missing pieces that still need to be built for this community to be able to say “we've built it out?” It's open ended, and to a a degree, it needs to be, but it's not the end-all of big data. I'd like to see a target. Another thing I'd like to see a real standards group step in. That's the same bunch of players as the open source vendor community plus users. I'd like to see a real equivalent to a W3C or OASIS begin to have a real formal process now for defining the various layers of Hadoop in terms of MapReduce, HDFS, HBASE, Hive and Pig and Mahout.... So we can get clarity around versions and a reference architecture. So there can be the beginnings of some cross-vendor certification profiles.
That's what I'd like to see. Not hat I see any real momentum in that direction, but there needs to be the beginnings of governance. Governance that's more than the open source process. The market will be strangled and momentum will die if we have too many vendor-proprietary deployments with no clear interoperability certifications.
Ideally, enterprises want to mitigate risk in incompatibilities but there's no clear standards. That's the next step in the evolution of the market.