| DISABLE AUTO REFRESH
 
SD TIMES BLOG
 
ahandy

On Hadoop 1.0: James Kobielus

by Alex Handy 01/17/2012 01:54 PM EST

 

James Kobielus, senior analyst at Forrester Research, has been knee-deep in enterprise elephant experiments for well over a year now. As Forrester's point-man on Hadoop, Kobielus has a lot of insight into what works and what doesn't for enterprises undertaking Hadoop deployments.

I spoke with him, today, about the Hadoop 1.0 release, and the release's focus on HBase as a stronger platform for enterprises. Here, then, are some of his thoughts on Hadoop 1.0:

It's now very appealing to use HBase. Last week, I published my Hadoop case studies on early adopters. In there is a case study I did with Yahoo!, and this past year I interviewed about a dozen early adopters, many of them online service providers. I'm seeing a growing number of deployments.

There is a strong push towards using HBase as the principle database in a Hadoop cluster for near real-time applications that enterprises are building on Hadoop, like social media monitoring and sentiment monitoring applications. It's not that anyone is backing away from HDFS: we see hybrid deployments where there are two or more different databases, or data storage platforms in use. HDFS is a file system. Massive amounts of unstructured content are brought into HDFS and extracted into HBase and other databases for other analysis. We see a fair amount of MySQL. We saw some that use MySQL and HBase and even nosqls like CouchDB and MongoDB. These hybrid deployments are becoming more common.

EMC Greenplum is another type of hybrid deployment which is a relational database and Hadoop, specifically Greenplum HD, which is their own tweak to MapR's distribution. That's an example of what I see more and more often: Using traditional data warehousing technology and Hadoop in the same implementation.

Oracle has this same song, but they have the next verse: a big appliance that incorporates Cloudera's distribution of Hadoop, and Oracle's key value store built on BerkeleyDB.

The important thing here is that 1.0 is simply a milestoene in the maturation of Hadoop. It jumped form 0.22 to 1.0. That was symbolic that Hadoop has come into its own, and has become more stable and mature.

So, in this release there are additional features in HBase: apphend, hflush and some security. There's Web HDFS, which I believe is a REST interface for HDFS. There are various performance enhancements and bug fixes. It's what you'd expect from a 1.0 release. It's been 6 years or so since the Hadoop project got going under Doug Cutting, and 5 years since it was open sourced by Yahoo! And we're now, publishing the first ever Forrester wave study on Hadoop to market.

All of these events are signaling that we're at the beginning of the maturation of this market.

Look at Oracle. They decided, for various reasons, to partner with Cloudera. They're the best established of the startups, Cloudera has over 100 customers in deployment. Oracle obviously wants to get a Hadoop offering out now while the market is continuing to build so they can begin the process of duking it out with EMC. I've tweeted the same sentiment as others: I can't imagine Oracle relying on a partner forever. It's as likely now as it was last year that Oracle, maybe, is sizing Cloudera up for acquisition.

With Oracle having jumped in the market, that further validates Hadoop. Actually, they say it's generally available, but I haven't seen it demonstrated yet so I'll reserve judgement on their deployment until I've seen it.

One of the things that impressed me about Oracle is that when I asked, point blank, are you committed to committing code changes back to Apache, and they said “yes.” They indicated they are strongly committed to the core Apache Hadoop stack, and participating in that community.

So, they're not going to fork Apache Hadoop.

But there are a number of distributions of Hadoop. There's Cloudera's, there's MapR, Greenplum, and IBM with BigInsnights. Hortonworks is by-the-book, gospel open source Apache, and Microsoft is partnering with Hortonworks. Microsoft has made no bones about wanting to stay open source: they say they're committed to remaining open source. Oracle is indicating they also very much want to adhere closely to the open source core. These are all good signs politically, which means we won't get distracted like the Linux community did by things that won't be differentiating.

The distribution is not differentiating. What is differentiating is the degree of hardware integration and the modeling tools they offer. It's their ability to support both appliance-based and cloud-based, and also software-based packaging of Hadoop functionality. It's about the support for multiple form factors, and things like visual modeling and development tools that they roll out in conjunction with their offerings. It's things like support for the open source process.

Hortonworks is primarily developing the next generation of MapReduce and HDFS, and adding such things as a cluster management sub-project to the whole Apache project. I find the fact that Cloudera and others have not engaged in sniping against Hortonworks as chief committer--it's inheriting of Yahoo!'s role interesting. I've detected no covert attempts to undermine Hortonworks, and that is a good thing, actually.

I think the market recognizes the role that Hortonworks is playing is good for everybody. I see Hortonworks helping other vendors evaluate Hadoop. As all the vendors Hadoopicize [you read that word here first, folks: credit James] their products, Hortonworks is becoming an important partner for the ISVs. If you look at the Apache press release, it's written by an exec at Hortonworks. In many ways, Hortonworks is generally recognized as the driving that side of it. I don't want to overstate their role because Cloudera contributes a lot of innovative software, too, however.

The fact we have a 1.0 should be a clear signal to everyone that Hadoop has arrived at a critical milestone as a project, and as a marketplace.

One thing I'd like to see is the community coalesce around the road map: What are they actually evolving towards? What are the missing pieces that still need to be built for this community to be able to say “we've built it out?” It's open ended, and to a a degree, it needs to be, but it's not the end-all of big data. I'd like to see a target. Another thing I'd like to see a real standards group step in. That's the same bunch of players as the open source vendor community plus users. I'd like to see a real equivalent to a W3C or OASIS begin to have a real formal process now for defining the various layers of Hadoop in terms of MapReduce, HDFS, HBASE, Hive and Pig and Mahout.... So we can get clarity around versions and a reference architecture. So there can be the beginnings of some cross-vendor certification profiles.

That's what I'd like to see. Not hat I see any real momentum in that direction, but there needs to be the beginnings of governance. Governance that's more than the open source process. The market will be strangled and momentum will die if we have too many vendor-proprietary deployments with no clear interoperability certifications.

Ideally, enterprises want to mitigate risk in incompatibilities but there's no clear standards. That's the next step in the evolution of the market.

Currently rated 1.7 by 65 people

  • Currently 1.676923/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Share this link: http://www.sdtimes.com/blog/1944

Tags: , , , , ,

hadoop

ahandy

Hadoop distro wars on the horizon

by Alex Handy 10/24/2011 05:24 PM EST

There has been a lot of enterprise Hadoop news in the last month or two. Everyone from Oracle to IBM is getting into the game, or has already been there. Everyone's got their own flavor. Sounds just like the Linux distro wars of the turn of the millennium.

Forrester Research is putting out five reports on Hadoop in the enterprise, and Forrester senior analyst James G. Kobielus said that Hadoop needs a standards body all its own. "What Hadoop needs is a clear reference framework with layers. Here's the storage layer, and your options are HDFS, HBase, Cassandra, etc. The data warehousing and access layer is things like Pig, Hive, HCatalog."

More from Kobielus on the need for Hadoop standards. "Ten years ago, when the whole SOA thing got going, solutions popped out of the ground like kudzu. They became so confusing they had to write standards, like WS* and use consortiums like OASIS, so vendors and users could have a greater degree of assurance that there was greater interoperability. The big data world needs something similar. Capabilities under Hadoop have developed to the point where we need a reference framework and something resembling a standards process. We need Map Reduce 1.0, Map Reduce 2.0, each with clear features, and clear APIs with certification or interoperability profiles fleshed out. It's an open source initiative, and they like to say the 'The open source process will solve all that!' But it won't. Think of it like Linux. We're starting into distribution wars. This is going to be a messy, competitive and cut-throat process, where some distributions become ubiquitous."

Kobielus, I feel, has hit the nail on the head. We're seeing a fracturing of Hadoop stacks.

Kobielus' blog is highly recommended, as are his reports. Hadoop is awesome, powerful and truly useful to business, but installing it into your business processes is a big project to undertake. That's why the early Hadoop companies are making the most money on consulting services, not products.

And just as an aside, while Kobielus says the distro wars are coming, I've already heard one Hadoop contributor say "they're already here."

Currently rated 1.6 by 27 people

  • Currently 1.62963/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Share this link: http://www.sdtimes.com/blog/1877

Tags: , , , , , ,

hadoop

ahandy

Oracle's NoSQL, Hadoop plays

by Alex Handy 10/05/2011 06:24 PM EST

From Oracle OpenWorld, the real news this week is around Oracle's further playing the database Swiss army knife. Anything you need, from a high speed key-value store, to a hardware-integrated Hadoop cluster, they've got your "database." That word is becoming increasingly blurred by the variety of NoSQL solutions, and HBase. But Oracle now offers both.

The SleepyCat BerkeleyDB is now being touted as Oracle's NoSQL. As a key-value store, it can alleviate some of the pain associated with the scalability of a database in the cloud. Nextdoor at the show, Oracle's showing off it Exalytics machines, which run both their analytics and intelligence packages, as well as Hadoop.

Will enterprises buy these offerings, rather than implementing their own open source stacks? Quite likely, actually. I've seen numerous complaints about the ease of administration for many of the new-world databases, and this is exactly where the various NoSQL and Hadoop related startups have been offering value-add.

Looks like their market just got a lot more competitive. What remains to be seen is whether Oracle can offer better controls and distributions of these tools than their leaner, meaner competitors?

Currently rated 1.5 by 15 people

  • Currently 1.466667/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Share this link: http://www.sdtimes.com/blog/1863

Tags: , , , ,

hadoop

ahandy

We are the big data problem

by Alex Handy 11/19/2009 04:53 PM EST

Like a kid waiting for Christmas to come, I have been watching the Mahout project with great anticipation. When you toss around the concepts of map/reduce and machine learning, there's an awful lot of potential for radical ninjas to ensue. While the magical science-fiction world of artificial intelligence is still in a fetal state within our reality, it is, nonetheless, a growing science.

One of the things humans are discovering about making real thinking machines is that a wealth of experience goes a long way. It's the same for humans and machines: The more memories we have, the more we are able to structure thought processes based on those memories, and to learn from them. For computers, memories can be thought of as datasets. And the bigger the dataset, the more understanding can be extracted.

We all know about the BIG DATA PROBLEM. But if I were going to write something here in the guise of a souped-up VC, or grizzled startup veteran, I'd say the big data problem, when observed from the right angle, is more like a big data opportunity.

Imagine how much optimization information you could pull from six years of user logs? Need metrics, anyone? Try juicing all of your user stats. Not just this month's stats, but all of them. Why not throw in the databases of old user info you got from that newly acquired company you're still digesting? Hadoop is the place to put all of this stuff, as we should all know by now. But the data in Hadoop is only as good as the people who extract meaning from it. And with petabytes of data available at once, what human being could ever comprehend, much less query that mound of information? We can, and do, poke at it, and pull massive amounts of data from it. But there is the potential for infinitely more meaning to be derived from our data.

The big data problem is not a problem with the machines anymore. They're ready, thanks to Doug Cutting, the Apache Foundation, Yahoo, Facebook, Cloudera, and all the hordes of other Hadoop committers out there. The big data problem is a problem with us. These datasets are just getting too big, and we can't spend our days reformatting them, writing connectors for them, or passing them through oodles of filters.

Today, enterprises spend most of their time integrating, not coding. They're making this database work with that database, and wrapping this system in that one. Most of these activities are performed to transform data from one form into another. The customer database has to be able to exchange information with the new databases brought in through company mergers. The HR system has to talk to the ERP system, and both have to be backed up according to processes that change daily. It's all about taking information, transforming it programatically, and passing it on. And the way we do it now is ludicrously inefficient.

No, the future is not in "integrating" big data. The future is in teaching the machines to understand big data for us. We've already done this in places that aren't generally associated with artifical intelligence. Take a look at any corporate firewall or load balancing system, and you can see how the rules have moved away from being simple laws into ever more complex Turing-complete languages of their own. Most of the business processes you work with every day are essentially organically grown machine rules; they've evolved out of experience with what works for the organization. Companies are, after all, organic hive minds with mechanical arms. Large distributed systems, as it were.

If we can build enough common tools to create machine learning on top of big data, humanity as a whole will be remarkably changed for the better. Imagine machines able to identify cancer trends through the analysis of patient data correlated with weather patterns and bird migration. Who knows what sort of broad connections like that could exist in our world? Perhaps someone will teach the machines to look through our software repositories and learn how to write code by understanding every check-in, every rewrite, every refactorization...

Of course, this is all still pie-in-the sky. Mmmm, pie. Pie is good. But you can't make pie without first making dough. And before you make dough, you have to crush grain into flour. And harvest the grain. We all currently exist somewhere in the latter portion of that metaphor, as far as building big learning machines is concerned.

No matter: The folks behind the Mahout project are working on making that dough. Mahout began as a sub-Lucene project, and has morphed into the first real attempt at building a foundation for machine learning on top of Hadoop. Version 0.2 of Mahout was officially released on Wednesday, and there's a number of interesting new features:

  • Significant performance increase (and API changes) in the collaborative filtering engine
  • K-nearest-neighbor and SVD recommenders
  • Much code cleanup, bug fixing
  • Random forests, frequent pattern mining using parallel FP growth
  • Latent Dirichlet Allocation
  • Updates for Hadoop 0.20.x

The point is: Pay attention to Mahout. Maybe go contribute some code. It's an interesting project that, I feel, is really working towards an inevitable future.

Currently rated 5.0 by 2 people

  • Currently 5/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Share this link: http://www.sdtimes.com/blog/1549

Tags:

hadoop | apache

ahandy

Hadoop World shows diverse uses

by Alex Handy 09/21/2009 04:12 PM EST

On Oct. 2, in New York, Cloudera is hosting Hadoop World, the first real conference around the Apache Hadoop project. It will certainly be an interesting show. I've looked over the schedule, and there are only three or four talks that would qualify as Cloudera-specific marketing pitches. The rest of the talks look to be about Hadoop development, the Hadoop ecosystem and the use of Hadoop in various industries.

There will be some interesting talks at Hadoop World, and though I will not be attending (Jeff will be, as he is already on that coast), I'm hoping to watch any video of these talks that Cloudera posts online. Expect an interesting new product from Cloudera, also.

I thought I'd post some of the interesting talks that will be given at the show. Take a gander at just how varied Hadoop usage can be:

  • Large Scale Transaction Analysis, Joe Cunningham, VISA
  • Data Processing for Financial Services, Peter Krey and Sih Lee, JP Morgan Chase
  • Protein Alignment, Paul Brown, Booz Allen
  • Real-Time Business Intelligence, Bradford Stephens, Visible Technologies
  • Understanding Natural Language, Charles Ward and Karthik Balaji, General Sentiment
  • Matchmaking in the Cloud, Ben Hardy, eHarmony
  • Hadoop for Bioinfomatics, Deepak Singh, Amazon Web Services

Sounds like a nice array of use cases to unvestigate. I must say, I can't even think of another conference where real langauge processing and protein alignment were discussed, and esspecially not a conference where those topics were covered within the same software platform.

 

Currently rated 4.7 by 3 people

  • Currently 4.666667/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Share this link: http://www.sdtimes.com/blog/1526

Tags:

cloud | cloud computing | hadoop

ahandy

Today, FlightCaster launched. It's a site that can predict whether or not your flight will be delayed. Before it happens. Built by Jason Freedman, Bradford Cross and eight others in about four months, FlightCaster represents, I believe, the future of application development in numerous ways. Allow me to enumerate them.

  1. FlightCaster was built fast to address a need. Faster than fast. Four months fast.
  2. FlightCaster is built on Hadoop.
  3. FlightCaster uses a JVM to run other languages, specifically Clojure.
  4. Its Hadoop cluster is hosted in Amazon's cloud.
  5. The total costs to start this business came to less than seven digits.
  6. They use Cascading.
  7. The primary use case includes a mobile app. But it would work in a mobile browser, anyway.
  8. FlightCaster is built on top of data feeds and existing data. Data is the primary resource used to build the business. Publicly available data, no less.
  9. Their secret sauce is using predictive AI for forecasting. FlightCaster actually uses AI to solve a real problem.

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Share this link: http://www.sdtimes.com/blog/1506

Tags:

hadoop

ahandy

Hadoop creator goes to Cloudera

by Alex Handy 08/10/2009 04:33 PM EST

For those of you who aren't too busy pondering the VMware acquisition of SpringSource, you might also like to know that the creator of Hadoop, Doug Cutting, just announced he's leaving Yahoo to join Cloudera. That's a big move for a big name. I've said for a while now that Cloudera is the company with the Hadoop banner firmly in its grasp, despite the fact that Yahoo and Facebook both contribute mountains of code the project. Cutting has blogged about the move.

Having Cutting on board will mean that Cloudera can continue to churn out disc images for Amazon's cloud, and might even be able to start building some of the products based on Hadoop that everyone is anticipating. Doug, in his blog, basically said that Yahoo got Hadoop to this point, and now he'd like to work on the broader scale of things people are doing with the Hadoop framework, not just Web indexing.

Cutting joins a team of experienced Hadoop and map/reduce folks at Cloudera. I'm expecting great things from them now that they've got a lead singer, as it were, in their merry band.

Currently rated 3.8 by 6 people

  • Currently 3.833333/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Share this link: http://www.sdtimes.com/blog/1501

Tags:

cloud | cloud computing | hadoop

ahandy

Hadoop companies everywhere

by Alex Handy 07/14/2009 03:02 PM EST

I'm becoming more and more convinced that Hadoop is going to be its own ecosystem that I must cover as a whole beat unto itself. Call it the Map/Reduce beat, call it the big-data crunching beat. Call it the elephant with its footprints in the butter. But no matter what you call it, companies are generating more data every day, and many of them aren't deriving business intelligence from said data. And that spells opportunity.

This morning, I spoke to Chris Wensel, cofounder of Scale Unlimited. His company specializes in Hadoop training and consulting, and he offered some good insight into how the Hadoop world is expanding. He set me straight on the fact that Hadoop is already its own ecosystem, with numerous related projects making it up: Hbase, Pig, HDFS and all the other things folks have built to expand Hadoop's capabilities.

He also pointed out that there is a big gap between what Hadoop can do and what most companies need right now: That is, Hadoop is for big, slow data crunching, and many companies need smaller, faster solutions. That seems to be the expectation of super-startup Aster Data.

Another thing Wensel pointed out to me was the fact that Hadoop should live in your data center. It's all fine and dandy to put up an Amazon instance and fill it with your data, but when you're crunching a petabyte, he said, it's just too expensive. That's why Amazon lets you mail them disks. And that's also why Hadoop should be inside the firewall. Try as we might, a petabyte isn't easy to push anywhere, even inside the network.

So I'll be watching Hadoop like a hawk now that it's on my radar. We'll have lots to talk about, I'm sure, and with companies like Cloudera and semi-quiet startup Stampede, I'm sure there will continue to be innovation around the edges of the project. Of course, the core will continue to evolve too, but third-parties have a way of increasing the visibility and scope of an open-source ecosystem. Let's hope that all these companies can play well together and contribute upstream.

Currently rated 1.8 by 15 people

  • Currently 1.8/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Share this link: http://www.sdtimes.com/blog/1472

Tags: , , , ,

hadoop

 
 
News on Monday
more>>
SharePoint Tech Report
more>>


   

 
 

Download Current Issue
MAY 2012 PDF ISSUE

Need Back Issues?
DOWNLOAD HERE

Want to subscribe?


 
blogs tab
Why we leave
Ten reasons good workers leave their jobs, plus a few suggestions for retaining them.
05/22/2012 06:14 PM EST

Creation
To write better software, cultivate your ability to be creative.
05/19/2012 07:40 PM EST

Slick...but who needs it?
compilr.com is a well-designed site and the folks behind it seem to have their heart in the right place. But...who needs it?
05/16/2012 12:45 PM EST

How to be a better software developer
Want to be a better developer? You won't get there by mastering an interesting language or learning a new set of APIs.
05/14/2012 12:18 PM EST

Wooing Galatea
Do yourself a favor and check out Galatea 2.2, a wonderful book by novelist Richard Powers.
05/12/2012 07:05 PM EST

The world as story
An artificial-intelligence system at Carnegie Mellon seeks to understand the world by making statements about it.
05/10/2012 06:39 AM EST

 

Events calendar tab
6/3/2012 to 6/7/2012
Orlando
IBM Rational

6/10/2012 to 6/15/2012
Las Vegas
SQE

6/10/2012 to 6/15/2012
Las Vegas
SQE

6/11/2012 to 6/14/2012
Bellevue, Wash.
AMD

6/11/2012 to 6/14/2012
Orlando
Microsoft