SD TIMES BLOG
ahandy

We are the big data problem

by Alex Handy 11/19/2009 04:53 PM EST

Like a kid waiting for Christmas to come, I have been watching the Mahout project with great anticipation. When you toss around the concepts of map/reduce and machine learning, there's an awful lot of potential for radical ninjas to ensue. While the magical science-fiction world of artificial intelligence is still in a fetal state within our reality, it is, nonetheless, a growing science.

One of the things humans are discovering about making real thinking machines is that a wealth of experience goes a long way. It's the same for humans and machines: The more memories we have, the more we are able to structure thought processes based on those memories, and to learn from them. For computers, memories can be thought of as datasets. And the bigger the dataset, the more understanding can be extracted.

We all know about the BIG DATA PROBLEM. But if I were going to write something here in the guise of a souped-up VC, or grizzled startup veteran, I'd say the big data problem, when observed from the right angle, is more like a big data opportunity.

Imagine how much optimization information you could pull from six years of user logs? Need metrics, anyone? Try juicing all of your user stats. Not just this month's stats, but all of them. Why not throw in the databases of old user info you got from that newly acquired company you're still digesting? Hadoop is the place to put all of this stuff, as we should all know by now. But the data in Hadoop is only as good as the people who extract meaning from it. And with petabytes of data available at once, what human being could ever comprehend, much less query that mound of information? We can, and do, poke at it, and pull massive amounts of data from it. But there is the potential for infinitely more meaning to be derived from our data.

The big data problem is not a problem with the machines anymore. They're ready, thanks to Doug Cutting, the Apache Foundation, Yahoo, Facebook, Cloudera, and all the hordes of other Hadoop committers out there. The big data problem is a problem with us. These datasets are just getting too big, and we can't spend our days reformatting them, writing connectors for them, or passing them through oodles of filters.

Today, enterprises spend most of their time integrating, not coding. They're making this database work with that database, and wrapping this system in that one. Most of these activities are performed to transform data from one form into another. The customer database has to be able to exchange information with the new databases brought in through company mergers. The HR system has to talk to the ERP system, and both have to be backed up according to processes that change daily. It's all about taking information, transforming it programatically, and passing it on. And the way we do it now is ludicrously inefficient.

No, the future is not in "integrating" big data. The future is in teaching the machines to understand big data for us. We've already done this in places that aren't generally associated with artifical intelligence. Take a look at any corporate firewall or load balancing system, and you can see how the rules have moved away from being simple laws into ever more complex Turing-complete languages of their own. Most of the business processes you work with every day are essentially organically grown machine rules; they've evolved out of experience with what works for the organization. Companies are, after all, organic hive minds with mechanical arms. Large distributed systems, as it were.

If we can build enough common tools to create machine learning on top of big data, humanity as a whole will be remarkably changed for the better. Imagine machines able to identify cancer trends through the analysis of patient data correlated with weather patterns and bird migration. Who knows what sort of broad connections like that could exist in our world? Perhaps someone will teach the machines to look through our software repositories and learn how to write code by understanding every check-in, every rewrite, every refactorization...

Of course, this is all still pie-in-the sky. Mmmm, pie. Pie is good. But you can't make pie without first making dough. And before you make dough, you have to crush grain into flour. And harvest the grain. We all currently exist somewhere in the latter portion of that metaphor, as far as building big learning machines is concerned.

No matter: The folks behind the Mahout project are working on making that dough. Mahout began as a sub-Lucene project, and has morphed into the first real attempt at building a foundation for machine learning on top of Hadoop. Version 0.2 of Mahout was officially released on Wednesday, and there's a number of interesting new features:

  • Significant performance increase (and API changes) in the collaborative filtering engine
  • K-nearest-neighbor and SVD recommenders
  • Much code cleanup, bug fixing
  • Random forests, frequent pattern mining using parallel FP growth
  • Latent Dirichlet Allocation
  • Updates for Hadoop 0.20.x

The point is: Pay attention to Mahout. Maybe go contribute some code. It's an interesting project that, I feel, is really working towards an inevitable future.

Currently rated 5.0 by 2 people

  • Currently 5/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5

Share this link: http://www.sdtimes.com/blog/1549

Tags: , , , , , , , , ,

apache | hadoop

Comments

Add comment


 
 

biuquote
  • Comment




 
 
News on Monday
more>>
SharePoint Tech Report
more>>


   

 
 

Download Current Issue
FEBRUARY 2012 PDF ISSUE

Need Back Issues?
DOWNLOAD HERE

Want to subscribe?


 
blogs tab
Are you at risk for burnout?
Burnout is a severe problem and it can strike at any time. Here's how to tell if you are nearing the edge.
02/09/2012 02:16 PM EST

Agility, mom, and apple pie
If we're to evaluate the state-of-the-art in software development, we should start with the values espoused in the Agile Manifesto.
02/07/2012 11:57 AM EST

RIM woos developers with free tablet
How do you get more apps ported to the BlackBerry PlayBook? By giving every developer a free tablet, of course!
02/04/2012 01:57 PM EST

GitHire: Use Headhunters to Find Your Perfect Programmer
Are you a hiring manager tired of scouring the job boards? Check out this new service that will find 5 people interested in your jobs.
02/03/2012 12:17 PM EST

Facebook claims hacker cred
Facebook's SEC S-1 filing form includes a short essay on the Hacker Way by Mark Zuckerberg himself.
02/02/2012 08:26 AM EST

Ryan Dahl steps down
Ryan Dahl, creator of Node.js, steps back from his position as gatekeeper for the project.
02/01/2012 04:58 PM EST

 
Events calendar tab
2/13/2012 to 2/16/2012
Santa Clara
TechWeb

2/26/2012 to 2/29/2012
San Francisco
BZ Media

2/27/2012 to 3/2/2012
San Francisco
RSA

3/4/2012 to 3/7/2012
Las Vegas
IBM Tivoli

3/5/2012 to 3/9/2012
San Francisco
TechWeb