Big Data: What’s in the box?

Published: May 28th, 2013

- Alex Handy

When it comes to Big Data, the big news stories swirl around the Apache Hadoop project. While there are many reasons for Hadoop’s popularity, its success hasn’t done much to make the Big Data puzzle any easier to solve. While Hadoop promises a place to put all your data, actually deriving business value from that data is another matter entirely.

After all, the Big Data revolution is not just about storing that data, said Luis Maldonado, director of product management for HP Vertica. Maldonado said that enterprises want to “query that data and have a conversation with it. It allows me to have conversations I haven’t thought about before. Understanding customers, no matter what your vertical, has been a big push.”

And with so much data being generated by those customers, there’s never been a better time to try to comprehend why they do what they do. “There’s a big focus on better understanding your customer,” said Maldonado. “People are starting to understand, ‘How are my customers segmented?’ ‘How effective are my campaigns in retaining customers and acquiring them?’ ‘If I have loyalty programs, how do I understand the effect these have?’”

Unfortunately, customers don’t keep their data in neatly ordered relational data stores. They communicate with enterprises through Twitter, Facebook, the corporate website, partner sites, and even the good old-fashioned telephone.

What if there was some magical place where you could store all of this unstructured data, from customer transaction records, to security camera footage, to tweets, to relational data stores, all the way down to voice recordings of tech-support calls? And what if you could build such a data store on open-source software and commodity hardware?

The Apache Hadoop Project is, if nothing else, a place to put the data and to perform computations upon it, no matter its form. A Hadoop cluster is built upon the Hadoop File System, which can spread petabytes of data across commodity hardware reliably, but not yet in a highly available (HA) fashion. With the help of Apache HBase, relational data stores, such as MySQL and Oracle, can be dumped into Hadoop with their relational information intact. And if you have a good Java developer, you can use all of this infrastructure to perform queries upon petabytes of data at a time.

Untangling that data
In years past, analysts using R, SAS or some other data-analysis platform would write complex computations and statistical analysis routines that would run against a more traditional data store against uniformly coded data.

Hadoop, however, requires a Java developer to write what’s known as a Map/Reduce job in order to process data inside the cluster. While Map/Reduce was designed to save developers time—requiring them only to write the code needed for the problem they’re trying to solve on a large data set—writing Map/Reduce jobs is still a programming task suited to an actual Java developer, not to a business analyst, for example.
#!
Maldonado said that “There’s not a lot of talent available. There’s high demand for people who can write Map/Reduce functions and can write out a rich analytics platform on Hadoop.”

There is hope, however, in the form of Cascading, Hive, Pig and about a dozen other new and maturing data-access and management layers for Hadoop. Pig, for example, is a platform built on Hadoop to provide a procedural data-access layer through a language called Pig Latin. Hive, on the other hand, is Facebook’s attempt to build a SQL-like query language on top of Hadoop.

Chris Wensel, founder of enterprise Big Data company Concurrent, said, “Hadoop, in theory, is cheaper than a Teradata database or Greenplum. But mainstream people out there don’t know Java. They’re not going to learn Cascalog, they know SQL, and Greenplum and Teradata are already SQL. The value is that you’ve already written applications in SQL, but you have a team competing for resources on the Teradata system, so migrating code onto Hadoop with as little code as possible is the first pain point.”

Wensel is the creator of Cascading, an API layer for Hadoop that he said smoothes over this pain point. He added that “Historically, our adoption has been through attrition. Most people get seduced by the idea that they can write a couple lines of a SQL-like syntax and put things into production. But once you want to do something interesting, and want to base your business on it, you need to do things like write unit tests.”

Cascading was built to address many of the common headaches developers encountered building on Hadoop, Hive and Pig. For example, said Wensel, “If you’re processing and seeing bad data, Cascading stores it off to the side so you can inspect it postmortem without killing the application.”

He said that because Cascading is an API, “You can build frameworks on top of it. You can build reusable code that anyone on your team can use, and you can write your own languages, like PyCascading, or the JRuby variant. There are lots of other internal projects to support other higher-lever languages, like Scala and Clojure.”
#!
Analytics come to the business side
Naturally, there has been demand for more traditional data-access layers inside and alongside Hadoop. A common model is to use Hadoop to sort and bundle relevant subsets of data, then offload that into a smaller, more analytics-focused data store. Many companies such as DataStax, HP and Revolution Analytics are taking this approach, working in tandem with Hadoop.

David Smith, vice president of corporate marketing at Revolution Analytics, said, “If you hire data scientists, there is an excellent chance they’re already trained in R. You can use all the flexibility of R to distill that data and, for example, extract sentiment from text. There are all sorts of unimagined applications sitting there in Hadoop that need a data science process to figure out what’s going on there. One of the big advantages of using the R Hadoop packages is that you only need to learn one language. The alternative is to learn Java and Pig and Hive just to be able to do an abstraction and analysis of the data.”

Maldonado said that HP Vertica can help bring Hadoop to the business analytics people as well. “We started with integrating at the Map/Reduce level,” he said. “We made it simple for the Map/Reduce developer to pull and push data from the Vertica analytics platform. This past fall, we introduced a simplified model for customers who don’t have expertise in Hadoop.”

But moving data to and from Hadoop, then controlling access to that data, is an essential piece of the enterprise Hadoop puzzle, and a solution is only just now forthcoming. According to Brian Christian, CTO of Hadoop security company Zettaset, “You have this weird storm of regulated industries. This even includes telcos, water supplies and utilities where they’re doing constant data sampling. They’re trying to create smart grids and they’re getting this data overflow, and there really is no good way to handle it. They turn to Hadoop, and again they can’t get Hadoop to comply. Now they’re in this catch-22 of ‘What do I do?’”

Christian said Zettaset offers enterprise-class security for Hadoop clusters, allowing teams locked down by HIPAA or Sarbanes-Oxley to maintain controls on that otherwise unruly blob of data inside of a Hadoop cluster.
#!
A laundry list of changes
Amid all of the Hadoop hype is another freight train of promises. This train is known as Hadoop 2.0, and it should reach the station this summer. Headed up by the folks at Hortonworks, Hadoop 2.0 includes a number of fundamental changes that will mold Hadoop into something that looks less like a Map/Reduce cluster and more like a system representing a generic workload cluster with a highly available file system.

In version 2.0, the HDFS is being improved to allow for predictable availability and better redundancy of data. The result will be that HBase, the relational data store inside Hadoop, will now be reliable enough to run as a front-end database rather than a back-end storage framework for dumping MySQL information.

The actual job-scheduling system behind Map/Reduce in Hadoop has been rewritten from the ground up in Hadoop 2.0. Known as MapReduce Framework 2.0 (or YARN), this project has split the original Map/Reduce daemon in two. The net effect will be better cluster sharing across jobs, as well as the ability to run non-Map/Reduce workloads on Hadoop. Thus, Hadoop 2.0 will be able to shoulder generic workloads, not just those explained in Map/Reduce terms.

But while the Apache Foundation and Hortonworks push their open-source vision for Hadoop, MapR has spent the last three years fixing many of the problems Hadoop 2.0 addresses. Tomer Shiran, vice president of product management at MapR, said that his company’s distribution of Hadoop offers many features not available in other forms of the platform. “MapR is the only distribution that can do disaster recovery and incrementally send that to another data center,” he said.

“It’s the only one that replicates the metadata of the cluster. Having that full [high availability] across every layer of the stack is unique to MapR as well. You look at any other system in the enterprise, whether that’s Teradata, NetApp or Oracle; they all have these capabilities. They have to have these capabilities.”

Shiran said MapR can claim these capabilities because the company was “able to rearchitect that layer of the stack, and that took several years. [MapR] is a read/write system, it has full read/write support. One of the challenges of HDFS is it was built for a search engine, where the only use case was to crawl. Because Hadoop has evolved so much into all these other use cases, the limited use case of crawling is no longer the only one.”

Thus, MapR and Zettaset claim to have made Hadoop enterprise-ready now, while Hortonworks continues to push for solutions to problems inside of the Apache Foundation. While these functionalities may someday become standard in all installations of Apache Hadoop, for now, those enterprise features are still only available from enterprise software companies.

Article Tags

Big Data

About Alex Handy

Alex Handy is the Senior Editor of Software Development Times.

View all posts by Alex Handy

Cookie	Duration	Description
cf_use_ob	past	Cloudflare sets this cookie to improve page load times and to disallow any security restrictions based on the visitor's IP address.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_S6PB8V57DG	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_846073_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.
WMF-Last-Access	1 month 14 hours 26 minutes	This cookie is used to calculate unique devices accessing the website.

Cookie	Duration	Description
__Host-GAPS	2 years	This cookie allows the website to identify a user and provide enhanced functionality and personalisation.
_pxhd	session	Used by Zoominfo to enhance customer data.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
__Secure-YEC	1 year 1 month	No description
_heatmaps_g2g_100754890	10 minutes	No description
_techvalidate_session	session	No description
cf_7166_id	20 years	No description
cf_7166_person_last_update	session	No description
f5avraaaaaaaaaaaaaaaa_session_	session	No description available.
GoogleAdServingTest	session	No description
Gyazo_cfwoker	7 years 2 months 17 days 7 hours	No description
incap_ses_451_2783402	session	No description
incap_ses_769_2783402	session	No description
loglevel	never	No description available.
m	2 years	No description available.
nlbi_2783402	session	No description
prism_252377639	1 month	No description
TS011605d9	session	No description
ustream-guest	session	No description available.
visid_incap_2783402	1 year	No description
xtc	1 year 1 month	No description

AI

AI and Software Development

Observability

Guide to Observability

CI/CD

A guide to CI/CD

Cloud Native

Cloud Native Content

Data

A Guide to Data

Test

Automation

Mobile

Mobile Testing

API

Sponsored by Parasoft

Performance

Load & Performance Testing

DevSecOps

A Guide to DevSecOps

Enterprise Security

A Guide to Security

Supply Chain Security

Supply Chain Security

Dev Manager

Dev Managers Content

Agile

A Guide To Agile

Value Stream

A Guide To Value Stream

Productivity

A Guide To Productivity

DevOps

DevOps Content

Microservices

A Guide to MicroServices

AI

AI and Software Development

Value Stream Management

A Guide To Value Stream

Big Data: What’s in the box?

Article Tags

Subscribe to SDTimes

About Alex Handy

Related Articles

Canonical announces general availability of Charmed MLFlow

IBM launches guide for contributing to open source cloud projects

SD Times Open-Source Project of the Week: Apache Drill

SD Times Open-Source Project of the Week: CodeFlare