British Library uses IBM BigSheets to hold onto its digital history



Email    print   
March 18, 2010 —  (Page 1 of 2)
From Chaucer’s “The Canterbury Tales” to newspapers documenting 19th-century wars, the recent credit crunch and the Olympics, the British Library has worked for more than 250 years to preserve the history and social culture those published works hold. Now, as more information goes digital, those working to collect as much information as possible are beginning to worry about the “digital black hole.” So much information is created every day that older information is ultimately erased or lost, leaving generations to come with a potential void in their history.

“About 15 petabytes of information are created every day in the world,” said David Boloker, IBM’s CTO for emerging Internet technologies. A petabyte can be thought of as about eight times the amount of information held in all United States libraries today, he said.

Recent research by the British Library estimates the average life expectancy of a website to be just about 44–75 days, and it suggests that 10% of all U.K. websites are either lost by assimilation with other information or replaced by new data every six months.

In an attempt to help the British Library harness an endless trove of information from the U.K. and Ireland, IBM worked with it this past December and installed code for IBM BigSheets. The program is a new technology prototype designed to build “a Web-insight engine that can deal with huge amounts of data and basically ingest it,” Boloker explained. “BigSheets can then in essence map and graph the data and reduce to what is understandable by a human being.”

Taking about two years to develop, and on its fifth generation of code, BigSheets is built on Apache’s Hadoop framework and has about two more years to go before becoming a robust technology, Boloker said. While in use, the program will crawl the roughly 8 million (expected to be up to 11 million by 2011) websites in the .uk domain and take “snapshots” of Web pages by going out, fetching and copying a page to be stored in the WARC (Web ARChive) file format. BigSheets then takes the content and stores it in the Hadoop File System (HDFS) for further processing.



Related Search Term(s): IBM

Pages 1 2 


Share this link: http://sdt.bz/34209
 
Most Read Latest News Blog Resources

Add comment


Name*
Email*  
Country     


  • Comment
Loading




close
NEXT ARTICLE
IBM's second century begins with a new CEO
Virginia Rometty gets that software will play an important role in providing business value Read More...
 
 
 
 
News on Monday
more>>
SharePoint Tech Report
more>>


   

 
 

Download Current Issue
FEBRUARY 2012 PDF ISSUE

Need Back Issues?
DOWNLOAD HERE

Want to subscribe?


 
blogs tab
Are you at risk for burnout?
Burnout is a severe problem and it can strike at any time. Here's how to tell if you are nearing the edge.
02/09/2012 02:16 PM EST

Agility, mom, and apple pie
If we're to evaluate the state-of-the-art in software development, we should start with the values espoused in the Agile Manifesto.
02/07/2012 11:57 AM EST

RIM woos developers with free tablet
How do you get more apps ported to the BlackBerry PlayBook? By giving every developer a free tablet, of course!
02/04/2012 01:57 PM EST

GitHire: Use Headhunters to Find Your Perfect Programmer
Are you a hiring manager tired of scouring the job boards? Check out this new service that will find 5 people interested in your jobs.
02/03/2012 12:17 PM EST

Facebook claims hacker cred
Facebook's SEC S-1 filing form includes a short essay on the Hacker Way by Mark Zuckerberg himself.
02/02/2012 08:26 AM EST

Ryan Dahl steps down
Ryan Dahl, creator of Node.js, steps back from his position as gatekeeper for the project.
02/01/2012 04:58 PM EST

 
Events calendar tab
2/13/2012 to 2/16/2012
Santa Clara
TechWeb

2/26/2012 to 2/29/2012
San Francisco
BZ Media

2/27/2012 to 3/2/2012
San Francisco
RSA

3/4/2012 to 3/7/2012
Las Vegas
IBM Tivoli

3/5/2012 to 3/9/2012
San Francisco
TechWeb