CHANNELS
 
 
 
 
 
 
 
 
ON THE WEB
 
 
 
 
PRINT EDITION
 
 
 
 
BZ MEDIA
 
 
 
 
ADVERTISER LINKS
 
 
 
 
 
 
 
AS OF 11/21/2008 3:38PM EST
Protocol Buffers: Reinventing the past
Stories Columns Opinions Resources

By Andrew Binstock

August 15, 2008 — 

For the decades prior to the widespread adoption of XML, the computer industry sought ways to find an easy way to transfer data across systems. Many, many options entered into vogue, then disappeared or eventually found acceptance in small utilitarian niches.

The CORBA CDR format, Sun’s XDR (eXternal Data Representation) and the various EDI (Electronic Data Interchange) protocols are examples of the latter category. A few other standards, such as ASN.1(Abstract Syntax Notation 1), live on out of sight even though widely used; and because they are unseen they don’t come to mind when an internal or external data exchange protocol is needed.

Even after the wide embrace given to XML, standards bodies such as the IETF have continued formulating new standards. One example is SDXF (Structured Data eXchange Format), which is a non-text-based format that has much of the same structure of XML, although the data types are self-describing.

Formulation of so many standards arises in large part because the balance that must be struck between three contending forces at play in the design of a data-interchange protocol. 1) Speed: How fast can the data be marshalled and then unmarshalled? 2) Ease of use: How simple is it to describe the data in the appropriate format from a variety of programming languages? 3) Expressivity and other factors: How well does the format express the data needs of a particular domain? Also, there are the pragmatic concerns such as security (can it be encrypted when sent over the wire?) and compressibility.

Curiously, protocols that fail at one or more of these aspects can still become popular. XML is one example. XML is more of a data structure representation than an interchange format, as it requires some definitions that both ends agree on (generally specified in the form of a schema or DTD) so that validation of fields and the type of a data item can be identified.

Beyond the need for supplemental support, XML has some fairly horrid characteristics, the worst of which is speed. XML is an extremely wordy protocol that is very slow to parse. Sites that run enterprise Java and profile their applications frequently find that the single biggest time sink is the encoding and decoding of XML documents. In nearly all cases, it is by a wide margin the most time-consuming activity.

XML’s ease of use is not great, although it’s better than it once was due to the fact that there are many good parsers available. Even so, the requirement to translate certain characters, and the consequent need for CCDATA, mean encoding can at times be complex.

Finally, of course, XML expressivity is abysmal. Data must be structured, items can at times be categorized only in one specific hierarchy, and each hierarchy level must encompass all lower levels, plus repeating patterns of fields, such as a list of employee records, must repeat all the field names and hierarchy for every instance.

I could go on about XML limitations, but the key objection is surely performance. Google, which is a company that is intensely focused on performance, examined XML and decided that it was far too slow for the company’s high-speed requirements. In true Google tradition, rather than use an alternative existing protocol, it came up with its own: Protocol Buffers, a new format for serializing data that the company open-sourced in July.

The data structure for Protocol Buffers data items is described using a simple data description language, reminiscent of CORBA’s Interface Description Language. This description is saved in a .proto file, which is then compiled into C++, Java or Python.

The Protocol Buffers meta-language is surprisingly extensive. Both sequential and hierarchical data structures can be defined. And data items, covering all the standard string and integral types, can be marked required or optional. Fields can include enums, which is a common mechanism for tightly associating a specific value to a variable name.

Protocol Buffers is a binary format. This is a crucial dimension for the performance, as it means that long text strings and field names do not have to be parsed for each item. In addition, the marshalling and unmarshalling functions are built directly into the code, via the .proto compiler—essentially guaranteeing that every aspect is optimized for speed.

Unlike earlier technologies released by Google that still had an experimental feel to them, Google has been using Protocol Buffers for years and in almost all aspects of the services they provide. So, this is a technology that has been extensively tested… and optimized. If you need to transfer data in C++, Java or Python and want a fast, simple solution that avoids the penalties imposed by XML, try Protocol Buffers. I think you’ll like what you find.

Andrew Binstock is the principal analyst at Pacific Data Works. Read his blog at binstock.blogspot.com.


Related Search Term(s): XMLGoogle


Share this link: http://www.sdtimes.com/link/32599
 


 
 
 
 
 
 
 
 
 
 
SUBSCRIBE TODAY!
 E-Newsletters:
  News on Mon/Thurs.  More info
  Test & QA Report  More info
  EclipseNews  
  SPTech Report  More info
 
 
 
PDF & PRINT EDITION
* Requires Resource Account!  LOGIN or SIGN UP

Download Current Issue!
ISSUE 11/15/2008 PDF

Need Back Issues?
DOWNLOAD HERE

Receive The Print Edition?
SUBSCRIBE HERE
 
REGISTER
 
GET NOTIFIED!
About all of the latest Resources
 
 
SD TIMES 100
It's time once again to
recognize the organizations
or individuals that have
demonstrated leadership in
their markets.