Skip to main content

Tip: Compress XML files for efficient transmission

Who needs binary XML when we have good compression?

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought, Inc.
Photo of Uche Ogbuji
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Summary:  Binary XML has generated a lot of talk, and one of the motivators is the need for a less verbose transfer format, especially for use with Web services. One solution that is already at hand is data compression. This tip shows you how to use compression to prepare XML for transmission over Web services.

View more content in this series

Date:  09 Apr 2004
Level:  Intermediate
Activity:  5430 views

The idea of binary XML has always hung around the margins of XML discourse. XML is very verbose because of its textual heritage and the many rules it imposes for friendliness toward internationalized text. Equivalent binary syntax would be much more compact. In an early (2000) article, "XML The future of EDI?" (see Resources), I demonstrated a translation of part of an ANSI EDI X12 purchase order transaction (which is binary) into XML. The XML result was more than eight times the size of the original EDI message (some other XML/EDI pilots were seeing only around three times). This verbosity is of some concern for storage of XML, but at least storage is cheap these days. Transmission capacity is usually more limited and the loudest calls for binary XML have been among those using XML for message transport formats, including some Web services users.

One approach to achieving XML compression is to adopt a format that is designed for binary formats from the start. The leading candidate is ISO/ITU ASN.1, a data transmission standard that predates XML. ASN.1 is being updated with several XML-related capabilities that allow XML formats to be reformulated into specialized forms such as ASN.1 Packed Encoding Rules, which define a very compact binary encoding. OASIS UBL is an example of an XML initiative that has taken the ASN.1 approach to XML data compression.

Compression for SOAP encoding

If you need to transmit XML over Web services you may find that your payload is too verbose. If so, you can use one of many text compression options on the XML content. Listing 1 is the XML/EDI example that I presented in the article mentioned earlier.


Listing 1. Sample XML document for Web service interchange
<?xml version="1.0" encoding="UTF-8"?>
<PurchaseOrder Version="4010">
<PurchaseOrderHeader>
  <TransactionSetHeader X12.ID="850">
    <TransactionSetIDCode code="850"/>
    <TransactionSetControlNumber>12345</TransactionSetControlNumber>
  </TransactionSetHeader>
  <BeginningSegment>
    <PurposeTypeCode Code="00 Original"/>
    <OrderTypeCode Code="SA Stand-alone Order"/>
    <PurchaseOrderNumber>RET8999</PurchaseOrderNumber>
    <PurchaseOrderDate>19981201</PurchaseOrderDate>
   </BeginningSegment>
  <AdminCommunicationsContact>
    <ContactFunctionCode Code="OC Order Contact"/>
    <ContactName>Obi Anozie</ContactName>
  </AdminCommunicationsContact>
</PurchaseOrderHeader>
<PurchaseOrderDetail>
  <Name1InformationLOOP>
    <Name>
      <EntityIdentifierCode Code="BY Buying Party"/>
      <EntityName>Internet Retailer Inc.</EntityName>
      <IdentificationCodeQualifier Code="91 Assigned by Seller"/>
      <IdentificationCode>RET8999</IdentificationCode>
    </Name>
    <Name>
      <EntityIdentifierCode Code="ST Ship To"/>
      <EntityName>Internet Retailer Inc.</EntityName>
    </Name>
    <AddressInformation>123 Via Way</AddressInformation>
    <GeographicLocation>
      <CityName>Milwaukee</CityName>
      <StateProvinceCode>WI</StateProvinceCode>
      <PostalCode>53202</PostalCode>
    </GeographicLocation>
  </Name1InformationLOOP>
  <BaselineItemData>
    <QuantityOrdered>100</QuantityOrdered>
    <Unit Code="EA Each"/>
    <UnitPrice>1.23</UnitPrice>
    <PriceBasis Code="WE Wholesale Price per Each"/>
    <ProductIDQualifier Code="MG Manufacturer Part Number"/>
    <ProductID Description="Fuzzy Dice">CO633</ProductID>
  </BaselineItemData>
</PurchaseOrderDetail>
</PurchaseOrder>
      

The original EDI example is 200 bytes long and the XML version is 1721 bytes long.

The well-known PK-ZIP routine compresses the XML file to 832 bytes.

The GNU gzip routine compresses the file to 707 bytes.

The open source routine in bzip2 compresses the file to 748 bytes.

None of these is as compact as the specialized EDI format, which is understandable. bzip2 is famous for compressing many files better (if more slowly) than gzip, but the results here are typical of my observations, in which gzip handles XML better than bzip2.

Most platforms and languages these days have libraries for at least PK-ZIP and GNU gzip compression, which can be done programmatically before a Web service is invoked.

Make sure you investigate whether canonicalization (C14N) would improve or deteriorate compression in your instance. C14N is a standard method for generating a physical representation of an XML document, called the canonical form, that accounts for the variations allowed in XML syntax without change in meaning. As a rough rule of thumb, if the XML is hand-edited with a lot of potential variation in attribute order and use of spacing, C14N might improve the performance of compression on large documents. However, if the XML is machine-generated or uses a lot of empty elements, C14N may hurt. My example is closer to the latter category. I canonicalized it using the C14N module in the PyXML project. The Python code is as follows:

>>> from xml.dom import minidom
>>> from xml.dom.ext import c14n
>>> doc = minidom.parse('listing1.xml')
>>> c14n.Canonicalize(doc)
>>> f = open('listing1-canonical.xml', 'w')
>>> c14n.Canonicalize(doc, output=f)
>>> f.close()
      

The resulting file, listing1-canonical.xml, is 1867 bytes and after gzip shrinks it, 714 bytes. The plain text version is 146 bytes larger and the gzip result 7 bytes larger. The main reason for this is that empty elements are expressed in their most verbose form after C14N. For example, the following line:

<Unit Code="EA Each"/> 

becomes

<Unit Code="EA Each"></Unit>     

To bundle compressed XML such as the gzip result into SOAP, you have two options:

  • Use some form of attachments facility
  • Use an encoding such as Base64 for inclusion in the main body of the message

Base64 renders binary documents using only common textual characters. You should be able to do this using readily available libraries on any platform. There is even a W3C XML Schema type for Base64 encoded data, and your tools may be able to automate the Base64 encoding and decoding if you set up the Web service properly. Unfortunately, Base64 encoding undoes some of the effect of the compression. A Base64 encoding is larger than the original by a ratio of 4:3. After Base64 encoding, the gzip result on Listing 1 is 957 bytes.


Wrap-up

In general, after gzip is applied to an XML file and the compressed result is then encoded with Base64 for delivery inline in SOAP, the result is often half its original size. This may be enough to meet your needs for space savings in XML Web services. If not, do take a good look at ASN.1.


Resources

About the author

Photo of Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, SOA and Web services
ArticleID=11899
ArticleTitle=Tip: Compress XML files for efficient transmission
publish-date=04092004
author1-email=uche@ogbuji.net
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Rate a product. Write a review.

Special offers