The idea of binary XML has always hung around the margins of XML discourse. XML is very verbose because of its textual heritage and the many rules it imposes for friendliness toward internationalized text. Equivalent binary syntax would be much more compact. In an early (2000) article, "XML The future of EDI?" (see Resources), I demonstrated a translation of part of an ANSI EDI X12 purchase order transaction (which is binary) into XML. The XML result was more than eight times the size of the original EDI message (some other XML/EDI pilots were seeing only around three times). This verbosity is of some concern for storage of XML, but at least storage is cheap these days. Transmission capacity is usually more limited and the loudest calls for binary XML have been among those using XML for message transport formats, including some Web services users.
One approach to achieving XML compression is to adopt a format that is designed for binary formats from the start. The leading candidate is ISO/ITU ASN.1, a data transmission standard that predates XML. ASN.1 is being updated with several XML-related capabilities that allow XML formats to be reformulated into specialized forms such as ASN.1 Packed Encoding Rules, which define a very compact binary encoding. OASIS UBL is an example of an XML initiative that has taken the ASN.1 approach to XML data compression.
If you need to transmit XML over Web services you may find that your payload is too verbose. If so, you can use one of many text compression options on the XML content. Listing 1 is the XML/EDI example that I presented in the article mentioned earlier.
Listing 1. Sample XML document for Web service interchange
<?xml version="1.0" encoding="UTF-8"?>
<PurchaseOrder Version="4010">
<PurchaseOrderHeader>
<TransactionSetHeader X12.ID="850">
<TransactionSetIDCode code="850"/>
<TransactionSetControlNumber>12345</TransactionSetControlNumber>
</TransactionSetHeader>
<BeginningSegment>
<PurposeTypeCode Code="00 Original"/>
<OrderTypeCode Code="SA Stand-alone Order"/>
<PurchaseOrderNumber>RET8999</PurchaseOrderNumber>
<PurchaseOrderDate>19981201</PurchaseOrderDate>
</BeginningSegment>
<AdminCommunicationsContact>
<ContactFunctionCode Code="OC Order Contact"/>
<ContactName>Obi Anozie</ContactName>
</AdminCommunicationsContact>
</PurchaseOrderHeader>
<PurchaseOrderDetail>
<Name1InformationLOOP>
<Name>
<EntityIdentifierCode Code="BY Buying Party"/>
<EntityName>Internet Retailer Inc.</EntityName>
<IdentificationCodeQualifier Code="91 Assigned by Seller"/>
<IdentificationCode>RET8999</IdentificationCode>
</Name>
<Name>
<EntityIdentifierCode Code="ST Ship To"/>
<EntityName>Internet Retailer Inc.</EntityName>
</Name>
<AddressInformation>123 Via Way</AddressInformation>
<GeographicLocation>
<CityName>Milwaukee</CityName>
<StateProvinceCode>WI</StateProvinceCode>
<PostalCode>53202</PostalCode>
</GeographicLocation>
</Name1InformationLOOP>
<BaselineItemData>
<QuantityOrdered>100</QuantityOrdered>
<Unit Code="EA Each"/>
<UnitPrice>1.23</UnitPrice>
<PriceBasis Code="WE Wholesale Price per Each"/>
<ProductIDQualifier Code="MG Manufacturer Part Number"/>
<ProductID Description="Fuzzy Dice">CO633</ProductID>
</BaselineItemData>
</PurchaseOrderDetail>
</PurchaseOrder>
|
The original EDI example is 200 bytes long and the XML version is 1721 bytes long.
The well-known PK-ZIP routine compresses the XML file to 832 bytes.
The GNU gzip routine compresses the file to 707 bytes.
The open source routine in bzip2 compresses the file to 748 bytes.
None of these is as compact as the specialized EDI format, which is understandable. bzip2 is famous for compressing many files better (if more slowly) than gzip, but the results here are typical of my observations, in which gzip handles XML better than bzip2.
Most platforms and languages these days have libraries for at least PK-ZIP and GNU gzip compression, which can be done programmatically before a Web service is invoked.
Make sure you investigate whether canonicalization (C14N) would improve or deteriorate compression in your instance. C14N is a standard method for generating a physical representation of an XML document, called the canonical form, that accounts for the variations allowed in XML syntax without change in meaning. As a rough rule of thumb, if the XML is hand-edited with a lot of potential variation in attribute order and use of spacing, C14N might improve the performance of compression on large documents. However, if the XML is machine-generated or uses a lot of empty elements, C14N may hurt. My example is closer to the latter category. I canonicalized it using the C14N module in the PyXML project. The Python code is as follows:
>>> from xml.dom import minidom
>>> from xml.dom.ext import c14n
>>> doc = minidom.parse('listing1.xml')
>>> c14n.Canonicalize(doc)
>>> f = open('listing1-canonical.xml', 'w')
>>> c14n.Canonicalize(doc, output=f)
>>> f.close()
|
The resulting file, listing1-canonical.xml, is 1867 bytes and after gzip shrinks it, 714 bytes. The plain text version is 146 bytes larger and the gzip result 7 bytes larger. The main reason for this is that empty elements are expressed in their most verbose form after C14N. For example, the following line:
<Unit Code="EA Each"/> |
becomes
<Unit Code="EA Each"></Unit> |
To bundle compressed XML such as the gzip result into SOAP, you have two options:
- Use some form of attachments facility
- Use an encoding such as Base64 for inclusion in the main body of the message
Base64 renders binary documents using only common textual characters. You should be able to do this using readily available libraries on any platform. There is even a W3C XML Schema type for Base64 encoded data, and your tools may be able to automate the Base64 encoding and decoding if you set up the Web service properly. Unfortunately, Base64 encoding undoes some of the effect of the compression. A Base64 encoding is larger than the original by a ratio of 4:3. After Base64 encoding, the gzip result on Listing 1 is 957 bytes.
In general, after gzip is applied to an XML file and the compressed result is then encoded with Base64 for delivery inline in SOAP, the result is often half its original size. This may be enough to meet your needs for space savings in XML Web services. If not, do take a good look at ASN.1.
- See the ASN.1 Markup Language home page for information about ASN.1 initiatives towards XML support.
- Read the Cover page on XML and compression, an excellent resource for compression techniques to use on XML documents.
- For more on canonicalization, see "XML Canonicalization" and "XML Canonicalization, Part 2" by Bilal Siddiqui. For another example of the use of C14N in Web services, read "A Web Services Cache Architecture Based on XML Canonicalization" by Takase, Nakamura, Neyama, and Eto of IBM Research, Tokyo Research Laboratory.
- If you're interested in the details of Base64 encoding see Section 6.8 of RFC 2045: MIME Part One.
- Find a broad array of articles, columns, tutorials, and tips on these two popular technologies at the developerWorks
Web services and XML zones.
- For a complete list of XML tips to date, check out the tips summary page.
-
Browse for books on these and other technical topics.
- Learn how you can become an IBM Certified Developer in XML and related technologies.

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.




