Convert Excel 2007 XML to OASIS code lists

Working with Genericode

Genericode—an OASIS specification—is an XML vocabulary for encoding code lists or authority lists in XML. These lists are usually developed and maintained by business users who are familiar with Microsoft® Office solutions but are not proficient in XML technologies. In this article, gain insight into using Microsoft Office Excel® 2007 spreadsheets to create and maintain code lists, then use XSLT to convert the spreadsheets to compliant Genericode that applications such as online forms can easily consume.

Betty Harvey (harvey@eccnet.com), Information Architect, Electronic Commerce Connection, Inc.

Betty Harvey is president of Electronic Commerce Connection, Inc. She has participated with many government and commercial enterprises in planning and executing their migration to structured information. Betty is involved in the ebXML initiative and is the co-author of Professional ebXML Foundations (Wrox, 2001) as well as many other publications. Betty started and coordinates the Washington, D.C., area SGML/XML Users Group and is also a member of the XML Guild. You can reach Betty at harvey@eccnet.com.



11 January 2011

Also available in Chinese Japanese

Genericode started within the Organization for the Advancement of Structured Information Standards (OASIS) specification Universal Business Language (UBL) technical committee (see Resources). UBL originally followed the United Nations Centre for Trade Facilitation and Electronic Business (UN/CEFACT) Core Components Technical Specification (see Resources for a link)—a specification that embedded code lists, such as currency codes and country codes, directly in the W3C Schema using enumerations.

Frequently used acronyms

  • W3C: World Wide Web Consortium
  • XML: Extensible Markup Language
  • XSLT: Extensible Stylesheet Language Transformation

The UBL community quickly found that code lists change frequently. Although embedding the lists directly in the schema provided easy validation, when a code changed in the list, the schema had to be updated. Handling updated schemas within multiple trading partners can be a logistic nightmare. Also, many products require recompiling the schema and reinitializing databases based on updated schemas.

The UBL technical committee developed Genericode as an abstraction layer to resolve updating the schema every time code changed (see Resources for a link to more information). Other validation methods such as Schematron can be used as the validation layer for validating codes. Genericode has a wider audience than UBL, and OASIS decided to create a new technical committee to manage the Genericode specification.

This article shows how to use Microsoft Office Excel 2007 spreadsheets to create code lists. You can then use XSLT to convert the lists from an Excel spreadsheet to Genericode.

Code lists in Office Excel 2007

Although this article deals specifically with Genericode conversion, the concepts can be generalized to other types of Excel documents and XML vocabularies. For example, I have done many conversions to and from Microsoft Office Excel 2003 and have found the process to be fairly straightforward. To convert to XML from Office Excel 2003, you must save the spreadsheet to Office Excel 2003 XML. One of my conversions is a schema analysis spreadsheet that creates a spreadsheet from a W3C schema. The Excel format provides an easy way for nontechnical individuals to understand relationships among and definitions of schema components.

Another "go-to" conversion I created was Office Excel 2003 XML to Genericode, a conversion that works in place in a production environment. In this conversion, many of the code lists have empty cells. In the Office Excel 2003 XML, however, if a cell is blank, it was not output into the XML, and there wasn't an easy way to automatically analyze the row and determine which cell was left out. This situation is true for both Office Excel 2003 and Office Excel 2007.  Office Excel 2007, however, provides the ability to logically derive when a cell is blank.

For the purpose of this article, I use the simple structure in Figure 1, which represents an Excel code list.

Figure 1. Sample Excel code list
Screen capture of sample Excel code list with CodeList name, code value name (column label), and invidual values

OASIS Genericode and Office Excel 2007 structures

Before beginning any conversion project, you must understand the starting structure and the ending structure. Genericode can become complicated, but for 98 percent of all code lists that people use, they are relatively straightforward. A simple code list is broken into three sections:

  • Identification
  • Column specification
  • Code list data

Figure 2 provides a graphical view of the XML schema for Genericode. A CodeList document (Type: gc:CodeListDocument) has attributes to describe it. The gc:DocumentHeader contains Annotation (Type: gc:Annotation) and Identification (Type: gc:Identification). The gc:ColumnSetChoice contains ColumnSet (Type: gc:ColumnSet) and ColumnSetRef (Type: gc:ColumnSetRef). The gc:OuterCodeListChoice contains gc:SimpleCodeListSequence and SimpleCodeList (Type: gc:SimpleCodeList).

Figure 2. The Genericode structure
Diagram of the Genericode structure

Office Excel 2007 XML isn't quite as straightforward as Genericode. In fact, the structure is rather bizarre. The XLSX file is really a ZIP file containing multiple files:

  • Root:
    • _rels
    • docProps
    • xl
      • _rels
      • printerSettings
      • theme
      • worksheets

Each directory has many XML files. For the conversion, you don't need to concern yourself with all the files or their purpose. Several of them are used internally by Excel to control the Excel application. The files of interest for the conversion in this article are:

  • /docProps/app.xml. Contains organization information
  • /docProps/core.xml. Obtain metadata for time, author, and so on
  • /xl/worksheets/sheet1.xml. Contains mappings to the data
  • /xl/sharedStrings.xml. Contains the actual data

Although there is logic to the files, at a quick glance that logic isn't obvious.


Excel data mapping

The sharedStrings.xml and sheet1.xml files work together. Let's look at the sharedStrings.xml file from Figure 1: Listing 1 shows the file structure for this XML file.

Listing 1. The sharedStrings.xml file structure
<sst>
     <si>
          <t>My Data 25</t>
     </si>
     <si>
          <t>My Data 10</t>
     </si>
     ...
</sst>

The interesting fact about this file is that there doesn't seem to be a logical order to how the data is included in the sharedStrings.xml file. The file is basically a flat listing of all the cells in the Excel spreadsheet.

Depending on the datatype of a cell, the location of the data is stored differently:

  • String data. Lookup in sharedStrings.xml file
  • Integer data. Located in the sheet1.xml file
  • Formula data. Formula located in sheet1.xml

Note: It is unlikely that you will have formula data in a code list.

The sheet1.xml file provides a mapping to the data. Figure 3 shows the structure of the sheet1.xml file. (View a text-only of Figure 3.)

Figure 3. The structure of sheet1.xml
Graphical structure of the sheet1.xml

The meat of the data is within the <row> element. Table 1 shows the elements of the sheet1.xml file.

Table 1. Elements and attributes of sheet1.xml
Element or attributeDescription
rowThe beginning of a row in the Excel spreadsheet.
@rThe row number.
@spansThe number of columns the row spans. The first number is the beginning; the second number is the end.
cInformation about a cell.
@rThe physical location of the cell —for example, B10 would be column B, row 10.
@s="1"The actual value of the cell. In this case, a lookup is not required into the sharedStrings.xml file.
@tThe datatype—for example, s represents a string. When @t is available, the value of the <v> element is used as a lookup into the sharedStrings.xml file.
vContains the number of the <si> element +1, where the actual data is located in the sharedStrings.xml file.

The value in the element is the XPath number (+1) of the <si> element in the file sharedStrings.xml. The first cell represents string data. The t attribute value is set to s. The s represents a string value.

The second and third cells do not have a t attribute. The absence of the t attribute indicates that the data is included in this file. You can distinguish a calculated field from an integer by the inclusion of the plus sign (+) as the first character.


Developing the XSLT

Because there are multiple files and multiple ways of developing the conversion methodology, you must select an XML file as your base XML file. I chose to use the sheet1.xml file because it contains the mapping to the data. The data file (sharedStrings.xml) is static, and it would be difficult to deduce intelligent navigation from it. As stated earlier, Genericode has three main containers of information. Let's walk through the conversion for each container.

Global variables

You need to perform lookups in several files. Rather than using the full path name of the various files, I find it easier and more readable to create global variables.

The following parameter sets the root directory for the XML files. If you are converting multiple code lists using an XProc pipeline or a batch job, you can send the value of the parameter to the process:

<xsl:param name="ExcelRoot">MyCodelist</xsl:param>

You use the variable shown here to get the complete path and file name of the file that contains the information:

<xsl:variable name=”DataFile”>
  <xsl:value-of select=”$ExcelRoot”/>/xl/sharedStrings.xml</xsl:variable>

Root template

The <worksheet> template is the starting point for the conversion. Listing 2 provides a sample of the template for the <worksheet> element. Basically, my beginning template calls the conversion of each section. The file that you want to start the conversion is sheet1.xml. The namespace prefix is Excel (although you can change it, if you prefer). The root element is <worksheet>. Note that my template matches Excel:worksheet.

Listing 2. The beginning template
<xsl:template match="Excel:worksheet">
   <CodeList xmlns="http://docs.oasis-open.org/codelist/ns/genericode/1.0/"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xmlns:era="http://archives.gov/era/schema"
         xsi:schemaLocation="http://docs.oasis-open.org/codelist/ns/genericode/1.0/
         http://docs.oasis-open.org/codelist/cs-genericode-1.0/xsd/genericode.xsd">

         <xsl:call-template name="commentBlock"/>
         <xsl:call-template name="identificationBlock"/>
         <xsl:call-template name="createColumnSet"/>
         <xsl:apply-templates/>
        </CodeList>
    </xsl:template>

You must also create a default template to suppress unwanted Excel-specific data:

<xsl:template match=”*”/>

For the purposes of this article, I concentrate on the body of the of the code list (<SimpleCodeList>) because when you have this logic, developing the other sections is simple. Each organization has its own methodology for Excel code list templates.


Create <SimpleCodeList> and <Row> output

The <sheetData> element is a direct child of the <worksheet> element. The <SimpleCodeList> element is sparked from the <sheetData> element. And the <Row> element is sparked from the <row> element. As you can see, this approach is straightforward. Listing 3 shows the XML for creating the <SimpleCodeList> and <Row> elements.

Listing 3. Create <SimpleCodeList> and <Row>
   <xsl:template match="Excel:sheetData">
        <SimpleCodeList>
            <xsl:apply-templates/>
        </SimpleCodeList>
    </xsl:template>

    <xsl:template match="Excel:row">
        <Row>
            <xsl:apply-templates/>
        </Row>
    </xsl:template>

Creating column references

Genericode requires that column references are used for every entry in the Genericode. This information is captured in the <ColumnSet> element. I don't show how to create the <Identification> and <ColumnSet> entries because they differ, depending on the template that your organization uses to maintain its code lists in Excel. The column reference information, however, is required to populate the @ColumnRef information in the <Value> element.

I manually set the variable for each heading. There is probably a more elegant way of setting these variables, but this mechanism works. Then, I create a variable for columns. In Listing 4, I select the value of the <v> element in cell A2. Remember that because the numbering starts with 0, you need to add 1 to the <v> value.

Listing 4. Create the column references
<xsl:variable name="ColumnALocation">
   <xsl:value-of select=
    "//Excel:worksheet/Excel:sheetData[1]/Excel:row[2]/Excel:c[1]/Excel:v + 1"
 />

From there, you need to get the actual value of the cell (see Listing 5). In this case, I perform a test to make sure the column location isn't null. If the value isn't null, then I perform a lookup in the sharedStrings.xml file, where the position axis (position()) equals the value of the value in the <v> element in sheet1.xml. I use this variable later to determine the @ColumnRef attribute.

Listing 5. Retrieving a cell's actual value
<xsl:variable name="Column-A">
    <xsl:choose>
        <xsl:when test="$ColumnALocation = ''"/>
        <xsl:otherwise>
            <xsl:value-of
                select="document($DataFile)/Excel:sst/Excel:si[position() =
            $ColumnALocation]"
            />
        </xsl:otherwise>
    </xsl:choose>
</xsl:variable>

Creating <Value> elements

If you look back at the example in Figure 1, you can see that the value of column A1 is a1. You want your Genericode representation to look like this:

<Value ColumnRef="ColumnA">
    <SimpleValue>a1</SimpleValue> 
</Value>

The @r attribute in the <c> element contains the cell coordinate. For example, A2 represents column A, row 2. The example shows that the column headings are A2, B2, and C2. You need to know in which cells the title and column headings are located.

Now, you set the following three variables:

  • column. The column number—for example, B represents the second column. The @r value contains this information.
  • dataLoc. The data location, which is the XPath + 1 position of where the data is located in the sharedStrings.xml file.
  • cellValue. The actual data of the cell. You set this variable with a straight lookup in the sharedStrings.xml file.

After setting the variables, you start populating the data. The second part of this process is to populate the @ColumnRef attribute. You could do so in the same template, but I choose to create a second template and send the column variable to find the name of the column, as in Listing 6.

Listing 6. Example <c> template
<xsl:template match="Excel:c">
    <xsl:variable name="column"><xsl:value-of select="substring(@r, 1, 1)"/>
    </xsl:variable>
    <xsl:variable name="dataLoc"><xsl:value-of select="number(Excel:v) + 1"/>
    </xsl:variable>
    <xsl:variable name="cellValue">
       <xsl:value-of 
         select="document($DataFile)/Excel:sst/Excel:si[position() = $dataLoc]"/>
    </xsl:variable>
   <xsl:choose>
      <xsl:when test="Excel:v">
         <Value>
            <xsl:attribute name="ColumnRef">
                 <xsl:call-template name="getColref">
                    <xsl:with-param name="column"><xsl:value-of select="$column"/>
                    </xsl:with-param>
                  </xsl:call-template>
                 </xsl:attribute>
                <SimpleValue><xsl:value-of select="$cellValue"/></SimpleValue>
              </Value>
           </xsl:when>
        <xsl:otherwise/>
    </xsl:choose>
</xsl:template>

Column reference name

The last piece of the puzzle is to obtain the column name. As long as you don't have code lists with many columns, the method in Listing 7 is adequate. Most code lists tend to be in the 2-to-5-column range.

Listing 7. Creating column references
  <xsl:template name="getColref">
        <xsl:param name="column"/>
        <xsl:choose>
            <xsl:when test="$column = 'A'">
                <xsl:value-of select="translate($Column-A, ' ', '')"/>
            </xsl:when>
            <xsl:when test="$column = 'B'">
                <xsl:value-of select="translate($Column-B, ' ', '')"/>
            </xsl:when>
            <xsl:when test="$column = 'C'">
                <xsl:value-of select="translate($Column-C, ' ', '')"/>
            </xsl:when>
  </xsl:template>

Conclusion

Owners of code lists tend to be business users who are familiar with Microsoft tools. Business users can update and manage code list data in a tool they know well. Using XSLT to translate the business requirements from the technical requirements makes perfect sense; with little pain, the workflow to enable separation from business subject matter experts and technical needs is an efficient and practical technical solution.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=607008
ArticleTitle=Convert Excel 2007 XML to OASIS code lists
publish-date=01112011