Genericode started within the Organization for the Advancement of Structured Information Standards (OASIS) specification Universal Business Language (UBL) technical committee (see Resources). UBL originally followed the United Nations Centre for Trade Facilitation and Electronic Business (UN/CEFACT) Core Components Technical Specification (see Resources for a link)—a specification that embedded code lists, such as currency codes and country codes, directly in the W3C Schema using enumerations.
The UBL community quickly found that code lists change frequently. Although embedding the lists directly in the schema provided easy validation, when a code changed in the list, the schema had to be updated. Handling updated schemas within multiple trading partners can be a logistic nightmare. Also, many products require recompiling the schema and reinitializing databases based on updated schemas.
The UBL technical committee developed Genericode as an abstraction layer to resolve updating the schema every time code changed (see Resources for a link to more information). Other validation methods such as Schematron can be used as the validation layer for validating codes. Genericode has a wider audience than UBL, and OASIS decided to create a new technical committee to manage the Genericode specification.
This article shows how to use Microsoft Office Excel 2007 spreadsheets to create code lists. You can then use XSLT to convert the lists from an Excel spreadsheet to Genericode.
Code lists in Office Excel 2007
Although this article deals specifically with Genericode conversion, the concepts can be generalized to other types of Excel documents and XML vocabularies. For example, I have done many conversions to and from Microsoft Office Excel 2003 and have found the process to be fairly straightforward. To convert to XML from Office Excel 2003, you must save the spreadsheet to Office Excel 2003 XML. One of my conversions is a schema analysis spreadsheet that creates a spreadsheet from a W3C schema. The Excel format provides an easy way for nontechnical individuals to understand relationships among and definitions of schema components.
Another "go-to" conversion I created was Office Excel 2003 XML to Genericode, a conversion that works in place in a production environment. In this conversion, many of the code lists have empty cells. In the Office Excel 2003 XML, however, if a cell is blank, it was not output into the XML, and there wasn't an easy way to automatically analyze the row and determine which cell was left out. This situation is true for both Office Excel 2003 and Office Excel 2007. Office Excel 2007, however, provides the ability to logically derive when a cell is blank.
For the purpose of this article, I use the simple structure in Figure 1, which represents an Excel code list.
Figure 1. Sample Excel code list
OASIS Genericode and Office Excel 2007 structures
Before beginning any conversion project, you must understand the starting structure and the ending structure. Genericode can become complicated, but for 98 percent of all code lists that people use, they are relatively straightforward. A simple code list is broken into three sections:
- Identification
- Column specification
- Code list data
Figure 2 provides a graphical view of the XML schema for
Genericode. A CodeList document (Type: gc:CodeListDocument)
has attributes to describe it. The gc:DocumentHeader
contains Annotation (Type: gc:Annotation) and
Identification (Type: gc:Identification).
The gc:ColumnSetChoice contains ColumnSet
(Type: gc:ColumnSet) and ColumnSetRef
(Type: gc:ColumnSetRef).
The gc:OuterCodeListChoice
contains gc:SimpleCodeListSequence and
SimpleCodeList (Type: gc:SimpleCodeList).
Figure 2. The Genericode structure
Office Excel 2007 XML isn't quite as straightforward as Genericode. In fact, the structure is rather bizarre. The XLSX file is really a ZIP file containing multiple files:
- Root:
- _rels
- docProps
- xl
- _rels
- printerSettings
- theme
- worksheets
Each directory has many XML files. For the conversion, you don't need to concern yourself with all the files or their purpose. Several of them are used internally by Excel to control the Excel application. The files of interest for the conversion in this article are:
- /docProps/app.xml. Contains organization information
- /docProps/core.xml. Obtain metadata for time, author, and so on
- /xl/worksheets/sheet1.xml. Contains mappings to the data
- /xl/sharedStrings.xml. Contains the actual data
Although there is logic to the files, at a quick glance that logic isn't obvious.
The sharedStrings.xml and sheet1.xml files work together. Let's look at the sharedStrings.xml file from Figure 1: Listing 1 shows the file structure for this XML file.
Listing 1. The sharedStrings.xml file structure
<sst>
<si>
<t>My Data 25</t>
</si>
<si>
<t>My Data 10</t>
</si>
...
</sst>
|
The interesting fact about this file is that there doesn't seem to be a logical order to how the data is included in the sharedStrings.xml file. The file is basically a flat listing of all the cells in the Excel spreadsheet.
Depending on the datatype of a cell, the location of the data is stored differently:
- String data. Lookup in sharedStrings.xml file
- Integer data. Located in the sheet1.xml file
- Formula data. Formula located in sheet1.xml
Note: It is unlikely that you will have formula data in a code list.
The sheet1.xml file provides a mapping to the data. Figure 3 shows the structure of the sheet1.xml file. (View a text-only of Figure 3.)
Figure 3. The structure of sheet1.xml
The meat of the data is within the <row> element.
Table 1 shows the elements of the sheet1.xml file.
Table 1. Elements and attributes of sheet1.xml
| Element or attribute | Description |
|---|---|
row | The beginning of a row in the Excel spreadsheet. |
@r | The row number. |
@spans | The number of columns the row spans. The first number is the beginning; the second number is the end. |
c | Information about a cell. |
@r | The physical location of the cell —for example, B10 would be column B, row 10. |
@s="1" | The actual value of the cell. In this case, a lookup is not required into the sharedStrings.xml file. |
@t | The datatype—for example, s
represents a string. When @t is available, the
value of the <v> element is used as a
lookup into the sharedStrings.xml file.
|
v | Contains the number of the <si>
element +1, where the actual data is located in the sharedStrings.xml file.
|
The value in the element is the XPath number (+1) of the <si>
element in the file sharedStrings.xml. The first cell represents string data. The
t attribute value is set to s.
The s represents a string value.
The second and third cells do not have a t attribute. The
absence of the t attribute indicates that the data is
included in this file. You can distinguish a calculated field from an integer by the
inclusion of the plus sign (+) as the first character.
Because there are multiple files and multiple ways of developing the conversion methodology, you must select an XML file as your base XML file. I chose to use the sheet1.xml file because it contains the mapping to the data. The data file (sharedStrings.xml) is static, and it would be difficult to deduce intelligent navigation from it. As stated earlier, Genericode has three main containers of information. Let's walk through the conversion for each container.
You need to perform lookups in several files. Rather than using the full path name of the various files, I find it easier and more readable to create global variables.
The following parameter sets the root directory for the XML files. If you are converting multiple code lists using an XProc pipeline or a batch job, you can send the value of the parameter to the process:
<xsl:param name="ExcelRoot">MyCodelist</xsl:param> |
You use the variable shown here to get the complete path and file name of the file that contains the information:
<xsl:variable name=”DataFile”> <xsl:value-of select=”$ExcelRoot”/>/xl/sharedStrings.xml</xsl:variable> |
The <worksheet> template is the
starting point for the conversion. Listing 2 provides
a sample of the template for the <worksheet>
element. Basically, my beginning template calls the conversion of each section.
The file that you want to start the conversion is sheet1.xml. The namespace
prefix is Excel (although you can change it, if you
prefer). The root element is <worksheet>.
Note that my template matches Excel:worksheet.
Listing 2. The beginning template
<xsl:template match="Excel:worksheet">
<CodeList xmlns="http://docs.oasis-open.org/codelist/ns/genericode/1.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:era="http://archives.gov/era/schema"
xsi:schemaLocation="http://docs.oasis-open.org/codelist/ns/genericode/1.0/
http://docs.oasis-open.org/codelist/cs-genericode-1.0/xsd/genericode.xsd">
<xsl:call-template name="commentBlock"/>
<xsl:call-template name="identificationBlock"/>
<xsl:call-template name="createColumnSet"/>
<xsl:apply-templates/>
</CodeList>
</xsl:template>
|
You must also create a default template to suppress unwanted Excel-specific data:
<xsl:template match=”*”/> |
For the purposes of this article, I concentrate on the body of the of the code
list (<SimpleCodeList>) because when you
have this logic, developing the other sections is simple. Each
organization has its own methodology for Excel code list templates.
Create <SimpleCodeList> and <Row> output
The <sheetData> element is a direct child of the
<worksheet> element. The
<SimpleCodeList> element is sparked from the
<sheetData> element. And the
<Row> element is sparked from the
<row> element. As you can see, this approach is straightforward.
Listing 3 shows the XML for creating the
<SimpleCodeList> and
<Row> elements.
Listing 3. Create <SimpleCodeList> and <Row>
<xsl:template match="Excel:sheetData">
<SimpleCodeList>
<xsl:apply-templates/>
</SimpleCodeList>
</xsl:template>
<xsl:template match="Excel:row">
<Row>
<xsl:apply-templates/>
</Row>
</xsl:template>
|
Genericode requires that column references are used for every entry in the
Genericode. This information is captured in the
<ColumnSet> element. I don't show how to
create the <Identification> and
<ColumnSet> entries because they
differ, depending on the template that your organization uses to maintain its
code lists in Excel. The column reference information, however, is required
to populate the @ColumnRef information in the
<Value> element.
I manually set the variable for each heading. There is probably a more elegant
way of setting these variables, but this mechanism works. Then, I create a
variable for columns. In Listing 4, I select the value of
the <v> element in cell A2. Remember that
because the numbering starts with 0, you need to add 1 to the
<v> value.
Listing 4. Create the column references
<xsl:variable name="ColumnALocation">
<xsl:value-of select=
"//Excel:worksheet/Excel:sheetData[1]/Excel:row[2]/Excel:c[1]/Excel:v + 1"
/>
|
From there, you need to get the actual value of the cell (see Listing 5).
In this case, I perform a test to make sure the column location isn't null. If
the value isn't null, then I perform a lookup in the sharedStrings.xml file,
where the position axis (position()) equals the
value of the value in the <v> element in
sheet1.xml. I use this variable later to determine the
@ColumnRef attribute.
Listing 5. Retrieving a cell's actual value
<xsl:variable name="Column-A">
<xsl:choose>
<xsl:when test="$ColumnALocation = ''"/>
<xsl:otherwise>
<xsl:value-of
select="document($DataFile)/Excel:sst/Excel:si[position() =
$ColumnALocation]"
/>
</xsl:otherwise>
</xsl:choose>
</xsl:variable>
|
If you look back at the example in Figure 1, you can see that
the value of column A1 is a1. You want your Genericode
representation to look like this:
<Value ColumnRef="ColumnA">
<SimpleValue>a1</SimpleValue>
</Value>
|
The @r attribute in the <c>
element contains the cell coordinate. For example, A2
represents column A, row 2. The example shows that the column headings are
A2, B2, and C2. You need to
know in which cells the title and column headings are located.
Now, you set the following three variables:
column. The column number—for example,Brepresents the second column. The@rvalue contains this information.dataLoc. The data location, which is the XPath + 1 position of where the data is located in the sharedStrings.xml file.cellValue. The actual data of the cell. You set this variable with a straight lookup in the sharedStrings.xml file.
After setting the variables, you start populating the data. The second part of this process is
to populate the @ColumnRef attribute. You could do so in
the same template, but I choose to create a second template and send the column
variable to find the name of the column, as in Listing 6.
Listing 6. Example <c> template
<xsl:template match="Excel:c">
<xsl:variable name="column"><xsl:value-of select="substring(@r, 1, 1)"/>
</xsl:variable>
<xsl:variable name="dataLoc"><xsl:value-of select="number(Excel:v) + 1"/>
</xsl:variable>
<xsl:variable name="cellValue">
<xsl:value-of
select="document($DataFile)/Excel:sst/Excel:si[position() = $dataLoc]"/>
</xsl:variable>
<xsl:choose>
<xsl:when test="Excel:v">
<Value>
<xsl:attribute name="ColumnRef">
<xsl:call-template name="getColref">
<xsl:with-param name="column"><xsl:value-of select="$column"/>
</xsl:with-param>
</xsl:call-template>
</xsl:attribute>
<SimpleValue><xsl:value-of select="$cellValue"/></SimpleValue>
</Value>
</xsl:when>
<xsl:otherwise/>
</xsl:choose>
</xsl:template>
|
The last piece of the puzzle is to obtain the column name. As long as you don't have code lists with many columns, the method in Listing 7 is adequate. Most code lists tend to be in the 2-to-5-column range.
Listing 7. Creating column references
<xsl:template name="getColref">
<xsl:param name="column"/>
<xsl:choose>
<xsl:when test="$column = 'A'">
<xsl:value-of select="translate($Column-A, ' ', '')"/>
</xsl:when>
<xsl:when test="$column = 'B'">
<xsl:value-of select="translate($Column-B, ' ', '')"/>
</xsl:when>
<xsl:when test="$column = 'C'">
<xsl:value-of select="translate($Column-C, ' ', '')"/>
</xsl:when>
</xsl:template>
|
Owners of code lists tend to be business users who are familiar with Microsoft tools. Business users can update and manage code list data in a tool they know well. Using XSLT to translate the business requirements from the technical requirements makes perfect sense; with little pain, the workflow to enable separation from business subject matter experts and technical needs is an efficient and practical technical solution.
Learn
- OASIS Universal Business Language (UBL) TC: Learn more about the UBL technical committee.
- The UN/CEFACT Core Components Technical Specification: Read this guide in the interpretation or implementation of ebXML concepts.
- Genericode: Read this OASIS specification for encoding code lists in XML.
- Learn more about XProc: An XML Pipeline Language (W3C Recommendation 11 May 2010): Learn more about XProc, a language for describing operations to be performed on XML documents.
- XML area on developerWorks: Get the resources you need to advance your skills in the XML arena.
- My developerWorks: Personalize your developerWorks experience.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- XML technical library: See the developerWorks XML Zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks. Also, read more XML tips.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
- developerWorks on Twitter: Join today to follow developerWorks tweets.
- developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
- developerWorks on-demand demos: Watch demos ranging from product installation and setup for beginners to advanced functionality for experienced developers.
Get products and technologies
- IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
- Yahoo! Groups related to XML: Join the discussions.
- XML zone discussion forums: Participate in any of several XML-related discussions.
- The developerWorks community: Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.
Betty Harvey is president of Electronic Commerce Connection, Inc. She has participated with many government and commercial enterprises in planning and executing their migration to structured information. Betty is involved in the ebXML initiative and is the co-author of Professional ebXML Foundations (Wrox, 2001) as well as many other publications. Betty started and coordinates the Washington, D.C., area SGML/XML Users Group and is also a member of the XML Guild. You can reach Betty at harvey@eccnet.com.



