Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Tip: Use the Unicode database to find characters for XML documents

The Unicode standard database has a wealth of characters for maximum expressiveness and even for fun

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought Inc.
Photo of Uche Ogbuji
Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia or contact him at uche@ogbuji.net.

Summary:  The Unicode consortium is dedicated to maintaining a character set that allows computers to deal with the vast array of human writing systems. When you think of computers that manage such a large and complex data set, you think databases, and this is precisely what the consortium provides for computer access to versions of the Unicode standard. The Unicode Character Database comprises files that present detailed information for each character and class of character. The strong tie between XML and Unicode means this database is very valuable to XML developers and authors. In this article Uche Ogbuji introduces the Unicode Character Database and shows how XML developers can put it to use.

View more content in this series

Date:  07 Mar 2006
Level:  Introductory

Comments:  

From the earliest conception of Unicode, a character representation standard designed to cover the vast array of human writing systems, it was clear that it would be essential to provide a convenient way for computers to query information about characters. Part of each version of the Unicode standard has been a corresponding version of The Unicode Character Database (UCD). The Unicode Character Database (UCD) home page describes the Unicode Character Database as follows:

The Unicode Character Database (UCD) consists of a number of data files listing character properties and related data along with a documentation file that explains the organization of the database and the format and meaning of the data in the files.

Character properties are information needed to understand and use a character. Some character properties are independent of the context in which the character is used, for example, whether or not a character is customarily used as a numerical digit. Some character properties depend on its role in a sequence of characters, such as directionality (some writing systems proceed from left to right on a page, some from right to left, and some in further variations).

Navigating the Unicode database

The third edition of XML 1.0 incorporates by reference The Unicode Standard, Version 3.2, so this is the version of UCD XML developers will be most concerned with. The UCD includes a couple dozen data files and about a half dozen documentation files. In this article I'll focus on the main file, UnicodeData-3.2.0.txt. You can find a link to the 3.2 UCD directory in Resources. If you open this file, you'll find a line for each character. Each line is a set of fields delimited by semicolons. As an example the following is the line for the uppercase A character.

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;

The first field is the code point, 0041. Obtain the conventional identifier for this character by prepending "U+" (In this case it is U+0041). You can represent the character in XML regardless of encoding using the entity A. Beware that code points can use up to six characters, although five characters is the limit you'll find in UnicodeData-3.2.0.txt. The second field LATIN CAPITAL LETTER A is the character name, which is very important in discussion of the character. The third field, Lu is the general category, which is the most important key in Unicode's system for organizing characters. The value Lu is an abbreviation for "Letter, Uppercase". Examples of other catagories are Nd ("Number, Decimal Digit"), Pd ("Punctuation, Dash"), Sc ("Symbol, Currency") and Zs ("Separator, Space"). There are several more fields in each UnicodeData-3.2.0.txt line, but the ones I've mentioned are the most widely used, and give you a flavor of the sort of information you can find in the UCD.

Finding characters for fun and profit

Since the UCD files are simple text you can use all sorts of generic tools to process them. You can often find a character of interest by loading UnicodeData-3.2.0.txt in a text editor and searching for a key word in the character name. You can also use command line tools such as grep in UNIX. In the following example, I look for the dagger characters commonly used to mark notes.

$ grep -i "dagger" UnicodeData-3.2.0.txt
2020;DAGGER;Po;0;ON;;;;;N;;;;;
2021;DOUBLE DAGGER;Po;0;ON;;;;;N;;;;;


The -i option makes the search case insensitive. You can also find fun characters by name. Some that I have come across are pencil, skull-and-crossbones and snowman.

$ grep -i skull UnicodeData-3.2.0.txt 2620;SKULL AND CROSSBONES;So;0;ON;;;;;N;;;;;
$ grep -i pencil UnicodeData-3.2.0.txt 
270E;LOWER RIGHT PENCIL;So;0;ON;;;;;N;;;;;
270F;PENCIL;So;0;ON;;;;;N;;;;;
2710;UPPER RIGHT PENCIL;So;0;ON;;;;;N;;;;;
$ grep -i snowman UnicodeData-3.2.0.txt
2603;SNOWMAN;So;0;ON;;;;;N;;;;;



Help from the library

Your programming language of choice might provide convenient tools for accessing the Unicode database, so that you do not need to parse it yourself. The unicodedata module in the Python standard library provides a thin layer on the UnicodeData-3.2.0.txt file from Unicode 3.2. The following interactive Python session demonstrates a few queries.

>>> import unicodedata
>>> print unicodedata.name(u'5')
DIGIT FIVE
>>> print unicodedata.lookup('DIGIT FIVE') #The lookup is case-insensitive
5
>>> print unicodedata.digit(u'5')
5
>>> #There are many ways in the world for writing the digit five
>>> print unicodedata.digit(unicodedata.lookup('ARABIC-INDIC DIGIT FIVE'))
5
>>> #Get the code point in decimal and hex ...
>>> print ord(unicodedata.lookup('ARABIC-INDIC DIGIT FIVE'))
1637
>>> print hex(ord(unicodedata.lookup('ARABIC-INDIC DIGIT FIVE')))
0x665
>>>


In Python you indicate a unicode string by prepending a u, as you can see in the second line of the session.


Wrap up

Once you are familiar with the UCD, you can use it in all sorts of advanced ways: sorting values in XML files in internationally sound ways, normalizing data in XML files for easier comparison and digital signing, and much more. I use the database a lot because I'm not good at rememebering the code points for some uncommon characters I use in XML files, such as special bullet points and international symbols. With the UCD, the wealth of the world's writing systems is right at your fingertips.


Resources

Learn

Get products and technologies

About the author

Photo of Uche Ogbuji

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is also a lead developer of the Versa RDF query language. He is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can find more about Mr. Ogbuji at his Weblog Copia or contact him at uche@ogbuji.net.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=104570
ArticleTitle=Tip: Use the Unicode database to find characters for XML documents
publish-date=03072006
author1-email=uche@ogbuji.net
author1-email-cc=dwxed@us.ibm.com

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).