IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
      
     Home      Products      Services & solutions      Support & downloads      My account     

developerWorks > Java technology >
developerWorks
The multilingual domain name race: On your mark get set WAIT!
e-mail it!
Contents:
Introduction
About DNS
So what's different about an IDNS?
Risks of moving too quickly
Status of standards
Proposal summaries
And now for something completely different...
Timeline
Resources
About the author
Rate this article
Subscriptions:
dW newsletters
dW Subscription
(CDs and downloads)
As the Web continues to expand globally, the English-oriented system needs updating -- but how?

Suzanne Topping (stopping@rochester.rr.com)
Vice president, BizWonk, Inc.
01 Dec 2000

It's high time the powers that be came up with a more effective means of supporting multilingual domain names -- few people question that. The challenge is deciding which method will work best. In this article, Suzanne Topping delves into the details of this thorny issue, including several of the more noteworthy proposals for solving this dilemma, as well as the realities of implementing a solution. She also helps guide you through the morass of abbreviations associated with the issue (see Terminology).

Introduction
The world of Internet domain names is changing fast -- maybe not faster than a speeding bit, but at least faster than it's ever changed before. For example, in November, 2000, the Internet Corporation for Assigned Names and Numbers (ICANN) announced seven new top-level domains (TLDs). These are .aero, .biz, .coop, .info, .museum, .name, and .pro. We'll soon be seeing these domains in use.

While debate was fairly hot about which of the proposed TLDs would be accepted, an even hotter issue was also under discussion. And that debate is far from over. The question of what method should be used for supporting multilingual domain names has prompted a great deal of controversy. There are as many outlooks on how to encode and process non-ASCII strings as there are domain name registrars.

Terminology
ACE -- ASCII Compatible Encoding
DNS -- Domain Name System
IAB -- Internet Architecture Board
IANA -- Internet Assigned Numbers Authority
IDN WG -- International Domain Name Working Group
IDNS -- International Domain Name System
IETF -- Internet Engineering Task Force
RFC -- Request for Comment (refers to IETF documents)
UCS -- Universal Character Set (refers to Unicode and ISO 10646)

The e-world is characterized by acronyms and abbreviations, and the Internet development community uses its fair share. Here are some of the acronyms and abbreviations used throughout this article, along with their definitions.

In the early days of the Internet, the Domain Name System (DNS) was intentionally designed to support only a limited subset of ASCII. This worked fine at the time, when users were primarily academics and programmers. But the approach merely delayed the problem we face today. Millions of people around the world can't use native words or phrases in domain names like English speakers can. This is a significant issue given the tremendous expansion of Internet use outside the United States over the past few years. Imagine how different your Internet experience would be if you had to enter strings of gibberish characters or numbers instead of simply typing "eatatjoes.com."

Addressing this need is universally important in the Internet development community, but implementing a solution is much less straightforward. Why? Because a standard has not been selected for how to actually deal with multilingual domain names. But this reality hasn't stopped some companies from offering to register them. Network Solutions, Inc., for example, announced a test program in August 2000, that will register domain names in 55 languages including Chinese, Japanese, Arabic, Korean, and Hebrew. Customer companies are jumping on board, scooping up regional names before they are taken. The problem with this prerelease is that each provider may implement a different solution, and that could result in a domain name system that is no longer universal. Instead of the World Wide Web, we could very well end up with pockets or clusters of webs. Unless a standard approach is applied, the result will be international domain name systems that can only communicate with systems using the same approach.

While the goal of offering support for non-ASCII characters is to open up the Web for wider use, the reality may be that if haphazardly implemented, this solution could result in technical incompatibilities that cause isolation.

About DNS
Before discussing International Domain Name System (IDNS) issues, we need to talk a bit about the existing Domain Name System.

The DNS' primary function is to map word-based domain names to numeric IP addresses. The DNS is closely tied to the applications and application protocols that use it, often at a fairly low level. This means that any changes to it will have a far-reaching impact on applications that interact with it.

Little thought was given to the need to support personal names and e-mail addresses. The DNS wasn't really designed to identify people, company names, brands, etc. The designers did allow for new data types and structures by including the ability to add new record types to the initial "Internet" class. These fields can contain information other than just the restricted text forms from the host table.

Proposals for IDNS therefore tend to fall into two camps: one group that works within the existing ASCII-based functioning of the DNS, and another that takes advantage of the ability to work with an extended version of the DNS.

Layers
The DNS is made up of several layers. The bottom layer passes packets across the Internet using a DNS query and response. The format and meaning of bits and octets in a DNS packet are crucial at this layer.

The "DNS service" sits above the bottom layer, and is created by an infrastructure of DNS servers. A "root cache file" (named.cache) lists the root servers.

The service layer (with which a user might interact) is often called the resolver library. This layer may be embedded in the operating system or system libraries of client machines.

API calls, such as gethostbyname and gethostbyaddress, reside at the top of the service layer.

The concept of layers becomes important to the discussion because a decision must be made about whether handling should be done at the server level, or by resolvers on user workstations.

So what's different about an IDNS?
The infrastructure for handling a wide range of characters and scripts is becoming ubiquitous, and software support for these characters is also becoming very widespread. Clearly something has to be done to get the DNS up to speed.

The essential concept behind IDNS technology is that it allows people to use domain names for Web URLs, e-mail addresses, and FTP in their native language, no matter what that language is.

Martin Dürst sheds a little more light on what this means. In an expired Internet draft titled "Internationalization of Domain Names," (see Resources), he states: "For domain name I18N to work inside the tight restrictions of domain name syntax, one has to define an encoding that maps strings of UCS characters to strings of characters allowable in domain names, and a means to distinguish domain names that are the result of such an encoding from ordinary domain names."

The need for standardization is not merely an esoteric problem. The IAB's RFC 2825 says that the challenge is to decide how to represent the names in a way that is clear, technically feasible, and ensures that a name always means the same thing. They believe that the best path forward is one that takes into account current realities and deployment issues. The RFC states: "In the Internet's global context, it is not enough to update a few isolated systems, or even most of the systems in a country or region. Deployment must be nearly universal in order to avoid the creation of 'islands' of interoperation that provide users with less access to and connection from the rest of the world."

Some of the issues that need to be addressed in an IDNS solution are described below.

Prohibited characters
Regardless of what proposal is selected, there must be limits on the characters that can be included in domain names. Some proposals include specification for these characters, but there is also an Internet draft (see Resources) that lists prohibited characters in detail. These characters include identical and near-identical characters, separators, non-displaying and non-spacing characters, private use characters, punctuation, and symbols. For UTF-8 based systems, surrogate characters are also prohibited.

Marc Blanchet's Internet draft "Handling Versions of Internationalized Domain Names Protocols" (see Resources) describes an approach for dealing with prohibited characters. He suggests that the IANA maintain an ASCII table that would contain all of the allowable characters. Blanchet believes that it would be easier for implementers to work with a list of accepted characters rather than unacceptable ones, and that the table would help decrease the variation of behaviors between individual implementations.

Normalization and canonicalization
Dürst has played a pivotal role in the ongoing development of an IDNS. In an article in Multilingual Computing and Technology magazine, he comments that, "the greatest risk and threat to the Internet as a whole is an IDNS where you don't know which machine you get to when you type in a domain name." Durst's Internet draft "Character Normalization in IETF Protocols" (see Resources) further states: "Early normalization is of particular importance for... domain names... In order for the protocol to work, it has to be very well-defined when two protocol element values match and when not."

Normalization and canonicalization are processes for ensuring that there is no confusion about which IP address a name is intended to map to. When converting to and from pre-existing character encodings to UTF-8 (as required by RFC2279), there are some occurrences of duplicates. The equivalence between duplicates is called canonical equivalence. The existence of these duplicates raises questions about which part of the Internet infrastructure should take responsibility for dealing with them, and how.

Versioning
Blanchet's Internet draft "Handling Versions of Internationalized Domain Names Protocols" (see Resources) describes the need to detect what version of the IDN protocol is in use. Because the accepted tables of characters will probably change in the future, information needs to be exchanged about which version is currently in use. IDN processors must verify the version number before handling the name, and reject it if the sending version number is greater than its own version.

The way that version indications are handled will be dependent on the proposal that is selected as the standard. (Standards proposals are summarized later in this article.) Blanchet suggests that proposals based on extensions of the DNS protocol should include a version number in the bits. (IDNE) defines version handling as part of the proposal, however similar definitions should be created for the other extension-based proposals.

Proposals based on ACE would use a different prefix/suffix for each version. One of the characters in the prefix should be used as a version number, beginning with the lowest possible ASCII character available and increasing the ASCII codepoint by 1 for each version change. For example, if the prefix is "ra", then the first version of the ASCII-based IDN protocol would be "ra" and the second version would be "rb".

Risks of moving too quickly
Because a standard approach has not been decided upon, there are a number of risks related to the spread of multilingual domain names. The biggest overall risk is for potential disruption to the "World Wide" part of the Web.

The IAB's RFC 2825, "A Tangled Web: Issues of I18N, Domain Names, and the Other Internet Protocols" (see Resources), sums this concern up: "These services must interoperate worldwide, or we risk isolating components of the network from along locale boundaries. This type of isolation could impede not only communications among people, but opportunities of the areas involved to participate effectively in e-commerce, distance learning, and other activities at an international scale, thereby retarding economic development."

Obviously, doing things in an un-standardized way is perceived to be significantly risky.

John Klensin's Internet draft "Role of the Domain Name System" (see Resources) describes further reasons for not moving too quickly. He writes: "...protocols tend to be deployed at a just-past-prototype level, typically including the types of expedient compromises typical with prototypes. If they prove useful, the nature of the network permits very rapid dissemination. But, once the vacuum is filled, the installed base provides its own inertia: unless the design is so seriously faulty as to prevent effective use (or there is a widely-perceived sense of impending disaster unless the protocol is replaced), future developments must maintain backward compatibility and workarounds for problematic characteristics rather than benefiting from redesign in the light of experience. Applications that are 'almost good enough' prevent development and deployment of high-quality replacements."

In other words, if we implement a half-baked solution just for the sake of getting something in place, we may never go back and fix it.

These are the overriding, umbrella concerns. Here are some more specific areas of risk:

E-mail
Electronic mail is one of the most widely-used and important applications of the Internet, and has been for some time. Internet-based mail provides a bridge between varied local and propriety e-mail systems. Standard mail protocols like SMTP and MIME don't permit the full range of characters allowed in the DNS specification. Some mail systems were built to conform to these specifications, but are known to fail for non-ASCII domain names in the address. Failure can include mail being discarded, misrouted, or damaged. RFC2825 summarizes, by saying: "Thus, it's not possible to simply switch to internationalized domain names and expect global e-mail to continue to work until most of the servers in the world are upgraded."

Network management
The Simple Network Management Protocol (SNMP) allows for UTF-8 based domain names, however few implementations of SNMP-compliant software actually support it. Network management tools may therefore be unable to display or accept internationalized domain names if a UTF-8 solution is selected.

Security
Internet public key technologies such as PKIX and IKE rely heavily on character restrictions for domain names and user e-mail addresses. If these restrictions aren't followed, security tools used to interact with them, such as the Transport Layer Security protocol (TLS) and Ipsec, will fail. But the type of failure that occurs may not be obvious. For example, unless comparison of domain names is properly defined, the client may fail to match the domain name of a legitimate server, or may match the domain name of a server performing a security attack. A third alternative, described in the Normalization and canonicalization section above, is that deployment of non-standard systems might result in name strings that are not globally unique. This would result in "spoofing" of hosts from one domain in another. (Spoofing is described in RFC2826.)

Paul Hoffman's Internet draft "Preparation of Internationalized Host Names" (see Resources) states: "Much of the security of the Internet relies on the DNS. Thus, any change to the characteristics of the DNS can change the security of much of the Internet.

"Host names are used by users to connect to Internet servers. The security of the Internet would be compromised if a user entering a single internationalized name could be connected to different servers based on different interpretations of the internationalized host name."

Status of standards
The Internet Engineering Task Force requires that future IETF protocols support UTF-8, an ASCII-compatible encoding of UCS. Given this requirement, one would assume that the solution selected would be based on UTF-8. However, even that discussion is far from over.

Standards bodies
A number of standards organizations are playing a role in defining the path forward. Some of the most involved groups are:

  • Internet Engineering Task Force (IETF)
  • IETF Internationalized Domain Name Working Group (IDN WG)
  • BIND Development Team
  • Internet Architecture Board (IAB)
  • Internet Society (umbrella organization for IETF and IAB)
  • Internet Assigned Numbers Authority (IANA)
  • Asia Pacific Networking Group (APNG)
  • Multilingual Internet Names Consortium (MINC)

While each of these organizations plays an important part in developing proposals and moving issues forward, it is the IDN Working Group's responsibility to determine an actual standard.

IDN Working Group
The IDN WG is charged with evaluating IDNS proposals to ensure proper operation within the global network. Evaluation is based not only on technical features, but also on the impact to existing standards and operation, and on end-user effect. The group is committed to the idea that solutions must not cause users to become more isolated from their global neighbors, even if they appear to solve a local problem.

The working group says that an IDNS should:

  • Coexist with the current ASCII domain name space
  • Be compatible with current DNS servers (ISC BIND, NT DNS), clients, and Internet protocols
  • Build upon industry standards (e.g. ISO 8859-11, TIS 620 2533 (TSC), EUC-KR, KSC5601, GB2312, Big5 SJIS, etc.)
  • Enable DNS to resolve multilingual characters
  • Require minimal changes to the client/server (easy installation for system administrator, no configuration for end users or ISPs)

The group's purpose statement says: "A fundamental requirement is to not disturb the current use and operation of the domain name system, and for the DNS to continue to allow any system anywhere to resolve any domain name."

Proposal summaries
The following table lists some of the proposals that have been receiving a lot of attention throughout the IDNS debate. For a full list of proposals, go to the IDN WG's Web site (see Resources).

ProposalASCIIBinaryChanges required to current DNS?EncodingImpact on root servers
IDNExxYesUTF-8Yes
KWAN (based on the original DÜRST proposal) xYesUTF-8Yes
RACEx NoBase32No
SENGx NoUTF-5No
UDNSxxYesUTF-8Yes

IDNE: Internationalized domain names using EDNS
The IDNE proposal (see Resources) uses the DNS extension mechanism called EDNS. With IDNE, you can send names as ASCII or binary, and the binary format is UTF-8. IDNE requires a new extended label type, which would be assigned by IANA. This label must be encoded EDNS.

The extension allows some IDNE labels to be longer than 63 characters and some IDNE names to be longer than 255 octets. These length differences cause special requirements for handling that other proposals do not create.

The IDN protocol version number must be included when using IDNE. An OPTION-CODE will be assigned by IANA for storing the IDNE protocol version number. All requesters must send this information as part of the OPT RR included in the EDNS packet.

Transition and deployment In order to deploy IDNE, clients, servers, applications, and protocols must all be updated. It would be unrealistic to think that all of these components could be upgraded overnight, therefore the proposal includes a transition strategy. The proposal states that it may take decades for DNS servers to handle IDNE, and that in the interim an ASCII-compatible encoding (ACE) format for IDN names is also needed as a transition. (However, the proposal foresees an eventual all-IDNE DNS.) If the IETF chooses to have an ACE mechanism in use at the same time as IDNE, the proposal recommends that the ACE method should allow as many characters as possible in the name parts and full names.

The issue of name length has also been discussed, because there is a possibility that IDNE names could be too long for ACE protocols to handle.

KWAN: Using the UTF-8 character set in the Domain Name System
The KWAN proposal (see Resources) expands the Domain Name System standard to allow the use of UTF-8 character encoding, which is a superset of ASCII. Some of the key differentiators of the KWAN proposal are described below.

Downcasing and case handling
A UTF-8-aware DNS server can load and store DNS names which contain UTF-8 characters. Uniform conversion of uppercase characters to lowercase (called downcasing) allows UTF-8-aware DNS implementations to work with those that are not UTF-8-aware.

The DNS protocol standard states that the original case should be preserved whenever possible as data is entered into the system. The KWAN proposal modifies this requirement to read: "A UTF-8-aware DNS server must downcase all names containing UTF-8 characters in both record names and record data before transmitting those names in any message. A UTF-8-aware DNS client/resolver must downcase all names containing UTF-8 characters before transmitting those names in any message."

Caution should be used by applications that allow uppercase UTF-8 characters to be passed to the resolver. DNS servers should apply similar caution when allowing uppercase UTF-8 characters to be entered in zone data. Because downcasing in UTF-8 is locale-sensitive, the result may depend on the locale at the point of code execution. Results as expected will be consistently achieved if both the application and server accept only lowercase characters.

Interoperability
UTF-8 is ideal for use with existing protocol implementations that expect US-ASCII, because the representation of US-ASCII characters is identical in both encodings. The proposal suggests that DNS server authors could provide a configuration switch to allow or disallow UTF-8 characters for each server or zone.

A non-UTF-8-aware DNS server may accept transfer of a zone containing UTF-8 names, but it may not be able to write back the names to a zone file or reload the names from a zone file. Under this system, administrators should consider the potential impact of transferring a zone containing UTF-8 names to a non-UTF-8-aware DNS server.

RACE: Row-based ASCII Compatible Encoding for IDN
The RACE proposal (see Resources) is based on ASCII Compatible Encoding (ACE), which is a method that satisfies all existing Internet standards. RACE converts strings with internationalized characters into strings of US-ASCII that are acceptable as host name parts in current DNS host naming usage. Parts that do not include international characters are not changed. RACE is designed so that every internationalized host name part can be represented as one and only one DNS-compatible string.

RACE is different from other ACE protocols because it can include more international characters. Names in the Han, Yi, Hangul syllables, or Ethiopic scripts can include up to 17 characters per name part, and names in most other scripts can include up to 35 characters. Names that use a mix of Latin and non-Latin characters can include up to 33 characters.

The length is also dependent on which row the characters come from (based on ISO 10646 rows):

  • If the characters all come from the same row, up to 35 characters per name part are allowed.
  • If the characters come from two or more rows, neither of which is row 0, up to 17 characters per name part are allowed.
  • If the characters come from two rows, one of which is row 0, between 17 and 33 characters per name part are allowed.

Conversion requirements
Checking for problems with name parts (prohibited characters, case-folding, or canonicalization) must be done before converting to an ACE name part. Characters with codepoints above U+FFFF must be represented using surrogates.

The preconverted string consists of characters from the ISO 10646 character set in big-endian UTF-16 encoding.

The basic process for preparing the name is to compress it, encode it, and give it a name tag. These steps are briefly described below. When converting back to an internationalized version, the process is essentially reversed.

Compression
The proposal provides detailed descriptions of the compression and decompression processes. Compression reduces a full string to as few octets as possible. However, the resulting number of octets is dependent on which rows characters initially came from. Compression and decompression rules in the proposal must be followed exactly in order to ensure that no single host name can have two encodings.

Encoding
In order to encode non-ASCII characters in DNS-compatible host name parts, they must be converted into legal characters. This is done with Base32 encoding. The proposal provides a table that maps input bits to output characters, along with detailed instructions for encoding and decoding.

Name tagging
One unique feature of the RACE proposal is the use of tags for converted domain name parts. Each of these parts should be tagged with the string "bq--". These characters were chosen because they are unlikely to exist in "real" host parts. Names are checked for error conditions, compressed, encoded using Base32, and then given the bq suffix.

SENG: UTF-5, a transformation format of Unicode and ISO 10646
The SENG proposal (see Resources) describes a transformation format called UTF-5 for Unicode. It was created to address a variety of legacy system issues, not merely internationalized domain names. The strings that result from the UTF-5 conversion fall within a [A-V][0-9] alphanumeric range. This allows legacy systems or protocols that previously supported only alphanumerical characters to be multilingual. The DNS is one example of the type of system that UTF-5 can assist.

The proposal concedes that UTF-8 is the preferred transformation format for all new IETF standards, and is not attempting to go against that proclamation. Instead, it was proposed to support legacy applications or protocols that cannot be modified easily to handle 8 bits using UTF-8 encoding. Detailed instructions for conversion are described in the proposal.

UDNS: Using the Universal Character Set in the Domain Name System
The UDNS proposal (see Resources) defines how the Universal Character Set can be used in DNS without extending the current protocol, and how DNS is extended to overcome length limits in the future. Detailed instructions are provided for a range of issues, including how to do name matching, the effect on other protocols, and handling long names.

Character data
Character data in the DNS protocol must use ISO 10646 (UCS) as its coded character set. It must also be normalized using form C as defined in Unicode technical report #15 [UTR15]. Lastly, it must be encoded using UTF-8.

Legacy support
Because there is a lot of software that expects host and domain names to use only a subset of ASCII, they may work incorrectly if receiving a response with non-ASCII characters. The proposal describes the processing that needs to take place in order to deal with this reality.

To support the transition to UTF-8 in resolver code, the proposal recommends that a server recognize local encodings for the zones it has authority over. This will allow clients to the local character set even before the resolver code is upgraded.

Handling long names
Because UTF-8 takes more than one octet for some characters, a UTF-8 name cannot have 63 characters in a label like an ASCII name can. For example, a name using Hangul would have a maximum of 21 characters. In order to support longer names, the UDNS proposal describes a long label type using an extended label -- 0b000011, for example. (The actual label type will be assigned by IANA.)

And now for something completely different...
The proposals described above either work with the existing ASCII limitations, or take advantage of the DNS extension capability (EDNS) to support non-ASCII characters. But one recommended approach stands out from the crowd.

John Klensin's Internet draft titled "Role of the Domain Name System" (see Resources) proposes a completely different solution to the IDNS problem. Instead of working with the limitations of the existing system, he suggests adding a "...directory layer which would use a two-stage lookup. This is not unlike several of the IDN proposals, but would do the first lookup in a directory system, rather than in the DNS itself. This would permit us to relax several constraints and produce a more comprehensive system."

Many people view the DNS as a directory, and this false perception creates problems. Klensin suggests that there is a real need for an actual directory system, rather than a series of "DNS patches, kludges, or workarounds."

He writes: "A directory system could permit explicit association of attributes of, e.g., language and country, with a name, without having to utilize trick encodings to incorporate that information in DNS labels (or creating an artificial hierarchy for doing so)."

Most proposals require changing resolver API calls in almost all Internet applications. Because changes have to be made anyway, Klensin believes that it is a relatively small matter to change from calling into the DNS, to calling a directory service first and then the DNS. He suggests that both actions could be accomplished in a single API call.

This is an interesting perspective and contrast from the rest of the proposals, however, it seems unlikely that this approach will be taken.

Timeline
The IDN WG's timeline shows that a protocol will have been selected and a first draft written for it by December 2000, but this probably won't happen on schedule. While the IETF's meeting in San Diego during the week of December 10 should help move things forward, there is enough contention that it is doubtful a resolution will be reached.

Two primary issues to be resolved are: what encoding method to use for the names, and where to do transformations (at the server or workstation level).

While there are many possible ways to deal with internationalized names, not all of them can work seamlessly with the wide range of existing tools and services. If some solutions were implemented right now, large groups of people would be cut off from being able to effectively use the worldwide capability of the Internet.

If you look at the deployment history of other protocols, it typically takes years before an enhancement becomes ubiquitous. The IDN WG has looked closely at proposed solutions from a variety of sources, including organizations that plan to sell multilingual name-related services. And despite the pressures from many sides to get something -- anything -- selected, the group will continue to work toward a single, scalable, and deployable solution that ensures continued global interoperation.

If the group can complete this task while somehow retaining their sanity, the next task will be to develop transition plans. The first draft of a transition document is due out in March 2001, and the plan should be finalized in September 2001. The team deserves a big round of applause, and the thanks of millions of users around the world.

Resources

About the author

Suzanne Topping is vice president of BizWonk, Inc., a provider of international e-business solutions. Before starting BizWonk, Suzanne ran a globalization consulting business called Localization Unlimited. She has written for Language International magazine, is a frequent contributor to Multilingual Computing & Technology, and authored a chapter in Translating into Success , published by John Benjamins this year. You can reach Suzanne at stopping@rochester.rr.com.


e-mail it!
Rate this article

This content was helpful to me:

Strongly disagree (1)Disagree (2)Neutral (3)Agree (4)Strongly agree (5)

Comments?



developerWorks > Java technology >
developerWorks
  About IBM  |  Privacy  |  Terms of use  |  Contact