![]() |
|
|||||||||||||||
|
||||||||||||||||
|
| The multilingual domain name race: On your mark get set WAIT! | ||||
|
As the Web continues to expand globally, the English-oriented system needs updating -- but how?
Introduction While debate was fairly hot about which of the proposed TLDs would be accepted, an even hotter issue was also under discussion. And that debate is far from over. The question of what method should be used for supporting multilingual domain names has prompted a great deal of controversy. There are as many outlooks on how to encode and process non-ASCII strings as there are domain name registrars.
In the early days of the Internet, the Domain Name System (DNS) was intentionally designed to support only a limited subset of ASCII. This worked fine at the time, when users were primarily academics and programmers. But the approach merely delayed the problem we face today. Millions of people around the world can't use native words or phrases in domain names like English speakers can. This is a significant issue given the tremendous expansion of Internet use outside the United States over the past few years. Imagine how different your Internet experience would be if you had to enter strings of gibberish characters or numbers instead of simply typing "eatatjoes.com." Addressing this need is universally important in the Internet development community, but implementing a solution is much less straightforward. Why? Because a standard has not been selected for how to actually deal with multilingual domain names. But this reality hasn't stopped some companies from offering to register them. Network Solutions, Inc., for example, announced a test program in August 2000, that will register domain names in 55 languages including Chinese, Japanese, Arabic, Korean, and Hebrew. Customer companies are jumping on board, scooping up regional names before they are taken. The problem with this prerelease is that each provider may implement a different solution, and that could result in a domain name system that is no longer universal. Instead of the World Wide Web, we could very well end up with pockets or clusters of webs. Unless a standard approach is applied, the result will be international domain name systems that can only communicate with systems using the same approach. While the goal of offering support for non-ASCII characters is to open up the Web for wider use, the reality may be that if haphazardly implemented, this solution could result in technical incompatibilities that cause isolation. About DNS The DNS' primary function is to map word-based domain names to numeric IP addresses. The DNS is closely tied to the applications and application protocols that use it, often at a fairly low level. This means that any changes to it will have a far-reaching impact on applications that interact with it. Little thought was given to the need to support personal names and e-mail addresses. The DNS wasn't really designed to identify people, company names, brands, etc. The designers did allow for new data types and structures by including the ability to add new record types to the initial "Internet" class. These fields can contain information other than just the restricted text forms from the host table. Proposals for IDNS therefore tend to fall into two camps: one group that works within the existing ASCII-based functioning of the DNS, and another that takes advantage of the ability to work with an extended version of the DNS. Layers The "DNS service" sits above the bottom layer, and is created by an
infrastructure of DNS servers. A "root cache file" ( The service layer (with which a user might interact) is often called the resolver library. This layer may be embedded in the operating system or system libraries of client machines. API calls, such as The concept of layers becomes important to the discussion because a decision must be made about whether handling should be done at the server level, or by resolvers on user workstations. So what's different about an IDNS? The essential concept behind IDNS technology is that it allows people to use domain names for Web URLs, e-mail addresses, and FTP in their native language, no matter what that language is. Martin Dürst sheds a little more light on what this means. In an expired Internet draft titled "Internationalization of Domain Names," (see Resources), he states: "For domain name I18N to work inside the tight restrictions of domain name syntax, one has to define an encoding that maps strings of UCS characters to strings of characters allowable in domain names, and a means to distinguish domain names that are the result of such an encoding from ordinary domain names." The need for standardization is not merely an esoteric problem. The IAB's RFC 2825 says that the challenge is to decide how to represent the names in a way that is clear, technically feasible, and ensures that a name always means the same thing. They believe that the best path forward is one that takes into account current realities and deployment issues. The RFC states: "In the Internet's global context, it is not enough to update a few isolated systems, or even most of the systems in a country or region. Deployment must be nearly universal in order to avoid the creation of 'islands' of interoperation that provide users with less access to and connection from the rest of the world." Some of the issues that need to be addressed in an IDNS solution are described below. Prohibited characters Marc Blanchet's Internet draft "Handling Versions of Internationalized Domain Names Protocols" (see Resources) describes an approach for dealing with prohibited characters. He suggests that the IANA maintain an ASCII table that would contain all of the allowable characters. Blanchet believes that it would be easier for implementers to work with a list of accepted characters rather than unacceptable ones, and that the table would help decrease the variation of behaviors between individual implementations. Normalization and canonicalization Normalization and canonicalization are processes for ensuring that there is no confusion about which IP address a name is intended to map to. When converting to and from pre-existing character encodings to UTF-8 (as required by RFC2279), there are some occurrences of duplicates. The equivalence between duplicates is called canonical equivalence. The existence of these duplicates raises questions about which part of the Internet infrastructure should take responsibility for dealing with them, and how. Versioning The way that version indications are handled will be dependent on the proposal that is selected as the standard. (Standards proposals are summarized later in this article.) Blanchet suggests that proposals based on extensions of the DNS protocol should include a version number in the bits. (IDNE) defines version handling as part of the proposal, however similar definitions should be created for the other extension-based proposals. Proposals based on ACE would use a different prefix/suffix for each version. One of the characters in the prefix should be used as a version number, beginning with the lowest possible ASCII character available and increasing the ASCII codepoint by 1 for each version change. For example, if the prefix is "ra", then the first version of the ASCII-based IDN protocol would be "ra" and the second version would be "rb". Risks of moving too quickly The IAB's RFC 2825, "A Tangled Web: Issues of I18N, Domain Names, and the Other Internet Protocols" (see Resources), sums this concern up: "These services must interoperate worldwide, or we risk isolating components of the network from along locale boundaries. This type of isolation could impede not only communications among people, but opportunities of the areas involved to participate effectively in e-commerce, distance learning, and other activities at an international scale, thereby retarding economic development." Obviously, doing things in an un-standardized way is perceived to be significantly risky. John Klensin's Internet draft "Role of the Domain Name System" (see Resources) describes further reasons for not moving too quickly. He writes: "...protocols tend to be deployed at a just-past-prototype level, typically including the types of expedient compromises typical with prototypes. If they prove useful, the nature of the network permits very rapid dissemination. But, once the vacuum is filled, the installed base provides its own inertia: unless the design is so seriously faulty as to prevent effective use (or there is a widely-perceived sense of impending disaster unless the protocol is replaced), future developments must maintain backward compatibility and workarounds for problematic characteristics rather than benefiting from redesign in the light of experience. Applications that are 'almost good enough' prevent development and deployment of high-quality replacements." In other words, if we implement a half-baked solution just for the sake of getting something in place, we may never go back and fix it. These are the overriding, umbrella concerns. Here are some more specific areas of risk: E-mail Network management Security Paul Hoffman's Internet draft "Preparation of Internationalized Host Names" (see Resources) states: "Much of the security of the Internet relies on the DNS. Thus, any change to the characteristics of the DNS can change the security of much of the Internet. "Host names are used by users to connect to Internet servers. The security of the Internet would be compromised if a user entering a single internationalized name could be connected to different servers based on different interpretations of the internationalized host name." Status of standards Standards bodies
While each of these organizations plays an important part in developing proposals and moving issues forward, it is the IDN Working Group's responsibility to determine an actual standard. IDN Working Group The working group says that an IDNS should:
The group's purpose statement says: "A fundamental requirement is to not disturb the current use and operation of the domain name system, and for the DNS to continue to allow any system anywhere to resolve any domain name." Proposal summaries
IDNE: Internationalized domain names using EDNS The extension allows some IDNE labels to be longer than 63 characters and some IDNE names to be longer than 255 octets. These length differences cause special requirements for handling that other proposals do not create. The IDN protocol version number must be included when using IDNE. An OPTION-CODE will be assigned by IANA for storing the IDNE protocol version number. All requesters must send this information as part of the OPT RR included in the EDNS packet. Transition and deployment In order to deploy IDNE, clients, servers, applications, and protocols must all be updated. It would be unrealistic to think that all of these components could be upgraded overnight, therefore the proposal includes a transition strategy. The proposal states that it may take decades for DNS servers to handle IDNE, and that in the interim an ASCII-compatible encoding (ACE) format for IDN names is also needed as a transition. (However, the proposal foresees an eventual all-IDNE DNS.) If the IETF chooses to have an ACE mechanism in use at the same time as IDNE, the proposal recommends that the ACE method should allow as many characters as possible in the name parts and full names. The issue of name length has also been discussed, because there is a possibility that IDNE names could be too long for ACE protocols to handle. KWAN: Using the UTF-8 character set in the Domain Name System
Downcasing and case handling
The DNS protocol standard states that the original case should be preserved whenever possible as data is entered into the system. The KWAN proposal modifies this requirement to read: "A UTF-8-aware DNS server must downcase all names containing UTF-8 characters in both record names and record data before transmitting those names in any message. A UTF-8-aware DNS client/resolver must downcase all names containing UTF-8 characters before transmitting those names in any message." Caution should be used by applications that allow uppercase UTF-8 characters to be passed to the resolver. DNS servers should apply similar caution when allowing uppercase UTF-8 characters to be entered in zone data. Because downcasing in UTF-8 is locale-sensitive, the result may depend on the locale at the point of code execution. Results as expected will be consistently achieved if both the application and server accept only lowercase characters.
Interoperability
A non-UTF-8-aware DNS server may accept transfer of a zone containing UTF-8 names, but it may not be able to write back the names to a zone file or reload the names from a zone file. Under this system, administrators should consider the potential impact of transferring a zone containing UTF-8 names to a non-UTF-8-aware DNS server. RACE: Row-based ASCII Compatible Encoding for IDN RACE is different from other ACE protocols because it can include more international characters. Names in the Han, Yi, Hangul syllables, or Ethiopic scripts can include up to 17 characters per name part, and names in most other scripts can include up to 35 characters. Names that use a mix of Latin and non-Latin characters can include up to 33 characters. The length is also dependent on which row the characters come from (based on ISO 10646 rows):
Conversion requirements
The preconverted string consists of characters from the ISO 10646 character set in big-endian UTF-16 encoding. The basic process for preparing the name is to compress it, encode it, and give it a name tag. These steps are briefly described below. When converting back to an internationalized version, the process is essentially reversed.
Compression
Encoding
Name tagging
SENG: UTF-5, a transformation format of Unicode and ISO 10646 The proposal concedes that UTF-8 is the preferred transformation format for all new IETF standards, and is not attempting to go against that proclamation. Instead, it was proposed to support legacy applications or protocols that cannot be modified easily to handle 8 bits using UTF-8 encoding. Detailed instructions for conversion are described in the proposal. UDNS: Using the Universal Character Set in the Domain Name System
Character data
Legacy support
To support the transition to UTF-8 in resolver code, the proposal recommends that a server recognize local encodings for the zones it has authority over. This will allow clients to the local character set even before the resolver code is upgraded.
Handling long names
And now for something completely different... John Klensin's Internet draft titled "Role of the Domain Name System" (see Resources) proposes a completely different solution to the IDNS problem. Instead of working with the limitations of the existing system, he suggests adding a "...directory layer which would use a two-stage lookup. This is not unlike several of the IDN proposals, but would do the first lookup in a directory system, rather than in the DNS itself. This would permit us to relax several constraints and produce a more comprehensive system." Many people view the DNS as a directory, and this false perception creates problems. Klensin suggests that there is a real need for an actual directory system, rather than a series of "DNS patches, kludges, or workarounds." He writes: "A directory system could permit explicit association of attributes of, e.g., language and country, with a name, without having to utilize trick encodings to incorporate that information in DNS labels (or creating an artificial hierarchy for doing so)." Most proposals require changing resolver API calls in almost all Internet applications. Because changes have to be made anyway, Klensin believes that it is a relatively small matter to change from calling into the DNS, to calling a directory service first and then the DNS. He suggests that both actions could be accomplished in a single API call. This is an interesting perspective and contrast from the rest of the proposals, however, it seems unlikely that this approach will be taken. Timeline Two primary issues to be resolved are: what encoding method to use for the names, and where to do transformations (at the server or workstation level). While there are many possible ways to deal with internationalized names, not all of them can work seamlessly with the wide range of existing tools and services. If some solutions were implemented right now, large groups of people would be cut off from being able to effectively use the worldwide capability of the Internet. If you look at the deployment history of other protocols, it typically takes years before an enhancement becomes ubiquitous. The IDN WG has looked closely at proposed solutions from a variety of sources, including organizations that plan to sell multilingual name-related services. And despite the pressures from many sides to get something -- anything -- selected, the group will continue to work toward a single, scalable, and deployable solution that ensures continued global interoperation. If the group can complete this task while somehow retaining their sanity, the next task will be to develop transition plans. The first draft of a transition document is due out in March 2001, and the plan should be finalized in September 2001. The team deserves a big round of applause, and the thanks of millions of users around the world.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||
| About IBM | Privacy | Terms of use | Contact |