Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Preparing your Web site for machine translation

How to avoid losing (or gaining) something in the translation

Photo of Theresa A. O'Connell
Theresa A. (Teri) O’Connell has practiced usability engineering in both the private and the public sectors in North America and in Europe. She was a principal investigator in the DARPA Machine Translation Initiative. This program established guidelines for MT evaluation. In this article she combines her interest in machine translation with her strong background in human-computer interaction. Her current research interests are internationalization and human interaction with intelligent agents.

Summary:  Machine translation is a sophisticated technology. However, it is not as sophisticated as human language. Understanding how MT works on the Web helps designers and developers prepare Web pages for MT. Preparatory tactics improve the usability of MT output.

Date:  24 Jul 2001
Level:  Intermediate
Also available in:   Japanese

Activity:  7132 views
Comments:  

You have a wonderful new idea. You want to share it with the whole world. So, you publish it on your Web page -- in your own native language. This tactic makes sense because Web publications are accessible all over the globe. But what happens when people who do not read your language try to understand your message?

More and more often, what happens is that your users run your page through machine translation (MT). This technology has, in recent years, taken on the challenges of helping Web users share ideas across language barriers. The Web even offers free MT.

While MT is not always a suitable substitute for human translation, it is a valuable tool. This is especially true on the Web, where users are so often in a hurry to understand a page’s content. Every day, the Web-using population becomes more linguistically diverse. Every day, it becomes more probable that a significant number of users will run your Web pages through MT. Understanding how MT works and following a few writing guidelines can raise the usability of your Web site for your international audience.

What is MT?

Machine translation is the process of converting a string in one human language into a string in another human language. MT was one of computing’s earliest ambitions, but only in the last few years has it been widely available over the Web.


MT today

A quick look into an MT system uncovers the state-of-the-technology. No matter how an MT system goes about its task, there are always at least two components: one to analyze source language input, and one to generate target language output.

Transfer systems

Often, in today’s systems, there is a third component called the transfer engine. To resolve differences between the source and target languages, the transfer component runs analysis output through bilingual rules. It then passes its output to the generation component (see Figure 1).


Figure 1: A transfer system has three components. The transfer engine resolves differences among languages.
A transfer system has three components. The transfer engine resolves differences among languages.

Each transfer component addresses a specific language pair. It goes in one direction only -- for example, from English to Spanish, but not from Spanish to English. Transfer systems typically include several transfer engines. This way, they can translate in both directions across more than one language pair.

Interlingua systems

Another approach uses a language-independent meaning representation called an interlingua. Analysis translates the source language into the interlingua. Generation translates the interlingua strings into the target language.


Figure 2: There are two stages to interlingua MT. This approach depends on a language-independent notation.
There are two stages to interlingua MT. This approach depends on a language-independent notation.

Because the interlingua is language-independent, an interlingua system can accommodate more than one language pair. The difficulties of creating a totally language-independent interlingua limit the use of this approach.

Statistical systems

Statistical systems compute the probability that a string in one language translates into a particular string in another language. Statistical systems require large collections of correct translations between the source and target languages. Expert human translators produce these texts.


MT: A sophisticated technology

Today, most MT systems combine several of these approaches. Therefore, they are called hybrid systems. MT systems represent sophisticated technology. Yet, they do not usually achieve the quality of human translation, since MT technology is neither as complex as human language nor as sophisticated as human translators.

An MT system’s knowledge of words resides in a lexicon. A lexicon usually contains information on 250,000 to 500,000 words. Consider the fact that Merriam-Webster has identified 500,000,000 words in the English language. Now you can see why out-of-the-box MT works best on common words and in subject areas that do not depend on jargon.

MT systems typically use between 500 and 1,000 grammar rules, depending on the languages involved and how sophisticated the specific system is.

It is not economically feasible to develop MT packages to translate across all the languages in the world, but products are available for most of the languages used in international business.


MT evaluation

So, how do you judge the quality of MT output? Researchers for the Defense Advanced Research Projects Agency (DARPA) developed three criteria: adequacy, informativeness, and fluency. Adequacy measures how much input meaning makes its way into the translation. Informativeness refers to whether or not a user can find the specific information he or she needs. Fluency assesses how smooth the translation is. Fluency accounts for factors such as spelling and word usage. Taken together, adequacy, informativeness, and fluency add up to user success and satisfaction.

Adequacy marks depend on factors as varied as subject matter and writing style. MT is usually adequate enough for the user to understand the gist of a message. Sometimes the user’s goal is to decide whether to send material to an expert for human translation. In this case, MT is generally sufficient. If, for example, the user needs to learn the location of a meeting, it is probable that MT will be informative enough. If the user requires literary quality output, it is equally probable that MT will not meet fluency requirements.

Because most users bring little patience to their Web sessions, effort becomes a fourth usability criterion. The less user effort required, the more users are satisfied with the Web MT session.

At present, raw MT output meets users’ needs only to a limited degree. To achieve high adequacy, informativeness, and fluency, human intervention is often necessary. This can take the form of user interaction during translation or post-editing. Preparing a text for machine translation can improve output quality and the user experience.


Back translation

A common misunderstanding about MT evaluation is the belief that back translation can disclose a system’s usability. Back translation consists of translating text into a target language and then back into the source language. For example, the system translates text from English into German and then it translates output from German back into English.

The theory is that if back translation returns English input exactly, the system performs well for this language pair. In reality, evaluators cannot tell if errors occurred during the passage to German or during the return passage to English. In addition, any errors that occur in the first translation into German cause more problems in the back translation.


Free Web MT and commercial MT

The companies that provide Web MT often also market a commercial version of machine translation software. For a fee, many of these vendors also provide post-editing by professional human translators. Professional post-editing achieves literary quality. This service is not generally used by Web surfers. A Web user in a hurry usually chooses either commercial off-the-shelf (COTS) MT or free Web MT. Free Web MT is available at no cost to the end user through a site such as a search provider.

When it comes to translating a Web page, there are important differences between COTS MT and free Web MT. In general, COTS versions offer more user control over translation quality. Web MT lends itself to quick informative gisting (gisting is the machine version of translation) while COTS MT also offers functions to improve adequacy and fluency. COTS MT serves both monolingual and bilingual users. Free Web MT focuses on the simple information needs of monolingual users.

COTS users can add to the system’s store of linguistic knowledge by teaching the system new words or phrases. They can often add information about these words and phrases. For example, users can teach the system that the English word "cow" is a singular, animate noun with female gender. This information ensures that "cow" translates into Italian as "vacca" and not "vacche" (cows) or "torro" (bull). Free Web MT users must depend on the provider’s vocabulary selections.

In cases where a system needs to know many domain-specific terms, many commercial MT providers customize lexicons for their clients. At best, free Web MT offers users a small set of domain-specific lexicons. Web MT users whose topics do not fall into the offered lexicons do not have the advantages of professionally developed lexicons customized to their own specific needs.

COTS MT typically empowers the user to identify words that the system should not translate, such as a company’s name. In free Web MT, such words or phrases run the risk of literal translation or partial translation. Use free Web MT to translate "The Ajax Widget Company" into French. Notice that the result, "L'Ajax widget compagnie" leaves some words in English, but translates others into French.

Some COTS packages empower users to participate in the translation process. Users can accept or modify text as it appears on the screen. These packages divide screen real estate into simultaneous displays of input and output. A bilingual user can easily check the system’s work and correct errors. COTS packages offer time-saving features such as lists of verb forms to help users post-edit output. Web MT typically does not offer post-editing or interaction tools. If users want to refine free Web MT output, they must resort to external resources, such as bilingual dictionaries.

Some COTS MT systems integrate with users’ e-mail; some even assist in chat sessions. This functionality is especially handy for users who only want information. These users can rely on MT to communicate simple facts, such as a shipping date.

And COTS MT is efficient. In two minutes, it can process the same 2,000 words that a human translator needs a day to translate. Speed on the Web is subject to the vagaries of the server. To accommodate many users, Web MT usually limits the number of words that a user can submit at one time.

Web MT lends itself to casual browsing. A user can simply paste in a URL and choose a language combination. With one more click, the system translates the page. If the user clicks on a link on the translated page, the system automatically translates the linked page. This functionality is now showing up in COTS as well.

COTS MT packages typically integrate with the user’s choice of word processing software. Some COTS products also integrate with spreadsheets and other applications. Web MT requires users to cut and paste text into their word processors to post-edit. This means more user effort.

Today, the differences between free Web MT and COTS are beginning to disappear. Although free Web MT still does not offer the full range of COTS functionality, users can subscribe to free MT service with a choice of several lexicons. They can even suggest new additions to a lexicon.


Guidelines

Anyone who runs Web pages through free Web MT probably has an amusing story about the outcome. But there are some simple steps that you can take to increase adequacy, informativeness, and fluency. Some tactics address clashes between MT and Web technologies. Others align human languages that convey the same concepts in different ways. Using these tactics can cause a better user experience.

You should always assume that your users are going to use free Web MT. This approach prepares your site for both COTS and Web MT.

Keep Web pages and paragraphs short

If you want your whole message to get through, keep your Web pages short. Do not expect Web MT to handle more than 200 words at a time. Of course, your users can copy and paste text, a few paragraphs at a time. But that sort of dedication requires effort.

Small pages mean a trade-off between a deeper site navigation schema and shorter text. To keep a site shallow, sometimes you must present everything on one Web page. In this case, organize material into logical chunks. Keep paragraphs well below 200 words. Section headings provide logical breaking points for users to navigate their cutting and pasting tasks.

No text in graphics

MT is most effective when it encounters straight HTML. Run this article through MT. Notice how some of the text remains in the source language? The untranslated text is locked within graphics.

MT skips over graphics; it only processes words. So banners, logos, and graphic buttons, or tab labels typically remain untranslated. When these screen elements play a navigation role in the interface, there can be fatal consequences to user success and satisfaction. Having text in graphics fails the adequacy test.

Even if you have been thoughtful enough to provide alt tags, the problem of untranslated text remains. Free Web MT does not translate alt tags for visitors to your site.

The trade-off is that you need to give up some pizzazz. Keeping text out of graphics pays off later when you localize your Web page. Different locales do not always require different graphics. But you almost always have to translate text. An added bonus is that you are simultaneously preparing your site for screen readers, a tactic that makes it accessible by visually impaired users.

.pdf files require extra effort

MT treats .pdf files the same way it treats graphics: It ignores them. Users have to run .pdf files through optical character recognition (OCR) before MT. You should evaluate carefully whether or not users need print-ready copy. The trade-off is an extra step of comparing OCR output to the source copy. To achieve adequacy and fluency, users must correct recognition errors. Otherwise, these errors propagate when OCR output runs through MT.

Frames fail the adequacy test

Frames can be a handy way to organize a Web page, but they present a challenge that MT simply avoids. Some Web MT engines translate only one frame. Others prompt the user to submit frame after frame, one at a time. Only those users who really want to know what the page says will exert this much effort.

Use simple grammatical structures

MT earns higher fluency marks on short sentences than on long sentences; aim for about 20 words per sentence. Help the MT engine solve the problem of understanding how sentence parts relate to each other. Whenever possible, turn clauses into separate sentences. Instead of writing "I check a Web traffic report every day and avoid all the worst traffic jams," try two sentences: "I check a Web traffic report every day. I avoid all the worst traffic jams."

Different languages make different distinctions among verb tenses. Tackle this problem by writing in the simple present tense whenever possible: Write "Dogs always chase cats" instead of "Dogs will always chase cats." Avoid the passive voice: "Exercise brings good health" works better than "Good health is brought about by exercise."

Use everyday language

As designers and developers, we communicate in jargon. We use terms that only our colleagues understand. Jargon on a Web page risks mistranslation. It poses serious threats to adequacy and fluency. Such terms may not be in the system’s lexicon. Worse still, they may be there, but associated with another, more common, meaning.

Use the simplest words that tell your story. An MT engine is more likely to handle "stamp collecting" than "philately."

English is a challenging source or target language because it gives more than one meaning to many words. To improve adequacy and informativeness, use only the most common forms and meanings of words.

Be careful when using words that change meaning with domain. A musical note is very different from a bank note. A note of interest differs from those notes you passed back and forth when the teacher was not looking. Although English only uses one word, these four uses of the word "note" can translate into four different words. Give the system clues about the domain by modifying such words. Do not write "I held that note." Say either "I held that bank note in my hand" or "I held the high C note for five seconds."

Provide clues

Help the MT engine improve fluency by providing it with clues about relationships among words in a sentence. Connecting words such as "that," "which," and "who" signal that a clause follows. For example, instead of writing "I know my stock is good," write "I know that my stock is good."

Consider the sentence "Every morning, reading the paper, I check the technology stocks." MT prefers "Every morning, while I read the paper, I check the technology stocks." The word "while" gives the system information about how to translate the sentence. It tells the system that "reading the paper" is a temporal clause. The system does not run the risk of thinking that "reading the paper" modifies "morning." These temporal markers are not a requirement in English, but in some languages they are integral.

Follow source language conventions

Follow all the conventions of the source language, even if you know that the target languages use different approaches. For example, German writers capitalize nouns that other languages leave in lower case. If you know that your users translate your site into German, capitalizing nouns is not the route to follow: Web MT is likely to consider these words proper nouns and leave them untranslated. Chinese does not use periods. Anticipating this by omitting periods leaves your text devoid of signals that are vital to MT.

Remember that a missing accent mark can change the meaning of a word.

Be specific

Be as specific as you can. The goal is to make sure that the MT engine does not have to resolve ambiguity. Consider the sentence "He took the book off the table and placed the paper on it." Where did that paper end up -- on the table or on the book? MT can have a problem translating "it" if "table" and "book" are different genders, as they are in many languages. Use short sentences to tell the engine where the paper is: "He took the book off the table. He placed the paper on the book."

Sometimes one extra word is all you need for successful MT. Avoid contractions. The MT engine may mistake a contraction for a possessive. Not all languages use contractions, but possessives are common. So, instead of writing "That’s a great idea," use "That is a great idea."

To improve adequacy, avoid idioms. They can lose their meaning in a literal translation.

Avoid acronyms

It is unlikely that a Web MT engine can identify your acronyms for what they are, even if you explain them in parentheses. If you run "COTS" through Web MT from English into French, you'll probably receive a literal translation of the word “cots," but all in caps: “LITS DE CAMP."

In other cases, the acronym does not translate at all. You end up with a translation of the explanation, but not of the acronym. Run "What you see is what you get (WYSIWYG)" from English into Spanish. The output is probably a Spanish phrase followed by the acronym, WYSIWYG.

Sometimes, word count forces you to use acronyms. In these cases, be sure to spell out the acronym in the text at least one time. This way, users can see, in their own language, what the acronym means.

Avoid e-mail shortcuts

Quick MT preparation guidelines

  • Keep Web pages and paragraphs short.
  • Use no text in graphics.
  • .pdf files require extra effort.
  • Frames fail the adequacy test.
  • Use simple grammatical structures.
  • Use everyday language.
  • Provide clues.
  • Follow source language conventions.
  • Be specific.
  • Avoid acronyms.
  • Avoid e-mail shortcuts.
  • Don’t depend on formatting.
  • Check spelling and grammar.

If you ask for feedback that you plan to run through MT, help your users. Provide hints on how to prepare e-mail for the experience. Everyday tactics such as no capitals, sentence fragments, run-on sentences, smiley faces, and casual punctuation are beyond MT’s abilities. There simply are not enough signals.

Don’t depend on formatting

In retaining formatting, COTS generally scores higher marks than free Web MT. So, on the Web, never depend on formatting to convey important information. You usually cannot depend on Web MT to retain italics, bolding, or table layouts. Decimal points and commas in numbers may migrate to new meaningless positions. Paragraphs may merge.

Check spelling and grammar

In terms of adequacy, informativeness, and fluency, MT output cannot be any better than MT input. One common sense step is to use your spell-checker. Simply put, misspellings and typos can cause mistranslations. They result in words you did not intend or misspelled words that remain in the source language. Also check your grammar; ungrammatical input is sure to hurt output fluency.


Looking forward

The Web becomes more and more of an international forum every day as speakers of many languages gain access. Even within the US, a growing segment of Web surfers are more comfortable reading a language other than English.

Some day, factors such as faster processing, more comprehensive lexicons, and artificial intelligence will improve MT quality. Today, MT does not deliver literary-quality output. However, a few simple strategies can yield a more successful and satisfactory user experience.

Accommodating MT’s technical shortcomings is vital. Writing strategies improve all users’ experiences, whether they read in your language or require translation. These guidelines promote the most adequate, informative, and fluent MT output.


Resources

Learn

Discuss

About the author

Photo of Theresa A. O'Connell

Theresa A. (Teri) O’Connell has practiced usability engineering in both the private and the public sectors in North America and in Europe. She was a principal investigator in the DARPA Machine Translation Initiative. This program established guidelines for MT evaluation. In this article she combines her interest in machine translation with her strong background in human-computer interaction. Her current research interests are internationalization and human interaction with intelligent agents.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development
ArticleID=11577
ArticleTitle=Preparing your Web site for machine translation
publish-date=07242001
author1-email=toconnell@acm.org
author1-email-cc=