Skip to main content

skip to main content

developerWorks  >  WebSphere  >

Bidirectional script support

A primer

developerWorks
Document options

Document options requiring JavaScript are not displayed


My developerWorks needs you!

Connect to your technical community


Rate this page

Help us improve this content


Level: Intermediate

Israel Gidali (gidali@il.ibm.com), Manager, IBM Globalization Center of Competency - Complex Text Languages, IBM
Matitiahu (Mati) Allouche (matial@il.ibm.com), Bidi Architect, IBM Globalization Center of Competency - Complex Text Languages, IBM

28 Sep 2005

Arabic, Hebrew, Urdu, and Farsi (Persian) are written from right to left, while numbers and segments of Latin (or Cyrillic or Greek) text are embedded in this text from left to right. The dual directionality aspects of such bidirectional (bidi) text are posing challenges to the way this text is processed and presented in computer applications. This article provides an initial introduction to the concepts and peculiarities of bidirectional scripts in computing systems, which forms a basis for understanding how those scripts are implemented in specific systems. It covers directionality and Arabic character shaping, the prevalence of bidi text in different bidi layouts, the definition of the bidi attributes, and the need to transform bidi text to a common layout before processing it.

The basic globalization support, which is mandatory for all IBM® offerings and components, also includes bidirectional (bidi) support: the support for languages with a bidirectional script. A bidirectional script contains segments of text that are written from right to left as well as embedded numbers or segments of text in western scripts (for example, Latin-based scripts such as English, French, Cyrillic-based, or Greek), which are written from left to right.

Arabic and Hebrew are the two major languages that use bidirectional scripts. The Arabic script group includes Arabic, Farsi (Persian), and Urdu, as well as other languages. The Hebrew script group includes not only Hebrew, but also Yiddish and Ladino. Because both language groups have alphabets with a small number of characters (27 or 28), they can accommodate a single-byte encoding scheme. Of course, like most other scripts, they are also included in the Unicode repertoire.

Bidirectional language characteristics

There are two main characteristics that distinguish bidirectional script from Western language scripts (such as English, Russian, Greek and others). These two characteristics are bidirectionality and shaping.

Bidirectionality

Bidirectionality encompasses several key concepts:

  • segments (or directional runs)
  • nesting or embedding (and associated embedding levels)
  • global orientation (also known as writing order, paragraph orientation, or even the somewhat misleading "reading order")
  • logical vs. physical order of bidirectional text
  • text-types (also known as ordering scheme) and associated re-ordering methods
  • symmetrical swapping
  • widget mirroring of translated GUIs

Segments (or directional runs)

A directional segment (or directional run) is one portion of text within a string that has a distinct homogeneous directionality. A string can have segments with a right-to-left directionality and other segments with a left-to-right directionality. An example of bidirectional segmentation can be, for instance, an address in the Middle East, written in Hebrew or Arabic, when communicated in an English note sent via e-mail:

my address is B ECNARTNE 25 TEERTS ELPAM.

IMPORTANT: In this and all other examples in this article, capitalized letters, such as DCBA, are used to represent Arabic or Hebrew letters.

In this example, the left-to-right segments are: "my address is" and "25". The rest are right-to-left segments.

Nesting

Nesting occurs when a text string with one directionality also contains segment(s) with the opposite directionality. In the above example, the right-to-left address "B ECNARTNE 25 TEERTS ELPAM" has embedded in it the left-to-right segment "25".

Global orientation

Global orientation, also referred to as writing order, reading order, or paragraph orientation, determines the side of the screen, window, page, or field where the rendering engine starts laying out directional segments. The next segments progress in the direction of the global orientation.

If a bidirectional text has been created in storage with the intent to be presented in a right-to-left global orientation, and is instead rendered with a left-to-right global orientation, the relative order of the different segments (and of the punctuation) gets mixed up and the text does not make sense.

Example: The Hebrew sentence "I WORK AT ibm AND TRAVEL TO canada." should be rendered with right-to-left global orientation as ".canada OT LEVART DNA ibm TA KROW I". If rendered with left-to-right global orientation, it would be "TA KROW I ibm OT LEVART DNA canada." which is quite unreadable.

Physical and logical text ordering

Bidirectional text can be stored in either logical or physical order. In a bidirectional text, the physical order of adjacent characters when presented (on a screen or in print) is not always the order in which the characters are pronounced when read aloud. For instance in the example above, after having read the last character "s" in the segment "my address is", the next logical character is not "B" (which is next to it physically on the screen) but rather the rightmost character "M" which is the first character of the address "B ECNARTNE 25 TEERTS ELPAM".

In workstation environments, the preferred way of entering and processing bidirectional text is in logical order, because text is processed similarly to Latin text. When using logical order in storage, the code responsible for presenting the text (for instance the rendering engine of the operating system in the workstation) must have the means to reverse segments whose direction is opposite to the global orientation.

In mainframes, the traditional way to enter, store, and process bidirectional text is in physical order. Therefore, when integrating bidirectional text from mainframe and workstation environments, you must transform the text to a layout where all the text has the same order or "ordering scheme" (see below).

Text-type (or ordering scheme)

Text-type (or "ordering scheme") is defined as the order in which bidirectional text is stored and processed. There are three text-types used for recording: visual (or physical), logical (also called implicit) and explicit.

Visual text-type is the oldest ordering scheme and is more or less a simple copy of the entire screen (this is why it is called "visual"). In the visual ordering scheme, the programmer must be intimately aware of the exact structure of the data in order to handle by himself each and every segment. A large majority of vintage applications running on mainframes assume this type of text for the data processed and in the mainframe data bases and files.

Implicit text-type assumes that the letters of the Latin alphabet have inherent left-to-right directionality and that Arabic, Persian, Urdu, and Hebrew characters have inherent right-to-left directionality. To accommodate bidirectionality, an algorithm is used to recognize segments based on their inherent directional characteristics, allowing segment inversion to be performed automatically.

Explicit text-type is the last of the three text-types. There are some limitations to the implicit ordering scheme, such as the inability to correctly handle text with more than one level of nesting (for example, English within Hebrew within English). These cases can be better handled by means of explicit controls: Explicit text-type assumes that there are additional (not seen when presented) control characters embedded in a text string that direct the explicit algorithm to perform segment inversions, shaping or numeral selections, and other transformations. The limitation of the explicit ordering scheme is the need of automatic processes to handle embedded controls. There is a specific technique that combines the advantages of implicit and explicit ordering schemes. This technique is the basic display algorithm defined in the Unicode Standard bidirectional algorithm.

Symmetrical swapping

Symmetrical Swapping is performed by the rendering programs for such characters as < ( [ { that have a symmetric character with an opposite directional meaning: > ) ] }. This is done in order to conserve the semantic of expressions such as A>B when presented from right-to-left so that they will appear as B<A (and not as B>A).

Widget mirroring

When a GUI is translated to a language with a right-to-left script (such as Hebrew or Arabic), the entire geometry of the GUI must be mirrored to match the expectation of the users of these languages who read from right to left. This includes mirroring of all the widgets that constitute the GUI window.

For example, widget mirroring can move the menu buttons and navigation tree to the right instead of the left, and the navigation tree itself is horizontally mirrored. If the GUI is not translated to Arabic or Hebrew (even though the locale is set to one of these languages and the programs may have to handle and present correctly bidirectional text), the frames and windows must not be mirrored.

Figure 1 shows widget mirroring of a drop-down menu.


Figure 1. Widget-mirrored window showing bidirectional labels
Widget-mirrored window showing bidirectional labels

Character shaping

The Arabic script is cursive. In most cases, adjacent characters are connected to one another. Some of the Arabic characters do not connect to the next character on the left. To accommodate the need for cursiveness of the Arabic characters, the Arabic characters can have up to four different shapes:

  • isolated shape, when there is no need to connect on any side
  • initial shape, when connection is required only for the next character on the left
  • middle shape, where connection is required on both sides
  • final shape, where connection is required only with the character on the right

In some Arabic code pages (such as the IBM EBCDIC 420 code page), separate code points are allocated for each possible character shape. In other Arabic code pages (such as the Microsoft Arabic code page 1256), there is a single representative code point for each Arabic character, and thus Arabic text is stored in a shape-independent manner. Of course, at presentation time the rendering program must apply a shaping algorithm in order to choose the proper shape (and the appropriate glyph of the font) to correctly represent the Arabic script text.

Shaping does not apply to the Hebrew script.

Number Shapes

In Hebrew (like in English), numbers are represented using Arabic digits, which in the Unicode Standard are called "European numbers":

1 2 3 4 5 6 7 8 9 0

The Arabic digits (European numbers) are also used in the Arabic script, but in many cases users of this script prefer another set of digits called Hindi digits (known in the Unicode Standard as "Arabic-Indic"):


Figure 2. Arabic-Indic digits
Arabic-Indic digits

There are some slight variations between the shapes of some of the Hindi numbers when used in Arabic, Urdu, and Farsi.

The numbers are usually stored as Arabic numbers and are represented with Arabic or Hindi digits depending on the locale or the requirements of specific users.

Direction of numbers

Regardless of the script used, whether shown in Arabic digits or in Hindi digits, all numbers are always presented from left to right.

The direction in which mathematical formulae are written can differ from language to language. In Hebrew (and in Persian), it is from left to right. In Arabic, mathematical formulae are written from left to right.



Back to top


Layout transformations and attributes

Enabling bidirectional scripts calls for special attention to:

  • text layout
  • layout attributes
  • layout transformations

Text layout

Bidirectional text can have different layouts. A layout differs in aspects such as the ordering scheme, the global orientation in which the text is expected to be seen, as well as in the character and numeral shaping aspect. Transformations between the different layouts require transformation APIs, also called layout functions or layout service functions.

Layout attributes

A bidirectional text layout is identified by the values of a set of bidirectional attributes. These attributes are usually external to the text and are stored in an external resource file. The five bidirectional attributes, whose meanings have already been introduced in this article, are:

  • orientation (also called global orientation, writing order, paragraph direction, basic embedding level, and reading order)
  • text-type (also called ordering scheme)
  • symmetric swapping
  • text shaping
  • numeric shaping

Layout transformations

Bidirectional text is stored and processed in different environments (platforms) and with different layouts. In order to create a transformation from one layout to another, a bidirectional "layout transformation" must be used. There are different flavors of such transformation code maintained by the IBM Bidi Competency Centers for different programming languages and encoding schemes. The layout transformations use a bidirectional implicit algorithm that conforms to the Unicode bidirectional algorithm. This algorithm can be found at http://www.unicode.org/reports/tr9/ and the IBM bidirectional layout transformation is also implemented in IBM Java™ SDK 1.4.1: http://www-128.ibm.com/developerworks/java/jdk/bidirectional/JAVABIDI.html.

Copyright(c) IBM Corporation, 2005

Trademarks

IBM, Aptiva, DB2, and WebSphere are trademarks of International Business Machines Corporation in the United States, other countries, or both.

Microsoft and Windows are trademarks of Microsoft Corporation in the United States, other countries, or both.

Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.

Linux is a trademark of Linus Torvalds in the United States, other countries, or both.

Other company, product or service names may be trademarks or service marks of others.



Resources

Learn

Get products and technologies


About the authors

Israel Gidali

Israel Gidali is the manager of the IBM Globalization Center of Competency - Complex Text Languages and of the Hebrew Translation Services Center in IBM Israel, and he is a member of GATT (IBM Globalization Architecture Technology Team). Israel has been involved with IBM globalization and in particular with bidi issues for the last 13 years, providing extensive bidi education to IBM development teams as well as being the focal person to address all bidi-related issues in IBM products.


Mati Allouche

Mati Allouche is the bidi architect of IBM GCoC. Mati is the most authoritative bidi expert in IBM, specializing in bidi for several decades, and he is actively participating in and contributing to the development and update of bidi-related standards in professional bodies such as Unicode, IETF, and the Israel Institute of Standards.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top