IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
      
     Home      Products      Services & solutions      Support & downloads      My account     
[an error occurred while processing this directive]
 
developerWorks  >  Java technology  >  IBM developer kits  > Bidrectional support developerWorks
Bidirectional support in IBM SDK Version 1.4.1: A user guide

Introduction

Arabic shaping options

  • Arabic shaping options available in this release
  • Arabic shaping options to be implemented in future releases
  • The JAVABIDI system property

  • S Part
  • U Part
  • C Part
  • Examples of values for JAVABIDI
  • Known limitations

    Introduction

    Within Java, character data are manipulated as Unicode UTF-16 values. However, character data outside Java frequently conform to different encodings. For this reason, file input/output operations, along with conversion of bytes to characters and vice-versa, also involve conversion from an external encoding to UTF-16 and back. The external encoding may be explicitly specified (e.g. in the constructor of an InputStreamReader or OutputStreamWriter), or fall back to a default.

    Via its implementation of Unicode, Java supports many languages with various alphabets or scripts, among them Arabic and Hebrew, whose scripts are written from right to left. Since Arabic and Hebrew text is frequently mixed with other languages and numbers that are written from left to right, there emerges the need to handle bidirectional (or Bidi) data.

    Bidi data raises the level of diversity, as compared to non-Bidi data, because it may be stored not only in various encodings, but also in various layouts, each layout being a combination of rules relative to ordering of the characters (Arabic and Hebrew) and shaping of Arabic letters (choosing the appropriate shape of an Arabic letter among several possible).

    For the same reasons that Java translates data from external encodings into the encoding used internally and vice-versa, it should transform Bidi data from external layouts to the layout used within Java, and vice-versa. For example, legacy applications store data in visual layout, while Java APIs assume an implicit (also known as a logical) layout.

    Release 1.4.1 of the Java SDK allows users to request that layout transformations be performed for Bidi data whenever encoding conversions are performed. In order to maintain compatibility with previous releases, these transformations are disabled by default. To enable them, users must assign an appropriate value to the system property JAVABIDI.

    Arabic shaping options

    Some Arabic characters need special handling during conversion between different code pages. Because they are not represented in all code pages, a normal conversion would result in substitute control characters (SUB) -- that is, a loss of data.

    The characters with different representation across code pages are:

    Lam-Alef
    This is represented as a single character in code pages 420, 864, and 1046 used for visual presentation in addition to the Unicode Arabic Presentation Forms-B (uFExx range). It is represented as two characters, Lam and Alef, in code pages 425, 1089, and 1256 used for implicit representation in addition to the Unicode Arabic u06xx range.
     
    Tail of Seen family of characters
    The visual code pages 420, 864, and 1046 represent the final form of the Seen family of characters as two adjacent characters: the three quarters shape and the Tail. The implicit code pages 425, 1089, an 1256 and the Unicode Arabic u06xx range do not represent the Tail character. In Unicode Arabic Presentation Forms-B (uFExx range); the final form for characters in the Seen family is represented as one character.
     
    Yeh-Hamza final form
    Code pages 420 and 864 have no unique character for the Yeh-Hamza final form; it is represented as two characters, Yeh final form and Hamza. In other code pages (like 425, 1046, 1089, 1256, and Unicode), the Yeh-Hamza final form is represented as one character or two characters, depending on user's input; whether it is one key stroke (Yeh-Hamza key) or two strokes (Yeh key + Hamza key). The conversion from the previous code pages to 420 or 864 would result in converting the Yeh-Hamza final form character to the Yeh-Hamza initial form; a special handling must convert it to the Yeh final form and Hamza.
     
    Tashkeel or diacritic characters except for Shadda
    These characters are not represented in code pages 420 and 864. Conversion of Tashkeel from code pages 425, 1046, 1089, 1256, and Unicode to 420 or 864 results in SUB.

    In order to avoid the loss of such characters during conversion, a group of Arabic shaping options are proposed to properly handle them.

    Arabic shaping options available in this release

    For each character in the previous list, there is a set of available shaping options. This is illustrated in the following:

    For Lam-Alef:
    1. Near
      • When converting from visual to implicit code pages, each Lam-Alef character is expanded to Lam plus Alef, consuming the blank space next to it. If no blank space is available, the Lam-Alef character remains as is in the Unicode uFExx range; it will become a substitute control character (SUB) when converted to implicit single byte code pages. When converting from implicit to visual code pages, the space resulting from Lam-Alef compression is positioned next to each generated Lam-Alef character.
    2. At Beginning
      • When converting from visual to implicit code pages, each Lam-Alef character is expanded to Lam plus Alef, consuming a blank space at the absolute beginning of the buffer*. If no blank space is available, the Lam-Alef character remains as is in the Unicode uFExx range; it will become a substitute control character (SUB) when converted to implicit single-byte code pages. When converting from implicit to visual code pages, the space resulting from Lam-Alef compression is positioned at the absolute beginning of the buffer.
    3. At End
      • When converting from visual to implicit code pages, each Lam-Alef character is expanded to Lam plus Alef, consuming a blank space at the absolute end of the buffer**. If no blank space is available, the Lam-Alef character remains as is in the Unicode uFExx range; it will become a substitute control character (SUB) when converted to implicit single byte code pages. When converting from implicit to visual code pages, the space resulting from Lam-Alef compression is positioned at the absolute end of the buffer.
    4. Auto
      • When converting from visual to implicit code pages, each Lam-Alef character is expanded to Lam plus Alef, consuming a blank space at the beginning of the buffer with respect to the orientation, i.e. buffer[0] in case of left-to-right and buffer[length - 1] in case of right-to-left. If no blank space is available, the Lam-Alef character remains as is in the Unicode uFExx range, it will become a substitute control character (SUB) when converted to implicit single byte code pages. When converting from implicit to visual code pages, the space resulting from Lam-Alef compression is positioned at the beginning of the buffer with respect to the orientation.

     
    For Seen Tail:
    1. Near
      • Conversion from visual to implicit converts the final form of the Seen family that is represented by two characters (the three quarters shape and the Tail character) to the Seen family of characters final form represented by one character and replaces the Tail by a space and positions this space next to the Seen final form. In conversion from implicit to visual, each Seen family of characters final form represented by one character is converted to the final form of the Seen family that is represented by two characters, consuming the space next to the Seen character.
    2. Auto
      • Same behavior as Near.

     
    For Yeh-Hamza:
    1. Near
      • Conversion from visual to implicit converts each Yeh character followed by a Hamza character to a Yeh-Hamza character, the space resulting from the contraction process is positioned next to the original Yeh-Hamza character. In conversion from implicit to visual, each Yeh-Hamza character is expanded to two characters (Yeh and Hamza), consuming the space located next to the original Yeh-Hamza character.
    2. Auto
      • Same behavior as Near.

     
    For Tashkeel:
    1. Keep
      • No special processing is done.
    2. Customized At Beginning
      • All Tashkeel characters except for Shadda are replaced by spaces. The resulting spaces are moved to the absolute beginning of the buffer*.
    3. Customized At End
      • All Tashkeel characters except for Shadda are replaced by spaces. The resulting spaces are moved to the absolute end of the buffer**.
    4. Auto
      • Same behavior as Keep.

    Note:

    • For all Arabic shaping options, the behavior of the Auto value will be enhanced in future releases to provide optimized support in more situations.

    Arabic shaping options to be implemented in future releases

    The following Arabic shaping options are planned to be enhanced in future releases of the JDK.

    For Lam-Alef:
    1. Resize Buffer
      • When converting from visual to implicit code pages, each Lam-Alef character is expanded to Lam plus Alef; the buffer is enlarged to have room for the newly added Alef characters. When converting from implicit to visual code pages, every sequence of Lam followed by Alef is contracted to a Lam-ALef character, the buffer is then reduced to eliminate the spaces resulting from the contraction process.

     
    For Seen Tail:
    1. At Beginning
      • Conversion from visual to implicit converts the final form of the Seen family that is represented by two characters (the three quarters shape and the Tail character) to the Seen family of characters final form represented by one character and replaces the Tail by a space. The spaces resulting from this process are moved to the absolute beginning of the buffer*. In conversion from implicit to visual, each Seen family of characters final form represented by one character is converted to the final form of the Seen family that is represented by two characters, consuming spaces at the absolute beginning of the buffer.
    2. At End
      • Conversion from visual to implicit converts the final form of the Seen family that is represented by two characters (the three quarters shape and the Tail character) to the Seen family of characters final form represented by one character and replaces the Tail by a space. The spaces resulting from this process are moved to the absolute end of the buffer**. In conversion from implicit to visual, each Seen family of characters final form represented by one character is converted to the final form of the Seen family that is represented by two characters, consuming the spaces at the absolute end of the buffer.

     
    For Yeh-Hamza:
    1. One Cell
      • In conversion from visual to implicit, each Yeh character followed by a Hamza character is contracted to a Yeh-Hamza character (one character), the resulting space is positioned next to the generated character. In conversion from implicit to visual, each Yeh-Hamza character is expanded to two characters (Yeh and Hamza), consuming the space located next to the original Yeh-Hamza character.
    2. Near
      • For a better behavior of this option, it will be modified in the next release so that in conversion from visual to implicit, each Yeh character followed by a Hamza character remains as is (Yeh followed by Hamza). In conversion from implicit to visual, each Yeh-Hamza character is expanded to two characters (Yeh and Hamza), consuming the space located next to the original Yeh-Hamza character.
    3. At Begin
      • In conversion from visual to implicit, each Yeh character followed by a Hamza character is contracted to a Yeh-Hamza character (one character), the resulting space is positioned at the absolute beginning of the buffer*. In conversion from implicit to visual, each Yeh-Hamza character is expanded to two characters (Yeh and Hamza), consuming the space located at the absolute beginning of the buffer.
    4. At End
      • In conversion from visual to implicit, each Yeh character followed by a Hamza character is contracted to a Yeh-Hamza character (one character), the resulting space is positioned at the absolute end of the buffer**. In conversion from implicit to visual, each Yeh-Hamza character is expanded to two characters (Yeh and Hamza), consuming the space located at the absolute end of the buffer.

     
    For Tashkeel:
    1. Customized with Zero width
      • All Tashkeel characters are converted to their correspondents as non-spacing (zero-width) characters.
    2. Customized with width
      • All Tashkeel characters are converted to their correspondents as spacing characters. This option is not available in case of visual to implicit conversion because Tashkeel characters in the Arabic u06xx range are only represented using non-spacing (zero-width) characters.
    *  The absolute beginning of the buffer is buffer[0].
    ** The absolute end of the buffer is buffer[bufferlength - 1].

    The JAVABIDI system property

    The JAVABIDI system property may be specified by adding -DJAVABIDI=xxxx to the command that launches Java, where xxxx represents parameters for the Bidi layout transformations.

    JAVABIDI may be set to "NO", the default, in which case no Bidi layout transformations are performed, which is compatible with the behavior of previous releases.

    When JAVABIDI is not set to "NO", its value may contain 1 to 3 parts, separated by commas without intervening spaces. Each part starts with a letter identifier followed by a value within parentheses.

    The letter identifiers are:

    • S, for the SBCS part that describes the Bidi attributes of the SBCS data consumed or produced by the conversions. Note: SBCS stands for Single Byte Character Set and designates the data as stored outside Java.
       
    • U, for the Unicode part that describes the Bidi attributes of the Unicode data consumed or produced by the conversions.
       
    • C, for the codepage part that specifies one or more encodings: if this part is specified, only data with encodings listed in this part will be submitted to the Bidi layout transformation.
      If this part is omitted, the layout transformations will be performed for all encodings except Cp850.

    Note: Applications should not try to modify the value of the JAVABIDI property after the initialization of the Java Virtual Machine. For performance reasons, JVM implementations may choose to check the value of JAVABIDI only at start-up time, so that any change applied later will have no effect.

    S Part

    The S part has the format: S(TOSHNALEYZ) with the following meaning:

    SymbolMeaningValid ValuesDefaultApplicability
    TText TypeI (implicit)
    V (visual)
    VArabic and Hebrew
    OOrientationL (LTR)
    R (RTL)
    C (Contextual LTR)
    D (Contextual RTL)
    LArabic and Hebrew
    SSwappingY (yes)
    N (no)
    NArabic and Hebrew
    HText ShapingN (Nominal)
    S (Shaped)
    I (Initial)
    M (Middle)
    F (final)
    B (isolated)
    SArabic only
    NNumeralsN (Nominal)
    H (National)
    C (Contextual)
    NArabic only
    ABidi AlgorithmU (Unicode)
    R (Roundtrip)
    UArabic and Hebrew
    LLam-Alef modeR (Resize)
    N (Near)
    B (at Begin)
    E (at End)
    A (Auto)
    AArabic only
    ESeen Tail modeN (Near)
    B (at Begin)
    E (at End)
    A (Auto)
    AArabic only
    YYeh-Hamza modeO (One cell)
    N (Near)
    B (at Begin)
    E (at End)
    A (Auto)
    AArabic only
    ZTashkeel modeK (Keep)
    Z (Zero width)
    W (with Width)
    B (at Begin)
    E (at End)
    A (Auto)
    AArabic only

    Notes:

    1. The part identifier and the values are case sensitive.
    2. Values for one or more symbols may be specified as hyphen ("-"), in which case the default value will be applied.

    U Part

    The U part has the format: U(TOSHNALEYZ), with the following meaning:

    SymbolMeaningValid ValuesDefaultApplicability
    TText TypeI (implicit)
    V (visual)
    IArabic and Hebrew
    OOrientationL (LTR)
    R (RTL)
    C (Contextual LTR)
    D (Contextual RTL)
    LArabic and Hebrew
    SSwappingY (yes)
    N (no)
    YArabic and Hebrew
    HText ShapingN (Nominal)
    S (Shaped)
    I (Initial)
    M (Middle)
    F (final)
    B (isolated)
    NArabic only
    NNumeralsN (Nominal)
    H (National)
    C (Contextual)
    NArabic only
    ABidi AlgorithmU (Unicode)
    R (Roundtrip)
    UArabic and Hebrew
    LLam-Alef modeR (Resize)
    N (Near)
    B (at Begin)
    E (at End)
    A (Auto)
    AArabic only
    ESeen Tail modeN (Near)
    B (at Begin)
    E (at End)
    A (Auto)
    AArabic only
    YYeh-Hamza modeO (One cell)
    N (Near)
    B (at Begin)
    E (at End)
    A (Auto)
    AArabic only
    ZTashkeel modeK (Keep)
    Z (Zero width)
    W (with Width)
    B (at Begin)
    E (at End)
    A (Auto)
    AArabic only

    Notes:

    1. The part identifier and the values are case sensitive.
    2. Values for one or more symbols may be specified as hyphen ("-"), in which case the default value will be applied.

    C Part

    The C part has the format: C(xxx;yyy;zzz) where xxx, yyy, zzz represent Bidi code pages.
    When more than one code page is listed, the code pages must be separated by semi-colons(";") without intervening spaces.

    Bidi supported code pages
    Code pageCanonical name for NIOLanguage
    Cp420IBM-420Arabic
    Cp424IBM-424Hebrew
    Cp856IBM-856Hebrew
    Cp862IBM-862Hebrew
    Cp864IBM-864Arabic
    Cp867IBM-867Hebrew
    Cp1046IBM-1046Arabic
    Cp1255windows-1255Hebrew
    Cp1256windows-1256Arabic
    ISO8859_6ISO8859_6Arabic
    ISO8859_8ISO8859_8Hebrew
    MacArabicMacArabicArabic
    MacHebrewMacHebrewHebrew

    Examples of values for JAVABIDI

    JAVABIDI=U(ILYNNUNNNK),S(VLNSNUNNNK),C(Cp420)

    JAVABIDI=C(Cp420),S(VLNSNUNNNK),U(ILYNNUNNNK)
    The order of the part specifications is not significant.

    JAVABIDI=U(ILYNNUNNNK),S(VLNSN---NK),C(Cp420;IBM-420)
    The hyphens in the S part represent default values for the corresponding symbols.

    JAVABIDI=C(Cp420)
    Since both the S and the U parts are omitted, they receive defaults values for all the symbols.

    Known limitations

    This is the first release where support for Bidi data is implemented, and limitations are known to exist.

    1. If an application program reads from a file or writes to a file pieces of text that do not constitute a logical unit, the Bidi layout transformations will not provide expected results. For instance, an application that reads or writes characters one at a time will not benefit from the new Bidi support. This limitation is not likely to be removed in future releases.
       
    2. When unmappable characters appear in SBCS data (characters that are not valid in the declared code page), they may cause previous and following data to be transformed independently from one another, which can lead to unexpected results.
       
    3. When an application reads or writes a unit of text (e.g. a line) that may cross the boundary between buffers used by the input or output file, the Bidi transformation may be done independently on the part of the text unit that is included in each buffer, leading to unexpected results. When the file is not too large, this can be avoided by setting the buffer size large enough to contain the whole file (e.g. by specifying the buffer size when constructing a BufferedInputStream or a BufferedOutputStream).
      Contents
    Overview
    Security
    Diagnosis documentation
    64-bit porting
    AIX
    Linux
    Bidirectional support
    IBM Development Package for Eclipse
    Newsgroups
    Developer kit FAQ
    Future development plans
    Special offers
    Podcasts: Interviews and discussions on vital topics
    New Web 2.0 tools
WebSphere sMash
and more!
    Learn fast with free IBM Java Tutorials

    Trial software offers


     
        About IBM Privacy Contact