Skip to main content
    Country/region [select]      Terms of use
     Home      Products      Services & solutions      Support & downloads      My account     

IBM developerWorks > Java technology
Glossary of Unicode terms
e-mail it!
Related content:
Subscribe to the developerWorks newsletter
developerWorks Toolbox subscription
Also in the Java zone:
Tools and products
Code and components

July 1, 1999

accent A modifying mark on a character. For example, the accent marks in Latin script (acute, tilde, and ogonek) and the tone marks in Thai. Synonymous with diacritic.
alphabetic language A written language in which symbols represent vowels and consonants, and in which syllables and words are formed by a phonetic combination of symbols. Examples of alphabetic languages are English, Greek, and Russian. Contrast with ideographic language.
Arabic numerals The characters 1, 2, 3, 4, 5, 6, 7, 8, 9, and 0. Contrast with Chinese numerals, Hindi numerals, and Roman numerals.
Arabic script A cursive script used in Arabic countries. Other writing systems such as Latin and Japanese also have a cursive handwritten form, but usually are typeset or printed in discrete letter form. Arabic script has only the cursive form, and is also used for Urdu, (which is spoken in Pakistan, Bangladesh, and India), Farsi or Persian (which is spoken in Iran, Iraq, and Afghanistan).
ASCII "American Standard Code for Information Interchange." A standard 7-bit character set used for information interchange. ASCII encodes the basic Latin alphabet and punctuation used in American English, but does not encode the accented characters used in many European languages.
baseline A conceptual line with respect to which successive characters are aligned.
bidirectional Languages such as Arabic, Hebrew, and Yiddish whose general flow of text proceeds horizontally from right to left, but numbers, English, and other left-to-right language text are written from left to right.
character set A collection of characters in which a numeric code is assigned to each character so that it can be represented on a computer. Most traditional character sets contain characters from only one or two scripts.
Chinese numerals Chinese characters that represent numbers. For example, the Chinese characters for 1, 2, and 3 are written with one, two, and three horizontal brush strokes, respectively. Contrast with Arabic numerals, Hindi numerals, and Roman numerals.
code page A synonym for character set.
collation Text comparison using language-sensitive rules as opposed to bitwise comparison of numeric character codes.
cursive script A script whose adjacent characters touch or are connected to each other. For example, Arabic script is cursive.
diacritic A modifying mark on a character. For example, the accent marks in Latin script (acute, tilde, and ogonek) and the tone marks in Thai. Synonymous with accent.
double-byte character set (DBCS) A set of characters in which each character is represented by 2 bytes. Scripts such as Japanese, Chinese, and Korean contain more characters than can be represented by 256 code points, thus requiring two bytes to uniquely represent each character. The term DBCS is often used to mean MBCS (multibyte character set). See multibyte character set.
EBCDIC Extended Binary-Coded Decimal Interchange Code. A group of coded character sets that consists of eight-bit coded characters. EBCDIC-coded character sets map specified graphic and control characters onto code points, each consisting of 8 bits. EBCDIC is an extension of BCD (Binary-Coded Decimal), which uses only 7 bits for each character.
ECMA European Computer Manufacturers Association. A nonprofit organization formed by European computer vendors to announce standards applicable to the functional design and use of data processing equipment.
encoding scheme A set of specific definitions that describe the philosophy used to represent character data. Examples of specifications in such a definition are: the number of bits, the number of bytes, the allowable ranges of bytes, maximum number of characters, and meanings assigned to some generic and specific bit patterns.
font A set of graphic characters that have a characteristic design, or a font designer's concept of how the graphic characters should appear. The characteristic design specifies the characteristics of its graphic characters. Examples of characteristics are shape, graphic pattern, style, size, weight, and increment.
globalization The process of developing, manufacturing, and marketing software products that are intended for worldwide distribution. This term combines two aspects of the work: internationalization (enabling the product to be used without language or culture barriers) and localization (translating and enabling the product for a specific locale).
glyph The actual shape (bit pattern, outline) of a character image. For example, an italic "A" and a roman "A" are two different glyphs representing the same underlying character. Strictly speaking, any two images that differ in shape constitute different glyphs. In this usage, glyph is a synonym for character image, or simply image.
graphic character A character, other than a control function, that has a visual representation normally handwritten, printed, or displayed.
GMT Greenwich mean time. In the 1840s the standard time kept by the Royal Greenwich Observatory located at Greenwich, England was established for all of England, Scotland, and Wales, replacing many local times in use in those days. Subsequently GMT became the official time reference for the world until 1972 when it was subsumed by the atomic clock-based coordinated universal time (UTC). GMT is also known as universal time.
Hangul The Korean alphabet that consists of fourteen consonants and ten vowels. Hangul was created by a team of scholars in the 15th century at the behest of King Sejong. See jamo.
Hanja The Korean term for characters derived from Chinese.
Hiragana A Japanese phonetic syllabary. The symbols are cursive or curvilinear in style. See Kanji and Katakana.
i18n Synonym for internationalization. (There are 18 letters between the "i" and the "n" in internationalization.)
ideographic language A written language in which each character (ideogram) represents a thing or an idea (but not necessarily a particular word or phrase). An example of such a language is written Chinese (Zhongen). Contrast with alphabetic language.
Indic numerals A set of numerals used in India and many Arabic countries instead of, or in addition to, the Arabic numerals. Indic numeral shapes are , , , , , , , , , and , which correspond to the Arabic numeral shapes of 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9, respectively. Contrast with Arabic numerals, Chinese numerals, and Roman numerals. See numbers.
internationalization The process of producing an application that can be localized for a particular country without any changes to the program code. Internationalized applications store their text in external resources, and use locale-sensitive utilities for formatting and collation.
jamo A set of consonants and vowels used in Korean Hangul. The word jamo is derived from ja, which means consonant, and mo, which means vowel.
Kanji Chinese characters or ideograms used in Japanese writing. The characters may have different meanings from their Chinese counterparts. See Hiragana and Katakana.
Katakana A Japanese phonetic syllabary used primarily for foreign names and place names and words of foreign origin. The symbols are angular, while those of Hiragana are cursive. Katakana is written left to right, or top to bottom. See Kanji.
l10n Synonym for localization. (There are 10 letters between the "l" and the "n" in localization.)
language A set of characters, phonemes, conventions, and rules used for conveying information. The aspects of a language are pragmatics, semantics, syntax, phonology, and morphology.
legacy An inherited obligation. For example, a legacy database might contain strategic data that must be maintained for a long time after the database has become technologically obsolete.
localization The process of converting a program to run in a particular locale or country, so that all text is displayed in the native language, and native conventions are used for sorting, formatting, etc.
lowercase The small alphabetic characters, whether accented or not, as distinguished from the capital alphabetic characters. The concept of case applies to alphabets such as Latin, Cyrillic, and Greek, but not to Arabic, Hebrew, Thai, Japanese, Chinese, Korean, and many other scripts. Examples of lowercase letters are a, b, and c. Contrast with uppercase.
MBCS Multibyte Character Set. A set of characters in which each character is represented by 1 or more bytes. Contrast with DBCS and SBCS.
multilingual An application that can simultaneously display and manipulate text in multiple languages. For example, a word processor that allows Japanese and English in the same document is multilingual.
NLS National Language Support. The features of a product that accommodate a specific region, its language, script, local conventions, and culture. See internationalization and localization.
normalization The process of converting Unicode text into one of several standardized forms in which precomposed and combining characters are used consistently. See Unicode Technical Report #15 for details.
numbers Numbers express either quantity (cardinal) or order (ordinal). Many cultures have different forms for cardinal and ordinal numbers. For example, in French the cardinal number five is cinq, but the ordinal fifth is cinquième or 5eme or 5e. Numbers are written with symbols that are usually referred to as numerals. See Arabic numerals, Chinese numerals, Indic numerals, and Roman numerals.
pinyin A system to phonetically render Chinese ideograms in a Latin alphabet.
roman_numerals A system of writing numbers in which the characters I, V, X, L, C, D, and M have the value of 1, 5, 10, 50, 100, 500, and 1000, respectively. Lesser numbers in prefix positions indicate subtraction. For example MCMLXIV is 1964 in decimal, because CM is 900, LX is 60, and IV is 4. Contrast with Arabic numerals, Chinese numerals, and Indic numerals.
SBCS (Single-byte character set) A set of characters in which each character is represented by 1 byte.
script A set of characters used to write a particular set of languages. For example, the Latin (or Roman) script is used to write English, French, Spanish, and most other European languages; the Cyrillic script is used to write Russian and Serbian.
separator The thousands separator (or digit grouping separator) is the local symbol used to separate every third digit in large numbers or lengthy decimal fractions. The decimal separator is the local symbol used to indicate the decimal position in a number.
transcoding Conversion of character data from one character set to another.
translation The conversion of text from one human language to another. When localizing an application, one of the largest tasks is the translation of all text resources into the target language.
transliteration Transformation of text from one script to another, usually based on phonetic equivalencies. For example, Greek text might be transliterated into the Latin script so that it can be pronounced by English speakers.
unicode A character set that encompasses all of the world's living languages. Unicode is the basis of most modern software internationalization.

e-mail it!
What do you think of this document?
Killer! (5)Good stuff (4)So-so; not bad (3)Needs work (2)Lame! (1)


IBM developerWorks > Java technology
  About IBM  |  Privacy  |  Terms of use  |  Contact