 | Level: Introductory Tony Graham (tkg@menteith.com), Senior consultant, Mulberry Technologies
01 Mar 2001 Pango is an open-source framework for the layout and rendering of internationalized text, and is being included in the next generation of GTK+ and GNOME. In the first of a two-part series, Tony Graham introduces Pango and describes how it handles text, as well as the text attributes that you can specify for formatted text. The article concludes with a summary of Pango's processing pipeline for formatting and rendering a simple text string and a list of its attributes. Pango is an open-source framework for the layout and rendering of
internationalized text, including right-to-left scripts and scripts
such as Tamil where glyphs are context-sensitive. Not surprisingly,
Pango uses Unicode characters internally (represented using
UTF-8), and Pango's interfaces also use UTF-8. Other encodings can be
supported by using a translation library such as GNU iconv to
convert the text to UTF-8 before processing. Pango is designed as a modular, cross-platform, cross-toolkit,
low-level library that can be used in multiple contexts. It is also
intimately related to the GTK+ and GNOME projects; the Pango
project started because of the need for high-quality internationalized
text in GTK+ and GNOME. While Pango can be used separately, the
current Pango (0.13) is being included in the development branch 1.3.x
versions of GTK+ that are currently under heavy development; Pango
will ultimately be incorporated into GTK+ 2.0. The name "Pango" comes from the Greek "Pan"
(Παν), meaning "All," and the Japanese "Go"
(語), meaning "Language."
 |
What are GTK+ and GNOME?
GTK stands for GIMP Toolkit, and GTK+ is a library of functions
that, among other things, give an object-oriented flavor to the
lower-level functions in the GIMP Drawing Kit or GDK. GDK is
a library of functions that simplify programming the low-level
X library.
GNOME is the name of both a desktop environment and a programming
library. A GNOME application uses the objects and functions defined in
the GNOME library to interface with the desktop widgets. The
application may also mix in calls to GTK+ functions, or even to GDK,
X, or lower-level glib or C functions.
GTK+ and GNOME are object-oriented even though they are written in
C. While object orientation is not intrinsic to C, the libraries
achieve object orientation by convention: The structs representing
objects each reference the struct representing their superclass, and
objects' properties are changed and new objects are created by calling
the appropriate methods. All this requires restraint on the part of
the programmer, but the resulting code is more portable because nearly
every platform has a C compiler. In addition, GNOME and GTK+
interface bindings from other languages -- both object-oriented and
otherwise -- have been defined.
|
|
Why UTF-8?
Strings in Pango's interfaces are UTF-8 because of its
compatibility with existing 8-bit software, for its pervasiveness on
UNIX platforms, the fact that it does not require extra effort to handle characters
outside Plane 0, and for its independence from byte-order
concerns. Offets into UTF-8 strings are counted in bytes, not characters.
The Pango documentation acknowledges that UTF-8's variable length
makes it harder to count characters in a string, but the documentation
also notes that, in Unicode, any non-spacing marks in the string break
any correspondence between character positions and strings, even for
fixed-width encodings. The Pango documentation also acknowledges that UTF-8 has a 50%
overhead for CJKV ideographs compared to UTF-16.
Single characters as UCS-4
Single characters are represented with 32 bits for planned upward
compatibility with any characters to be defined in ISO/IEC
10646. While the ISO working group has recently committed to using
only the same million or so code points covered by UTF-16,
even that reduced range requires 21 bits, and 32 bits is still the
next highest standardized word size.
BiDi library
Pango uses Dov Grobgeld's FriBidi implementation of the Unicode
bidirectional algorithm (see Resources). When
Pango is compiled with the --with-fribidi option, it will
use a copy of FriBidi that you provide; otherwise the copy in the
Pango source is used. The minimal version included with Pango 1.3 is
an older version that supports Unicode 2.1.8, whereas the latest
FriBidi version as of this writing supports Unicode 3.0.1.
Language and other attributes
In addition to handling right-to-left text, Pango supports language
tagging, so, for example, it will attempt to use a Japanese font for
text marked as Japanese. Language tagging, like all Pango text
attribute tagging, is a Pango-specific scheme. Language tagging
does not use Unicode's Plane 14 language tags, nor does it relate to
the xml:lang and html:lang attributes defined by the
W3C, but those and other language markup schemes could easily be
translated into Pango language attributes. The complete set of Pango text attributes is shown in the following
list:
- Language
- Font family: name of a font family or a comma-separated list of families
- Style: normal, oblique, or italic
- Weight: six possible values from ultralight to heavy
- Variant: normal or small caps
- Stretch: nine possible values from ultracondensed to ultraexpanded
- Size: font size in thousandths of a point
- Font description: shorthand label for a particular font family, style, variant, weight, stretch, and size
- Foreground color
- Background color
- Underline: whether the text is underlined with a single, double,
or low line
- Strikethrough: whether the text is struck through
- Rise: vertical displacement
- Shape: shape to impose on a glyph
- Scale
The following two figures show examples of Pango in action. Note
the use of German, Greek, Hebrew, Japanese, and Arabic text in the
first figure and the additional use of French, Korean, and Russian in
the second. Labels and text boxes containing German and French are
admittedly easy to achieve on most English or European computer
systems, but it is much less common for a computer system to be able
to handle those languages and the other languages shown in the
figures in combination. Styled, multilanguage, and bidirectional text

Multiple languages in widget labels

Marking up text attributes
The different attributes for a sequence of characters, including
the language, are maintained separately from the text as a list of
structures, one structure for each span of each attribute type. Every
structure indicates a single attribute class and the start and end of the
character range to which the class applies. Particular attribute types
extend this with additional information; for example, the color
attributes also record the red, green, and blue components of the
color to apply to the span. You can create the separate attribute list for some text (for example, for a
widget label), but it can be a painstaking task when
there are a lot of attribute changes. Also, as the Pango documentation
notes, the character ranges in each attribute structure will surely
be invalid for any later translation of the original attributed
text. As a convenience measure for translators in particular, Pango
supports a simple HTML-like markup language for embedding attribute
changes in the text, and it provides the pango_parse_markup()
function for converting marked-up text into a plain string and a
separate attribute list. The root element is <markup>,
but it can be omitted. (You can omit both the start tag
and the end tag, but omitting just one causes an error.) The most versatile element, and the one that will have the most
common use, is <span>. Like the HTML element with
the same name, this marks a span of text, and its start tag may have
the following attributes whose values will be translated into Pango
text attribute values:
-
font_desc: a shorthand font description, such as "Sans
Italic 12" (any other span attributes override this
description)
-
font_family: A font family name
-
face: Synonym for the font_family attribute
-
size: Font size in thousandths of a point; a predefined
absolute size keyword such as xx-small or xx-large,
or one of the relative sizes smaller or larger
-
style: One of normal, oblique, or
italic, corresponding to the allowed values of the style text
attribute
-
weight: One of six keywords such as ultralight,
normal, or heavy -- or a numeric weight
-
variant: normal or smallcaps
-
stretch: One of nine keywords such as
ultracondensed, normal, and ultraexpanded
that correspond to the allowed values of the stretch text attribute
-
foreground: An RGB color specification such as
#00FF00 or a color name such as red
-
background: An RGB color specification such as
#00FF00 or a color name such as red
-
underline: One of single, double,
low, none
-
rise: Vertical displacement, in ten thousandths of an em. Can be
negative for subscript, positive for superscript
-
strikethrough: true or false, whether to
strike through the text
-
lang: A language code (for example, fr)
The markup language also includes a handful of convenience elements
that do not have attributes:
-
<b>: bold
-
<big>: equivalent to <span size="larger">
-
<i>: italic
-
<s>: strikethrough
-
<sub>: subscript
-
<sup>: superscript
-
<small>: equivalent to <span size="smaller">
-
<tt>: monospace font
-
<u>: underline
The absolute and relative sizes of successive steps of
the size attribute and the size increase or decrease from the
<bigger> or <smaller> elements is in the ratio
1:1.2 (or 1.2:1); this is the same as the CSS scale factor between
its text sizes. The markup language is case-sensitive, unlike HTML (but like
XML), and the only tags that can be omitted are the pair of the
<markup> start tag and end tag.
In the pipeline
Pango implements formatting and rendering in a staged pipeline. The following example adds markup to an example used in both
Chapter 3 of The Unicode Standard, Version 3.0 and UAX #9 (see
Resources). The uppercase text in the example
stands for right-to-left text such as Arabic or Hebrew. The markup
makes some of the text underlined, some of it blue, and some
of it both underlined and blue.
<u>car </u><span foreground="blue"><u>is </u>THE CAR</span> in arabic
|
The effect of the markup is shown in the following table.
<table border="1">
<tr>
<td>String</td>
<td><code>car </code></td>
<td><code>is </code></td>
<td><code>THE CAR</code></td>
<td><code> in arabic</code></td>
</tr>
<tr>
<td>Foreground</td>
<td> </td>
<td colspan="2" align="center"><span style="color: blue">Blue</span></td>
<td> </td>
</tr>
<tr>
<td>Underline</td>
<td colspan="2" align="center"><u>True</u></td>
<td> </td>
<td> </td>
</tr>
</table>
Itemization
The first step when laying out the text is to break the string
into portions with consistent attributes, including consistent
language tag, bidirectional category, color, etc. Markup for the attributes is just a convenience feature, and the
pipeline really begins with text and a list of Pango attributes, so
step 0, as it were, is to call pango_parse_markup() with
the above example as input. This returns a single string containing
the text and a list of four Pango attributes -- one for each change in
the attributes. The table below shows the spans.
<table border="1">
<tr>
<td>String</td>
<td><code>car </code></td>
<td><code>is </code></td>
<td><code>THE CAR</code></td>
<td><code> in arabic</code></td>
</tr>
<tr>
<td>Foreground</td>
<td> </td>
<td align="center"><span style="color: blue">Blue</span></td>
<td align="center"><span style="color: blue">Blue</span></td>
<td> </td>
</tr>
<tr>
<td>Underline</td>
<td align="center"><u>True</u></td>
<td align="center"><u>True</u></td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>Bidi Level</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">1</td>
<td align="center">0</td>
</tr>
</table>
Reordering
The items are then reordered into visual
order, as the following table shows. Remember that for the purposes of
this example, uppercase text stands for right-to-left text such as
Arabic or Hebrew. The "Bidi Level" in the table is the Unicode bidirectional
embedding level of the spans, where even numbers (including 0)
indicate left-to-right text and odd numbers indicate right-to-left
text. Bidi level is not recorded in Pango attributes, but it is
calculated by the FriBidi library.
<table border="1">
<tr>
<td>String</td>
<td><code>car </code></td>
<td><code>is </code></td>
<td><code>RAC EHT</code></td>
<td><code> in arabic</code></td>
</tr>
<tr>
<td>Foreground</td>
<td> </td>
<td align="center"><span style="color: blue">Blue</span></td>
<td align="center"><span style="color: blue">Blue</span></td>
<td> </td>
</tr>
<tr>
<td>Underline</td>
<td align="center"><u>True</u></td>
<td align="center"><u>True</u></td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>Bidi Level</td>
<td align="center">0</td>
<td align="center">0</td>
<td align="center">1</td>
<td align="center">0</td>
</tr>
</table>
Glyph selection
Pango then selects the appropriate glyphs
for the characters in each item. Pango supports script-specific layout engines so, for example,
Tamil glyph selection is done by the Tamil engine and Thai glyph
selection is done by the Thai engine. There doesn't have to be one
engine per script however, and, in practice, characters from the Basic
Latin, Latin-1 Supplement, Greek, Cyrillic, and several other blocks
are all handled by the "basic" engine. Justification
The glyph strings are justified, for example, to the
right or to the left as shown by some of the labels in the previous figure. Rendering
The glyphs are rendered onto an output
device. Pango is not a rendering system, but it does include a
rendering routine for X fonts. Other output devices will require
other, external rendering routines. The following table shows how the
example might look when rendered.
<table border="1">
<tr>
<td><u>car <span style="color: blue">is </span></u><span
style="color: blue">RAC EHT</span> in arabic</code></td>
</tr>
</table>
More Pango
In the second installment, I'll show the code for the example and
discuss how Pango selects glyphs and renders text.
Resources
About the author  | |  |
Tony Graham is the author of
Unicode: A
Primer
, the first and currently only book about the Unicode
Standard, Version 3.0, and its uses. An Australian, Tony is a
Specialist member of the Unicode Consortium. He can be reached at tkg@menteith.com. |
Rate this page
|  |