The Unicode standard
The Unicode Standard is the specification of an encoding scheme for written characters and text. It is a universal standard that enables consistent encoding of multilingual text and allows text data to be interchanged internationally without conflict. The ISO standards for C and C++ refer to Information technology – Programming Languages – Universal Multiple-Octet Coded Character Set (UCS), ISO/IEC 10646:2003. (The term octet is used by ISO to refer to a byte.) The ISO/IEC 10646 standard is more restrictive than the Unicode Standard in the number of encoding forms: a character set that conforms to ISO/IEC 10646 is also conformant to the Unicode Standard.
The Unicode Standard specifies a unique numeric value and name for each character and defines three encoding forms for the bit representation of the numeric value. The name/value pair creates an identity for a character. The hexadecimal value representing a character is called a code point. The specification also describes overall character properties, such as case, directionality, alphabetic properties, and other semantic information for each character. Modeled on ASCII, the Unicode Standard treats alphabetic characters, ideographic characters, and symbols, and allows implementation-defined character codes in reserved code point ranges. According to the Unicode Standard, the encoding scheme of the standard is therefore sufficiently flexible to handle all known character encoding requirements, including coverage of all the world's historical scripts.
C99 allows the universal character name construct defined
in ISO/IEC 10646 to represent characters outside the basic source
character set. It permits universal character names in identifiers,
character constants, and string literals.
To be compatible with C99, the XL C/C++ compiler
supports universal character names as an IBM extension. In C++, you
must compile with the -qlanglvl=ucs option
for universal character name support.
| Universal character name | ISO/IEC 10646 short name |
|---|---|
| where N is a hexadecimal digit | |
| \UNNNNNNNN | NNNNNNNN |
| \uNNNN | 0000NNNN |
C99 and C++ disallow the hexadecimal values representing characters in the basic character set (base source code set) and the code points reserved by ISO/IEC 10646 for control characters.
- Any character whose short identifier is less than 00A0. The exceptions are 0024 ($), 0040 (@), or 0060 (').
- Any character whose short identifier is in the code point range D800 through DFFF inclusive.
UTF literals (IBM extension)
The ISO C and ISO C++ Committees have approved the implementation of u-literals and U-literals to support Unicode UTF-16 and UTF-32 character literals, respectively.
| Syntax | Explanation |
|---|---|
| u'character' | Denotes a UTF-16 character. |
| u"character-sequence" | Denotes an array of UTF-16 characters. |
| U'character' | Denotes a UTF-32 character. |
| U"character-sequence" | Denotes an array of UTF-32 characters. |
- String concatenation of u-literals
- The u-literals and U-literals follow the same concatenation rule as wide character literals: the normal character string is widened if they are present. The following shows the allowed combinations. All other combinations are invalid.
| Combination | Result |
|---|---|
| u"a" u"b" | u"ab" |
| u"a" "b" | u"ab" |
| "a" u"b" | u"ab" |
| U"a" U"b" | U"ab" |
| U"a" "b" | U"ab" |
| "a" U"b" | U"ab" |


