Basic UTF-8 literals

The format and rules for basic UTF-8 literals are listed in this section.

Format 1: Basic UTF-8 literals
U"character-data"
U'character-data'
U" or U'
Opening delimiters. The opening delimiter must be coded as single-byte characters. It cannot be split across lines.
" or '
The closing delimiter. The closing delimiter must be coded as a single-byte character. If a quotation mark is used in the opening delimiter, it must be used as the closing delimiter. Similarly, if an apostrophe is used in the opening delimiter, it must be used as the closing delimiter.
To include the quotation mark or apostrophe used in the opening delimiter in the content of the literal, specify a pair of quotation marks or apostrophes, respectively. For example:
  • U'This literal''s content includes an apostrophe ';
  • U'This literal includes ", which is not used in the opening delimiter ';
  • U"This literal includes "", which is used in the opening delimiter ".
character-data
The source text representation of the content of the UTF-8 literal. character-data can include any combination of EBCDIC single-byte characters and double-byte characters encoded in the Coded Character Set ID (CCSID) specified by the CODEPAGE compiler option.

DBCS characters in the content of the literal must be delimited by shift-out and shift-in control characters.

character-data can contain the following Unicode escape sequences:
  • \uhhhh, where each h represents a hexadecimal digit in the range '0' to '9', 'a' to 'f', and 'A' to 'F', inclusive. This Unicode escape sequence represents a Unicode code point from the Basic Multilingual Plane (i.e., Unicode code points in the range U+0000 through U+FFFF).
  • \U00hhhhh, where each h represents a hexadecimal digit in the range '0' to '9', 'a' to 'f', and 'A' to 'F'. This Unicode escape sequence can represent any legal Unicode code point, including code points from the Supplementary Planes, specifically, Unicode code points in the range U+10000 through U+10FFFF (e.g., an emoji symbol).
Note:
  1. Code points U+D800 through U+DFFF are reserved for the high and low halves of surrogate pairs used by UTF-16. There is no legal encoding of these Unicode code points in UTF-8 and hence \uD800 through \uDFFF and \U0000D800 through \U0000DFFF cannot be specified as Unicode escape sequences in UTF-8 literals.
  2. To avoid having a string of characters of the form \uhhhh or \U00hhhhhh in a UTF-8 literal be interpreted as a Unicode escape sequence, the escape character ‘\’ can itself be escaped with ‘\’ to cause it to be interpreted literally. Thus, the sequence \\u00E9 will not be treated as a Unicode escape sequence.

Wherever a Unicode escape sequence appears in a basic UTF-8 literal, it is replaced by the compiler with the corresponding UTF-8 encoding of the Unicode code point, which makes it convenient to represent general Unicode code points in the literal using only EBCDIC characters. For example, u'caf\u00E9' represents the string 'café'.

Maximum length
The maximum number of UTF-8 characters that can be represented in a basic UTF-8 literal varies depending on the size (1 to 4 bytes) of each UTF-8 character being represented. However, a maximum of 160 bytes after conversion of the literal characters form the EBCDIC codepage to UTF-8 is allowed before truncation occurs. Truncation will be performed on a character boundary.

If the source content of the literal contains one or more DBCS characters, the maximum length is limited by the available space in Area B of a single source line.

The literal must contain at least one character. Each single-byte character in the literal counts as one character position and each DBCS character in the literal counts as one character position. Shift-in and shift-out delimiters for DBCS characters are not counted.

Continuation rules
When the content of the literal includes DBCS characters, the literal cannot be continued.

When the content of the literal does not include DBCS characters, normal continuation rules apply.

The source text representation of character-data is automatically converted to UTF-8 for use at run time. For example, when the literal is moved to or compared with a data item of category UTF-8, it is automatically converted to UTF-8.