Using UTF-8 literals

Two types of UTF-8 literals are supported:
  1. Basic UTF-8 literal:
    • u'character-data'
      • character-data is converted from EBCDIC to UTF-8.
      • character-data may contain double-byte EBCDIC characters, but those characters must be delimited by shift-out and shift-in characters.
      • The maximum number of Unicode code points that can be represented in a basic UTF-8 literal can vary depending on the size of each UTF-8 character. However, before truncation occurs, a maximum of 160 bytes after UTF-8 conversion is allowed.
      • character-data can contain the following Unicode escape sequences:
        • \uhhhh, where each h represents a hexadecimal digit in the range ‘0’ to ‘9’, ‘a’ to ‘f’, and ‘A’ to ‘F’. This Unicode escape sequence represents a Unicode code point from the Basic Multilingual Plane (BMP) (that is, Unicode code points in the range U+0000 through U+FFFF).
        • \U00hhhhhh, where each h represents a hexadecimal digit in the range ‘0’ to ‘9’, ‘a’ to ‘f’, and ‘A’ to ‘F’. This Unicode escape sequence can represent any legal Unicode code point, including code points from the Supplementary Planes, i.e., Unicode code points in the range U+10000 through U+10FFFF (for example, an emoji symbol).
        Notes:
        • Code points U+D800 through U+DFFF are reserved for the high and low halves of surrogate pairs used by UTF-16. There is no legal encoding of these Unicode code points in UTF-8 and hence \uD800 through \uDFFF and \U0000D800 through \U0000DFFF cannot be specified as Unicode escape sequences in UTF-8 literals.
        • To avoid having a string of characters of the form \uhhhh or \U00hhhhhh in a UTF-8 literal to be interpreted as a Unicode escape sequence, the escape character ‘\’ can itself be escaped with ‘\’ in order to be interpreted literally. Thus, the sequence \\u00E9 will not be treated as a Unicode escape sequence.
      • Wherever a Unicode escape sequence appears in a basic UTF-8 literal, it is replaced by the compiler with the UTF-8 encoding for the Unicode code point corresponding to the escape sequence. This makes it convenient to represent general Unicode code points in the literal using only EBCDIC characters. For example, u’caf\u00E9’ represents the string ‘café’.
  2. Hexadecimal UTF-8 literal
    • ux'hexadecimal-digits'
      • hexadecimal-digits are converted to a sequence of bytes in order to be used verbatim as the UTF-8 literal value.
      • A minimum of 2 hexadecimal digits and up to a maximum of 320 hexadecimal digits are allowed.
        Note: The sequence of bytes represented by hexadecimal-digits is validated to ensure that it contains a legal sequence of UTF-8 bytes.