Reference modifiers and UTF-8 data items

Similar to national data items, reference modifiers can be applied to UTF-8 data items. The start position and length of UTF-8 reference modifiers are specified in terms of characters (Unicode code points), not bytes, so that you do not need to worry about the varying byte-width of UTF-8 characters when generating UTF-8 substrings. However, due to this varying nature of UTF-8 characters, there are some special considerations:

About this task

  • The starting byte position of a reference modified UTF-8 data item cannot be determined at compile time and must be determined from its starting character position at run time, even when the starting character position is specified as a literal.
    Example 1
    01 u1 pic u(10) value u'\u00e9cran'. *> écran
    :
    display u1(3:2)
    

    In this example, u1(3:2) starts at character position 3, which is the 4th byte of the data item. This is because 'é' has a 2-byte encoding in UTF-8 and 'c' and 'r' both have 1-byte encodings in UTF-8.

    Example 2
    01 u1 pic u(10) value u'ecran'.
    :
    display u1(3:2)
    

    In this example, u1(3:2) starts at character position 3, which is the 3rd byte of the data item. This is because 'e', 'c', and 'r' all have 1-byte encodings in UTF-8.

  • The actual byte length of a UTF-8 reference modification cannot be determined at compile time and must be determined from its character length at run time, even when the character length of the reference modifier is specified as a literal.
    Example 3
    01 u1 pic u(10) value u'caf\u00e9'.  *> café
    :
    display u1(3:2)
    

    In this example, u1(3:2) is 2 UTF-8 characters in length but 3 bytes in length because 'f' has a 1-byte encoding in UTF-8 and 'é' has a 2-byte encoding in UTF-8.

    Example 4
    01 u1 pic u(10) value u'cafe'.
    :
    display u1(3:2)
    

    In this example, u1(3:2) is 2 UTF-8 characters in length but 2 bytes in length because both 'f' and 'e' have a 1-byte encoding in UTF-8.

  • When a reference modified UTF-8 data item is a receiver in a MOVE statement, the number of bytes that make up the substring indicated by the reference modification may not be the same after the move as before the move. For example, if the substring has a length of 4 characters, those 4 characters might be represented by 12 bytes before the move, but after the move, those 4 characters could be represented by only 4 bytes, or as many as 16 bytes. It depends on the UTF-8 sender in the move. When this situation occurs the remaining data after the substring will automatically be shifted left if the new substring has fewer bytes than it originally had, and the remaining bytes will be shifted right if the new substring has more bytes than it originally had. In the latter case, it is possible that some characters in the remaining portion of the underlying data item may be truncated. However, truncation can only happen if the underlying UTF-8 data item is defined with the BYTE-LENGTH phrase of the PICTURE clause. In the case of fixed character-length UTF-8 items, there will always be enough space to accommodate the right-shift without truncation.
    Example 5
    01 u1 pic u(13) value 'Ol\u00e9, Ol\u00e9, Ol\u00e9'. *> Olé, Olé, Olé
    :
    move 'abcdef' to u1(3:6)
    

    In this example, u1(3:6) represents 'é, Olé', which is 8 bytes long. After the move, the 8-byte substring represented by the reference modification is replaced with 'abcdef' which is only 6 bytes long in UTF-8. This means that the remaining characters ', Olé' in the underlying data item located immediately following the reference modified portion will be shifted left in memory 2 bytes as a result of the move.

    Example 6
    01 u1 pic u(13) value 'Ol\u00e9, Ol\u00e9, Ol\u00e9'. *> Olé, Olé, Olé
    :
    move '\u00e9\u00e9\u00e9\u00e9\u00e9\u00e9' to u1(3:6)
    

    In this example, u1(3:6) represents 'é, Olé' which is 8 bytes long. After the move, the 8-byte substring represented by the reference modification is replaced with 'éééééé' which is 12 bytes long in UTF-8. This means that the remaining characters ', Olé' in the underlying data item located immediately following the reference modified portion will be shifted right in memory 4 bytes as a result of the move.

Related references  
Reference modifiers