UVALID

UVALID returns a FIXED BINARY(31) value which is zero if a string contains valid UTF data and which is the index of the first invalid element if the string does not contain valid UTF data.

x: Expression which must have CHARACTER, UCHAR, WIDECHAR or WIDEPIC type.

If x has CHARACTER type, then UVALID(x) will return 0 if the string contains valid UTF-8 data. Otherwise, it will return the index of the BYTE where the first invalid UTF-8 data starts.

If x has UCHAR type, then UVALID(x) will return 0 if the string contains valid UTF-8 data. Otherwise, it will return the index of the UCHAR where the first invalid UTF-8 data starts.

If x has WIDECHAR or WIDEPIC type, then UVALID(x) will return 0 if the string contains valid UTF-16 data. Otherwise, it will return the index of the WIDECHAR where the first invalid UTF-16 data starts.

Note that UVALID will indicate if the string contains valid UTF data (according to the rules below). It does not indicate if these bytes have actually been allocated to represent any particular character.

For UTF-8 data, the validity of a byte varies as follows according to its range:

'00'x - '7f'x, it is valid
'80'x - 'c1'x, it is invalid
'c2'x - 'df'x, it is valid if followed by a second byte and if that byte is in the range '80'x to 'bf'x
'e0'x - 'ef'x, it is valid if followed by 2 more bytes and if
- when the first byte is 'e0'x, the second and third bytes must be in the ranges 'a0'x to 'bf'x and '80'x to 'bf'x, respectively.
- when the first byte is in the range 'e1'x to 'ec'x, the second and third bytes must be in the ranges '80'x to 'bf'x
- when the first byte is 'ed'x, the second and third bytes must be in the ranges '80'x to '9f'x and '80'x to 'bf'x, respectively.
- when the first byte is in the range 'ee'x to 'ef'x, the second and third bytes must be in the ranges '80'x to 'bf'x
'f0'x - 'f4'x, it is valid if followed by 3 more bytes and if
- when the first byte is 'f0'x, the second, third and fourth bytes must be in the ranges '90'x to 'bf', '80'x to 'bf'x and '80'x to 'bf'x, respectively.
- when the first byte is in the range 'f1'x to 'f3'x, the second, third and fourth bytes must be in the range '80'x to 'bf'x
- when the first byte is 'f4'x, the second, third and fourth bytes must be in the ranges '80'x to '8f'x, '80'x to 'bf'x and '80'x to 'bf'x, respectively.
'f5'x - 'ff'x, it is invalid

For UTF-16 data, the validity of a widechar varies as follows according to its range:

'0000'wx - '007f'wx, it is valid and would be 1 byte if UTF-8
'0080'wx - '07ff'wx, it is valid and would be 2 bytes if UTF-8
'0800'wx - 'd7ff'wx, it is valid and would be 3 bytes if UTF-8
'd800'wx - 'dbff'wx, it is valid if followed by a second widechar with a value greater than or equal to 'dc00'wx and less than or equal to 'dfff'wx. It is a unicode surrogate pair and would be 4 bytes if UTF-8
'dc00'wx - 'dfff'wx, it is valid only when it is the second half of a surrogate pair
'e000'wx - 'ffff'wx, it is valid and would be 3 bytes if UTF-8