UVALID returns a FIXED BIN(31) value which is zero if a
string contains valid UTF data and which is the index of the first
invalid element if the string does not contain valid UTF data.
>>-UVALID(x)---------------------------------------------------><
- x
- Expression which must have CHARACTER, WIDECHAR or
WIDEPIC type.
If x has CHARACTER type, then UVALID(x)
will return 0 if the string contains valid UTF-8 data, and otherwise
it will return the index of the byte where the first invalid UTF-8
data starts.
If x has WIDECHAR or
WIDEPIC type, then UVALID(x) will return 0 if the string contains
valid UTF-16 data, and otherwise it will return the index of the widechar
where the first invalid UTF-16 data starts.
Note that UVALID will indicate if the string contains valid UTF
data (according to the rules below). It does not indicate if these
bytes have actually been allocated to represent any particular character.
For UTF-8 data, the validity of a byte varies as follows according
to its range:
- '00'x - '7f'x, it is valid
- '80'x - 'c1'x, it is invalid
- 'c2'x - 'df'x, it is valid if followed by a second byte and if
that byte is in the range '80'x to 'bf'x
- 'e0'x - 'ef'x, it is valid if followed by 2 more bytes and if
- when the first byte is 'e0'x, the second and third bytes must
be in the ranges 'a0'x to 'bf'x and '80'x to 'bf'x, respectively.
- when the first byte is in the range 'e1'x to 'ec'x, the second
and third bytes must be in the ranges '80'x to 'bf'x
- when the first byte is 'ed'x, the second and third bytes must
be in the ranges '80'x to '9f'x and '80'x to 'bf'x, respectively.
- when the first byte is in the range 'ee'x to 'ef'x, the second
and third bytes must be in the ranges '80'x to 'bf'x
- 'f0'x - 'f4'x, it is valid if followed by 3 more bytes and if
- when the first byte is 'f0'x, the second, third and fourth bytes
must be in the ranges '90'x to 'bf', '80'x to 'bf'x and '80'x to 'bf'x,
respectively.
- when the first byte is in the range 'f1'x to 'f3'x, the second,
third and fourth bytes must be in the range '80'x to 'bf'x
- when the first byte is 'f4'x, the second, third and fourth bytes
must be in the ranges '80'x to '8f'x, '80'x to 'bf'x and '80'x to
'bf'x, respectively.
- 'f5'x - 'ff'x, it is invalid
For UTF-16 data, the validity of a widechar varies as follows according
to its range:
- '0000'wx - '007f'wx, it is valid and would be 1 byte if UTF-8
- '0080'wx - '07ff'wx, it is valid and would be 2 bytes if UTF-8
- '0800'wx - 'd7ff'wx, it is valid and would be 3 bytes if UTF-8
- 'd800'wx - 'dbff'wx, it is valid if followed by a second widechar
with a value greater than or equal to 'dc00'wx and
less than or equal to 'dfff'wx. It is a unicode surrogate pair
and would be 4 bytes if UTF-8
- 'dc00'wx - 'dfff'wx, it is valid only when it is
the second half of a surrogate pair
- 'e000'wx - 'ffff'wx, it is valid and would be 3 bytes if UTF-8