Character strings

A character string is a sequence of code units. The length of the string is the number of code units in the sequence. If the length is zero, the value is called the empty string, which should not be confused with the null value.

Fixed-length character string (CHAR)

All values in a fixed-length string column have the same length, which is determined by the length attribute of the column. The length attribute must be in the range 1 - 255, inclusive, unless the string unit is CODEUNITS32, which has a range of 1 - 63, inclusive.

Varying-length character strings

There are two types of varying-length character strings:

VARCHAR: A VARCHAR value can be up to 32,672 bytes long. If the string unit is CODEUNITS32, the length can be up to 8,168 string units.
CLOB: A character large object (CLOB) value can be up to 2 gigabytes minus 1 byte (2,147,483,647 bytes) long or, if the string unit is CODEUNITS32, up to 536,870,911 string units. A CLOB is used to store large SBCS or mixed (SBCS and MBCS) character-based data (such as documents written with a single character set) and, therefore, has an SBCS or mixed code page that is associated with it.

Special restrictions apply to expressions that result in a CLOB data type, and to structured type columns; such expressions and columns are not permitted in:

A SELECT list that is preceded by the DISTINCT clause
A GROUP BY clause
An ORDER BY clause
A subselect of a set operator other than UNION ALL
A basic, quantified, BETWEEN, or IN predicate
An aggregate function
VARGRAPHIC, TRANSLATE, and datetime scalar functions
The pattern operand in a LIKE predicate, or the search string operand in a POSSTR function
The string representation of a datetime value.

The functions in the SYSFUN schema taking a VARCHAR as an argument will not accept VARCHARs greater than 4,000 bytes long as an argument. However, many of these functions also have an alternative signature accepting a CLOB(1M). For these functions, the user can explicitly cast the greater than 4,000 VARCHAR strings into CLOBs and then recast the result back into VARCHARs of the required length.

NUL-terminated character strings that are found in C are handled differently, depending on the standards level of the precompile option.

Each character string is further defined as one of:

Bit data: Data that is not associated with a code page.
Single-byte character set (SBCS) data: Data in which every character is represented by a single byte.
Mixed data: Data that might contain a mixture of characters from a single-byte character set and a multi-byte character set (MBCS).
Unicode data: Data that contains characters that are represented by one or more bytes. Each Unicode character string is encoded by using UTF-8. The CCSID for UTF-8 is 1208.

Note: The LONG VARCHAR data type continues to be supported but is deprecated, not recommended, and might be removed in a future release.

String units specification for character strings

The unit of length for the character string data type is OCTETS or CODEUNITS32. The unit of length defines the counting method that is used to determine the length of the data.

OCTETS: Indicates that the units for the length attribute are bytes. This unit of length applies to all non-Unicode character string data types. For a Unicode character string data type, OCTETS can be explicitly specified or determined based on an environment setting.
CODEUNITS32: Indicates that the units for the length attribute are Unicode UTF-32 code units which approximate counting in characters. This unit of length does not affect the underlying code page of the data type. The actual length of a data value is determined by counting the UTF-32 code units as if the data was converted to UTF-32. A string unit of CODEUNITS32 can be used only for a Unicode character string data type. CODEUNITS32 can be explicitly specified or determined based on an environment setting.

For a non-Unicode character string data type, the string unit is always OCTETS and cannot be changed. For a Unicode character string data type, the string units can be explicitly specified with the length attribute of a character string data type, or it can default based on an environment setting. If FOR BIT DATA is also specified for the character string data type, CODEUNITS32 cannot be specified and an environment setting of CODEUNITS32 does not apply.

The environment setting for string units is based on the value for the NLS_STRING_UNITS global variable, or the string_units database configuration parameter. The database configuration parameter can be set to either SYSTEM or CODEUNITS32. The global variable can also be set to either SYSTEM or CODEUNITS32, but also can be set to NULL. The NULL value indicates that the SQL session should use the string_units database configuration parameter setting. If the value for the environment setting is SYSTEM, then OCTETS is used as the default string units setting.

String units in built-in functions

The ability to specify string units for certain built-in string functions allows you to process string data in a more "character-based manner" than a "byte-based manner". The string unit determines the unit that is used for length or position when you execute the function. You can specify CODEUNITS16, CODEUNITS32, or OCTETS as the string unit for some string functions. When no string unit is specified, the default string unit is usually determined by the string units of the source string argument (refer to the description of the function that you are using for details). The string units argument can be specified for string functions that support the parameter in Unicode or non-Unicode databases.

CODEUNITS16: Specifies that Unicode UTF-16 is the unit for the operation. CODEUNITS16 is useful when an application is processing data in code units that are 2 bytes in width. Note some characters, which are known as supplementary characters, require two UTF-16 code units to be encoded. For example, the musical symbol G clef requires two UTF-16 code units (X'D834' and X'DD1E' in UTF-16BE).
CODEUNITS32: Specifies that Unicode UTF-32 is the unit for the operation. CODEUNITS32 is useful for applications that process data in a simple, fixed-length format, and that must return the same answer regardless of the storage format of the data (ASCII, UTF-8, or UTF-16).; Note: The storage on disk uses UTF-8 encoding.

OCTETS: Specifies that bytes are the units for the operation. OCTETS is often used when an application is interested in allocating buffer space or when operations need to use simple byte processing.

The calculated length of a string computed using OCTETS (bytes) might differ from that computed using CODEUNITS16 or CODEUNITS32. When you are using OCTETS, the length of the string is determined by simply counting the number of bytes in the string. When you are using CODEUNITS16 or CODEUNITS32, the length of the string is determined by counting the number of 16-bit or 32-bit code units necessary to represent the string in UTF-16 or UTF-32. A length that is determined using CODEUNITS16 and CODEUNITS32 is identical unless the data contains supplementary characters (see Difference between CODEUNITS16 and CODEUNITS32).

For example, assume that NAME, a VARCHAR(128) column that is encoded in Unicode UTF-8, contains the value 'Jürgen'. The following two queries, which count the length of the string in CODEUNITS16 or CODEUNITS32, return the same value (6).

   SELECT CHARACTER_LENGTH(NAME,CODEUNITS16) FROM T1
     WHERE NAME = 'Jürgen'

   SELECT CHARACTER_LENGTH(NAME,CODEUNITS32) FROM T1
     WHERE NAME = 'Jürgen'

The next query, which counts the length of the string in OCTETS, returns the value 7.

   SELECT CHARACTER_LENGTH(NAME,OCTETS) FROM T1
     WHERE NAME = 'Jürgen'

These values represent the length of the string that is expressed in the specified string unit.

The following table shows the UTF-8, UTF-16BE (big-endian), and UTF-32BE (big-endian) representations of the name 'Jürgen':

Format    Representation of the name 'Jürgen'
--------  --------------------------------------
UTF-8	  X'4AC3BC7267656E'
UTF-16BE 	X'004A00FC007200670065006E'
UTF-32BE 	X'0000004A000000FC0000007200000067000000650000006E'

The representation of the character 'ü' differs among the three string units:

The UTF-8 representation of the character 'ü' is X'C3BC'.
The UTF-16BE representation of the character 'ü' is X'00FC'.
The UTF-32BE representation of the character 'ü' is X'000000FC'.

Specifying string units for a built-in function does not affect the data type, the string units, or the code page of the result of the function. If necessary, the data is converted to Unicode for evaluation when CODEUNITS16 or CODEUNITS32 is specified.

When OCTETS is specified for the LOCATE or POSITION function, and the code pages of the string arguments differ, the data is converted to the code page of the source-string argument. In this case, the result of the function is in the code page of the source-string argument. When OCTETS is specified for functions that take a single string argument, the data is evaluated in the code page of the string argument, and the result of the function is in the code page of the string argument.

Difference between CODEUNITS16 and CODEUNITS32

When CODEUNITS16 or CODEUNITS32 is specified, the result is the same except when the data contains Unicode supplementary characters. This is because a supplementary character is represented by two UTF-16 code units or one UTF-32 code unit. In UTF-8, a non-supplementary character is represented by 1 to 3 bytes, and a supplementary character is represented by 4 bytes. In UTF-16, a non-supplementary character is represented by one CODEUNITS16 code unit or 2 bytes, and a supplementary character is represented by two CODEUNITS16 code units or 4 bytes. In UTF-32, a character is represented by one CODEUNITS32 code unit or 4 bytes.

For example, the following table shows the hexadecimal values for the mathematical bold capital A and the Latin capital letter A. The mathematical bold capital A is a supplementary character that is represented by 4 bytes in UTF-8, UTF-16, and UTF-32.

Character	UTF-8 representation	UTF-16BE representation	UTF-32BE representation
Unicode value X'1D400' - 'A'; mathematical bold capital A	X'F09D9080'	X'D835DC00'	X'0001D400'
Unicode value X'0041' - 'A'; latin capital letter A	X'41'	X'0041'	X'00000041'

Assume that C1 is a VARCHAR(128) column, encoded in Unicode UTF-8, and that table T1 contains one row with the value of the mathematical bold capital A (X'F09D9080'). The following queries return different results:

Query                                                Returns
-----                                                -------
SELECT CHARACTER_LENGTH(C1,CODEUNITS16) FROM T1      2

SELECT CHARACTER_LENGTH(C1,CODEUNITS32) FROM T1      1

SELECT CHARACTER_LENGTH(C1,OCTETS) FROM T1           4