Character strings

A character string is a sequence of bytes. The length of the string is the number of bytes in the sequence. If the length is zero, the value is called the empty string. The empty string should not be confused with the null value.

Default CCSIDs for character strings

The value of the field MIXED DATA (on installation panel DSNTIPF) determines the default CCSIDs for a character string.

The following table shows how the value of the field MIXED DATA (on installation panel DSNTIPF) determines the default CCSIDs for a character string.
Table 1. Default CCSIDs for character strings
Encoding scheme Value of MIXED DATA field Default attribute
ASCII or EBCDIC NO Character: SBCS

The value of the ASCII CCSID or EBCDIC CCSID field on installation panel determines the system CCSID for SBCS data.

ASCII or EBCDIC YES Character: MIXED

The value of the ASCII CCSID or EBCDIC CCSID field on installation panel DSNTIPF determines the system CCSID for SBCS data, MIXED, and graphic data.

Unicode Not applicable Character: MIXED

The CCSIDs are:

  • 367 for SBCS data
  • 1208 for MIXED data
  • 1200 for graphic data

The MIXED DATA field does not apply to Unicode columns in EBCDIC tables. Those columns follow the same rules that are shown for the Unicode encoding scheme in the previous table. For more information, see Unicode columns in EBCDIC tables.

Fixed-length character strings

When fixed-length character string distinct types, columns, and variables are defined, the length attribute is specified, and all values have the same length. For a fixed-length character string, the length attribute must be in the range 1–255 inclusive.

Varying-length character strings

The types of varying-length character strings are VARCHAR and character large object (CLOB). A CLOB is a type of LOB. A CLOB column is useful for storing large amounts of character data, such as documents written with a single character set.

When varying-length character strings, distinct types, columns, and variables are defined, the maximum length is specified and this length becomes the length attribute except for C NUL-terminated strings. Actual values might have a smaller value. For varying-length character strings, the length specifies the number of bytes.

For a VARCHAR string, the length attribute must be in the range 1–32704. For a VARCHAR column, the maximum for the length attribute is determined by the record size that is associated with the table, as described in Maximum record size the description of the CREATE TABLE statement. For a CLOB string, the length attribute must be in the range 1–2147483647 inclusive. For more information about CLOBs, see Large objects (LOBs).

Character string variables

  • Fixed-length character string variables can be used in all languages except REXX and Java™. In C, CHAR string variables are limited to a length of 1.
  • Varying-length character string variables can be used in all host languages with the following exceptions:
    • Fortran: varying-length non-LOB character strings cannot be used.
    • Assembler, C, and COBOL: varying-length non-LOB strings are simulated as described in the section for each language in Embedded SQL programming. In C, NUL-terminated strings can also be used.
    • REXX: CLOBs and DBCLOBs cannot be used.

Character string encoding schemes

The method of representing DBCS and MBCS characters within a mixed string differs among the encoding schemes.

Each character string is further defined as one of the following subtypes:
Bit data
Data that is not associated with a coded character set and, therefore, is never converted. The CCSID for bit data is X'FFFF' (65535). The bytes do not represent characters.

Bit data is a form of character data. The pad character is a blank for assignments to bit data; the pad character is X'00' for assignments to binary data. It is recommended that binary data be used instead of character for bit data.

If both operands in a predicate are EBCDIC, both operands are padded with X'40'. Otherwise, both operands are padded with X'20'. For example, if both operands are ASCII, or if one operand is ASCII and the other operand is EBCDIC, both are padded with X'20'.

SBCS data
Data in which every character is represented by a single byte. Each SBCS string has an associated CCSID. If necessary, an SBCS string is converted before it is used in an operation with a character string that has a different CCSID.
Mixed data
Data that can contain a mixture of characters from a single-byte character set (SBCS) and a multiple-byte character set (MBCS). Each mixed string has an associated CCSID. If necessary, a mixed string is converted before an operation with a character string that has a different CCSID. If a mixed data string contains an MBCS character, it cannot be converted to SBCS data.

EBCDIC mixed data can contain shift characters, which are not MBCS data.

When the encoding scheme is Unicode or the Db2 installation is defined to support mixed data, Db2 recognizes MBCS sequences within mixed data string when performing character sensitive operations. These operations include parsing, character conversion, and the pattern matching specified by the LIKE predicate.

Character strings with a CLOB data type can only be SBCS or MIXED. BLOB should be used for binary strings.

The method of representing DBCS and MBCS characters within a mixed string differs among the encoding schemes.

  • ASCII reserves a set of code points for SBCS characters and another set as the first half of DBCS characters. When it encounters the first half of a DBCS character, the system reads the next byte in order to obtain the complete character.
  • EBCDIC makes use of two special code points:
    • A shift-out character (X'0E') to introduce a string of DBCS characters.
    • A shift-in character (X'0F') to end a string of DBCS characters.
    DBCS sequences within mixed data strings are recognized as the string is read from left to right. At any time, the reading of the string is in SBCS mode or DBCS mode. In SBCS mode, which is the initial mode, any byte other than a shift-out is interpreted as an SBCS character. When a shift-out is read, the mode switches to DBCS mode. In DBCS mode, the next byte and every second byte after that byte is interpreted as the first byte of a DBCS character unless it is a shift character. If the byte is a shift-out, an error occurs. If the byte is a shift-in, the mode returns to SBCS mode. An error occurs if the mode is still DBCS mode after processing the last byte of the string. Because of the shift characters, EBCDIC mixed data requires more storage than ASCII mixed data.
  • UTF-8 is a varying-length encoding of byte sequences. The high bits indicate the part of the sequence to which a byte belongs. The first byte indicates the number of bytes to follow in a byte sequence.

Examples of character encoding schemes

The same mixed date character string can be represented as character and hexadecimal data in different encoding schemes.

For the same mixed data character string, the following table shows character and hexadecimal representations of the character string in different encoding schemes. In EBCDIC, the shift-out and shift-in characters are needed to delineate the double-byte characters.
Table 2. Example of a character string in different encoding schemes
Data type and encoding scheme Character representation Hexadecimal representation (with spaces separating each character)
9 bytes in ASCII
Begin figure description. A string consists of a Kanji character, the Latin lowercase characters gen, another Kanji character, and the Latin lowercase characters ki. End figure description.
8CB3 67 65 6E 8B43 6B 69
13 bytes in EBCDIC
Begin figure description. A string is a shift-out, a Kanji character, a shift-in, the characters g e n, a shift-out, a Kanji character, a shift-in, and the characters k i. End figure description.
0E 4695 0F 87 85 95 0E 45B9 0F 92 89
11 bytes in Unicode UTF-8
Begin figure description. A string consists of a Kanji character, the Latin lowercase characters gen, another Kanji character, and the Latin lowercase characters ki. End figure description.
E58583 67 65 6E E6B097 6B 69

Because of the differences of the representation of mixed data strings in ASCII, EBCDIC, and Unicode, mixed data is not transparently portable. To minimize the effects of these differences, use varying-length strings in applications that require mixed data and operate on ASCII, EBCDIC, and Unicode data.

String units specifications

The ability to specify string units for certain built-in functions and on the CAST specification allows you to process string data in a more "character-based manner" than a "byte-based manner". The string unit determines the length in which the operation is to occur. For more information, see String unit specifications.