Deciding whether to store data as UTF-8 or UTF-16

If you create a Unicode database in DB2® for z/OS®, you need to decide whether to use UTF-8 or UTF-16. DB2 for z/OS does not support storing data as UTF-32. UTF-8 and UTF-16 can both represent any Unicode character that you need to represent, but each format has advantages and disadvantages depending on your situation.

Procedure

To decide whether to store data as UTF-8 or UTF-16:

Consider the following recommendations and guidelines:

Performance recommendation: Store your data in DB2 in the same format as your application. This setup ensures optimal performance, because character conversion is avoided.
This recommendation is especially important when the application is written in a language that runs on z/OS (for example COBOL on z/OS), because the CPU cost of character conversion on z/OS can be very expensive.
Examples:
- COBOL and PL/I on z/OS use UTF-16 for Unicode data. Neither language supports UTF-8. So if you are using COBOL or PL/I applications on z/OS that process Unicode data, the optimal situation is to store your data in DB2 in UTF-16. In this case, even though UTF-16 data can potentially take more storage than UTF-8 data , no conversion occurs. Thus you avoid a significant performance impact.
- For Java applications that use the type 4 z/OS driver, which sends the data in UTF-8, store your data in DB2 as UTF-8 data.
If you have both local and remote applications on different operating systems, choose the format based on the encoding of the local application.
Storage recommendation: After you consider performance, consider your storage requirements. Store the data in the format that requires the least space for your data.
UTF-16 does not always require more storage than UTF-8. The amount of storage that is required depends on your data. For example, Latin-1 characters always take 1 byte in UTF-8 and 2 bytes in UTF-16. However, Japanese characters take 3 to 4 bytes in UTF-8 and 2 to 4 bytes in UTF-16.

Example: DB2 for z/OS uses UTF-8 for the catalog. Because the catalog contains mostly Latin-1 characters, this format uses considerably less space than UTF-16.
Recommendation for MQ, CICS® Transaction Gateway, and IMS™ Connect messages: When messages are passed from one technology to another, everything in the message is usually converted to characters. You should consider the size of these messages when you decide when and where to use certain UTFs. For example, suppose that you have COBOL applications, which use UTF-16, but you are concerned about the size of the messages. You might decide to convert the messages to UTF-8 before you put them on the wire. This setup compresses the messages.

What to do next

If you choose a Unicode format for performance reasons and are concerned about the extra storage that the format requires, see Tips for handling any extra storage that Unicode data might require.