Deciding whether to store data as UTF-8 or UTF-16

If you create a Unicode database in Db2 for z/OS®, you need to decide whether to use UTF-8 or UTF-16. Db2 for z/OS does not support storing data as UTF-32. UTF-8 and UTF-16 can both represent any Unicode character that you need to represent, but each format has advantages and disadvantages depending on your situation.

Procedure

To decide whether to store data as UTF-8 or UTF-16:

Consider the following recommendations and guidelines:
Performance
Store your data in Db2 in the same format as your application. This setup ensures optimal performance, because character conversion is avoided.

This recommendation is especially important when the application is written in a language that runs on z/OS (for example COBOL on z/OS), because the CPU cost of character conversion on z/OS can be very expensive.

For example:

  • Start of changeCOBOL and PL/I on z/OS use UTF-16 or UTF-8 for Unicode data. Even if your compiler supports UTF-8 data, the optimal situation is to store your data in Db2 in UTF-16. UTF-16 data can potentially take more storage than UTF-8 data, but because no conversion occurs when you use UTF-16 data, you avoid a significant performance impact.End of change
  • For Java™ applications that use IBM® Data Server Driver for JDBC and SQLJ type 4 connectivity, which sends the data in UTF-8, store your data in Db2 as UTF-8 data.
If you have both local and remote applications on different operating systems, choose the format based on the encoding of the local application.
Storage
After you consider performance, consider your storage requirements. Store the data in the format that requires the least space for your data.

UTF-16 does not always require more storage than UTF-8. The amount of storage that is required depends on your data. For example, Latin-1 characters always take 1 byte in UTF-8 and 2 bytes in UTF-16. However, Japanese characters take 3 to 4 bytes in UTF-8 and 2 to 4 bytes in UTF-16.

For example. Db2 for z/OS uses UTF-8 for the catalog. Because the catalog contains mostly Latin-1 characters, this format uses considerably less space than UTF-16.

MQ, CICS® Transaction Gateway, and IMS Connect messages
When messages are passed from one technology to another, everything in the message is usually converted to characters. You should consider the size of these messages when you decide when and where to use certain UTFs. For example, suppose that you have COBOL applications, which use UTF-16, but you are concerned about the size of the messages. You might decide to convert the messages to UTF-8 before you put them on the wire. This setup compresses the messages.

What to do next

If you choose a Unicode format for performance reasons and are concerned about the extra storage that the format requires, see Tips for handling any extra storage that Unicode data might require.