Selecting an encoding

Explains special considerations about encoding, also known as code pages.

CPLEX offers parameters that specify the encoding (also known as the code page) for CPLEX to use in the representation of data, whether as input or output. For details about these encoding parameters, see also the documentation of the API string encoding switch and the file encoding switch in the CPLEX Parameters Reference Manual.

Tip:

These encoding parameters have no effect on IBM CPLEX Optimizer for z/OS, where only EBCDIC IBM-1047 encoding is available.

Default encoding

By default, CPLEX uses the encoding ISO-8859-1 (also known as Latin-1). The familiar encoding known as ASCII is a subset of ISO-8859-1. In fact, ISO-8859-1 supports a wide variety of character sets, so this default is a reasonable choice for many users.

Multi-byte encoding

However, the encoding ISO-8859-1 cannot represent character sets with large number of characters, such as Chinese, Japanese, Korean, Indian, or Vietnamese. Such character sets require multiple-byte encoding, that is multiple bytes may be used to represent single characters. For example, UTF-8 is such an encoding. UTF-8 is a multi-byte character encoding that can represent every character in the Unicode character set; that is, it is sufficiently comprehensive for many purposes. It is also compatible with ASCII. It does not require byte-order marks (also known as BOM) nor specification of big-endian or little-endian byte-order. It does not include multi-byte characters that contain a NULL byte in their multi-byte encoding. In short, it is a useful encoding for many users whose needs reach beyond ASCII or Latin-1.

Also take care if you choose another multi-byte encoding instead of UTF-8: CPLEX routines such as CPXXmsgstr and CPXmsgstr do not work well with encodings that include characters that contain a NULL byte in their multi-byte representation. The presence of these NULL bytes can lead to unfortunate coincidences and in the worst case, errors in handling the characters. Indeed, for precisely those reasons CPLEX does not guarantee support for such encodings (of which UTF-16 and UTF-32 are the most well-known).

Example: why one must be careful with encodings

To get an idea of the hazards of encodings, imagine a situation in which a user creates a model as a file in a favorite editor with the encoding cp424, which supports Hebrew characters. The unsuspecting user names the model gimel (a single Hebrew character, not reproduced here). In this model, the user names each variable with a distinct, single Hebrew character. The user, aware that the encoding used in the editor is not the default file encoding in CPLEX, carefully sets the CPLEX file encoding parameter to the value cp424 before reading the file into CPLEX. Unfortunately, our unlucky user then forgets to change the default value ISO-8859-1 of the API encoding parameter of CPLEX when extracting names through the API. Since the Hebrew character gimel (the name of the model) cannot be represented in the Latin-1 code page, a silent substitution of the ISO-8859-1 substitute character (hex value 0x1a) occurs. Similarly, the names of the variables will be substituted since they are characters not representable in Latin-1. And all these substitutions will not appear to the software to be an error.

Note:

Due to a difference in the way that the IBM Java Virtual Machine and Runtime Environment and the International Components for Unicode (ICU) interpret newline characters in EBCDIC encoding and its variants, it may be necessary to append the option swaplfnl to the encoding name when the user intends to use a form of EBCDIC encoding in file or API operations. For example,

"IBM1047,swaplfnl"

may need to be used instead of

"IBM1047"

to avoid anomalies.