Selecting an encoding
Explains special considerations about encoding, also known as code pages.
CPLEX offers parameters that specify the encoding (also known as the code page) for CPLEX to use in the representation of data, whether as input or output. For details about these encoding parameters, see also the documentation of the API string encoding switch and the file encoding switch in the CPLEX Parameters Reference Manual.
These encoding parameters have no effect on IBM CPLEX Optimizer for z/OS, where only EBCDIC IBM-1047 encoding is available.
Default encoding
By default, CPLEX uses the encoding ISO-8859-1 (also known as Latin-1). The familiar encoding known as ASCII is a subset of ISO-8859-1. In fact, ISO-8859-1 supports a wide variety of character sets, so this default is a reasonable choice for many users.
Multi-byte encoding
However, the encoding ISO-8859-1 cannot represent character sets with large number of characters,
such as Chinese, Japanese, Korean, Indian, or Vietnamese. Such character sets require multiple-byte
encoding, that is multiple bytes may be used to represent single characters. For example, UTF-8 is
such an encoding. UTF-8 is a multi-byte character encoding that can represent every character in the
Unicode character set; that is, it is sufficiently comprehensive for many purposes. It is also
compatible with ASCII. It does not require byte-order marks (also known as BOM) nor specification of
big-endian or little-endian byte-order. It does not include multi-byte characters that contain a
NULL byte in their multi-byte encoding. In short, it is a useful encoding for many
users whose needs reach beyond ASCII or Latin-1.
Also take care if you choose another multi-byte encoding instead of UTF-8: CPLEX routines such as
CPXXmsgstr and CPXmsgstr do not work well with
encodings that include characters that contain a NULL byte in their multi-byte
representation. The presence of these NULL bytes can lead to unfortunate
coincidences and in the worst case, errors in handling the characters. Indeed, for precisely those
reasons CPLEX does not guarantee support for such encodings (of which UTF-16 and UTF-32 are the most
well-known).
Example: why one must be careful with encodings
To get an idea of the hazards of encodings, imagine a situation in which a user creates a model
as a file in a favorite editor with the encoding cp424, which supports Hebrew
characters. The unsuspecting user names the model gimel (a single Hebrew character, not reproduced
here). In this model, the user names each variable with a distinct, single Hebrew character. The
user, aware that the encoding used in the editor is not the default file encoding in CPLEX,
carefully sets the CPLEX file encoding parameter to the value cp424 before reading
the file into CPLEX. Unfortunately, our unlucky user then forgets to change the default value
ISO-8859-1 of the API encoding parameter of CPLEX when extracting names through the API. Since the
Hebrew character gimel (the name of the model) cannot be represented in the Latin-1 code page, a
silent substitution of the ISO-8859-1 substitute character (hex value 0x1a) occurs. Similarly, the
names of the variables will be substituted since they are characters not representable in Latin-1.
And all these substitutions will not appear to the software to be an error.
Due to a difference in the way that the IBM Java Virtual Machine and Runtime Environment and the
International Components for Unicode (ICU) interpret newline characters in EBCDIC encoding and its
variants, it may be necessary to append the option swaplfnl to the encoding name
when the user intends to use a form of EBCDIC encoding in file or API operations. For example,
"IBM1047,swaplfnl"
may need to be used instead of
"IBM1047"
to avoid anomalies.