File encoding and UTF-8 as default charset

Starting with JDK 18, UTF-8 is the default charset across all operating systems.

JEP 400 - UTF-8 by default

Java Platform Enhancement Proposal (JEP) 400 introduced UTF-8 as the default charset for standard Java APIs across all operating systems, starting with Java 18, except for the console input and output encoding.

In IBM Semeru Runtime Certified Edition for z/OS 21 (Semeru 21), UTF-8 is the default value of the file.encoding property, and the charset returned by the method java.nio.charsets.Charset.defaultCharset(). Previous releases defaulted to an EBCDIC charset that depends on locale settings.

This change impacts various Java SE APIs, which use the default charset such as

InputStreamReader, FileReader, OutputStreamWriter, FileWriter, and PrintStream APIs of the java.io package and
URLEncoder, and URLDecoder APIs of the java.net package.

The UTF-8 default affects both the encoding interpretation of input files and encoding of output files. The file.encoding property influences this behavior.

The file.encoding property

The file.encoding property can be used to influence the default charset behavior. Specifically, a new property value COMPAT is introduced, which reverts the default charset behavior to that of Java 17 and earlier releases. Alternatively, a specific charset can be specified to the property by using command line, if necessary. The following example illustrates the usage of the property.

# The COMPAT value reverts the default charset selection to Java 17 behaviour, 
and will select a native EBCDIC encoding dependent on locale settings.
java -Dfile.encoding=COMPAT MyApplication    

# This specifies file.encoding to be in IBM-1047 charset.
java -Dfile.encoding=IBM-1047 MyApplication

Note: Configuring the system property file.encoding by using the option java -D on the command line is not supported through the JEP. IBM supports this behavior on z/OS. Setting this property must be done before the JVM initialization.

Default UTF-8 behavior and auto-conversion of input files

If no specific file.encoding property value is set, the default charset is UTF-8. Improved compatibility of Semeru 21 with UTF-8 over EBCDIC on z/OS systems is achieved through the implementation of a charset automatic conversion feature that converts EBCDIC files to UTF-8. This feature uses the z/OS file tagging support, by using the chtag utility, to identify the charset of the text within the file. When enabled, the text contents of the tagged file are automatically converted on read operations, and presented as UTF-8 characters to the Java application, subject to the limitations documented in the subsequent section. Only conversions to UTF-8 charset are supported. Only files with txtflag=ON (set by using the chtag -t) and a valid charset code are considered candidates for auto-conversion. The txtflag attribute denotes whether the file contains uniformly encoded text data. Untagged text files and binary files are not converted.

The following code snippet shows how to tag files on z/OS with the chtag utility.


chtag -t -c IBM-1047 EBCDIC-1047.txt          # Tag as text in IBM-1047 encoding
chtag -t -c ISO8859-1 ASCII-ISO8859-1.txt     # Tag as text in ISO8859-1 encoding
chtag -t -c UTF-8 UTF-8.txt                   # Tag as text in UTF-8 encoding

This auto-conversion feature is controlled by using a new property com.ibm.autocvt. The default value of the com.ibm.autocvt property is true only if the file.encoding property is not set. If a specific charset or COMPAT is specified for the file.encoding property, auto-conversion is disabled by default.

Note: The com.ibm.autocvt property is entirely independent of the z/OS Enhanced ASCII support, and does not interact with any _BPXK_AUTOCVT settings.

The following table summarizes the interactions between file.encoding property and auto-conversion.

Table 1. Interactions between file.encoding property and auto-conversion.
file.encoding value	Auto-conversion setting (com.ibm.autocvt)	Behavior
Unset	Enabled	Default behavior that automatically converts tagged files to UTF-8. Untagged text files are assumed to be in UTF-8.
file.encoding=COMPAT	Disabled	COMPAT reverts to Java 17 behavior, file.encoding defaults to native EBCDIC encoding dependent on locale, auto-conversion is disabled.
file.encoding=UTF-8	Disabled (Can be optionally enabled)	This setting enforces file.encoding to be UTF-8. Auto-conversion is disabled to preserve compatibility with JDK17 when file.encoding=UTF-8 is set. If auto-conversion is required, it can be set by using -Dcom.ibm.autocvt=true
file.encoding=<non UTF-8 charset>	Disabled	This setting enforces file.encoding to the specified charset. Auto-conversion is disabled (and cannot be enabled by setting com.ibm.autocvt property to true) in this scenario. Behavior is compatible with JDK17 with the same file.encoding setting applied.

Diagnostics for auto-conversion

A logger that tracks the auto-conversion related activities is added. The logger prints information, warnings, or errors that are encountered during the auto-conversion process to a standard error stream. The method and thread ID along with the type of log and the actual log is printed. The auto-conversion logging feature is controlled by using a new property com.ibm.autocvt.trace.

Auto-conversion logging is disabled by default. To enable it, set the property com.ibm.autocvt.trace to true when you start the Java application by using the java command, as shown in the following example.

java -Dcom.ibm.autocvt.trace=true MyApplication

An example of auto-conversion information logs is provided here.

[AUTOCVT.INFO] [ThreadID:2] [jdk.internal.util.AutoConversion.<clinit>] Autoconversion is true
[AUTOCVT.INFO] [ThreadID:2] [jdk.internal.util.AutoConversion.enable] Enabling auto-conversion for: myFile.txt

An example of auto-conversion warning logs is provided here.

[AUTOCVT.WARNING] [ThreadID:2] [jdk.internal.util.AutoConversion.<clinit>] Autoconversion will not be performed since file.encoding (IBM-1047) is not UTF-8.
[AUTOCVT.INFO] [ThreadID:2] [jdk.internal.util.AutoConversion.<clinit>] Autoconversion is false

Auto-conversion limitations

1. APIs FileChannel and RandomAccessFile

While auto-conversion enables a seamless approach to UTF-8 support, the conversion to UTF-8 might introduce multi-byte characters that result in differing number of bytes in UTF-8 encoding relative to the raw data. For Java classes, such as java.nio.channels.FileChannel and java.io.RandomAccessFile, that allow the contents of the file to be read at specific offsets, auto-conversion cannot be supported. Usage of the APIs with non-UTF-8 tagged text files results in an errno2 code of 0x05350651 and the following exception is thrown.

java.io.IOException: EDC5121I Invalid argument

The following example shows the exception that is seen when the API FileChannel.read() is called on a tagged text file with auto-conversion enabled.

Caused by: java.io.IOException: EDC5121I Invalid argument.
        at java.base/sun.nio.ch.UnixFileDispatcherImpl.pread0(Native Method)
        at java.base/sun.nio.ch.UnixFileDispatcherImpl.pread(UnixFileDispatcherImpl.java:57)
        at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:338)
        at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:306)
        at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:283)
        at java.base/sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:984)
        at java.base/sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:967)

To resolve this problem, the input file should remain untagged, or tagged as binary, whichever is appropriate.

2. Implications with explicit charsets

The auto-conversion feature can transparently convert the contents of tagged text files to UTF-8. This support allows applications to assume file contents to be in UTF-8, achieving strong compatibility with the goals of JEP 400 in creating a uniform UTF-8-based experience. However, when application code programmatically sets a specific character encoding, the functionality might be compromised when operating within a UTF-8 environment. Consider the following FileReader example:

FileReader reader = new FileReader("EBCDIC.txt", Charset.forName("IBM-1047"));

If the input file EBCDIC.txt is tagged as IBM-1047 text and no file.encoding property is set, then auto-conversion automatically converts the contents of the EBCDIC.txt file to UTF-8 in the underlying FileInputStream. The specified IBM-1047 charset causes an incorrect interpretation of UTF-8 contents as IBM-1047 text, resulting in garbled contents.

To resolve this problem, choose one of the following options based on your application's requirements.

Disable auto-conversion by setting com.ibm.autocvt=false. This disables all auto-conversions. With this option, auto-conversion is fully disabled in the context of the execution of the Java application program.
Use compatibility mode by using file.encoding=COMPAT. This setting forces the default charset to revert to Java 17 behavior, where native encoding is determined based on locale. Auto-conversion is disabled in this scenario.
Leave the input files untagged. This prevents the auto-conversion feature from converting your input file contents to UTF-8, and the raw bytes of the input file are presented.
Remove the specific charset from your source code as shown in the following example:
```
FileReader reader = new FileReader("EBCDIC.txt");
```
In this case, the text is interpreted as UTF-8, and the underlying auto-conversion support of tagged files convert it to UTF-8.

3. The pread() function may fail after UTF-8 auto-conversion

Any tagged files that are not in UTF-8 and that use the public class FileChannel to open and read or write contents fail. This failure is because the underlying function pread() is not supported on files following auto-conversion between encodings that are not a single byte to a single byte.

The sample error message shows which file's pread() caused a problem.

Image shows which file's pread() caused a problem.

Resolve this problem by applying the following steps.

Untag the file used by the class FileChannel.
Convert the file to UTF-8 and tag it as UTF-8.

Impact of JEP 400 on javac command

With JEP 400, the javac command that is used to compile Java source files into class files expects UTF-8 encoded input source files. If your Java source files are encoded in EBCDIC, you can use one of the following options to change or set the encoding and successfully compile your Java source files:

Specify the source file charset by using the -encoding option of the javac command as shown in the following example.
```
javac -encoding IBM-1047 MyJavaSource.java
```
Tag your EBCDIC encoded Java source files with the correct encoding by using the chtag utility as shown in the following example.
```
chtag -tc IBM-1047 MyJavaSource.java
```

Other known issues with UTF-8 support

Override Charset in constructor of the class java.io.InputStreamReader: In the constructor InputStreamReader(Inputstream in, Charset cs), while reading from standard input stream in z/OS, if the system property console.encoding charset is explicitly provided, it overrides the specified charset. If console.encoding is not explicitly provided, then ibm.system.encoding(which by default is IBM-1047) overrides the specified charset, provided a regular file does not back the ConsoleInputStream. This can ensure that the input is interpreted correctly, avoiding garbled data.
Note: The overriding of console.encoding charset for streams that are backed by regular file can be disabled by setting the option -Dcom.ibm.consoleEncoding.useForFileBackedConsoleStreams=false.
Override Charset in constructor of the class java.io.OutputStreamWriter: In the constructor OutputStreamWriter(OutputStream out, Charset cs), console.encoding charset overrides the supplied charset when writing to standard output or standard error in z/OS. If console.encoding is explicitly not provided, then ibm.system.encoding(which by default is IBM-1047) charset overrides the supplied charset when writing to standard output or standard error in z/OS, provided these streams are not directed to regular files. This prevents the output from becoming garbled by helping ensure that the correct charset is applied for text output.
Note: The overriding of console.encoding charset for streams backed by regular file can be disabled by setting the option -Dcom.ibm.consoleEncoding.useForFileBackedConsoleStreams=false.

Migration from previous releases

The change to adopt UTF-8 might cause migration challenges for existing Java applications that interact with non-UTF-8 files. Following are the possible solutions that enable your applications to run after migration to Semeru 21.

Convert and tag your files to UTF-8 encoding. This solution is expected to be the best in performance, as it minimizes the processor usage that is required for charset conversion.
Tag any input files with the appropriate encoding by using the chtag utility. Auto-conversion is enabled by default if file.encoding is not set, and converts tagged text file contents to UTF-8.
Note: The output files that use default mode remain encoded in UTF-8.
Specify an explicit charset by using file.encoding=<charset>. This influences the default charset for both input and output files to the specified charset.
Use the compatibility mode by specifying file.encoding=COMPAT. This mode reverts the setting of the default charset to Java 17’s behavior, where native encoding is determined based on locale.