Character Set Encoding in Syntax Files

The character set encoding of a syntax file can be either Unicode or code page encoding. A Unicode file can contain characters from many different character sets. Code page files are restricted to characters supported in a specific language or locale. For example, a code page file in a western European encoding cannot contain Japanese or Chinese characters.

Reading syntax files

To read syntax files correctly, the syntax editor needs to know the character encoding of the file.

  • Files with a Unicode UTF-8 byte order mark are read as Unicode UTF-8 encoding, regardless of any encoding selection you make. This byte order mark is at the beginning of the file, but it is not displayed.
  • By default, files without any encoding information are read as Unicode UTF-8 in Unicode mode or the current locale character encoding in code page mode. To override the default behavior, select Unicode (UTF-8) or Local Encoding.
  • As Declared is enabled if the syntax file contains a code page encoding identifier at the top of the file. Starting with release 23, a comment is automatically inserted in syntax files that are saved in code page encoding. For example, the first line in the file could be:
    * Encoding: en_US.windows-1252.
    If you select As Declared, that encoding is used to read the file.

Saving syntax files

By default, syntax files are saved as Unicode UTF-8 in Unicode mode or the current locale character encoding in code page mode. To override the default behavior, select Unicode (UTF-8) or Local Encoding in the Save Syntax As dialog.

  • If you save a new syntax file or save the file in a different encoding, a comment is inserted at the top of the file that identifies the encoding. If an encoding comment is already present, it is replaced.
  • If you save a syntax file and then save it again without closing it, it is saved in the same encoding.