String Formats (DATA LIST command)

String (alphanumeric) variables can contain any numbers, letters, or characters, including special characters and embedded blanks. Numbers entered as values for string variables cannot be used in calculations unless you convert them to numeric format (see RECODE). On DATA LIST, a string variable is defined with an A format if data are in standard character form or an AHEX format if data are in hexadecimal form.

  • For fixed-format data, the width of a string variable is either implied by the column location specification or specified by the w on the FORTRAN-like format. For freefield data, the width must be specified on the FORTRAN-like format.
  • AHEX format is available only for fixed-format data. Since each set of two hexadecimal characters represents one standard character, the width specification must be an even number. The output format for a variable in AHEX format is A format with half the specified width.
  • If a string in the data is longer than its specified width, the string is truncated and a warning message is displayed. If the string in the data is shorter, it is right-padded with blanks and no warning message is displayed.
  • For fixed-format data, all characters within the specified or implied columns, including leading, trailing, and embedded blanks and punctuation marks, are read as the value of the string.
  • For freefield data without a specified delimiter, string values in the data must be enclosed in quotes if the string contains a blank or a comma. Otherwise, the blank or comma is treated as a delimiter between values. See the topic String Values in Command Specifications for more information.

Example

DATA LIST FILE="/data/wins.txt" FREE /POSTPOS NWINS * POSNAME (A24).
  • POSNAME is specified as a 24-byte string. The asterisk preceding POSNAME indicates that POSTPOS and NWINS are read with the default format. If the asterisk was not specified, the program would apply the A24 format to POSNAME and then issue an error message indicating that there are more variables than specified formats.

Example

DATA LIST FILE="/data/wins.txt" FREE /POSTPOS * NWINS (A5) POSWINS.
  • Both POSTPOS and POSWINS receive the default numeric format F8.2.
  • NWINS receives the specified format of A5.

String Width in Code Page, UTF-8, and UTF-16 Files

"Column" and width specifications for numbers are the same for all encodings. Each digit represents a single column. For example, a column specification of VAR1 1-4 (F) or a width specification of VAR1 (F4) defines a width wide enough for a 4-digit number.

For string variables, however, column and width specifications can be different for different encodings.

  • For code page and UTF-8 files, column and width specifications for alphabetic characters represent bytes.
  • In code page files, Latin (Roman) characters are each one byte. This includes characters with diacritical marks, such as accents. For example, resume and résumé both have a width of 6 bytes (or columns) in a code page file. Asian characters, however, require two bytes for each character.
  • In UTF-8 files, Latin characters without diacritical marks are each one byte, but characters with diacritical marks require two bytes. For example, resume requires a width of 6 bytes while résumé requires a width of 8 bytes.
  • Characters from Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tana alphabets each require two bytes in UTF-8 files.
  • Characters from Asian alphabets require 3 or more bytes for each character in UTF-8 files.
  • For UTF-16 files, column and width specifications represent characters. Therefore, each character requires one column for all alphabets.

For more information on Unicode encoding, go to http://www.unicode.org.