String Formats (DATA LIST command)
String (alphanumeric) variables can contain any
numbers, letters, or characters, including special characters and
embedded blanks. Numbers entered as values for string variables cannot
be used in calculations unless you convert them to numeric format
(see RECODE
). On DATA LIST
, a string variable is defined
with an A
format if data are
in standard character form or an AHEX
format if data are in hexadecimal form.
- For fixed-format data, the width of a string variable is either implied by the column location specification or specified by the w on the FORTRAN-like format. For freefield data, the width must be specified on the FORTRAN-like format.
-
AHEX
format is available only for fixed-format data. Since each set of two hexadecimal characters represents one standard character, the width specification must be an even number. The output format for a variable inAHEX
format isA
format with half the specified width. - If a string in the data is longer than its specified width, the string is truncated and a warning message is displayed. If the string in the data is shorter, it is right-padded with blanks and no warning message is displayed.
- For fixed-format data, all characters within the specified or implied columns, including leading, trailing, and embedded blanks and punctuation marks, are read as the value of the string.
- For freefield data without a specified delimiter, string values in the data must be enclosed in quotes if the string contains a blank or a comma. Otherwise, the blank or comma is treated as a delimiter between values. See the topic String Values in Command Specifications for more information.
Example
DATA LIST FILE="/data/wins.txt" FREE /POSTPOS NWINS * POSNAME (A24).
- POSNAME is specified
as a 24-byte string. The asterisk preceding POSNAME indicates that POSTPOS and NWINS are read with the default
format. If the asterisk was not specified, the program would apply
the
A24
format to POSNAME and then issue an error message indicating that there are more variables than specified formats.
Example
DATA LIST FILE="/data/wins.txt" FREE /POSTPOS * NWINS (A5) POSWINS.
- Both POSTPOS and POSWINS receive the default numeric format
F8.2.
- NWINS receives
the specified format of
A5
.
String Width in Code Page, UTF-8, and UTF-16 Files
"Column" and width specifications for numbers are the same for
all encodings. Each digit represents a single column. For example,
a column specification of VAR1 1-4 (F)
or a width specification of VAR1 (F4)
defines a width wide enough for a 4-digit number.
For string variables, however, column and width specifications can be different for different encodings.
- For code page and UTF-8 files, column and width specifications for alphabetic characters represent bytes.
- In code page files, Latin (Roman) characters are each one byte. This includes characters with diacritical marks, such as accents. For example, resume and résumé both have a width of 6 bytes (or columns) in a code page file. Asian characters, however, require two bytes for each character.
- In UTF-8 files, Latin characters without diacritical marks are each one byte, but characters with diacritical marks require two bytes. For example, resume requires a width of 6 bytes while résumé requires a width of 8 bytes.
- Characters from Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tana alphabets each require two bytes in UTF-8 files.
- Characters from Asian alphabets require 3 or more bytes for each character in UTF-8 files.
- For UTF-16 files, column and width specifications represent characters. Therefore, each character requires one column for all alphabets.
For more information on Unicode encoding, go to http://www.unicode.org.