Topic
  • 13 replies
  • Latest Post - ‏2013-11-21T20:46:06Z by JonPeck
SystemAdmin
SystemAdmin
456 Posts

Pinned topic SPSS 21 and Unicode

‏2013-04-03T11:31:49Z |
When reading CSV files in Unicode before SPSS 21 I didn't have a problem if the options were set right (Options>general>Character Encoding for Data and Syntax>Unicode). In SPSS 21 this doesn't work anymore. CSV and similar files are only read correctly if a BOM is included at the beginning of the file. Most CSV generators don't include BOM's to files, so you have to add this manually, for example using Notepad++ (the Coding menu). This is impractical, especially whan handling large numbers of CSV files.

In, for example, French, Dutch and German this is annoying, as letters like é, ë or ô are replaced with strange codes. I made a script for replacing those faulty codes with the correct letter, in case anyone has similar problems.
Updated on 2013-04-03T12:53:07Z at 2013-04-03T12:53:07Z by SystemAdmin
  • SystemAdmin
    SystemAdmin
    456 Posts

    Re: SPSS 21 and Unicode

    ‏2013-04-03T12:53:07Z  
    There is an ENCODING subcommand for GET TEXT. Does this help?

    ENCODING subcommand
    You can use the optional ENCODING subcommand to specify the character encoding for the
    file. The subcommand name is followed by an equals sign (=) and one of the following values
    enclosed in quotes:
     UTF8. The file is read in UTF-8 Unicode encoding. This is the default in Unicode mode
    (see SET command, Unicode subcommand) .
     Locale. The file is read in the current locale code page encoding. This is the default in code
    page mode (see SET command, LOCALE subcommand).
    The ENCODING subcommand is ignored if the file contains a byte order mark. If the byte order
    mark indicates UTF-8, the file is read as UTF-8. If the byte order mark indicates any other
    Unicode encoding, the file is not read and an error message is issued.
  • joostjakob
    joostjakob
    8 Posts

    Re: SPSS 21 and Unicode

    ‏2013-04-19T11:22:37Z  
    There is an ENCODING subcommand for GET TEXT. Does this help?

    ENCODING subcommand
    You can use the optional ENCODING subcommand to specify the character encoding for the
    file. The subcommand name is followed by an equals sign (=) and one of the following values
    enclosed in quotes:
     UTF8. The file is read in UTF-8 Unicode encoding. This is the default in Unicode mode
    (see SET command, Unicode subcommand) .
     Locale. The file is read in the current locale code page encoding. This is the default in code
    page mode (see SET command, LOCALE subcommand).
    The ENCODING subcommand is ignored if the file contains a byte order mark. If the byte order
    mark indicates UTF-8, the file is read as UTF-8. If the byte order mark indicates any other
    Unicode encoding, the file is not read and an error message is issued.

    Files in question do not have a Byte Order Marker, so I would expect the default or defined encoding to be used. I tested this again (see sps below), with both ENCODING='UTF-8' and ='LOCALE' . Neither works, so I am still bypassing this through Notepad++.

    GET DATA
      /TYPE=TXT
      /FILE="file.csv"
      /DELCASE=LINE
      /ENCODING='UTF8'
      /DELIMITERS="|"
      /ARRANGEMENT=DELIMITED
      /FIRSTCASE=2
      /IMPORTCASE=ALL 
      /VARIABLES=

    PS it's funny to take part in a conversation between SystemAdmin and SystemAdmin.

  • Albert-Jan
    Albert-Jan
    13 Posts

    Re: SPSS 21 and Unicode

    ‏2013-04-19T19:30:07Z  

    Files in question do not have a Byte Order Marker, so I would expect the default or defined encoding to be used. I tested this again (see sps below), with both ENCODING='UTF-8' and ='LOCALE' . Neither works, so I am still bypassing this through Notepad++.

    GET DATA
      /TYPE=TXT
      /FILE="file.csv"
      /DELCASE=LINE
      /ENCODING='UTF8'
      /DELIMITERS="|"
      /ARRANGEMENT=DELIMITED
      /FIRSTCASE=2
      /IMPORTCASE=ALL 
      /VARIABLES=

    PS it's funny to take part in a conversation between SystemAdmin and SystemAdmin.

    The BOM is there for Spss to know that it's utf-8 because characters 0-127 in utf-8 are identical to ascii. What encoding do you use to write the csv file? Are you sure it's not cp1252 (aka latin-1)? In that case you could simply do SET UNICODE = OFF LOCALE = "nl_NL.cp1252" (or another locale, e.g. en_US), then open the file. If the locale was not set before, Spss uses the locale of the host system (ie, the regional settings, if you're using windoze).

     

    Albert-Jan

    ps: Didn't know Encoding can be specified with GET DATA; this seems to be new.

     

  • JonPeck
    JonPeck
    398 Posts

    Re: SPSS 21 and Unicode

    ‏2013-04-19T20:28:54Z  

    The BOM is there for Spss to know that it's utf-8 because characters 0-127 in utf-8 are identical to ascii. What encoding do you use to write the csv file? Are you sure it's not cp1252 (aka latin-1)? In that case you could simply do SET UNICODE = OFF LOCALE = "nl_NL.cp1252" (or another locale, e.g. en_US), then open the file. If the locale was not set before, Spss uses the locale of the host system (ie, the regional settings, if you're using windoze).

     

    Albert-Jan

    ps: Didn't know Encoding can be specified with GET DATA; this seems to be new.

     

    The default behavior is to read the file according to the mode Statistics is in, but a BOM always overrides.  If Unicode is off, the code page of the current SPSS locale is used.

    P.S. The poster names were mostly scrambled in the recent site migration.  Because forums were not previously run via the IBM Connections software and now they are, the forum posts had to be migrated differently from the rest of the site, and most of the names were lost.  The dW administrators say that this will be largely fixed in about two weeks.

  • AlexEfremov
    AlexEfremov
    1 Post

    Re: SPSS 21 and Unicode

    ‏2013-04-26T14:59:25Z  

    Files in question do not have a Byte Order Marker, so I would expect the default or defined encoding to be used. I tested this again (see sps below), with both ENCODING='UTF-8' and ='LOCALE' . Neither works, so I am still bypassing this through Notepad++.

    GET DATA
      /TYPE=TXT
      /FILE="file.csv"
      /DELCASE=LINE
      /ENCODING='UTF8'
      /DELIMITERS="|"
      /ARRANGEMENT=DELIMITED
      /FIRSTCASE=2
      /IMPORTCASE=ALL 
      /VARIABLES=

    PS it's funny to take part in a conversation between SystemAdmin and SystemAdmin.

    Hi,

    I have the same problem. As for me, is the behavior of SPSS in this thing not the best.

    I'v changed my data-read processes to DATA LIST, where i can define the ENCODING by myself.

    The only thing, i miss in DATA LIST is the possiblity to set "Qualifier" by reading datas and it seems to be some "effectiveness loss"...

    Good Luck!

    Alex

     

     

     

  • JonPeck
    JonPeck
    398 Posts

    Re: SPSS 21 and Unicode

    ‏2013-04-26T18:16:44Z  

    Hi,

    I have the same problem. As for me, is the behavior of SPSS in this thing not the best.

    I'v changed my data-read processes to DATA LIST, where i can define the ENCODING by myself.

    The only thing, i miss in DATA LIST is the possiblity to set "Qualifier" by reading datas and it seems to be some "effectiveness loss"...

    Good Luck!

    Alex

     

     

     

    From the CSR

    Value Delimiter. For freefield-format data (keywords FREE and LIST), you can specify the
    character(s) that separate data values, or you can use the keyword TAB to specify the tab character
    as the delimiter. Any delimiter other than the TAB keyword must be enclosed in quotation marks,
    and the specification must be enclosed in parentheses, as in DATA LIST FREE(",").

  • Albert-Jan
    Albert-Jan
    13 Posts

    Re: SPSS 21 and Unicode

    ‏2013-11-21T14:43:16Z  
    • JonPeck
    • ‏2013-04-26T18:16:44Z

    From the CSR

    Value Delimiter. For freefield-format data (keywords FREE and LIST), you can specify the
    character(s) that separate data values, or you can use the keyword TAB to specify the tab character
    as the delimiter. Any delimiter other than the TAB keyword must be enclosed in quotation marks,
    and the specification must be enclosed in parentheses, as in DATA LIST FREE(",").

    I have a csv file that is encoded in utf-8 encoding. I would like to write it as a cp1252 encoded file. I use SPSS v20 so I cannot use the /ENCODING subcommand of GET DATA.

    --Now I use DATA LIST. Does this cause the file encoding to be changed into codepage encoding (I use SET UNICODE=OFF, Dutch locale). I realize I could use SET UNICODE=ON and then open the file but I would like to change the encoding because I will be merging the data with cp1252-encoded files.

    --Why are the warnings issued? The accented characters are displayed properly. Is this *just in case* the character can not be represented in cp1252? (I believe this dataset contains exactly one Eastern European character, I don't care about that one ;-).

    DATA LIST FILE = !INPUTDATA ENCODING = "utf-8" LIST (";") RECORDS = 1 SKIP = 1/

    id (T1, F7.0)

    property (T2, A4)

    url (T3, A100)

    price (T4, F8.0)

    street (T5, A300)

    number (T6, A12)

    suffix (T7, A30)

    postcode(T8, A6)

    city (T9, A80).

    cache.

    execute.

    Warning # 1183

    An input record contained an invalid Unicode character or one which is invalid

    in the current locale.

    Command line: 10208 Current case: 129736 Current splitfile group: 1

    Warning # 1183

    An input record contained an invalid Unicode character or one which is invalid

    in the current locale.

    Command line: 10208 Current case: 254877 Current splitfile group: 1

     

    regards,

    Albert-Jan

  • Albert-Jan
    Albert-Jan
    13 Posts

    Re: SPSS 21 and Unicode

    ‏2013-11-21T14:58:21Z  

    I have a csv file that is encoded in utf-8 encoding. I would like to write it as a cp1252 encoded file. I use SPSS v20 so I cannot use the /ENCODING subcommand of GET DATA.

    --Now I use DATA LIST. Does this cause the file encoding to be changed into codepage encoding (I use SET UNICODE=OFF, Dutch locale). I realize I could use SET UNICODE=ON and then open the file but I would like to change the encoding because I will be merging the data with cp1252-encoded files.

    --Why are the warnings issued? The accented characters are displayed properly. Is this *just in case* the character can not be represented in cp1252? (I believe this dataset contains exactly one Eastern European character, I don't care about that one ;-).

    DATA LIST FILE = !INPUTDATA ENCODING = "utf-8" LIST (";") RECORDS = 1 SKIP = 1/

    id (T1, F7.0)

    property (T2, A4)

    url (T3, A100)

    price (T4, F8.0)

    street (T5, A300)

    number (T6, A12)

    suffix (T7, A30)

    postcode(T8, A6)

    city (T9, A80).

    cache.

    execute.

    Warning # 1183

    An input record contained an invalid Unicode character or one which is invalid

    in the current locale.

    Command line: 10208 Current case: 129736 Current splitfile group: 1

    Warning # 1183

    An input record contained an invalid Unicode character or one which is invalid

    in the current locale.

    Command line: 10208 Current case: 254877 Current splitfile group: 1

     

    regards,

    Albert-Jan

    Strange, at closer inspection quite a few street names are still not displayed properly. Maybe this explains the warnings. The frist example below should (I think) contain a German ringel s. And these are properly displayed in other cases! Same for "ééN", which is displayed as ééN, but "é'" is displayed correctly in another value. Mixed encodings?

     Breyeller Straã?E
    Car¿No-Antoni Gaudipark
    Dã¶Rper Tore Patiowoning
    Dahliahof Twee-Onder-ééN Kapwoningen Type C
    Borné
    Bouweslân
    Braziliëhof
    Brookstraße
    Büllerlaan
    Burgermeister-Frye-Straße
    Burgerwaard 
    Camping Château Le Verdoyer-Chaletnummer
    Cantabriëstraat
     

  • JonPeck
    JonPeck
    398 Posts

    Re: SPSS 21 and Unicode

    ‏2013-11-21T16:51:09Z  

    Strange, at closer inspection quite a few street names are still not displayed properly. Maybe this explains the warnings. The frist example below should (I think) contain a German ringel s. And these are properly displayed in other cases! Same for "ééN", which is displayed as ééN, but "é'" is displayed correctly in another value. Mixed encodings?

     Breyeller Straã?E
    Car¿No-Antoni Gaudipark
    Dã¶Rper Tore Patiowoning
    Dahliahof Twee-Onder-ééN Kapwoningen Type C
    Borné
    Bouweslân
    Braziliëhof
    Brookstraße
    Büllerlaan
    Burgermeister-Frye-Straße
    Burgerwaard 
    Camping Château Le Verdoyer-Chaletnummer
    Cantabriëstraat
     

    In the first example, the mangled character should be shown as a German eszett, which would be C39F as two bytes in utf-8 or DF in code page 1252. a+tilde in utf-8 would be C3A3, so it looks like the text is being displayed as if code page 1252, in part.  If the input is really in 1252 and has a matching SPSS locale setting, the input is being transcoded incorrectly, and you should contact TS.

  • Albert-Jan
    Albert-Jan
    13 Posts

    Re: SPSS 21 and Unicode

    ‏2013-11-21T19:14:54Z  
    • JonPeck
    • ‏2013-11-21T16:51:09Z

    In the first example, the mangled character should be shown as a German eszett, which would be C39F as two bytes in utf-8 or DF in code page 1252. a+tilde in utf-8 would be C3A3, so it looks like the text is being displayed as if code page 1252, in part.  If the input is really in 1252 and has a matching SPSS locale setting, the input is being transcoded incorrectly, and you should contact TS.

    Hi Jon,

     

    Thanks for your reply. I was not familiar with the term "eszett", but it turns out it is the same as what I (and the Germans) referred to as "ringel s". The data are scraped from a website and are then preprocessed. After some searching, the values turn out to be mangled to begin with, e.g.: http://tinyurl.com/kymq6st Perhaps they're uploaded as cp1252 to a utf-8 based Content Management System. Do you know if there are any (Python) packages that can "un-mangle" characters?

    Regards,

    Albert-Jan

  • JonPeck
    JonPeck
    398 Posts

    Re: SPSS 21 and Unicode

    ‏2013-11-21T19:27:46Z  

    Hi Jon,

     

    Thanks for your reply. I was not familiar with the term "eszett", but it turns out it is the same as what I (and the Germans) referred to as "ringel s". The data are scraped from a website and are then preprocessed. After some searching, the values turn out to be mangled to begin with, e.g.: http://tinyurl.com/kymq6st Perhaps they're uploaded as cp1252 to a utf-8 based Content Management System. Do you know if there are any (Python) packages that can "un-mangle" characters?

    Regards,

    Albert-Jan

    Interesting.  "Eszett" is what I learned when I studied German back in the dark ages - long s + z, but Wikipedia also refers to long s over round s, which matches your term, but goes on to say "Its German name is Eszett".

    I think you would have to know what the encoding process was in order to consider unmangling, but I doubt that it would be reversible in all cases.  If you had a (correctly encoded) street/place dictionary, a spell checker might work, or you could roll your own string distance metric.  This link, though, http://www.perlmonks.org/bare/?node_id=370892, doesn't hold out much hope.

  • Albert-Jan
    Albert-Jan
    13 Posts

    Re: SPSS 21 and Unicode

    ‏2013-11-21T20:25:09Z  
    • JonPeck
    • ‏2013-11-21T19:27:46Z

    Interesting.  "Eszett" is what I learned when I studied German back in the dark ages - long s + z, but Wikipedia also refers to long s over round s, which matches your term, but goes on to say "Its German name is Eszett".

    I think you would have to know what the encoding process was in order to consider unmangling, but I doubt that it would be reversible in all cases.  If you had a (correctly encoded) street/place dictionary, a spell checker might work, or you could roll your own string distance metric.  This link, though, http://www.perlmonks.org/bare/?node_id=370892, doesn't hold out much hope.

    You're right: In the Dutch wikipedia it says that in Dutch eszett is generally called "Ringel-s", which is actually a German term, but most Germans have never heard of it. In German it is called "Eszett [es-tset], scharfes s (scherpe s) of Dreierles-s" . http://nl.wikipedia.org/wiki/%C3%9F

    I found this script: https://gist.github.com/litchfield/1282752.

    I do have a reference file with all the street names. Often these will match exactly, but for at least 20 % (guesstimate) of the cases they won't, e.g. "Revd John Doe Street" vs. "Reverend J. Doe St" and a gazillion more variations and typos ;-) Luckily I can often use postcode + street number instead

     

  • JonPeck
    JonPeck
    398 Posts

    Re: SPSS 21 and Unicode

    ‏2013-11-21T20:46:06Z  

    You're right: In the Dutch wikipedia it says that in Dutch eszett is generally called "Ringel-s", which is actually a German term, but most Germans have never heard of it. In German it is called "Eszett [es-tset], scharfes s (scherpe s) of Dreierles-s" . http://nl.wikipedia.org/wiki/%C3%9F

    I found this script: https://gist.github.com/litchfield/1282752.

    I do have a reference file with all the street names. Often these will match exactly, but for at least 20 % (guesstimate) of the cases they won't, e.g. "Revd John Doe Street" vs. "Reverend J. Doe St" and a gazillion more variations and typos ;-) Luckily I can often use postcode + street number instead

     

    No harm in trying that script, but it's only going to work for one specific scenario.