Topic
  • 9 replies
  • Latest Post - ‏2013-01-31T19:01:40Z by hnasgaard
bmwilli
bmwilli
41 Posts

Pinned topic csvTokenize(ustring) working differently than csvTokenize(rstring)

‏2013-01-28T19:33:54Z |
I have a situation where I am trying to use csvTokenize to parse a string into a list<rstring>.

One particular string fails to parse if the source string is an rstring, but if the field is a ustring it parses fine.

Here is the example:


composite TokenizeTest 
{ graph () as SinkTest = Custom() 
{ logic state : rstring sourceString = 
"field1,\"field2\",\"field\"3\"\",field4"; onProcess : 
{ ustring us = (ustring)sourceString; rstring rs = sourceString;   println(
"-------- USTRING --------"); println(us); list<rstring> ufields = csvTokenize(us); println(ufields); println(
"-------- RSTRING --------"); println(rs); list<rstring> rfields = csvTokenize(rs); println(rfields); 
} 
} 
}


When run, here is the output:


"-------- USTRING --------" 
"field1,\"field2\",\"field\"3\"\",field4" [
"field1",
"\"field2\"",
"\"field\"3\"\"",
"field4"] 
"-------- RSTRING --------" 
"field1,\"field2\",\"field\"3\"\",field4" 28 Jan 2013 14:14:03.662 [27023] ERROR #splapplog,J[0],P[0],SinkTest,spl_pe M[PEImpl.cpp:logTerminatingException:1172]  - CDISR5033E: An exception occurred during the execution of the SinkTest operator. Processing element number 0 is terminating. 28 Jan 2013 14:14:03.663 [27023] ERROR #splapptrc,J[0],P[0],SinkTest,#splapptrc,SinkTest,spl_operator M[OperatorThread.cpp:run:85]  - CDISR5030E: An exception occurred during the execution of the SinkTest operator. The exception is csv string (
'field1,"field2","field"3"",field4') has incorrect format: token ending quote (
") followed by a character other that comma (,) at index 23. 28 Jan 2013 14:14:03.663 [27021] ERROR #splapptrc,J[0],P[0],SinkTest,spl_operator M[PEImpl.cpp:process:654]  - CDISR5053E: Runtime failures occurred in the following operators: SinkTest.


Can anyone explain why there is a difference?

I prefer the behavior of the csvTokenize(ustring) version.

Thanks,
Brian

Message was edited by: kjerick - Fixed problematic developerWorks rendering with matching {code} tags.
  • kjerick
    kjerick
    227 Posts

    Re: csvTokenize(ustring) working differently than csvTokenize(rstring)

    ‏2013-01-28T21:03:56Z  
    • bmwilli
    • ‏2013-01-28T19:36:02Z
    Paste of output ignored some of my characters...here is the output
    
    
    "-------- USTRING --------" 
    "field1,\"field2\",\"field\"3\"\",field4" [
    "field1",
    "\"field2\"",
    "\"field\"3\"\"",
    "field4"] 
    "-------- RSTRING --------" 
    "field1,\"field2\",\"field\"3\"\",field4" 28 Jan 2013 14:14:03.662 [27023] ERROR #splapplog,J[0],P[0],SinkTest,spl_pe M[PEImpl.cpp:logTerminatingException:1172]  - CDISR5033E: An exception occurred during the execution of the SinkTest operator. Processing element number 0 is terminating. 28 Jan 2013 14:14:03.663 [27023] ERROR #splapptrc,J[0],P[0],SinkTest,#splapptrc,SinkTest,spl_operator M[OperatorThread.cpp:run:85]  - CDISR5030E: An exception occurred during the execution of the SinkTest operator. The exception is csv string (
    'field1,"field2","field"3"",field4') has incorrect format: token ending quote (
    ") followed by a character other that comma (,) at index 23. 28 Jan 2013 14:14:03.663 [27021] ERROR #splapptrc,J[0],P[0],SinkTest,spl_operator M[PEImpl.cpp:process:654]  - CDISR5053E: Runtime failures occurred in the following operators: SinkTest.
    


    Message was edited by: kjerick - Fixed problematic developerWorks rendering with matching {code} tags.
    Hi Brian,

    It wasn't necessarily your paste that ignored some of your characters, but more likely the quirks of developerWorks forums when it comes to special characters. Whenever posting code snippets or output, or anything that might contain special characters, it is best to put it in {code} tags as follows:

    {code}
    Put the code or text you want to be left alone between these tags. Note that unlike markup languages, the start block and end block tags are identical.
    {code}

    I will go back and fix your previous two posts for better readability.

    Best regards,
    Kevin
  • bmwilli
    bmwilli
    41 Posts

    Re: csvTokenize(ustring) working differently than csvTokenize(rstring)

    ‏2013-01-28T21:09:08Z  
    Paste of output ignored some of my characters...here is the output
    
    "-------- USTRING --------" 
    "field1,\"field2\",\"field\"3\"\",field4" [
    "field1",
    "\"field2\"",
    "\"field\"3\"\"",
    "field4"] 
    "-------- RSTRING --------" 
    "field1,\"field2\",\"field\"3\"\",field4" 28 Jan 2013 14:14:03.662 [27023] ERROR #splapplog,J[0],P[0],SinkTest,spl_pe M[PEImpl.cpp:logTerminatingException:1172]  - CDISR5033E: An exception occurred during the execution of the SinkTest operator. Processing element number 0 is terminating. 28 Jan 2013 14:14:03.663 [27023] ERROR #splapptrc,J[0],P[0],SinkTest,#splapptrc,SinkTest,spl_operator M[OperatorThread.cpp:run:85]  - CDISR5030E: An exception occurred during the execution of the SinkTest operator. The exception is csv string (
    'field1,"field2","field"3"",field4') has incorrect format: token ending quote (
    ") followed by a character other that comma (,) at index 23. 28 Jan 2013 14:14:03.663 [27021] ERROR #splapptrc,J[0],P[0],SinkTest,spl_operator M[PEImpl.cpp:process:654]  - CDISR5053E: Runtime failures occurred in the following operators: SinkTest.
    


    Message was edited by: kjerick - Fixed problematic developerWorks rendering with matching {code} tags.
  • hnasgaard
    hnasgaard
    200 Posts

    Re: csvTokenize(ustring) working differently than csvTokenize(rstring)

    ‏2013-01-29T13:15:22Z  
    • bmwilli
    • ‏2013-01-28T19:36:02Z
    Paste of output ignored some of my characters...here is the output
    <pre class="jive-pre"> "-------- USTRING --------" "field1,\"field2\",\"field\"3\"\",field4" [ "field1", "\"field2\"", "\"field\"3\"\"", "field4"] "-------- RSTRING --------" "field1,\"field2\",\"field\"3\"\",field4" 28 Jan 2013 14:14:03.662 [27023] ERROR #splapplog,J[0],P[0],SinkTest,spl_pe M[PEImpl.cpp:logTerminatingException:1172] - CDISR5033E: An exception occurred during the execution of the SinkTest operator. Processing element number 0 is terminating. 28 Jan 2013 14:14:03.663 [27023] ERROR #splapptrc,J[0],P[0],SinkTest,#splapptrc,SinkTest,spl_operator M[OperatorThread.cpp:run:85] - CDISR5030E: An exception occurred during the execution of the SinkTest operator. The exception is csv string ( 'field1,"field2","field"3"",field4') has incorrect format: token ending quote ( ") followed by a character other that comma (,) at index 23. 28 Jan 2013 14:14:03.663 [27021] ERROR #splapptrc,J[0],P[0],SinkTest,spl_operator M[PEImpl.cpp:process:654] - CDISR5053E: Runtime failures occurred in the following operators: SinkTest. </pre>

    Message was edited by: kjerick - Fixed problematic developerWorks rendering with matching {code} tags.
    I will investigate.
  • SystemAdmin
    SystemAdmin
    1245 Posts

    Re: csvTokenize(ustring) working differently than csvTokenize(rstring)

    ‏2013-01-29T15:58:35Z  
    There is an error on your sourceString, the way you are using the escape character "\":

    
    rstring sourceString = 
    "field1,\"field2\",\"field\"3\"\",field4";
    


    Try using either this sourceString:

    
    rstring sourceString = 
    "field1,field2,field3,field4";
    


    or this one:

    
    rstring sourceString = 
    "\"field1\",\"field2\",\"field3\",\"field4\"";
    


    Either one should work. I've tested both with no errors on output.

    
    composite csvTokenizerTest 
    { graph () as out = Custom()
    { logic state : rstring sourceString = 
    "\"field1\",\"field2\",\"field3\",\"field4\""; onProcess : 
    { ustring us = (ustring)sourceString; rstring rs = sourceString; println(
    "---------USTRING---------"); println(us); list<rstring> ufields = csvTokenize(us); println(ufields); println(
    "---------RSTRING---------"); println(rs); list<rstring> rfields = csvTokenize(rs); println(rfields); 
    } 
    } 
    }
    


    Regards,
    Daniel Lopez
  • bmwilli
    bmwilli
    41 Posts

    Re: csvTokenize(ustring) working differently than csvTokenize(rstring)

    ‏2013-01-29T18:21:41Z  
    Thanks, but I do not have an error in my source string, this is the way the data arrives. The third field had embedded quotes, I has to escape them for the purposes of the example.
    The point of my email is that my string is correctly tokenized as a ustring, however, not as an rstring.

    In looking at the Streams .h files, I see the following:

    csvTokenize(ustring) simply calls tokenize(ustring,",",true)

    however

    csvTokenize(rstring) seems to be a declaration, meaning its body is implemented in the library.

    Perhaps this was an attempt at optimization, however, the logic is different?

    Brian
  • SystemAdmin
    SystemAdmin
    1245 Posts

    Re: csvTokenize(ustring) working differently than csvTokenize(rstring)

    ‏2013-01-29T20:12:33Z  
    I see what the structure of your CSV string is:

    
    field1,
    "field2",
    "field"3
    "",field4
    


    But you should know that the correct structure of a CSV file having double quotes inside a double quoted string is:

    
    field1,
    "field2",
    "field"
    "3"
    "",field4
    


    But you're right. There must be something wrong in the implementation of csvTokenize(rstring) because it showed the same error with the corrected structure.

    Regards,
    Daniel Lopez
  • hnasgaard
    hnasgaard
    200 Posts

    Re: csvTokenize(ustring) working differently than csvTokenize(rstring)

    ‏2013-01-31T18:33:55Z  
    • bmwilli
    • ‏2013-01-29T18:21:41Z
    Thanks, but I do not have an error in my source string, this is the way the data arrives. The third field had embedded quotes, I has to escape them for the purposes of the example.
    The point of my email is that my string is correctly tokenized as a ustring, however, not as an rstring.

    In looking at the Streams .h files, I see the following:

    csvTokenize(ustring) simply calls tokenize(ustring,",",true)

    however

    csvTokenize(rstring) seems to be a declaration, meaning its body is implemented in the library.

    Perhaps this was an attempt at optimization, however, the logic is different?

    Brian
    I think this is a defect in the product. The behavior with rstring should be the same as that with ustring.
  • bmwilli
    bmwilli
    41 Posts

    Re: csvTokenize(ustring) working differently than csvTokenize(rstring)

    ‏2013-01-31T18:39:53Z  
    Thanks for the response. You will be entering the defect report I assume.
  • hnasgaard
    hnasgaard
    200 Posts

    Re: csvTokenize(ustring) working differently than csvTokenize(rstring)

    ‏2013-01-31T19:01:40Z  
    • hnasgaard
    • ‏2013-01-31T18:33:55Z
    I think this is a defect in the product. The behavior with rstring should be the same as that with ustring.
    Yes. Done.