Topic
  • 4 replies
  • Latest Post - ‏2013-04-01T14:14:59Z by SystemAdmin
SystemAdmin
SystemAdmin
1245 Posts

Pinned topic Whitespace Delimiter

‏2013-04-01T09:27:54Z |
I’m having trouble using tokenize function with whitespace as delimiter. As I checked, I can easily use tokenize as long as there is only “one” whitespace as a delimiter. But my input has “eight” whitespace delimiter. These are the codes I tried to run for “eight” whitespace delimiter. I based all my parameters on Java and C++.

tokens = tokenize (line, " ", true);
tokens = tokenize (line, "/s/s/s/s/s/s/s/s", true);
tokens = tokenize (line, "/ /", true);
tokens = tokenize (line, "\\s+\\s+\\s+\\s+\\s+\\s+\\s+\\s+", true);

Sample Input
MWTRF03030000 60.0009208438695 411.4609184139049 61.0000GSM 020130423 20130324193350 003232599589
  • hnasgaard
    hnasgaard
    200 Posts

    Re: Whitespace Delimiter

    ‏2013-04-01T10:36:29Z  
    The string in tokenize represents a set of separators, where each character is a possible separator, not all the characters in the string at once. It also doesn't support regular expressions. You could try regexMatch, or regexMatchPerl
  • SystemAdmin
    SystemAdmin
    1245 Posts

    Re: Whitespace Delimiter

    ‏2013-04-01T11:20:32Z  
    • hnasgaard
    • ‏2013-04-01T10:36:29Z
    The string in tokenize represents a set of separators, where each character is a possible separator, not all the characters in the string at once. It also doesn't support regular expressions. You could try regexMatch, or regexMatchPerl
    Hi, thank you so much for your suggestion but, I tried regexMatchPerl and regexMatch function as you told me, but it seems that i can't compile it using this expression for "eight" whitespaces. the regex "\s" cannot be compiled.

    tokens = regexMatchPerl(line, "\s{8}");

    I also tried other regex like:

    tokens = regexMatchPerl(line, "\\s+{8}");
    tokens = regexMatchPerl(line, "\s{8}");
    tokens = regexMatchPerl(line, "[ ]|[ ]|[ ]|[ ]");

    But nothing still works.
  • hnasgaard
    hnasgaard
    200 Posts

    Re: Whitespace Delimiter

    ‏2013-04-01T12:58:12Z  
    Hi, thank you so much for your suggestion but, I tried regexMatchPerl and regexMatch function as you told me, but it seems that i can't compile it using this expression for "eight" whitespaces. the regex "\s" cannot be compiled.

    tokens = regexMatchPerl(line, "\s{8}");

    I also tried other regex like:

    tokens = regexMatchPerl(line, "\\s+{8}");
    tokens = regexMatchPerl(line, "\s{8}");
    tokens = regexMatchPerl(line, "[ ]|[ ]|[ ]|[ ]");

    But nothing still works.
    Here's a sample program I tried out that seems to work..
    
    composite Main 
    { param expression<rstring> $patt : getSubmissionTimeValue(
    "patt"); graph stream<rstring s> In = FileSource() 
    { param file : 
    "in.dat"; format : line; 
    } stream<list<rstring> l> C = Custom(In) 
    { logic onTuple In : 
    { mutable list<rstring> li = regexMatchPerl(s, $patt); submit(
    {l=li
    }, C); 
    }   
    } () as O = Custom(C) 
    { logic onTuple C : 
    { 
    
    for (rstring r in l) println(r); 
    } 
    } 
    }
    

    Compile as a standalone program
    
    sc -M Main -T
    

    and run it with the following string:
    
    output/bin/standalone Main.patt=
    "(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)"
    

    If you hard-code the string you will have to escape the '\' as '\\'
  • SystemAdmin
    SystemAdmin
    1245 Posts

    Re: Whitespace Delimiter

    ‏2013-04-01T14:14:59Z  
    • hnasgaard
    • ‏2013-04-01T12:58:12Z
    Here's a sample program I tried out that seems to work..
    <pre class="jive-pre"> composite Main { param expression<rstring> $patt : getSubmissionTimeValue( "patt"); graph stream<rstring s> In = FileSource() { param file : "in.dat"; format : line; } stream<list<rstring> l> C = Custom(In) { logic onTuple In : { mutable list<rstring> li = regexMatchPerl(s, $patt); submit( {l=li }, C); } } () as O = Custom(C) { logic onTuple C : { for (rstring r in l) println(r); } } } </pre>
    Compile as a standalone program
    <pre class="jive-pre"> sc -M Main -T </pre>
    and run it with the following string:
    <pre class="jive-pre"> output/bin/standalone Main.patt= "(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)" </pre>
    If you hard-code the string you will have to escape the '\' as '\\'
    Thanks you very much Mr. hnasgaard. This solves the problem. :)