Topic
  • 6 replies
  • Latest Post - ‏2013-10-17T15:15:20Z by david.cyr
david.cyr
david.cyr
20 Posts

Pinned topic odd problem with HDFS File Sink not working in 'standalone' mode, but working distributed

‏2013-09-26T16:20:18Z |

I have a project where I have added the big data toolkit (in dependencies) and an HDFS File Sink at the end to create my output in HDFS, and have encountered a strange problem I did not encounter in my previous project using the bigdata toolkit.

The project will compile fine (I have the bigdata toolkit installed, I have the bigdata toolkit in dependencies, I have a valid HADOOP_HOME set). Another project in the same workspace will compile and run fine in Standalone mode. When I try to run this particular project in "Standalone" mode, I get:

Exception in thread "Thread-11" java.lang.NoClassDefFoundError: org.apache.hadoop.conf.Configuration
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
 at java.net.URLClassLoader.findClass(URLClassLoader.java:434)

Which does not happen for my other project (I can run the other project fine in Standalone mode). However, if I compile this project at the command line and run it distributed, it works fine. This project is different from the other in that it uses the com.ibm.streams.text toolkit (the other project uses the SPSS analytics toolkit, but not the Text toolkit). I even tried going into the 'Edit Configuration' from the Launch Dialog and hard-coding an additional env setting for HADOOP_HOME just to be even more explicit about the presence of the env variable.

This isn't super critical since I have a workaround that can keep me moving forward, but development and debugging would be easier if I could conveniently work in standalone mode.

This (and the workaround of using the distributed deployment rather than standalone deployment) was discussed towards the end of the thread: https://www.ibm.com/developerworks/community/forums/html/topic?id=d428aa44-06bb-455a-85f6-172baa3d4e33&ps=25, but there wasn't a mention of a way to get this working in standalone mode (which is really great while in the dev/debugging process), so I thought I'd ask in case anybody knows the answer. The theory in this thread was along the lines that there was a problem with two java based toolkits somehow conflicting (?), but there wasn't anything certain or definate.

Thanks in advance for any help or assistance you can provide!

d

 

 

 

 

 

  • Stan
    Stan
    76 Posts

    Re: odd problem with HDFS File Sink not working in 'standalone' mode, but working distributed

    ‏2013-09-27T19:25:55Z  

    Can you provide code that demonstrates this situation so we cam investigate this further?

  • david.cyr
    david.cyr
    20 Posts

    Re: odd problem with HDFS File Sink not working in 'standalone' mode, but working distributed

    ‏2013-09-27T20:51:47Z  
    • Stan
    • ‏2013-09-27T19:25:55Z

    Can you provide code that demonstrates this situation so we cam investigate this further?

    Here's a simple example to demonstrate the problem:

    --------------------- HERE IS THE getResult.aql file -----
    module getResult;

    create dictionary Dog as ('dog','puppy');

    create view MyAnswer as
    extract dictionary Dog
    on D.text as myResult
    from Document D;

    output view MyAnswer;
    ----------------------------------------------------------

    -----------------------HERE IS THE streams .spl file-----
    --Note this runs fine without the HDFSFileSink.
    --and if I leave the HDFSFileSink widget in and take
    --out the Text Extract it works as well (i.e. wire straight from the beacon to the HDFSFileSink with no TextExtract in between)
    --------
    --if both are present, though I get a
    --Exception in thread "Thread-11" java.lang.NoClassDefFoundError: org.apache.hadoop.conf.Configuration
    --Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
    --error

     


    namespace application ;

    use com.ibm.streams.text.analytics::TextExtract ;
    use com.ibm.streams.bigdata.hdfs::HDFSFileSink ;

    composite ScratchAQLandHDFS
    {
     graph
      (stream<rstring text> BeaconOutput) as Beacon_1 = Beacon()
      {
       param
        iterations : 1 ;
       output
        BeaconOutput : text = "The quick brown fox jumped over the lazy dog." ;
      }

      (stream<rstring myResult> TextExtract_2_out0) as TextExtract_2 =
       TextExtract(BeaconOutput)
      {
       param
        outputMode : 'multiPort' ;
        uncompiledModules : 'getResult' ;
        moduleOutputDir : 'moduleGetResult' ;
      }

      () as FileSink_3 = FileSink(TextExtract_2_out0)
      {
       param
        file : 'out.out' ;
      }

      () as HDFSFileSink_4 = HDFSFileSink(TextExtract_2_out0)
      {
       param
        file : 'junk/file%FILENUM.out' ;
        format : txt ;
        hdfsConfigFile : 'hdfsconfig.txt' ;
      }

    }

     

  • david.cyr
    david.cyr
    20 Posts

    Re: odd problem with HDFS File Sink not working in 'standalone' mode, but working distributed

    ‏2013-10-10T21:02:31Z  
    • david.cyr
    • ‏2013-09-27T20:51:47Z

    Here's a simple example to demonstrate the problem:

    --------------------- HERE IS THE getResult.aql file -----
    module getResult;

    create dictionary Dog as ('dog','puppy');

    create view MyAnswer as
    extract dictionary Dog
    on D.text as myResult
    from Document D;

    output view MyAnswer;
    ----------------------------------------------------------

    -----------------------HERE IS THE streams .spl file-----
    --Note this runs fine without the HDFSFileSink.
    --and if I leave the HDFSFileSink widget in and take
    --out the Text Extract it works as well (i.e. wire straight from the beacon to the HDFSFileSink with no TextExtract in between)
    --------
    --if both are present, though I get a
    --Exception in thread "Thread-11" java.lang.NoClassDefFoundError: org.apache.hadoop.conf.Configuration
    --Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
    --error

     


    namespace application ;

    use com.ibm.streams.text.analytics::TextExtract ;
    use com.ibm.streams.bigdata.hdfs::HDFSFileSink ;

    composite ScratchAQLandHDFS
    {
     graph
      (stream<rstring text> BeaconOutput) as Beacon_1 = Beacon()
      {
       param
        iterations : 1 ;
       output
        BeaconOutput : text = "The quick brown fox jumped over the lazy dog." ;
      }

      (stream<rstring myResult> TextExtract_2_out0) as TextExtract_2 =
       TextExtract(BeaconOutput)
      {
       param
        outputMode : 'multiPort' ;
        uncompiledModules : 'getResult' ;
        moduleOutputDir : 'moduleGetResult' ;
      }

      () as FileSink_3 = FileSink(TextExtract_2_out0)
      {
       param
        file : 'out.out' ;
      }

      () as HDFSFileSink_4 = HDFSFileSink(TextExtract_2_out0)
      {
       param
        file : 'junk/file%FILENUM.out' ;
        format : txt ;
        hdfsConfigFile : 'hdfsconfig.txt' ;
      }

    }

     

    I just wanted to check if you were able to duplicate this problem or not. I do have a workaround (running in distributed mode/compiling at the command line and running that way), but if there was a good explanation for this I'd like to have it so when I transition code I can explain what's going on.

    Thanks in advance for any insight you can provide,

    d

  • Stan
    Stan
    76 Posts

    Re: odd problem with HDFS File Sink not working in 'standalone' mode, but working distributed

    ‏2013-10-16T19:17:25Z  
    • david.cyr
    • ‏2013-10-10T21:02:31Z

    I just wanted to check if you were able to duplicate this problem or not. I do have a workaround (running in distributed mode/compiling at the command line and running that way), but if there was a good explanation for this I'd like to have it so when I transition code I can explain what's going on.

    Thanks in advance for any insight you can provide,

    d

    Sorry for the delay on this.  Wanted to give you a status

    I now have my environemnt setup but I am getting AQL compile failures when launching the application (Distributed).  I will get a Text analytics person to look at the error:

    Exception in thread "Thread-6" com.ibm.avatar.api.exceptions.CompilerException: Compiling AQL encountered 1 errors:
    null
            at com.ibm.avatar.api.CompileAQL.compile(CompileAQL.java:264)
            at com.ibm.streams.text.SystemTAdapter.initialize(SystemTAdapter.java:526)

    I am using the getResutls posted previously - it did not appear to require any modification

  • SXVK_Roopa_v
    SXVK_Roopa_v
    13 Posts

    Re: odd problem with HDFS File Sink not working in 'standalone' mode, but working distributed

    ‏2013-10-17T06:20:03Z  
    • david.cyr
    • ‏2013-10-10T21:02:31Z

    I just wanted to check if you were able to duplicate this problem or not. I do have a workaround (running in distributed mode/compiling at the command line and running that way), but if there was a good explanation for this I'd like to have it so when I transition code I can explain what's going on.

    Thanks in advance for any insight you can provide,

    d

    David,

    This is known issue and is documented in the infocenter that we can't use the Bigdata toolkit operators with any other java operators. The same works in distributed mode because the PE's might not be deployed on the same server where in Standalone mode all the PE's will be running on the same server and hence the failure.

    Snippet from the infocenter.

    http://pic.dhe.ibm.com/infocenter/streams/v3r1/index.jsp

    Infocenter 3.1 -> Reference->Toolkit reference > SPL specialized toolkits > Big Data Toolkit


    Operators in the Big Data Toolkit
    The Big Data Toolkit contains operators that work with Hadoop Distribute File Systems (HDFS) or IBM® InfoSphere® Data Explorer.

    Use the HDFS operators in an InfoSphere Streams job to read and write to a Hadoop Distributed File System.
    Note: These operators cannot be fused with InfoSphere Streams Java™ operators.

     

  • david.cyr
    david.cyr
    20 Posts

    Re: odd problem with HDFS File Sink not working in 'standalone' mode, but working distributed

    ‏2013-10-17T15:15:20Z  

    David,

    This is known issue and is documented in the infocenter that we can't use the Bigdata toolkit operators with any other java operators. The same works in distributed mode because the PE's might not be deployed on the same server where in Standalone mode all the PE's will be running on the same server and hence the failure.

    Snippet from the infocenter.

    http://pic.dhe.ibm.com/infocenter/streams/v3r1/index.jsp

    Infocenter 3.1 -> Reference->Toolkit reference > SPL specialized toolkits > Big Data Toolkit


    Operators in the Big Data Toolkit
    The Big Data Toolkit contains operators that work with Hadoop Distribute File Systems (HDFS) or IBM® InfoSphere® Data Explorer.

    Use the HDFS operators in an InfoSphere Streams job to read and write to a Hadoop Distributed File System.
    Note: These operators cannot be fused with InfoSphere Streams Java™ operators.

     

    Thanks much for the followup and the clarification! Sorry, I didn't see that note in the InfoCenter (or fully internalize it).

    Fortunately, as you mentioned, there's an easy workaround (Distributed mode, or testing as a deployed app proper.

    Thanks again for the followup.

    d