Topic
6 replies Latest Post - ‏2013-10-17T15:15:20Z by david.cyr
david.cyr
david.cyr
20 Posts
ACCEPTED ANSWER

Pinned topic odd problem with HDFS File Sink not working in 'standalone' mode, but working distributed

‏2013-09-26T16:20:18Z |

I have a project where I have added the big data toolkit (in dependencies) and an HDFS File Sink at the end to create my output in HDFS, and have encountered a strange problem I did not encounter in my previous project using the bigdata toolkit.

The project will compile fine (I have the bigdata toolkit installed, I have the bigdata toolkit in dependencies, I have a valid HADOOP_HOME set). Another project in the same workspace will compile and run fine in Standalone mode. When I try to run this particular project in "Standalone" mode, I get:

Exception in thread "Thread-11" java.lang.NoClassDefFoundError: org.apache.hadoop.conf.Configuration
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
 at java.net.URLClassLoader.findClass(URLClassLoader.java:434)

Which does not happen for my other project (I can run the other project fine in Standalone mode). However, if I compile this project at the command line and run it distributed, it works fine. This project is different from the other in that it uses the com.ibm.streams.text toolkit (the other project uses the SPSS analytics toolkit, but not the Text toolkit). I even tried going into the 'Edit Configuration' from the Launch Dialog and hard-coding an additional env setting for HADOOP_HOME just to be even more explicit about the presence of the env variable.

This isn't super critical since I have a workaround that can keep me moving forward, but development and debugging would be easier if I could conveniently work in standalone mode.

This (and the workaround of using the distributed deployment rather than standalone deployment) was discussed towards the end of the thread: https://www.ibm.com/developerworks/community/forums/html/topic?id=d428aa44-06bb-455a-85f6-172baa3d4e33&ps=25, but there wasn't a mention of a way to get this working in standalone mode (which is really great while in the dev/debugging process), so I thought I'd ask in case anybody knows the answer. The theory in this thread was along the lines that there was a problem with two java based toolkits somehow conflicting (?), but there wasn't anything certain or definate.

Thanks in advance for any help or assistance you can provide!

d

 

 

 

 

 

  • Stan
    Stan
    76 Posts
    ACCEPTED ANSWER

    Re: odd problem with HDFS File Sink not working in 'standalone' mode, but working distributed

    ‏2013-09-27T19:25:55Z  in response to david.cyr

    Can you provide code that demonstrates this situation so we cam investigate this further?

    • david.cyr
      david.cyr
      20 Posts
      ACCEPTED ANSWER

      Re: odd problem with HDFS File Sink not working in 'standalone' mode, but working distributed

      ‏2013-09-27T20:51:47Z  in response to Stan

      Here's a simple example to demonstrate the problem:

      --------------------- HERE IS THE getResult.aql file -----
      module getResult;

      create dictionary Dog as ('dog','puppy');

      create view MyAnswer as
      extract dictionary Dog
      on D.text as myResult
      from Document D;

      output view MyAnswer;
      ----------------------------------------------------------

      -----------------------HERE IS THE streams .spl file-----
      --Note this runs fine without the HDFSFileSink.
      --and if I leave the HDFSFileSink widget in and take
      --out the Text Extract it works as well (i.e. wire straight from the beacon to the HDFSFileSink with no TextExtract in between)
      --------
      --if both are present, though I get a
      --Exception in thread "Thread-11" java.lang.NoClassDefFoundError: org.apache.hadoop.conf.Configuration
      --Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration
      --error

       


      namespace application ;

      use com.ibm.streams.text.analytics::TextExtract ;
      use com.ibm.streams.bigdata.hdfs::HDFSFileSink ;

      composite ScratchAQLandHDFS
      {
       graph
        (stream<rstring text> BeaconOutput) as Beacon_1 = Beacon()
        {
         param
          iterations : 1 ;
         output
          BeaconOutput : text = "The quick brown fox jumped over the lazy dog." ;
        }

        (stream<rstring myResult> TextExtract_2_out0) as TextExtract_2 =
         TextExtract(BeaconOutput)
        {
         param
          outputMode : 'multiPort' ;
          uncompiledModules : 'getResult' ;
          moduleOutputDir : 'moduleGetResult' ;
        }

        () as FileSink_3 = FileSink(TextExtract_2_out0)
        {
         param
          file : 'out.out' ;
        }

        () as HDFSFileSink_4 = HDFSFileSink(TextExtract_2_out0)
        {
         param
          file : 'junk/file%FILENUM.out' ;
          format : txt ;
          hdfsConfigFile : 'hdfsconfig.txt' ;
        }

      }

       

      • david.cyr
        david.cyr
        20 Posts
        ACCEPTED ANSWER

        Re: odd problem with HDFS File Sink not working in 'standalone' mode, but working distributed

        ‏2013-10-10T21:02:31Z  in response to david.cyr

        I just wanted to check if you were able to duplicate this problem or not. I do have a workaround (running in distributed mode/compiling at the command line and running that way), but if there was a good explanation for this I'd like to have it so when I transition code I can explain what's going on.

        Thanks in advance for any insight you can provide,

        d

        • Stan
          Stan
          76 Posts
          ACCEPTED ANSWER

          Re: odd problem with HDFS File Sink not working in 'standalone' mode, but working distributed

          ‏2013-10-16T19:17:25Z  in response to david.cyr

          Sorry for the delay on this.  Wanted to give you a status

          I now have my environemnt setup but I am getting AQL compile failures when launching the application (Distributed).  I will get a Text analytics person to look at the error:

          Exception in thread "Thread-6" com.ibm.avatar.api.exceptions.CompilerException: Compiling AQL encountered 1 errors:
          null
                  at com.ibm.avatar.api.CompileAQL.compile(CompileAQL.java:264)
                  at com.ibm.streams.text.SystemTAdapter.initialize(SystemTAdapter.java:526)

          I am using the getResutls posted previously - it did not appear to require any modification

        • SXVK_Roopa_v
          SXVK_Roopa_v
          13 Posts
          ACCEPTED ANSWER

          Re: odd problem with HDFS File Sink not working in 'standalone' mode, but working distributed

          ‏2013-10-17T06:20:03Z  in response to david.cyr

          David,

          This is known issue and is documented in the infocenter that we can't use the Bigdata toolkit operators with any other java operators. The same works in distributed mode because the PE's might not be deployed on the same server where in Standalone mode all the PE's will be running on the same server and hence the failure.

          Snippet from the infocenter.

          http://pic.dhe.ibm.com/infocenter/streams/v3r1/index.jsp

          Infocenter 3.1 -> Reference->Toolkit reference > SPL specialized toolkits > Big Data Toolkit


          Operators in the Big Data Toolkit
          The Big Data Toolkit contains operators that work with Hadoop Distribute File Systems (HDFS) or IBM® InfoSphere® Data Explorer.

          Use the HDFS operators in an InfoSphere Streams job to read and write to a Hadoop Distributed File System.
          Note: These operators cannot be fused with InfoSphere Streams Java™ operators.

           

          • david.cyr
            david.cyr
            20 Posts
            ACCEPTED ANSWER

            Re: odd problem with HDFS File Sink not working in 'standalone' mode, but working distributed

            ‏2013-10-17T15:15:20Z  in response to SXVK_Roopa_v

            Thanks much for the followup and the clarification! Sorry, I didn't see that note in the InfoCenter (or fully internalize it).

            Fortunately, as you mentioned, there's an easy workaround (Distributed mode, or testing as a deployed app proper.

            Thanks again for the followup.

            d