Topic
4 replies Latest Post - ‏2010-08-23T17:43:08Z by SystemAdmin
Kryten
Kryten
6 Posts
ACCEPTED ANSWER

Pinned topic Best way to Interface to running job from external application

‏2010-08-17T21:13:10Z |
Hi,

Given the scenario in the attached PDF:

1) A job running in Streams collecting state information
2) HTTP clients wanting to get a dump of the current state on demand
3) An intermediate 'Shim' to provide data translation, AAA, caching etc

What is the most appropriate way to get at data being modified inside a Streams job from another machine (the Shim, which is running a web server for clients)

Thanks,
Simon

Attachments

  • kjerick
    kjerick
    227 Posts
    ACCEPTED ANSWER

    Re: Best way to Interface to running job from external application

    ‏2010-08-18T02:22:32Z  in response to Kryten
    Hi Simon, welcome to the InfoSphere Streams Forum! You have an interesting first question. I will check with some folks in development who are more in tune with the application development space, and either I or one of them will try to respond to your question as soon as we can.

    Thanks and best regards,
    Kevin
  • SystemAdmin
    SystemAdmin
    1245 Posts
    ACCEPTED ANSWER

    Re: Best way to Interface to running job from external application

    ‏2010-08-18T10:58:02Z  in response to Kryten
    Hi Simon,
    You can evaluate the following (file, DB, and network socket based) approaches for addressing what you described below. All of them have merits and demerits.

    1) Streams application could write the state information to a file via a Sink operator. Outside of the Streams instance, a script can periodically (a) open the Sink output file, (b) sequentially fetch the necessary blocks of state data, (c) optionally transform/enrich or perform any format conversion and (d) publish it to the htdocs directory of any industry standard Web server (Apache, lighttpd, WEBrick etc. with the necessary AAA and Caching services enabled). Now, the HTTP clients can access the state information stored in those published files.

    2) Streams application could periodically write the state information in a (DB2 or SolidDB) database table (as well-defined fields or as a BLOB) with timestamps and other columns as necessary. Then, a server side Web application (written in J2EE or using any of the simpler and popular Web scripting languages) can provide logic to read the required state information stored in the table and deliver it to the requesting HTTP clients. Underlying Web application server infrastructure will provide the AAA and Caching services.

    3) Streams application could write the state information to a UDP client sink operator. Outside of Streams, use a script or the netcat utility to open a UDP server socket and receive information sent from the UDP client sink, then apply data transformation, and write/publish it to appropriate file(s) in a htdocs directory of a Web server (WebSphere, Apache, lighttpd etc.). Now the HTTP clients can pull the data from these published files.

    Please comment, if the above approaches come close to addressing your need. If we can think of more ideas about this question, we will post them here.

    Regards,
    Senthil.
    • Kryten
      Kryten
      6 Posts
      ACCEPTED ANSWER

      Re: Best way to Interface to running job from external application

      ‏2010-08-18T18:16:22Z  in response to SystemAdmin
      Hi, Answers below inline:

      > 1) Streams application could write the state information to a file via a Sink operator. Outside of the Streams instance, a script can periodically (a) open the Sink output file, (b) sequentially fetch the
      > necessary blocks of state data, (c) optionally transform/enrich or perform any format conversion and (d) publish it to the htdocs directory of any industry standard Web server (Apache, lighttpd,
      > WEBrick etc. with the necessary AAA and Caching services enabled). Now, the HTTP clients can access the state information stored in those published files.

      Within the deployment of my application the web server would be on a separate machine, file based storage would not really work... I was thinking more of an RPC/TCP Server solution.

      Also, data may be load balanced across the inputs of the nodes in the streams cluster (having 1/n th of the data being processed on one node of an n node cluster). To fetch the data for a whole job may require a number of merges. I was thinking that a TCP server could service the requests of the shim, the TCP server would have a command and control stream going to all the running jobs. When a client connected to the server and gave a command to "dump state" the jobs would output data to TCP server (via their output streams) and the TCP server would then reply to the shim (Assuming a Fork'ing TCP server). Does that seem possible/sensible?

      > 2) Streams application could periodically write the state information in a (DB2 or SolidDB) database table (as well-defined fields or as a BLOB) with timestamps and other columns as necessary.
      > Then, a server side Web application (written in J2EE or using any of the simpler and popular Web scripting languages) can provide logic to read the required state information stored in the table
      > and deliver it to the requesting HTTP clients. Underlying Web application server infrastructure will provide the AAA and Caching services.

      This would be ok apart from the fact there is an update interval/lag and a database. This would also cause lots of redundant information to be sent between the Streams Cluster and the Shim... Non-ideal if you have lots of running models which could be queried

      > 3) Streams application could write the state information to a UDP client sink operator. Outside of Streams, use a script or the netcat utility to open a UDP server socket and receive information sent
      > from the UDP client sink, then apply data transformation, and write/publish it to appropriate file(s) in a htdocs directory of a Web server (WebSphere, Apache, lighttpd etc.). Now the HTTP clients
      > can pull the data from these published files.

      UDP is not really appropriate because under high I/O load the Linux kernel will readily drop them (more often than you think). All socket based comms should be reliable.

      In terms of the web server/Shim , the thought so far is that is will provide dynamic content (maybe cached content, depending on the algorithms running in streams) and provide and nice API for automated clients.

      Does the facility exist for the Source/Sink TCP operator which I described?

      Thanks,
      Simon
      • SystemAdmin
        SystemAdmin
        1245 Posts
        ACCEPTED ANSWER

        Re: Best way to Interface to running job from external application

        ‏2010-08-23T17:43:08Z  in response to Kryten
        Hi Simon,
        Sorry for the delayed reply as I was traveling.

        In my earlier reply, I should have mentioned about the availability of TCP and UDP source/sink operators. InfoSphere Streams already provides built-in Source and Sink operators that can be configured to use either UDP or TCP in client or server mode. You should be able to use the TCP-based Source and Sink operators as you described in your comments above.

        Regards,
        Senthil.