Topic
  • 3 replies
  • Latest Post - ‏2012-08-28T23:33:21Z by travis2k4
oakstream
oakstream
9 Posts

Pinned topic Best practice for eliminating data based on user provided input thru webapp

‏2012-08-26T19:47:35Z |
I am going to setup a filter operator to eliminate unwanted/unnecessary records prior to processing. Essentially I only want to look thru the data for certain keywords and exclude everything else. The issue I have is that these "Keywords" need to be dynamic as end users will be controlling these thru a web interface. There will be a few dozen users that will be able to update these keywords. My thoughts are that I would store these keywords in a database (postgres or Oracle) and would need to query these keywords for every tuple. I anticipate a few keywords changing every few seconds. Does anyone have any suggestions for best practices for performance and architecture? Essentially data will be flowing thru and at any given point in time there will be active users who have an interest in the data via keywords. As the data flows if keywords aren't found then the data can be trashed. I know there is overhead with databases so I'm hoping there might be some type of cache or something that can be dynamic???

Any thoughts or advice I would appreciate it.

Mike
  • travis2k4
    travis2k4
    4 Posts

    Re: Best practice for eliminating data based on user provided input thru webapp

    ‏2012-08-26T20:58:44Z  
    I've set up something similar to allow users to alter the behaviour of a job during run time. I have a database which a REST API interacts with. Users have a web page / javascript client which allows them to add / remove keywords to filter on. As well as updating the database with the changes to the list of keywords, the REST API connects to a TCPSource operator in Streams and forwards on the user's request. This TCPSource is connected to a DynamicFilter operator which keeps the current list of keywords to filter on in memory. This means that we don't need to keep querying the database for every single tuple. To keep everything in sync, I also use a slightly modified InetSource operator which connects to the REST API when the job first starts up and asks for the current full list of keywords it should be filtering for.

    HTH
  • oakstream
    oakstream
    9 Posts

    Re: Best practice for eliminating data based on user provided input thru webapp

    ‏2012-08-28T15:17:02Z  
    • travis2k4
    • ‏2012-08-26T20:58:44Z
    I've set up something similar to allow users to alter the behaviour of a job during run time. I have a database which a REST API interacts with. Users have a web page / javascript client which allows them to add / remove keywords to filter on. As well as updating the database with the changes to the list of keywords, the REST API connects to a TCPSource operator in Streams and forwards on the user's request. This TCPSource is connected to a DynamicFilter operator which keeps the current list of keywords to filter on in memory. This means that we don't need to keep querying the database for every single tuple. To keep everything in sync, I also use a slightly modified InetSource operator which connects to the REST API when the job first starts up and asks for the current full list of keywords it should be filtering for.

    HTH
    Hi Travis,
    Thanks alot, This definately helps. I've been able to try most of what you have suggested and it works great. I just had a question about the InetSource. Is this the operator you would use to connect to a rest service? Are you just connecting once with the INETSource upon job startup to get the keywords? Or do you periodically go back and check. Doesn't look like I need to do this if I'm writing the keywords to the TCPSource.
    Another thing I'm doing is inserting records into a database. My database is Postgresql which doesn't look like is supported at least thru their already developed operators so I have to go thru a rest service to update the data. Just wondering if the INET operator is what I would use for rest connection thru streams or if there is another one that should be used. (and whether you know of any examples of doing this)

    Thanks again for your help. This is definately helpful.
  • travis2k4
    travis2k4
    4 Posts

    Re: Best practice for eliminating data based on user provided input thru webapp

    ‏2012-08-28T23:33:21Z  
    • oakstream
    • ‏2012-08-28T15:17:02Z
    Hi Travis,
    Thanks alot, This definately helps. I've been able to try most of what you have suggested and it works great. I just had a question about the InetSource. Is this the operator you would use to connect to a rest service? Are you just connecting once with the INETSource upon job startup to get the keywords? Or do you periodically go back and check. Doesn't look like I need to do this if I'm writing the keywords to the TCPSource.
    Another thing I'm doing is inserting records into a database. My database is Postgresql which doesn't look like is supported at least thru their already developed operators so I have to go thru a rest service to update the data. Just wondering if the INET operator is what I would use for rest connection thru streams or if there is another one that should be used. (and whether you know of any examples of doing this)

    Thanks again for your help. This is definately helpful.
    InetSource isn't ideal for connecting to a REST API but it's the only operator that I knew about that allowed you to create a HTTP connection when I was doing this work. There are a couple of problems I had with it:

    • It queries the provided URLs every so often. In our case we only want it to load the URL once. To get round this I copied the operator's code from the toolkits folder inside the Streams install and created a new version of the operator. I altered this new version so that it the "fetchIntervalSeconds" parameter is negative, then the operator only requests the URL once.

    • InetSource outputs a stream with a single attribute which is an rstring of the contents of the URL that was requested so it doesn't do any parsing of the returned data. The REST API I use returns JSON. Luckily there are some JSON parsing operators available in the Streams Exchange (https://www.ibm.com/developerworks/mydeveloperworks/files/app?lang=en#/person/060002871K/file/d8bd5118-4587-4b3e-b43e-4b1717f8691f) so I pass the output from my altered InetSource operator into one of these operators to decode the JSON into a Streams tuple.

    Looking at the Streams Exchange it looks like some developers have contributed some other operators to help with making HTTP requests and it looks like there is a toolkit for connecting to REST APIs too. So it looks like you might be in a better position than I was. Take a look at the inet_ssb_v1.0.6.tar.gz and HTTPUtils V1.0.tgz toolkits available at
    https://www.ibm.com/developerworks/mydeveloperworks/files/app?lang=en#/collection/09ddaa56-cd45-4e04-b880-d52a3ab630c0