Topic
9 replies Latest Post - ‏2013-05-15T10:32:46Z by Seamus Rooney
David Benes
David Benes
9 Posts
ACCEPTED ANSWER

Pinned topic API call to start full recrawl for web crawler

‏2013-04-19T07:54:35Z |

Hi,

how can I start a full recrawl for the web crawler for ICAwES 3.0?
For other crawlers it works fine. But if I try the similar REST call (/crawler?method=startCrawl) for web crawler I get following error:

FFQEP0002E An error occurred when processing a remote API. The reason is : com.ibm.es.control.webcrawler.client.CtrlWebCrawler incompatible with com.ibm.es.control.crawler.client.CtrlCrawler

 

Thanks
David

  • Seamus Rooney
    Seamus Rooney
    5 Posts
    ACCEPTED ANSWER

    Re: API call to start full recrawl for web crawler

    ‏2013-04-30T10:50:34Z  in response to David Benes

    Hi David,

    For all other cralwers a call to startCrawl is required to begin the crawler. However for a web crawler this method is not required as the web crawler will start a crawl automatically when the crawler session starts.

    Hope this helps,

     

    Regards,

    Seamus Rooney

    QA Engineer Content Analytics

    • David Benes
      David Benes
      9 Posts
      ACCEPTED ANSWER

      Re: API call to start full recrawl for web crawler

      ‏2013-04-30T11:22:19Z  in response to Seamus Rooney

      Hi Seamus,

      thank you very much for your response.

      I understand that web crawler start crawling automatically. But what I need is to start full recrawl. Exactly how it works if I click the "Start a full recrawl" in the Administration of the web crawler. So it seems, that the full recrawl is not available through the API, am I right?

      In the meantime I have found another missing API call. The possibility to add specific URLs to the crawler's queue - the function that is under the "URLs to visit or revisit" link in the Administration of web crawler. Is there any possibility to add specific URLs for the web crawler on demand programatically?

       

      Thanks a lot for your help.

      Regard,
      David

      • Seamus Rooney
        Seamus Rooney
        5 Posts
        ACCEPTED ANSWER

        Re: API call to start full recrawl for web crawler

        ‏2013-04-30T11:35:33Z  in response to David Benes

        Hi David,

        Yes, I think you are right but what I will do is check with one of the developers of this crawler to confirm the exact mechanism and what gets called when the "Start a full recrawl" button in the Administration is clicked.

        I will also check with a developer with regard to your second query for you but I would think that this would need to be a new feature request of the API. I can't guarentee that it will make it into the next relase but I can make a new feature request for you once I get my answer back.

        Leave it with me and I will post a reply when I get more details from developemnt.

        Regards,

        Seamus Rooney

        QA Engineer Content Analytics

         

        • David Benes
          David Benes
          9 Posts
          ACCEPTED ANSWER

          Re: API call to start full recrawl for web crawler

          ‏2013-04-30T11:43:38Z  in response to Seamus Rooney

          Thanks a lot Seamus.

          I'm looking forward for information from the check with the developers.

          Also completely understand, that it might require new feature request. Thank you very much for possible submitting that request.

           

          Best regards
          David

          • Seamus Rooney
            Seamus Rooney
            5 Posts
            ACCEPTED ANSWER

            Re: API call to start full recrawl for web crawler

            ‏2013-05-08T10:09:03Z  in response to David Benes

            Hi David,

            Apologies for the delay, but I have some more answers for you that might help.

            Regarding the REST API to add specific URLs, there is no public API for that at the moment but I have made a request for this to be developed in the future.

            Regarding how to do a full re-crawl on a webcrawler. This is a little more tricky as the webcrawler is designed to check for any new/updated/removed pages periodically and it crawls them when these changes happen. There is also no finish on a webcrawler either as it constantly checks for webpages that might appear in the future.

            However, if you really did need to do a recrawl and since you are doing this through code, you could delete the existing web cawler and just create it again. This would have the same effect and you would just need to call the method again depending on how you construct your code.

            Another option is to use the SIAPI which is depreciated now. I believe you can do a full recrawl with specifying http://* and https://* as the URLS to revisit.

            I would suggest deleting the webcrawer using the REST API myself to future proof your code but the option is yours.

            Would you be able to tell me why you want to do a full re-crawl using the webcrawler? The developer was curious as its not a frequent request and he was wondering about a customer scenario they might need to think about in the future.

             

            Regards,

            Seamus Rooney

            QA Engineer Content Analytics

            • David Benes
              David Benes
              9 Posts
              ACCEPTED ANSWER

              Re: API call to start full recrawl for web crawler

              ‏2013-05-09T07:41:17Z  in response to Seamus Rooney

              Hi Seamus,

              thank you for sending  the request to add an REST API call to add specific URLs. By the way is it possible with SIAPI in the meantime? It seem so as you stated later in your answer, that full recrawl might be possible with sending request over SIAPI to crawl http://* and https://*. I assume it could be done in the administration part of the SIAPI. But according to the Infocenter, this part of SIAPI is not available in the ICAwES 3.0. So is it still available, just not documented any more?

              Unfortunatelly deleting the crawler is not an option, because deleting the crawler means all documents from that crawler will be deleted from index.
              But maybe our need of full recrawl just rise from the not understanding how web crawler works. Lets say we create new web crawler, wait until it crawl all pages, stop the crawler and wait as long as the maximum delay between two crawls is defined. Than after starting the crawler it start the same process as if I click the full recrawl button? So full recrawl for the web crawler means that the date for the next crawl of each page is set to the current time?

              The scenario for the full recrawl for the web crawler is that we have a web page with some business opportunities that are submitted during the day. Each evening, we need to crawl the server again to track the current state of all opportunities - each has it's own page with details including the state of the opportunity (open, closed, ...). Opportunities are added, removed and may change their state. Next business day we need to provide actualized information from those opportunities to the end users.
               

              Thanks a lot for you help.

              Best regards
              David

              • Seamus Rooney
                Seamus Rooney
                5 Posts
                ACCEPTED ANSWER

                Re: API call to start full recrawl for web crawler

                ‏2013-05-14T11:43:10Z  in response to David Benes

                Hi David,

                I believe it should be possible with the SIAPI but im not 100%. I think the feature is still there but it is now undocumented and I cannot test at the moment due to my current system confugurations. I was able to find sample code from a pervious release that demostrates the use of revisiting a URL. I have attached the code in a file called "PerformAdminCommand.java" for you to download and test.

                As I was unable to test this i'm not 100% sure this will work for ICA but you should know very quickly if it will not. If it does not work then I would assume that the feature was also removed from the SIAPI along with the documentation.

                With regard to the second part of your message, I think the date will be of the last time the page was crawled (Updated/Modified/New), so if there are mofifications to a current page I would think that the webcrawler will pick that up and recrawl the page, likewise for a new web page, the webcrawler will pick that up and do a full crawl.

                Hope this helps you,

                Best Regards,

                Seamus Rooney

                QA Engineer Content Analytics

                Attachments

                Updated on 2013-05-15T08:33:13Z at 2013-05-15T08:33:13Z by Seamus Rooney
                • David Benes
                  David Benes
                  9 Posts
                  ACCEPTED ANSWER

                  Re: API call to start full recrawl for web crawler

                  ‏2013-05-15T09:48:42Z  in response to Seamus Rooney

                  Hi Seamus,

                  thanks a lot for the example. It takes a while to figure out how to make it work and which attributes it requires. Finally I have success with following command. After performing following command, I see web crawler start crawling of specified pages (monitoring the recently crawled URLs in the details of web crawler in Administration console).

                  java "-DES_CFG=C:\Program Files\IBM\es\nodeinfo\es.cfg"  -classpath .;..\lib\siapi.jar;..\lib\es.siapi.jar;..\lib\es.oss.jar;..\lib\esctrl.jar;..\lib\es.dl.client.jar PerformAdminCommand <ICA_admin_user> <ICA_admin_pass> col_99835 "revisitURLs http://www.idnes.cz/*"

                  So those admin SIAPI functions are still available in ICAwES 3.0.0.2 even when documentation says it is not available anymore.

                  So to summarize, it is not possible to start a full recrawl or add/recrawl specific page over REST API at the moment, but it is possible to do that using the undocumented part of SIAPI, until the request to add this function to the REST API will be implemented.

                   

                  Thank you very much for your help

                  Best regards
                  David Benes

                  • Seamus Rooney
                    Seamus Rooney
                    5 Posts
                    ACCEPTED ANSWER

                    Re: API call to start full recrawl for web crawler

                    ‏2013-05-15T10:32:46Z  in response to David Benes

                    Hi David,

                    Your welcome and thanks very much for the feedback on how you got the sample running as it may help other users in the future too.

                    Best Regards,

                    Seamus Rooney

                    QA Engineer Content Analytics