Topic
  • 9 replies
  • Latest Post - ‏2013-05-15T10:32:46Z by Seamus Rooney
David Benes
David Benes
9 Posts

Pinned topic API call to start full recrawl for web crawler

‏2013-04-19T07:54:35Z |

Hi,

how can I start a full recrawl for the web crawler for ICAwES 3.0?
For other crawlers it works fine. But if I try the similar REST call (/crawler?method=startCrawl) for web crawler I get following error:

FFQEP0002E An error occurred when processing a remote API. The reason is : com.ibm.es.control.webcrawler.client.CtrlWebCrawler incompatible with com.ibm.es.control.crawler.client.CtrlCrawler

 

Thanks
David

  • Seamus Rooney
    Seamus Rooney
    5 Posts
    ACCEPTED ANSWER

    Re: API call to start full recrawl for web crawler

    ‏2013-05-14T11:43:10Z  

    Hi Seamus,

    thank you for sending  the request to add an REST API call to add specific URLs. By the way is it possible with SIAPI in the meantime? It seem so as you stated later in your answer, that full recrawl might be possible with sending request over SIAPI to crawl http://* and https://*. I assume it could be done in the administration part of the SIAPI. But according to the Infocenter, this part of SIAPI is not available in the ICAwES 3.0. So is it still available, just not documented any more?

    Unfortunatelly deleting the crawler is not an option, because deleting the crawler means all documents from that crawler will be deleted from index.
    But maybe our need of full recrawl just rise from the not understanding how web crawler works. Lets say we create new web crawler, wait until it crawl all pages, stop the crawler and wait as long as the maximum delay between two crawls is defined. Than after starting the crawler it start the same process as if I click the full recrawl button? So full recrawl for the web crawler means that the date for the next crawl of each page is set to the current time?

    The scenario for the full recrawl for the web crawler is that we have a web page with some business opportunities that are submitted during the day. Each evening, we need to crawl the server again to track the current state of all opportunities - each has it's own page with details including the state of the opportunity (open, closed, ...). Opportunities are added, removed and may change their state. Next business day we need to provide actualized information from those opportunities to the end users.
     

    Thanks a lot for you help.

    Best regards
    David

    Hi David,

    I believe it should be possible with the SIAPI but im not 100%. I think the feature is still there but it is now undocumented and I cannot test at the moment due to my current system confugurations. I was able to find sample code from a pervious release that demostrates the use of revisiting a URL. I have attached the code in a file called "PerformAdminCommand.java" for you to download and test.

    As I was unable to test this i'm not 100% sure this will work for ICA but you should know very quickly if it will not. If it does not work then I would assume that the feature was also removed from the SIAPI along with the documentation.

    With regard to the second part of your message, I think the date will be of the last time the page was crawled (Updated/Modified/New), so if there are mofifications to a current page I would think that the webcrawler will pick that up and recrawl the page, likewise for a new web page, the webcrawler will pick that up and do a full crawl.

    Hope this helps you,

    Best Regards,

    Seamus Rooney

    QA Engineer Content Analytics

    Attachments

    Updated on 2013-05-15T08:33:13Z at 2013-05-15T08:33:13Z by Seamus Rooney
  • Seamus Rooney
    Seamus Rooney
    5 Posts

    Re: API call to start full recrawl for web crawler

    ‏2013-04-30T10:50:34Z  

    Hi David,

    For all other cralwers a call to startCrawl is required to begin the crawler. However for a web crawler this method is not required as the web crawler will start a crawl automatically when the crawler session starts.

    Hope this helps,

     

    Regards,

    Seamus Rooney

    QA Engineer Content Analytics

  • David Benes
    David Benes
    9 Posts

    Re: API call to start full recrawl for web crawler

    ‏2013-04-30T11:22:19Z  

    Hi David,

    For all other cralwers a call to startCrawl is required to begin the crawler. However for a web crawler this method is not required as the web crawler will start a crawl automatically when the crawler session starts.

    Hope this helps,

     

    Regards,

    Seamus Rooney

    QA Engineer Content Analytics

    Hi Seamus,

    thank you very much for your response.

    I understand that web crawler start crawling automatically. But what I need is to start full recrawl. Exactly how it works if I click the "Start a full recrawl" in the Administration of the web crawler. So it seems, that the full recrawl is not available through the API, am I right?

    In the meantime I have found another missing API call. The possibility to add specific URLs to the crawler's queue - the function that is under the "URLs to visit or revisit" link in the Administration of web crawler. Is there any possibility to add specific URLs for the web crawler on demand programatically?

     

    Thanks a lot for your help.

    Regard,
    David

  • Seamus Rooney
    Seamus Rooney
    5 Posts

    Re: API call to start full recrawl for web crawler

    ‏2013-04-30T11:35:33Z  

    Hi Seamus,

    thank you very much for your response.

    I understand that web crawler start crawling automatically. But what I need is to start full recrawl. Exactly how it works if I click the "Start a full recrawl" in the Administration of the web crawler. So it seems, that the full recrawl is not available through the API, am I right?

    In the meantime I have found another missing API call. The possibility to add specific URLs to the crawler's queue - the function that is under the "URLs to visit or revisit" link in the Administration of web crawler. Is there any possibility to add specific URLs for the web crawler on demand programatically?

     

    Thanks a lot for your help.

    Regard,
    David

    Hi David,

    Yes, I think you are right but what I will do is check with one of the developers of this crawler to confirm the exact mechanism and what gets called when the "Start a full recrawl" button in the Administration is clicked.

    I will also check with a developer with regard to your second query for you but I would think that this would need to be a new feature request of the API. I can't guarentee that it will make it into the next relase but I can make a new feature request for you once I get my answer back.

    Leave it with me and I will post a reply when I get more details from developemnt.

    Regards,

    Seamus Rooney

    QA Engineer Content Analytics

     

  • David Benes
    David Benes
    9 Posts

    Re: API call to start full recrawl for web crawler

    ‏2013-04-30T11:43:38Z  

    Hi David,

    Yes, I think you are right but what I will do is check with one of the developers of this crawler to confirm the exact mechanism and what gets called when the "Start a full recrawl" button in the Administration is clicked.

    I will also check with a developer with regard to your second query for you but I would think that this would need to be a new feature request of the API. I can't guarentee that it will make it into the next relase but I can make a new feature request for you once I get my answer back.

    Leave it with me and I will post a reply when I get more details from developemnt.

    Regards,

    Seamus Rooney

    QA Engineer Content Analytics

     

    Thanks a lot Seamus.

    I'm looking forward for information from the check with the developers.

    Also completely understand, that it might require new feature request. Thank you very much for possible submitting that request.

     

    Best regards
    David

  • Seamus Rooney
    Seamus Rooney
    5 Posts

    Re: API call to start full recrawl for web crawler

    ‏2013-05-08T10:09:03Z  

    Thanks a lot Seamus.

    I'm looking forward for information from the check with the developers.

    Also completely understand, that it might require new feature request. Thank you very much for possible submitting that request.

     

    Best regards
    David

    Hi David,

    Apologies for the delay, but I have some more answers for you that might help.

    Regarding the REST API to add specific URLs, there is no public API for that at the moment but I have made a request for this to be developed in the future.

    Regarding how to do a full re-crawl on a webcrawler. This is a little more tricky as the webcrawler is designed to check for any new/updated/removed pages periodically and it crawls them when these changes happen. There is also no finish on a webcrawler either as it constantly checks for webpages that might appear in the future.

    However, if you really did need to do a recrawl and since you are doing this through code, you could delete the existing web cawler and just create it again. This would have the same effect and you would just need to call the method again depending on how you construct your code.

    Another option is to use the SIAPI which is depreciated now. I believe you can do a full recrawl with specifying http://* and https://* as the URLS to revisit.

    I would suggest deleting the webcrawer using the REST API myself to future proof your code but the option is yours.

    Would you be able to tell me why you want to do a full re-crawl using the webcrawler? The developer was curious as its not a frequent request and he was wondering about a customer scenario they might need to think about in the future.

     

    Regards,

    Seamus Rooney

    QA Engineer Content Analytics

  • David Benes
    David Benes
    9 Posts

    Re: API call to start full recrawl for web crawler

    ‏2013-05-09T07:41:17Z  

    Hi David,

    Apologies for the delay, but I have some more answers for you that might help.

    Regarding the REST API to add specific URLs, there is no public API for that at the moment but I have made a request for this to be developed in the future.

    Regarding how to do a full re-crawl on a webcrawler. This is a little more tricky as the webcrawler is designed to check for any new/updated/removed pages periodically and it crawls them when these changes happen. There is also no finish on a webcrawler either as it constantly checks for webpages that might appear in the future.

    However, if you really did need to do a recrawl and since you are doing this through code, you could delete the existing web cawler and just create it again. This would have the same effect and you would just need to call the method again depending on how you construct your code.

    Another option is to use the SIAPI which is depreciated now. I believe you can do a full recrawl with specifying http://* and https://* as the URLS to revisit.

    I would suggest deleting the webcrawer using the REST API myself to future proof your code but the option is yours.

    Would you be able to tell me why you want to do a full re-crawl using the webcrawler? The developer was curious as its not a frequent request and he was wondering about a customer scenario they might need to think about in the future.

     

    Regards,

    Seamus Rooney

    QA Engineer Content Analytics

    Hi Seamus,

    thank you for sending  the request to add an REST API call to add specific URLs. By the way is it possible with SIAPI in the meantime? It seem so as you stated later in your answer, that full recrawl might be possible with sending request over SIAPI to crawl http://* and https://*. I assume it could be done in the administration part of the SIAPI. But according to the Infocenter, this part of SIAPI is not available in the ICAwES 3.0. So is it still available, just not documented any more?

    Unfortunatelly deleting the crawler is not an option, because deleting the crawler means all documents from that crawler will be deleted from index.
    But maybe our need of full recrawl just rise from the not understanding how web crawler works. Lets say we create new web crawler, wait until it crawl all pages, stop the crawler and wait as long as the maximum delay between two crawls is defined. Than after starting the crawler it start the same process as if I click the full recrawl button? So full recrawl for the web crawler means that the date for the next crawl of each page is set to the current time?

    The scenario for the full recrawl for the web crawler is that we have a web page with some business opportunities that are submitted during the day. Each evening, we need to crawl the server again to track the current state of all opportunities - each has it's own page with details including the state of the opportunity (open, closed, ...). Opportunities are added, removed and may change their state. Next business day we need to provide actualized information from those opportunities to the end users.
     

    Thanks a lot for you help.

    Best regards
    David

  • Seamus Rooney
    Seamus Rooney
    5 Posts

    Re: API call to start full recrawl for web crawler

    ‏2013-05-14T11:43:10Z  

    Hi Seamus,

    thank you for sending  the request to add an REST API call to add specific URLs. By the way is it possible with SIAPI in the meantime? It seem so as you stated later in your answer, that full recrawl might be possible with sending request over SIAPI to crawl http://* and https://*. I assume it could be done in the administration part of the SIAPI. But according to the Infocenter, this part of SIAPI is not available in the ICAwES 3.0. So is it still available, just not documented any more?

    Unfortunatelly deleting the crawler is not an option, because deleting the crawler means all documents from that crawler will be deleted from index.
    But maybe our need of full recrawl just rise from the not understanding how web crawler works. Lets say we create new web crawler, wait until it crawl all pages, stop the crawler and wait as long as the maximum delay between two crawls is defined. Than after starting the crawler it start the same process as if I click the full recrawl button? So full recrawl for the web crawler means that the date for the next crawl of each page is set to the current time?

    The scenario for the full recrawl for the web crawler is that we have a web page with some business opportunities that are submitted during the day. Each evening, we need to crawl the server again to track the current state of all opportunities - each has it's own page with details including the state of the opportunity (open, closed, ...). Opportunities are added, removed and may change their state. Next business day we need to provide actualized information from those opportunities to the end users.
     

    Thanks a lot for you help.

    Best regards
    David

    Hi David,

    I believe it should be possible with the SIAPI but im not 100%. I think the feature is still there but it is now undocumented and I cannot test at the moment due to my current system confugurations. I was able to find sample code from a pervious release that demostrates the use of revisiting a URL. I have attached the code in a file called "PerformAdminCommand.java" for you to download and test.

    As I was unable to test this i'm not 100% sure this will work for ICA but you should know very quickly if it will not. If it does not work then I would assume that the feature was also removed from the SIAPI along with the documentation.

    With regard to the second part of your message, I think the date will be of the last time the page was crawled (Updated/Modified/New), so if there are mofifications to a current page I would think that the webcrawler will pick that up and recrawl the page, likewise for a new web page, the webcrawler will pick that up and do a full crawl.

    Hope this helps you,

    Best Regards,

    Seamus Rooney

    QA Engineer Content Analytics

    Attachments

    Updated on 2013-05-15T08:33:13Z at 2013-05-15T08:33:13Z by Seamus Rooney
  • David Benes
    David Benes
    9 Posts

    Re: API call to start full recrawl for web crawler

    ‏2013-05-15T09:48:42Z  

    Hi David,

    I believe it should be possible with the SIAPI but im not 100%. I think the feature is still there but it is now undocumented and I cannot test at the moment due to my current system confugurations. I was able to find sample code from a pervious release that demostrates the use of revisiting a URL. I have attached the code in a file called "PerformAdminCommand.java" for you to download and test.

    As I was unable to test this i'm not 100% sure this will work for ICA but you should know very quickly if it will not. If it does not work then I would assume that the feature was also removed from the SIAPI along with the documentation.

    With regard to the second part of your message, I think the date will be of the last time the page was crawled (Updated/Modified/New), so if there are mofifications to a current page I would think that the webcrawler will pick that up and recrawl the page, likewise for a new web page, the webcrawler will pick that up and do a full crawl.

    Hope this helps you,

    Best Regards,

    Seamus Rooney

    QA Engineer Content Analytics

    Hi Seamus,

    thanks a lot for the example. It takes a while to figure out how to make it work and which attributes it requires. Finally I have success with following command. After performing following command, I see web crawler start crawling of specified pages (monitoring the recently crawled URLs in the details of web crawler in Administration console).

    java "-DES_CFG=C:\Program Files\IBM\es\nodeinfo\es.cfg"  -classpath .;..\lib\siapi.jar;..\lib\es.siapi.jar;..\lib\es.oss.jar;..\lib\esctrl.jar;..\lib\es.dl.client.jar PerformAdminCommand <ICA_admin_user> <ICA_admin_pass> col_99835 "revisitURLs http://www.idnes.cz/*"

    So those admin SIAPI functions are still available in ICAwES 3.0.0.2 even when documentation says it is not available anymore.

    So to summarize, it is not possible to start a full recrawl or add/recrawl specific page over REST API at the moment, but it is possible to do that using the undocumented part of SIAPI, until the request to add this function to the REST API will be implemented.

     

    Thank you very much for your help

    Best regards
    David Benes

  • Seamus Rooney
    Seamus Rooney
    5 Posts

    Re: API call to start full recrawl for web crawler

    ‏2013-05-15T10:32:46Z  

    Hi Seamus,

    thanks a lot for the example. It takes a while to figure out how to make it work and which attributes it requires. Finally I have success with following command. After performing following command, I see web crawler start crawling of specified pages (monitoring the recently crawled URLs in the details of web crawler in Administration console).

    java "-DES_CFG=C:\Program Files\IBM\es\nodeinfo\es.cfg"  -classpath .;..\lib\siapi.jar;..\lib\es.siapi.jar;..\lib\es.oss.jar;..\lib\esctrl.jar;..\lib\es.dl.client.jar PerformAdminCommand <ICA_admin_user> <ICA_admin_pass> col_99835 "revisitURLs http://www.idnes.cz/*"

    So those admin SIAPI functions are still available in ICAwES 3.0.0.2 even when documentation says it is not available anymore.

    So to summarize, it is not possible to start a full recrawl or add/recrawl specific page over REST API at the moment, but it is possible to do that using the undocumented part of SIAPI, until the request to add this function to the REST API will be implemented.

     

    Thank you very much for your help

    Best regards
    David Benes

    Hi David,

    Your welcome and thanks very much for the feedback on how you got the sample running as it may help other users in the future too.

    Best Regards,

    Seamus Rooney

    QA Engineer Content Analytics