Topic
3 replies Latest Post - ‏2012-03-06T06:40:53Z by bwchang
Dushyant
Dushyant
10 Posts
ACCEPTED ANSWER

Pinned topic Controlling crawler plugin behavior

‏2012-02-27T14:15:55Z |
Hi All,

Quick question regarding the crawler plugin behavior. Is there a way to control the plugin in such a way as to NOT ADD the document to the collection based on logic. We are planning to insert some conditional logic in init() or another method, and based on certain conditions not matching, want to skip adding the document to the collection - is that doable?

Thanks for help!
Dushyant.
Updated on 2012-03-06T06:40:53Z at 2012-03-06T06:40:53Z by bwchang
  • bwchang
    bwchang
    146 Posts
    ACCEPTED ANSWER

    Re: Controlling crawler plugin behavior

    ‏2012-02-27T19:30:47Z  in response to Dushyant
    Dushyant,

    Here is a piece of code to skip over a document from being indexed. Billy.

    public CrawledData updateDocument(CrawledData crawledData) throws CrawlerPluginException {
    // Get uri string, security tokens, and field metadata
    String uri = crawledData.getURI();
    String securityTokens = crawledData.getSecurityTokens();

    // custom logic for exclusion
    boolean skip = false;
    if( skip ) {
    return null; // skip this document
    }
    }

    // if reached here, no exclusion
    return crawledData;
    }
    • SujataDe
      SujataDe
      2 Posts
      ACCEPTED ANSWER

      Re: Controlling crawler plugin behavior

      ‏2012-03-02T14:23:31Z  in response to bwchang
      Hi,
      Is it possible to update the Omnifind collection while the crawler is running? In specific, can we leverage the init() or term() methods to explicitly add a different document from a different datasource to the collection using the add document REST SIAPI admin API, irrespective of the updateDocument() method?
      Thanks
      Sujata
      • bwchang
        bwchang
        146 Posts
        ACCEPTED ANSWER

        Re: Controlling crawler plugin behavior

        ‏2012-03-06T06:40:53Z  in response to SujataDe
        Technically speaking, you should be able to add documents to a collection from within the init() and term() method of a customer crawler plugin, via the REST Admin API. However, you should be aware of the following.

        1) Documents added via REST Admin API will not be processed by any custom crawler plugin, since a crawler is not involved (nor required) to add a document via REST Admin API.

        2) If documents are to be added via REST Admin API within the init() method, then they will be added every time the crawler is started, unless the code is wrapped with a conditional stmt based on some persisted state information that lives outside the life of the process (e.g., a file). Likewise for adding documents within the term() method, as it will occur only and every time when the crawler is stopped.

        Billy