BoardReader crawler - configuration properties

The BoardReader crawler crawls social media data that has been collected by the BoardReader web service. BoardReader is an application that aggregates data from multiple social media sources across the Internet.

Note: BoardReader crawler is deprecated and removed effective 26 July 2024.

In order to run the BoardReader crawler, you need a BoardReader API key. Contact BoardReader to obtain this key.

The Create crawler: BoardReader screen is where you enter the configuration parameters for this crawler.

Crawler Properties

Crawler name

The name of the crawler. Alphanumeric characters, hyphens, underscores, and spaces are allowed.

Crawler description

A description of the crawler.

Advanced options

Time to wait between retrieval requests: The time is expressed in milliseconds.
Maximum number of active crawler threads: The maximum number of active crawler threads.
Maximum document size: The maximum size expressed in kilobytes. The maximum value is 131,071 kilobytes.
When the crawler session is started: Specifies which content to crawl.

Data Source Properties

BoardReader license key: BoardReader license key to call BoardReader API.
Crawl Duration: Select the crawl duration.
Start date: The duration start date to crawl.
End date: The duration end date to crawl.
Duration period type: Select the crawl duration period type. This option is shown only when The current time for a specified duration is selected as Crawl Duration.
Duration period amount: The crawl duration period amount. This option is shown only when The current time for a specified duration is selected as Crawl Duration.
Domain Conditions: The domain list of social media to crawl.
Query Conditions: BoardReader queries to limit how much content is crawled. The crawler applies Boolean OR logic to combine multiple queries.
BoardReader API parameters: BoardReader API parameters. For example, filter_language=ja&filter_country=jp limits the crawl to documents in Japanese that originate in Japan.
Default time zone: The default time zone that is used to parse date string values to epoch time.
Time zone list: Specified time zone that is used to parse date string values which are crawled from the corresponding domain. For example, *fr.yahoo.com=WET.
Proxy server host name: The host name of the proxy server.
Proxy server port: The port of the proxy server.
User ID for the proxy server: The user name to access the proxy server.
Password for the proxy server: The password of the user to access the proxy server.

Crawl space Properties

You can find and add multiple crawl spaces for a BoardReader crawler. For instructions, see Finding and adding crawl spaces in a BoardReader crawler.

Crawler plug-in

Data source crawler plug-ins are Java™ applications that can change the content or metadata of crawled documents. You can configure a data source crawler plug-in for all non-web crawler types. For more information, see Crawler plug-ins.

Enable the crawler plug-in: Enable this option when you use the crawler plug-in.
Plug-in class name: The class name for the crawler plug-in.
Plug-in class path: The JAR file location of the crawler plug-in. The folder that contains the JAR file must be mounted so it is available. For more information, see Providing access to the local filesystem from Watson Explorer oneWEX.