BoardReader crawlers

Configure a BoardReader crawler to add social media data that was collected by the BoardReader web service to a collection.

BoardReader is an application that aggregates data from multiple social media sources across the Internet. The BoardReader crawler connects to the BoardReader server to collect this data so that it can be indexed, searched, and analyzed by Watson Explorer Content Analytics. By using various BoardReader REST APIs, the crawler collects data from the following BoardReader aggregated feeds:

By default, the BoardReader crawler collects data that was aggregated by BoardReader during the previous three months. You can specify a different crawling time period when you configure the crawler.

Crawler connection credentials

When you configure a BoardReader crawler, you must specify the value of the BoardReader license key that you received from BoardReader as part of your license agreement. The crawler requires this key to collect data from the BoardReader social media sources.

Alternatively, you can configure the connection credentials when you specify security settings for the system instead of specifying credentials for each crawler. By using this approach, multiple crawlers can use the same credentials.

BoardReader source properties and crawler metadata fields

BoardReader defines properties for each web source type, such as the subject, the crawled date, and author. The BoardReader crawler publishes these properties as the crawler metadata fields. The properties, and corresponding crawler fields, depend on the BoardReader source type:
  • The crawler publishes the BoardReader property Text as the Watson Explorer Content Analytics document body. The MIME type of the document body is text/plain.
  • The crawler publishes the BoardReader property Published as the document date.
  • For other BoardReader properties, the crawler field matches the BoardReader property name, prefixed by the source category name. For example, a property that is named AuthorInfo/Name in a message board is mapped to a crawler field that is named MessageBoard_AuthorInfo/Name.
  • For the document title, the crawler maps the BoardReader property Subject from blogs, message boards, and news sources to an index field named title. For reviews and videos, the crawler maps the BoardReader property Title to an index field named title.

Identifying the content to crawl

Because the amount of data that can be obtained from BoardReader is large, configure one or more of the following crawler conditions to limit what is added to the collection. At a minimum, you must specify a query condition or a domain condition, and you must select both a top-level source type and lower-level sources to crawl.
  • You can limit the amount of data by specifying time limits. The crawler can collect content that was published between two specific dates, from a specific date to the start of the crawler session, or for a specific amount of time, such as six months before the start of a crawler session.
  • You can narrow the scope by specifying one or more BoardReader queries. The crawler applies the queries to limit the content that is collected from the selected BoardReader sources. If you specify more than one query, the crawler runs each query separately and then applies Boolean OR logic to the results. For example, if you specify two queries and select Message Boards as the source type, the crawler searches the message boards twice and then aggregates the results of both queries. Contact BoardReader for information about BoardReader query syntax.
  • You can specify a filter to limit the crawler to specific domains. For example, BoardReader handles Twitter accounts as sites. To crawl all Twitter site content that was collected by BoardReader, add twitter.com as a domain condition.
  • You can also narrow the scope by selecting specific BoardReader source types to crawl, such as message boards or specific message board sites and forums. Select at least one top-level source type and at least one lower-level source to control the amount of data that is crawled. For example, if you select Message Boards as the source type, select one or more message board sites or one or more of the forums that are hosted on a site. You can also enter queries to narrow the list of available sites and forums. For example, limit the list of sites to sites that include IBM® in their names or limit the list of forums to forums that contain Enterprise in their titles.
    The following table shows the levels available for the supported source types and the ways that you can use them to narrow the scope of the crawl. If you change the crawler conditions after you create the crawler, a full recrawl is required to apply the changes.
    Top-Level (Source Type) Second-Level Third-Level
    Message Boards Site

    You can enter a query to narrow the list of sites by site name, site domain, or site URL.

    Forum

    You can enter a query to narrow the list of forums by forum name.

    Blogs Site

    You can enter a query to narrow the list of sites by site name, site domain, or site URL.

     
    News Site

    You can enter a query to narrow the list of sites by site name, site domain, or site URL.

     
    Videos Site

    You can click a button to see a list of the available video sites.

    Author

    You can enter a query to narrow the list of authors by author name.

    Reviews Site

    You can enter a query to narrow the list of sites by site name or site URL.

    Author

    You can enter a query to narrow the list of authors by author name.

Applying time zones and API filters

In BoardReader, dates are determined by the eastern United States time zone (EST/EDT) for message board content, and by GMT time for all other source types. Because the crawler cannot detect the time zone for each BoardReader site, the dates that users see when they query a BoardReader collection might be different from dates that appear on the original data source sites (such as yahoo.com sites that were crawled by the BoardReader service).

When you select the sources to be crawled, you can specify which time zone is to be used to determine the document dates. For best results, select a time zone based on the time zone for the site to be crawled:
  • You can select a default time zone. This time zone is used to convert the BoardReader Published date to the indexed document date.
  • You can specify patterns to apply a time zone to documents that match a pattern URL. For example, if you specify *.yahoo.co.jp=JST in the time zone list, Japanese Standard Time is used to determine the document date for documents that are published on .yahoo.co.jp sites.
If you do not specify a time zone pattern, the default time zone is applied. If you do not specify a default time zone, the crawler uses the time zone that is configured for the collection.

You can specify BoardReader API parameters to further limit the amount of data that the crawler collects. For example, specify filter_language=ja to collect only documents in the Japanese language, or specify filter_country=jp to collect only documents that originate in Japan. To specify more than one filter, combine the entries with an ampersand (&) character, for example, filter_language=ja&filter_country=jp. Contact BoardReader for information about available API parameters.

Configuration overview

When you configure a BoardReader crawler in the administration console, wizards help you provide required information:
  • Specify properties that control how the crawler operates and uses system resources. The crawler properties control how the crawler collects content from all of the BoardReader sources that you add to the crawl space.
  • Specify connection credentials.
  • Select the BoardReader sources that you want to crawl by specifying time limits, queries, domains, and source types.
  • Specify options for how documents can be searched. For example, you can exclude certain types of documents from the crawl space.
  • Set up a schedule for crawling BoardReader sites.