Configure a BoardReader crawler
to add social media data that was collected by the BoardReader
web service to a collection.
BoardReader is an application that aggregates data from multiple
social media sources across the Internet. The
BoardReader crawler connects to
the BoardReader server to collect this data so that it can be
indexed, searched, and analyzed by
Watson Explorer Content Analytics. By using various BoardReader
REST APIs, the crawler collects data from the following BoardReader
aggregated feeds:
- Blogs (sites)
- Message boards (sites and forums)
- News (sites)
- Reviews (sites and authors)
- Videos (sites and authors)
By default, the BoardReader crawler
collects data that was aggregated by BoardReader during the
previous three months. You can specify a different crawling
time period when you configure the crawler.
Crawler connection credentials
When you configure a BoardReader crawler,
you must specify the value of the BoardReader license key
that you received from BoardReader as part of your license
agreement. The crawler requires this key to collect data from the
BoardReader social media sources.
Alternatively,
you can configure the connection credentials when you specify security
settings for the system instead of specifying credentials
for each crawler. By using this approach, multiple crawlers
can use the same credentials.
BoardReader source properties and crawler metadata
fields
BoardReader defines properties for each web
source type, such as the subject, the crawled date, and author.
The
BoardReader crawler
publishes these properties as the crawler metadata fields.
The properties, and corresponding crawler fields, depend on
the BoardReader source type:
- The crawler publishes the BoardReader property Text as the Watson Explorer Content Analytics document body. The MIME
type of the document body is text/plain.
- The crawler publishes the BoardReader property Published as the
document date.
- For other BoardReader properties, the crawler field matches the
BoardReader property name, prefixed by the source category
name. For example, a property that is named AuthorInfo/Name
in a message board is mapped to a crawler field that is named
MessageBoard_AuthorInfo/Name.
- For the document title, the crawler maps the BoardReader property
Subject from blogs, message boards, and news sources to
an index field named title. For reviews and videos, the
crawler maps the BoardReader property Title to an index field named
title.
Identifying the content to crawl
Because the amount of data that can be obtained from BoardReader
is large, configure one or more of the following crawler conditions
to limit what is added to the collection. At a minimum, you
must specify a query condition or a domain condition, and you must
select both a top-level source type and lower-level sources
to crawl.
- You can limit the amount of data by specifying time limits. The
crawler can collect content that was published between
two specific dates, from a specific date to the start
of the crawler session, or for a specific amount of time, such as
six months before the start of a crawler session.
- You can narrow the scope by specifying one or more BoardReader
queries. The crawler applies the queries to limit the
content that is collected from the selected BoardReader
sources. If you specify more than one query, the crawler runs each
query separately and then applies Boolean OR logic to
the results. For example, if you specify two queries and
select Message Boards as the source type, the crawler searches the
message boards twice and then aggregates the results of
both queries. Contact BoardReader for information about
BoardReader query syntax.
- You can specify a filter to limit the crawler to specific domains.
For example, BoardReader handles Twitter accounts as sites.
To crawl all Twitter site content that was collected by
BoardReader, add twitter.com as a domain
condition.
- You can also narrow the scope by selecting specific BoardReader
source types to crawl, such as message boards or specific
message board sites and forums. Select at least one top-level
source type and at least one lower-level source to control the amount
of data that is crawled. For example, if you select Message
Boards as the source type, select one or more message
board sites or one or more of the forums that are hosted on a site.
You can also enter queries to narrow the list of available
sites and forums. For example, limit the list of sites
to sites that include IBM® in
their names or limit the list of forums to forums that contain Enterprise
in their titles.
The following table shows the levels
available for the supported source types and the ways
that you can use them to narrow the scope of the crawl. If
you change the crawler conditions after you create the crawler,
a full recrawl is required to apply the changes.
Top-Level (Source Type) |
Second-Level |
Third-Level |
Message Boards |
Site You can enter a query to narrow the list of sites by
site name, site domain, or site URL.
|
Forum You can enter a query to narrow the list of forums
by forum name.
|
Blogs |
Site You can enter a query to narrow the list of sites by
site name, site domain, or site URL.
|
|
News |
Site You can enter a query to narrow the list of sites by
site name, site domain, or site URL.
|
|
Videos |
Site You can click a button to see a list of the available
video sites.
|
Author You can enter a query to narrow the list of authors
by author name.
|
Reviews |
Site You can enter a query to narrow the list of sites by
site name or site URL.
|
Author You can enter a query to narrow the list of authors
by author name.
|
Applying time zones and API filters
In BoardReader, dates are determined by the eastern United
States time zone (EST/EDT) for message board content, and
by GMT time for all other source types. Because the crawler
cannot detect the time zone for each BoardReader site, the dates
that users see when they query a BoardReader collection might
be different from dates that appear on the original data source
sites (such as yahoo.com sites that were crawled by the BoardReader
service).
When you select the sources to be crawled,
you can specify which time zone is to be used to determine
the document dates. For best results, select a time zone based on
the time zone for the site to be crawled:
- You can select a default time zone. This time zone is used to
convert the BoardReader Published date to the indexed
document date.
- You can specify patterns to apply a time zone to documents that
match a pattern URL. For example, if you specify *.yahoo.co.jp=JST in
the time zone list, Japanese Standard Time is used to
determine the document date for documents that are published
on .yahoo.co.jp sites.
If you do not specify a time zone pattern, the default time
zone is applied. If you do not specify a default time zone,
the crawler uses the time zone that is configured for the
collection.
You can specify BoardReader API parameters
to further limit the amount of data that the crawler collects.
For example, specify filter_language=ja to
collect only documents in the Japanese language, or specify filter_country=jp
to collect only documents that originate in Japan. To specify
more than one filter, combine the entries with an ampersand
(&) character, for example, filter_language=ja&filter_country=jp.
Contact BoardReader for information about available API parameters.
Configuration overview
When
you configure a
BoardReader crawler
in the administration console, wizards help you provide required
information:
- Specify properties that control how the crawler operates and uses
system resources. The crawler properties control how the
crawler collects content from all of the BoardReader sources
that you add to the crawl space.
- Specify connection credentials.
- Select the BoardReader sources that you want to crawl by specifying
time limits, queries, domains, and source types.
- Specify options for how documents can be searched. For example,
you can exclude certain types of documents from the crawl
space.
- Set up a schedule for crawling BoardReader sites.