How to use IBM App Connect with Website Crawler

Website Crawler crawls web page URLs to retrieve links that are available on the web pages or download the HTML content of pages.

Availability:
  • A connector in IBM App Connect on IBM CloudCloud-managed connector

To retrieve page links, Website Crawler uses a basic website crawler algorithm that starts with a web page URL. The crawler goes to the URL and identifies all the hyperlinks in the statically defined content of each page. The crawler does not try to run any scripts; therefore, dynamically generated content is not crawled. You can customize the behavior of Website Crawler by setting filter parameters. For example, you can change the maximum depth of pages to be crawled, and force Website Crawler to exclude header and footer tags. Website Crawler can crawl public websites and websites in a private network (through the IBM® Secure Gateway).

The following information describes how to use IBM App Connect to connect Website Crawler to your other applications.

Connecting to Website Crawler

To connect App Connect to a Website Crawler account that you want App Connect to use, you need the following connection details:

  • Username: The username that is used to log in to Website Crawler.
  • Password: The password for the specified username.
  • Fully qualified web domain name to be crawled: The fully qualified domain of the website to be crawled, such as https://mydomain.com.
  • Network name: The name of the network that App Connect uses to access the website. This value is needed only when you connect to a website in a private network.

To connect to Website Crawler from the App Connect Designer Catalog page for the first time, under Website Crawler click Connect. A Website Crawler account is created in App Connect. For more information, see Managing accounts in App Connect.

When you click Connect to create the connection account, App Connect checks that it can connect to the specified domain. If the check fails, App Connect displays an error message. Check that the specified values are correct, then try again.

Tip:

Before you use the account that is created in App Connect in a flow, rename the account to something meaningful that helps you to identify it. To rename the account on the Catalog page, select the account, open its options menu (⋮), then click Rename Account.

What to consider first

Before you use the App Connect Designer with Website Crawler, take note of the following considerations.

  • (General consideration) You can see lists of the trigger events and actions that are available on the Catalog page of the App Connect Designer.

    For some applications, the events and actions in the catalog depend on the environment (IBM Cloud Pak for Integration or App Connect on IBM Cloud) and whether the connector supports configurable events and dynamic discovery of actions. If the application supports configurable events, you see a Show more configurable events link under the events list. If the application supports dynamic discovery of actions, you see a Show more link under the actions list.

  • (General consideration) If you are using multiple accounts for an application, the set of fields that is displayed when you select an action for that application can vary for different accounts. In the flow editor, some applications always provide a curated set of static fields for an action. Other applications use dynamic discovery to retrieve the set of fields that are configured on the instance that you are connected to. For example, if you have two accounts for two instances of an application, the first account might use settings that are ready for immediate use. However, the second account might be configured with extra custom fields.
  • The Web page / Retrieve pages action crawls the static content of web pages. It returns a JSON response that includes, for each web page, a collection of links that are extracted from the page. The action starts with an initial web page URL that is specified by the Fully qualified web URL filter property. The crawler then crawls the links that are discovered down to the specified maximum depth of pages to be crawled.
  • Crawling websites can take noticeable time and depends on the responsiveness of the website, the network, and to other factors outside the control of App Connect. Therefore, for optimal behavior and to crawl the largest websites supported, use Website Crawler actions in a batch process. In a batch process, a Web page / Retrieve pages action can crawl a maximum of 500000 web pages. If the maximum number is reached, an error message is issued.
  • When a page is crawled, Website Crawler can process up to 1 MB of links that are discovered on that page. If more links are discovered, they are ignored, but a log message can be recorded.
  • The Web page / Download page content action downloads the content of a web page in Base64 format.

    If necessary, you can convert the downloaded web page content from Base64 format by using the unencode JSONata function on the content, in the format {{$base64decode($WebCrawlerDownloadpagecontent.Content)}}. For more information, see https://jsonata.org/.

  • Website Crawler can crawl public website domains that do not use authentication or it can be configured to use basic authentication where needed to enable website domains to be crawled.
  • Website Crawler can be used to crawl public websites and websites on a private network (through the IBM Secure Gateway). When you configure a Website Crawler connection for a website on a private network, you must specify the hostname and port (for example, https://host:port). You must also use the IBM Secure Gateway to access the network. When you click Connect to create the connection account, App Connect checks whether it can connect to the specified domain. If you previously used the Secure Gateway Client to set up a network connection for an App Connect application on the same private network as the website, you can use this network connection with Website Crawler. If you do not have such a network connection in place, configure one as described in Configuring a private network for IBM App Connect.
  • To protect your data, you cannot crawl a URL where the hostname includes localhost or a loopback IP address. A loopback ID address is used by a system to communicate with itself and is typically in the range 127.0.0.0 to 127.0.0.8.
  • For more considerations and details about Website Crawler, see Reference.
Tip: If the results of crawling a website do not match what you expect, consider the following possible causes.
  • If a URL is not getting crawled as you expect, the page might contain dynamic content (driven by JavaScript or another language) that is not seen by Website Crawler. Only static content is processed by Website Crawler.
  • If a robots rule does not allow the website, or some pages, to be crawled, look for a message in the error log.
  • If a URL is not being crawled, even though it is under an allowed subtree, the URL might be redirected to a target that is not under an allowed subtree.

Events and actions

Website Crawler events

These events are for changes in this application that trigger a flow to start performing the actions in the flow.

Note: Events are not available for changes in this application. You can trigger a flow in other ways, such as at a scheduled interval or at specific dates and times.

Website Crawler actions

These are actions on this application that you want a flow to complete.

Web page
Retrieve pages
Download page content

Examples

A company wants to analyze the content of its vehicle website to find specific information and extract actionable insights. They create an event-driven flow to use Website Crawler to download the content of appropriate web pages and upload the content as documents to IBM Watson™ Discovery to analyze. The company can then examine the documents and run various queries to find specific information and extract actionable insights.

Figure 1. Website Crawler flow for analysis of data that is retrieved from page content on a vehicle website
Event-driven flow for analysis of data retrieved from page content on vehicle web site

(Click image to view full size.)

Connecting Website Crawler: In the App Connect Designer Catalog, you create a Website Crawler account and set Fully qualified web domain name to be crawled to https://ibmmotors.com. The website does not need a username and password, and it is a public site so it does not need a secure network (the IBM Secure Gateway) to be configured.

Connect to Website Crawler parameters in the App Connect Catalog

Notes on the event-driven flow

  1. For ease of testing, a Scheduler event is used to trigger the flow.
  2. Website Crawler actions are run in a batch process for optimal behavior. The batch process completes a sequence of actions for each web page that is retrieved into its pagecollection array:
    Page collection data properties of the batch process

    (Click image to view full size.)

  3. The Website Crawler / Retrieve pages action is used to crawl the website from an initial page. For this website to be crawled, the action is configured with the following customized filter conditions:
    • Fully qualified web URL is set to https://ibmmotors.com/vehicle-collection/.

      This initial web page is the first in a sequence of hub pages. Each page lists a number of links to pages for specific vehicles.

    • Maximum depth of pages to be crawled is set to 20. For this example, the depth value is used to limit the number of pages that the Website Crawler crawls. It crawls from the initial page to its vehicle pages, and to the next hub page and its linked vehicle pages, and so on, to the last hub page and its linked vehicle pages. But it does not crawl any links to pages deeper than 20 levels in the website.
    • Blocklist strings is set to .jpg;.png, which tells Website Crawler to ignore web page URLs that end with .jpg or .png.
    • Filter by sub tree is set to true, which tells Website Crawler to crawl only web pages with URLs that start with the URL of the initial page: https://ibmmotors.com/vehicle-collection/. Website Crawler ignores other web pages on the website, such as https://ibmmotors.com/contact-us/.
  4. The Website Crawler / Download content action is used to download the content of the web page, which is identified by its crawled URL from the batch process pagecollection object.
    Download content action properties

    (Click image to view full size.)

  5. A Google Sheets / Create row action is used to create an index of page IDs, URLs, and links.
    Spreadsheet columns (properties)

    (Click image to view full size.)

  6. An IBM Watson Discovery / Update or create document action is used to upload the content of each web page as a separate document, where the source ID is the ID of the web page. The action updates any document that was previously uploaded into the IBM Watson Discovery environment (on a previous run of the event-driven flow). The action metadata indicates that the downloaded page content is in base64encoded format. The name of the document that is updated or created has the ID value of the web page that was crawled.
    Update or create document action properties

    (Click image to view full size.)

Example result: When the flow was run, each crawled web page was processed by the batch process. The App Connect batch status view listed the number of pages that were processed and whether they were processed successfully:

The App Connect batch status view listed the number of pages processed

(Click image to view full size.)

For each web page that was crawled, the flow added a row to a Google Sheets spreadsheet.

Google spreadsheet showing rows added by the App Connect flow

(Click image to view full size.)

The flow then uploaded the contents of the web pages to IBM Watson Discovery for analysis.

Data added to BM Watson Discovery for analysis

(Click image to view full size.)

In IBM Watson Discovery, queries can be run to find specific information and extract actionable insights.

IBM Watson Discovery analysis of results

(Click image to view full size.)

Reference

Supported API versions

The Website Crawler application uses the simplecrawler node module and some additional node modules:

  • simplecrawler v1.1.6 is a flexible, event-driven crawler for the node.

Extra node modules:

  • cheerio 1.0.0-rc.2 is a fast, flexible, and lean implementation of core jQuery that is designed specifically for the server.
  • robots-parser is a NodeJS robots.txt parser with support for wildcard (*) matching.
  • uuid generates RFC-compliant UUIDs in JavaScript.

Rate limit and retry logic

The application provides the following configuration properties for limits that are associated with crawling websites:
  • Crawl interval in milliseconds
  • Timeout in milliseconds

Supported connection authentication types and objects

Website Crawler can crawl public website domains that do not use authentication, or it can be configured to use basic authentication where needed to enable website domains to be crawled.

Use of IBM Secure Gateway

When you configure the application connection property Fully qualified web domain name to be crawled for a private website domain to be accessed through the IBM Secure Gateway, you must include the port number that is used to access the website.

  • Typically, HTTP websites are accessed on port 80, and HTTPS websites are accessed on port 443. For example, use of these ports for a private website domain can be specified on Fully qualified web domain name to be crawled as http://www.example.com:80/ or https://www.example.com:443/.
  • However, the connection property Fully qualified web domain name to be crawled supports only one website domain (and port). The entire website to be crawled must be accessible on the same protocol (HTTP or HTTPS) and on the same port. Any discovered links that have a different protocol or port number are not crawled. For example, if the domain to be crawled is configured as https://www.example.com:9080/, and the web page https://www.example.com:9080/apage.html contains references to http://www.example.com:9080/somecontent or https://www.example.com:9081/somecontent, those URLs are not crawled.
  • If you do not supply a valid port number, Website Crawler issues an error when you create the account.
  • The web domain of the connection property Fully qualified web domain name to be crawled must match the domain of the Fully qualified web URL action. If the domains do not match, the crawler cannot crawl the requested web domain.
  • If Website Crawler encounters a link to a public web page while it is crawling, it crawls that page but not through the IBM Secure Gateway.

Supported objects and operations

Website Crawler can be triggered by or act on particular objects, and it provides certain operations (events and actions).

Web page

A web page is a document that is present on the website to be crawled. A web page might be a document that is based on HTML (hypertext markup language) or a document in another format like JPG, PNG, or PDF.

Supported operations (actions)
Web page / Retrieve pages action

The Web page / Retrieve pages action crawls web pages and returns a JSON pagecollection object. This object contains a record for each crawled web page (URL) that provides page properties and a collection of links that are extracted from the page. The action starts with an initial web page URL that is specified by the Fully qualified web URL filter property. From the static content of an HTML web page, the action extracts any links (<a href="..."), then adds the web page URL and extracted links to the links collection for the action response. The action continues to crawl links from an initial page to other HTML web pages (at the next depth level). From the static content of each other web page, it extracts any links, then adds them to the links collection for the action response. The action crawls links to more HTML web pages down to the depth that is specified by the Maximum depth of pages to be crawled filter property.

Default filter properties:

The following properties are provided by default to filter the behavior of the action.
Allow self-signed certificate
  • To crawl a website that has self-signed certificates, set this property to true.
  • Supported values are true (the default) and false.
Crawl interval in milliseconds
  • This property specifies how frequently Website Crawler crawls each page URL that is retrieved by the action. This value is used to reduce the load effect on the website that is being crawled.
  • Supported values are slow (2500 ms), medium (1500 ms), and fast (700 ms). The default value is Slow.
Exclude header and footer tags
  • To exclude header and footer tags, set this property to true. Links in header and footer tags are not crawled.
  • Supported values are true and false (the default).
Filter by domain
  • If this property is set to true, Website Crawler does not crawl links that contain a host domain that is different from the domain that is configured in Fully qualified web URL. (Any such links are still listed in the links collection in the action response.)
  • To crawl a website on a private network through the IBM Secure Gateway, you must set Filter by domain to true because public website pages cannot be crawled through the IBM Secure Gateway.
  • Supported values are true (the default) and false.
Fully qualified web URL
  • Specify the URL of the initial page to be crawled.
  • Specify a URL that starts with the same protocol and domain as is specified on the Website Crawler connection parameter Fully qualified web domain name to be crawled
Maximum depth of pages to be crawled
  • Set the maximum depth of pages to be crawled to an integer value, such as 1, 5 (the default), or 20. This property directs Website Crawler to crawl pages only until the specified depth level, as demonstrated in the following examples.
    • Maximum depth = 0: Crawl only the initial page (which is specified by the Fully qualified web URLs property) to return a collection of the links that the page contains.
    • Maximum depth = 1: Crawl the initial page, then crawl any HTML page links that were discovered in the initial page. This setting returns a collection of the links that are contained in the initial page and in any discovered HTML page links (at 'depth 1').
    • Maximum depth = 2: Crawl the initial page, then crawl any discovered HTML page links (at 'depth 2'), then crawl any HTML page links that were discovered in web pages at 'depth 2'.
Timeout in milliseconds
  • This property specifies the maximum time that Website Crawler waits for a web page to respond before it stops trying to crawl that page and moves on to the next page in the list.
  • The default value is 30000 milliseconds.
Respect robots TXT rules set by Web master
  • This property defines that the connector adheres to the robots.txt rule, which is defined by the websites, for all the publicly accessible websites and websites hosted behind a firewall.
  • If a website is behind a firewall and the Filter by Domain option is set to true, the property can be overridden. If it’s a public site, changing this property takes no effect.
  • Supported values are true (the default) and false.

Extra filter properties:

When you configure the Web page / Retrieve pages action, you can click (+) Add to add any of the following filters.
Blocklist strings
  • This filter specifies one or more substrings for URLs to be excluded from the website crawl. Separate multiple strings with a semicolon ;.
  • If a string matches part of a URL, that URL is not crawled.

For example, a website page contains the links: https://my.domain/afolder/open.html, https://my.domain/bfolder/private_ano.html, and https://my.domain/afolder/private.html. If you set Blocklist strings to /private, the response for the website page contains the links list with only https://my.domain/afolder/open.html. Website Crawler ignores the pages https://my.domain/bfolder/private_ano.html and https://my.domain/afolder/private.html.

Filter by subtree
  • If this property is set to true, only pages under the same initial page URL are crawled. For example, if the page URL in Fully qualified web URL is specified as https://www.example.com/partners, and Filter by subtree is set to true, only URLs that begin with https://www.example.com/partners/ are crawled.
  • If the value is set to false, Website Crawler crawls all pages from the initial page URL. For example, the page URL in Fully qualified web URL is specified as https://www.example.com/partners, and Filter by subtree is set to false. In this case, all links with URLs that begin with https://www.example.com/ are crawled down to the specified maximum depth. (Links that begin with different URLs can also be crawled if Filter by domain is set to false and other website conditions allow crawling.)
  • Supported values are true and false. Default value is false.
Note: When the nofollow & noindex flag is set in the header of the seed URL, then only the seed URL is crawled, and no other URLs get crawled.

Response: The Web page / Retrieve pages action returns a JSON pagecollection object. This object contains a record for each crawled web page (URL) that provides page properties and a collection of links that are extracted from the page, in the following format.


{
        "Id": Unique identifier of crawled URL,
        "Url": Crawled URL,
        "ContentType": Content type of response,
        "LastModified": Last modified date,
        "Links": [
            {
                "id": Unique identifier of extracted link,
                "url": URL of extracted link
            },
            ...
            link_n {id, url}
        ]
}
Example: A website contains the following pages and links to URLs in the same domain:
  • http://my.co.host.com/helloworld/web_page.html (depth 1) with links to site.css and page_2.html.
  • http://my.co.host.com/helloworld/page_2.html (depth 2) with links to site.css, page_3.html, and web_page.html.
  • http://my.co.host.com/helloworld/page_3.html (depth 3) with links to site.css and page_2.html.
The Web page / Retrieve pages action is configured with the following filter values:
  • Fully qualified web URL is set to http://my.co.host.com/helloworld/web_page.html.
  • Maximum depth of pages to be crawled is set to 3.
The action returned the following response:

{"response":{"payload":[
{"Id":"fc0af263-cbea-5e58-b3e0-52f3020e6d5e","Url":"http://my.co.host.com/helloworld/web_page.html","ContentType":"text/html","LastModified":"2018-11-17T17:50:52.000Z","Links":[{"id":"a02d2a25-77a3-5352-b612-59b2a965d55d","url":"http://my.co.host.com/helloworld/page_2.html"}]},
{"Id":"a02d2a25-77a3-5352-b612-59b2a965d55d","Url":"http://my.co.host.com/helloworld/page_2.html","ContentType":"text/html","LastModified":"2018-11-17T17:50:52.000Z","Links":[{"id":"05884593-6b22-59bb-8696-fbbcd86a0898","url":"http://my.co.host.com/helloworld/page_3.html"},{"id":"fc0af263-cbea-5e58-b3e0-52f3020e6d5e","url":"http://my.co.host.com/helloworld/web_page.html"}]},
{"Id":"05884593-6b22-59bb-8696-fbbcd86a0898","Url":"http://my.co.host.com/helloworld/page_3.html","ContentType":"text/html","LastModified":"2018-11-17T17:50:52.000Z","Links":[{"id":"a02d2a25-77a3-5352-b612-59b2a965d55d","url":"http://my.co.host.com/helloworld/page_2.html"}]}
}

Limitations:

  • During the website crawl, if the number of URLs to be crawled exceeds 500,000, App Connect issues a message and further discovered URLs are ignored.
  • When a page is crawled, Website Crawler can process up to 1 MB of links that are discovered on that page. If more links are discovered, they are ignored, but a log message can be recorded.
Web page / Download page content action

The Web page / Download page content action downloads the content of a web page in Base64 format.

Filter properties: You can use the following parameters to filter the behavior of the action.
URL
This property is mandatory, and specifies the fully qualified URL of the web page to be downloaded.
Ignore invalid SSL
  • To download the content of a web page from a website that uses self-signed certificates, set this property to true.
  • Supported values are true (the default) and false.
Content Type
  • This property is mandatory, and specifies whether the file contains binary or text content.
  • Supported values are Binary (the default) and Text. (You can use the Custom option to map a value from a preceding event or action.)

Response: The Web page / Download page content action returns a JSON pagecontent object. This object contains the content of a web page in Base64 format.

Tip: If necessary, you can convert the downloaded web page content from Base64 format by using the base64decode JSONata function on the content: {{$base64decode($WebCrawlerDownloadpagecontent.Content)}}.
JSONata function used to convert downloaded web page content from Base64 format

(Click image to view full size.)