How to use IBM App Connect with Website Crawler
Website Crawler crawls web page URLs to retrieve links that are available on the web pages or download the HTML content of pages.
Cloud-managed connector
To retrieve page links, Website Crawler uses a basic website crawler algorithm that starts with a web page URL. The crawler goes to the URL and identifies all the hyperlinks in the statically defined content of each page. The crawler does not try to run any scripts; therefore, dynamically generated content is not crawled. You can customize the behavior of Website Crawler by setting filter parameters. For example, you can change the maximum depth of pages to be crawled, and force Website Crawler to exclude header and footer tags. Website Crawler can crawl public websites and websites in a private network (through the IBM® Secure Gateway).
Connecting to Website Crawler
To connect App Connect to a Website Crawler account that you want App Connect to use, you need the following connection details:
- Username: The username that is used to log in to Website Crawler.
- Password: The password for the specified username.
- Fully qualified web domain name to be crawled: The fully qualified domain of the website to be crawled, such as https://mydomain.com.
- Network name: The name of the network that App Connect uses to access the website. This value is needed only when you connect to a website in a private network.
To connect to Website Crawler from the App Connect Designer Catalog page for the first time, under Website Crawler click Connect. A Website Crawler account is created in App Connect. For more information, see Managing accounts in App Connect.
When you click Connect to create the connection account, App Connect checks that it can connect to the specified domain. If the check fails, App Connect displays an error message. Check that the specified values are correct, then try again.
Before you use the account that is created in App Connect in a flow, rename the account to something meaningful that helps you to identify it. To rename the account on the Catalog page, select the account, open its options menu (⋮), then click Rename Account.
What to consider first
Before you use the App Connect Designer with Website Crawler, take note of the following considerations.
- (General consideration) You can see lists of the trigger events and
actions that are available on the Catalog page of the App Connect Designer.
For some applications, the events and actions in the catalog depend on the environment (IBM Cloud Pak for Integration or App Connect on IBM Cloud) and whether the connector supports configurable events and dynamic discovery of actions. If the application supports configurable events, you see a Show more configurable events link under the events list. If the application supports dynamic discovery of actions, you see a Show more link under the actions list.
- (General consideration) If you are using multiple accounts for an application, the set of fields that is displayed when you select an action for that application can vary for different accounts. In the flow editor, some applications always provide a curated set of static fields for an action. Other applications use dynamic discovery to retrieve the set of fields that are configured on the instance that you are connected to. For example, if you have two accounts for two instances of an application, the first account might use settings that are ready for immediate use. However, the second account might be configured with extra custom fields.
- The Web page / Retrieve pages action crawls the static content of web pages. It returns a JSON response that includes, for each web page, a collection of links that are extracted from the page. The action starts with an initial web page URL that is specified by the Fully qualified web URL filter property. The crawler then crawls the links that are discovered down to the specified maximum depth of pages to be crawled.
- Crawling websites can take noticeable time and depends on the responsiveness of the website, the network, and to other factors outside the control of App Connect. Therefore, for optimal behavior and to crawl the largest websites supported, use Website Crawler actions in a batch process. In a batch process, a Web page / Retrieve pages action can crawl a maximum of 500000 web pages. If the maximum number is reached, an error message is issued.
- When a page is crawled, Website Crawler can process up to 1 MB of links that are discovered on that page. If more links are discovered, they are ignored, but a log message can be recorded.
- The Web page / Download page content action downloads the content of a
web page in Base64 format.
If necessary, you can convert the downloaded web page content from Base64 format by using the
unencode
JSONata function on the content, in the format{{$base64decode($WebCrawlerDownloadpagecontent.Content)}}
. For more information, see https://jsonata.org/. - Website Crawler can crawl public website domains that do not use authentication or it can be configured to use basic authentication where needed to enable website domains to be crawled.
- Website Crawler can be used to crawl public websites and
websites on a private network (through the IBM Secure
Gateway). When you configure a Website Crawler connection for a
website on a private network, you must specify the hostname and port (for example,
https://host:port
). You must also use the IBM Secure Gateway to access the network. When you click Connect to create the connection account, App Connect checks whether it can connect to the specified domain. If you previously used the Secure Gateway Client to set up a network connection for an App Connect application on the same private network as the website, you can use this network connection with Website Crawler. If you do not have such a network connection in place, configure one as described in Configuring a private network for IBM App Connect. - To protect your data, you cannot crawl a URL where the hostname includes
localhost
or a loopback IP address. A loopback ID address is used by a system to communicate with itself and is typically in the range127.0.0.0
to127.0.0.8
. - For more considerations and details about Website Crawler, see Reference.
- If a URL is not getting crawled as you expect, the page might contain dynamic content (driven by JavaScript or another language) that is not seen by Website Crawler. Only static content is processed by Website Crawler.
- If a robots rule does not allow the website, or some pages, to be crawled, look for a message in the error log.
- If a URL is not being crawled, even though it is under an allowed subtree, the URL might be redirected to a target that is not under an allowed subtree.
Events and actions
Website Crawler events
These events are for changes in this application that trigger a flow to start performing the actions in the flow.
Website Crawler actions
These are actions on this application that you want a flow to complete.
- Web page
-
- Retrieve pages
- Download page content
Examples
A company wants to analyze the content of its vehicle website to find specific information and extract actionable insights. They create an event-driven flow to use Website Crawler to download the content of appropriate web pages and upload the content as documents to IBM Watson™ Discovery to analyze. The company can then examine the documents and run various queries to find specific information and extract actionable insights.

(Click image to view full size.)
Connecting Website Crawler:
In the App Connect Designer
Catalog, you create a Website Crawler account and set Fully qualified web domain
name to be crawled to https://ibmmotors.com
. The website does not need
a username and password, and it is a public site so it does not need a secure network (the IBM Secure Gateway) to be configured.

Notes on the event-driven flow
- For ease of testing, a Scheduler event is used to trigger the flow.
- Website Crawler actions are run in a batch process for optimal
behavior. The batch process completes a sequence of actions for each web page that is retrieved into
its
pagecollection
array: - The Website Crawler / Retrieve pages action is used to crawl the website
from an initial page. For this website to be crawled, the action is configured with the following
customized filter conditions:
- Fully qualified web URL is set to
https://ibmmotors.com/vehicle-collection/
.This initial web page is the first in a sequence of hub pages. Each page lists a number of links to pages for specific vehicles.
- Maximum depth of pages to be crawled is set to 20. For this example, the depth value is used to limit the number of pages that the Website Crawler crawls. It crawls from the initial page to its vehicle pages, and to the next hub page and its linked vehicle pages, and so on, to the last hub page and its linked vehicle pages. But it does not crawl any links to pages deeper than 20 levels in the website.
- Blocklist strings is set to
.jpg;.png
, which tells Website Crawler to ignore web page URLs that end with .jpg or .png. - Filter by sub tree is set to
true
, which tells Website Crawler to crawl only web pages with URLs that start with the URL of the initial page:https://ibmmotors.com/vehicle-collection/
. Website Crawler ignores other web pages on the website, such ashttps://ibmmotors.com/contact-us/
.
- Fully qualified web URL is set to
- The Website Crawler / Download content action is used to download the
content of the web page, which is identified by its crawled URL from the batch process
pagecollection
object. - A Google Sheets / Create row action is used to create an index of page IDs, URLs, and links.
- An IBM Watson Discovery / Update or create document action is used to
upload the content of each web page as a separate document, where the source ID is the ID of the web
page. The action updates any document that was previously uploaded into the IBM
Watson Discovery environment (on a previous run of the event-driven
flow). The action metadata indicates that the downloaded page content is in
base64encoded
format. The name of the document that is updated or created has the ID value of the web page that was crawled.
Example result: When the flow was run, each crawled web page was processed by the batch process. The App Connect batch status view listed the number of pages that were processed and whether they were processed successfully:
For each web page that was crawled, the flow added a row to a Google Sheets spreadsheet.
The flow then uploaded the contents of the web pages to IBM Watson Discovery for analysis.
In IBM Watson Discovery, queries can be run to find specific information and extract actionable insights.
Reference
Supported API versions
The Website Crawler application uses the
simplecrawler
node module and some additional node modules:
- simplecrawler v1.1.6 is a flexible, event-driven crawler for the node.
Extra node modules:
- cheerio 1.0.0-rc.2 is a fast, flexible, and lean implementation of core jQuery that is designed specifically for the server.
- robots-parser is a NodeJS robots.txt parser with support for wildcard (*) matching.
- uuid generates RFC-compliant UUIDs in JavaScript.
Rate limit and retry logic
- Crawl interval in milliseconds
- Timeout in milliseconds
Supported connection authentication types and objects
Website Crawler can crawl public website domains that do not use authentication, or it can be configured to use basic authentication where needed to enable website domains to be crawled.
Use of IBM Secure Gateway
When you configure the application connection property Fully qualified web domain name to be crawled for a private website domain to be accessed through the IBM Secure Gateway, you must include the port number that is used to access the website.
- Typically, HTTP websites are accessed on port 80, and HTTPS websites are accessed on port 443. For example, use of these ports for a private website domain can be specified on Fully qualified web domain name to be crawled as http://www.example.com:80/ or https://www.example.com:443/.
- However, the connection property Fully qualified web domain name to be crawled
supports only one website domain (and port). The entire website to be crawled must be accessible on
the same protocol (HTTP or HTTPS) and on the same port. Any discovered links that have a different
protocol or port number are not crawled. For example, if the domain to be crawled is configured as
https://www.example.com:9080/, and the web page
https://www.example.com:9080/apage.html contains references to
http://www.example.com:9080/somecontent
orhttps://www.example.com:9081/somecontent
, those URLs are not crawled. - If you do not supply a valid port number, Website Crawler issues an error when you create the account.
- The web domain of the connection property Fully qualified web domain name to be crawled must match the domain of the Fully qualified web URL action. If the domains do not match, the crawler cannot crawl the requested web domain.
- If Website Crawler encounters a link to a public web page while it is crawling, it crawls that page but not through the IBM Secure Gateway.
Supported objects and operations
Website Crawler can be triggered by or act on particular objects, and it provides certain operations (events and actions).
Web page
A web page is a document that is present on the website to be crawled. A web page might be a document that is based on HTML (hypertext markup language) or a document in another format like JPG, PNG, or PDF.
- Web page / Retrieve pages action
-
The Web page / Retrieve pages action crawls web pages and returns a JSON
pagecollection
object. This object contains a record for each crawled web page (URL) that provides page properties and a collection of links that are extracted from the page. The action starts with an initial web page URL that is specified by the Fully qualified web URL filter property. From the static content of an HTML web page, the action extracts any links (<a href="..."), then adds the web page URL and extracted links to the links collection for the action response. The action continues to crawl links from an initial page to other HTML web pages (at the next depth level). From the static content of each other web page, it extracts any links, then adds them to the links collection for the action response. The action crawls links to more HTML web pages down to the depth that is specified by the Maximum depth of pages to be crawled filter property. - Web page / Download page content action
-
The Web page / Download page content action downloads the content of a web page in Base64 format.