If a document includes links to documents that the Web
crawler is forbidden to crawl, you can specify whether you want to
retain the anchor text for those links in the index when you configure
a Web crawler.
About this task
Directives in a robots.txt file or
in the metadata of Web documents can prevent the Web crawler from
accessing documents on a Web site. If a document that the Web crawler
is allowed to crawl includes links to forbidden documents, you can
specify how you want to handle the anchor text for those links.
You
can specify whether you want to index the anchor text to forbidden
documents when you configure the Web crawler. For maximum security,
specify that you do not want to index the anchor text in links to
forbidden documents. By not indexing anchor text, however, the search
results might not include all of the documents that are potentially
relevant to a query.
Procedure
To enable or disable the indexing of anchor text in links
to forbidden documents:
- In the Crawl and Import pane of the Collections view,
locate the Web crawler that you want to configure and click Edit
crawler properties.
- Expand the advanced options and click Edit advanced
Web crawler properties.
- To index the anchor text in all of the documents that this
crawler crawls, select the Index the anchor text in links
to forbidden documents check box.
Users will
be able to learn about pages that the Web crawler is not allowed to
crawl by searching for text that is in the anchor text of links that
point to those pages.
To exclude anchor text in links to forbidden
documents from the index, clear this check box. Users will not be
able to learn about pages that the Web crawler is not allowed to crawl.
The anchor text will be excluded from the index in addition to the
forbidden documents.
- Click OK and then, on the Web
Crawler Properties page, click OK again.
- For the changes to become effective, stop and restart the
crawler.
What to do next
To apply the changes to documents that were previously indexed,
the documents must be recrawled so that they can be indexed again.
If a previous crawl added information about forbidden documents to
the index, that information will then be removed from the index.