URL RFC Standards

Watson™ Explorer Engine strictly adheres to the Internet Engineering Task Force (IETF) Request for Comments (RFC) standards. Because web sites and web browsers do not strictly adhere to these standards, there are several things you should keep in mind when configuring your project.

  • Watson Explorer Engine percent-encodes non-ASCII characters in URLs. The hexadecimal digits used in URL percent-encoding are normalized to lowercase. While hexadecimal case normalization strategy varies among software, the IETF standard declares that the hexadecimal digits must be treated in a case-insensitive way. The RFC discussing this IETF standard can be found here: 6.2.2: Syntax-Based Normalization.
  • Watson Explorer Engine drops everything after the anchor symbol (#)when verifying it has already crawled a URL.
  • Watson Explorer Engine does not recognize file paths (for example C:\..) as URLs. Use a URL like the following to reference a resource on the local filesystem: file:///C%3a/Program%20Files/my%20file.txt
  • Watson Explorer Engine does not recognize Windows Universal Naming Convention (UNC) file paths (for example \\sharehost\path\file) as URLs.
  • Domain Name System (DNS) aliases and Host Names cannot include underscores (_) in their URLs.

You can relax the checks Watson Explorer Engine has on URLs by disabling URL normalizations. To do so, open the search collection's Configuration > Crawling tab. Open the URL normalization section. See URL Normalizations for more information.