Advanced

These options are in the Advanced sub-section of the General Settings for a search collection:

  • Number of exec threads - Specifies the number of concurrent exec: requests that can be handled.
  • Converters queue size - Identifies the number of crawled URLs that can be queued, awaiting disposition by the link extractor.
  • DNS/Robots cache size - Identifies the number of computer name/network identifier mappings that will be cached.
  • DNS/Robots cache entries expiry - Identifies the number of milliseconds to keep a cached name server response and the robots.txt file for a specific web server.
  • Disable DNS resolution - Disables DNS lookups at the crawler level, causing the host system to do the lookups. The default value of this option is false. You may need to set this option if DNS lookups at a given site are handled by a proxy server.
  • FTP list only - Enabling this option tells the crawler to request that an FTP list only provide the names in an FTP directory instead of doing a full directory listing that would include file sizes, dates etc. Activating this option causes an FTP NLST command to be sent. Some FTP servers list only files in their response to NLST, rather than including subdirectories and symbolic links.
  • Status dump period - Specifies the time period at which the crawler should write status information to disk. This status information is used to resume the crawler if the crawler terminates abnormally or the machine crashes. When the crawler restarts, it may therefore be out of date by at most the number of seconds specified in this option. If you are crawling a large number of hosts, you will want to increase this value so that the status dumps don't interfere with the crawling process. You can also disable status dumps entirely by setting this value to -1.
  • Link analysis period - Specifies the time period at which the crawler sends link analysis information (used to help determine the weight of each search result) to the indexer. The value of this option therefore represents the number of seconds by which newly crawled URLs will not be accounted for in the link analysis score (the previous link analysis score will be used). Because performing link analysis is a computationally expensive operation, increasing or decreasing this value may have a significant performance impact on the speed of crawling and indexing. Increasing this value will sacrifice link analysis precision to improve overall performance, while decreasing this value will sacrifice performance for more accurate link analysis scores and more precise weighting.
  • Page size - The number of bytes in a page used by the search engine's storage facilities. Depending on your workload, changing this parameter may result in better crawling or browsing performance. The page size must be a power of two that is greater than or equal to 512 and less than or equal to 32768.
  • Cache size - The maximum number of megabytes available for use as the storage engine's cache. Increasing this number may result in faster log browsing, particularly when sorting on an un-indexed column. Additionally, a larger cache value may speed up a crawl or refresh in the presence of I/O intensive competing workloads. Two caches are employed when a collection is being refreshed, which means that the specified cache size is effectively doubled during that time.
  • Transaction size (urls) - The maximum number of URLs written to the logs in a single transaction. A larger number may increase crawling speed. A smaller number will reduce memory usage and crash recovery time.
  • Transaction size (mbytes) - The maximum number of megabytes waiting to be written to the logs in a single transaction. A larger number may increase crawling speed. A smaller number will reduce memory usage and crash recovery time.
  • Synchronization mode - This option specifies how tightly log transactions are synchronized with disk writes. Options are:
    • FULL - The storage engine will pause at critical moments to make sure that data has actually been written to the disk surface before continuing. This ensures that if the operating system crashes or if there is a power failure, the log will be uncorrupted after rebooting. This option is very safe, but it is also slow.
    • NORMAL - The storage engine will still pause at the most critical moments, but less often than in FULL mode. There is a very small (though non-zero) chance that a power failure at just the wrong time could corrupt the log in NORMAL mode. But in practice, you are more likely to suffer a catastrophic disk failure or some other unrecoverable hardware fault.
    • OFF - The storage engine continues without pausing as soon as it has handed data off to the operating system. If the crawler crashes, the data will be safe, but the log might become corrupted if the operating system crashes or the computer loses power before that data has been written to the disk surface. This option is very fast at the cost of data integrity.
  • Enable ACLs - This is true by default. You would only want to set this to false if you do not want to enforce security at the search engine level and you would like to save a small amount of space by not retrieving the ACLs for the data in this collection.
  • Authorize result views - If enabled, authorization of this URL will be required before it will be returned to the user. To do this, either the crawled URL will be downloaded or the special authorization-url will be downloaded (if specified).
  • Authorization URL - An alternate URL to use for verifying that a user is able to view a given page. For example, if all pages on a site have the same authentication checks, you can specify a common authorization URL so that it only needs to be authorized once per search.
  • Crawl Strategy - Enables you to specify how the crawler will process the links retrieved from the data source that is associated with this search collection. Options are:
    • DFS: (Depth First Search) Links that point to additional levels of hierarchy in the remote data source will be crawled before links without hierarchy. For example, when crawling a directory in a remote filesystem, subdirectories will be processed before files.
    • BFS: (Breadth First Search) Links that point directly to data in the remote data source will be crawled before links that point to additional levels of hierarchy. For example, when crawling a directory in a remote filesystem, files will be processed before subdirectories. This is the default value for this option, to expedite procesing data so that it can be indexed and is available as potential search results..
    • NONE: Links are processed in the order that they are encountered in the remote data source.

    Changing this option can have a significant impact on the performance of a crawl, depending on the characteristics of the data that you are crawling, the characteristics of the system(s) on which that data is stored, and so on. Breadth First Search typically provides better performance when crawling data sources with relatively small amounts of hierarchy, references to other systems, and so on. Depth First Search often provides improved performance when crawling data sources where the actual data that you want to index is deeply nested, contains references to significant numbers of other systems (which effectively enables multiple systems to be crawled in parallel), and so on.

    Because the performance of the crawler with these options is so data-dependent, experimenting with these options on a subset of the data that you want to crawl and measuring the performance of each is only absolute way to determine which option you will want to use.

  • Duplicates hash table size - Sets the size of the hash table that is used for resolving duplicates. Be very careful when modifying this number. The value that you select should be prime, and larger sizes can provide faster lookups but will require more memory, while smaller sizes can slow down crawls but will substantially reduce memory usage.
  • Exact duplicates hash table size - Sets the size of the hash table used for resolving exact duplicates. Be very careful when modifying this number. The value that you select should be prime, and larger sizes can provide faster lookups but will require more memory, while smaller sizes can slow down crawls but will substantially reduce memory usage.
  • Remove logged input upon completion - Removes input information when a crawl-url reaches the crawled and indexed state. This option is off by default. This option can be set to on to improve performance when crawling or refreshing search collections that crawl a database.