These options are in the Advanced sub-section of the General Settings for a
search collection:
- Number of exec threads - Specifies the number of concurrent exec: requests that can
be handled.
- Converters queue size - Identifies the number of crawled URLs that can be queued,
awaiting disposition by the link extractor.
- DNS/Robots cache size - Identifies the number of computer name/network identifier
mappings that will be cached.
- DNS/Robots cache entries expiry - Identifies the number of milliseconds to keep a
cached name server response and the robots.txt file for a specific web server.
- Disable DNS resolution - Disables DNS lookups at the crawler level, causing the
host system to do the lookups. The default value of this option is false. You may
need to set this option if DNS lookups at a given site are handled by a proxy server.
- FTP list only - Enabling this option tells the crawler to request that an FTP list
only provide the names in an FTP directory instead of doing a full directory listing that
would include file sizes, dates etc. Activating this option causes an FTP NLST command to be
sent. Some FTP servers list only files in their response to NLST, rather than including
subdirectories and symbolic links.
- Status dump period - Specifies the time period at which the crawler should write
status information to disk. This status information is used to resume the crawler if the
crawler terminates abnormally or the machine crashes. When the crawler restarts, it may
therefore be out of date by at most the number of seconds specified in this option. If you
are crawling a large number of hosts, you will want to increase this value so that the
status dumps don't interfere with the crawling process. You can also disable status dumps
entirely by setting this value to -1.
- Link analysis period - Specifies the time period at which the crawler sends link
analysis information (used to help determine the weight of each search result) to the
indexer. The value of this option therefore represents the number of seconds by which newly
crawled URLs will not be accounted for in the link analysis score (the previous link
analysis score will be used). Because performing link analysis is a computationally
expensive operation, increasing or decreasing this value may have a significant performance
impact on the speed of crawling and indexing. Increasing this value will sacrifice link
analysis precision to improve overall performance, while decreasing this value will
sacrifice performance for more accurate link analysis scores and more precise
weighting.
- Page size - The number of bytes in a page used by the search engine's storage
facilities. Depending on your workload, changing this parameter may result in better
crawling or browsing performance. The page size must be a power of two that is greater than
or equal to 512 and less than or equal to 32768.
- Cache size - The maximum number of megabytes available for use as the storage
engine's cache. Increasing this number may result in faster log browsing, particularly when
sorting on an un-indexed column. Additionally, a larger cache value may speed up a crawl or
refresh in the presence of I/O intensive competing workloads. Two caches are employed when a
collection is being refreshed, which means that the specified cache size is effectively
doubled during that time.
- Transaction size (urls) - The maximum number of URLs written to the logs in a
single transaction. A larger number may increase crawling speed. A smaller number will
reduce memory usage and crash recovery time.
- Transaction size (mbytes) - The maximum number of megabytes waiting to be written
to the logs in a single transaction. A larger number may increase crawling speed. A smaller
number will reduce memory usage and crash recovery time.
- Synchronization mode - This option specifies how tightly log transactions are
synchronized with disk writes. Options are:
- FULL - The storage engine will pause at critical moments to make sure that data
has actually been written to the disk surface before continuing. This ensures that if
the operating system crashes or if there is a power failure, the log will be uncorrupted
after rebooting. This option is very safe, but it is also slow.
- NORMAL - The storage engine will still pause at the most critical moments, but
less often than in FULL mode. There is a very small (though non-zero) chance that a
power failure at just the wrong time could corrupt the log in NORMAL mode. But in
practice, you are more likely to suffer a catastrophic disk failure or some other
unrecoverable hardware fault.
- OFF - The storage engine continues without pausing as soon as it has handed
data off to the operating system. If the crawler crashes, the data will be safe, but the
log might become corrupted if the operating system crashes or the computer loses power
before that data has been written to the disk surface. This option is very fast at the
cost of data integrity.
- Disable resume - Enable/Disable resume crawl. If enabled, it will be possible to
resume a crawl that was stopped or killed. If disabled, crawling speed may significantly
increase. Additionally, by disabling resume you will be unable to browse any URLs in the
pending state.
- Disable URL browsing - Enable/Disable URL browsing. URLs may be browsed in the
Crawled URLs section of the crawler. If disabled, crawling speed may significantly
increase.
- Disable graph logging - Enable/Disable graph logging. If disabled, crawling speed
may significantly increase, but the indexer will be unable to perform link analysis.
Additionally, disabling graph logging may have unintended side-effects during a refresh
operation.
- Disable log indexes - Select the log indexes to disable, which may result in a
faster crawl. Available indexes are:
- Log Browsing - disabling these indexes will make normal log browsing less
responsive
- Log Sorting - disabling these indexes will make sorting by columns in the log
browser much slower
- Refresh - disabling these indexes will result in much slower refresh
operations
- Enable ACLs - This is true by default. You would only want to set this to false if
you do not want to enforce security at the search engine level and you would like to save a
small amount of space by not retrieving the ACLs for the data in this collection.
- Authorize result views - If enabled, authorization of this URL will be required
before it will be returned to the user. To do this, either the crawled URL will be
downloaded or the special authorization-url will be downloaded (if specified).
- Authorization URL - An alternate URL to use for verifying that a user is able to
view a given page. For example, if all pages on a site have the same authentication checks,
you can specify a common authorization URL so that it only needs to be authorized once per
search.
- Crawl Strategy - Enables you to specify how the crawler will process the links
retrieved from the data source that is associated with this search collection. Options are:
- DFS: (Depth First Search) Links that point to additional levels of hierarchy in
the remote data source will be crawled before links without hierarchy. For example, when
crawling a directory in a remote filesystem, subdirectories will be processed before
files.
- BFS: (Breadth First Search) Links that point directly to data in the remote
data source will be crawled before links that point to additional levels of hierarchy.
For example, when crawling a directory in a remote filesystem, files will be processed
before subdirectories. This is the default value for this option, to expedite procesing
data so that it can be indexed and is available as potential search results..
- NONE: Links are processed in the order that they are encountered in the remote
data source.
Changing this option can have a significant impact on the performance of a crawl,
depending on the characteristics of the data that you are crawling, the characteristics of
the system(s) on which that data is stored, and so on. Breadth First Search typically
provides better performance when crawling data sources with relatively small amounts of
hierarchy, references to other systems, and so on. Depth First Search often provides
improved performance when crawling data sources where the actual data that you want to
index is deeply nested, contains references to significant numbers of other systems (which
effectively enables multiple systems to be crawled in parallel), and so on.
Because the performance of the crawler with these options is so data-dependent,
experimenting with these options on a subset of the data that you want to crawl and
measuring the performance of each is only absolute way to determine which option you will
want to use.
- Duplicates hash table size - Sets the size of the hash table that is used for
resolving duplicates. Be very careful when modifying this number. The value that you select
should be prime, and larger sizes can provide faster lookups but will require more memory,
while smaller sizes can slow down crawls but will substantially reduce memory usage.
- Exact duplicates hash table size - Sets the size of the hash table used for
resolving exact duplicates. Be very careful when modifying this number. The value that you
select should be prime, and larger sizes can provide faster lookups but will require more
memory, while smaller sizes can slow down crawls but will substantially reduce memory
usage.
- Remove logged input upon completion - Removes input information when a crawl-url
reaches the crawled and indexed state. This option is off by default. This option can
be set to on to improve performance when crawling or refreshing search collections
that crawl a database.