Disk Usage

The size listed on a collection's Overview page is an approximation of how much space is currently being used by a collection. Many variables affect the size, such as compression. To know exactly how much space is being used by each collection, use your system tools.

Under Linux, to get a list sorted by size, you can run the following command as a user who has access to the Watson Explorer Engine installation directory:

du -ks {INSTALL_DIR}/data/search-collections/*/* | sort -k1n

Windows users can navigate to the {INSTALL_DIR}/data/search-collections/ directory and use Explorer to view the sizes.

A collection consists of the data that was crawled (which is converted to a cacheable type and then compressed) and log files for the crawl and index.

During a recrawl, much more space can be used. The logs and crawled data are kept for both the Staging and Live collections. The search engine must keep all of the old data while recrawling to maintain query-service to users and to ensure that any documents that are being preserved due to transient errors, archiving, etc. can be found again.

The indexer creates temporary segments that need to be merged. The total size of all the segments will be approximately equal to the final size of the index. Thus, in addition to the data used by the crawler, the indexer requires approximately twice the final index size to build the new index, in addition to the space used by the old index.