Performance
The machine used for the performance testing had dual 2.4 GHz Xeon processors with HyperThreading enabled, 8 GB of RAM, and two IDE disks attached to a RAID 1 controller.
To measure performance, we used a publicly-available list of universities to create a very broad crawl of some data. For each of the test collections, we limit the crawl to the specified number of documents and run it using the default configuration. Starting from university homepages actually produced a crawl that was very heavily dominated by HTML documents and did not include as many PDF, Word, and PowerPoint documents as you might expect. Any URL in the .edu domain was crawled using the following seed URL:
http://www.mit.edu:8001/people/cdemello/univ-full.html
The number of threads used for indexing was changed to 4 because 4 virtual CPUs are available on the dual-CPU machine. A raw partition on the RAID 1 device was used to hold the index data. The cache size of the search was set to 600 MB.
The queries are based on actual queries on the IBM web site. A random sample of 30,000 queries from one day's logs was chosen. These 30,000 queries were filtered by searching against the Open Directory data. Any query that returned no results in the Open Directory was rejected. This left a pool of approximately 10,000 queries. A different pool of queries were used to randomly select 200 queries to use to warm up the search.