As local storage has increased the need to organize and search on the desktop has become more complex. imgSeek and Terrier, are two desktop search engine tools that can help you search for text and image files quickly, using capabilities that an operating system is limited in or does not have.

Judith M. Myerson, Systems Engineer and Architect

Judith M. Myerson is a systems architect and engineer. Her areas of interest include middleware technologies, enterprise-wide systems, database technologies, application development, network management, security, RFID technologies, and project management. She is the author of RFID in the Supply Chain and the editor of Enterprise Systems Integration, Second Edition Handbook.



19 April 2011

Also available in Russian Japanese Vietnamese

Benefits of open source desktop search engines

There was a time when finding the right content was simply a matter of retrieving the right floppy disk from the shoebox. Those days are long gone. Now the average desktop computer contains hundreds of gigs, and in some cases terabytes, of data! So much information is interrelated that a simple hierarchical arrangement of folders and files may no longer be sufficient to find what you need. You need a tool that can intelligently index your files and help you locate them in the right context. Google and others have created commercial desktop search engines. However, there are open source alternatives.

I'll look at two open-source desktop search engines, imgSeek and Terrier, which are handy tools when I search image files and XML documents that contain text and references to images. I can use a rough sketch or import an image to query similar images from a bucket of hundreds of disparate images — almost like finding a needle or two in a haystack of images. The search results can bring up duplicate images, each with a different file name. In addition to query by content, I can look for images by metadata keywords, such as filename, description, and creation date.

Terrier

I use Desktop Terrier to narrow my search to documents containing one or two words, including references to images I specify in the search field. I have several search choices. In the search field, I can specify a word that a document must contain, and another word the same document cannot contain. I can assign a weight to a word that is more important or relevant than another word is. Documents containing important words show up first in the search results. Usually, I am happy with the results. Terrier also comes in batch and interactive modes.

imgSeek

imgSeek's desktop version is a collection of free open source visual similarity projects. I can express the query as a rough sketch I paint or as another image I supply. imgSeek uses an algorithm that is the multi-resolution wavelet decomposition of the query and database images. Use the server-side version if you're interested in integrating a content-based image database into an image-related website.

Figure 1 shows what the desktop version's home page may look like when first initialized.

Figure 1. Home page: Options>Viewing
Screenshot shows the options for viewing in imgSeek, inclduding a 'slideshow delay' and smoothing of thumbnails

You may change the way imgSeek appears on its home page when you start it up. If you want the Image by content as the home page, click Search, and then Image. Exit imgSeek and restart. Figure 2. shows the result.

Figure 2. imgSeek home page: Search by Image content
Screenshot shows the search by Image content, allowing you to import a sample image or draw a rough sketch

Search by image content

You have options to search by image content, keyword, or group. To search image by content, you must first import an image, or draw a sketch, that you can use to query and present images from a collection. If you do not have a collection go to the Add tab to create a collection and then add images to it.

When you are done with the collection, you return to the Search by Image content tab to import an image from the collection or create a sketch. Figure 3 shows that when you draw you have the option to choose colors from the panel.

Figure 3. Rough sketch of a flower
Screen shot shows the draw functionality with a rough red outline of a flower with a brush-size slider and color selector

You can adjust the brush size for your sketch by sliding the bar to the left for smaller size and to the right for larger size. As you increase or decrease the brush size, the box containing the color you selected from the panel will increase or decrease. You can save the sketch for later review, put it in the trash, or reset the sketch history.

Then go to the Results box to click the query button (leftmost) to quickly search for similar images in a collection. Figure 4 shows the images retrieved by the query having similar shapes and colors to those of the sketch in Figure 3. A weight in percentage represents the degree of similarity.

Figure 4. Resulting images of flowers
Screenshot shows three image results that matched the sketch with the similarity represented as percentages

Now let's look at the sketch, see Figure 5 drawn with a larger brush size. The size of selected Color box is larger than the one shown in Figure 3.

Figure 5. Rough sketch drawn with a larger brush
Screenshot shows the same flower sketch with the thicker lines

In Figure 6, I get different results than shown in Figure 4.

Figure 6. Different set of resulting images of flowers
Screenshot shows another set of image results with similarity percentages

In Figure 4 and Figure 6, the first image retrieved by the search query is labeled with the highest weight. The weight assigned to the first picture in Figure 6 is somewhat higher than that for Figure 4. due to a slight increase in similarity of the shape and color between the sketch and the images retrieved in the collection.

The second image in Figure 4 has become the first image in Figure 6. This is because of the change in the degree of similarity of the sketch (drawn with a larger brush) to the image. A photo of a flower in Figure 6 is not shown in Figure 4. The shape and thickness of the sketch in Figure 5 are somewhat similar to the photo. The thinness of the sketch in Figure 3 did not bring up the photo in the collection shown in Figure 4.

Build image collection

You need to build a collection of images you want to search and browse. To add files, go to Add. You can set the path of the files you want to add or ignore. You can choose to give a name to the collection or let the system create a collection automatically. You can restrict the files to certain files, dimensions, and extensions.

You may find it useful to activate the beeping sounds to let you know when imgSeek is finished adding files or extracting metadata. You can opt for hiding progress and adding image files without extension. When you are ready, click Add to begin processing.

If you want to edit metadata items, such as the author's name, before adding the files, you should go to the Tools menu. You can choose whether you want to edit image metadata one by one, or if you want to apply these changes to the set of images in a previously built batch. You can edit the batch by choosing Work batch editor under the Tools. Figure 7 shows where to locate the batch editor.

Figure 7. Work batch editor
Screenshot shows the menu to select the Work batch editor

To populate a work batch, right click on an image that you want to include then add it to the batch. You can add to the work batch on the system directory, keyword group, and database directory. You can find duplicate images and rename them. To store changes to the metadata, go to the Database menu to export metadata as shown in Figure 8.

Figure 8. Export metadata
Screenshot shows menu to select Exort metadata

If you forget what images you've added to the collection, you can save time looking for them by going to the Maintenance menu to scan all directories for new images.

Image query by keyword

You can search images by keyword rather than by content. To begin, right click in the space in the Field column to bring up a small menu. Click New to insert Description as the first parameter by default. Type in the Value column to describe what the image is about.

The next step is to choose the logical AND or OR operator and a second parameter. The AND operator indicates imgSeek must use both the first and second parameters to query images in the collection. The OR operator indicates imgSeek can use either of the two parameters if it does not matter which parameter you use to query the collection.

Create the second parameter, right click in the space under the Description parameter (the first one) and then click New parameter. Use the drop down arrow to display a list of all parameters as shown below.

  • Description
  • Dimensions
  • Filename
  • File size
  • Format
  • Modify Date
  • Database Date
  • Mounted (Linux® only)

After you select, describe the parameter. If you are not sure how to describe each parameter, go to the Maintenance menu to look at image metadata. You will notice the keyword parameter list is a small part of the larger picture of metadata in the image.

Image browsing by similarity

You can browse by files, groups, system, and similarity. Browsing by files and groups is nothing new. If you can remember where you have placed your files, then you can browse directly to the directory containing those files.

Before you can compare an image with another for similarity, you need to go to the Add tab to add images to your collection. When you are done, you return to the Similarity tab and indicate whether you want to browse images by date or filename.

If you have hundreds of files to browse, it would be more efficient to group files with similar dates and filenames, and choose the groups you want to browse. To generate similarity groups, click Group. If the collection is too small to browse, try adding more pictures to your collection. You can use the Export button next to the Group button to export similarity groups as logical groups.


Text desktop search engine: Terrier

After you are satisfied with the results of your query for image, the next step is to use Terrier to search XML documents containing those images. Unlike imgSeek, you need to start the GUI of Terrier from the command prompt. Make sure you have installed the correct version of Java™ on your computer.

In its main window, Terrier shows only two tabs: Search and Index. When you run Terrier for the first time, it focuses on the Index tab and shows a dialog box (see Figure 9.) asking if you would like Terrier to index its own documents or the documents of your choice.

Figure 9. A dialog box when running Terrier for the first time
Screenshot shows Yes/No choices to index its own documentation

Choose a folder of XML documents to index. When you restart Terrier, it shifts its focus to the Search tab. You can switch to the Index tab to re-index your documents before you query them.

Indexing files

On the Index tab, select folders to bring up a window to specify which documents Terrier should index. When done, click OK to return to the Index tab and begin the process of creating an index.

Terrier does not support incremental indexing. Every time you create an index, Terrier will remove the old index and index all specified folders from scratch.

You can watch the progress of the indexing in the lower part of the window. When Terrier completes indexing, it changes its focus to the Search tab.


Terry Query Language

The Search tab is very simple, containing only one field to enter a Terrier query. You can use the query language to search words individually or in a phrase. Here are some examples of query documents containing images found with imgSeek.

Example 1: word1 word2

This query will return documents containing one or two words, but not always both. Let's suppose the first word is boat and the second word is imgboat1.png. The search results may show one document containing boat, but not imgboat1.png. The second document contains imgboat1.png, but not boat. The third document contains both words.

The search results may show documents in a random order. This may help identify which documents contain mislabeled images.

Example 2: word1^2.3 word2

The weight of the first word is increased by 2.3 while the weight of the second word remains one. Don't forget to put the caret sign between the word and the weight when typing a query. Search results will bring documents that always contain the first word, and may or may not contain the second word. The search results will first show documents containing the weighted word.

Unlike those documents in the first example, documents containing boat, which now has an assigned weight of 2.3, will always be at the top of the results. A reference to the boat image may or may not be in those documents.

To further refine your search, enter a third word, such as flower, in the search field. The weight assigned to this third word may be higher or lower than that assigned to the first word. The higher the weight, for example, 7.2, for the first word than for the third word, for example, 2.5, makes it highly likely that the documents containing the first word will show up first in the search results.

Example 3: +word1 +word2

You can get documents containing both words by entering the plus sign as the word prefix. These two words can be in separate places throughout a document. They do not have to be next to one another as in a phrase. For example, flower and imgflower1.png are in separate places, but flower may not be associated with imgflower1.png. The image may be labeled with Flower in one document and Rose in another version of the same document.

Example 4: +word1 -word2

Use this example when seeking documents each containing the first word and not containing the second word. You do this by putting the plus sign as the prefix to the first word and the minus sign as the prefix to the second word. If a document contains both the first word and the second word, it is not retrieved in the search results.

For example, if you search for three words: +boat +imgboat1.png and -canoe, you will get documents containing boat and imgboat1.png, but not containing canoe.

Example 5: "word1 word2"

You can get documents where both words appear in a phrase. These words are not located in separate places in a document as shown in the third example. To indicate a phrase, you should put the words between double quotes, for example: "Flower picture".

Example 6: phrase1 -word1 word2^3.5

Let's suppose you put in four words to find the documents that contain a phrase, contain a word with a weight of 3.5, and do not contain the second word. For example, you want documents containing "Figure 7. This is the picture of a flower", skipping boat and containing stone as the word with a weight of 3.5.


Conclusion

Users who must meet tight deadlines of obtaining and presenting information from a bucket of thousands of files will find imgSeek and Terrier useful. These two tools have capabilities that an operating system is limited in or does not have. imgSeek searches images by content, keyword, and similarity. Terrier gives a weight to one of the words to be searched in the contents of documents.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source
ArticleID=647150
ArticleTitle=Open source desktop search engines
publish-date=04192011