Benefits of open source desktop search engines
There was a time when finding the right content was simply a matter of retrieving the right floppy disk from the shoebox. Those days are long gone. Now the average desktop computer contains hundreds of gigs, and in some cases terabytes, of data! So much information is interrelated that a simple hierarchical arrangement of folders and files may no longer be sufficient to find what you need. You need a tool that can intelligently index your files and help you locate them in the right context. Google and others have created commercial desktop search engines. However, there are open source alternatives.
I'll look at two open-source desktop search engines, imgSeek and Terrier, which are handy tools when I search image files and XML documents that contain text and references to images. I can use a rough sketch or import an image to query similar images from a bucket of hundreds of disparate images — almost like finding a needle or two in a haystack of images. The search results can bring up duplicate images, each with a different file name. In addition to query by content, I can look for images by metadata keywords, such as filename, description, and creation date.
I use Desktop Terrier to narrow my search to documents containing one or two words, including references to images I specify in the search field. I have several search choices. In the search field, I can specify a word that a document must contain, and another word the same document cannot contain. I can assign a weight to a word that is more important or relevant than another word is. Documents containing important words show up first in the search results. Usually, I am happy with the results. Terrier also comes in batch and interactive modes.
imgSeek's desktop version is a collection of free open source visual similarity projects. I can express the query as a rough sketch I paint or as another image I supply. imgSeek uses an algorithm that is the multi-resolution wavelet decomposition of the query and database images. Use the server-side version if you're interested in integrating a content-based image database into an image-related website.
Figure 1 shows what the desktop version's home page may look like when first initialized.
Figure 1. Home page: Options>Viewing
You may change the way imgSeek appears on its home page when you start it up. If you want the Image by content as the home page, click Search, and then Image. Exit imgSeek and restart. Figure 2. shows the result.
Figure 2. imgSeek home page: Search by Image content
Search by image content
You have options to search by image content, keyword, or group. To search image by content, you must first import an image, or draw a sketch, that you can use to query and present images from a collection. If you do not have a collection go to the Add tab to create a collection and then add images to it.
When you are done with the collection, you return to the Search by Image content tab to import an image from the collection or create a sketch. Figure 3 shows that when you draw you have the option to choose colors from the panel.
Figure 3. Rough sketch of a flower
You can adjust the brush size for your sketch by sliding the bar to the left for smaller size and to the right for larger size. As you increase or decrease the brush size, the box containing the color you selected from the panel will increase or decrease. You can save the sketch for later review, put it in the trash, or reset the sketch history.
Then go to the Results box to click the query button (leftmost) to quickly search for similar images in a collection. Figure 4 shows the images retrieved by the query having similar shapes and colors to those of the sketch in Figure 3. A weight in percentage represents the degree of similarity.
Figure 4. Resulting images of flowers
Figure 5. Rough sketch drawn with a larger brush
Figure 6. Different set of resulting images of flowers
In Figure 4 and Figure 6, the first image retrieved by the search query is labeled with the highest weight. The weight assigned to the first picture in Figure 6 is somewhat higher than that for Figure 4. due to a slight increase in similarity of the shape and color between the sketch and the images retrieved in the collection.
The second image in Figure 4 has become the first image in Figure 6. This is because of the change in the degree of similarity of the sketch (drawn with a larger brush) to the image. A photo of a flower in Figure 6 is not shown in Figure 4. The shape and thickness of the sketch in Figure 5 are somewhat similar to the photo. The thinness of the sketch in Figure 3 did not bring up the photo in the collection shown in Figure 4.
Build image collection
You need to build a collection of images you want to search and browse. To add files, go to Add. You can set the path of the files you want to add or ignore. You can choose to give a name to the collection or let the system create a collection automatically. You can restrict the files to certain files, dimensions, and extensions.
You may find it useful to activate the beeping sounds to let you know when imgSeek is finished adding files or extracting metadata. You can opt for hiding progress and adding image files without extension. When you are ready, click Add to begin processing.
If you want to edit metadata items, such as the author's name, before adding the files, you should go to the Tools menu. You can choose whether you want to edit image metadata one by one, or if you want to apply these changes to the set of images in a previously built batch. You can edit the batch by choosing Work batch editor under the Tools. Figure 7 shows where to locate the batch editor.
Figure 7. Work batch editor
To populate a work batch, right click on an image that you want to include then add it to the batch. You can add to the work batch on the system directory, keyword group, and database directory. You can find duplicate images and rename them. To store changes to the metadata, go to the Database menu to export metadata as shown in Figure 8.
Figure 8. Export metadata
If you forget what images you've added to the collection, you can save time looking for them by going to the Maintenance menu to scan all directories for new images.
Image query by keyword
You can search images by keyword rather than by content. To begin, right click in the space in the Field column to bring up a small menu. Click New to insert Description as the first parameter by default. Type in the Value column to describe what the image is about.
The next step is to choose the logical AND or OR operator and a second parameter. The AND operator indicates imgSeek must use both the first and second parameters to query images in the collection. The OR operator indicates imgSeek can use either of the two parameters if it does not matter which parameter you use to query the collection.
Create the second parameter, right click in the space under the Description parameter (the first one) and then click New parameter. Use the drop down arrow to display a list of all parameters as shown below.
- File size
- Modify Date
- Database Date
- Mounted (Linux® only)
After you select, describe the parameter. If you are not sure how to describe each parameter, go to the Maintenance menu to look at image metadata. You will notice the keyword parameter list is a small part of the larger picture of metadata in the image.
Image browsing by similarity
You can browse by files, groups, system, and similarity. Browsing by files and groups is nothing new. If you can remember where you have placed your files, then you can browse directly to the directory containing those files.
Before you can compare an image with another for similarity, you need to go to the Add tab to add images to your collection. When you are done, you return to the Similarity tab and indicate whether you want to browse images by date or filename.
If you have hundreds of files to browse, it would be more efficient to group files with similar dates and filenames, and choose the groups you want to browse. To generate similarity groups, click Group. If the collection is too small to browse, try adding more pictures to your collection. You can use the Export button next to the Group button to export similarity groups as logical groups.
Text desktop search engine: Terrier
After you are satisfied with the results of your query for image, the next step is to use Terrier to search XML documents containing those images. Unlike imgSeek, you need to start the GUI of Terrier from the command prompt. Make sure you have installed the correct version of Java™ on your computer.
In its main window, Terrier shows only two tabs: Search and Index. When you run Terrier for the first time, it focuses on the Index tab and shows a dialog box (see Figure 9.) asking if you would like Terrier to index its own documents or the documents of your choice.
Figure 9. A dialog box when running Terrier for the first time
Choose a folder of XML documents to index. When you restart Terrier, it shifts its focus to the Search tab. You can switch to the Index tab to re-index your documents before you query them.
On the Index tab, select folders to bring up a window to specify which documents Terrier should index. When done, click OK to return to the Index tab and begin the process of creating an index.
Terrier does not support incremental indexing. Every time you create an index, Terrier will remove the old index and index all specified folders from scratch.
You can watch the progress of the indexing in the lower part of the window. When Terrier completes indexing, it changes its focus to the Search tab.
Terry Query Language
The Search tab is very simple, containing only one field to enter a Terrier query. You can use the query language to search words individually or in a phrase. Here are some examples of query documents containing images found with imgSeek.
Example 1: word1 word2
This query will return documents containing one or two words, but not
always both. Let's suppose the first word is
boat and the second word is
imgboat1.png. The search results may show one
boat, but not
imgboat1.png. The second document contains
imgboat1.png, but not
boat. The third document contains both
The search results may show documents in a random order. This may help identify which documents contain mislabeled images.
Example 2: word1^2.3 word2
The weight of the first word is increased by 2.3 while the weight of the second word remains one. Don't forget to put the caret sign between the word and the weight when typing a query. Search results will bring documents that always contain the first word, and may or may not contain the second word. The search results will first show documents containing the weighted word.
Unlike those documents in the first example, documents containing
boat, which now has an assigned weight of 2.3,
will always be at the top of the results. A reference to the boat image
may or may not be in those documents.
To further refine your search, enter a third word, such as
flower, in the search field. The weight
assigned to this third word may be higher or lower than that assigned to
the first word. The higher the weight, for example, 7.2, for the first
word than for the third word, for example, 2.5, makes it highly likely
that the documents containing the first word will show up first in the
Example 3: +word1 +word2
You can get documents containing both words by entering the plus sign as
the word prefix. These two words can be in separate places throughout a
document. They do not have to be next to one another as in a phrase. For
imgflower1.png are in separate places, but
flower may not be associated with
imgflower1.png. The image may be labeled with
Flower in one document and
Rose in another version of the same
Example 4: +word1 -word2
Use this example when seeking documents each containing the first word and not containing the second word. You do this by putting the plus sign as the prefix to the first word and the minus sign as the prefix to the second word. If a document contains both the first word and the second word, it is not retrieved in the search results.
For example, if you search for three words:
-canoe, you will get documents containing
imgboat1.png, but not containing
Example 5: "word1 word2"
You can get documents where both words appear in a phrase. These words are
not located in separate places in a document as shown in the third
example. To indicate a phrase, you should put the words between double
quotes, for example:
Example 6: phrase1 -word1 word2^3.5
Let's suppose you put in four words to find the documents that contain a
phrase, contain a word with a weight of 3.5, and do not contain the second
word. For example, you want documents containing
"Figure 7. This is the picture of a flower",
boat and containing
stone as the word with a weight of 3.5.
Users who must meet tight deadlines of obtaining and presenting information from a bucket of thousands of files will find imgSeek and Terrier useful. These two tools have capabilities that an operating system is limited in or does not have. imgSeek searches images by content, keyword, and similarity. Terrier gives a weight to one of the words to be searched in the contents of documents.
- Learn more about Terrier's batch and interactive versions. Check on updates on Terry Query Language.
- Get more details on imgSeek. Take a journey through screenshots of interesting images and photos.
- Stay current with developerWorks technical events and webcasts.
- Stay current with developerWorks' Technical events and webcasts.
- Follow developerWorks on Twitter.
- Check out upcoming conferences, trade shows, webcasts, and other Events around the world that are of interest to open source developers.
- Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM products.
- Learn about IBM and open source technologies and product functions with the no-cost developerWorks on-demand demos.
Get products and technologies
- See the ever-evolving list of free and open source software packages on Wikipedia.
- Innovate your next open source development project with IBM trial software, available for download or on DVD.
- Participate in the discussion forum.
- Help build the Real world open source group in the developerWorks Community.