Mark_Ashworth 27000019EC Tags:  and text extensibility geodetic timeseries search informix basic ibm spatial 3,228 Views
Welcome to my blog. Here I will focus on the Extensibility aspects of the IBM Informix Extensibility features and some key technologies available in the IBM Informix Server include the Spatial, Geodetic, Timeseries and Basic Text Search.
Mark_Ashworth 27000019EC Tags:  informix text lucene search analyzer standaard canonical_maps bts clucene basic 3,033 Views
I have received several questions about how BTS analysis the characters in a document to form tokens. I use token instead of words here because, as we will see, tokens do not necessarily map to what is considered a single word. Currently BTS uses the standard CLucene analyzer to perform this analysis. Here in this blog entry, I will show some examples of small input documents and resulting token stream. Each token is delineated by square brackets ([ ]).
For my examples, I will assume the index is using the default list of stop words, that include words like: "the" and "a"
My first example is a common English-language pangram:
The Quick Brown Fox Jumped Over The Lazy Dog
This is analyzed into a token stream, converted to lowercase and the stopwords are eliminated:
There are no surprises yet. Words with apostrophes (') are also handled:
Prequ'ile Mark's 'cause
Note that the
Now let us look at documents that contain digits:
-12 -.345 -898.2 -56. –
Note that the minus sign is included in the token only if it is followed by a digit.
Now we mix it up a little with alphabetic characters:
1abc 12abc abc1 abc12 -1abc -12abc abc-1 abc-12 -12.abc abc.321
Notice how numbers follow by characters are broken into two tokens but numbers with proceeding alphabetic characters are not.
The CLucene standard analyzer also recognizes other frequently found patterns in modern documents, such as email address and company names that have an ampersand (&) character or ip addresses"
firstname.lastname@example.org of XY&Z Corporation as an ip address 192.168.1.1
results in the token stream:
and the text:
their web site is http://www.xyz_corporation.com
The last example has other characters:
m re#23 () %$ @
I expect some of that behaviour may be a little surprising. But what if you want to break an email address or web page address into its component words? Let's say you are interested in finding any documents with the name 'ashworth' including email addresses. In my example email@example.com is indexed as one token and BTS will not find 'ashworth' in a simple term search. To help the user out, I recently recommended the following solution. You can create the index with canonical_maps parameter to map periods (.) and at-signs (@) to spaces. The mapping is done as a preprocessor to the analyzer and the component words of an email address or web site would be indexed:
So the document:
firstname.lastname@example.org of XY&Z Corporation as an ip address 192.168.1.1
is transformed to:
mark ashworth example.com of XY&Z Corporation as an ip address 192 168 1 1
and the following token stream is generated:
Now you could search for the 'ashworth'. You can also look for emails with example.com in them by using a phrase search:
So I will end this entry with output from dbaccess of an example of how canonical_maps can help with queries on email addresses::
That’s all for now, Enjoy
Its been a great year and a very busy December for me. I have been getting many questions on BTS in the past week and it good it being used so much. As this will be the last blog entry of 2010, I will discuss a question I was asked on an error that BTS can generate:
(BTSB0) - bts clucene error: Too Many Clauses
When CLucene rewrites a search expression with, by default, more than 1024 clauses, this error is generated.
What causes this error and how can it be addressed will be discuss here.
First, the cause: if it happens, it usually happens on a wildcard query. In a previous blog entry I discussed how the input stream of characters are analyzed into tokens. The CLucene index, on which BTS is based, maintains a dictionary of these tokens on a per field basis. When a wildcard or fuzzy query is specified on a field, its dictionary is searched for all the tokens that will match the wildcard or fuzzy expression and then its rewrites that expression with a number of simple term searches joined with a Boolean OR operator.
I have a data-set with a large number of distinct tokens that I use for testing. If I search this index with the wildcard search:
the query is rewritten as
(slob or slobber or sloe or slog or slogan or sloop)
In this example we see the wildcard search is replaced with the search expression of 6 simple term clauses. The wildcard search of sl* generates a search expression with some 956 simple terms. And finally if I search for just s*, there are 32798 terms and, by default, it will generate the Too Many Clauses error.
CLucene has this limit because, as we see with s*, some wildcard expansions can be very large and this can use a lot of virtual memory. CLucene has a parameter to limit and help control memory usage. The good news is that it is tunable and an index basis. In BTS, if you have indexes that you want to allow very large query rewrites, then you can specify the max_clause_count parameter. For my test index, if I want to search on the wildcard search s*, then I need to create the index with a max_clause_count larger that 32798. For example:
create index bts_idx on bts_tab (text bts_char_ops) using bts (max_clause_count="4000"); in sbspace1;
Keep in mind that these queries can result in more memory usage and you may see the server allocation more virtual segments. The number of virtual segments attached can be monitored with onstat -g seg.
For those celebrating the holiday season, I wish you all the best. I will be back here next year with some more interesting information on extensibility in the Informix Server.
BTS support Fuzzy Term search, but how does it work?
BTS is based on CLucene, a derived from Lucene and you can search the web for some of the details of the fuzzy term search used by Lucene, or you can read on :-)
A fuzzy term search is specified with a tilde (~) following a word with an optional parameter specifying the amount of similarity. The parameter is between 0 and 1 where 1 is similar (exact) is very similar and 0 being not very similar. The default that is used if the parameter is not specified is 0.5. For example in BTS, you can search for words similar to tone by specifying:
which is equivalent to:
Most internet searches on fuzzy term search will simple say that CLucene or Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. There is lots of information on this algorithm and I don't plan to go into the details of that algorithm here. Basically, this algorithm measures the amount of difference between two string and returns in a integer value greater than or equal to zero. Let us look that Edit distance between tone and the following list of words: tone, ton, tune, tones, once, tan, two, terrible, fundamental:
From these results, identical words have an edit distance of zero. Very similar words have small edit distances and the edit distance increase as the works are less and less similar.The question I have been asked several times is how this integer value relates to the similarity parameter which is a floating point number between 0 and 1. First, CLucene calculates this similarity value, sim, between the search term and a word in the index:
Here is the results of this calculation on our sample set:
It should not be a surprise that if both the term and the word are the same, sim = 1. If they are similar with an edit distance of 1, then sim is closer to 1. As they get less and less similar (ie the edit distance gets larger), the value gets closer to 0.
If the edit distance is greater than the minimum length, then sim becomes negative. Since the similarity parameter is between 0 and 1, you may have already guessed that any word with a negative sim will not be used in the search.
To perform the fuzzy term search, the query is rewritten to search for every
word that is similar to the term. Now we take into consideration the
similarity parameter between 0 and 1. CLucene will calculate the sim
value for every word in the field being searched and if its greater than the
similarity parameter, then that word is used in the query rewrite. In our
example, we are using a similarity parameter of 0.5. For 'tone~', the words
with a sim greater than 0.50 are: tone, ton, tune, and tones are
greater than 0.50 so the query is rewritten from:
Notice that text containing once will not be selected using this fuzzy query because its sim is 0.50 which is not greater than 0.50. If we wanted fuzzy term search to find text with this word using the fuzzy term search, we would need to specify a similarity parameter less that point 0.50. For example:
I hope this sheds some light on how fuzzy term search are handled. If you want to hear more about BTS and its features in the latest release, please consider attending my IIUG 2011 Conference session: Advanced Text Searching Topics in Informix.
Postscript: Here is some SQL for this example to try it out: