Mark_Ashworth 27000019EC Tags:  and text extensibility search timeseries geodetic informix basic ibm spatial 2,646 Views
Welcome to my blog. Here I will focus on the Extensibility aspects of the IBM Informix Extensibility features and some key technologies available in the IBM Informix Server include the Spatial, Geodetic, Timeseries and Basic Text Search.
Mark_Ashworth 27000019EC Tags:  bts composite lucene field informix text index xml clucene 2 Comments 4,034 Views
In Informix 11.70, we introduced a new feature in BTS to support creating a composite index. This is an index across several columns in a table.
Given the table:
create table address(The simplest case is to create a composite index on all these columns as if it were one column. With a single column, we specify a column and opclass pair. With a composite index, 2 or more column and opclass pairs are specified and the strings in each column are concatenated together to form one string and indexed in the default contents field. As with a single column, the opclass must match the data type of its column. Here is what the create index looks like:
create index bts_idx on address(A search may use any of the columns in the bts_contains() predicate:
select * from address where bts_contains(fname, 'john');would produce the same results as
select * from address where bts_contains(address1, 'john');
Now with an XML document we can index the tags as fields in the BTS index. Why not do the same with composite indexes and have BTS use the column in the bts_contains() as the field to search? Well you can!
Lets step back for a moment. When the clucene index is created on unstructured text, its indexed under a field called contents. This is the default field that is used when a query term does not specify a field. So the predicate: bts_contains(mycol, 'john') is equivalent as bts_contains(mycol, 'contents:john').
With XML, we index the tag information in a number of different fields. We can list a specific set of tags to index with the xmltags parameter. These tags are indexed as tags and by default, the first tag in that list is the default field for searching (instead of the field named contents). For example, xmltags="(name,address,comment)", the default field is name. In this case, bts_contains(mycol, 'john') is equivalent as bts_contains(mycol, 'name:john'). In the case of all_xmltags, there is no definitive list of tags so the default field for queries defaults back to contents. However, BTS provides a query_default_field parameter to explicitly specify the default field used for queries.
Now back to composite indexes. With a query, we need to specify the column name. In the simple case all columns are indexed into the contents field, so the column name is not used. But if you want to index each column as a field and used the field designated as the default column to search, then you can by specifying query_default_field="*". The * tells BTS to build an index with each the text in each column indexed as a separate field after the name of the column and that also becomes the default field on queires.
create index bts_idx on address(The query:
select * from address where bts_contains(fname, 'john');is equivalent to:
select * from address where bts_contains(fname, 'fname:john');This will only search text in the fname field and would not for example find any street name john in the address1 or address2 field.
Finally, the search is not limited to one column, like with XML,you can search different tags, you can do the same with composite indexes. Just specify the column name as a field to search.
select * from address where bts_contains(fname, 'nancy ANDcity:nipigon');Will search for a row that contains nancy in the fname column and nipigon in the city column.
A closing thought, what about composite BTS indexes on columns of XML. It does work and perhaps a topic for another blog entry.
Mark_Ashworth 27000019EC Tags:  clucene index panther bts provisioning storage lucene feature text cheetah continuous availability informix 2,437 Views
When the Basic Text Search DataBlade (BTS) was introduced in Informix Version 11.10, it only provided support extspaces, ie storing the index files outside the database.
With Informix Version 11.50, BTS introduced support for storing the index in a sbspace with the index "files" stored in smart blobs. Now the indexes were full class database objects being logged and backed up like other indexes in Informix. Logging also meant BTS could be and is supported in the Continuous Availability Feature (Mach 11).
However, creating or modifying a clucene based index caused a significant amount of logging which needed to be addressed before the feature was released. This issue was resolved doing the create index, insert, update, or delete in a temporary area. This temporary area is specified by the onconfig parameter SBSPACETEMP. This variable is similar to DBSPACETEMP and is described in the Onconfig.std as:
In its simplest form it may be specified as:
The space, in this case, mysbspacetemp, may either be a temporary sbspace or a normal sbspace. However we recommend only using temporary sbspaces here so that intermediate operations are not logged. If it is not a temporary sbspace, BTS can still use it and does turn off logging for these smart blobs values but there will still be logging for the smartblob metadata.
In 11.50, a temporary sbspace may be created with a -t flag to onspaces, for example:
In Informix Version 11.70, there are more convenient facilities. The first time a BTS index is created in a database without BTS registered, the server will first auto-configure BTS and among other things, if storage provisioning is enabled, a temporary sbspace. If necessary, the SBSPACETEMP is set and a temporary sbspace is created.
In 11.70, we also have the option of using the admin api to create sbspace. Since I usually have storage provisioning enabled, I like using the admin api to create additional temporary sbspace, for example:
SBSPACETEMP is a list, so multiple sbspaces can be specified using ith either colon or comma separators. For example:
With this example, BTS now has three sbspaces to choose from, which one is picked?
Yes, BTS will have a preference! First, BTS will prefer an sbspace in the list that is a temporary sbspace over a normal sbspace. Then from that subset, it picks the one with the most free space available. Note: for any given operation, BTS will not spread the work across multiple sbspaces.
Finally, there is a undocumented parameter in BTS called
Remember, temporary sbspaces are important to prevent unnecessary or excessive logging.
Mark_Ashworth 27000019EC Tags:  informix text lucene search analyzer standaard canonical_maps bts clucene basic 2,512 Views
I have received several questions about how BTS analysis the characters in a document to form tokens. I use token instead of words here because, as we will see, tokens do not necessarily map to what is considered a single word. Currently BTS uses the standard CLucene analyzer to perform this analysis. Here in this blog entry, I will show some examples of small input documents and resulting token stream. Each token is delineated by square brackets ([ ]).
For my examples, I will assume the index is using the default list of stop words, that include words like: "the" and "a"
My first example is a common English-language pangram:
The Quick Brown Fox Jumped Over The Lazy Dog
This is analyzed into a token stream, converted to lowercase and the stopwords are eliminated:
There are no surprises yet. Words with apostrophes (') are also handled:
Prequ'ile Mark's 'cause
Note that the
Now let us look at documents that contain digits:
-12 -.345 -898.2 -56. –
Note that the minus sign is included in the token only if it is followed by a digit.
Now we mix it up a little with alphabetic characters:
1abc 12abc abc1 abc12 -1abc -12abc abc-1 abc-12 -12.abc abc.321
Notice how numbers follow by characters are broken into two tokens but numbers with proceeding alphabetic characters are not.
The CLucene standard analyzer also recognizes other frequently found patterns in modern documents, such as email address and company names that have an ampersand (&) character or ip addresses"
email@example.com of XY&Z Corporation as an ip address 192.168.1.1
results in the token stream:
and the text:
their web site is http://www.xyz_corporation.com
The last example has other characters:
m re#23 () %$ @
I expect some of that behaviour may be a little surprising. But what if you want to break an email address or web page address into its component words? Let's say you are interested in finding any documents with the name 'ashworth' including email addresses. In my example firstname.lastname@example.org is indexed as one token and BTS will not find 'ashworth' in a simple term search. To help the user out, I recently recommended the following solution. You can create the index with canonical_maps parameter to map periods (.) and at-signs (@) to spaces. The mapping is done as a preprocessor to the analyzer and the component words of an email address or web site would be indexed:
So the document:
email@example.com of XY&Z Corporation as an ip address 192.168.1.1
is transformed to:
mark ashworth example.com of XY&Z Corporation as an ip address 192 168 1 1
and the following token stream is generated:
Now you could search for the 'ashworth'. You can also look for emails with example.com in them by using a phrase search:
So I will end this entry with output from dbaccess of an example of how canonical_maps can help with queries on email addresses::
That’s all for now, Enjoy
Its been a great year and a very busy December for me. I have been getting many questions on BTS in the past week and it good it being used so much. As this will be the last blog entry of 2010, I will discuss a question I was asked on an error that BTS can generate:
(BTSB0) - bts clucene error: Too Many Clauses
When CLucene rewrites a search expression with, by default, more than 1024 clauses, this error is generated.
What causes this error and how can it be addressed will be discuss here.
First, the cause: if it happens, it usually happens on a wildcard query. In a previous blog entry I discussed how the input stream of characters are analyzed into tokens. The CLucene index, on which BTS is based, maintains a dictionary of these tokens on a per field basis. When a wildcard or fuzzy query is specified on a field, its dictionary is searched for all the tokens that will match the wildcard or fuzzy expression and then its rewrites that expression with a number of simple term searches joined with a Boolean OR operator.
I have a data-set with a large number of distinct tokens that I use for testing. If I search this index with the wildcard search:
the query is rewritten as
(slob or slobber or sloe or slog or slogan or sloop)
In this example we see the wildcard search is replaced with the search expression of 6 simple term clauses. The wildcard search of sl* generates a search expression with some 956 simple terms. And finally if I search for just s*, there are 32798 terms and, by default, it will generate the Too Many Clauses error.
CLucene has this limit because, as we see with s*, some wildcard expansions can be very large and this can use a lot of virtual memory. CLucene has a parameter to limit and help control memory usage. The good news is that it is tunable and an index basis. In BTS, if you have indexes that you want to allow very large query rewrites, then you can specify the max_clause_count parameter. For my test index, if I want to search on the wildcard search s*, then I need to create the index with a max_clause_count larger that 32798. For example:
create index bts_idx on bts_tab (text bts_char_ops) using bts (max_clause_count="4000"); in sbspace1;
Keep in mind that these queries can result in more memory usage and you may see the server allocation more virtual segments. The number of virtual segments attached can be monitored with onstat -g seg.
For those celebrating the holiday season, I wish you all the best. I will be back here next year with some more interesting information on extensibility in the Informix Server.
BTS support Fuzzy Term search, but how does it work?
BTS is based on CLucene, a derived from Lucene and you can search the web for some of the details of the fuzzy term search used by Lucene, or you can read on :-)
A fuzzy term search is specified with a tilde (~) following a word with an optional parameter specifying the amount of similarity. The parameter is between 0 and 1 where 1 is similar (exact) is very similar and 0 being not very similar. The default that is used if the parameter is not specified is 0.5. For example in BTS, you can search for words similar to tone by specifying:
which is equivalent to:
Most internet searches on fuzzy term search will simple say that CLucene or Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. There is lots of information on this algorithm and I don't plan to go into the details of that algorithm here. Basically, this algorithm measures the amount of difference between two string and returns in a integer value greater than or equal to zero. Let us look that Edit distance between tone and the following list of words: tone, ton, tune, tones, once, tan, two, terrible, fundamental:
From these results, identical words have an edit distance of zero. Very similar words have small edit distances and the edit distance increase as the works are less and less similar.The question I have been asked several times is how this integer value relates to the similarity parameter which is a floating point number between 0 and 1. First, CLucene calculates this similarity value, sim, between the search term and a word in the index:
Here is the results of this calculation on our sample set:
It should not be a surprise that if both the term and the word are the same, sim = 1. If they are similar with an edit distance of 1, then sim is closer to 1. As they get less and less similar (ie the edit distance gets larger), the value gets closer to 0.
If the edit distance is greater than the minimum length, then sim becomes negative. Since the similarity parameter is between 0 and 1, you may have already guessed that any word with a negative sim will not be used in the search.
To perform the fuzzy term search, the query is rewritten to search for every
word that is similar to the term. Now we take into consideration the
similarity parameter between 0 and 1. CLucene will calculate the sim
value for every word in the field being searched and if its greater than the
similarity parameter, then that word is used in the query rewrite. In our
example, we are using a similarity parameter of 0.5. For 'tone~', the words
with a sim greater than 0.50 are: tone, ton, tune, and tones are
greater than 0.50 so the query is rewritten from:
Notice that text containing once will not be selected using this fuzzy query because its sim is 0.50 which is not greater than 0.50. If we wanted fuzzy term search to find text with this word using the fuzzy term search, we would need to specify a similarity parameter less that point 0.50. For example:
I hope this sheds some light on how fuzzy term search are handled. If you want to hear more about BTS and its features in the latest release, please consider attending my IIUG 2011 Conference session: Advanced Text Searching Topics in Informix.
Postscript: Here is some SQL for this example to try it out:
As many users are aware, BTS is based on the CLucene Search Engine and documents are returned from the index in order from the highest to lowest score. This score value is the relevance of the document as compared to the documents indexed, the documents returned and the nature of the query.
To get the score, you specify a Statement Local Variable as the third argument to the bts_contains() predicate. This variable may be used in a comparison predicate in the where clause or as a returned column value in the select clause. For example:
SELECT text, score FROM bts_tab WHERE bts_contains(text, 'snow', score#real);
I get many questions on how this value is calculated. Since CLucene is open source on the Web, you can get the source and have a look but it's not that easy to understand. CLucene, itself, is based on Lucene and it is also available as open source. You can find many web pages discussing scoring for either of these packages. My original intention was to point my readers to those sources but its not easy material to digest so I am going to begin here with some very simple basics to help you get started.
Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given document is to a user's query (from http://lucene.apache.org/core/3_6_2/scoring.html)
I have seen reference to a formula like this:
score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d
As I said, it is not so easy to understand! This calculation returns the score value between 0 and 1. BTS converts this value to a percentage. For the purposes of this discussion I will not be asking you to understand this equation.
One of the inputs to the score calculation is the number of documents in the index and another is the number of documents that contain a given term (or word). For this discussion, I will use some examples around the words 'snow' and 'rain' using 10000 documents, each with 5 to 20 words.
Let's do a simple query:
SELECT FIRST 1 text, score FROM bts_tab WHERE bts_contains(text, 'snow', score#real); text snow is the main ingredient to many winter sports score 100.0000000000
Here, I get back the first row which happens to have a score of 100%. This seems reasonable because I am searching for 'snow' and the document contains the word 'snow'. When initially using BTS on simple document repositories, a query tends to return some rows with a score of 100% and that becomes expected. Discard that idea immediately! This should not be the expected result because as we work through these examples, we will see that this document will get other score values by changing the contents of the table and not the query. My first point is that score value is a ranking, depending on many other factors besides the text in the document itself and the specified query and its all relative.
To show you a stark contrast, I am going to do the same query with 'rain'. In my 10000 rows, I have 3 rows with the term 'snow' and 9995 rows with the word 'rain' in the document.
When I perform the query with 'rain':
SELECT FIRST 1 text, score FROM bts_tab WHERE bts_contains(text, 'rain', score#real); text rain freezes in winter score 50.02000430000
The score value is 50%, not 100%. Not even close to 100%!! Bit it is like the example with 'snow' where each row returned contains the word 'rain'. When users see this result, it can be very confusing. This demonstrates one aspect of how the document is scored. A factor in the score value which is based on the total number of documents in the index versus the number of documents that match the term. With 'snow', we have 3 out 10000 documents that match and with 'rain', we have 9995 documents that match. So with one word term, we see that if the ratio between the number of documents found versus the total number of documents in the repository is low, it will tend to raise the score value. The ratio of 3:10000 gives score values nearer 100% and a ratio of 9995:10000 gives a much lower score value.
We can see this holds true for all rows with 'snow' and 'rain' as we see with the min and max score range:
SELECT count(*),min(score),max(score) FROM bts_tab WHERE bts_contains(text, 'snow', score#real); (count(*)) (min) (max) 3 83.33334350000 100.0000000000 1 row(s) retrieved. SELECT count(*),min(score),max(score) FROM bts_tab WHERE bts_contains(text, 'rain', score#real); (count(*)) (min) (max) 9995 25.01000210000 50.02000430000
To bring the point home, I ran the same queries on the table but with only 10 rows, the same three rows with 'snow' and the one row with 'rain'. Now the ratio for 'snow' is 3:10 and the ratio for 'rain' is 1:10. We can see that the min and max values have dramatically changed because the change in ratios:
SELECT count(*),min(score),max(score) FROM bts_tab WHERE bts_contains(text, 'snow', score#real); (count(*)) (min) (max) 3 59.88408280000 71.86090090000 1 row(s) retrieved. SELECT count(*),min(score),max(score) FROM bts_tab WHERE bts_contains(text, 'rain', score#real); (count(*)) (min) (max) 1 100.0000000000 100.0000000000 1 row(s) retrieved.
This means that as documents are added or deleted from the index, the score values will be affected.
Now back to the table with 10000 rows. You will recall that 3 rows had one occurrence of the word 'snow' and the 9995 rows had one occurrence of the word 'rain'. But we saw a significant range of score values one each query so this is something else to consider. This brings me to the next factor in the scoring.
The score value also factors in the number of occurrences of search term versus the number of terms in the document. Let us take a look at the 3 rows with 'snow' and their scores:
SELECT text, score FROM bts_tab WHERE bts_contains(text, 'snow', score#real); text snow is the main ingredient to many winter sports score 100.0000000000 text snow covered mountains are areas that skiers flock to in the winter time score 83.33334350000 text in the city, snow needs to be cleared promptly to keep traffic moving score 83.33334350000
Each row has only one occurrence of the term 'snow'. The first row, minus the stop words, has 6 words. Similarly, the next two rows have 8 words. 1 occurrence of the term out of 6 words is more relevant than 1 occurrence of the term out of 8 words so the score value tends to be larger. This is why we see the row with 1:6 ratio gets a higher score than the two rows with a 1:8 ration. In the next example, we add the words 'snow covered' to the row containing 'mountains' to give that row 2 occurrences of 'snow'. Now we perform the query on 'snow' again:
SELECT text,score FROM bts_tab WHERE bts_contains(text, 'snow', score#real); text in the snow covered city, snow needs to be cleared promptly to keep traffic moving score 99.99999240000 text snow is the main ingredient to many winter sports score 84.85280610000 text snow covered mountains are areas that skiers flock to in the winter time score 70.71067050000
Since 'snow' appears twice in that modified row, it now has 2 occurrences of the term out of 10 words it makes the document more relevant and the other documents with 'snow' less relevant.
In the next example, I go to the extreme and replace the document with the word 'mountains' with a 5-word document containing 4 occurrences of 'snow'. Given the ratio of 4 to 5, this greatly increases the relevance of this document compared to the relevance other 2 documents and the score values reflect this:
SELECT text, score FROM bts_tab WHERE bts_contains(text, 'snow', score#real); text snow snow snow snow mountains score 100.0000000000 text in the snow covered city, snow needs to be cleared promptly to keep traffic moving score 50.50762940000 text snow is the main ingredient to many winter sports score 42.85714340000
In conclusion, we see that the ratio between the number of documents selected and the total number of documents and ratio of the number of occurrences of the term and the number of terms (or words) in a given document are two factors that play an important role in determining the score value.
When dealing with multiple terms in a query, each term is given a score factor. These individual score factors are weighted and summed to give an overall score between 0 and 100.
When looking at scores, please keep a few other details in mind. First, stop words are not indexed and not included in counts. Also if you are using a stemmer analyzer or soundex, the stemmed words or soundex codes are counted, not the original words. Since many words may map to one stemmed word or one soundex code, it naturally increases number of occurrences in a document. Finally, in the real world, some terms are more important than other terms, so the query language does support the capability to boost a term's relevance in the calculation. (See "Boost a term" in BTS part of the IBM Informix Database Extensions User's Guide.)
This should help you understand the basics of how a document returned from a query on a BTS index is scored. There is an abundance of details on the Web on how CLucene / Lucene score documents.