Its been a great year and a very busy December for me. I have been getting many questions on BTS in the past week and it good it being used so much. As this will be the last blog entry of 2010, I will discuss a question I was asked on an error that BTS can generate:
(BTSB0) - bts clucene error: Too Many Clauses
When CLucene rewrites a search expression
with, by default, more than 1024 clauses, this error is generated.
What causes this error and how can it be addressed will be discuss here.
First, the cause: if it happens, it usually happens on a wildcard query.
In a previous blog entry I discussed how the input stream of characters are analyzed
into tokens. The CLucene index, on which BTS is based, maintains a dictionary of these tokens on a per field basis. When a wildcard or fuzzy
query is specified on a field, its dictionary is searched for all the tokens that will match
the wildcard or fuzzy expression and then its rewrites that expression with a number
of simple term searches joined with a
Boolean OR operator.
I have a data-set with a large number of distinct tokens that I use for testing. If I search this index
with the wildcard search:
the query is rewritten as
(slob or slobber or sloe or slog or slogan or sloop)
In this example we see the wildcard search is replaced with the search expression of 6 simple term clauses. The wildcard search of sl* generates a search expression with some 956 simple terms. And finally if I search for just s*, there are 32798 terms and, by default, it will generate the Too Many Clauses error.
CLucene has this limit because, as we see with s*, some wildcard expansions can be very large and this can use a lot of virtual memory. CLucene has a parameter to limit and help control memory usage. The good news is that it is tunable and an index basis. In BTS, if you have indexes that you want to allow very large query rewrites, then you can specify the max_clause_count parameter. For my test index, if I want to search on the wildcard search s*, then I need to create the index with a max_clause_count larger that 32798. For example:
create index bts_idx on bts_tab (text bts_char_ops)
using bts (max_clause_count="4000"); in sbspace1;
Keep in mind that these queries can result in more memory usage and you may see the server allocation more virtual segments. The number of virtual segments attached can be monitored with onstat -g seg.
For those celebrating the holiday season, I wish you all the best. I will be back here next year with some more interesting information on extensibility in the Informix Server.
From archive: December 2010 X
Mark_Ashworth 27000019EC Tags:  informix text lucene search analyzer standaard canonical_maps bts clucene basic 3,181 Views
I have received several questions about how BTS analysis the characters in a document to form tokens. I use token instead of words here because, as we will see, tokens do not necessarily map to what is considered a single word. Currently BTS uses the standard CLucene analyzer to perform this analysis. Here in this blog entry, I will show some examples of small input documents and resulting token stream. Each token is delineated by square brackets ([ ]).
For my examples, I will assume the index is using the default list of stop words, that include words like: "the" and "a"
My first example is a common English-language pangram:
The Quick Brown Fox Jumped Over The Lazy Dog
This is analyzed into a token stream, converted to lowercase and the stopwords are eliminated:
There are no surprises yet. Words with apostrophes (') are also handled:
Prequ'ile Mark's 'cause
Note that the
Now let us look at documents that contain digits:
-12 -.345 -898.2 -56. –
Note that the minus sign is included in the token only if it is followed by a digit.
Now we mix it up a little with alphabetic characters:
1abc 12abc abc1 abc12 -1abc -12abc abc-1 abc-12 -12.abc abc.321
Notice how numbers follow by characters are broken into two tokens but numbers with proceeding alphabetic characters are not.
The CLucene standard analyzer also recognizes other frequently found patterns in modern documents, such as email address and company names that have an ampersand (&) character or ip addresses"
firstname.lastname@example.org of XY&Z Corporation as an ip address 192.168.1.1
results in the token stream:
and the text:
their web site is http://www.xyz_corporation.com
The last example has other characters:
m re#23 () %$ @
I expect some of that behaviour may be a little surprising. But what if you want to break an email address or web page address into its component words? Let's say you are interested in finding any documents with the name 'ashworth' including email addresses. In my example email@example.com is indexed as one token and BTS will not find 'ashworth' in a simple term search. To help the user out, I recently recommended the following solution. You can create the index with canonical_maps parameter to map periods (.) and at-signs (@) to spaces. The mapping is done as a preprocessor to the analyzer and the component words of an email address or web site would be indexed:
So the document:
firstname.lastname@example.org of XY&Z Corporation as an ip address 192.168.1.1
is transformed to:
mark ashworth example.com of XY&Z Corporation as an ip address 192 168 1 1
and the following token stream is generated:
Now you could search for the 'ashworth'. You can also look for emails with example.com in them by using a phrase search:
So I will end this entry with output from dbaccess of an example of how canonical_maps can help with queries on email addresses::
That’s all for now, Enjoy
Mark_Ashworth 27000019EC Tags:  clucene index panther bts provisioning storage lucene feature text cheetah continuous availability informix 3,187 Views
When the Basic Text Search DataBlade (BTS) was introduced in Informix Version 11.10, it only provided support extspaces, ie storing the index files outside the database.
With Informix Version 11.50, BTS introduced support for storing the index in a sbspace with the index "files" stored in smart blobs. Now the indexes were full class database objects being logged and backed up like other indexes in Informix. Logging also meant BTS could be and is supported in the Continuous Availability Feature (Mach 11).
However, creating or modifying a clucene based index caused a significant amount of logging which needed to be addressed before the feature was released. This issue was resolved doing the create index, insert, update, or delete in a temporary area. This temporary area is specified by the onconfig parameter SBSPACETEMP. This variable is similar to DBSPACETEMP and is described in the Onconfig.std as:
In its simplest form it may be specified as:
The space, in this case, mysbspacetemp, may either be a temporary sbspace or a normal sbspace. However we recommend only using temporary sbspaces here so that intermediate operations are not logged. If it is not a temporary sbspace, BTS can still use it and does turn off logging for these smart blobs values but there will still be logging for the smartblob metadata.
In 11.50, a temporary sbspace may be created with a -t flag to onspaces, for example:
In Informix Version 11.70, there are more convenient facilities. The first time a BTS index is created in a database without BTS registered, the server will first auto-configure BTS and among other things, if storage provisioning is enabled, a temporary sbspace. If necessary, the SBSPACETEMP is set and a temporary sbspace is created.
In 11.70, we also have the option of using the admin api to create sbspace. Since I usually have storage provisioning enabled, I like using the admin api to create additional temporary sbspace, for example:
SBSPACETEMP is a list, so multiple sbspaces can be specified using ith either colon or comma separators. For example:
With this example, BTS now has three sbspaces to choose from, which one is picked?
Yes, BTS will have a preference! First, BTS will prefer an sbspace in the list that is a temporary sbspace over a normal sbspace. Then from that subset, it picks the one with the most free space available. Note: for any given operation, BTS will not spread the work across multiple sbspaces.
Finally, there is a undocumented parameter in BTS called
Remember, temporary sbspaces are important to prevent unnecessary or excessive logging.