As many users are aware, BTS is based on the CLucene Search Engine and documents are returned from the index in order from the highest to lowest score. This score value is the relevance of the document as compared to the documents indexed, the documents returned and the nature of the query.
To get the score, you specify a Statement Local Variable as the third argument to the bts_contains() predicate. This variable may be used in a comparison predicate in the where clause or as a returned column value in the select clause. For example:
SELECT text, score FROM bts_tab WHERE bts_contains(text, 'snow', score#real);
I get many questions on how this value is calculated. Since CLucene is open source on the Web, you can get the source and have a look but it's not that easy to understand. CLucene, itself, is based on Lucene and it is also available as open source. You can find many web pages discussing scoring for either of these packages. My original intention was to point my readers to those sources but its not easy material to digest so I am going to begin here with some very simple basics to help you get started.
Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given document is to a user's query (from http://lucene.apache.org/core/3_6_2/scoring.html)
I have seen reference to a formula like this:
score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d
As I said, it is not so easy to understand! This calculation returns the score value between 0 and 1. BTS converts this value to a percentage. For the purposes of this discussion I will not be asking you to understand this equation.
One of the inputs to the score calculation is the number of documents in the index and another is the number of documents that contain a given term (or word). For this discussion, I will use some examples around the words 'snow' and 'rain' using 10000 documents, each with 5 to 20 words.
Let's do a simple query:
SELECT FIRST 1 text, score FROM bts_tab WHERE bts_contains(text, 'snow', score#real); text snow is the main ingredient to many winter sports score 100.0000000000
Here, I get back the first row which happens to have a score of 100%. This seems reasonable because I am searching for 'snow' and the document contains the word 'snow'. When initially using BTS on simple document repositories, a query tends to return some rows with a score of 100% and that becomes expected. Discard that idea immediately! This should not be the expected result because as we work through these examples, we will see that this document will get other score values by changing the contents of the table and not the query. My first point is that score value is a ranking, depending on many other factors besides the text in the document itself and the specified query and its all relative.
To show you a stark contrast, I am going to do the same query with 'rain'. In my 10000 rows, I have 3 rows with the term 'snow' and 9995 rows with the word 'rain' in the document.
When I perform the query with 'rain':
SELECT FIRST 1 text, score FROM bts_tab WHERE bts_contains(text, 'rain', score#real); text rain freezes in winter score 50.02000430000
The score value is 50%, not 100%. Not even close to 100%!! Bit it is like the example with 'snow' where each row returned contains the word 'rain'. When users see this result, it can be very confusing. This demonstrates one aspect of how the document is scored. A factor in the score value which is based on the total number of documents in the index versus the number of documents that match the term. With 'snow', we have 3 out 10000 documents that match and with 'rain', we have 9995 documents that match. So with one word term, we see that if the ratio between the number of documents found versus the total number of documents in the repository is low, it will tend to raise the score value. The ratio of 3:10000 gives score values nearer 100% and a ratio of 9995:10000 gives a much lower score value.
We can see this holds true for all rows with 'snow' and 'rain' as we see with the min and max score range:
SELECT count(*),min(score),max(score) FROM bts_tab WHERE bts_contains(text, 'snow', score#real); (count(*)) (min) (max) 3 83.33334350000 100.0000000000 1 row(s) retrieved. SELECT count(*),min(score),max(score) FROM bts_tab WHERE bts_contains(text, 'rain', score#real); (count(*)) (min) (max) 9995 25.01000210000 50.02000430000
To bring the point home, I ran the same queries on the table but with only 10 rows, the same three rows with 'snow' and the one row with 'rain'. Now the ratio for 'snow' is 3:10 and the ratio for 'rain' is 1:10. We can see that the min and max values have dramatically changed because the change in ratios:
SELECT count(*),min(score),max(score) FROM bts_tab WHERE bts_contains(text, 'snow', score#real); (count(*)) (min) (max) 3 59.88408280000 71.86090090000 1 row(s) retrieved. SELECT count(*),min(score),max(score) FROM bts_tab WHERE bts_contains(text, 'rain', score#real); (count(*)) (min) (max) 1 100.0000000000 100.0000000000 1 row(s) retrieved.
This means that as documents are added or deleted from the index, the score values will be affected.
Now back to the table with 10000 rows. You will recall that 3 rows had one occurrence of the word 'snow' and the 9995 rows had one occurrence of the word 'rain'. But we saw a significant range of score values one each query so this is something else to consider. This brings me to the next factor in the scoring.
The score value also factors in the number of occurrences of search term versus the number of terms in the document. Let us take a look at the 3 rows with 'snow' and their scores:
SELECT text, score FROM bts_tab WHERE bts_contains(text, 'snow', score#real); text snow is the main ingredient to many winter sports score 100.0000000000 text snow covered mountains are areas that skiers flock to in the winter time score 83.33334350000 text in the city, snow needs to be cleared promptly to keep traffic moving score 83.33334350000
Each row has only one occurrence of the term 'snow'. The first row, minus the stop words, has 6 words. Similarly, the next two rows have 8 words. 1 occurrence of the term out of 6 words is more relevant than 1 occurrence of the term out of 8 words so the score value tends to be larger. This is why we see the row with 1:6 ratio gets a higher score than the two rows with a 1:8 ration. In the next example, we add the words 'snow covered' to the row containing 'mountains' to give that row 2 occurrences of 'snow'. Now we perform the query on 'snow' again:
SELECT text,score FROM bts_tab WHERE bts_contains(text, 'snow', score#real); text in the snow covered city, snow needs to be cleared promptly to keep traffic moving score 99.99999240000 text snow is the main ingredient to many winter sports score 84.85280610000 text snow covered mountains are areas that skiers flock to in the winter time score 70.71067050000
Since 'snow' appears twice in that modified row, it now has 2 occurrences of the term out of 10 words it makes the document more relevant and the other documents with 'snow' less relevant.
In the next example, I go to the extreme and replace the document with the word 'mountains' with a 5-word document containing 4 occurrences of 'snow'. Given the ratio of 4 to 5, this greatly increases the relevance of this document compared to the relevance other 2 documents and the score values reflect this:
SELECT text, score FROM bts_tab WHERE bts_contains(text, 'snow', score#real); text snow snow snow snow mountains score 100.0000000000 text in the snow covered city, snow needs to be cleared promptly to keep traffic moving score 50.50762940000 text snow is the main ingredient to many winter sports score 42.85714340000
In conclusion, we see that the ratio between the number of documents selected and the total number of documents and ratio of the number of occurrences of the term and the number of terms (or words) in a given document are two factors that play an important role in determining the score value.
When dealing with multiple terms in a query, each term is given a score factor. These individual score factors are weighted and summed to give an overall score between 0 and 100.
When looking at scores, please keep a few other details in mind. First, stop words are not indexed and not included in counts. Also if you are using a stemmer analyzer or soundex, the stemmed words or soundex codes are counted, not the original words. Since many words may map to one stemmed word or one soundex code, it naturally increases number of occurrences in a document. Finally, in the real world, some terms are more important than other terms, so the query language does support the capability to boost a term's relevance in the calculation. (See "Boost a term" in BTS part of the IBM Informix Database Extensions User's Guide.)
This should help you understand the basics of how a document returned from a query on a BTS index is scored. There is an abundance of details on the Web on how CLucene / Lucene score documents.