As many users are aware, BTS is based on the CLucene Search Engine and
documents are returned from the index in order from the highest to lowest
score. This score value is the relevance of the document as compared to the
documents indexed, the documents returned and the nature of the query.
To get the score, you specify a Statement Local Variable as the third argument
to the bts_contains() predicate. This variable may be used in a comparison
predicate in the where clause or as a returned column value in the select
clause. For example:
SELECT text, score
FROM bts_tab
WHERE bts_contains(text, 'snow', score#real);
I get many questions on how this value is calculated. Since CLucene is open
source on the Web, you can get the source and have a look but it's not that
easy to understand. CLucene, itself, is based on Lucene and it is also
available as open source. You can find many web pages discussing scoring for
either of these packages. My original intention was to point my readers to
those sources but its not easy material to digest so I am going to begin here
with some very simple basics to help you get started.
Lucene scoring uses a combination of the Vector Space Model (VSM) of
Information Retrieval and the Boolean model to determine how relevant a given
document is to a user's query (from
http://lucene.apache.org/core/3_6_2/scoring.html)
I have seen reference to a formula like this:
score_d = sum_t(tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t * boost_t) * coord_q_d
As I said, it is not so easy to understand! This calculation returns the
score value between 0 and 1. BTS converts this value to a percentage. For
the purposes of this discussion I will not be asking you to understand this
equation.
One of the inputs to the score calculation is the number of documents in the
index and another is the number of documents that contain a given term (or
word). For this discussion, I will use some examples around the words 'snow'
and 'rain' using 10000 documents, each with 5 to 20 words.
Let's do a simple query:
SELECT FIRST 1 text, score
FROM bts_tab
WHERE bts_contains(text, 'snow', score#real);
text snow is the main ingredient to many winter sports
score 100.0000000000
Here, I get back the first row which happens to have a score of 100%. This
seems reasonable because I am searching for 'snow' and the document contains
the word 'snow'. When initially using BTS on simple document repositories, a
query tends to return some rows with a score of 100% and that becomes
expected. Discard that idea immediately! This should not be the expected
result because as we work through these examples, we will see that this
document will get other score values by changing the contents of the table and
not the query. My first point is that score value is a ranking, depending on
many other factors besides the text in the document itself and the specified
query and its all relative.
To show you a stark contrast, I am going to do the same query with 'rain'. In
my 10000 rows, I have 3 rows with the term 'snow' and 9995 rows with the word
'rain' in the document.
When I perform the query with 'rain':
SELECT FIRST 1 text, score
FROM bts_tab
WHERE bts_contains(text, 'rain', score#real);
text rain freezes in winter
score 50.02000430000
The score value is 50%, not 100%. Not even close to 100%!! Bit it is like
the example with 'snow' where each row returned contains the word 'rain'.
When users see this result, it can be very confusing. This demonstrates one
aspect of how the document is scored. A factor in the score value which is
based on the total number of documents in the index versus the number of
documents that match the term. With 'snow', we have 3 out 10000 documents
that match and with 'rain', we have 9995 documents that match. So with one
word term, we see that if the ratio between the number of documents found
versus the total number of documents in the repository is low, it will tend to
raise the score value. The ratio of 3:10000 gives score values nearer 100%
and a ratio of 9995:10000 gives a much lower score value.
We can see this holds true for all rows with 'snow' and 'rain' as we see with
the min and max score range:
SELECT count(*),min(score),max(score)
FROM bts_tab
WHERE bts_contains(text, 'snow', score#real);
(count(*)) (min) (max)
3 83.33334350000 100.0000000000
1 row(s) retrieved.
SELECT count(*),min(score),max(score)
FROM bts_tab
WHERE bts_contains(text, 'rain', score#real);
(count(*)) (min) (max)
9995 25.01000210000 50.02000430000
To bring the point home, I ran the same queries on the table but with only 10
rows, the same three rows with 'snow' and the one row with 'rain'. Now the
ratio for 'snow' is 3:10 and the ratio for 'rain' is 1:10. We can see that
the min and max values have dramatically changed because the change in ratios:
SELECT count(*),min(score),max(score)
FROM bts_tab
WHERE bts_contains(text, 'snow', score#real);
(count(*)) (min) (max)
3 59.88408280000 71.86090090000
1 row(s) retrieved.
SELECT count(*),min(score),max(score)
FROM bts_tab
WHERE bts_contains(text, 'rain', score#real);
(count(*)) (min) (max)
1 100.0000000000 100.0000000000
1 row(s) retrieved.
This means that as documents are added or deleted from the index, the score
values will be affected.
Now back to the table with 10000 rows. You will recall that 3 rows had one
occurrence of the word 'snow' and the 9995 rows had one occurrence of the word
'rain'. But we saw a significant range of score values one each query so this
is something else to consider. This brings me to the next factor in the
scoring.
The score value also factors in the number of occurrences of search term
versus the number of terms in the document. Let us take a look at the 3 rows
with 'snow' and their scores:
SELECT text, score
FROM bts_tab
WHERE bts_contains(text, 'snow', score#real);
text snow is the main ingredient to many winter sports
score 100.0000000000
text snow covered mountains are areas that skiers flock to in the winter time
score 83.33334350000
text in the city, snow needs to be cleared promptly to keep traffic moving
score 83.33334350000
Each row has only one occurrence of the term 'snow'. The first row, minus the
stop words, has 6 words. Similarly, the next two rows have 8 words. 1
occurrence of the term out of 6 words is more relevant than 1 occurrence of
the term out of 8 words so the score value tends to be larger. This is why we
see the row with 1:6 ratio gets a higher score than the two rows with a 1:8
ration. In the next example, we add the words 'snow covered' to the row
containing 'mountains' to give that row 2 occurrences of 'snow'. Now we
perform the query on 'snow' again:
SELECT text,score
FROM bts_tab
WHERE bts_contains(text, 'snow', score#real);
text in the snow covered city, snow needs to be cleared promptly
to keep traffic moving
score 99.99999240000
text snow is the main ingredient to many winter sports
score 84.85280610000
text snow covered mountains are areas that skiers flock to in the winter time
score 70.71067050000
Since 'snow' appears twice in that modified row, it now has 2 occurrences of
the term out of 10 words it makes the document more relevant and the other
documents with 'snow' less relevant.
In the next example, I go to the extreme and replace the document with the
word 'mountains' with a 5-word document containing 4 occurrences of 'snow'.
Given the ratio of 4 to 5, this greatly increases the relevance of this
document compared to the relevance other 2 documents and the score values
reflect this:
SELECT text, score
FROM bts_tab
WHERE bts_contains(text, 'snow', score#real);
text snow snow snow snow mountains
score 100.0000000000
text in the snow covered city, snow needs to be cleared promptly
to keep traffic moving
score 50.50762940000
text snow is the main ingredient to many winter sports
score 42.85714340000
In conclusion, we see that the ratio between the number of documents selected
and the total number of documents and ratio of the number of occurrences of
the term and the number of terms (or words) in a given document are two
factors that play an important role in determining the score value.
When dealing with multiple terms in a query, each term is given a score
factor. These individual score factors are weighted and summed to give an
overall score between 0 and 100.
When looking at scores, please keep a few other details in mind. First, stop
words are not indexed and not included in counts. Also if you are using a
stemmer analyzer or soundex, the stemmed words or soundex codes are counted,
not the original words. Since many words may map to one stemmed word or one
soundex code, it naturally increases number of occurrences in a document.
Finally, in the real world, some terms are more important than other terms, so
the query language does support the capability to boost a term's relevance in
the calculation. (See "Boost a term" in BTS part of the IBM Informix Database
Extensions User's Guide.)
This should help you understand the basics of how a document returned from a
query on a BTS index is scored. There is an abundance of details on the Web
on how CLucene / Lucene score documents.