Better MySQL searches with Sphinx

Use Sphinx for non-full-text searching

Even though MySQL is a good general-purpose database, if your application is search-heavy, you may get better performance by using Sphinx instead. Despite Sphinx being a full-text search tool, it can increase the speed of your application even when working with non-full-text queries. This article shows how to configure Sphinx for this task, includes some example queries, measures their execution times, and demonstrates some of the trade-offs involved in the change considering what you need to use Sphinx in a general, systematic way.

Share:

Federico Kereki, Systems Engineer, Freelance

Photo of Federico KerekiFederico Kereki is a Uruguayan systems engineer with more than 20 years of experience developing systems, doing consulting work, and teaching at universities. He has been working with open source software for over 10 years and particularly appreciates the greater security of Linux. He recently wrote Essential GWT, a book about this open source tool.


developerWorks Contributing author
        level

08 November 2011

Also available in Chinese Russian Japanese

Introduction

MySQL is a good all-around database, but for search-heavy applications, you may fare better by looking at specific search utilities. This article considers Sphinx, a well-known full-text search package, as a MySQL replacement for searches, increasing speed even for non-full-text searches. The article studies the trade-offs and caveats involved in such a change, demonstrates some specific tests, and considers what is required to use Sphinx in a general way.

MySQL or Sphinx?

Why would you consider substituting Sphinx for MySQL? Consider, for instance, a bookseller's search application. A user might look for a book by specifying the title, author, status (new or used), edition (first or later), cover (hard or paperback), publisher, year of publication, whether it's signed by the author, price, and so on. MySQL typically doesn't use more than one index (an exception is index merge optimization; see Resources), so the only way to optimize all possible searches is by having a prohibitive number of multi-column indices, which is not good.

However, Sphinx is a search engine that can integrate well with MySQL and work on its own in a stand-alone fashion. It provides high indexing and searching performance and allows queries by means of SphinxQL, a Structured Query Language (SQL)-based language. Finally, Sphinx scales up well, being able to work with billions of documents spanning terabytes of data with distributed searching capabilities.

Sphinx works with documents (which can be simply records in a database table or view), text fields (which it indexes, providing full-text searches), and attributes (non-text values that you can use for filtering, sorting, and grouping results). Attributes are kept in random-access memory (RAM) for efficiency; see the Sphinx documentation in Resources for the actual size calculation formula.

To process queries, Sphinx depends on special index files. You have to define the data sources for the indexing procedure and then run the indexer program. Another possibility is using real-time index files, which you can update on the fly at the cost of somewhat lessened efficiency. I cover these in more detail later in the article.

A sample problem

To test Sphinx and MySQL search speeds, I looked for some interesting (in size) data sets and found a test database with about 3 million salary records, which seemed good enough for my purposes. (See Resources to get this data.) The scheme is simple: you have employees with job titles who work in departments and earn annual salaries. This sample works with the two largest tables: employees and salaries.

After installing the data, I found I needed to fix a couple details for the salaries table. First, the to_date field sometimes contained 9999-01-01 as a special marker; I substituted 2038-01-01 instead because UNIX® timestamps can't go beyond 2038. In addition, Sphinx requires every record to have a single field ID key, so I had to add a salaries_id auto-increment field to the salaries table. See Listing 1.

Listing 1. Fixing some details in the salaries table
ALTER TABLE salaries
  DROP PRIMARY KEY,
  ADD UNIQUE emp_from (emp_no, from_date) ;
ALTER TABLE salaries
  ADD salaries_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST ;
UPDATE salaries
  SET to_date="2038-01-01" WHERE to_date="9999-01-01";

Now consider the queries shown in Listing 2.

Listing 2. A couple of simple tests to compare MySQL and Sphinx
SELECT *
  FROM employees.employees
  WHERE birth_date BETWEEN '1960-09-01' AND '1960-09-30'
    AND hire_date >= '1998-01-01';
38 rows in set (0.19 sec)

SELECT *
  FROM employees.employees eee JOIN employees.salaries sss
  ON sss.emp_no=eee.emp_no
  WHERE eee.first_name='Yucel'
    AND sss.salary>120000
    AND sss.from_date >= '2000-01-01';
5 rows in set (0.15 sec)

The first search involves only employees and looks for people born in September of 1960 and hired in or after 1998. MySQL finds 38 records in approximately 0.19 seconds. (Of course, your results may vary.) The second search joins employees and salaries to find male workers named Yucel who earned over 120,000 in 2000 or later. MySQL finds five records in about 0.15 seconds.

To run these searches with Sphinx, you must set it up, index the data, and run a daemon. The next section covers these tasks.

Configuring Sphinx

Sphinx indexes data sources according to definitions in the sphinx.conf file, shown in Listing 3. Note the type conversions for the date fields. This article skims the main required parameters, but you should be aware that there are many more configuration settings you can work with.

Listing 3. The source definitions for Sphinx
source employeesSource
{
  type = mysql
  sql_host = localhost
  sql_user = mysqluser
  sql_pass = mysqlpass
  sql_db = employees
  sql_query= SELECT \
    emp_no as id, \
    TO_DAYS(birth_date) AS birth_date_td, \
    first_name, \
    last_name, \
    gender, \
    TO_DAYS(hire_date) AS hire_date_td \
    FROM employees
  sql_attr_uint = birth_date_td
  sql_field_string = first_name
  sql_field_string = last_name
  sql_field_string = gender
  sql_attr_uint = hire_date_td
}

source employeesSalariesSource
{
  type = mysql
  sql_host = localhost
  sql_user = mysqluser
  sql_pass = mysqlpass
  sql_db = employees
  sql_query= SELECT \
    ss.salaries_id AS id, \
    ee.emp_no AS emp_no, \
    TO_DAYS(ee.birth_date) AS birth_date_td, \
    ee.first_name AS first_name, \
    ee.last_name AS last_name, \
    ee.gender AS gender, \
    TO_DAYS(ee.hire_date) AS hire_date_td, \
    ss.salary AS salary, \
    UNIX_TIMESTAMP(ss.from_date) AS from_date_ts, \
    UNIX_TIMESTAMP(ss.to_date) AS to_date_ts \
    FROM employees ee JOIN salaries ss \
    ON ss.emp_no=ee.emp_no
  sql_attr_uint = emp_no
  sql_attr_timestamp = birth_date_ts
  sql_field_string = first_name
  sql_field_string = last_name
  sql_field_string = gender
  sql_attr_uint = hire_date_td
  sql_attr_uint = salary
  sql_attr_timestamp = from_date_ts
  sql_attr_timestamp = to_date_ts
}

The configuration file is divided into stanzas. The source part defines data sources. Sphinx can work with many types of files, including text, Hypertext Markup Language (HTML), and Extensible Markup Language (XML). However, this example uses only MySQL, and thus type=mysql. The sql_host, sql_user, sql_pass, and sql_db parameters define how to access the database and which schema to use. The sql_query parameter provides the SQL sentence that retrieves the data to be indexed. Essentially, the only restriction here is that the first field must be a unique, unsigned positive integer ID number. This is why I had to add a salaries_id field to the salaries table earlier. You can specify up to 32 text fields and an arbitrary number of attributes. Sphinx full-text indexes all columns except the ID (the first field) and attributes.

As you can see in Table 1, Sphinx supports several types of attributes but not all the possible MySQL data types. There are also some Sphinx-specific attribute types, but they aren't applicable if you are just looking for a MySQL replacement, so I don't include those here.

Table 1. Sphinx allows only a few attribute types
AttributeDescription
sql_attr_uint and sql_attr_bigint32-bit unsigned and 64-bit signed integer values. You use these two types for all integer database fields and also possibly as a DATE stand in.
sql_attr_float32-bit floating point values. If you want to store geographic coordinates, you use this attribute type. Also note that if you require more precision there's no solution; fields are rounded to about seven decimal digits.
sql_attr_boolA Boolean (single bit) value, similar to MySQL's tinyint values.
sql_attr_timestampA UNIX timestamp that can represent date/time values from 1970-01-01 to 2038-01-19. You cannot directly use DATE or DATETIME column types in Sphinx. You have to convert them to timestamps with the UNIX_TIMESTAMP() function. If you just require dates, you can use the TO_DAYS() function to convert a DATE field into an integer.
sql_attr_string and sql_field_stringStrings (obviously!), but the former are only for retrieval, while the latter are indexed as full text.

The index part of the configuration files describes the attributes of the sources (Listing 4).

Listing 4. Index definitions for Sphinx
index employeesIndex
{
  type = plain
  source = employeesSource
  path = /home/fkereki/bin/sphinx/var/data/sphinxFilesEI
  charset_type = utf-8
  preopen = 1
}

index employeesSalariesIndex
{
  type = plain
  source = employeesSalariesSource
  path = /home/fkereki/bin/sphinx/var/data/sphinxFilesESI
  charset_type = utf-8
  preopen = 1
}

Sphinx uses index files that are separate from the ones used by MySQL. The type=plain line means you are using the standard Sphinx index files. Other possibilities are distributed (when you have the index files distributed over several nodes in your network) and rt (standing for real time). You can update these indexes on the fly. The source= line relates a data source and an index. You can actually merge several data sources together in a single index, but this example doesn't do that. The path= line defines the index file name and where it is stored. The charset_type= line specifies if you will be working with a Single Byte Character Set (sbcs) or Universal Character Set (UCS) Transformation Format-8 bit (utf-8). Finally, preopen=1 tells the search daemon to open all index files upon load rather than waiting for the first queries to arrive.

The last stanzas in the configuration file have to do with the indexer and searchd applications (Listing 5).

Listing 5. Indexer and search daemon parameters
indexer
{
  mem_limit = 1024M
}

searchd
{
  listen = 127.0.0.1:9306:mysql41
  log = /home/fkereki/bin/sphinx/var/log/searchd.log
  query_log = /home/fkereki/bin/sphinx/var/log/query.log
  pid_file = /home/fkereki/bin/sphinx/var/log/searchd.pid
}

You have to define a mem_limit RAM size to give indexer enough memory to work with. The searchd definitions are rather self-explanatory, except for listen=. You use this parameter to specify at which IP address and port the SphinxQL binary network protocol is available. If you want to interact with the protocol directly, you can use the standard MySQL client by entering mysql -P 9306 and run queries without even running the MySQL mysqld program on your machine.

Now that everything is set up, you can simply index the data and start running the search daemon (Listing 6). The --all parameter means all index files will be generated.

Listing 6. Indexing data and running the search daemon
~/bin/sphinx/etc> ../bin/indexer --all
Sphinx 2.0.1-beta (r2792)
Copyright (c) 2001-2011, Andrew Aksyonoff
Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file './sphinx.conf'...
indexing index 'employeesSalariesIndex'...
collected 2844047 docs, 40.9 MB
sorted 8.5 Mhits, 100.0% done
total 2844047 docs, 40877736 bytes
total 24.842 sec, 1645464 bytes/sec, 114482.32 docs/sec
indexing index 'employeesIndex'...
collected 300024 docs, 4.3 MB
sorted 0.9 Mhits, 100.0% done
total 300024 docs, 4311224 bytes
total 0.914 sec, 4714945 bytes/sec, 328119.56 docs/sec
total 7 reads, 0.274 sec, 19695.2 kb/call avg, 39.2 msec/call avg
total 110 writes, 0.528 sec, 2351.0 kb/call avg, 4.8 msec/call avg

~/bin/sphinx/etc> ../bin/searchd
Sphinx 2.0.1-beta (r2792)
Copyright (c) 2001-2011, Andrew Aksyonoff
Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file './sphinx.conf'...
listening on 127.0.0.1:9306
precaching index 'employeesSalariesIndex'
precaching index 'employeesIndex'
precached 2 indexes in 0.124 sec

Now that your data is indexed and the required daemon is running, you're ready to run some searches.

Running searches

The preferred method for querying Sphinx is to use SphinxQL. Any language that works with MySQL also works with SphinxQL. This example uses PHP, and the code is shown in Listing7.

Listing 7. A PHP program to query MySQL and Sphinx, timing their performances
$bd0 = mysql_connect('localhost:3306', 'mysqluser','mysqlpass');
$bd1 = mysql_connect('localhost:9306');

echo "FIRST TEST ... SINGLE TABLE\n\n";

$td1 = to_days("1960-09-01");
$td2 = to_days("1960-09-30");
$td3 = to_days("1998-01-01");

do_time("test 1 - MySQL ", "SELECT * FROM employees.employees ".
  "WHERE birth_date BETWEEN '1960-09-01' AND '1960-09-30' ".
  "AND hire_date >= '1998-01-01'", $bd0);

do_time("test 1 - Sphinx", "SELECT * FROM employeesIndex ".
  "WHERE birth_date_td BETWEEN {$td1} AND {$td2} ".
  "AND hire_date_td >= {$td3} LIMIT 0,10000 ", $bd1);

echo "\nSECOND TEST ... JOIN\n\n";

$ts1 = mktime(0,0,0,1,1,2000);

do_time("test 2 - MySQL ", "SELECT * ".
  "FROM employees.employees eee JOIN employees.salaries sss ".
  "ON sss.emp_no=eee.emp_no ".
  "WHERE eee.first_name='Yucel' ".
  "AND sss.salary>120000 ".
  "AND sss.from_date >= '2000-01-01'", $bd0);

do_time("test 2 - Sphinx", "SELECT * FROM employeesSalariesIndex ".
  "WHERE MATCH('@first_name Yucel') ".
  "AND salary>120000 ".
  "AND from_date_ts>={$ts1} limit 0,10000 ", $bd1);


function to_days($date) {
  return 719528 + floor(strtotime($date)/(60*60*24));
}


function do_time($description, $sentence, $bd) {
  $m0 = microtime(true);
  $res= @mysql_query($sentence, $bd);
  $m1 = microtime(true);
  $nr = mysql_num_rows($res);
  echo $description." ".$nr." rows in ".
    sprintf("%6.4f", $m1-$m0)." secs\n";
}

According to the specification in Listing 5, SphinxQL queries must be sent to port 9306. Because there is no PHP equivalent for the MySQL TO_DAYS() function, I wrote one of my own. Note that mktime works for UNIX_TIMESTAMP conversions. The do_time function executes and times a given query on a specified server.

SphinxQL queries differ from MySQL queries in the following ways:

  • You use MATCH to query text fields.
  • You have to convert all dates either to timestamps or to integers. This example uses both methods just for variety.
  • Sphinx can return full records or just ID fields, which is more efficient. Of course, if you use the latter option you must then use MySQL to retrieve the rest of the information.
  • The AND and OR operators have the same priority, so be careful and use parentheses where appropriate.
  • Not all MySQL numeric, string, and other functions are provided by Sphinx.

Running just a couple of tests isn't a thorough proof of concept, but the results shown in Listing 8 do suggest that the MySQL-to-Sphinx change may prove useful.

Listing 8. The results of the comparison between MySQL and Sphinx
~/bin/sphinx/etc> php test.php
FIRST TEST ... SINGLE TABLE

test 1 - MySQL  38 rows in 0.1912 secs
test 1 - Sphinx 38 rows in 0.0157 secs

SECOND TEST ... JOIN

test 2 - MySQL  5 rows in 0.1532 secs
test 2 - Sphinx 5 rows in 0.0020 secs

These results are good, but the examples so far only consider static searches, which assume constant tables. You still need to look to the problem of updating index files.

Updating your index files

What happens if the original data is updated? You must update your index files or searches will start producing the wrong results. You could re-index everything after each update, but that would probably be too costly! Sphinx provides two solutions: delta index files and live index updates.

Frequently, you have a large dataset with a small number of new records added every now and then. For this example, assume old records aren't modified after they are first written. You can have almost-real-time index updates by implementing a main+delta scheme. The idea is to have one index for the old, fixed data and another for the new data, which can be created quickly because of its smaller size. Then, all you need do is query both index files and use the union of both results. Consider using index merging to recreate the main index by merging a previous index and a delta index (see the Sphinx documentation in Resources for details).

What happens if the old data can be modified or if you truly need real-time updates? Sphinx's RT index files are the solution because they allow INSERT, REPLACE, and DELETE commands to affect the index files in real time. Whenever you update your main tables, you must be careful to also execute the corresponding index updates to ensure there are no differences between the MySQL and Sphinx data. To be safe, check out the caveats and restrictions in section 4.2 of the Sphinx documentation (see Resources).

Final evaluation

What conclusions can you draw? First, using Sphinx instead of MySQL can provide significant performance advantages. Sphinx is quite good for searching static tables. However, for tables that are frequently updated, you can't use the plain index files. Instead, you need to either implement delta files or change to real-time indexing, and both of these solutions carry an extra performance cost. Finally, using Sphinx efficiently requires some planning because you have to pre-define all required sources and index files— of course, this isn't a disadvantage, just common sense.

Substituting Sphinx for MySQL isn't trivial but also isn't so complicated as to preclude the option. If fast search speed is a requirement, it's worth contemplating a change from MySQL to Sphinx, even if you are not doing full-text searches.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source
ArticleID=769355
ArticleTitle=Better MySQL searches with Sphinx
publish-date=11082011