Better MySQL searches with Sphinx

Use Sphinx for non-full-text searching


MySQL is a good all-around database, but for search-heavy applications, you may fare better by looking at specific search utilities. This article considers Sphinx, a well-known full-text search package, as a MySQL replacement for searches, increasing speed even for non-full-text searches. The article studies the trade-offs and caveats involved in such a change, demonstrates some specific tests, and considers what is required to use Sphinx in a general way.

MySQL or Sphinx?

Why would you consider substituting Sphinx for MySQL? Consider, for instance, a bookseller's search application. A user might look for a book by specifying the title, author, status (new or used), edition (first or later), cover (hard or paperback), publisher, year of publication, whether it's signed by the author, price, and so on. MySQL typically doesn't use more than one index (an exception is index merge optimization; see Related topics), so the only way to optimize all possible searches is by having a prohibitive number of multi-column indices, which is not good.

However, Sphinx is a search engine that can integrate well with MySQL and work on its own in a stand-alone fashion. It provides high indexing and searching performance and allows queries by means of SphinxQL, a Structured Query Language (SQL)-based language. Finally, Sphinx scales up well, being able to work with billions of documents spanning terabytes of data with distributed searching capabilities.

Sphinx works with documents (which can be simply records in a database table or view), text fields (which it indexes, providing full-text searches), and attributes (non-text values that you can use for filtering, sorting, and grouping results). Attributes are kept in random-access memory (RAM) for efficiency; see the Sphinx documentation in Related topics for the actual size calculation formula.

To process queries, Sphinx depends on special index files. You have to define the data sources for the indexing procedure and then run the indexer program. Another possibility is using real-time index files, which you can update on the fly at the cost of somewhat lessened efficiency. I cover these in more detail later in the article.

A sample problem

To test Sphinx and MySQL search speeds, I looked for some interesting (in size) data sets and found a test database with about 3 million salary records, which seemed good enough for my purposes. (See Related topics to get this data.) The scheme is simple: you have employees with job titles who work in departments and earn annual salaries. This sample works with the two largest tables: employees and salaries.

After installing the data, I found I needed to fix a couple details for the salaries table. First, the to_date field sometimes contained 9999-01-01 as a special marker; I substituted 2038-01-01 instead because UNIX® timestamps can't go beyond 2038. In addition, Sphinx requires every record to have a single field ID key, so I had to add a salaries_id auto-increment field to the salaries table. See Listing 1.

Listing 1. Fixing some details in the salaries table
ALTER TABLE salaries
  ADD UNIQUE emp_from (emp_no, from_date) ;
ALTER TABLE salaries
UPDATE salaries
  SET to_date="2038-01-01" WHERE to_date="9999-01-01";

Now consider the queries shown in Listing 2.

Listing 2. A couple of simple tests to compare MySQL and Sphinx
  FROM employees.employees
  WHERE birth_date BETWEEN '1960-09-01' AND '1960-09-30'
    AND hire_date >= '1998-01-01';
38 rows in set (0.19 sec)

  FROM employees.employees eee JOIN employees.salaries sss
  ON sss.emp_no=eee.emp_no
  WHERE eee.first_name='Yucel'
    AND sss.salary>120000
    AND sss.from_date >= '2000-01-01';
5 rows in set (0.15 sec)

The first search involves only employees and looks for people born in September of 1960 and hired in or after 1998. MySQL finds 38 records in approximately 0.19 seconds. (Of course, your results may vary.) The second search joins employees and salaries to find male workers named Yucel who earned over 120,000 in 2000 or later. MySQL finds five records in about 0.15 seconds.

To run these searches with Sphinx, you must set it up, index the data, and run a daemon. The next section covers these tasks.

Configuring Sphinx

Sphinx indexes data sources according to definitions in the sphinx.conf file, shown in Listing 3. Note the type conversions for the date fields. This article skims the main required parameters, but you should be aware that there are many more configuration settings you can work with.

Listing 3. The source definitions for Sphinx
source employeesSource
  type = mysql
  sql_host = localhost
  sql_user = mysqluser
  sql_pass = mysqlpass
  sql_db = employees
  sql_query= SELECT \
    emp_no as id, \
    TO_DAYS(birth_date) AS birth_date_td, \
    first_name, \
    last_name, \
    gender, \
    TO_DAYS(hire_date) AS hire_date_td \
    FROM employees
  sql_attr_uint = birth_date_td
  sql_field_string = first_name
  sql_field_string = last_name
  sql_field_string = gender
  sql_attr_uint = hire_date_td

source employeesSalariesSource
  type = mysql
  sql_host = localhost
  sql_user = mysqluser
  sql_pass = mysqlpass
  sql_db = employees
  sql_query= SELECT \
    ss.salaries_id AS id, \
    ee.emp_no AS emp_no, \
    TO_DAYS(ee.birth_date) AS birth_date_td, \
    ee.first_name AS first_name, \
    ee.last_name AS last_name, \
    ee.gender AS gender, \
    TO_DAYS(ee.hire_date) AS hire_date_td, \
    ss.salary AS salary, \
    UNIX_TIMESTAMP(ss.from_date) AS from_date_ts, \
    UNIX_TIMESTAMP(ss.to_date) AS to_date_ts \
    FROM employees ee JOIN salaries ss \
    ON ss.emp_no=ee.emp_no
  sql_attr_uint = emp_no
  sql_attr_timestamp = birth_date_ts
  sql_field_string = first_name
  sql_field_string = last_name
  sql_field_string = gender
  sql_attr_uint = hire_date_td
  sql_attr_uint = salary
  sql_attr_timestamp = from_date_ts
  sql_attr_timestamp = to_date_ts

The configuration file is divided into stanzas. The source part defines data sources. Sphinx can work with many types of files, including text, Hypertext Markup Language (HTML), and Extensible Markup Language (XML). However, this example uses only MySQL, and thus type=mysql. The sql_host, sql_user, sql_pass, and sql_db parameters define how to access the database and which schema to use. The sql_query parameter provides the SQL sentence that retrieves the data to be indexed. Essentially, the only restriction here is that the first field must be a unique, unsigned positive integer ID number. This is why I had to add a salaries_id field to the salaries table earlier. You can specify up to 32 text fields and an arbitrary number of attributes. Sphinx full-text indexes all columns except the ID (the first field) and attributes.

As you can see in Table 1, Sphinx supports several types of attributes but not all the possible MySQL data types. There are also some Sphinx-specific attribute types, but they aren't applicable if you are just looking for a MySQL replacement, so I don't include those here.

Table 1. Sphinx allows only a few attribute types
sql_attr_uint and sql_attr_bigint32-bit unsigned and 64-bit signed integer values. You use these two types for all integer database fields and also possibly as a DATE stand in.
sql_attr_float32-bit floating point values. If you want to store geographic coordinates, you use this attribute type. Also note that if you require more precision there's no solution; fields are rounded to about seven decimal digits.
sql_attr_boolA Boolean (single bit) value, similar to MySQL's tinyint values.
sql_attr_timestampA UNIX timestamp that can represent date/time values from 1970-01-01 to 2038-01-19. You cannot directly use DATE or DATETIME column types in Sphinx. You have to convert them to timestamps with the UNIX_TIMESTAMP() function. If you just require dates, you can use the TO_DAYS() function to convert a DATE field into an integer.
sql_attr_string and sql_field_stringStrings (obviously!), but the former are only for retrieval, while the latter are indexed as full text.

The index part of the configuration files describes the attributes of the sources (Listing 4).

Listing 4. Index definitions for Sphinx
index employeesIndex
  type = plain
  source = employeesSource
  path = /home/fkereki/bin/sphinx/var/data/sphinxFilesEI
  charset_type = utf-8
  preopen = 1

index employeesSalariesIndex
  type = plain
  source = employeesSalariesSource
  path = /home/fkereki/bin/sphinx/var/data/sphinxFilesESI
  charset_type = utf-8
  preopen = 1

Sphinx uses index files that are separate from the ones used by MySQL. The type=plain line means you are using the standard Sphinx index files. Other possibilities are distributed (when you have the index files distributed over several nodes in your network) and rt (standing for real time). You can update these indexes on the fly. The source= line relates a data source and an index. You can actually merge several data sources together in a single index, but this example doesn't do that. The path= line defines the index file name and where it is stored. The charset_type= line specifies if you will be working with a Single Byte Character Set (sbcs) or Universal Character Set (UCS) Transformation Format-8 bit (utf-8). Finally, preopen=1 tells the search daemon to open all index files upon load rather than waiting for the first queries to arrive.

The last stanzas in the configuration file have to do with the indexer and searchd applications (Listing 5).

Listing 5. Indexer and search daemon parameters
  mem_limit = 1024M

  listen =
  log = /home/fkereki/bin/sphinx/var/log/searchd.log
  query_log = /home/fkereki/bin/sphinx/var/log/query.log
  pid_file = /home/fkereki/bin/sphinx/var/log/

You have to define a mem_limit RAM size to give indexer enough memory to work with. The searchd definitions are rather self-explanatory, except for listen=. You use this parameter to specify at which IP address and port the SphinxQL binary network protocol is available. If you want to interact with the protocol directly, you can use the standard MySQL client by entering mysql -P 9306 and run queries without even running the MySQL mysqld program on your machine.

Now that everything is set up, you can simply index the data and start running the search daemon (Listing 6). The --all parameter means all index files will be generated.

Listing 6. Indexing data and running the search daemon
~/bin/sphinx/etc> ../bin/indexer --all
Sphinx 2.0.1-beta (r2792)
Copyright (c) 2001-2011, Andrew Aksyonoff
Copyright (c) 2008-2011, Sphinx Technologies Inc (

using config file './sphinx.conf'...
indexing index 'employeesSalariesIndex'...
collected 2844047 docs, 40.9 MB
sorted 8.5 Mhits, 100.0% done
total 2844047 docs, 40877736 bytes
total 24.842 sec, 1645464 bytes/sec, 114482.32 docs/sec
indexing index 'employeesIndex'...
collected 300024 docs, 4.3 MB
sorted 0.9 Mhits, 100.0% done
total 300024 docs, 4311224 bytes
total 0.914 sec, 4714945 bytes/sec, 328119.56 docs/sec
total 7 reads, 0.274 sec, 19695.2 kb/call avg, 39.2 msec/call avg
total 110 writes, 0.528 sec, 2351.0 kb/call avg, 4.8 msec/call avg

~/bin/sphinx/etc> ../bin/searchd
Sphinx 2.0.1-beta (r2792)
Copyright (c) 2001-2011, Andrew Aksyonoff
Copyright (c) 2008-2011, Sphinx Technologies Inc (

using config file './sphinx.conf'...
listening on
precaching index 'employeesSalariesIndex'
precaching index 'employeesIndex'
precached 2 indexes in 0.124 sec

Now that your data is indexed and the required daemon is running, you're ready to run some searches.

Running searches

The preferred method for querying Sphinx is to use SphinxQL. Any language that works with MySQL also works with SphinxQL. This example uses PHP, and the code is shown in Listing7.

Listing 7. A PHP program to query MySQL and Sphinx, timing their performances
$bd0 = mysql_connect('localhost:3306', 'mysqluser','mysqlpass');
$bd1 = mysql_connect('localhost:9306');


$td1 = to_days("1960-09-01");
$td2 = to_days("1960-09-30");
$td3 = to_days("1998-01-01");

do_time("test 1 - MySQL ", "SELECT * FROM employees.employees ".
  "WHERE birth_date BETWEEN '1960-09-01' AND '1960-09-30' ".
  "AND hire_date >= '1998-01-01'", $bd0);

do_time("test 1 - Sphinx", "SELECT * FROM employeesIndex ".
  "WHERE birth_date_td BETWEEN {$td1} AND {$td2} ".
  "AND hire_date_td >= {$td3} LIMIT 0,10000 ", $bd1);

echo "\nSECOND TEST ... JOIN\n\n";

$ts1 = mktime(0,0,0,1,1,2000);

do_time("test 2 - MySQL ", "SELECT * ".
  "FROM employees.employees eee JOIN employees.salaries sss ".
  "ON sss.emp_no=eee.emp_no ".
  "WHERE eee.first_name='Yucel' ".
  "AND sss.salary>120000 ".
  "AND sss.from_date >= '2000-01-01'", $bd0);

do_time("test 2 - Sphinx", "SELECT * FROM employeesSalariesIndex ".
  "WHERE MATCH('@first_name Yucel') ".
  "AND salary>120000 ".
  "AND from_date_ts>={$ts1} limit 0,10000 ", $bd1);

function to_days($date) {
  return 719528 + floor(strtotime($date)/(60*60*24));

function do_time($description, $sentence, $bd) {
  $m0 = microtime(true);
  $res= @mysql_query($sentence, $bd);
  $m1 = microtime(true);
  $nr = mysql_num_rows($res);
  echo $description." ".$nr." rows in ".
    sprintf("%6.4f", $m1-$m0)." secs\n";

According to the specification in Listing 5, SphinxQL queries must be sent to port 9306. Because there is no PHP equivalent for the MySQL TO_DAYS() function, I wrote one of my own. Note that mktime works for UNIX_TIMESTAMP conversions. The do_time function executes and times a given query on a specified server.

SphinxQL queries differ from MySQL queries in the following ways:

  • You use MATCH to query text fields.
  • You have to convert all dates either to timestamps or to integers. This example uses both methods just for variety.
  • Sphinx can return full records or just ID fields, which is more efficient. Of course, if you use the latter option you must then use MySQL to retrieve the rest of the information.
  • The AND and OR operators have the same priority, so be careful and use parentheses where appropriate.
  • Not all MySQL numeric, string, and other functions are provided by Sphinx.

Running just a couple of tests isn't a thorough proof of concept, but the results shown in Listing 8 do suggest that the MySQL-to-Sphinx change may prove useful.

Listing 8. The results of the comparison between MySQL and Sphinx
~/bin/sphinx/etc> php test.php

test 1 - MySQL  38 rows in 0.1912 secs
test 1 - Sphinx 38 rows in 0.0157 secs


test 2 - MySQL  5 rows in 0.1532 secs
test 2 - Sphinx 5 rows in 0.0020 secs

These results are good, but the examples so far only consider static searches, which assume constant tables. You still need to look to the problem of updating index files.

Updating your index files

What happens if the original data is updated? You must update your index files or searches will start producing the wrong results. You could re-index everything after each update, but that would probably be too costly! Sphinx provides two solutions: delta index files and live index updates.

Frequently, you have a large dataset with a small number of new records added every now and then. For this example, assume old records aren't modified after they are first written. You can have almost-real-time index updates by implementing a main+delta scheme. The idea is to have one index for the old, fixed data and another for the new data, which can be created quickly because of its smaller size. Then, all you need do is query both index files and use the union of both results. Consider using index merging to recreate the main index by merging a previous index and a delta index (see the Sphinx documentation in Related topics for details).

What happens if the old data can be modified or if you truly need real-time updates? Sphinx's RT index files are the solution because they allow INSERT, REPLACE, and DELETE commands to affect the index files in real time. Whenever you update your main tables, you must be careful to also execute the corresponding index updates to ensure there are no differences between the MySQL and Sphinx data. To be safe, check out the caveats and restrictions in section 4.2 of the Sphinx documentation (see Related topics).

Final evaluation

What conclusions can you draw? First, using Sphinx instead of MySQL can provide significant performance advantages. Sphinx is quite good for searching static tables. However, for tables that are frequently updated, you can't use the plain index files. Instead, you need to either implement delta files or change to real-time indexing, and both of these solutions carry an extra performance cost. Finally, using Sphinx efficiently requires some planning because you have to pre-define all required sources and index files— of course, this isn't a disadvantage, just common sense.

Substituting Sphinx for MySQL isn't trivial but also isn't so complicated as to preclude the option. If fast search speed is a requirement, it's worth contemplating a change from MySQL to Sphinx, even if you are not doing full-text searches.

Downloadable resources

Related topics


Sign in or register to add and subscribe to comments.

Zone=Open source
ArticleTitle=Better MySQL searches with Sphinx