MySQL is a good all-around database, but for search-heavy applications, you may fare better by looking at specific search utilities. This article considers Sphinx, a well-known full-text search package, as a MySQL replacement for searches, increasing speed even for non-full-text searches. The article studies the trade-offs and caveats involved in such a change, demonstrates some specific tests, and considers what is required to use Sphinx in a general way.
Why would you consider substituting Sphinx for MySQL? Consider, for instance, a bookseller's search application. A user might look for a book by specifying the title, author, status (new or used), edition (first or later), cover (hard or paperback), publisher, year of publication, whether it's signed by the author, price, and so on. MySQL typically doesn't use more than one index (an exception is index merge optimization; see Resources), so the only way to optimize all possible searches is by having a prohibitive number of multi-column indices, which is not good.
However, Sphinx is a search engine that can integrate well with MySQL and work on its own in a stand-alone fashion. It provides high indexing and searching performance and allows queries by means of SphinxQL, a Structured Query Language (SQL)-based language. Finally, Sphinx scales up well, being able to work with billions of documents spanning terabytes of data with distributed searching capabilities.
Sphinx works with documents (which can be simply records in a database table or view), text fields (which it indexes, providing full-text searches), and attributes (non-text values that you can use for filtering, sorting, and grouping results). Attributes are kept in random-access memory (RAM) for efficiency; see the Sphinx documentation in Resources for the actual size calculation formula.
To process queries, Sphinx depends on special index files. You have to
define the data sources for the indexing procedure and then run the
indexer program. Another possibility is using
real-time index files, which you can update on the fly at the cost of
somewhat lessened efficiency. I cover these in more detail later in the
article.
To test Sphinx and MySQL search speeds, I looked for some interesting (in
size) data sets and found a test database with about 3 million salary
records, which seemed good enough for my purposes. (See Resources to get this data.) The scheme is
simple: you have employees with job titles who work in departments and
earn annual salaries. This sample works with the two largest tables:
employees and
salaries.
After installing the data, I found I needed to fix a couple details for the
salaries table. First, the
to_date field sometimes contained
9999-01-01 as a special marker; I substituted
2038-01-01 instead because UNIX®
timestamps can't go beyond 2038. In addition, Sphinx requires every record
to have a single field ID key, so I had to add a
salaries_id auto-increment field to the
salaries table. See Listing
1.
Listing 1. Fixing some details in the
salaries table
ALTER TABLE salaries
DROP PRIMARY KEY,
ADD UNIQUE emp_from (emp_no, from_date) ;
ALTER TABLE salaries
ADD salaries_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY FIRST ;
UPDATE salaries
SET to_date="2038-01-01" WHERE to_date="9999-01-01";
|
Now consider the queries shown in Listing 2.
Listing 2. A couple of simple tests to compare MySQL and Sphinx
SELECT *
FROM employees.employees
WHERE birth_date BETWEEN '1960-09-01' AND '1960-09-30'
AND hire_date >= '1998-01-01';
38 rows in set (0.19 sec)
SELECT *
FROM employees.employees eee JOIN employees.salaries sss
ON sss.emp_no=eee.emp_no
WHERE eee.first_name='Yucel'
AND sss.salary>120000
AND sss.from_date >= '2000-01-01';
5 rows in set (0.15 sec)
|
The first search involves only employees and
looks for people born in September of 1960 and hired in or after 1998.
MySQL finds 38 records in approximately 0.19 seconds. (Of course, your
results may vary.) The second search joins
employees and
salaries to find male workers named Yucel who
earned over 120,000 in 2000 or later. MySQL finds five records in about
0.15 seconds.
To run these searches with Sphinx, you must set it up, index the data, and run a daemon. The next section covers these tasks.
Sphinx indexes data sources according to definitions in the
sphinx.conf file, shown in Listing 3. Note the type conversions for the date fields. This
article skims the main required parameters, but you should be aware that
there are many more configuration settings you can work with.
Listing 3. The source definitions for Sphinx
source employeesSource
{
type = mysql
sql_host = localhost
sql_user = mysqluser
sql_pass = mysqlpass
sql_db = employees
sql_query= SELECT \
emp_no as id, \
TO_DAYS(birth_date) AS birth_date_td, \
first_name, \
last_name, \
gender, \
TO_DAYS(hire_date) AS hire_date_td \
FROM employees
sql_attr_uint = birth_date_td
sql_field_string = first_name
sql_field_string = last_name
sql_field_string = gender
sql_attr_uint = hire_date_td
}
source employeesSalariesSource
{
type = mysql
sql_host = localhost
sql_user = mysqluser
sql_pass = mysqlpass
sql_db = employees
sql_query= SELECT \
ss.salaries_id AS id, \
ee.emp_no AS emp_no, \
TO_DAYS(ee.birth_date) AS birth_date_td, \
ee.first_name AS first_name, \
ee.last_name AS last_name, \
ee.gender AS gender, \
TO_DAYS(ee.hire_date) AS hire_date_td, \
ss.salary AS salary, \
UNIX_TIMESTAMP(ss.from_date) AS from_date_ts, \
UNIX_TIMESTAMP(ss.to_date) AS to_date_ts \
FROM employees ee JOIN salaries ss \
ON ss.emp_no=ee.emp_no
sql_attr_uint = emp_no
sql_attr_timestamp = birth_date_ts
sql_field_string = first_name
sql_field_string = last_name
sql_field_string = gender
sql_attr_uint = hire_date_td
sql_attr_uint = salary
sql_attr_timestamp = from_date_ts
sql_attr_timestamp = to_date_ts
}
|
The configuration file is divided into stanzas. The source part defines
data sources. Sphinx can work with many types of files, including text,
Hypertext Markup Language (HTML), and Extensible Markup Language (XML).
However, this example uses only MySQL, and thus
type=mysql. The
sql_host, sql_user,
sql_pass, and sql_db
parameters define how to access the database and which schema to use. The
sql_query parameter provides the SQL sentence
that retrieves the data to be indexed. Essentially, the only restriction
here is that the first field must be a unique, unsigned positive integer
ID number. This is why I had to add a
salaries_id field to the
salaries table earlier. You can specify up to
32 text fields and an arbitrary number of attributes. Sphinx full-text
indexes all columns except the ID (the first field) and attributes.
As you can see in Table 1, Sphinx supports several types of attributes but not all the possible MySQL data types. There are also some Sphinx-specific attribute types, but they aren't applicable if you are just looking for a MySQL replacement, so I don't include those here.
Table 1. Sphinx allows only a few attribute types
| Attribute | Description |
|---|---|
sql_attr_uint and
sql_attr_bigint | 32-bit unsigned and 64-bit signed integer values. You use these
two types for all integer database fields and also possibly as a
DATE stand in. |
sql_attr_float | 32-bit floating point values. If you want to store geographic coordinates, you use this attribute type. Also note that if you require more precision there's no solution; fields are rounded to about seven decimal digits. |
sql_attr_bool | A Boolean (single bit) value, similar to MySQL's
tinyint values. |
sql_attr_timestamp | A UNIX timestamp that can represent date/time values from
1970-01-01 to 2038-01-19. You cannot directly use
DATE or
DATETIME column types in Sphinx. You
have to convert them to timestamps with the
UNIX_TIMESTAMP() function. If you just
require dates, you can use the
TO_DAYS() function to convert a
DATE field into an integer. |
sql_attr_string and
sql_field_string | Strings (obviously!), but the former are only for retrieval, while the latter are indexed as full text. |
The index part of the configuration files describes the attributes of the sources (Listing 4).
Listing 4. Index definitions for Sphinx
index employeesIndex
{
type = plain
source = employeesSource
path = /home/fkereki/bin/sphinx/var/data/sphinxFilesEI
charset_type = utf-8
preopen = 1
}
index employeesSalariesIndex
{
type = plain
source = employeesSalariesSource
path = /home/fkereki/bin/sphinx/var/data/sphinxFilesESI
charset_type = utf-8
preopen = 1
}
|
Sphinx uses index files that are separate from the ones used by MySQL. The
type=plain line means you are using the
standard Sphinx index files. Other possibilities are
distributed (when you have the index files
distributed over several nodes in your network) and
rt (standing for real time). You can
update these indexes on the fly. The source=
line relates a data source and an index. You can actually merge several
data sources together in a single index, but this example doesn't do that.
The path= line defines the index file name and
where it is stored. The charset_type= line
specifies if you will be working with a Single Byte Character Set
(sbcs) or Universal Character Set (UCS)
Transformation Format-8 bit (utf-8). Finally,
preopen=1 tells the search daemon to open all
index files upon load rather than waiting for the first queries to arrive.
The last stanzas in the configuration file have to do with the
indexer and searchd
applications (Listing 5).
Listing 5. Indexer and search daemon parameters
indexer
{
mem_limit = 1024M
}
searchd
{
listen = 127.0.0.1:9306:mysql41
log = /home/fkereki/bin/sphinx/var/log/searchd.log
query_log = /home/fkereki/bin/sphinx/var/log/query.log
pid_file = /home/fkereki/bin/sphinx/var/log/searchd.pid
}
|
You have to define a mem_limit RAM size to give
indexer enough memory to work with. The
searchd definitions are rather
self-explanatory, except for listen=. You use
this parameter to specify at which IP address and port the SphinxQL binary
network protocol is available. If you want to interact with the protocol
directly, you can use the standard MySQL client by entering
mysql -P 9306 and run queries without even
running the MySQL mysqld program on your
machine.
Now that everything is set up, you can simply index the data and start
running the search daemon (Listing 6). The
--all parameter means all index files will be
generated.
Listing 6. Indexing data and running the search daemon
~/bin/sphinx/etc> ../bin/indexer --all
Sphinx 2.0.1-beta (r2792)
Copyright (c) 2001-2011, Andrew Aksyonoff
Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file './sphinx.conf'...
indexing index 'employeesSalariesIndex'...
collected 2844047 docs, 40.9 MB
sorted 8.5 Mhits, 100.0% done
total 2844047 docs, 40877736 bytes
total 24.842 sec, 1645464 bytes/sec, 114482.32 docs/sec
indexing index 'employeesIndex'...
collected 300024 docs, 4.3 MB
sorted 0.9 Mhits, 100.0% done
total 300024 docs, 4311224 bytes
total 0.914 sec, 4714945 bytes/sec, 328119.56 docs/sec
total 7 reads, 0.274 sec, 19695.2 kb/call avg, 39.2 msec/call avg
total 110 writes, 0.528 sec, 2351.0 kb/call avg, 4.8 msec/call avg
~/bin/sphinx/etc> ../bin/searchd
Sphinx 2.0.1-beta (r2792)
Copyright (c) 2001-2011, Andrew Aksyonoff
Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file './sphinx.conf'...
listening on 127.0.0.1:9306
precaching index 'employeesSalariesIndex'
precaching index 'employeesIndex'
precached 2 indexes in 0.124 sec
|
Now that your data is indexed and the required daemon is running, you're ready to run some searches.
The preferred method for querying Sphinx is to use SphinxQL. Any language that works with MySQL also works with SphinxQL. This example uses PHP, and the code is shown in Listing7.
Listing 7. A PHP program to query MySQL and Sphinx, timing their performances
$bd0 = mysql_connect('localhost:3306', 'mysqluser','mysqlpass');
$bd1 = mysql_connect('localhost:9306');
echo "FIRST TEST ... SINGLE TABLE\n\n";
$td1 = to_days("1960-09-01");
$td2 = to_days("1960-09-30");
$td3 = to_days("1998-01-01");
do_time("test 1 - MySQL ", "SELECT * FROM employees.employees ".
"WHERE birth_date BETWEEN '1960-09-01' AND '1960-09-30' ".
"AND hire_date >= '1998-01-01'", $bd0);
do_time("test 1 - Sphinx", "SELECT * FROM employeesIndex ".
"WHERE birth_date_td BETWEEN {$td1} AND {$td2} ".
"AND hire_date_td >= {$td3} LIMIT 0,10000 ", $bd1);
echo "\nSECOND TEST ... JOIN\n\n";
$ts1 = mktime(0,0,0,1,1,2000);
do_time("test 2 - MySQL ", "SELECT * ".
"FROM employees.employees eee JOIN employees.salaries sss ".
"ON sss.emp_no=eee.emp_no ".
"WHERE eee.first_name='Yucel' ".
"AND sss.salary>120000 ".
"AND sss.from_date >= '2000-01-01'", $bd0);
do_time("test 2 - Sphinx", "SELECT * FROM employeesSalariesIndex ".
"WHERE MATCH('@first_name Yucel') ".
"AND salary>120000 ".
"AND from_date_ts>={$ts1} limit 0,10000 ", $bd1);
function to_days($date) {
return 719528 + floor(strtotime($date)/(60*60*24));
}
function do_time($description, $sentence, $bd) {
$m0 = microtime(true);
$res= @mysql_query($sentence, $bd);
$m1 = microtime(true);
$nr = mysql_num_rows($res);
echo $description." ".$nr." rows in ".
sprintf("%6.4f", $m1-$m0)." secs\n";
}
|
According to the specification in Listing 5,
SphinxQL queries must be sent to port 9306. Because there is no PHP
equivalent for the MySQL TO_DAYS() function, I
wrote one of my own. Note that mktime works for
UNIX_TIMESTAMP conversions. The do_time
function executes and times a given query on a specified server.
SphinxQL queries differ from MySQL queries in the following ways:
- You use
MATCHto query text fields. - You have to convert all dates either to timestamps or to integers. This example uses both methods just for variety.
- Sphinx can return full records or just ID fields, which is more efficient. Of course, if you use the latter option you must then use MySQL to retrieve the rest of the information.
- The
ANDandORoperators have the same priority, so be careful and use parentheses where appropriate. - Not all MySQL numeric, string, and other functions are provided by Sphinx.
Running just a couple of tests isn't a thorough proof of concept, but the results shown in Listing 8 do suggest that the MySQL-to-Sphinx change may prove useful.
Listing 8. The results of the comparison between MySQL and Sphinx
~/bin/sphinx/etc> php test.php
FIRST TEST ... SINGLE TABLE
test 1 - MySQL 38 rows in 0.1912 secs
test 1 - Sphinx 38 rows in 0.0157 secs
SECOND TEST ... JOIN
test 2 - MySQL 5 rows in 0.1532 secs
test 2 - Sphinx 5 rows in 0.0020 secs
|
These results are good, but the examples so far only consider static searches, which assume constant tables. You still need to look to the problem of updating index files.
What happens if the original data is updated? You must update your index files or searches will start producing the wrong results. You could re-index everything after each update, but that would probably be too costly! Sphinx provides two solutions: delta index files and live index updates.
Frequently, you have a large dataset with a small number of new records added every now and then. For this example, assume old records aren't modified after they are first written. You can have almost-real-time index updates by implementing a main+delta scheme. The idea is to have one index for the old, fixed data and another for the new data, which can be created quickly because of its smaller size. Then, all you need do is query both index files and use the union of both results. Consider using index merging to recreate the main index by merging a previous index and a delta index (see the Sphinx documentation in Resources for details).
What happens if the old data can be modified or if you truly need real-time
updates? Sphinx's RT index files are the solution because they allow
INSERT, REPLACE, and
DELETE commands to affect the index files in
real time. Whenever you update your main tables, you must be careful to
also execute the corresponding index updates to ensure there are no
differences between the MySQL and Sphinx data. To be safe, check out the
caveats and restrictions in section 4.2 of the Sphinx documentation (see
Resources).
What conclusions can you draw? First, using Sphinx instead of MySQL can
provide significant performance advantages. Sphinx is quite good for
searching static tables. However, for tables that are frequently updated,
you can't use the plain index files. Instead,
you need to either implement delta files or change to real-time indexing,
and both of these solutions carry an extra performance cost. Finally,
using Sphinx efficiently requires some planning because you have to
pre-define all required sources and index files— of course,
this isn't a disadvantage, just common sense.
Substituting Sphinx for MySQL isn't trivial but also isn't so complicated as to preclude the option. If fast search speed is a requirement, it's worth contemplating a change from MySQL to Sphinx, even if you are not doing full-text searches.
Learn
-
Sphinx documentation: Learn
more about its features.
- Build a custom search engine with
PHP (developerWorks, July 2007): Explore more about full-text
searches.
-
MySQL: Index merge
optimization: Find more about this method.
- developerWorks on
Twitter: Follow us for the latest news.
- developerWorks
Open source zone: Find extensive how-to information, tools, and
project updates to help you develop with open source technologies and use
them with IBM products.
- Events of interest: Check out upcoming conferences, trade shows,
and webcasts that are of interest to IBM open source
developers.
- developerWorks
podcasts: Tune into interesting interviews and discussions for
software developers
- developerWorks On demand demos: Watch our no-cost demos and learn
about IBM and open source technologies and product functions.
Get products and technologies
-
Sphinx: Get the latest version
and try out the code in this article.
- The
sample database used in this article: Download it from The Data Charmer website.
- Evaluate IBM
software products: From trial downloads to cloud-hosted products,
you can innovate your next open source development project using software
especially for developers.
Discuss
- developerWorks
community: Connect with other developerWorks users while exploring
the developer-driven blogs, forums, groups, and wikis. Help build the Real world open source group in the developerWorks
community.
Federico Kereki is a Uruguayan systems engineer with more than 20 years of experience developing systems, doing consulting work, and teaching at universities. He has been working with open source software for over 10 years and particularly appreciates the greater security of Linux. He recently wrote Essential GWT, a book about this open source tool.



