In "Build a
custom search engine with PHP," I
combined PHP and the open source Sphinx search engine to create a blazing-fast
alternative to text-intensive database queries, such as LIKE
and, in the case of MySQL, MATCH. (See Resources for Sphinx-related information.)
Sphinx is easy to install and maintain, and is quite capable. Moreover, recent releases of Sphinx now provide a native MySQL engine, deprecating the need to run a separate Sphinx daemon. V0.9.8 (the most recent release as of this writing) also added geodistance queries to find records encompassed by a distance from a given location and a feature named multi-query, an optimization that bundles multiple queries and sets of results in a single network connection.
Sphinx continues to improve with time and is ideal for shopping sites, blogs, and many other applications. According to the Sphinx site, one application now indexes 700 million documents, or roughly 1.2 terabytes of data. I recommend Sphinx without hesitation.
However, Sphinx does not yet support several features you might like to employ and offer as your application or site becomes popular and usage increases. In particular, Sphinx does not yet automatically replicate or distribute its indices, making its daemon a single point of failure. (As a workaround, several machines can index the same database, and you can cluster those systems.) Sphinx does not highlight search results (like Google does when it displays cached pages), does not retain or cache recent results, and does not support regular expression (regex) or date-based operations.
If you seek those features or are ready for an enterprise-grade solution, consider the Apache Software Foundation's Solr project. Based on the Lucene search engine and provided as open source under the terms of the liberal Apache Software License, Solr is (according to the Lucene site) "an open source enterprise search server based on the Lucene Java™ search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a Web administration interface."
Among other notable, highly trafficked Web sites, Netflix, Digg, and CNET's News.com and CNET Reviews use Solr to power search. A lengthy list of public Solr-powered sites can be found in the Solr wiki (see Resources).
Learn how to use Solr and PHP to create a small application to search a database of automobile parts. While the example database contains only a handful of records, it could just as easily include millions. All the source code used in this article is available from the Download section.
To combine Solr with PHP, you must install Solr, design an index, prepare your data to be indexed by Solr, load the index, write PHP code to execute queries, and present results. Much of the work required to create a searchable index can be performed from the command line. Of course, PHP's programmatic interface to Solr can also affect the contents of an index.
Solr is implemented in Java technology. To run Solr and its administrative tools, you must install a Java V1.5 software development kit (Java 5 SDK). Several vendors provide a Java V1.5 SDK — for example, Sun Microsystems, IBM® , and BEA Systems — and each implementation is capable of powering Solr. Simply choose the Java package suited for your operating system and follow the appropriate instructions to complete the installation.
In many cases, the installation of Java V1.5 is as simple as running a self-extracting
archive and accepting the terms of a license agreement. A script in the archive does
all the heavy lifting in a matter of seconds. Other operating systems, such as Debian,
provide the Java 5 SDK in the APT repository. For example, if you use Debian or Ubuntu,
you can install the Java V1.5 software with sudo apt-get install
sun-java5-jdk.
Conveniently, APT also downloads all the dependencies required to use the Java 5 SDK automatically.
If the Java software is already installed and the Java executable file is in your PATH, run java -version to determine which Java code you have.
Here, let's use the Mac OS X V10.5 Leopard operating system as the basis of the
demonstration. Apple's Leopard includes Java V1.5. With a small change to the default
Apache configuration, Leopard runs PHP applications, too. Running java -version in a Leopard terminal window produces the following.
Listing 1. Run
java -version in a Leopard terminal window
$ which java
/usr/bin/java
$ java -version
java version "1.5.0_13"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_13-b05-237)
Java HotSpot(TM) Client VM (build 1.5.0_13-119, mixed mode, sharing)
|
Note: Leopard allows you to switch between Java V1.4 and V1.5 in the Java Preferences application in /Applications/Utilities/Java. If your installation of Leopard reports V1.4, open Java Preferences and change the settings to resemble Figure 1.
Figure 1. Java Preferences application in Leopard
To install Solr, visit Apache.org, click Resources > Download, select a convenient project mirror, and navigate within the folders shown to pick a tarball (a .tgz file) of Solr V1.2. The download transfers a file named something akin to apache-solr-1.2.0.tgz. Unpack the tarball with the following code.
Listing 2. Unpack tarball
$ tar xzf apache-solr-1.2.0.tgz
$ ls -F apache-solr-1.2.0
CHANGES.txt NOTICE.txt dist/ lib/
KEYS.txt README.txt docs/ src/
LICENSE.txt build.xml example/
|
In the newly created directory, the folder named dist contains the Solr code bundled as a Java archive (JAR). The subdirectory example/exampledocs contains examples of data that's formatted — typically as XML code — and ready for Solr to index.
The example directory contains a complete sample Solr application. To run it, simply launch the Java engine with the application archive: start.jar.
Listing 3. Launch Java engine
$ java -jar start.jar
2007-11-10 15:00:16.672::INFO: Logging to STDERR via org.mortbay.log.StdErrLog
2007-11-10 15:00:16.866::INFO: jetty-6.1.3
...
INFO: SolrUpdateServlet.init() done
2007-11-10 15:00:18.694::INFO: Started SocketConnector @ 0.0.0.0:8983
|
The application is now available on port 8983. Start your browser and type
http://localhost:8983/solr/admin/ in the address bar. This is
the interface for administering Solr. (To stop the Solr server, use Ctrl+C at the command line.)
But there's no data in the Solr index to manage or query — yet.
Solr is remarkably flexible out of the box, supporting a variety of data types and rules to create effective indices. And while broad, if the standard components do not suffice, you can further customize Solr by writing new Java classes.
Given a set of data types and rules, you can then create a Solr schema to describe your data and control how the indices should be constructed. You then export your data to match the schema and load the data into Solr. Solr creates the indices on the fly, updating each index immediately as records are created, modified, or deleted.
The default Solr schema can be found at Apache.org as part of the Solr source code repository. For reference, a snippet of the default schema is shown below.
Listing 3. Default Solr schema snippet
<schema name="example" version="1.1">
...
<fields>
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="name" type="text" indexed="true" stored="true"/>
<field name="nameSort" type="string" indexed="true" stored="false"/>
<field name="cat" type="text" indexed="true" stored="true" multiValued="true"/>
...
</fields>
<uniqueKey>id</uniqueKey>
...
<copyField source="name" dest="nameSort"/>
...
</schema>
|
Much of the schema is self-explanatory, but some aspects warrant clarification:
- As shown, the field
idis a string (type="string") and should be indexed (indexed="true"). It is also a required field (required="true"). Using this schema, every record loaded in Solr must have a value for this field. The<uniqueKey>id</uniqueKey>modifier further declares that theidfield must be unique. (Solr does not require a unique ID field; this is merely a rule established in the default index schema.) The attributestored="true"indicates that theidfield should be retrievable.Why would you ever set
storedtofalse? You can use a nonretrievable field to order results differently, as in the case ofnameSort, which is a copy of thenamefield (because of thecopyFieldcommand on the last line), but has different behaviors. Notice thatnameSortis astring, whilenameistext. The default index schema treats those two types slightly differently. - The field
catismultiValued. A record may define several values for this field. For instance, if your application manages content, a story may be assigned several topics. You could use thecatfield (or define a similar field of your own) to capture all the topics.
Listing 4 shows the file example/exampledocs/ipod_other.xml, which represents two entries in a catalog of iPod accessories.
Listing 4. Data formatted for the default Solr index schema
<add>
<doc>
<field name="id">F8V7067-APL-KIT</field>
<field name="name">Belkin Mobile Power Cord for iPod w/ Dock</field>
<field name="manu">Belkin</field>
<field name="cat">electronics</field>
<field name="cat">connector</field>
<field name="features">car power adapter, white</field>
<field name="weight">4</field>
<field name="price">19.95</field>
<field name="popularity">1</field>
<field name="inStock">false</field>
</doc>
<doc>
<field name="id">IW-02</field>
<field name="name">iPod & iPod Mini USB 2.0 Cable</field>
<field name="manu">Belkin</field>
<field name="cat">electronics</field>
<field name="cat">connector</field>
<field name="features">car power adapter for iPod, white</field>
<field name="weight">2</field>
<field name="price">11.50</field>
<field name="popularity">1</field>
<field name="inStock">false</field>
</doc>
</add>
|
The add element is a Solr command to add the enveloped
records to the index. Each record is captured in a doc
element, which uses a series of named field elements to
specify field values. The fields weight, price, inStock, manu, features, and popularity are other fields defined in the default Solr index
schema. The features field has identical attributes to cat, but has a different semantic meaning: It enumerates the
(potentially many) capabilities of a product.
This example indexes a collection of auto parts. Each auto part has several fields, with a sample of the most important fields shown in Table 1. The name of the field is listed in the first column. The second column provides a brief description, while the third column lists its logical type. The fourth column shows the index type (as defined in the schema in Listing 5) used to represent the datum.
Table 1. The fields of an auto part record
| Name | Description | Type | Solr type |
|---|---|---|---|
| Part number (unique, mandatory) | An identifying number | String |
partno
|
| Name | A concise description | String |
name
|
| Model (required, multi-value) | A model, such as "Camaro" | String |
model
|
| Model year (multi-value) | A model year, such as 2001 | String |
year
|
| Price | Cost per unit | Float |
price
|
| In stock | Within inventory or not | Boolean |
inStock
|
| Features | Capabilities of part | String |
features
|
| Timestamp | Record of activity | String |
timestamp
|
| Weight | Shipping weight | Float |
weight
|
Listing 3 shows a portion of the Solr schema used for the auto parts index. It's largely based on the default Solr schema. The specific fields used — the names and attributes — simply replaced the fields element found in the default (as shown in Listing 1).
Listing 5. The auto parts index schema
<?xml version="1.0" encoding="utf-8" ?>
<schema name="autoparts" version="1.0">
...
<fields>
<field name="partno" type="string" indexed="true"
stored="true" required="true" />
<field name="name" type="text" indexed="true"
stored="true" required="true" />
<field name="model" type="text_ws" indexed="true" stored="true"
multiValued="true" required="true" />
<field name="year" type="text_ws" indexed="true" stored="true"
multiValued="true" omitNorms="true" />
<field name="price" type="sfloat" indexed="true"
stored="true" required="true" />
<field name="inStock" type="boolean" indexed="true"
stored="true" default="false" />
<field name="features" type="text" indexed="true"
stored="true" multiValued="true" />
<field name="timestamp" type="date" indexed="true"
stored="true" default="NOW" multiValued="false" />
<field name="weight" type="sfloat" indexed="true" stored="true" />
</fields>
<uniqueKey>partno</uniqueKey>
<defaultSearchField>name</defaultSearchField>
</schema>
|
Given the fields above, a database of auto parts exported and formatted for uploading into Solr might look like Listing 6.
Listing 6. A database of auto parts formatted for indexing
<add>
<doc>
<field name="partno">1</field>
<field name="name">Spark plug</field>
<field name="model">Boxster</field>
<field name="model">924</field>
<field name="year">1999</field>
<field name="year">2000</field>
<field name="price">25.00</field>
<field name="inStock">true</field>
</doc>
<doc>
<field name="partno">2</field>
<field name="name">Windshield</field>
<field name="model">911</field>
<field name="year">1991</field>
<field name="year">1999</field>
<field name="price">15.00</field>
<field name="inStock">false</field>
</doc>
</add>
|
Let's install the new index schema and load the data into Solr. First, stop the Solr daemon (if it's still running) by using Ctrl+C. Make an archive of the existing Solr schema in example/solr/conf/schema.xml. Next, create a text file from Listing 6, save it to /tmp/schema.xml, and copy it to example/solr/conf/schema.xml. Create another file for the data shown in Listing 7. Now, you can start Solr again and use the posting utility provided with the example.
Listing 7. Launching Solr with a new schema
$ cd apache-solr-1.2/example
$ cp solr/conf/schema.xml solr/conf/default_schema.xml
$ chmod a-w solr/conf/default_schema.xml
$ vi /tmp/schema.xml
...
$ cp /tmp/schema.xml solr/conf/schema.xml
$ vi /tmp/parts.xml
...
$ java -jar start.jar
...
2007-11-11 16:56:48.279::INFO: Started SocketConnector @ 0.0.0.0:8983
$ java -jar exampledocs/post.jar /tmp/parts.xml
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8,
other encodings are not currently supported
SimplePostTool: POSTing files to http://localhost:8983/solr/update...
SimplePostTool: POSTing file parts.xml
SimplePostTool: COMMITting Solr index changes...
|
Success! If you want to verify that the index exists and contains two documents, point
your browser again to http://localhost:8983/solr/admin/. You should see "(autoparts)"
at the top of the page. If so, click the query box at midpage and type
partno: 1 or partno: 2.
Your result should resemble this:
3 on 10 0 partno: 1 OR partno: 2 2.2 true Boxster 924 Spark plug 1 25.0 2007-11-11T21:58:45.899Z 1999 2000 false 911 Windshield 2 15.0 2007-11-11T21:58:45.953Z 1991 1999 |
Try some other queries. The syntax for Lucene queries.
You should also try editing and loading the data again. Because the partno field is declared unique, repeated upload operations of the
same part number merely replace the old index record with a new record. In addition to
the add command, you can use commit, optimize, and delete. The last command can delete a specific record by ID or many
records through a query.
Finally, PHP enters the example.
There are at least two PHP Solr APIs. The most robust implementation is Donovan Jimenez's PHP Solr Client (see Resources). The code is licensed under the same terms as Solr, has extensive documentation, and is compatible with Solr V1.2. The most recent release as of this writing is dated 2 Oct 2007.
Solr Client provides four PHP classes:
-
Apache_Solr_Servicerepresents a Solr server. Use these methods to ping the server, add and delete documents, commit changes, optimize the index, and run queries. -
Apache_Solr_Documentembodies a Solr document. The methods of this class manage (key, value) pairs and multivalue fields. Field values can be accessed by direct dereferencing, such as$document->title = 'Something'; ... echo $document->title;. -
Apache_Solr_Responseencapsulates a Solr response. This code depends on thejson_decode()function, which is bundled with PHP V5.2.0 and later or can be installed with the PHP Extension Community Library (PECL — see Resources). -
Apache_Solr_Service_BalancerenhancesApache_Solr_Service, allowing you to connect to multiple Solr services in a distribution. This class is not covered here.
Download the PHP Solr Client (see Resources) and extract it to a working directory. Change to the SolrPhpClient. Next, check the file Apache/Solr/Service.php. At the time of this writing, line 335 was missing a trailing semicolon. Edit the file, and add the semicolon, if necessary. Also, check the file Apache/Solr/Document.php. Lines 112-117 should read as follows.
if (!is_array($this->_fields[$key]))
{
$this->_fields[$key] = array($this->_fields[$key]);
}
$this->_fields[$key][] = $value;
|
After you correct the files, you can install the Apache directory alongside your other PHP libraries.
The code below shows a PHP application that connects a Solr service, adds two documents to the index, and runs the part number query used previously.
Listing 8. A sample PHP application to connect to, load, and query a Solr index
<?php
require_once( 'Apache/Solr/Service.php' );
//
//
// Try to connect to the named server, port, and url
//
$solr = new Apache_Solr_Service( 'localhost', '8983', '/solr' );
if ( ! $solr->ping() ) {
echo 'Solr service not responding.';
exit;
}
//
//
// Create two documents to represent two auto parts.
// In practice, documents would likely be assembled from a
// database query.
//
$parts = array(
'spark_plug' => array(
'partno' => 1,
'name' => 'Spark plug',
'model' => array( 'Boxster', '924' ),
'year' => array( 1999, 2000 ),
'price' => 25.00,
'inStock' => true,
),
'windshield' => array(
'partno' => 2,
'name' => 'Windshield',
'model' => '911',
'year' => array( 1999, 2000 ),
'price' => 15.00,
'inStock' => false,
)
);
$documents = array();
foreach ( $parts as $item => $fields ) {
$part = new Apache_Solr_Document();
foreach ( $fields as $key => $value ) {
if ( is_array( $value ) ) {
foreach ( $value as $datum ) {
$part->setMultiValue( $key, $datum );
}
}
else {
$part->$key = $value;
}
}
$documents[] = $part;
}
//
//
// Load the documents into the index
//
try {
$solr->addDocuments( $documents );
$solr->commit();
$solr->optimize();
}
catch ( Exception $e ) {
echo $e->getMessage();
}
//
//
// Run some queries. Provide the raw path, a starting offset
// for result documents, and the maximum number of result
// documents to return. You can also use a fourth parameter
// to control how results are sorted and highlighted,
// among other options.
//
$offset = 0;
$limit = 10;
$queries = array(
'partno: 1 OR partno: 2',
'model: Boxster',
'name: plug'
);
foreach ( $queries as $query ) {
$response = $solr->search( $query, $offset, $limit );
if ( $response->getHttpStatus() == 200 ) {
// print_r( $response->getRawResponse() );
if ( $response->response->numFound > 0 ) {
echo "$query <br />";
foreach ( $response->response->docs as $doc ) {
echo "$doc->partno $doc->name <br />";
}
echo '<br />';
}
}
else {
echo $response->getHttpStatusMessage();
}
}
?>
|
To begin, the code connects to the named Solr server on the port and path given, and
uses the ping() method to verify that the server is operational.
Next, the code translates the records represented as PHP arrays into Solr documents. If
a field has a single value, a simple accessor adds the (key, value) pair to the
document. If a field has multiple values, the list of values is assigned to the key
with the special function setMultiValue(). You can see that
this process closely resembles the XML representation of a Solr document.
As an optimization, addDocuments() inserts multiple
documents into the index. Subsequent commit() and optimize() functions finalize the additions.
At the bottom, several queries retrieve data from the index. You can view the results
through two lenses: The getRawResponse() function yields the
entire, unparsed result, while the docs() function returns
an array of documents with named accessors.
If a query does not get the OK from Solr, the code prints an error message. An empty result set emits no output.
Solr is incredibly powerful, and the PHP API makes integration on any platform a snap. Better yet, Solr is easy to set up and operate, and you can enable advanced features as you need them. Best of all, Solr is free. Don't pay for a search engine. Save your greenbacks and go Solr.
Surf the Solr Web site to learn more about advanced configuration, including sorting, categorized results, and replication. The Lucene Web site is another source of information because it's the search technology beneath the Solr system.
| Description | Name | Size | Download method |
|---|---|---|---|
| Sample PHP and Solr application | os-php-apachesolr.src.zip | 109KB | HTTP |
Information about download methods
Learn
-
Read the "Make PHP apps fast, faster, fastest" series.
-
Read an in-depth introduction of Solr in "Search smarter with Apache
Solr, Part 1: Essential features and the Solr schema" and Part 2: Solr for the
enterprise" by Solr expert Grant Ingersoll.
-
See Martin Streicher's article that introduces finely tuned local search systems titled
"Build a custom
search engine with PHP."
-
Visit Solr at Apache.org to find resources and valuable information.
-
Check out the Solr wiki, home to a great deal
of documentation about Solr.
-
Discover which public Web sites use Solr.
-
PHP.net is the central resource for PHP developers.
-
Check out the "Recommended PHP reading list."
-
Browse all the PHP content on developerWorks.
-
Expand your PHP skills by checking out IBM developerWorks' PHP project resources.
-
To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts.
-
Using a database with PHP? Check out the Zend Core for
IBM, a seamless, out-of-the-box, easy-to-install PHP development and production environment that supports IBM DB2 V9.
-
Stay current with developerWorks' Technical events and webcasts.
-
Check out upcoming conferences, trade shows, webcasts, and other Events around the world that are of interest to IBM open source developers.
-
Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
-
Watch and learn about IBM and open source technologies and product functions with the no-cost developerWorks On demand demos.
Get products and technologies
-
Download Solr from one of the project's mirrors.
-
Learn more about and download the Sphinx search engine.
-
Download Donovan Jimenez's PHP Solr Client.
-
Visit the PECL repository, your first stop for all
known extensions and hosting facilities for downloading and developing PHP extensions.
-
Innovate your next open source development project with IBM trial software, available for download or on DVD.
-
Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
-
Participate in developerWorks blogs and get involved in the developerWorks community.
-
Participate in the developerWorks PHP Forum: Developing PHP applications with IBM Information Management products (DB2, IDS).





