Enterprise search with PHP and Apache Solr

Add an advanced search engine to your Web application

Discover how to combine an enterprise-worthy search engine — Apache Software Foundation's Solr — with your PHP application.

Share:

Martin Streicher (martin.streicher@gmail.com), Editor in Chief, McClatchy Interactive

Martin Streicher is chief technology officer for McClatchy Interactive, editor-in-chief of Linux Magazine, a Web developer, and a regular contributor to developerWorks. He earned a master's degree in computer science from Purdue University and has been programming UNIX-like systems since 1986.


developerWorks Contributing author
        level

15 January 2008

Also available in Russian Japanese

In "Build a custom search engine with PHP," I combined PHP and the open source Sphinx search engine to create a blazing-fast alternative to text-intensive database queries, such as LIKE and, in the case of MySQL, MATCH. (See Resources for Sphinx-related information.)

Sphinx is easy to install and maintain, and is quite capable. Moreover, recent releases of Sphinx now provide a native MySQL engine, deprecating the need to run a separate Sphinx daemon. V0.9.8 (the most recent release as of this writing) also added geodistance queries to find records encompassed by a distance from a given location and a feature named multi-query, an optimization that bundles multiple queries and sets of results in a single network connection.

Sphinx continues to improve with time and is ideal for shopping sites, blogs, and many other applications. According to the Sphinx site, one application now indexes 700 million documents, or roughly 1.2 terabytes of data. I recommend Sphinx without hesitation.

However, Sphinx does not yet support several features you might like to employ and offer as your application or site becomes popular and usage increases. In particular, Sphinx does not yet automatically replicate or distribute its indices, making its daemon a single point of failure. (As a workaround, several machines can index the same database, and you can cluster those systems.) Sphinx does not highlight search results (like Google does when it displays cached pages), does not retain or cache recent results, and does not support regular expression (regex) or date-based operations.

If you seek those features or are ready for an enterprise-grade solution, consider the Apache Software Foundation's Solr project. Based on the Lucene search engine and provided as open source under the terms of the liberal Apache Software License, Solr is (according to the Lucene site) "an open source enterprise search server based on the Lucene Java™ search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a Web administration interface."

Among other notable, highly trafficked Web sites, Netflix, Digg, and CNET's News.com and CNET Reviews use Solr to power search. A lengthy list of public Solr-powered sites can be found in the Solr wiki (see Resources).

Learn how to use Solr and PHP to create a small application to search a database of automobile parts. While the example database contains only a handful of records, it could just as easily include millions. All the source code used in this article is available from the Download section.

Installing Solr

To combine Solr with PHP, you must install Solr, design an index, prepare your data to be indexed by Solr, load the index, write PHP code to execute queries, and present results. Much of the work required to create a searchable index can be performed from the command line. Of course, PHP's programmatic interface to Solr can also affect the contents of an index.

Solr is implemented in Java technology. To run Solr and its administrative tools, you must install a Java V1.5 software development kit (Java 5 SDK). Several vendors provide a Java V1.5 SDK — for example, Sun Microsystems, IBM®, and BEA Systems— and each implementation is capable of powering Solr. Simply choose the Java package suited for your operating system and follow the appropriate instructions to complete the installation.

In many cases, the installation of Java V1.5 is as simple as running a self-extracting archive and accepting the terms of a license agreement. A script in the archive does all the heavy lifting in a matter of seconds. Other operating systems, such as Debian, provide the Java 5 SDK in the APT repository. For example, if you use Debian or Ubuntu, you can install the Java V1.5 software with sudo apt-get install sun-java5-jdk.

Conveniently, APT also downloads all the dependencies required to use the Java 5 SDK automatically.

If the Java software is already installed and the Java executable file is in your PATH, run java -version to determine which Java code you have.

Here, let's use the Mac OS X V10.5 Leopard operating system as the basis of the demonstration. Apple's Leopard includes Java V1.5. With a small change to the default Apache configuration, Leopard runs PHP applications, too. Running java -version in a Leopard terminal window produces the following.

Listing 1. Run java -version in a Leopard terminal window
$ which java
/usr/bin/java

$ java -version
java version "1.5.0_13"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_13-b05-237)
Java HotSpot(TM) Client VM (build 1.5.0_13-119, mixed mode, sharing)

Note: Leopard allows you to switch between Java V1.4 and V1.5 in the Java Preferences application in /Applications/Utilities/Java. If your installation of Leopard reports V1.4, open Java Preferences and change the settings to resemble Figure 1.

Figure 1. Java Preferences application in Leopard
Java Preferences application in Leopard

To install Solr, visit Apache.org, click Resources > Download, select a convenient project mirror, and navigate within the folders shown to pick a tarball (a .tgz file) of Solr V1.2. The download transfers a file named something akin to apache-solr-1.2.0.tgz. Unpack the tarball with the following code.

Listing 2. Unpack tarball
$ tar xzf apache-solr-1.2.0.tgz

$ ls -F apache-solr-1.2.0
CHANGES.txt NOTICE.txt  dist/ lib/
KEYS.txt  README.txt  docs/   src/
LICENSE.txt build.xml example/

In the newly created directory, the folder named dist contains the Solr code bundled as a Java archive (JAR). The subdirectory example/exampledocs contains examples of data that's formatted — typically as XML code — and ready for Solr to index.

The example directory contains a complete sample Solr application. To run it, simply launch the Java engine with the application archive: start.jar.

Listing 3. Launch Java engine
$ java -jar start.jar
2007-11-10 15:00:16.672::INFO:  Logging to STDERR via org.mortbay.log.StdErrLog
2007-11-10 15:00:16.866::INFO:  jetty-6.1.3
...
INFO: SolrUpdateServlet.init() done
2007-11-10 15:00:18.694::INFO:  Started SocketConnector @ 0.0.0.0:8983

The application is now available on port 8983. Start your browser and type http://localhost:8983/solr/admin/ in the address bar. This is the interface for administering Solr. (To stop the Solr server, use Ctrl+C at the command line.)

But there's no data in the Solr index to manage or query — yet.


Loading data into Solr

Solr is remarkably flexible out of the box, supporting a variety of data types and rules to create effective indices. And while broad, if the standard components do not suffice, you can further customize Solr by writing new Java classes.

Given a set of data types and rules, you can then create a Solr schema to describe your data and control how the indices should be constructed. You then export your data to match the schema and load the data into Solr. Solr creates the indices on the fly, updating each index immediately as records are created, modified, or deleted.

The default Solr schema can be found at Apache.org as part of the Solr source code repository. For reference, a snippet of the default schema is shown below.

Listing 3. Default Solr schema snippet
<schema name="example" version="1.1">
  ...
  <fields>
  <field name="id" type="string" indexed="true" stored="true" required="true" /> 
  <field name="name" type="text" indexed="true" stored="true"/>
  <field name="nameSort" type="string" indexed="true" stored="false"/>
  <field name="cat" type="text" indexed="true" stored="true" multiValued="true"/>
  ...
  </fields>
  
  <uniqueKey>id</uniqueKey>
  ...
  <copyField source="name" dest="nameSort"/>
  ...
</schema>

Much of the schema is self-explanatory, but some aspects warrant clarification:

  • As shown, the field id is a string (type="string") and should be indexed (indexed="true"). It is also a required field (required="true"). Using this schema, every record loaded in Solr must have a value for this field. The <uniqueKey>id</uniqueKey> modifier further declares that the id field must be unique. (Solr does not require a unique ID field; this is merely a rule established in the default index schema.) The attribute stored="true" indicates that the id field should be retrievable.

    Why would you ever set stored to false? You can use a nonretrievable field to order results differently, as in the case of nameSort, which is a copy of the name field (because of the copyField command on the last line), but has different behaviors. Notice that nameSort is a string, while name is text. The default index schema treats those two types slightly differently.

  • The field cat is multiValued. A record may define several values for this field. For instance, if your application manages content, a story may be assigned several topics. You could use the cat field (or define a similar field of your own) to capture all the topics.

Listing 4 shows the file example/exampledocs/ipod_other.xml, which represents two entries in a catalog of iPod accessories.

Listing 4. Data formatted for the default Solr index schema
<add>
<doc>
  <field name="id">F8V7067-APL-KIT</field>
  <field name="name">Belkin Mobile Power Cord for iPod w/ Dock</field>
  <field name="manu">Belkin</field>
  <field name="cat">electronics</field>
  <field name="cat">connector</field>
  <field name="features">car power adapter, white</field>
  <field name="weight">4</field>
  <field name="price">19.95</field>
  <field name="popularity">1</field>
  <field name="inStock">false</field>
</doc>

<doc>
  <field name="id">IW-02</field>
  <field name="name">iPod & iPod Mini USB 2.0 Cable</field>
  <field name="manu">Belkin</field>
  <field name="cat">electronics</field>
  <field name="cat">connector</field>
  <field name="features">car power adapter for iPod, white</field>
  <field name="weight">2</field>
  <field name="price">11.50</field>
  <field name="popularity">1</field>
  <field name="inStock">false</field>
</doc>
</add>

The add element is a Solr command to add the enveloped records to the index. Each record is captured in a doc element, which uses a series of named field elements to specify field values. The fields weight, price, inStock, manu, features, and popularity are other fields defined in the default Solr index schema. The features field has identical attributes to cat, but has a different semantic meaning: It enumerates the (potentially many) capabilities of a product.


Searching for auto parts

This example indexes a collection of auto parts. Each auto part has several fields, with a sample of the most important fields shown in Table 1. The name of the field is listed in the first column. The second column provides a brief description, while the third column lists its logical type. The fourth column shows the index type (as defined in the schema in Listing 5) used to represent the datum.

Table 1. The fields of an auto part record
NameDescriptionTypeSolr type
Part number (unique, mandatory)An identifying numberStringpartno
NameA concise descriptionStringname
Model (required, multi-value)A model, such as "Camaro"Stringmodel
Model year (multi-value)A model year, such as 2001Stringyear
PriceCost per unitFloatprice
In stockWithin inventory or notBooleaninStock
FeaturesCapabilities of partStringfeatures
TimestampRecord of activityStringtimestamp
WeightShipping weightFloatweight

Listing 3 shows a portion of the Solr schema used for the auto parts index. It's largely based on the default Solr schema. The specific fields used — the names and attributes — simply replaced the fields element found in the default (as shown in Listing 1).

Listing 5. The auto parts index schema
<?xml version="1.0" encoding="utf-8" ?>
<schema name="autoparts" version="1.0">
  ...
  <fields>
    <field name="partno" type="string" indexed="true" 
    stored="true" required="true" /> 
    
    <field name="name" type="text" indexed="true" 
    stored="true" required="true" />
    
    <field name="model" type="text_ws" indexed="true" stored="true" 
    multiValued="true" required="true" />
    
    <field name="year" type="text_ws" indexed="true" stored="true" 
    multiValued="true" omitNorms="true" />
    
    <field name="price"  type="sfloat" indexed="true" 
    stored="true" required="true" />
    
    <field name="inStock" type="boolean" indexed="true" 
     stored="true" default="false" /> 
    
    <field name="features" type="text" indexed="true" 
    stored="true" multiValued="true" />
    
    <field name="timestamp" type="date" indexed="true" 
    stored="true" default="NOW" multiValued="false" />
    
    <field name="weight" type="sfloat" indexed="true" stored="true" />
  </fields>
  
  <uniqueKey>partno</uniqueKey>
  
  <defaultSearchField>name</defaultSearchField>
</schema>

Given the fields above, a database of auto parts exported and formatted for uploading into Solr might look like Listing 6.

Listing 6. A database of auto parts formatted for indexing
<add>
<doc>
  <field name="partno">1</field>
  <field name="name">Spark plug</field>
  <field name="model">Boxster</field>
  <field name="model">924</field>
  <field name="year">1999</field>
  <field name="year">2000</field>
  <field name="price">25.00</field>
  <field name="inStock">true</field>
</doc>
<doc>
  <field name="partno">2</field>
  <field name="name">Windshield</field>
  <field name="model">911</field>
  <field name="year">1991</field>
  <field name="year">1999</field>
  <field name="price">15.00</field>
  <field name="inStock">false</field>
</doc>
</add>

Let's install the new index schema and load the data into Solr. First, stop the Solr daemon (if it's still running) by using Ctrl+C. Make an archive of the existing Solr schema in example/solr/conf/schema.xml. Next, create a text file from Listing 6, save it to /tmp/schema.xml, and copy it to example/solr/conf/schema.xml. Create another file for the data shown in Listing 7. Now, you can start Solr again and use the posting utility provided with the example.

Listing 7. Launching Solr with a new schema
$ cd apache-solr-1.2/example
$ cp solr/conf/schema.xml solr/conf/default_schema.xml
$ chmod a-w solr/conf/default_schema.xml

$ vi /tmp/schema.xml
...
$ cp /tmp/schema.xml solr/conf/schema.xml

$ vi /tmp/parts.xml
...

$ java -jar start.jar 
...
2007-11-11 16:56:48.279::INFO:  Started SocketConnector @ 0.0.0.0:8983

$ java -jar exampledocs/post.jar /tmp/parts.xml
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8,    
  other encodings are not currently supported
SimplePostTool: POSTing files to http://localhost:8983/solr/update...
SimplePostTool: POSTing file parts.xml
SimplePostTool: COMMITting Solr index changes...

Success! If you want to verify that the index exists and contains two documents, point your browser again to http://localhost:8983/solr/admin/. You should see "(autoparts)" at the top of the page. If so, click the query box at midpage and type partno: 1 or partno: 2.

Your result should resemble this:

3 on 10 0 partno: 1 OR partno: 2 2.2
true Boxster 924 Spark plug 1 25.0 2007-11-11T21:58:45.899Z 1999 2000 
false 911 Windshield 2 15.0 2007-11-11T21:58:45.953Z 1991 1999

Try some other queries. The syntax for Lucene queries.

You should also try editing and loading the data again. Because the partno field is declared unique, repeated upload operations of the same part number merely replace the old index record with a new record. In addition to the add command, you can use commit, optimize, and delete. The last command can delete a specific record by ID or many records through a query.


And now for the PHP

Finally, PHP enters the example.

There are at least two PHP Solr APIs. The most robust implementation is Donovan Jimenez's PHP Solr Client (see Resources). The code is licensed under the same terms as Solr, has extensive documentation, and is compatible with Solr V1.2. The most recent release as of this writing is dated 2 Oct 2007.

Solr Client provides four PHP classes:

  • Apache_Solr_Service represents a Solr server. Use these methods to ping the server, add and delete documents, commit changes, optimize the index, and run queries.
  • Apache_Solr_Document embodies a Solr document. The methods of this class manage (key, value) pairs and multivalue fields. Field values can be accessed by direct dereferencing, such as $document->title = 'Something'; ... echo $document->title;.
  • Apache_Solr_Response encapsulates a Solr response. This code depends on the json_decode() function, which is bundled with PHP V5.2.0 and later or can be installed with the PHP Extension Community Library (PECL — see Resources).
  • Apache_Solr_Service_Balancer enhances Apache_Solr_Service, allowing you to connect to multiple Solr services in a distribution. This class is not covered here.

Download the PHP Solr Client (see Resources) and extract it to a working directory. Change to the SolrPhpClient. Next, check the file Apache/Solr/Service.php. At the time of this writing, line 335 was missing a trailing semicolon. Edit the file, and add the semicolon, if necessary. Also, check the file Apache/Solr/Document.php. Lines 112-117 should read as follows.

if (!is_array($this->_fields[$key]))
{
  $this->_fields[$key] = array($this->_fields[$key]);
}

$this->_fields[$key][] = $value;

After you correct the files, you can install the Apache directory alongside your other PHP libraries.

The code below shows a PHP application that connects a Solr service, adds two documents to the index, and runs the part number query used previously.

Listing 8. A sample PHP application to connect to, load, and query a Solr index
<?php
  require_once( 'Apache/Solr/Service.php' );
  
  // 
  // 
  // Try to connect to the named server, port, and url
  // 
  $solr = new Apache_Solr_Service( 'localhost', '8983', '/solr' );
  
  if ( ! $solr->ping() ) {
    echo 'Solr service not responding.';
    exit;
  }
  
  //
  //
  // Create two documents to represent two auto parts.
  // In practice, documents would likely be assembled from a 
  //   database query. 
  //
  $parts = array(
    'spark_plug' => array(
      'partno' => 1,
      'name' => 'Spark plug',
      'model' => array( 'Boxster', '924' ),
      'year' => array( 1999, 2000 ),
      'price' => 25.00,
      'inStock' => true,
    ),
    'windshield' => array(
      'partno' => 2,
      'name' => 'Windshield',
      'model' => '911',
      'year' => array( 1999, 2000 ),
      'price' => 15.00,
      'inStock' => false,
    )
  );
    
  $documents = array();
  
  foreach ( $parts as $item => $fields ) {
    $part = new Apache_Solr_Document();
    
    foreach ( $fields as $key => $value ) {
      if ( is_array( $value ) ) {
        foreach ( $value as $datum ) {
          $part->setMultiValue( $key, $datum );
        }
      }
      else {
        $part->$key = $value;
      }
    }
    
    $documents[] = $part;
  }
    
  //
  //
  // Load the documents into the index
  // 
  try {
    $solr->addDocuments( $documents );
    $solr->commit();
    $solr->optimize();
  }
  catch ( Exception $e ) {
    echo $e->getMessage();
  }
  
  //
  // 
  // Run some queries. Provide the raw path, a starting offset
  //   for result documents, and the maximum number of result
  //   documents to return. You can also use a fourth parameter
  //   to control how results are sorted and highlighted, 
  //   among other options.
  //
  $offset = 0;
  $limit = 10;
  
  $queries = array(
    'partno: 1 OR partno: 2',
    'model: Boxster',
    'name: plug'
  );

  foreach ( $queries as $query ) {
    $response = $solr->search( $query, $offset, $limit );
    
    if ( $response->getHttpStatus() == 200 ) { 
      // print_r( $response->getRawResponse() );
      
      if ( $response->response->numFound > 0 ) {
        echo "$query <br />";

        foreach ( $response->response->docs as $doc ) { 
          echo "$doc->partno $doc->name <br />";
        }
        
        echo '<br />';
      }
    }
    else {
      echo $response->getHttpStatusMessage();
    }
  }
?>

To begin, the code connects to the named Solr server on the port and path given, and uses the ping() method to verify that the server is operational.

Next, the code translates the records represented as PHP arrays into Solr documents. If a field has a single value, a simple accessor adds the (key, value) pair to the document. If a field has multiple values, the list of values is assigned to the key with the special function setMultiValue(). You can see that this process closely resembles the XML representation of a Solr document.

As an optimization, addDocuments() inserts multiple documents into the index. Subsequent commit() and optimize() functions finalize the additions.

At the bottom, several queries retrieve data from the index. You can view the results through two lenses: The getRawResponse() function yields the entire, unparsed result, while the docs() function returns an array of documents with named accessors.

If a query does not get the OK from Solr, the code prints an error message. An empty result set emits no output.


More power

Solr is incredibly powerful, and the PHP API makes integration on any platform a snap. Better yet, Solr is easy to set up and operate, and you can enable advanced features as you need them. Best of all, Solr is free. Don't pay for a search engine. Save your greenbacks and go Solr.

Surf the Solr Web site to learn more about advanced configuration, including sorting, categorized results, and replication. The Lucene Web site is another source of information because it's the search technology beneath the Solr system.


Download

DescriptionNameSize
Sample PHP and Solr applicationos-php-apachesolr.src.zip109KB

Resources

Learn

Get products and technologies

  • Download Solr from one of the project's mirrors.
  • Learn more about and download the Sphinx search engine.
  • Download Donovan Jimenez's PHP Solr Client.
  • Visit the PECL repository, your first stop for all known extensions and hosting facilities for downloading and developing PHP extensions.
  • Innovate your next open source development project with IBM trial software, available for download or on DVD.
  • Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source
ArticleID=280986
ArticleTitle=Enterprise search with PHP and Apache Solr
publish-date=01152008