Contents


Create a browser-based PDF storage and search application with Bluemix services, Part 2

Use the Object Storage service to store your files

Comments

Content series:

This content is part # of # in the series: Create a browser-based PDF storage and search application with Bluemix services, Part 2

Stay tuned for additional content in this series.

This content is part of the series: Create a browser-based PDF storage and search application with Bluemix services, Part 2

Stay tuned for additional content in this series.

In Part 1 of this article series, I walked you through using two important services: Document Conversion and Keyword Extraction. I also showed you how to use these services in a real-world application that lets users store and intelligently index PDF documents for more efficient searches.

In this concluding part, I introduce you to the IBM® Bluemix™Object Storage service that provides standards-compliant, reliable object storage infrastructure. I show you how to use the service with PHP to create a happy home for your PDF uploads.

In this concluding segment, enhance the document storage solution by adding keyword search and PDF file retrieval, then deploy it to Bluemix.

I also show you how to build a search engine that leverages MongoDB's text indices to enable keyword search for your new document store, and then walk you through the process of deploying it to IBM Bluemix. Keep reading, there's lots in store!

Run the appGet the code on GitHub

1

Understand and configure the Object Storage service

The Object Storage service enables easy storage and retrieval of unstructured data in the Bluemix cloud. It supports the OpenStack Swift API and follows Swift's three-tier hierarchy for organizing data: accounts, containers, and objects. Here's how this works:

  • The primary unit in the hierarchy is an account. Accounts correspond to users; to access an account, a user must provide authentication credentials.
  • An account can host multiple containers that are broadly equivalent to folders or sub-directories in a traditional file system.
  • Each container can store multiple objects that can be files or data. Objects can have additional user-defined metadata. Usually, you can store an unlimited number of objects.

To see this in action, initialize a new Object Storage service instance on Bluemix by logging in to your Bluemix account. Search for and select the Object Storage service.

Review the description of the service and click to launch it. Make sure the "Connect to" field is set to "Leave unbound" and that you're using the "Free Plan". Initially, this service instance runs in an unbound state. Similar to the Document Conversion service instance in Part 1, this unbound state allows the application to be developed locally with the service instance itself hosted remotely on Bluemix.

Figure 1. Object Storage service initialization
Object Storage service initialization
Object Storage service initialization

The service instance is initialized and you are presented with a service information page once done. Display the navigation bar on the left and click the "Service Credentials" link to view the URL, region, username, password, and other credentials for the service instance. Note all these credentials, as you will need them in subsequent steps.

Figure 2. Object Storage service credentials
Object storage service credentials
Object storage service credentials

You can now take the Object Storage service for a spin by using a REST client like Postman to send it a few sample requests. Typically, begin by sending a POST request to the authentication URL for the service with the service username and password. If authentication is successful, the server returns a 200 OK response code and an X-Subject-Token header containing an authentication token that must be used for subsequent requests.

For example, if the authentication URL for the service is https://identity.open.softlayer.com/v3/auth/tokens, send a POST request with a JSON content body containing the service username and password. If authenticated, the server response includes an X-Subject-Token header containing an authentication token. Here's an example request and response:

Figure 3. Sample request/response for Object Storage service authentication
Sample request/response for object storage service authentication
Sample request/response for object storage service authentication

The response also includes a series of endpoints, one for each of the services available in this OpenStack deployment. Look through it until you find the endpoint for the Object Storage service, as shown below. This endpoint URL will be the target of all subsequent requests.

Figure 4. Object Storage service endpoint
Object storage service endpoint
Object storage service endpoint

Once you have the authentication token and the endpoint URL, you can begin interacting with the Object Storage service using the Swift API. For example, to add a new container, send a PUT request to the endpoint URL, adding the new container name to the end of the URL and remembering to include an X-Auth-Token header in the request with the authentication token. So, if the endpoint URL is https://xyz.objectstorage.softlayer.net/AUTH_123, send a PUT request to https://xyz.objectstorage.softlayer.net/AUTH_123/container1 to create a new container named "container1". Here's an example:

Figure 5. Sample request/response for object storage container creation
Sample request/response for object storage container creation
Sample request/response for object storage container creation

Similarly, if you want to list all containers in the account, you can send a GET request to the corresponding endpoint URL without any parameters – in this example, https://xyz.objectstorage.softlayer.net/AUTH_123 - as shown below:

Figure 6. Sample request/response for object storage container listing
Sample request/response for object storage container listing
Sample request/response for object storage container listing
2

Store uploaded files in the Object Storage service instance

Although it's certainly possible to interact with the Object Storage API using requests and responses like the ones shown above, an easier solution for application developers is to use php-opencloud, a PHP SDK for OpenStack-based deployments. This SDK provides a convenient PHP wrapper around the Swift API methods, so you simply need to call the appropriate method, for example, createContainer() or listContainers(). The client library will take care of formulating the request and decoding the response for you. You still need to know what's going on in the background, though, both to debug errors and in case you want to perform an operation that's not currently supported in the SDK.

The php-opencloud package was included in the composer dependency file shown in Part 1 of this article, so you should already have it installed in your development system. Before using it, copy your Object Storage service credentials to the PHP application. Edit the $APP_ROOT/config.php file and add the credentials to it following the example below:

<?php
$config['settings']['object-storage']['url'] = "URL";
$config['settings']['object-storage']['region'] = "REGION";
$config['settings']['object-storage']['user'] = "USERNAME";
$config['settings']['object-storage']['pass'] = "PASSWORD";

Then, use this configuration to initialize a new OpenStack client using Slim's DI container by adding the code below to the $APP_ROOT/public/index.php file:

<?php
// Slim application initialization - snipped

// initialize dependency injection container
$container = $app->getContainer();

// add Object Storage service client
$container['objectstorage'] = function ($container) {
  $config = $container->get('settings');
  $openstack = new OpenStack\OpenStack(array(
    'authUrl' => $config['object-storage']['url'],
    'region'  => $config['object-storage']['region'],
    'user'    => array(
      'id'       => $config['object-storage']['user'],
      'password' => $config['object-storage']['pass']
  )));
  return $openstack->objectStoreV1();
};

Finally, update the /add POST request handler in the same file to use this client and save the uploaded PDF file to the Object Storage instance:

<?php
// Slim application initialization - snipped

// upload processor
$app->post('/add', function (Request $request, Response $response) {

  // get configuration
  $config = $this->get('settings');
  

  try {
    // check for valid file upload
    if (empty($_FILES['upload']['name'])) {
      throw new Exception('No file uploaded');
    }
    
    $finfo = new finfo(FILEINFO_MIME_TYPE);
    $type = $finfo->file($_FILES['upload']['tmp_name']);
    if ($type != 'application/pdf') {
      throw new Exception('Invalid file format');    
    }

    // convert uploaded PDF to text
    // connect to Watson document conversion API  
    // transfer uploaded file for conversion to text format
    $apiResponse = $this->converter->post(
      'v1/convert_document?version=2015-12-15', array('multipart' => array(
        array('name' => 'config', 
          'contents' => '{"conversion_target":"normalized_text"}'),
        array('name' => 'file', 
          'contents' => fopen($_FILES['upload']['tmp_name'], 'r'))
    )));
    
    // store response
    $text = (string)$apiResponse->getBody();
    unset($apiResponse);

    // extract keywords from text
    // connect to Watson/Alchemy API for keyword extraction 
    // transfer text content for keyword extraction
    // request JSON output
    $apiResponse = $this->extractor->post('text/TextGetRankedKeywords', 
      array('form_params' => array(
      'apikey' => $config['alchemy']['apikey'],
      'text' => strip_tags($text),
      'outputMode' => 'json'
    )));

    // process response
    // create keyword array
    $body = $apiResponse->getBody(); 
    $data = json_decode($body);
    $keywords = array();
    foreach ($data->keywords as $k) {
      $keywords[] = $k->text;
    }
    
    // save keywords to MongoDB
    $collection = $this->db->docs;
    $q = trim($_FILES['upload']['name']);
    $params = $request->getParams();
    $result = $collection->findOne(array('name' => $q));
    $doc = new stdClass;
    if (count($result) > 0) {
      $doc->_id = $result['_id'];
    }
    $doc->name = trim($_FILES['upload']['name']);
    $doc->keywords = $keywords;
    $doc->description = trim(strip_tags($params['description']));
    $doc->updated = time();
    $collection->save($doc);
    
    // save PDF to object storage
    $service = $this->objectstorage;
    $containers = $service->listContainers();
    
    $containerExists = false;
    foreach($containers as $c) {
      if ($c->name == 'documents') {
        $containerExists = true;
        break;
      }
    }
    
    if ($containerExists == false) {
      $container = $service->createContainer(array(
        'name' => 'documents'
      )); 
    } else {    
      $container = $service->getContainer('documents');
    }
      
    $stream = new Stream(fopen($_FILES['upload']['tmp_name'], 'r'));
    $options = array(
      'name'   => trim($_FILES['upload']['name']),
      'stream' => $stream,
    );
    $container->createObject($options);

    $response = $this->view->render($response, 'add.phtml', 
      array('keywords' => $keywords, 
        'object' => trim($_FILES['upload']['name']), 
        'router' => $this->router
    ));
    return $response;
    
  } catch (ClientException $e) {
    // in case of a Guzzle exception
    // display HTTP response content
    throw new Exception($e->getResponse());
  }

});

You've already seen some of this code in Part 1: checking the uploaded file, converting it to normalized text, extracting keywords from it, and saving those keywords to a MongoDB database. The additional code begins by using the Object Storage client to list available containers using the listContainers() method. It then looks through the container list to check if a container named documents exists and if not, it invokes the createContainer() method to create a new container with this name. Or, if the container already exists, it uses the getContainer() method to obtain a reference to the container.

Once a reference to the documents container is obtained, the next step is to initialize a new stream from the uploaded PDF document. This stream is passed to the container's createObject() method as part of an options array, which includes the desired name for the object in the container. The createObject() method takes care of transferring and saving the document to the Object Storage instance as a named object. As you'll see in Step 4, you can use the object name as a key to retrieve the PDF document at any time.

3

Build a search interface

Once you've got your documents and keywords saved, all that's left is to build a search interface that lets you quickly scan the keyword list and find matching documents.

Now, you'll remember that the extracted keywords for each PDF were saved as an array of string elements in the keywords property of the corresponding MongoDB document. To make it easier to search this array, add a text index to the keywords property, using a command like the one below:

db.docs.createIndex({ keywords: "text" })

Here's an example of doing this using the MongoLab interface:

Figure 7. Text index creation on MongoDB collection
Text index creation on MongoDB collection
Text index creation on MongoDB collection

Then, add a /search route and callback handler to your Slim application at the $APP_ROOT/public/index.php file, as shown below:

<?php
// Slim application initialization - snipped

$app->get('/search', function (Request $request, Response $response) {
  $params = $request->getQueryParams();
  $results = array();
  if (isset($params['q'])) {
    $q = trim(strip_tags($params['q']));
    if (!empty($q)) {
      $where = array(
        '$text' => array('$search' => $q) 
      );  
      $collection = $this->db->docs;
      $results = $collection->find($where)->sort(array('updated' => -1));    
    }
  }
  $response = $this->view->render($response, 'search.phtml', 
    array('router' => $this->router, 'results' => $results));
  return $response;
})->setName('search');

This callback handles requests for the /search URL endpoint and checks for a query string in the request URL. If a query string is found, it initializes the MongoDB client and uses the client's find() method to generate and execute a MongoDB search query on the text index. The results are then sorted with the most recently updated documents first. The find() method returns a cursor to this result collection, and this cursor is passed on to the view script for display.

Here's what the view script looks like:

<div class="panel panel-default">
  <form method="get" 
    action="<?php echo $data['router']->pathFor('search'); ?>">
    <div class="input-group">
      <input type="text" name="q" class="form-control" 
        placeholder="Search for...">
      <span class="input-group-btn">
        <button type="submit" class="btn btn-default">Search</button>
      </span>
    </div>  
  </form>
</div>  

<?php if (isset($data['results']) && count($data['results'])): ?>
<h4>Search Results</h4>
<ul class="list-group row clearfix">
<?php foreach ($data['results'] as $doc): ?>
  <li class="list-group-item clearfix" style="border:none">
  <strong><?php echo $doc['name']; ?></strong> 
    <a href="<?php echo $data['router']->pathFor('download', 
      array('id' => $doc['name'])); ?>" 
      class="btn-sm btn-success">Download</a> <br /> 
  <?php echo $doc['description']; ?> <br /> 
  Last updated: <?php echo date('d M Y', $doc['updated']); ?> 
  <br /> 
  </li>
<?php endforeach; ?>
</ul>
<p>This operation made use of data generated through 
  <a href="http://www.ibm.com/smarterplanet/us/en/ibmwatson/">IBM Watson</a>
  and <a href="https://www.alchemyapi.com/">AlchemyAPI</a> services.</p>
<?php endif; ?>
</div>

There are two main components to this view script:

  • A search form that contains a text input field for the user to enter one or more keywords. On submission, the data entered by the user is submitted to the /search URL endpoint as a GET request.
  • A search results panel that iterates over the MongoDB document collection returned by the /search handler and, for each result document, displays the document name, description, and last updated date. Each entry additionally contains a link to the /download URL endpoint that enables the user to download the corresponding PDF document from the Object Storage service.

Here's an example of what it looks like:

Figure 8. Search form and results
Search form and results
Search form and results
4

Retrieve documents from storage

The view script shown in the previous section includes links to the /download URL endpoint, which is supposed to let users download a PDF document from the Object Storage service. You'll also notice that the /download URL includes the PDF document name as part of the URL.

Internally, the callback handler for the /download endpoint must initialize a new Object Storage client and then use the client's methods to download the corresponding binary object, using its name as key. This object must then be sent to the user's browser as a stream so that it can be saved locally.

Here's the code for the /download handler that does all of the above tasks:

<?php
// Slim application initialization - snipped

$app->get('/download/{id}', function (Request $request, Response $response, $args) {
  $service = $this->objectstorage;
  $stream = $service->getContainer('documents')
                  ->getObject(trim(strip_tags($args['id'])))
                  ->download();
  $response = $response->withHeader('Content-Type', 'application/pdf')
                       ->withHeader('Content-Disposition', 'attachment; filename="' . trim(strip_tags($args['id'])) .'"')
                       ->withHeader('Content-Length', $stream->getSize())
                       ->withHeader('Expires', '@0')
                       ->withHeader('Cache-Control', 'must-revalidate')
                       ->withHeader('Pragma', 'public');
  $response = $response->withBody($stream);
  return $response;
})->setName('download');

The handler above initializes a new Object Storage client and then uses the client's getContainer() and getObject() methods to obtain a reference to the documents container and the object specified in the request URL. It then uses the client's download() method to create a stream containing the PDF file.

The Slim Response object is modified to include various headers, including the Content-Type, Content-Disposition, and Content-Length headers, which tell the browser that what follows is a binary object. The stream is then attached to the Response object with the withBody() method and the complete response is sent to the requesting client.

5

Deploy to Bluemix

At this point, the application is complete and can be deployed to Bluemix. To do this, first create the application manifest file at $APP_ROOT/manifest.yml, remembering to use a unique host and application name by appending a random string to it (like your initials).

---
applications:
- name: pdf-keyword-search-[initials]
memory: 256M
instances: 1
host: pdf-keyword-search-[initials]
buildpack: https://github.com/cloudfoundry/php-buildpack.git
stack: cflinuxfs2

The Cloud Foundry PHP build pack doesn't include the PHP MongoDB extension or the fileinfo extension (used for validating PDF file uploads) by default, so you must configure the build pack to enable these extensions during deployment. Similarly, you must configure the build pack to use the public directory of the application as the web server directory. Create a $APP_ROOT/.bp-config/options.json file with the following content:

{
    "WEB_SERVER": "httpd",
    "PHP_EXTENSIONS": ["bz2", "zlib", "curl", "mcrypt", "mongo", "fileinfo"],
    "WEBDIR": "public",
    "PHP_VERSION": "{PHP_56_LATEST}"
}

Also, if you'd like to have the service credentials for the Document Conversion and Object Storage services automatically sourced from Bluemix, update the $APP_ROOT/public/index.php script to use Bluemix's VCAP_SERVICES variable, as shown below:

<?php                
// include autoloader and configuration
require '../vendor/autoload.php';
require '../config.php';
                
// if BlueMix VCAP_SERVICES environment available
// overwrite with credentials from BlueMix
if ($services = getenv("VCAP_SERVICES")) {
  $services_json = json_decode($services, true);
  
  $config['settings']['document-conversion']['user'] = 
    $services_json["document_conversion"][0]["credentials"]["username"];
  $config['settings']['document-conversion']['pass'] = 
    $services_json["document_conversion"][0]["credentials"]["password"];

  $config['settings']['object-storage']['url'] = 
    $services_json["Object-Storage"][0]["credentials"]["auth_url"];
  $config['settings']['object-storage']['region'] = 
    $services_json["Object-Storage"][0]["credentials"]["region"];;
  $config['settings']['object-storage']['user'] = 
    $services_json["Object-Storage"][0]["credentials"]["userId"];;
  $config['settings']['object-storage']['pass'] = 
    $services_json["Object-Storage"][0]["credentials"]["password"];;
} 

// initialize application
$app = new \Slim\App($config);

// route callbacks - snipped

You can now go ahead and push the application to Bluemix, then bind the Document Conversion and Object Storage services that you initialized in Part 1 to it. Remember to use the correct ID for each service instance to ensure they are correctly bound to the application.

shell> cf api https://api.ng.bluemix.net
shell> cf login
shell> cf push
shell> cf bind-service pdf-keyword-search-[initials] "Document Conversion-[id]"
shell> cf bind-service pdf-keyword-search-[initials] "Object Storage-[id]"
shell> cf restage pdf-keyword-search-[initials]

You can start using the application by browsing to the host specified in the application manifest, for example, http://pdf-keyword-search-[initials].mybluemix.net. If you see a blank page or other errors, use the link at the top of this section to debug your PHP code and find out where things are going wrong.

Conclusion

This article focused on orchestrating various Watson and Bluemix services to solve a common problem: quickly filtering a large collection of documents to find only those matching certain keywords. By combining cognitive computing with reliable, scalable cloud storage and infrastructure and some PHP glue, it demonstrated how easy it is for developers to build powerful solutions for document storage and search in the cloud.

If you'd like to experiment with the services discussed in this article, start by trying out the live demo of the application. Remember that this is a public demo, so you should be careful not to upload confidential or sensitive information to it (there's also a handy "Reset System" button that you can use to erase all uploaded data). Then, download the code from its GitHub repository and take a closer look to see how it all fits together.

Of course, this article focused only on one particular use case but, by using other Bluemix and Watson services, you can mix things up to create other cognitive document handling apps. To learn more, refer to the links at the top of each section to learn more about the Bluemix Document Conversion service, the AlchemyAPI Keyword Extraction API, the Bluemix Object Storage service, the Slim micro-framework and the other tools and techniques used in this article. Happy coding!


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Cognitive computing
ArticleID=1031729
ArticleTitle=Create a browser-based PDF storage and search application with Bluemix services, Part 2: Use the Object Storage service to store your files
publish-date=06132016