Contents


Create a browser-based PDF storage and search application with Bluemix services, Part 1

Use the Document Conversion and Keyword Extraction services to convert and index files

Comments

Content series:

This content is part # of # in the series: Create a browser-based PDF storage and search application with Bluemix services, Part 1

Stay tuned for additional content in this series.

This content is part of the series: Create a browser-based PDF storage and search application with Bluemix services, Part 1

Stay tuned for additional content in this series.

If you're anything like me, you probably have hundreds of documents on your computer right now. Documents you've created, documents you've been asked to review, documents you've been copied on as part of various projects...the list goes on. You get them over email, in Slack conversations, and as website downloads...and the more you have, the larger your haystack grows and the harder it is to find the needles you need, when you need them.

With existing Watson and Bluemix services, you can put together a useful document storage application with full keyword search support.

That's where this article comes in. In Part 1 of this two-part series, I show you how to create a powerful, browser-based, document storage and search application that makes it faster and easier to search for relevant content in your document haystack. Through this process, I also introduce you to some interesting new services from IBM® Watson and walk you through the process of hosting the final application on IBM Bluemix™.

Sound interesting? Keep reading!

Run the appGet the code on GitHub

The example application described in this article lets users select and upload PDF documents from their computer to an online document store. As each document is uploaded, it is automatically and intelligently scanned for keywords and those keywords are extracted and stored in a database. Users can later search by keyword to quickly identify and download documents relevant to their needs. Needless to say, the application are mobile-optimized and suitable for use on both smartphones and desktop computers.

Behind the scenes, the application works by orchestrating various services, some available directly through IBM Bluemix and others available as third-party services. Here's a quick list:

On the client side of things, I'm using Bootstrap to create a mobile-friendly user interface for the application. On the server, I'm using the Slim PHP micro-framework to manage the application flow, and MongoDB to store the list of keywords. The application is hosted on IBM Bluemix and connects to the three services listed previously for document processing and storage.

What you'll need

There are a lot of technologies in use here, so here's what you'll need:

Note: Any application that uses the AlchemyAPI service must comply with the AlchemyAPI Terms and Conditions. Similarly, any application that uses the Document Conversion and Object Storage services must comply with each service's terms of use, as described in the service catalog pages. Before beginning your project, spend a few minutes reading these requirements and ensuring that your application complies with them.

1

Create the bare application

The first step is to create a bare application containing the Slim PHP micro-framework and various other dependencies, such as the OpenStack SDK for PHP and the Guzzle HTTP client, both needed for interacting with the various services described previously. These components can be easily downloaded and installed using Composer, the PHP dependency manager. Use this Composer configuration file, which should be saved to $APP_ROOT/composer.json where $APP_ROOT refers to your project directory.

{
    "require": {
        "php-opencloud/openstack": "*",
        "slim/slim": "*",
        "slim/php-view": "*",
        "guzzlehttp/guzzle": "~6.0"
    },
    "minimum-stability": "dev",
    "prefer-stable": true
}

Install Slim and other required components using Composer with the command:

  shell> php composer.phar install

Next, create the directories $APP_ROOT/public for all web-accessible files, $APP_ROOT/views for all views and $APP_ROOT/config.php for configuration information. You should end up with a directory structure that looks like this:

Figure 1. Application file structure
Application file structure

You should also add your MongoDB connection information to the $APP_ROOT/config.php file following the example below:

<?php
$config['settings']['db']['uri'] = "mongodb://USERNAME:PASSWORD@HOST:PORT/DATABASE";

To make it easier to access the application, you can also define a new virtual host in your development environment and point its document root to $APP_ROOT. This step, although optional, is recommended because it creates a closer replica of the target deployment environment on Bluemix.

To set up a named virtual host for the application in Apache, open the Apache configuration file (httpd.conf or httpd-vhosts.conf) and add these lines:

NameVirtualHost 127.0.0.1
<VirtualHost 127.0.0.1>
    DocumentRoot "/var/www/pdf-keyword-search/public"
    ServerName pdf-keyword-search.localhost
</VirtualHost>

These lines define a new virtual host, http://pdf-keyword-search.localhost/, whose document root corresponds to the $APP_ROOT (remember to update it to reflect your own local settings). Restart the web server to activate these new settings. Note that you might need to update your network's local DNS server to tell it about the new host.

Then, add a .htaccess file to the $APP_ROOT/public directory with the following settings:

RewriteEngine On
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^ index.php [QSA,L]

Next, set up the main control script for the application. This script loads the Slim framework and initialize the Slim application. It also contains callbacks for each of the application's routes, with each callback defining the code to be executed when the route is matched to an incoming request.

Create the script at $APP_ROOT/public/index.php with the following content:

<?php
use \Psr\Http\Message\ServerRequestInterface as Request;
use \Psr\Http\Message\ResponseInterface as Response;
use GuzzleHttp\Psr7\Stream;
use GuzzleHttp\Client;

// set a long time limit to account 
// for large file uploads and processing time
set_time_limit(6000);

// include autoloader and configuration
require '../vendor/autoload.php';
require '../config.php';

// initialize application
$app = new \Slim\App($config);

// initialize dependency injection container
$container = $app->getContainer();

// add view renderer
$container['view'] = function ($container) {
  return new \Slim\Views\PhpRenderer("../views/");
};

$app->run();

This script loads all the necessary classes and the configuration file. It then initializes a new Slim application object and a dependency injection (DI) container, and adds the PHP view renderer to the DI container.

You also need to construct a base user interface that can be used for the various views rendered by the application. Here's an example, which is used for all the application views shown in subsequent code listings:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Documents</title>
    <link rel="stylesheet" 
      href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css">
    <link rel="stylesheet" 
      href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap-theme.min.css"
    >
    <!-- HTML5 shim and Respond.js for IE8 support of HTML5 
      elements and media queries -->
    <!-- WARNING: Respond.js doesn't work if you view the 
      page via file:// -->
    <!--[if lt IE 9]>
      <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js">
      </script>
      <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js">
      </script>
    <![endif]-->    
  </head>
  <body>

    <div class="container">
      <div class="panel panel-default">
        <div class="panel-heading clearfix">
          <h4 class="pull-left">Documents</h4>
        </div>
      </div>  
      

      <div>
          <!-- main content area -->
      </div>
    </div>
      
    <div class="container">
          <!-- footer area -->
    </div> 
    
  </body>
</html>

This view sets up Bootstrap from a CDN and defines a basic Bootstrap page containing a header, a main content area, and a footer.

With all the pieces in place, you can now start building the application itself.

2

Create a file upload form

Slim works by defining router callbacks for HTTP methods and endpoints. This is done simply by calling the corresponding HTTP method get() for GET requests, post() for POST requests and so on, and passing the URL route to be matched as the first argument to the method. The second argument to the method is a function that specifies the actions to be taken when the route is matched to an incoming request.

To see this in action, create a file upload form through which users can submit documents. Begin by defining a /add route and corresponding callback function by adding the code below to $APP_ROOT/public/index.php:

<?php
// Slim application initialization - snipped
// upload form
$app->get('/add', function (Request $request, Response $response) {
  $response = $this->view->render($response, 'add.phtml', 
    array('router' => $this->router));
  return $response;
})->setName('add');

In essence, this tells Slim to respond to GET requests for the /add URL endpoint with the contents of the add.phtml view script. That view script, located at $APP_ROOT/views/add.phtml, should contain the HTML code needed for the file upload form that looks like this:

<?php if (!isset($_POST['submit'])): ?>
<div>
  <form method="post" enctype="multipart/form-data" 
    action="<?php echo $data['router']->pathFor('add'); ?>">
    <input type="hidden" name="MAX_FILE_SIZE" value="300000000" />
    <div class="form-group">
      <label for="upload">File</label>
      <span class="btn btn-default btn-file">
        <input type="file" name="upload" />
      </span>
    </div>  
    <div class="form-group">
      <label for="body">Description</label>
      <textarea name="description" id="description" 
        class="form-control" rows="3"></textarea>
    </div>
  <div class="form-group">
    <button type="submit" name="submit" 
      class="btn btn-primary">Add</button>
  </div>          
  </form>
</div>
<?php endif; ?>

This defines a basic PHP file upload form (note the multipart/form-data encoding type and hidden MAX_FILE_SIZE variable). It includes a file upload field and an additional description field for user-specified text.

And now, when you access http://pdf-keyword-search.localhost/add in your browser, you should see something like this:

Figure 2. File upload form
File upload form
File upload form

The infrastructure for uploading files is now complete. The next step is to process them, which requires learning more about two key services: the Watson Document Conversion service and the AlchemyAPI service.

3

Understand and configure the Document Conversion service

Processing an uploaded document consists of four steps:

  1. Converting it from PDF format to normalized text
  2. Analyzing the text to extract keywords
  3. Storing the keywords
  4. Storing the uploaded PDF

The Document Conversion service takes care of the first of these steps. Part of the Watson Developer Cloud, it provides an API to convert a document from one format to another. Common use cases include cleaning up incorrect HTML markup and processing content into JSON-formatted Answer units suitable for use with other Watson services.

The Document Conversion service accepts an HTML, PDF, or Microsoft Word document (for this article, I'll assume all documents are PDF documents) as input and returns a normalized HTML or plain text version that is easier to analyze. It accepts POST requests to the service endpoint https://gateway.watsonplatform.net/document-conversion/api/v1. Each POST request should contain the document to be converted as a POST body together with a JSON-formatted array that indicates the required output format.

To see how this works, initialize a new Document Conversion service instance on Bluemix. Log in to your Bluemix account. Search for and select the Document Conversion service.

Review the description of the service and click to launch it. Ensure that the "Connect to" field is set to "Leave unbound" and that you're using the "Standard Plan". Initially, this service instance will run in an unbound state; this allows the application to be developed locally with the service instance itself hosted remotely on Bluemix.

Figure 3. Document Conversion service initialization
Document Conversion service initialization
Document Conversion service initialization

The service instance is initialized and you are presented with a service information page once done. Display the navigation bar on the left and click the "Service Credentials" link to view the username and password for the service instance. Note these credentials as well as the service name, as you will need them in subsequent steps.

Figure 4. Document Conversion service credentials
Document Conversion service credentials
Document Conversion service credentials

You can now take the Document Conversion service for a test drive by using a REST client like Postman to send it a sample request. The image below shows an example request and response.

Figure 5. Sample request/response for Document Conversion service
Sample request and response for Document Conversion service
Sample request and response for Document Conversion service

This is also a good time to copy your Document Conversion service credentials to the PHP application. Edit the $APP_ROOT/config.php file and add the credentials to it following the example below:

<?php
$config['settings']['document-conversion']['user'] = "USERNAME";
$config['settings']['document-conversion']['pass'] = "PASSWORD";
4

Understand and configure the AlchemyAPI Keyword Extraction service

Of course, converting the PDF document into plain text is only the beginning. The next step is to send the plain text output to the AlchemyAPI service for analysis and keyword extraction. You can access this service either from the Bluemix catalog (as shown in the previous section) or independently by requesting an API key using the AlchemyAPI website.

The AlchemyAPI service uses natural language processing to semantically analyze unstructured text and return information about it. This information can relate to sentiment or emotion (whether the text was positive, negative or neutral) or it can be a list of extracted keywords, concepts, or relations that can then be linked to news or other data sources. It's an incredibly powerful service to have in your toolkit when working with textual content of any kind.

The AlchemyAPI service works by accepting POST requests to the service endpoint http://gateway-a.watsonplatform.net/calls/text/TextGetRankedKeywords. Each POST request should contain the AlchemyAPI key, together with the text to be analyzed and the required response format, as an array of form parameters.

Here's an example of accessing the API with a REST client:

Figure 6. Sample request/response for AlchemyAPI service
Sample request and response for AlchemyAPI service
Sample request and response for AlchemyAPI service

Once you've had a chance to verify that it works, update the $APP_ROOT/config.php file with your API key:

<?php
$config['settings']['alchemy']['apikey'] = "APIKEY";
5

Extract keywords from uploaded documents

Once you've understood how both the Document Conversion and AlchemyAPI services work, it's time to write the application code that puts them together and saves the generated keywords to the MongoDB database. Begin by using the Slim DI container to create initialization routines for each service, plus your MongoDB database connection in the $APP_ROOT/public/index.php file:

<?php
// Slim application initialization - snipped

// initialize dependency injection container
$container = $app->getContainer();

// add PHP Mongo client
$container['db'] = function ($container) {
  $config = $container->get('settings');
  $dbn = substr(parse_url($config['db']['uri'], PHP_URL_PATH), 1);
  $mongo = new MongoClient($config['db']['uri'], 
    array("connectTimeoutMS" => 30000));
  return $mongo->selectDb($dbn);
};

// add Alchemy API client
$container['extractor'] = function ($container) {
  $config = $container->get('settings');
  return new Client(array(
    'base_uri' => 'http://gateway-a.watsonplatform.net/calls/',
    'timeout'  => 6000,
  ));
};

// add Watson document conversion client
$container['converter'] = function ($container) {
  $config = $container->get('settings');
  return new Client(array(
    'base_uri' => 'https://gateway.watsonplatform.net/document-conversion/api/',
    'timeout'  => 6000,
    'auth' => array($config['document-conversion']['user'], 
      $config['document-conversion']['pass'])
  ));
};

In the code above, a Guzzle Client object is created for the Document Conversion and AlchemyAPI services, initialized with the service endpoint and credentials where needed. Similarly, a MongoClient object is initialized for MongoDB database interaction using PHP's MongoDB extension. Service endpoints and credentials are sourced from the $APP_ROOT/config.php file.

Next, define a callback to handle POST requests to the /add endpoint responsible for handling PHP file uploads. This callback first checks for a valid file upload and, if found, processes it. Here's the code:

<?php
// Slim application initialization - snipped

// upload processor
$app->post('/add', function (Request $request, Response $response) {

  // get configuration
  $config = $this->get('settings');
  

  try {
    // check for valid file upload
    if (empty($_FILES['upload']['name'])) {
      throw new Exception('No file uploaded');
    }
    
    $finfo = new finfo(FILEINFO_MIME_TYPE);
    $type = $finfo->file($_FILES['upload']['tmp_name']);
    if ($type != 'application/pdf') {
      throw new Exception('Invalid file format');    
    }

    // convert uploaded PDF to text
    // connect to Watson document conversion API  
    // transfer uploaded file for conversion to text format
    $apiResponse = $this->converter->post(
      'v1/convert_document?version=2015-12-15', array('multipart' => array(
        array('name' => 'config', 
          'contents' => '{"conversion_target":"normalized_text"}'),
        array('name' => 'file', 
          'contents' => fopen($_FILES['upload']['tmp_name'], 'r'))
    )));
    
    // store response
    $text = (string)$apiResponse->getBody();
    unset($apiResponse);

    // extract keywords from text
    // connect to Watson/Alchemy API for keyword extraction 
    // transfer text content for keyword extraction
    // request JSON output
    $apiResponse = $this->extractor->post(
      'text/TextGetRankedKeywords', array('form_params' => array(
        'apikey' => $config['alchemy']['apikey'],
        'text' => strip_tags($text),
        'outputMode' => 'json'
    )));

    // process response
    // create keyword array
    $body = $apiResponse->getBody(); 
    $data = json_decode($body);
    $keywords = array();
    foreach ($data->keywords as $k) {
      $keywords[] = $k->text;
    }
    
    // save keywords to MongoDB
    $collection = $this->db->docs;
    $q = trim($_FILES['upload']['name']);
    $params = $request->getParams();
    $result = $collection->findOne(array('name' => $q));
    $doc = new stdClass;
    if (count($result) > 0) {
      $doc->_id = $result['_id'];
    }
    $doc->name = trim($_FILES['upload']['name']);
    $doc->keywords = $keywords;
    $doc->description = trim(strip_tags($params['description']));
    $doc->updated = time();
    $collection->save($doc);
    
    $response = $this->view->render($response, 'add.phtml', 
      array('keywords' => $keywords, 
        'object' => trim($_FILES['upload']['name']), 
        'router' => $this->router));
    return $response;
    
  } catch (ClientException $e) {
    // in case of a Guzzle exception
    // display HTTP response content
    throw new Exception($e->getResponse());
  }

});

There's a lot going on here, so let's step through it:

  • The first few lines of code check that a file was actually uploaded and if it was, it checks that the file is a PDF document. If either of these conditions fail, an exception is generated.
  • Assuming a valid PDF upload, send a POST request to the Document Conversion service with the uploaded file included. If successful, the response is a normalized plain text version of the PDF stored in the $text variable.
  • Once a plain text version of the PDF content is available, send a POST request to the AlchemyAPI service. This request contains the plain text version of the PDF, the AlchemyAPI key, and the requested output format (JSON). If successful, the response is a JSON-encoded list of keywords, ranked by score.
  • Use the MongoDB client to save the keywords to the database. The code first checks if an existing MongoDB document exists for the given file, using the file name as identifier. If no existing MongoDB document is found with the same file name, it creates a new MongoDocument object containing properties for the file name (name), the file description (description), the date and time (updated) and the list of keywords (keywords). If an existing MongoDB document is found with the same file name (for example, if this is an update of an existing PDF document), the existing MongoDB document is retrieved and its properties updated with the new information. This result is then saved to the docs collection in the database.

Once these steps are completed, the callback passes the view script a success message and a list of the keywords extracted and saved. Here's what the revised view script, located at $APP_ROOT/views/add.phtml, looks like after adjusting for this change:

<?php if (!isset($_POST['submit'])): ?>
<div>
  <form method="post" enctype="multipart/form-data" 
    action="<?php echo $data['router']->pathFor('add'); ?>">
    <input type="hidden" name="MAX_FILE_SIZE" value="300000000" />
    <div class="form-group">
      <label for="upload">File</label>
      <span class="btn btn-default btn-file">
        <input type="file" name="upload" />
      </span>
    </div>  
    <div class="form-group">
      <label for="body">Description</label>
      <textarea name="description" id="description" 
        class="form-control" rows="3"></textarea>
    </div>
  <div class="form-group">
    <button type="submit" name="submit" 
      class="btn btn-primary">Add</button>
  </div>          
  </form>
</div>
<?php else: ?>
<div>
  <div class="alert alert-success">
    <strong>Success!</strong> Your document 
      <strong><?php echo $data['object']; ?></strong> was added. 
      <a role="button" class="btn btn-primary" 
      href="<?php echo $data['router']->pathFor('add'); ?>">
      Add another?</a>
  </div>
  <div>
   The following keywords were extracted and stored: 
    <ul>
    <?php foreach ($data['keywords'] as $k): ?> 
      <li><?php echo $k; ?></li> 
    <?php endforeach; ?> 
    </ul> 
  </div>
</div>   
<p>This operation made use of data generated through 
  <a href="http://www.ibm.com/smarterplanet/us/en/ibmwatson/">
  IBM Watson</a> and <a href="https://www.alchemyapi.com/">
  AlchemyAPI</a> services.</p>     
<?php endif; ?>

As the code above demonstrates, the view script merely iterates over the list of keywords and displays them, together with a brief success message.

To see this in action, go back to http://pdf-keyword-search.localhost/add in your browser and try uploading a PDF file. If all goes well, your PDF file is converted, analyzed and saved and you see a success message:

Figure 7. File upload results
File upload results
File upload results

Conclusion

Now, if you're sharp-eyed, you'll notice that at the beginning of Step 3, I listed four stages in processing a file upload. The code you've seen so far addresses the first three stages, but there's still one missing: saving the actual PDF file. In Part 2 of this article series, I introduce you to the missing piece, the IBM Bluemix Object Storage service that provides scalable, cloud-based infrastructure for file storage and retrieval.

I also walk you through the process of building a search interface for the application so you can begin using it to find documents matching your search keywords. Finally, I show you how to deploy the whole shebang to IBM Bluemix, so that you can get to your documents anytime and from anywhere. Make sure you come back for that!


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Cognitive computing
ArticleID=1031720
ArticleTitle=Create a browser-based PDF storage and search application with Bluemix services, Part 1: Use the Document Conversion and Keyword Extraction services to convert and index files
publish-date=06132016