Building flexible apps from big data sources

It's no secret that a significant proportion of the needs for big data have come from the explosion in Internet technologies. Up until 10-20 years ago, the idea of a public-facing application having more than a few million users was unheard of. Today, even a modest website can have millions of users, and if it's active, can generate millions of data items every day. The irony is that the very infrastructure and systems that create big data can also work in reverse, and provide some of the better ways to integrate and work with that data. Usefully, InfoSphere® BigInsights™ comes with support for managing and executing data jobs through a simple REST API. And through the Jaql interface, we can run queries and get information directly from a Hadoop cluster. This article looks at how these systems work together to give you a rich basis for capturing data and provide an interface to get the information back out again.

Martin C. Brown, Director of Documentation

Martin BrownA professional writer for over 15 years, Martin (MC) Brown is the author and contributor to more than 26 books covering an array of topics, including the recently published Getting Started with CouchDB. His expertise spans myriad development languages and platforms: Perl, Python, Java, JavaScript, Basic, Pascal, Modula-2, C, C++, Rebol, Gawk, Shellscript, Windows, Solaris, Linux, BeOS, Microsoft WP, Mac OS and more. He currently works as the director of documentation for Continuent.

10 December 2013

Also available in Chinese Russian

Working with applications using REST technology

REST is a simple and easy structure for interacting with specific services and applications. It has its roots in many technologies, including XML-RPC, SOAP, and of course, HTTP, which is now the ubiquitous network transfer method of choice.

InfoSphere BigInsights comes with both Jaql and a suitable deployment interface to make it accessible through the REST interface. To get the Jaql interface running, you first need to install the sample Jaql application, which you can do through the InfoSphere BigInsights console.

InfoSphere BigInsights Quick Start Edition

InfoSphere BigInsights Quick Start Edition is a complimentary, downloadable version of InfoSphere BigInsights, IBM's Hadoop-based offering. Using Quick Start Edition, you can try out the features that IBM has built to extend the value of open source Hadoop, like Big SQL, text analytics, and BigSheets. Guided learning is available to make your experience as smooth as possible including step-by-step, self-paced tutorials and videos to help you start putting Hadoop to work for you. With no time or data limit, you can experiment on your own time with large amounts of data. Watch the videos, follow the tutorials (PDF), and download BigInsights Quick Start Edition now.

Open the console by going to http://servername:8080 or http://localhost:8080 if you are on the same machine. Click the Applications tab, as shown in Figure 1.

Figure 1. The Applications tab
Image shows the Applications tab

Now click the Manage link at the top left of the page, as shown in Figure 2, then select the Ad hoc Jaql query application, and click Deploy.

Figure 2. Manage link
Image shows Manage link

Once the application has been deployed, find its end point. Within REST, end points are the URL of the application. InfoSphere BigInsights creates a unique application reference when the application is deployed.

To find the end point, use a REST call to obtain the list of configured applications. You can use any REST client, including a browser. The following examples use the command-line tool curl. First, get the list of applications by accessing the URL http://servername:8080/data/controller/catalog/applications: $ curl -O

This creates a file called applications, which contains all the configured and active application details. Viewing that file should show XML containing the application definitions. Look for the application with Jaql, as shown in Listing 1.

Listing 1. Applications file
  <column>Ad hoc Jaql query</column>
  <column>The Ad hoc Jaql Query application runs a custom query 
  entered in the UI to analyze data.</column>

The first <column> block shows the unique application ID that will be needed for future queries.

To confirm that you've got the right application, look at the detailed application information accessed by using the REST URL as shown in Listing 2: http://servername:8080/data/controller/catalog/applications/applicationID.

Listing 2. Getting detailed application information about an application
$ curl -O http://servername:8080/data/controller/catalog/applications/applicationID

The code in Listing 2 generates a more detailed XML description, as shown in Listing 3.

Listing 3. Detailed XML description
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<application-template xmlns="">
     <name>Ad hoc Jaql query</name>
     <description>The Ad hoc Jaql Query application runs a custom
 query entered in the UI to analyze data.</description>
      <property uitype="textfield" paramtype="TEXTAREA" 
name="script"  label="Jaql query"
        isRequired="true" isOutputPath="false" isInputPath="false"
		 description=" Ad hoc Jaql query"/>
     <asset type="WORKFLOW" id="Ad hoc Jaql query"/>

Use this detailed information to submit application requests to the system running a Jaql query. The pertinent information is the list of properties. The output shows that there is just one property: the Jaql query text. We're going to hijack this so we can run arbitrary Jaql queries.

To run a query, construct an XML file with the property information. The basic structure is shown in Listing 4.

Listing 4. Basic structure of XML file with property information
  <name>Hello Jaql</name>
      <value paramtype='TEXTAREA'>'Hello World';</value>

The <appid> is the application ID determined in the earlier step when getting a list of configured applications. The script <value> is the Jaql script you want to execute.

To submit a job, you must send this XML (URL-encoded) as a parameter value to a different REST endpoint, supplying the encoded XML to the runconfig parameter, as shown in Listing 5.

Listing 5. Submitting a job
$ curl -o t.out

To encode the XML, use one of many URL encoding tools or functions, such as the urlencode() function within PHP or the encodeURIComponent() function within JavaScript.

Listing 6 writes out the information into a file, t.out, which will contain the execution ID and the status as a JSON value.

Listing 6. Writing the information to a file
   "result": {
      "oozie_id": "0000003-131017053452866-oozie-biad-W",
      "status": "OK"

If the status is anything other than "OK," there was a problem with the job that was submitted. Two common problems are that the application ID was invalid or the XML was badly structured or encoded.

The oozie_id that is returned is the job identifier that can be used to obtain the output from the job after it is completed. To get the job details, access the REST endpoint: http://<oozieHost>:<ooziePort>/oozie/v1/job/<oozieid>?show=info. For example, to get the status of the job just submitted, use the code in Listing 7.

Listing 7. Getting the status of the job just submitted
$ curl -o status.out "

This code produces the file status.out, which contains a JSON representation of the executed job, as shown in Listing 8.

Listing 8. Producing the file status.out
   "actions" : [
         "retries" : 0,
         "externalStatus" : "SUCCEEDED",
         "externalId" : "job_201310170523_0004",
         "status" : "OK",
         "trackerUri" : "bivm:9001",
         "toString" : "Action name[jaql1] status[OK]",
         "errorCode" : null,
         "endTime" : "Thu, 17 Oct 2013 12:22:58 GMT",
         "id" : "0000003-131017053452866-oozie-biad-W@jaql1",
         "startTime" : "Thu, 17 Oct 2013 12:22:38 GMT",
         "consoleUrl" : "http://bivm:50030/jobdetails.jsp?jobid
         "transition" : "end",
         "stats" : null,
         "name" : "jaql1",
         "data" : "#\n#Thu Oct 17 08:22:58 EDT 2013\nhadoopJobs=\n",
         "errorMessage" : null,
         "conf" : "<jaql xmlns=\"uri:oozie:jaql-action:0.1\">\r\n  
<configuration>\r\n    <property>\r\n      
<value>true</value>\r\n    </property>\r\n    
<property>\r\n      <name></name>\r\n 
<value>default</value>\r\n    </property>\r\n  
</configuration>\r\n <script>adhoc.jaql</script>\r\n  
<eval>setOptions( { conf: { \"hadoop.job.ugi\": \"biadmin,\" 
 }} );\r\n\t\t\t\tsetOptions( { conf: { \"\": \"biadmin\" }} );\r\n    \t
  'Hello World';;</eval>\r\n</jaql>",
         "externalChildIDs" : null,
         "cred" : "null",
         "type" : "jaql"
   "appPath" : "hdfs://bivm:9000/user/applications/3d420497-e1a6-411f-9644
   "appName" : "jaql-adhoc",
   "externalId" : null,
   "status" : "SUCCEEDED",
   "lastModTime" : "Thu, 17 Oct 2013 12:22:59 GMT",
   "createdTime" : "Thu, 17 Oct 2013 12:22:38 GMT",
   "toString" : "Workflow id[0000003-131017053452866-oozie-biad-W] status[SUCCEEDED]",
   "group" : null,
   "run" : 0,
   "endTime" : "Thu, 17 Oct 2013 12:22:59 GMT",
   "user" : "biadmin",
   "id" : "0000003-131017053452866-oozie-biad-W",
   "startTime" : "Thu, 17 Oct 2013 12:22:38 GMT",
   "consoleUrl" : "http://bivm:8280/oozie?job=0000003-131017053452866-oozie-biad-W",
   "acl" : null,
   "progress" : 1,
   "conf" : "<configuration>\r\n  <property>\r\n    
   </property>\r\n  <property>\r\n    
</property>\r\n  <property>\r\n    
</property>\r\n  <property>\r\n    
<value>'Hello World';</value>\r\n  
</property>\r\n  <property>\r\n    
 </property>\r\n  <property>\r\n   
  </property>\r\n  <property>\r\n    
   "parentId" : null

The critical part of the status.out file is the status line, which shows that the job completed successfully. Also useful is the URL, which provides further information about the job, in consoleUrl.

The final part of the remote access is to make use of the WebHDFS interface, which provides access to data stored within Hadoop through HTTP. This function is installed and available by default within InfoSphere BigInsights. To test that this function is working, access the following URL: http://servername:14000/webhdfs/v1?op=GETHOMEDIRECTORY&

The argument is required and must match a valid user in your Hadoop installation. The op parameter, GETHOMEDIRECTORY, should return the home directory for the user. The information is returned as a JSON object: {"Path":"\/user\/biadmin"}.

To actually download a file from HDFS, use the OPEN operation. For example, to download the file chicago.csv from the chicago directory of the biadmin user, use Listing 9.

Listing 9. Downloading a file from HDFS
$ curl -o chicago.csv "

With this basic set of REST and HTTP interfaces, you have a good sequence for running arbitrary Jaql queries on your data:

  1. Submit a job to the Jaql application service using XML through REST.
  2. Check for the job status through REST.
  3. Download the file generated using WebHDFS through REST.

Before trying out this sequence, take a quick look at processing data through Jaql.

Loading and reading data through Jaql

Jaql works by reading and writing data directly from one source, processing and parsing the content, and writing that information back. Jaql is a query parser that works by accessing data (through any accessible means), running a query on that information, and returning the data. Jaql is in fact a small language designed for data processing, but it also includes support for reading from and writing to HDFS.

Although in practice Jaql can read and write to a variety of data stores, the best performance and processing is achieved when Jaql can read data in parallel from the store. Jaql actually accepts information from the I/O layer about whether the data is read in serial or parallel. This ability makes it ideal for processing information from HDFS, especially if the information is widely distributed across a large cluster.

At the most basic level, you can read and write data within Jaql using the read() and write() functions. The easiest way to try Jaql at this level is to use the jaqlshell (/opt/ibm/biginsights/jaql/bin/jaqlshell), which provides an interactive interface where statements can be executed from the Jaql prompt, as shown in Listing 10.

Listing 10. The Jaql prompt
[biadmin@bivm ~]$  /opt/ibm/biginsights/jaql/bin/jaqlshell

For example, to read a local file, use read('file:///chicago.csv');. To read the file from HDFS, use read(hdfs('chicago.csv'));.

By default, Jaql expects to read a Hadoop sequence file, but Jaql also includes specific parsers for handling different file formats, including CSV, JSON, and others. This flexible model offers significant advantages for reading and writing data in different file formats and different destinations. For example, to read in a delimited file, you use the del() function: read(del('chicago.csv'));. This reads the data in, identifies the content, and puts it into an internal JSON structure, as shown in Listing 11.

Listing 11. Putting it into an internal JSON structure
    "03/12/2011 12:20:44 AM",
    "03/12/2011 12:20:44 AM",

This format is useful, but to go one stage further, assign the individual fields in the file field names. In the long run, field names make the data more malleable because it enables us to query and run code based on the field, rather than on an implied column number. Listing 12 shows how to assign field names.

Listing 12. Assigning individual fields in file field names
read(del('chicago/chicago.csv',{ schema: schema { logdate: string, region: long, 
buscount: long, logreads: long, speed: double}}));

The benefit of using text or JSON files is that they can be read and written in a parallel fashion — ideal when used with HDFS and Hadoop. Note the use of different types here. The raw date string in the input (taken from the Chicago Traffic Tracker) is not date information in JSON format, so the date type cannot be used without further processing. Jaql includes a complete parser for handling this data if you need this level of detail.

The output then becomes an array of records, as shown in Listing 13.

Listing 13. An array of records
    "logdate": "02/11/2013 09:51:23 AM",
    "region": 26,
    "buscount": 54,
    "logreads": 910,
    "speed": 26.59
    "logdate": "02/11/2013 09:51:23 AM",
    "region": 27,
    "buscount": 27,
    "logreads": 336,
    "speed": 30.0

To be used within Jaql for processing you can assign the information to a variable, as shown in Listing 14.

Listing 14. Assigning the information to a variable
x = read(del('chicago/chicago.csv',{ schema: schema { logdate: string, 
region: long, buscount: long, logreads: long, speed: double}}));

The flexible nature of Jaql in this respect is useful outside of Hadoop. For example, Jaql can be used to read data from the local file system and write data into HDFS. It can also be used to convert a text or CSV file into JSON through the process. In our web application, we can take advantage of that flexible nature to load data from HDFS, process it, and write out a JSON file, which can be more easily used from within the web interface to contain and display the results of the query.

After you have the reference to the data and it has been transformed internally, it can be written out as a sequence file using x -> write(seq('chicago.seq'));, or explicitly converted to proper JSON in a file using the shorthand jsonText() function: x -> write(jsonText('chicago.json'));. Check the documentation for Jaql for more complex examples of reading and writing binary, sequence, JSON and other file formats (see Resources).

We'll use the JSON output format when writing a query from the web interface to write a JSON file we can then access using WebHDFS.

Executing Jaql queries

Once we have the data within Jaql — by explicitly or implicitly reading the data — transformations and queries can be executed on the data structure. The transformations can be quite simple, or they can be full SQL-like queries based on our parsed data structure.

We'll skip the basic transformations as they are less useful to us than performing an SQL statement on the data. The defined fields from our processing of the data become the fields you can select and query on, and the variable (x in the above examples) is the table. Thus, we can perform basic queries that select specific fields, as shown in Listing 15.

Listing 15. Performing basic queries that select specific fields
jaql> SELECT region FROM x;
    "region": 3
    "region": 4
    "region": 5

To perform complex queries involving functions and grouping, see Listing 16.

Listing 16. Performing complex queries involving functions and grouping
jsql> SELECT region,avg(speed) FROM x GROUP BY region;
    "region": 26,
    "#1": 29.466060606060772
    "region": 27,
    "#1": 28.768625178975057
    "region": 28,
    "#1": 21.32377419211978
    "region": 29,
    "#1": 19.889688925634466

If you have loaded multiple files into internal structures, joins can be performed to combine the data.

The output can be assigned to a variable, and we can then write the result of the query out to a file by putting it all together in one script, as shown in Listing 17.

Listing 17. Assigning the output to a variable and writing the result of the query to a file
x = read(del('chicago/chicago.csv',{ schema: schema { logdate: string, 
region: long, buscount: long, logreads: long, speed: double}}));

y = SELECT region,avg(speed) FROM x GROUP BY region;

y -> write(jsonText('output.json'));

The script in Listing 17 performs three steps: It reads in the source file, performs the query, then writes the information out to a file in JSON format.

We now have the basic structure for how to submit jobs, access their status, write queries, and read the resulting output file.

Building a web interface to your big data store

With all the pieces in place, we can write a basic HTML interface to our InfoSphere BigInsights server that executes arbitrary Jaql scripts and writes out the data to be viewed within the page. The HTML for this interface is shown in Listing 18.

Listing 18. HTML for the interface
<title>JAQL Query for Chicago Traffic Tracker</title>
<script src="jquery.js"></script>
<script src="work.js"></script>
<h1>JAQL Query Executor</h1>

<div><textarea id="query" name="query" rows="10" columns="80"></textarea></div>
<div>Output Filename: <input id="filename" name="filename"/></div>
<a href="#" onclick="runquery();">Run Query</a>
<div id="status"/>
<div id="result"/>


We're going to use jQuery to perform AJAX-style loading of data.

The basic structure provides two input boxes: one for the Jaql query text and the other for the name of the file to hold the output information. We don't want to parse the filename out of the script, so we ask for it separately to be sure we load the right one. A simple link is used to execute the query, and a status and result DIV is used to hold the current activity and the result file information.

The page itself, shown in Figure 3 populated with the script information, shows the application before clicking Run Query.

Figure 3. Application before clicking Run Query
Image shows application before clicking Run Query

The JavaScript behind the application is split into four main blocks.

The first defines some global variables used by the application: the job ID (once it has been submitted), the filename (used to load the results), and a variable to hold the interval timer object that will be used while waiting for the job to complete.

The main function in this section is the runquery() function, called when Run Query is clicked. The runquery() function updates the status and calls the submitquery() function, which actually does the work, as shown in Listing 19.

Listing 19. Updating the status and calling the submitquery() function
var jobid;
var checkinterval;
var filename;

function runquery() {
    $('#status').html('Executing remote query');

The second block contains the definition of the submitquery() function. This function reads the contents of the TEXTAREA containing the Jaql script to be executed, embeds it in the XML for the run configuration, then submits, through the jQuery ajax() function, the job to the application server.

If the submission fails, we report the error, but if it is successful, we extract the Oozie job ID from the returned JSON structure, run the checkjobstatus() function to get the current running status, and create an interval timer that will check the status every 5 seconds by calling the same checkjobstatus() function. Remember that the job submission is through a REST request containing the XML job specification and that the specification must be escaped (using encodeURIComponent()), as shown in Listing 20.

Listing 20. The submitquery() function
function submitquery() {
    var startfrag = "<runconfig><name>Jaql Remote
  var endfrag = "</value></property></properties></runconfig>";

    var jobspec = startfrag + $('#query').val() + endfrag;

    remoteurl = "
	?actiontype=run_application&runconfig=" + encodeURIComponent(jobspec);

               error: function() {
                   $('#status').html("Error submitting job");
                  success: function(data) {
                   jobid = data.result.oozie_id;
                   $('#status').html("Job Submitted: " + data.result.status 
	+ " ID: " + jobid);
                   checkinterval = setInterval(checkjobstatus,5000);

The checkjobstatus() function uses the REST interface for getting job states, using the job ID extracted in the previous step. Because this function is called at a 5-second interval, it has to be stand-alone in that it must submit the REST request, update the status, and if the job was identified as successful, switch off the interval and execute the function (getoutputfile()) that is to retrieve the query output generated by the Jaql script, as shown in Listing 21.

Listing 21. checkjobstatus() function
function checkjobstatus() {
    var checkurl = "" + jobid + "?show=info";

               error: function() {
                   $('#status').html("Error getting status");
                   success: function(data) {
                   jobstatus = data.status;
                   if (jobstatus) {
                       $('#status').html("Current Job Status: " + jobstatus);
                       if (jobstatus == 'SUCCEEDED') {
                   else {
                       $('#status').html("Current Job Status: Unknown");

The last function uses WebHDFS to access the file generated by the script, after the script has executed successfully, as shown in Listing 22.

Listing 22. getoutputfile() function
function getoutputfile() {
    filename = $('#filename').val();
    var fileurl = "" + filename +

               error: function() {
                   $('#status').html("Error getting result file");
                   success: function(data) {


The sequence of execution is fundamentally straightforward:

  1. Enter the query.
  2. Click Run Query.
  3. Submit the job to the application server.
  4. Check the job status through REST.
  5. Repeat Step 4 until the job status is "succeeded."
  6. Download the generated output file and display it.

Assuming that your Jaql script was OK, you should get output similar to that in Figure 4.

Figure 4. Output
Image shows output


InfoSphere BigInsights comes with an impressive array of applications to configure and run whatever scripts you need. By making use of the standard REST interfaces provided to these systems, we can build a completely web-based interface to Hadoop and to the underlying data processing function without having to write complex code or develop MapReduce functions. Using Jaql gives us the flexibility to run arbitrary queries on our data, even if the data is not formatted in a directly addressable format, by easily translating it into a structure that can be processed through the SQL-like interface. With a little JavaScript and jQuery magic, the entire interface is small and compact enough to run anywhere.



Get products and technologies



developerWorks: Sign in

Required fields are indicated with an asterisk (*).

Need an IBM ID?
Forgot your IBM ID?

Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks

Zone=Big data and analytics
ArticleTitle=Building flexible apps from big data sources