Smarter collaboration for the education industry using Lotus Connections, Part 4: Use IBM Content Analytics to crawl, analyze, and display unstructured data

In this article, continue building an example application that augments IBM Lotus® Connections Profiles pages with information about research grant awards and academic research interests. Unstructured source data is gathered and persisted to a text analytics collection point. Learn about the server-side implementation of the custom widgets, and how they employ IBM Content Analytics analysis. You'll also build a custom crawler. With Lotus Connections and IBM Content Analytics you can crawl, analyze, and display unstructured data.

Ilya Afanasiev, Software Engineer, IBM

Photograph of Ilya AfanasievIlya Afanasiev is a software engineer with over seven years successful working experience and rich skill set in various fields, including software research, development, support, and quality assurance. He joined IBM in 2007 as z/OS Systems Programmer in Russian Systems and Technology laboratory. His current focus is web UI development and prototyping, information retrieval, and Java EE application development. His current research interests are text mining and unstructured text analysis.



Carl Osipov, Software Architect, IBM

CarlCarl Osipov is an experienced software architect with the Strategy and Technology organization in IBM Software Group. His skills are in the area of distributed computing, speech application development, and computational natural language understanding. He has published and presented on Service-Oriented Architecture and conversational dialog management to peers in the industry and academia. His current focus is on design of reuse techniques for composite business services.



01 March 2011

Also available in Russian Portuguese

Introduction

Part 3 of this series, Use Profiles to share research interests and grant awards, explored how to augment IBM Lotus® Connections Profiles pages with information about research grant awards and academic research interests. An example application showed how to gather and persist source data to a text analytics collection point. The user interface (UI) was created with widgets and Lotus Connections. The presentation, or "web" tier of the architecture supported the application's interaction with a user.

This article expands the example Profiles application extension and its multiple conceptual tiers: web, business, integration, and data. Learn to integrate the server-side functions with IBM Content Analytics to analyze unstructured web content.

The Profiles application extension has two major parts:

Java™ EE application
A Java Enterprise Edition (Java EE) application is deployed to WebSphere Application Server as an EAR file. The Java EE application's functions, also called server-side functions, include:
  • Two Java servlets that process requests generated by the My Grant Awards and My Keywords widgets. Both servlets generate XML-formatted responses.
  • A custom crawler that works with the Tracking Accountability in US Government Grants System (TAGGS) advanced search page, and generates HTTP POST requests to retrieve the list of grant awards for an arbitrary researcher.
  • Unstructured text analysis functions that interact with a Content Analytics standalone application through a Java API. The functions are used by the Java servlets to retrieve the results of unstructured text analysis performed by Content Analytics.
IBM Content Analytics
This product includes crawl, parse, index, search, and unstructured text analysis capabilities. Though Content Analytics provides a range of complex text analytics capabilities, this article focuses on a simple term frequency analysis. At the moment of writing this article, authors used IBM Content Analytics version 2.1, which exposes its capability to the Java EE application through a Content Analytics Java API. However, starting from release 2.2, IBM Content Analytics additionally suggests to use convenient REST APIs to interact with its functionality.

Figure 1 shows the data processing flow to retrieve grant titles used by the My Grant Awards widget. It also shows the processing flow to retrieve keywords with their associated frequencies used by the My Keywords widget. (See Part 3 of this series for details on the My Grant Awards and My Keywords widgets.)

Figure 1. Data processing flow
Data processing flow

Server-side functions

The server-side functions of the My Keywords and My Grant Awards widgets are implemented using Java Servlet technology. Widgets access server-side functions through standard HTTP GET or HTTP POST requests to servlets. The servlets then generate XML-formatted responses.

The My Grant Awards and My Keywords servlets employ a custom web crawler to retrieve titles of awarded grants from a third-party website. As shown in Figure 1, the retrieved information is used by both servlets in two ways:

  • To retrieve awarded grant titles by the My Grant Awards servlet.
  • To analyze the grant titles using Content Analytics to extract key words, and their associated frequencies, for further retrieval by the My Keywords servlet.

The My Keywords servlet employs Content Analytics analysis for two types of text sources:

  • Titles of awarded research grants, extracted from the TAGGS website.
  • Articles extracted from the Public Library of Science (PLoS) Biology Journal website.

Both sources represent unstructured text content.


My Grant Awards servlet

The My Grant Awards widget displays the list of awarded grant titles for a profile owner. In this article, the My Grant Awards servlet accepts only one parameter: the ID of the person whose profile is displayed on a Profiles page. For simplicity, we use the profile owner's full name in place the ID throughout the rest of this article.

The interface for the My Grant Awards servlet has the following syntax.

{path_to_servlet}?investigator=<string>

Servlet implementation

The Java source code in Listing 1 implements the doGet method for the My Grant Awards servlet. The My Grant Awards servlet accepts a single parameter, on line 2 below, whose value is stored in the principalInvestigator string variable. Further processing of the HTTP GET request is done according to the value of this variable.

Listing 1. My Grant Awards servlet, doGet implementation
1 protected void doGet(HttpServletRequest request, HttpServletResponse response)
 throws ServletException, IOException {
2 String principalInvestigator = request.getParameter(INVESTIGATOR);
3 response.setContentType(XMLUtil.CONTENT_TYPE);
4 PrintWriter pw = response.getWriter();
5 String grantsXml = "";
6 try{
7 if(db2Connection == null) {
8 /* ... handle error condition ... */
9 } else if(principalInvestigator == null || principalInvestigator == "") {
10 /* ... handle error condition ... */
11 } else {
12 if(principalInvestigator.equals("maintenance")) {
13 GAMaintenanceStatus st = mgr.maintainAwards(db2Connection); 
14 switch(st) {
15 /* ...modify grantsXml string according to maintenance status...*/
16 }
17 } else {
18 grantsXml += mgr.retrieveAwardsXML(principalInvestigator);
19 }
20 }
21 } catch(Exception e) {
22 /* ... handle any occurred exception ... */
23 } finally {
24 /* ... update servlet response in any case ... */
25 }
26 }

All Profiles user names are stored in a DB2 table used by the server side of the Profiles application. To access the table, the My Grant Awards servlet uses a DB2 connection, as shown on lines 7 and 13 in Listing 1, with the variable name db2Connection.

The servlet can process two types of requests:

  • For retrieving the list of awarded grants for a person (line 18 in Listing 1, method retrieveAwardsXML). The request is invoked if the HTTP GET request contains the full name of a profile owner.
  • To get the most current list of awarded grants topics from the custom crawler and save it to a local file system (line 13 in Listing 1, method maintainAwards).

The maintenance is done when the HTTP GET request contains the maintenance keyword. Both of these requests execute the code in the custom crawler.

Custom crawler to perform HTTP POST

The custom crawler is for retrieving the data from a third-party website by generating an HTTP POST request, and parsing the retrieved data. Parsed data is stored to a local file system in the form of HTML files. The list of awarded grant topics is represented in the files by native HTML tables. A unique HTML file exists for each registered Profiles user.

Locally stored HTML files are used in two ways: to extract keywords with their associated frequencies from the files, and to display the contents of the files using the My Grant Award widget.

We had to develop a custom crawler because Content Analytics doesn't support crawling of web pages generated dynamically in response to submitted HTML forms.


My Keywords servlet

As discussed in Part 3: Use Profiles to share research interests and grant awards, the My Keywords widget has two types of content: for pure visualization purposes, and for modifying parameters of visualization. In line with such implicit "classification," the My Keywords servlet processes all requests as two groups:

  • HTTP GET requests to generate or refresh the data used to visualize components of the widget.
  • HTTP POST requests to modify the parameters used to generate this data.

GET request handler

The interface of the My Keywords servlet has the syntax to handle HTTP GET requests, as shown in Listing 2.

Listing 2. My Keywords servlet interface for HTTP GET request
{path_to_servlet}?investigator=<string>
 [&threshold=<integer>]
 [&extract_from=<comma-separated strings>]
 [&compare_with=<string>]

According to the interface, the only mandatory parameter passed to the servlet is investigator=<string> which, similar to the My Grant Awards servlet, specifies the name for a profile owner. Other parameters are optional, as outlined below.

Optional parametersRole
extract_from=<comma-separated strings>To determine web sources used for keywords extraction. By default, only awarded grant topics are used for extraction.
threshold=<integer>To determine minimum threshold for the frequency of occurrence of a keyword. For example, if threshold equals 2, threshold=2 is passed along with other parameters to the servlet. In this case, all keywords that occur in the text analyzed by IBM Content Analytics less than 2 times will not be included in the list with keywords and frequencies.
compare_with=<string>To determine the name of the person who is another profile owner. Keywords extracted for this person are to be compared with keywords extracted for a person currently logged in to the Profiles application so the words identical to both profiles can be highlighted.

POST request handler

The My Keywords servlet accepts the parameters shown in Listing 3 to handle HTTP POST requests.

Listing 3. My Keywords servlet interface for HTTP POST request
investigator=<string>
 updated_stopwords_list=<comma-separated strings>
 terms_sources=<comma-separated strings>

All parameters are mandatory, as discussed below.

Mandatory parametersRole
investigator=<string>To identify a profile owner.
updated_stopwords_list=<comma-separated sub-strings>To identify user-defined stop words to be used for unwanted keywords filtering.
terms_sources=<comma-separated sub-strings>To identify the updated sources to use for keywords extraction. Possible values for this parameter could be any string constructed of grants and publications sub-strings, separated with a “,” (comma) sign.

Servlet implementation

Listing 4 shows the Java source code for the doGet method implementation for the My Keywords servlet. In Listing 4:

  • Lines 2 to 7: Are for initialization of the variables that get assigned according to parameter values received with the HTTP GET request.
  • Line 8: The termsFrequencyMap hash map is used to store keywords (or terms) with their frequency of occurrence extracted from the text.
  • Line 13: The servlet generates a response in the form of XML-formatted text.
  • Lines 15 to 24: Implement the mechanism to initialize stop words set to be used for unwanted terms filtering, and the list of sources to be used for terms extraction.
  • Lines 26 to 45: Implement interaction of server-side functions with IBM Content Analytics, encapsulated by the CcaAnalyzer class. Details of the class implementation are discussed below.
  • Lines 32 to 37: Represent a case when awarded grant topics were selected among other sources to extract terms and their frequencies.

    A specific unique resource identifier (URI) prefix is used to call the getTermsAndFreqs method (lines 33, 35). This URI prefix, combined with a file name built from the principalInvestigator variable, is used by IBM Content Analytics to access analyzed documents. The URI prefix assures that IBM Content Analytics will use analysis results of awarded grant topics during terms extraction. As soon as terms with their frequencies are extracted from the text, termsFrequencyMap is updated. This part of the source code is omitted in Listing 4 (line 36).

Listing 4. My Keywords servlet, doGet implementation
1 protected void doGet(HttpServletRequest req, HttpServletResponse resp) 
 throws ServletException, IOException {
2 String principalInvestigator = (req.getParameter(INVESTIGATOR) == null) ?
3 "" : req.getParameter(INVESTIGATOR);
4 String personToCompareWith = (req.getParameter(COMPARE_WITH) == null) ? 
5 "" : req.getParameter(COMPARE_WITH);
6 String frequencyThreshold = (req.getParameter(THRESHOLD) == null) ? 
7 "0": req.getParameter(THRESHOLD);
8 HashMap<String, String> termsFrequencyMap = new HashMap<String, String>();
9 
10 String usedSources; 
11 Set<String> stopWords;
12 
13 resp.setContentType(XMLUtil.CONTENT_TYPE);
14 
15 try {
16 TermsInfo termsInfo = allPersonsInfo.get(principalInvestigator);
17 if(termsInfo.exists()) {
18 usedSources = termsInfo.getSources();
19 stopWords = termsInfo.getStopWords ();
20 } else {
21 usedSources = DEFAULT_SOURCES;
22 stopWords = getDefaultStopWords();
23 }
24 } catch(Exception e) { /*... handle exceptions ...*/ }
25 
26 try{
27 CcaAnalyzer ccaAnalyzer;
28 ccaAnalyzer = new CcaAnalyzer(Integer.parseInt(frequencyThreshold));
29 ccaAnalyzer.initialize(ccaConfigFile);
30 
31 String normalizedName = principalInvestigator.replace(' ', '+');
32 if(usedSources.contains(AWARDS_SOURCES)) {
33 String uriPrefix = AWARDS_URI_PREFIX;
34 String terms_and_freqs;
35 terms_and_freqs = 
 ccaAnalyzer.getTermsAndFreqs(uriPrefix,normalizedName,stopWords);
36 /*... code to update termsFrequencyMap is omitted ...*/
37 }
38 if(usedSources.contains(OTHER_POSTS)) {
39 String uriPrefix = OTHER_URI_PREFIX;
40 /* code similar to AWARDS_SOURCES block*/
41 }
42 if(usedSources.contains(JOURNAL_ARTICLES)) {
43 String uriPrefix = JOURNAL_ARTICLES_URI_PREFIX;
44 /* code similar to AWARDS_SOURCES block*/
45 }
 
46 /* here an output is updated with XML section,
 describing extracted terms and frequencies */
 
47 String cloudJson = getCloudJSON(principalInvestigator,personToCompareWith); 
48 /* code to add XML section with JSON to be used with Tag Cloud widget */
49 } catch (Exception e) { /* ... handle exceptions ... */ 
50 } finally {
51 /* ... code to finalize XML markup / release response is omitted ... */
52 }
53 }

Terms and their frequencies are extracted from different sources, as shown on lines 38 to 45 above. Source code representing terms extraction from different sources is almost identical to the code shown on lines 33-36, so it is not included in the listing.

As soon as extraction of terms and frequencies from different sources is finished (line 46 in Listing 4), the output is updated to include the pairs of extracted terms and their frequencies.

The Dojo Tag Cloud widget, which is a part of the My Keywords widget, uses JSON source to render widget contents. Thus, terms with their frequencies in JSON format are included in the My Keywords servlet response (line 47 in Listing 4).

IBM Content Analytics API and code for terms frequency extraction

Listing 5 shows a simplified version of Java source code for the getTermsAndFreqs method, which represents the core of the CcaAnalyzer class implementation. An instance of this class is used by the My Keywords servlet, as shown in Listing 4 on lines 27, 28, 29, and 35. An example of the getTermsAndFreqs method invocation is shown in line 35 of Listing 4.

Listing 5. Simplified source code for getTermsAndFreqs implementation
1 public Map<String, Integer> getTermsAndFreqs (String uriPrefix, String normalizedName,
 Set<String> stopWords) {
2 Class<?> fetchClass = Class.forName("com.ibm.es.api.fetch.RemoteFetchFactory");
3 FetchServiceFactory fetchFactory = 
 (FetchServiceFactory) fetchClass.newInstance();
4 config = new Properties();
5 config.load(new FileInputStream(configPath));
6 ApplicationInfo applicationInfo = 
 fetchFactory.createApplicationInfo(applicationName);
7 FetchService fetchService = fetchFactory.getFetchService(config);
8 Fetcher fetcher = fetchService.getFetcher(applicationInfo, collectionID);
9 fetcher.getCollection();
10 
11 String resourceURI = uriPrefix + "/" + normalizedName;
12 SummaryRequest summaryRequest = 
 fetchFactory.createSummaryRequest(resourceURI, null);
13 summaryRequest.setFormat("miml");
14 InputStream in = fetcher.getSummary(summaryRequest);
15 if(in != null) {
16 String inputString; 
 /* [skipped]: 
 build the string filled from the contents of ‘in’ input stream */
17 InputSource source = new InputSource(inputString);
 
18 com.ibm.es.takmi.impl.std.document.DocumentHandler handler = 
 new DocumentHandler();
19 SAXParser parser = new SAXParser();
20 parser.setContentHandler(handler);
21 parser.parse(source);
 
22 com.ibm.es.takmi.impl.std.document.Document mimlDoc = 
 handler.getDocument();
23 if (mimlDoc.getDocumentID() != null) {
24 Map<String, Integer> m = determineFrequencyOfKeywords(mimlDoc, stopWords);
25 return m; 
26 } else {
27 return new HashMap<String, Integer>; //return empty map
28 }
29 }
30 }

The getTermsAndFreqs method in Listing 5 accepts the following arguments as input parameters:

  • String uriPrefix: Contains the first part of the document source's URI, which is to be used for terms and frequencies extraction from the document.
  • String normalizedName: Contains the full name of the person whose profile is currently observed, normalized in that all white spaces between the first, middle, and last name were replaced with "+" (plus) signs.
  • Set<String> stopWords: Contains all stop words, to refine the results of terms extraction, by filtering out those words residing in the stopWords set.

A path to the configuration file configPath is initialized prior to getTermsAndFreqs invocation, and during the IBM Content Analytics analyzer initialization phase (line 29 in Listing 4). The same phase initializes the collectionID variable (used in line 8, Listing 5). CollectionID defines the IBM Content Analytics documents collection that will be used during keywords extraction. Each resource has a URI.

As shown in Listing 5, implementing the getTermsAndFreqs method starts with the FetchServiceFactory class instantiation fetchFactory variable (lines 2 and 3). The fetchFactory variable is used to instantiate the:

  • com.ibm.siapi.common.ApplicationInfo interface (line 6), used for authentication and access control purposes, to verify access to IBM Content Analytics collections.
  • com.ibm.es.fetch.FetchService interface ( line 7), which contains generic fetching services and is also used to create a Fetcher instance (line 8).
  • com.ibm.es.fetch.SummaryRequest interface (line 12), which is used to actually fetch a resource identified by its URI (lines 13 and 14).

Lines 4 and 5 in Listing 5 illustrate the instantiation of the java.util.Properties class that's used to upload IBM Content Analytics configuration options from a local file system. fetchService is created using this initialized instantiation (the Properties class instance config on line 7).

As soon as a non-empty resource summary is fetched using a pre-created summary request as the input stream (line 14), it is then processed to:

  • Retrieve terms.
  • Determine their frequencies from the resource summary.
  • Return a response to the caller in the form of a Java string (lines 15 to 29).

The resource summary is formatted using the MIning Markup Language (MIML) with XML tags. Resource summary processing continues by parsing the input stream using SAXparser (lines 19-21 in Listing 5). Parsing is followed by subsequent creation of the com.ibm.es.takmi.document.Document class instance (named mimlDoc, line 22), which is used to retrieve terms with their associated frequencies and returned as a string (lines 24 and 25).

Listing 6 shows more detail of the terms and frequencies extraction process. Sample Java code uses three com.ibm.es.takmi.impl.* packages to process MIML-formatted XML documents.

Listing 6. Auxiliary method to extract terms and their frequencies
1 import com.ibm.es.takmi.impl.common.document.KeywordFeature;
2 import com.ibm.es.takmi.impl.common.document.KeywordFeatureElement;
3 
4 /*...*/
5 
6 private Map&lt;String,Integer&gt;
          determineFrequencyOfKeywords(Document doc, Set&lt;String&gt; stopWords) {
7 HashMap<String, Integer> keywordFrequencyMap = 
 new HashMap<String, Integer>();
8 KeywordFeature keywordFeature = 
 (KeywordFeature) mimlDoc.getFeature(KeywordFeature.class);
9 for (KeywordFeatureElement element : keywordFeature.getFeatureElements()) {
10 String category = element.getCategory();
11 if (category.startsWith("$._word")) {
12 String keyword = element.getValue();
13 if(stopWords.contains(keyword.toLowerCase()) {
14 continue;
15 }
16 Integer freq = keywordFrequencyMap.get(keyword);
17 if(freq != null) {
18 keywordFrequencyMap.put(keyword, freq+1);
19 } else {
20 keywordFrequencyMap.put(keyword, 1);
21 }
22 }
23 }
24 return keywordFrequencyMap;
25 }

The code in Listing 6 implements a traversal of all elements included in the MIML document. Each element has constituents extracted from the text. In cases where the constituent represents a word (line 11), the element gets analyzed further. Stop words filtering is applied (lines 13-15), and word frequencies (also called keyword or term frequencies) are calculated using Java HashMap (lines 16-21).


IBM Content Analytics configuration

This section dives into the configuration details of IBM Content Analytics, which enables unstructured text content analysis. In the example, text content will be extracted from various web sources.

IBM Content Analytics configuration encompasses several text analytics collections (or corpora, to text mining professionals). Collections support search, and various text mining capabilities, such as:

  • Ability to explore correlations or deviations in the data.
  • Exporting analysis results to data warehouse or business intelligence applications.

The data is retrieved for a collection by one or more crawlers that collect documents from data sources, either continually or according to a predefined schedule. Retrieved data is then passed through an analytics pipeline that includes parsing, indexing, linguistic analysis, and custom analysis on each crawled document.

In this article, the example text analytics collections use two types of crawlers:

  • Local file system crawler, configured to collect the data stored locally.
  • Web crawler, configured to collect publications from PLoS Biology Journal website.

Both types of crawlers are discussed in more detail below.

Custom crawler to retrieve awarded project topics

The awarded grants topics for a Profiles user come from the TAGGS website, are stored locally, and are crawled by a Content Analytics local file system crawler. The TAGGS site lets you retrieve the data by submitting a search form, so it is necessary to crawl a dynamically-generated page. IBM Content Analytics does not provide capabilities to crawl such pages. Our Profiles extension application applies an alternative solution based on a custom crawler for dynamically-generated web pages.

The custom crawler is implemented by the GAManager (grant awards manager) Java class and several auxiliary classes. In the servlet implementation for the My Grant Awards widget, in Listing 1 there are two different types of HTTP GET request processing: one for maintaining grant awards for a single person, and the other for maintaining awards for all registered persons at once. The processing for the requests ends by two different calls within the scope of doGet:

GAMaintenanceStatus st = mgr.maintainAwards(db2Connection);

and

grantsXml += mgr.retrieveAwardsXML(principalInvestigator);

In the code snippets above, the instance of the GAManager class (mgr variable) is the handler to work with the custom developed crawler. It is initialized during the My Grant Awards servlet initialization.

GAManager high-level implementation

Listing 7 shows a high-level implementation of the GAManager class. The GAManager constructor accepts two string parameters:

  • First string parameter contains the URL of the LDAP server. This server is used to authenticate all registered Profiles users. GAManager uses the server to retrieve the list of all registered users' full names.
  • Second string parameter contains the path to the local file system folder, which is used to store the data retrieved from the TAGGS website -- one file per one Profiles user.
Listing 7. High-level implementation of GAManager class
public class GAManager {
 public GAManager(String ldapURL, String awardsFolder) {...}
 public String retrieveAwardsXML(String userName) {...}
 public GAMaintenanceStatus maintainSingleAward(String userName) {...}
 public GAMaintenanceStatus maintainAwards(Connection db2Connection) {...}
 }

In Listing 7, the following methods are used.

retrieveAwardsXml
Given a Profiles user name, this method performs the maintenance of the user's grant awards list in terms of locally stored HTML file contents. It then translates HTML data into XML and returns it back to the caller. Depending on the current status of the HTML file (for example, the file might not exist), fetching the awards HTML from the web might be invoked implicitly.

retrieveAwardsXml accepts the full name of the Profiles user as a parameter, and calls the maintainSingleAward method.

maintainSingleAward
Given a Profiles user name, this method performs maintenance of a profile grant awards list in terms of a single HTML file stored locally for a user. Depending on the current status of the HTML file (the file might not exist, or could be outdated), fetching the awards HTML from the web is invoked here.

maintainSingleAward accepts the full name of a Profiles user as a parameter and returns the status of an attempt to perform maintenance. Status is one of the following:

  • SUCCESS - maintenance was successfully executed.
  • INPROGRESS - other maintenance thread for this person is in progress, so maintenance was not executed.
  • ERROR - internal error occurred during the maintenance.
maintainAwards
Performs maintenance of grant awards data for all Profiles users. It first retrieves registered user entries from LDAP, and uses the entries as keys to retrieve user full names from DB2 using SQL queries. maintainAwards then calls the maintainSingleAwards method for all registered users. maintainAwards accepts a Connection parameter, which represents a connection to a DB2 database.

Fetching data from TAGGS

Fetching the data from the TAGGS website is done through the GAWebFetcher auxiliary class, which GAManager uses. A simplified version of the main method used by GAWebFetcher to fetch grant awards HTML is shown in Listing 8.

Listing 8. Implementation of fetchAwardsHTML method
        /* ... */	
        1 private static final String GRANTS_REFERER = 
        "http://taggs.hhs.gov/advancedsearch.cfm";
        /* ... */	
        2 private String fetchAwardsHTML(String userName) throws IOException {
        3 String awardsHTML = "<h2>"+facultyName+"</h2><hr><br>";
        4 String investigator = getLastName(userName);
        5 addRequestHeader("referer", GRANTS_REFERER); 
        6 setAdvancedSearchRequestDefaults(); //setting default parameters.
        7 addRequestParameter("sPIName", investigator); //fill-in the name parameter.
        8 fillParameters(); //putting parameters into request.
        9 
        10 try {
        11 client.setTimeout(15000);//15 seconds timeout
        12 client.executeMethod(postMethod); //executing request.
        13 } catch(Exception e) {
        14 if( e.getMessage().contains("timed out")) {
        15 /* ... notify the caller about timeout ... */
        16 }
        17 /* ... handle other exceptions ... */
        18 }
        19 
        20 Tidy tidy = new org.w3c.tidy.Tidy();
        21 tidy.setXmlOut(true);
        22 tidy.setShowWarnings(false);
        23 document = tidy.parseDOM(postMethod.getResponseBodyAsStream(), null);
        24 awardsHTML += getAwardsHtmlTable();
        25 postMethod.releaseConnection();
        26 return awardsHTML;
        27 }

The HTML section fetched from the TAGGS website is equipped with artificially generated HTML components, such as a header representing a faculty name (line 3, Listing 8).

To fetch grant awards HTML data from a TAGGS advanced search page, an HTTP POST request needs to be generated. It includes request parameters with values representing the details of what needs to be retrieved. With the exception of the last name of the investigator, the data being passed along with HTTP POST requests is the same for all people.

The code in lines 4 to 7 in Listing 8 is used to fill an HTTP POST request with meaningful parameters. Notice that the grants referrer parameter, with its value, must be present in the HTTP POST request addressed to the TAGGS advanced search page in order to get a valid response from this page. Essentially, implementation of the fillParameters and setAdvancedSearchDefaults methods is to add/initialize parameters of the object instantiated from the org.apache.commons.httpclient.methods.PostMethod class, as follows.

postMethod.addParameter(key, parameterValue);

Lines 10 to 18 in Listing 8 show how the postMethod object is used to invoke HTTP POST request execution. As soon as the HTTP POST request is executed, the response can be retrieved, as shown on line 23:

postMethod.getResponseBodyAsStream();

IBM Content Analytics built-in crawlers

As mentioned, we use only two of the many types of IBM Content Analytics crawlers. The following are real-life configuration examples for each type of crawler.

Configuration for local file system crawler
The local file system crawler is configured to retrieve the files residing in a predefined Windows file system folder (discussed in Custom crawler to retrieve awarded project topics, GAManager constructor details, Listing 7). HTML files retrieved by the custom crawler using an HTTP POST request are stored to this folder and become available for the IBM Content Analytics Windows file system crawler.

The local file system crawler is configured to work with a single folder, and to retrieve only the HTML type files from that folder. This configuration guarantees that only proper HTML files, containing awarded grants topics for a Profiles user, will be analyzed by IBM Content Analytics.

As an added benefit, analysis results for each person could be subsequently retrieved using a unique URI that matches the path to the HTML file with awarded grant topics.

Configuration for web crawler: PLoS Biology Journal
Web crawler is configured to retrieve article abstracts from the PLoS Biology May 2010 issue, using the PLoS Biology website. The website provides a structured catalog of articles with related materials (such as article abstracts, article citations, and so on).

To force the crawler to retrieve only the article abstracts with URLs of May 2010 issue pages, and to prevent the crawler from crawling any other documents, the following crawling rules were used.

Starting URL:

http://www.plosbiology.org/article/browseIssue.action?
    issue=info%3Adoi%2F10.1371%2Fissue.pbio.v08.i05

Domain rules:

allow domain www.plosbiology.org
forbid domain *

HTTP prefix rules:

allow prefix http://www.plosbiology.org/article/browseIssue.action?
    issue=info%3Adoi%2F10.1371%2Fissue.pbio.v08.i05
allow prefix http://www.plosbiology.org/article/browseIssue.action*
allow prefix http://www.plosbiology.org/article/info*
forbid prefix *

Using domain rules, almost all unrelated URLs, such as banner ads or links to partner web journals, are filtered out. Crawl space, that is domain or prefix, rules are applied in a “first come – first applied” manner. According to the domain rules above, IBM Content Analytics first allows the www.plosbiology.org domain to be crawled and then forbids any other domains for crawling.

HTTP prefix rules perform fine-tuning of crawl space so that the starting web page is crawled first, followed by articles on the page while any other URLs are forbidden for crawling.


Conclusion

This article expanded on the implementation of a UI based on IBM Lotus Connections Profiles. The UI was extended with an iWidget-based widgets bundle to support unstructured data retrieval, analysis, and visualization. (Part 3 explored the client side.)

Server-side functions were integrated with IBM Content Analytics to analyze unstructured web content. The results of the analysis were used to display the widgets' contents in a friendly, useful manner.

Using a realistic example, this article demonstrated the value of integrating IBM Lotus Connections and IBM Content Analytics.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into IBM collaboration and social software on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Lotus
ArticleID=629382
ArticleTitle=Smarter collaboration for the education industry using Lotus Connections, Part 4: Use IBM Content Analytics to crawl, analyze, and display unstructured data
publish-date=03012011