Text analysis in InfoSphere Warehouse, Part 2: Dictionary-based information extraction combined with IBM Cognos reporting

Gain business insights from unstructured data

Unstructured information represents the largest, most current, and fastest growing source of information that is available today. This information exists in many different sources such as call center records, repair reports, product reviews, e-mails, and many others. The text analysis features of IBM InfoSphere Warehouse can help you uncover the hidden value in this unstructured data. This series of articles covers the general architecture and business opportunities of analyzing unstructured data with the text analysis capabilities of InfoSphere Warehouse. The integration of this capability with IBM Cognos reporting enables people across the company to exploit the text analysis results.

The first article of this series gave an overview of the text analysis capabilities in InfoSphere Warehouse and showed how to use regular expressions to extract concepts from free-form text. This second article shows you how to use dictionaries for concept extraction and how you can use taxonomies to structure them. It also explains how you can present the results in an interactive Cognos report.

Stefan Abraham (stefana@de.ibm.com), Software Engineer, IBM

Stefan Abraham photographStefan Abraham is a Software Engineer at the IBM Research & Development Lab Boeblingen, Germany. He works on text analysis components and on data mining related UI components in InfoSphere Warehouse.



Simone Daum (sdaum@de.ibm.com), Software Engineer, IBM

Simone Daum photoSimone Daum is a Software Engineer at the IBM Research & Development Lab Boeblingen, Germany. She works on tooling for data preparation for data mining and on text analysis in InfoSphere Warehouse.



Benjamin G. Leonhardi (bleon@de.ibm.com), Software Engineer, IBM

Author Photo: Benjamin LeonhardiBenjamin Leonhardi is a software engineer for InfoSphere Warehouse data mining at the IBM Research & Development Lab in Boeblingen, Germany. He works on mining visualization, text mining, and mining reporting solutions.



09 July 2009

Introduction

In information extraction, it is a common task to extract concepts such as persons, products, or email addresses from texts. If those concepts do not comply with a certain pattern (such as the patterns of phone numbers and email addresses), you can use dictionaries to list all terms that you would like to extract. Taxonomies can be used to structure the entries of dictionaries into categories of various levels.

In InfoSphere Warehouse, extracted concepts can be stored back to the database and, subsequently, they are made available for other business analytics tools, such as data mining applications or reporting tools.

IBM Cognos 8 Business Intelligence provides a full set of business intelligence capabilities, including reporting, analysis, dashboards, and scorecarding. Cognos 8 Reporting is able to consume structured information from many data sources, and it can be used to propagate the text analysis results to a wide audience.

The following sections explain the dictionary-based extraction capabilities of InfoSphere Warehouse and show you how to create taxonomies to structure the concepts in dictionaries. You will see a high-level overview of IBM Cognos 8 and how it is integrated with the text analysis capabilities of InfoSphere Warehouse.

Finally, in a step-by-step example, you will see how to combine the text analysis components of InfoSphere Warehouse and Cognos 8 BI to create a report highlighting the skills that are most frequently asked for in IT job offerings.

Understanding dictionary-based concept extraction in InfoSphere Warehouse

InfoSphere Warehouse contains rich tooling to extract concepts from text based on dictionaries. With the help of a dictionary editor, you can first build a dictionary and then use the dictionary lookup operator to embed the concept extraction in a flow. InfoSphere Warehouse dictionaries are much more than simple word lists. A dictionary entry consists of a base form and multiple optional variants. The variants can include acronyms, abbreviations, synonyms, or any other variation or association of the base form.

Table 1. Sample entries of the dictionary Country showing the base form and variants
Base formVariants
......
GermanyBRD
Sri LankaCeylon
United States of AmericaUSA, US
......

Consider the dictionary Country in Table 1 and the following text: Sri Lanka was known as Ceylon before 1972. Performing a lookup with the dictionary on this text would result in the following two annotations:

Table 2. Type and country annotations
TypeCountry
baseform:Sri Lanka
id:17
begin:0
end:9
covered text:Sri Lanka
Table 3. Lookup type and country annotations
TypeCountry
baseform:Sri Lanka
id:17
begin:23
end:29
covered text:Ceylon

The type of the annotations created by a dictionary lookup is derived from the dictionary name. The begin and end attributes denote the beginning and end of the matching term by means of number of characters. The baseform attribute of the annotation is the base form of the matched dictionary entry, whereas the covered text shows the matching term as it occurs in the text. It might be the base form itself, or it might be a variant of the base form or even a grammatical inflection of those, as shown in the following list. The id of the annotations shows the internal ID of the matched dictionary entry. This ID is internally used to link dictionary entries with taxonomy entries, as shown later in this article.

As mentioned above, InfoSphere Warehouse does not only do a character-by-character lookup of the base forms and its variants. Before the lookup, the text is pre-processed with the following subtasks:

  • Text segmentation

    Text segmentation splits text into individual words or tokens. Typically, this is done based on the blank characters between words. However, languages such as Chinese or Japanese do not have word boundaries like this. Therefore, text parsing in these languages requires the identification of word boundaries, which can be a very difficult task. InfoSphere Warehouse uses rules and language-specific lexical dictionaries for text segmentation. Those internal lexical dictionaries are available in over 20 languages. You need to choose the language and the text to be analyzed as an option in the dictionary lookup operator.

  • Stem derivation

    During the text segmentation, each word or token is determined for the lexical dictionary. Such words are called in-vocabulary. Stem derivation finds derived forms of in-vocabulary words based on grammatical inflection, including verb conjugations such as write, wrote, and written, or noun forms such as mouse and mice. During the dictionary lookup, inflections of the base forms and variants are found. If a dictionary entry consists of multiple terms, each individual term might be inflected.

    Stem derivation is not applied to out-of-vocabulary words. So you need to add all expected variations manually to the variants of the dictionary. In addition to the lexical dictionaries that are used during the dictionary lookup, there are also special internal inflection dictionaries. Those dictionaries help you during creation of your dictionary to determine whether a word is in-vocabulary or out-of-vocabulary. You can choose a language for your dictionary in the section Automatically detected inflections in the dictionary editor. The section shows you all inflections that will be found during a lookup. Figure 2 shows you a dictionary entry write with the variants scribble and wrte. With English selected as language, you can see the inflections for write and scribble, whereas wrte is out-of-vocabulary. Nevertheless you might want to include wrte in your dictionary, because it is a common misspelling in the texts that you want to analyze.

    Figure 2. Details of a dictionary entry and the Automatically detected inflections section of the dictionary editor
    Details of a dictionary entry and the automatically detected inflections section of the dictionary editor
  • Upper- and lower-case handling

    The Dictionary Lookup operator finds the following spelling for in-vocabulary words:

    • Initial-upper case letters, such as Customer
    • Upper-case letters, such as CUSTOMER
    • Mixed-case letters, such as cUstomer

    This behavior applies also to out-of-vocabulary words if you type them in lower-case letters into your dictionary, such as ipod. If you do not enter out-of-vocabulary words in lower-case letters, the word is only found if it is spelled exactly like in your dictionary. So if you add the word iPod to your dictionary, it will only be found with this exact spelling. The dictionary editor does not show the automatically detected upper- and lower-case variations of a term, but you can see in the inflections section whether it is in-vocabulary or out-of-vocabulary for the selected language.

To enable the natural language processing, you have to specify the text language in the dictionary lookup operator. If you want to perform a lookup with multiple dictionaries simultaneously, you can create multiple output ports for the operator, each taking one dictionary.

Dictionaries can be created manually, or the entries can be imported from flat files, database tables, or other dictionaries. They can also be created from the results of a frequent term extraction. This extraction shows the most frequent nouns, verbs, or any specific part-of-speech patterns. In the results view, you can mark all interesting words and export them to a dictionary. This might save you time for the dictionary creation.


Using taxonomies to structure dictionary entries

Taxonomies can be used to structure or categorize dictionary entries. In Figure 3, you can see an example of a taxonomy where countries are categorized into different continents.

Figure 3. The taxonomy editor
The taxonomy editor

You can create levels, like continents in the example, and assign the dictionary entries by dragging and dropping them. Levels can also contain sublevels of any depths.

If you create a taxonomy from a dictionary, only the base forms are imported into the taxonomy, but their variants are also taken into account, because annotations are categorized in the taxonomy by the annotation id, which identifies a dictionary entry. This ID is also used to identify a taxonomy entry. After adding an entry to the dictionary or to the taxonomy, you can synchronize the two resources again.

The levels of a taxonomy can be used to drill-down or roll-up in a report. To make the taxonomy available to a reporting tool, export it to a database table. Each taxonomy level is exported to one column in the table. Figure 4 shows the table for the taxonomy in Figure 2.

Figure 4. Sample contents of the table with the exported taxonomy
Sample contents of the table with the exported taxonomy

Using IBM Cognos 8 Business Intelligence

IBM Cognos 8 Business Intelligence provides a full set of business intelligence capabilities. It is based on a flexible, service-oriented architecture. The main capabilities include reporting, analysis, dashboards, and scorecarding.

Reporting is used to aggregate relevant data about business processes and to distribute it to the people who benefit the most from the particular information. In the Advanced Analytics environment, this means to transfer results to the people who benefit the most from the generated business insights.

The components of Cognos 8 that are needed to create and publish business reports are as follows:

  • Cognos Connection is the Web portal and entry point to the functionality of Cognos 8. From here, the user can manage, organize, and view existing content, such as reports or dashboards. The Web-based authoring studios, such as Report Studio, can be started from Cognos Connection to create new resources or to edit existing ones. It is also used to administer the Cognos server, including changing access restrictions, importing content, or changing the list of people to whom a report is sent.
  • Cognos Framework Manager is the modeling tool for the data sources that are used in Cognos 8. Data from most databases and other sources, such as Web services for example, can be accessed in Cognos reports by describing a metadata layer in Framework Manager. Database tables, views, and stored procedures can be added to a Cognos package as query subjects. The Framework Manager is used to import and consolidate the information that is available in different data sources in the company so that the information can be used in a coherent way in the Cognos 8 BI tools, such as Report Studio. Note that the data itself is kept in the data source, and it is accessed as the report is created.
  • Cognos Report Studio is one of the Web-based authoring studios of Cognos 8 BI. It is used to create and edit sophisticated reports about the data that was described in Framework Manager. It provides powerful functionality, including drill-down, prompts, and a comprehensive charting library.

The following process makes data available in a report:

  1. In Framework Manager, the data modeler creates the Cognos metadata that describes the data in the database from a business perspective, including relations among tables and business names of values.
  2. After the metadata has been modeled, it is deployed as a package into the Cognos 8 content store. From there, it can be accessed through Cognos Connection and the authoring studios.
  3. The report designer enters Cognos Connection and creates a new report on the deployed package. After the report is created, the target group and distribution form (such as email or Web portal) can be administered.

The capability to create reports from relational data sources is the key to the integration of InfoSphere Warehouse Text Analysis and IBM Cognos.


Understanding integration architecture

The content of Cognos reports consists of results sets delivered by a relational data source, as described above. The content of a particular report is determined by a dynamic SQL query to one or more data sources. Because InfoSphere Warehouse Text Analysis reads information from database tables and writes the results back into result database tables, the results can be accessed easily with Cognos 8 BI to create reports from the results. In contrast to InfoSphere Warehouse Data Mining, it is not easy to create dynamic parameterized text analysis runs from the report, because the text analysis runtime is not in-database and therefore it has no SQL API. But it is easy to consume the result tables containing the text analysis results, like any other database table in the Cognos report.

The next section gives you a step-by-step example for this kind of integration: text analysis is used to extract structured information from a database table containing unstructured information. The results are stored in the database and are consumed by the Cognos report to propagate the information to a wider audience.


Extracting information from text using dictionaries: An example

In this example pretend you are part of the management of JK Superstore. You want to define the IT strategy of this company and to determine if the skills of your employees are still up to date. Therefore, you want to try to discover the current trends in the IT job market by analyzing job offerings posted on the Internet by 8 major IT companies: Sigratech, Mics System, JTS World, Netrum, Tersa, Vernacle, Quantech, and Coratech.

You are interested in which skills (including programming languages) are asked by these companies. In addition to this one, there are more sophisticated queries possible, including which skills have an increasing trend over time, separated by company.

To achieve this goal, first create a dictionary that contains a list of skills that you want to extract from the text of the job offering. Then create a taxonomy using the dictionary entries, which enables you to get a higher-level view of the skills mentioned. Then you can create a flow analyzing the table containing the job offerings with the Dictionary-Lookup operator. This operator will create a result set containing the job identification number and the skill mentioned. This result set can then be joined with the created taxonomy to get the final table containing existing, structured information, including the company name, the job offering, or and new structured information about the mentioned skills. The table can now be accessed in IBM Cognos 8 BI to create reports. Finally you can create a report showing the skill categories and detailed skills mentioned in the job offerings and which companies requested these skills.

Understanding the data

To perform an analysis of text in database tables, first explore the available information to select the relevant text columns for the analysis. With the Data Exploration View, you can browse samples of database tables, explore the contents of large text columns, or determine the average length of text strings.

For the example, use the Data Exploration View to explore the content of the table JOBS. The table can be found in the DWESAMP sample database shipped with InfoSphere Warehouse.

In the Data Source Explorer, navigate to the table JOBS in the database DWESAMP and schema TXTANL. Right-click the table, and select Distribution and Statistics > Data Exploration from the pop-up menu.

The table might contain four columns:

  • COMPANY_NAME: the name of the company
  • TIME: the offer date
  • ID: a number that identifies the offer
  • JOB_DESC: the job description in free-formed text

To view different job offerings, select the appropriate rows in the sample contents table.

Scroll down in the text description of a job offering to find the information about required skills. Looking at these enumerated skills, you can see that this task of extracting concepts can be handled best with dictionaries or lists of words.

The text analysis sample project

The example in this article uses a text analysis example project available with the DesignStudio of InfoSphere Warehouse. Select File > New > Example from the menu. In the New Example wizard, expand the folder Data Warehouse examples and select Text Analysis Sample. Click Next. Specify a name for the project, and finish the wizard. This sample project contains the dictionary it_skills to use in this article's example.

Creating the dictionary and taxonomy for the skills mentioned in the job offerings

The Dictionary-Lookup operator processes the text using a dictionary. A dictionary is a list of words or expressions defined by the user. For each term, it is possible to add some variants. For example, Java™, JAVA, and J2EE will be recognized as a single reference. For this example, you will design a dictionary that describes the range of IT skills. A taxonomy is a hierarchical classification made with levels. You will use the taxonomy editor to build a taxonomy, which you will export as a table and use later for the construction of the report. The taxonomy will contain the terms from the it_skills dictionary, and it will look like the one depicted in Figure 5.

Figure 5. The skills taxonomy
The skills taxonomy
  1. To create a Data Warehouse Project, right-click Project Explorer, and select New > Data Warehouse Project.
  2. In the wizard that appears, type the project name, such as Text Analytics, and click Finish.
  3. To create and populate the dictionary, expand the created Text Analytics project.
  4. Right-click the folder Text Analysis, and click New > Dictionary.
  5. In the resulting window, type a name for the dictionary, such as skills, and click Finish. The Dictionary Editor appears on the Editor View. On the left side of the editor, you can see a view of the dictionary, including each term and a sample of its variants. On the right side of the editor is the editing zone.
  6. Add any additional terms and their variants.
  7. To add a new entry to the dictionary, on the right side type Windows® as the base form, and add the variants Microsoft Windows, XP, and Windows XP Professional.
  8. To see if some inflections can be covered automatically, in the lower-right part, select English as the language for the inflection lookup. You can see inflections such as Microsoft Corporation Windows and Windows XP professionals, which will be covered automatically. Note: To make sure, that you can find all occurrences of the specified terms, and because these terms are not known by the underlying dictionary, write the terms in all lower-case letters. This ensures that terms such as WINdows or xp occur in the result as well.

    Instead of inserting all entries and their variants manually, you can usually start with a frequent terms analysis to view frequent terms and to decide which terms to add to your dictionary. This process of identifying the right terms can be very time consuming. Instead of creating the dictionary from scratch using frequent terms, use the sample it_skills dictionary from the text analysis sample project.

    If you want to start with frequent terms in order to find out-vocabulary terms such as j2ee or db2, the pattern Unidentified must be added to the list of term patterns to find, which in turn detects more terms, causing the analysis to be very time consuming.

  9. To import entries to the dictionary, click the upper-left button in the dictionary editor to import entries.
  10. Select dictionary as the import source, navigate to the project TextAnalysisSample, and select the dictionary it_skills.
  11. Click Finish to import the entries.
  12. Quit the Merge window by clicking OK.
  13. Press CTRL+S to save the dictionary.

To create and export the taxonomy, follow these steps:

  1. To create the taxonomy, in the Text Analytics project in the Project Explorer, right-click the Text Analysis folder, and click New > Taxonomy.
  2. In the Create Taxonomy wizard, specify a name for your taxonomy, such as skills_tax, and click Next.
  3. Select Create taxonomy from import source, check Dictionary, and click Next.
  4. Browse the subtree of your Text Analytics project, select the it_skills dictionary, and click Finish. The taxonomy editor opens. On the left is the Unassigned terms frame with the terms imported from the dictionary. On the right is the Taxonomy tree frame where you will build your classification.
  5. To define the taxonomy, create the first level. On the taxonomy Tree, right-click Root, and click Add level to add a new member for the selected level.
  6. Add five levels: Database, Development/Programming, Operating Systems, Web Oriented, and Others.
  7. Assign the unassigned terms to the categories you created by placing your cursor on the first category and using the arrow buttons to assign terms to this category. You can also use drag and drop to assign terms to a category.

    The final taxonomy is shown in Figure 6.

Figure 6. The taxonomy editor with assigned skills to defined categories
The taxonomy editor with assigned skills to defined categories
  1. Press CTRL+S to save the taxonomy.
  2. Click the Export Taxonomy button.
Figure 7. Export taxonomy
Export Taxonomy
  1. On the available taxonomies page, make sure that your taxonomy is selected and that Export the taxonomy to a table is checked.
  2. Click Next, and select the database DWESAMP.
  3. Click Next to specify the new table.
  4. Select TXTANL as the table schema, and select SKILLS_TAX as the new table name.
  5. Click Next to view the detailed information of the table to be created.
  6. Click Next. The Show Table Format page summarizes the taxonomy table that will be created, showing a sample of the content, as shown in Figure 8.
Figure 8. Sample content of the taxonomy table
Sample content of the taxonomy table
  1. Click Finish to create and populate the table.

Creating a table suitable for reporting

The Dictionary Lookup operator is based on dictionaries that contain lists of words to extract concepts, including names, products, or even skills mentioned from textual database columns. With the Dictionary Lookup operator, you can find the text sections that contain the terms in the selected dictionary.

  1. To create an empty mining flow, right-click the Mining Flows folder in your Data Warehousing project, and select New > Mining Flow.
  2. In the wizard, type the mining flow name, such as JobsAnalysis.
  3. Specify to work against a database, and click Next.
  4. Select the DWESAMP database, and click Finish. The Mining Flow editor opens.

To define the mining flow, on the right side of the editor, you can see a palette with operators. With these operators you can build up a mining flow by dragging and dropping operators to the editor canvas.

To create the mining flow, follow these steps:

  1. In the palette, find the Sources and Targets section.
  2. Select a Table Source operator, and drag it onto the editor canvas.
  3. In the table selection dialog window, expand the TXTANL schema, and select the JOBS table.
  4. Click Finish.
  5. From the Text Operators section, drag a Dictionary Lookup operator onto the canvas. The Properties view of the operator is opened below the canvas. Note: If your editor area is maximized, you cannot see this view until you reduce the size of the editor.
  6. On the canvas, connect the output port of the Source Table operator with the input port of the Dictionary Lookup operator using a simple drag operation.
  7. Select the Dictionary Lookup operator, and specify the properties in the Properties view.
  8. On the Dictionary Settings page, select the JOB_DESC input-text column from the list, and set the language of the text to English (United States).
  9. On the Analysis Results page, select skills as the dictionary and as the annotation type.
  10. Delete BEGIN, END, and COVEREDTEXT from the resulting columns table, because this information is not needed for this analysis.
  11. Rename the remaining BASEFORM column to SKILL to depict the baseforms specified in the dictionary.
  12. Rename the ID column to SKILL_ID to show the unique identification for the baseform.

    Figure 9 shows the complete Analysis Results page.

Figure 9. The Analysis Results page
The Analysis Results page
  1. On the Output Columns page of the Properties View, select the columns ID, COMPANY_NAME, and TIME from the list of available columns, and move them to the list of output columns on the right. Now you can relate the extracted concepts with the key ID, because this column is also contained in the operator output.
  2. Place a DISTINCT-Operator to the canvas. If a concept is found, a new row is created in the target table. If a concept occurs multiple times in the original text row, there will be another row created in the target table, which points to the same original entry.

    To remove the duplicate entries to prevent counting them multiple times later, use the DISTINCT-Operator from the Advanced Transformations section of the editor palette. Connect the input port of the DISTINCT-Operator with the output port of the Dictionary-Lookup operator.

Join the resulting table with the taxonomy table to allow higher value analysis by following these steps:

  1. Drag a new Table Source Operator to the canvas, and select the table TXTANL.SKILLS_TAX, where you exported the taxonomy. If the table is not available in the list, refresh the database model by selecting Mining Flow > Refresh Database from the menu bar.
  2. Place a Table Join Operator from the Transformations palette section to the canvas.
  3. Connect the first input port with the resulting output port of the DISTINCT-Operator and connect the second input port with the output port of the Table Source Operator (TXTANL.SKILLS_TAX).
  4. Specify the condition for the Table Join operator by connecting the tables using the unique term number, which is SKILL_ID in the result set of the text analysis and TERM_ID in the taxonomy table. The resulting condition looks like this with other names for the virtual tables: "IN_09_0"."SKILL_ID" = "IN1_09_1"."TERM_ID".
  5. Set the resulting columns for the Table Join operator.
  6. On the Select List page of the Properties view, remove all columns that are not needed to create a report. Keep the following columns for the example result: COMPANY_NAME, TIME, ID, SKILL, and LEVEL1_NAME, which might be renamed to SKILL_CAT.

Create the table that receives the analysis results by completing the following steps:

  1. Right-click the output port (Inner) of the JOIN operator, and select Create suitable table from the context menu. Specify table name SKILL_REPORT and schema TXTANL.
  2. Click Finish.
  3. Save the mining flow by clicking into the editor area and pressing Ctrl+S.

Now, the mining flow is ready to execute, as shown in Figure 10.

Figure 10. The complete mining flow
The complete mining flow

Next, execute your mining flow. This will analyze the source columns in table TXTANL.JOBS using your dictionary, join the result with the exported taxonomy, and write this data set into the target table TXTANL.SKILL_REPORT you just created. Complete the following steps:

  1. From the menu, select Mining Flow > Execute.
  2. In the wizard, click the Execute button.
  3. Explore the content of the target table by right-clicking the table SKILL_REPORT and selecting Distribution and Statistics > Data Exploration. The sample contents are shown in the Data Exploration view, as shown in Figure 11.
Figure 11. Sample contents of the table SKILL_REPORT showing the extracted skills
Sample contents of the table SKILL_REPORT

This table can now be used in Cognos 8 BI to create a report.


Creating a Cognos report for the extracted skills

In this section, you will create a Cognos report that displays the text analysis results you extracted in the last section. The report consists of a main page that gives an overview of the skills and skill categories. From this overview, you can drill down to a list of job offerings containing the selected skills and skill categories. You can also select a job offering from the list overviews to see the complete job offering description. The data will be read from the SKILL_REPORT table you created in this article.

Importing skills data into Cognos

To use the SKILL_REPORT table that contains the analysis results in the report, you need to import it into Cognos metadata with the Cognos Framework Manager. You will also import the original JOBS table, and you will create a relationship between the two subjects to display the complete job offering text in the report. For simplification of the report creation, you will create a joined query subject SKILL_OVERVIEW that adds the job description of the JOBS query subject to the columns of the SKILL_REPORT query subject. This will be the query subject you will use in the report.

The query subjects you will create are:

  • SKILL_REPORT: The table containing the text analysis results. It contains a list of all occurrences of skills in the job offerings, and it indicates the skill category to which each skill belongs.
  • JOBS: The table containing the complete job description. You will create a relationship with SKILL_REPORT through the ID column, which contains the job offering ID.
  • SKILL_OVERVIEW: The query subject containing all information from the SKILL_REPORT table joined with the job descriptions of the JOBS table.

First you need to create a Cognos Framework Manager project that is connected to the sample database DWESAMP of InfoSphere Warehouse.

To create the project, complete the following steps:

  1. Launch the Cognos Framework Manager.
  2. Click the Create a new project link.
  3. Specify the name as TxtDictMetaData, and click OK.
  4. Select English as the language, and click OK.
  5. To create metadata from a DB2 table, select a common Data Source, and click Next.
  6. Select the DWESAMP database, and click Next.
  7. Select the data object you want to import as metadata. Expand the TXTANL schema and the table folder, and select the JOBS and SKILL_REPORT tables.
  8. Click Next.
  9. Use the default settings, and click Import.
  10. Click Finish.

Now you have a Framework Manager project containing the two query subjects JOBS and SKILL_REPORT. To create a report that joins the information between these subjects, you need to create a relationship between them. The example uses the ID column as the join key.

To create a relationship between SKILL_REPORT and JOBS, complete the following steps:

  1. Select Create Relationship from the context menu of the SKILL_REPORT query subject.
  2. For the left query subject, select the ID column of the SKILL_REPORT subject, and set the cardinality to 1..n, because you could have multiple skills in one job offering.
  3. For the right query subject, add the JOBS query subject, select the ID column, and set the cardinality to 1..1, because there is exactly one row for each job offering.
  4. Click OK.

To make it easier to create Cognos reports using information from both tables, create a new query subject that contains all columns from the SKILL_REPORT table together with the job description query item from the JOBS query subject.

To create the SKILL_OVERVIEW query subject, complete the following steps:

  1. Create a new query subject in the DWESAMP namespace by right-clicking DWESAMP and selecting Create > Query Subject.
  2. Change the name to SKILL_OVERVIEW without changing the other settings, and click OK.
  3. Add all query items from the SKILL_REPORT query subject and the JOB_DESC query item from the JOBS query subject.
  4. Click OK.

You have created the query subjects you need for the Cognos report. Now you can deploy a TXTARTICLE package containing the query subjects of the project to the Cognos Content Store.

To create and deploy the package, complete the following steps:

  1. Go to the Project View, and right-click the Packages folder.
  2. Select Create > Package.
  3. In the name field, type TXTARTICLE, and click Next.
  4. Click Next.
  5. Add DB2 to the Available Function Sets list, and click Finish.
  6. Click Yes to continue with the publishing process.
  7. In the Select publish location part of the Publish Wizard, click Next to take the defaults.
  8. Click Next without specifying security settings.
  9. Click Publish.
  10. Click Finish in the dialog informing you that you have successfully published your package.

The resulting package looks like Figure 12.

Figure 12. The created resources in the Framework Manager
The created resources in the Framework Manager

Click here for a larger view of Figure 12.

Creating a skills report with Cognos Report Studio

This section describes how to create the skills report that displays the results of the text analysis. Cognos Report Studio is a fully Web-based application, and it can be accessed from Cognos Connection. The report will have four report pages that are linked together. In Cognos reports, you can do this by adding links to other reports. If you also use variables or parameters, you can add a powerful drill-through functionality to the report. Each Cognos report can contain parameters that can be used in parameterized queries. Either the user can select them manually on report execution or another report can add them through drill-through definitions. In the example, you will learn to display a list of job offerings with a drill-through definition to see the job description for the job offering that was selected.

The report pages include:

  • SkillOverview: This page gives an overview of the skills that have been found in all the job offerings. They will be displayed with a table that has the skill categories and the skills as rows and the number of job offerings containing the skills as columns. It will be possible to drill down to all job offerings for a skill or skill category. This is done by linking the page to the JobsSkill and JobsCategory report.
  • JobsSkill: This page contains a list displaying all occurrences of the selected skill in all job offerings. The skill is an input variable of the report page. You will be able to drill down to the job description for the job offering. This is done by linking to the JobDescription report page.
  • JobsCategory: This page is identical to the JobsSkill page but for a skill category. You already created the skill categories using the Taxonomy Editor in the DesignStudio section. JobsCategory uses the skill category as the input variable.
  • JobDescription: This report page displays the job description for a given job offering. It uses the job offering ID as the input variable.

The combined report will enable you to see the number of job offerings for a given skill or skill category. You can drill down to the job offerings that contain the skill occurrence, and you can see the job offering text. You will now create the four report pages, starting with the JobDescription report, because it doesn't reference one of the other reports.

To create the JobDescription report page, complete the following steps:

  1. In Cognos Connection, launch the Report Studio.
  2. In the Select a package explorer, click the TXTARTICLE package you published in the Framework Manager.
  3. In the welcome dialog, click Create a new report or template.
  4. Select the List template, and click OK.

    In the upper-left corner, you see the Insertable Objects view containing three tabs. The query subjects of your package are on the Sources tab. The created queries are on the Data Items tab. The Toolbox tab contains the report widgets.

  5. Open the Sources tab. You should see your TXTARTICLE package with the SKILL_OVERVIEW query subject.
  6. Drag the query items ID, TIME, COMPANY_NAME, and JOB_DESC to the list in the report page.
  7. Add a filter to the report to display only one job offering by selecting the list and clicking the Filters icon on the toolbar, or by selecting Data > Filters from the menu.
  8. In the wizard, add a Details filter using the Add icon.
  9. In the Detail Filter Expression window, add the following to the Expression Definition: [DWESAMP].[SKILL_OVERVIEW].[ID] = ?JOBID?

    Cognos automatically detects the ?JOBID? keyword that is surrounded by ? as a parameter and adds it to the parameter list of the report.

  10. Confirm the filter by clicking OK.
  11. Change the report header text if needed, and save your report as JobDescription.
Figure 13. The Cognos Report Studio with the JobDescription report
The Cognos Report Studio with the JobDescription report

Click here for a larger view of Figure 13.

The second report page you create is the JobsSkill page. In this page, you will show a list of all skill occurrences filtered by a skill input variable. You will also add the capability to open the JobDescription report page you already created for the job offerings in the report.

To create the JobsSkill report page, follow these steps:

  1. Create a new report for the same package. You can do this by selecting File > New in the Report Studio.
  2. From the template list, select List.
  3. From the SKILLS_OVERVIEW query subject in the Insertable Objects view, drag the following query items to the List in the report: ID, COMPANY_NAME, TIME, SKILL, and JOB_DESC.
  4. Add a filter to the report to display only the occurrences of a specific skill. As you did for the JobDescription report, add a Detail filter to the list using the following Expression Definition: [DWESAMP].[SKILL_OVERVIEW].[SKILL]= ?SKILL?

    You have now added an input parameter SKILL to the report.

  5. Confirm the filter by clicking OK.

    The complete job description is shown for each column, which can be quite long.

  6. Select the JOB_DESC column (not the header, but the column below it), and navigate to the Data-Item > Expression field of the Properties View in the lower left.
  7. Open the Expression definition by clicking the ... button to the right of the property.
  8. Enter the following expression: substring([DWESAMP].[SKILL_OVERVIEW].[JOB_DESC], 1, 50) + ' ...'

    This will show only the first 50 characters of the job description followed by ... to indicate that it is only an excerpt.

  9. Click OK.

To enable the user to click on one of the job description fields and see the whole description, add a drill-through definition to the list by completing the following steps.

  1. Select the JOB_DESC column (not the header, but the column below it), and select Drill-Through Definitions from the right-click context menu.
  2. Add a new drill-through definition.
  3. In the Target Report tab of the properties, select the JobDescription report as target report.
  4. Select Run the report as action.
  5. Select the Open in new window checkbox.
  6. Add a new linked parameter with the Edit button below the parameters list.
  7. In the Parameters window, select Pass data item value as the method to link the parameter JOBID of the JobDescription report to a value of the row in the list.
  8. Select the ID query item as the data item source, and click OK.
  9. Change the report header and column headers if needed, and save the report as JobsSkill.
Figure 14. The Cognos Report Studio with the JobsSkill report
The Cognos Report Studio with the JobsSkill report

Click here for a larger view of Figure 14.

The JobsCategory report is similar to the JobsSkill report, but it will show all occurrences of a given skill category. Use the same steps you used to create a JobsSkills report, changing only the steps highlighted in this section. Everything that is not explained in detail in this section needs to be done as you did in the JobsSkill report above. To create the JobsCategory report, follow these steps:

  1. Create a new report for the same package by selecting File > New in the Report Studio.
  2. From the template list, select List.
  3. Add the following query items from the Skills_Overview query subject to the list: ID, COMPANY_NAME, TIME, SKILL, JOB_DESC.
  4. In this report, you want to filter by skill category instead of skill by adding a detail filter to the list with the following Expression Definition: [DWESAMP].[SKILL_OVERVIEW].[SKILL_CAT] = ?CATEGORY?

    This adds the input parameter CATEGORY to the report.

  5. Change the expression of the JOB_DESC column to display only the first 50 characters, as described above.
  6. As in the JobsSkills report, add the drill-through definition to the JobDescription report for the JOB_DESC column.
  7. Change the report and column headers as needed, and save the report as JobsCategory.
Figure 15. The Cognos Report Studio with the JobsCategory report
The Cognos Report Studio with the JobsCategory report

Click here for a larger view of Figure 15.

The SkillOverview report shows all skill categories and skills in a cross tab together with the number of offerings for all skills. You will also add drill-through capability to the JobsSkill and JobsCategory reports you created above. To create the SkillOverview report page, follow these steps:

  1. Create a new report for the same package by selecting File > New in the Report Studio.
  2. From the template list, select Crosstab.
  3. From the Sources view, drag the SKILL_CAT and SKILL query items of the SKILL_OVERVIEW query subject to the rows field of the crosstab.
  4. Drag the ID query item of the SKILL_OVERVIEW query subject to the columns field of the crosstab.
  5. To display the number of job applications containing a skill, change the aggregate function of the ID columns by selecting the ID columns of the crosstab (select the headers, not the rows below).
  6. In the Properties view in the lower left, navigate to the Data Item > Aggregate Function property, and change it to Count Distinct.

    To enable the user to click on one of the skill categories and see the JobsCategory report for this category, add a drill-through definition to the crosstab.

  7. Select the SKILL_CAT column, and select Drill-Through Definitions from the right-click context menu.
  8. Add a new Drill-Through Definition.
  9. In the Target Report tab of the properties, select the JobsCategory report as the target report.
  10. Select Run the report.
  11. Select the Open in new window checkbox.
  12. Add a new linked parameter using the Edit button below the parameters list.
  13. In the Parameters dialog, select Pass data item value as the method to link the parameter CATEGORY of the JobDescription report to a value of the row in the list.
  14. Select the SKILL_CAT query item as the data item source, and click OK.
  15. To add a drill-through definition to the SKILL column, repeat the above steps for the SKILL column. This time select the JobsSkill report instead of the JobsCategory report. The parameter of the JobsSkill report should be SKILL, and the data item source should be the SKILL query item.
  16. Change the report and column headers as needed, and save the report as SkillOverview.
Figure 16. The Cognos Report Studio with the SkillOverview report
The Cognos Report Studio with the SkillOverview report

Click here for a larger view of Figure 16.

Now you have created everything that is needed for the skills report. When the SkillOverview report is run, it shows the skill categories and skills that have been found in the job offerings, as shown in Figure 17.

Figure 17. The skill overview report
The skill overview report

The skills report also shows the number of job offerings for each skill. The user is then able to click on a skill category or skill to see a list of all occurrences of this skill or skill category in the job offerings, as shown in Figure 18.

Figure 18. The job offering list for a skill category in this case Development
The job offering list for a skill category in this case Development

In this list, a job offering can be selected to see the whole job description, as shown in Figure 19.

Figure 19. The job description for the selected job offering
The job description for the selected job offering

Conclusion

This article described how dictionaries can be used to extract concepts and how taxonomies can be used to aggregate the results. In addition, you saw how this result can be used within Cognos 8 BI and how a meaningful report using the previously existing structured information together with the new extracted concepts can be created.

The next article in the series goes beyond the text analysis capabilities provided within InfoSphere Warehouse. Using the text analyzer operator and an Apache UIMA compatible annotator, you will see how to create and integrate a user-defined text analysis task into the data transformation flows.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Big data and analytics
ArticleID=407544
ArticleTitle=Text analysis in InfoSphere Warehouse, Part 2: Dictionary-based information extraction combined with IBM Cognos reporting
publish-date=07092009