Text analysis in InfoSphere Warehouse, Part 1: Architecture overview and example of information extraction with regular expressions

Gain business insights from unstructured data

Unstructured information represents the largest, most current, and fastest growing source of information that is available today. This information exists in many different sources such as call center records, repair reports, product reviews, e-mails, and many others. The text analysis features of IBM® InfoSphere™ Warehouse can help you uncover the hidden value in this unstructured data. This series of articles covers the general architecture and business opportunities of analyzing unstructured data with the text analysis capabilities of InfoSphere Warehouse. The integration of this capability with IBM Cognos® reporting enables people across the company to exploit the text analysis results. This first article introduces the basic architecture of the text analysis feature in InfoSphere Warehouse and includes a technical example showing how to extract concepts from text using regular expressions.

Stefan Abraham (stefana@de.ibm.com), Software Engineer, IBM

Stefan Abraham photographStefan Abraham is a Software Engineer at the IBM Research & Development Lab Boeblingen, Germany. He works on text analysis components and on data mining related UI components in InfoSphere Warehouse.



Simone Daum (sdaum@de.ibm.com), Software Enginee, IBM

Simone Daum photoSimone Daum is a Software Engineer at the IBM Research & Development Lab Boeblingen, Germany. She works on tooling for data preparation for data mining and on text analysis in InfoSphere Warehouse.



Benjamin G. Leonhardi (bleon@de.ibm.com), Software Engineer, IBM

Author Photo: Benjamin LeonhardiBenjamin Leonhardi is a software engineer for InfoSphere Warehouse data mining at the IBM Research & Development Lab in Boeblingen, Germany. He works on mining visualization, text mining, and mining reporting solutions.



04 June 2009

Also available in Portuguese Spanish

Introduction

In a recent TDWI survey, data management professionals were asked "Which types of data and source systems will feed your data warehouse three years from now?" Respondents said they expected a huge increase in unstructured data. This included e-mail, call center transcripts, documents from content management systems, and public content from forums or blogs. (See the Resources section for a link to the survey.)

This series of articles describes how text analysis technology can transform this unstructured, textual data into meaningful pieces of information that can be used within Business Intelligence applications. Unstructured data can improve the quality of existing BI analytics, or in some cases, it can be the key enabler for new types of insight.

Sample business scenarios

Following are two examples of business scenarios that illustrate the value of text analysis technology:

  • Reduce customer churn by identifying unhappy customers as early as possible: Companies in the telecommunication sector already have elaborate predictive analytic models for customer churn. However, these models are predominately based on the use of structured data. Adding information from unstructured data could enhance these predictive models significantly. For example, a company could detect dissatisfied customers that explicitly reference a competitor in a service call. By including this in the churn model, the company could then set up processes to trigger immediate action at the first sign of customer discontent.
  • Improve the quality of early warning systems: Internal problem reports, customer e-mail, or call center transcripts can yield important information about emerging product problems. Today, companies try to capture these insights using a fixed set of categories within problem taxonomies. Such taxonomies typically suffer from granularity problems. If the taxonomies contain only high-level categories, the company can’t capture the actual reason for a problem. However, if the taxonomies try to capture all possible problems, they become too unwieldy to use for front-line personal like call center employees. The actual reason for a defect is often buried within technician comments or call center logs. So, for example, a company may be able to detect that there is a problem with a certain product, but doesn't realize that a particular part is causing the problem. Therefore, the company misses the opportunity to take appropriate actions such as issuing a product recall or checking other products that use the problematic part. By using frequent terms analysis available in InfoSphere Warehouse, the company could create a report showing correlated terms extracted from customer complaints for a certain product model. This could provide direct insight into likely problem spots.

In both of the above scenarios, text is the main type of unstructured data. Companies may also have a need to analyze semi-structured text (such as XML content) or other data types (such as audio and video). However, the authors of this series of articles see that the bulk of content that is relevant for today’s applications comes in as free-form text from technician notes, customer comments through CRM applications or e-mail, or snippets from news services. Thus, we have chosen to focus these articles on free-form text.

Information Extraction tasks

The basic task behind text analysis is Information Extraction (IE). Information Extraction is an area of natural language processing that is concerned with examining unstructured text in order to extract concepts (referred to as entities) and relationships between these concepts.

Relevant information extraction tasks are:

  • Named Entity Recognition (NER): recognize and extract named entities. For example, person or place names, monetary expressions, and problem indicators.
  • Relationship Detection: detect relationships based on named entities. For example, part X causes problem Y.
  • Coreference resolution: identify expressions across a document that refer to the same entity. For example, the hotel named "Best Hotel" in the following text: I liked my stay at the Best Hotel. It has very bright rooms. The hotel also features…

List-based and rules-based named entity recognition

One approach to Named Entity Recognition is list-based extraction of entities. This would include extraction of things such as employee names (for example, from the company LDAP server) or product names and their attributes. Some domains already have official domain vocabulary. For example, the Systematized Nomenclature of Medicine-Clinical Terms (SnoMed CT) for the health care industry.

One advantage of list-based extraction is that the word lists often come from trusted sources, which means its creation and maintenance can be automated to a certain degree. For example, each time a new product name is added, you can have it trigger a batch update . Also, the extraction results are immediately plausible to the end user. Often the terms in the list have different variants and acronyms that have to be added to the list by the domain expert.

Some types of entities, such as telephone numbers or monetary expressions, can’t be listed exhaustively. For these entities, rule-based extraction is the proper approach. One advantage of rules is generalization—one rule may cover a large range of entities. Another advantage is that rules can take the document context into account. This is crucial for tasks like sentiment detection where a negation word such as "not" flips the sentiment of a whole sentence.

The key challenge for rules is their complexity—users need help to create and maintain rules. The people with the appropriate domain knowledge are often non-technical. Therefore, configuration tooling that hides the intricacies of linguistics and rule languages from these users is necessary.

Article and series overview

The rest of this article briefly presents the basic architecture of InfoSphere Warehouse and its text analysis features. It then shows a simple, step-by-step example of how to use InfoSphere Warehouse to extract concepts using regular expressions.

The upcoming articles of the series will describe other text analysis features available in InfoSphere Warehouse, and show how these results can be used in reporting software products such as IBM Cognos 8 BI.


IBM InfoSphere Warehouse architecture

InfoSphere Warehouse is the IBM warehouse solution built on IBM DB2® for data storage. This article focuses on the text analytics capabilities of InfoSphere Warehouse, but the product also includes a wide array of other tools for warehouse management and analysis tasks, such as online analytical processing (OLAP), performance management, and workload management.

Figure 1. InfoSphere Warehouse architecture
Graphic depicting the architecture of InfoSphere Warehouse according to the design, deploy, and manage tasks; components are described below

As shown in the above architecture diagram, the main components of InfoSphere Warehouse are:

  • The DB2 database server contains the structured and unstructured (most often text) warehouse data.
  • The Design Studio is a tooling platform used by business analysts and warehouse administrators to design workload rules, data transformation flows, and analytical flows for data mining and text analysis. For example, a business analyst might create an analytics flow to extract structured information from customer e-mail or call center reports. These flows can then be deployed to the InfoSphere Warehouse Administration Console. In addition, the Design Studio provides tooling to better understand the data, create resources like dictionaries or regular expression rules used in the analytic flows, and much more.
  • The Administration Console is used to manage and monitor the warehouse. After you deploy flows that have been designed in the Design Studio, you can run, schedule, and monitor them. For example, you might schedule a weekly analysis of new call center reports to identify customers who are likely to churn, or run a search of recent technician notes aimed at finding potential product problems.

Unstructured analytics in InfoSphere Warehouse

InfoSphere Warehouse uses the Unstructured Information Management Architecture (UIMA) for the analysis of unstructured data. UIMA is an open, scalable, and extensible platform for creating, integrating, and deploying text-analysis solutions. UIMA is free software and provides a common foundation for industry and academia. UIMA-based components that are used to extract entities like names, sentiments, or relationships are called UIMA Annotators or Analysis Engines.

InfoSphere Warehouse provides operators and tooling for dictionary-based and regular expression-based named entity recognition. For other text analysis tasks, a generic text analysis operator is available that can be used to run Apache UIMA-compatible annotators in analytic flows:

  • Data understanding is important for successful information extraction from text data, so InfoSphere Warehouse provides the Data Exploration feature to find columns with relevant text information (Text Statistics view) and to browse through the text (Sample Contents view). For a deeper analysis, you can use the Frequent Terms Extraction feature to extract the most frequent terms that occur in the text column together with advanced visualization like a cloud view. Frequent Terms Extraction is an important feature for the efficient creation of dictionaries that can be used in dictionary-based analysis.
  • Dictionary-based analysis is the extraction of keywords from text. Examples of entities that you can extract are names, companies, and products. You can also extract all entities that are contained in a list. InfoSphere Warehouse supports dictionary-based analysis of text columns through the Dictionary Lookup operator. The Dictionary Lookup operator is based on technology from IBM LanguageWare. It supports natural language processing like word stem reduction and tokenization in multiple languages. Dictionaries can be created and maintained with the Dictionary Editor in InfoSphere Warehouse. InfoSphere Warehouse also includes a Taxonomy Editor that categorizes dictionary entries in a taxonomy tree for use in data mining and OLAP. Dictionary-based analysis will be demonstrated in detail in another article in this series.
  • Rule-based analysis is the extraction of information from text through regular expression rules. Regular expressions are ideal to extract concepts like phone or credit card numbers, addresses, dates, etc.. InfoSphere Warehouse supports rule-based analysis through the Regular Expression Lookup operator. The operator uses rule files containing regular expression rules to extract concepts from text columns. You can create and modify these rule files with the Regular Expressions editor. A detailed example of this is provided in this article.
  • In addition to the common text analysis methods above, InfoSphere Warehouse allows the use of Apache UIMA compatible annotators. You can import these into an InfoSphere Warehouse Data Warehousing project and use them in the Text Analyzer operator. For example, to extract higher level concepts like relationships or sentiments. Advanced UIMA annotators are available from IBM custom solutions, IBM Research, other companies, and universities. You can also create them from scratch using the UIMA SDK. (See the Resources section for a link to more information about UIMA)

Using the InfoSphere Warehouse Design Studio for text analysis

Figure 2 shows the InfoSphere Warehouse Design Studio.

Figure 2. InfoSphere Warehouse Design Studio
InfoSphere Warehouse Design Studio screen shot showing a mining flow named Jobs Dictionary Analysis


The Design Studio is the integrated tooling platform of InfoSphere Warehouse. It is built on Eclipse technology. The Design Studio lets you save your work into projects. You can see all your projects in the Project Explorer, which appears on the left side of the Design Studio interface. The default project for all data warehouse work is the Data Warehousing Project. This project contains a Text Analysis folder that contains the text analytic resources such as dictionaries, rule files, taxonomies, etc..

The information extraction is done by the Text operators in the data transformation flows (data flows and mining flows). With the powerful concept of these flows, you can sample, join, and modify tables. Text operators can then extract structured information from text columns and add them to the output as new columns containing found concepts like names, skills, dates etc..

Figure 3 depicts a scenario where concepts in free-form text are first annotated and later written to a database table together with existing structured information.

Figure 3. Unstructured to structured information
Left side of figure contains a dense block of text with highlighted product names and terms; the right side shows the terms integrated into a database table

Extracting information from text using patterns: an example

For source data, this example uses the World Factbook document collection that is produced by the United States Central Intelligence Agency. These documents contain information about all countries in the world, including the area of the country in square kilometers and its geographic location in longitude and latitude.

First, we will create a regular expression rule file to extract the concepts of area and location from text documents. We will then use this rule file in a mining flow to extract the concepts from text columns in relational database tables.

Regular expressions for pattern matching

A regular expression is a pattern of characters that describes a set of strings. InfoSphere Warehouse uses the regular expression syntax of Java™. Regular expression rules can consist of:

  • String literals a, b, c ...
  • Character classes such as [abc], which means that either a, b, or c can be in this position
  • Predefined character classes such as digits \d, which is equivalent to [0-9]
  • Quantifiers that allow you to specify the number of occurrences of a subpatterns. For example, a* is the letter "a" zero or multiple times, a+ is the letter "a" at least once, a{3} is "aaa", a{1,3} is the letter "a" at least once but not more than three times.
  • Groups are subpatterns in the regular expression that are surrounded by parenthesis. For example the regular expression A(B(C)) has the subgroups (B(C)) and (C).

For a complete description of the regular expression syntax, refer to the InfoSphere Warehouse documentation or the Java documentation (see Resources).
If you are not familiar with regular expressions, or do not remember the regular expression constructs, you can use the Regular Expression Builder available with the RegEx editor in InfoSphere Warehouse.

If you wanted to create a regular expression to match an international US telephone number (for example, 001-555-7323), you would look for a string that starts with "001", followed by a hyphen, followed by at least one number, followed by a hyphen, followed again by at least one number. The regular expression for this would be: (001)-(\d+)-(\d+)

The following example show you how to use regular expressions to extract concepts such as coordinates and areas from text.

Exploring text in database tables

To perform an analysis of text in database tables, you first explore the available information to select the relevant text columns for the analysis. With the InfoSphere Warehouse Data Exploration view you can browse samples of database tables, browse the contents of large text columns, and determine the average length of text strings.

Following are instructions for how to use the Data Exploration view to explore the content of a table named FACTBOOK. The table can be found in the DWESAMP sample database shipped with InfoSphere Warehouse.

  1. From the InfoSphere Warehouse Data Source Explorer, navigate to the table FACTBOOK in the database DWESAMP and schema CIA. Right-click the table and select Distribution and Statistics -> Data Exploration from the context menu.

    The Data Exploration view in Figure 4 shows a sample of fifty random rows out of the 275 rows in the table. Each row contains the country name and a column named TEXT with information about the country. Below the table, you can select one of the text columns in the table to display the complete content. This allows you to inspect even large amounts of text that might be in the column. From the drop-down list, select the column TEXT.

    Figure 4. The Data Exploration view showing sample contents of the FACTBOOK table
    The Data Exploration view showing sample contents of the FACTBOOK table
  2. To view information about a specific country, select the appropriate row in the sample contents table.
  3. Scroll down in the text description of a country to find the information about its geographic location and its area in square kilometers. For example by selecting Germany, as shown in Figure 4, you would see the following information: Geographic coordinates: 51 00 N, 9 00 E

    Map references: Europe

    Area: total: 357,021 sq km water: 7,798 sq km land: 349,223 sq km

    Looking at the sample contents table, you can see that a country's geographic coordinates and area are always given in the same format. The task of extracting concepts in this format can be easily handled with regular expression rules.

Creating a rule file with regular expressions for location and area

With rule files, you can define concepts like phone numbers or uniform resource locators. These concepts are called types. To find these concepts in text, you can specify rules that define a pattern to match these concepts.

A concept type can have features. For example, given a phone number that consists of a country code, an area code, and the extension number, you can create the type phone number and specify the features country code, area code, and extension number.

With the RegEx rules editor, you define concept types and their features and then assign rules with regular expression patterns to those types. When a pattern matches a part of the text, an annotation is created for the associated type. You can set the feature values of an annotation by assigning a sub-pattern of the regular expression rule—a match group—to the feature.

When the rule file is used within the Regular Expression Lookup operator in a mining flow or a data flow, the features can be mapped to columns of a relational table, which denotes the extracted concepts.

Create a Data Warehouse Project:

  1. Right-click in the Project Explorer and select New -> Data Warehouse Project from the context menu.
  2. In the following wizard, type the project name, for example, Text Analytics.
  3. Click Finish.

Create a new rule file:

  1. In the Text Analysis folder, right-click the Rules folder and select New -> Rules from the context menu. This displays the New Rules dialog.
  2. Select the Data Warehousing project that you have previously created.
  3. Specify Factbook_Concepts as the rule file name and click Finish. This displays the RegEx editor.

Create a type named Coordinates:

  1. In the Types section, the type Factbook_Concepts is displayed as the initial type. Delete this type and create a new type. Name the new type Coordinates. This automatically creates a rule with the same name.
  2. Expand the Coordinates type. The Features folder is empty and the Rules folder contains the rule named Coordinates, but no regular-expression pattern is defined yet.
  3. The first rule that you create should extract the concept type Coordinates with features longitude and latitude from the CIA.FACTBOOK. For example:
    Geographic coordinates: 51 00 N, 9 00 E

    In the RegEx editor, for the Coordinates type, select the Features folder and click New Feature.

  4. In the New Feature dialog, type longitude in the entry field, accept the default data type of String, and click OK.
  5. Repeat the previous step to add another feature named latitude with the data type String.

    As shown in Figure 5, the Types section should now have a definition for the type named Coordinates with two features named longitude and latitude.

    Figure 5. The Coordinates type with the longitude and latitude features
    The Types section tree hierarchy with type Coordinates and its features longitude and latitude
  6. Specify the regular expression pattern for the rule:
    1. Expand the Rules folder in the tree and click on the Coordinates rule to select it.
    2. The Test Rule section of the RegEx editor, which is shown in Figure 6, is used to test your rule on a set of sample text snippets. You enter an example of the text that you want to find in the Input text field. The Matched field then shows the parts of the rule that could be matched to the text in the Input text field.

      Enter the following text snippet into the Input text field:
      Geographic coordinates: 51 00 N, 9 00 E

      Figure 6. The Test Rule section of the RegEx editor
      Screen shot of the Test Rule section of the RegEx editor with the Input text field filled in as described above and the Matched field showing the parts of the rule that could be matched
    3. The Rule section of the RegEx editor, which is shown in Figure 7, is where you actually enter the regular expression.

      Enter the following regular expression pattern into the input field of the Rule section:
      Geographic coordinates: ([0-9]{1,2} [0-9]{1,2} [SN]), ([0-9]{1,3} [0-9]{1,2} [EW])

      Figure 7. The Rule section of the RegEx editor
      Screen shot of the Rule section of the RegEx editor with the input field filled in with the sample regular expression listed above

      The above rule specifies 1-2 digits, followed by a blank, followed by 1-2 digits, followed by S or N (for south or north), followed by a comma, followed by 1-3 digits followed by a blank, followed by 1-2 digits, followed by E or W (for east or west).

      Alternatively, you can try to build the regular expression pattern yourself using the Regular Expression Builder. The Regular Expression builder provides syntax assistance for the regular expression notation.

    4. In this step you assign sub-patterns to the features longitude and latitude so that you are able to extract these concepts separately.

      The rule contains two subpatterns, or match groups, enclosed by parenthesis. The first subpattern (Subpattern1) denotes the latitude (51 00 N) and the second sub-pattern (Subpattern2) denotes the longitude (9 00 E). To see all of this information, click on the Matched panel of the Test Rule section and scroll down using the down-arrow key.

      In the Features section, select the entry for latitude and click Add subpattern reference. You now see the Add Subpattern Reference dialog shown in Figure 8. For each choice listed on the dialog, the part of the regular expression that defines the sub-pattern is displayed in bold.

      Figure 8. Dialog to add a subpattern reference
      Screen shot of the Add Subpattern Reference dialog box with radio button entries for Entire regular expression, Subpattern 1, and Subpattern 2

      Select Subpattern1.

      Go back to the features section, select the entry for longitude and click Add subpattern reference. From the Add Subpattern Reference dialog, select Subpattern2.

      As shown in Figure 9, the Features section should now show the assigned subpattern for each feature.

      Figure 9. The Features section of the RegEx editor
      Screen shot of the Features section of the RegEx editor with longitude feature assigned the value of $1 and latitude feature assigned the value of $2

Create a type named Area:

  1. The second rule that you create should extract the concept type Area. For example:
    Area: total: 357,021 sq km water: 7,798 sq km land: 349,223 sq km

    Start by clicking New Type and entering Area as the type name. This also automatically creates a rule with the name Area.

  2. Click New Feature and enter value as the feature name.
  3. Change the data type of the new feature named value to Integer.
  4. Enter the following regular expression:
    Area: total: ([0-9]{1,3}(,[0-9]{1,3})*) sq km
  5. Use the following text snippet to test your rule:
    Area: total: 357,021 sq km water: 7,798 sq km land: 349,223 sq km
  6. Assign Subpattern1 to the feature named value.
  7. Save the rule file by clicking in the editor area and pressing Ctrl+S.

Creating a flow with a Regular Expression Lookup operator

The Regular Expression Lookup operator is based on rule files. The rule files contain regular expression patterns that let you extract concepts such as phone numbers or email addresses from database tables. With the Regular Expression Lookup operator you can find the text sections that match the expressions contained in the selected rule files.

Create an empty mining flow:

  1. Right-click the Mining Flows folder in your Data Warehousing project and select New -> Mining Flow from the context menu.
  2. In the New File Data Mining Flow wizard, type a name for the mining flow. For example, RegExLookup.
  3. Make the selection to work against a database and click Next.
  4. From the Select Connection page, select the DWESAMP database and click Finish.

    This opens the Mining Flow editor.

Define the mining flow:

  1. The right side of the Mining Flow editor contains a palette with operators. You can use these operators to build up a mining flow by dragging and dropping them onto the editor canvas.

    Find the Sources and Targets section of the palette. Select a Table Source operator and drag it onto the editor canvas.

  2. In the table selection dialog, expand the CIA schema, select the FACTBOOKS table, and click Finish.
  3. From the Text Operators section, drag a Regular Expression Lookup operator onto the canvas.

    You can now see the Properties view of the operator below the canvas. (However, if your editor area is maximized, you cannot see this view until you reduce the size of the editor.)

  4. On the canvas, use a simple drag operation to connect the output port of the Source Table operator with the input port of the Regular Expression Lookup operator.
  5. On the Settings page of the Properties view, select the input-text column TEXT from the list of input-text columns.
  6. On the Analysis Result page, use the Add new Port icon (shown in Figure 10) to add two new output ports.
    Figure 10. Add new port from the Analysis Result page
    Screen shot of the Analysis Result page with a cursor pointing to the Add new Port icon

    For the first output port (Output):

    1. Select the Output tab on the analysis results page.
    2. Select the Factbook_Concepts as the rule file to be used in the text analysis.
    3. Select Coordinates from the drop-down list of annotation types.
    4. Delete the columns named begin and end from the Result Columns table. These columns contain the start and end position of the found concept in the text. You do not need this information for this analysis.

    For the second output port (Output1):

    1. Select the Output1 tab on the analysis results page.
    2. Select the Factbook_Concepts as the rule file to be used in the text analysis.
    3. Select Area from the drop-down list of annotation types.
    4. Delete the columns named begin and end from the Result Columns table.

    Figure 11 shows the complete Analysis Results page

    Figure 11. The Analysis Results page
    Screen shot of the Analysis Results page with output port Output and output port Output1
  7. On the Output columns page of the Properties View, select the column named COUNTRY from the list of available columns and move it to the list of output columns on the right. Now you are able to relate the extracted concepts with the COUNTRY key, because this column is also contained in the operator output.
  8. Create the tables that receive the analysis results:
    1. Right-click the first output port (Output) of the Regular Expression Lookup operator and select Create suitable table from the context menu. Enter COUNTRY_COORDINATES for the table name, CIA for the schema, and click Finish.
    2. Right-click the second output port (Output1) and select Create suitable table from the context menu. Enter COUNTRY_AREA for the table name, CIA for the schema, and click Finish.
  9. Finally, save the mining flow by clicking in the editor area and pressing Ctrl+S.

Figure 12 shows the completed mining flow. It is now ready to execute.

Figure 12. The complete mining flow
Screen shot of the completed mining flow built according to the instructions above

Execute the mining flow:

  1. The execution process analyzes the source columns in table CIA.FACTBOOK using your rules file and writes the results into the COUNTRY_COORDINATES and COUNTRY_AREA target tables you just created.

    To execute the mining flow, select Mining Flow -> Execute from the menu and then click Execute on the wizard.

  2. After executing the mining flow, you can explore the content of the target tables.

    Right-click the COUNTRY_COORDINATES table and select Sample contents of database table from the context menu. This causes the sample contents to be shown in the Data Output view as shown in Figure 13.

    Figure 13. Sample contents of the COUNTRY_COORDINATES table showing the extracted longitude and latitude
    Screen shot of the COUNTRY_COORDINATES table being displayed in the Data Output view; table shows columns for country, longitude, latitude, and covered text

Conclusion and outlook

This article has described the basic architecture of InfoSphere Warehouse and text analysis in particular. It also described some possibilities regarding how text analysis results can be used to create new insights. A step-by-step example demonstrated how to build a a simple named entity extraction task using regular expressions. The follow-up articles in the series will show you how to perform other tasks such as extracting named entities using dictionaries and how to use IBM Cognos 8 BI to view and analyze the text analysis results together with already existing structured data.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Big data and analytics
ArticleID=394466
ArticleTitle=Text analysis in InfoSphere Warehouse, Part 1: Architecture overview and example of information extraction with regular expressions
publish-date=06042009