Add XML as a data mining tool

Delve into the role of XML in data mining with tips and techniques on how to use it

Examine the use and function of XML in data mining. Get tips and techniques that use XML for pattern matching, change detection, search and similarity detection, data annotation, and semantics.

J. Jeffrey Hanson, CTO, Max International

Photo of Jeff HansonJeff Hanson has more than 20 years of experience as a software engineer and architect, and he is the CTO for Max International. Jeff has written many articles and books, including Mashups: Strategies for the Modern Enterprise. You can reach Jeff at jjeffreyhanson@gmail.com.



31 May 2011

Also available in Chinese Russian Japanese Spanish

Frequently used acronyms

  • API: Application programming interface
  • HTML: Hypertext Markup Language
  • XML: Extensible Markup Language

Data mining is the process of applying algorithms to data to uncover patterns that match a given context or query. For many years, organizations have used data mining to analyze volumes of data so they can predict behavior, produce meaningful reports, overcome competition, and so on.

The proliferation of unstructured and semi-structured data across the web has increased the need for intelligent data mining, storage, and processing. Large data sets with escalating complexity have pushed traditional data mining techniques to new levels of processing demand. Mining data from the web involves applying structure to information that is typically presented in a semi-structured format at best.

Overview of data mining

Data mining seeks to extract patterns from large sets of data using, among other things, statistical methods, artificial intelligence, and standard database management techniques.

As storage capacity and processing power increase and the ability of devices to connect becomes ubiquitous, the importance of data mining becomes more apparent. Organizations seek to gain a competitive advantage as they strive to transform extraordinary quantities of data into valuable business knowledge. This knowledge leads to advantages and advances in science, marketing, fraud prevention, surveillance, and other areas. The benefits realized through data mining are creating a sharp rise in demand for more effective data mining techniques, technologies, and solutions. Effective solutions involve optimized techniques and technologies to extract, filter, and transform data. The transformed data is then made available for use through web services, messaging systems, and so on.

Data mining commonly involves a few standard tasks that include clustering, classification, regression, and associated rule learning.

Clustering

Clustering, in the context of data mining, refers to the attempt to uncover similar subgroups (clusters) of data embedded within unstructured and semi-structured data. Table 1 lists and describes some typical types of data-mining clustering.

Table 1. Typical types of data-mining clustering
Clustering typeDescription
Grid-basedUses thresholds to uncover matrices or "cells" of data that are combined to form clusters
HierarchicalInvolves finding successive groups or clusters of data using previously detected clusters to produce a hierarchy of clusters ranging from small to large
Locality or distance-basedInvolves methods to uncover clusters of data based on a virtual or physical locale
PartitionalRecursively divides data objects into a fixed number of clusters

Classification

Classification seeks to catalog data according to a predetermined taxonomy or organization such as height, color, and so on. One of the most common uses of classification is to identify spam or unwanted email versus wanted email.

Regression

Regression in data mining is a statistical method that seeks to make certain predictions about data. For example, you can predict the value of a home based on location, number of bedrooms, square footage, location, and so on. Using quantitative data, formulas are developed and then applied to subsequent data to make predictions. Regression is commonly used for forecasting.

Association rule learning

Association rule learning searches for relationships between data objects to make predictions, position products, and so on. For example, a grocery store might use association rule learning predict customer buying habits by determining that whenever a customer purchases hot dogs, buns, and charcoal, the customer also typically purchases paper plates.


Overview of XML

XML is a text-based markup language that you can use to structure data in various different dialects. XML dialects are typically defined by schemas that you can reference externally or embed within an XML document. Most prominent programming languages provide some form of support for manipulating XML documents.

XML is a subset of the Standard Generalized Markup Language (SGML) and is designed to provide meta-information about the content of a given XML document. XML stores hardware- or software-independent data in plain text format. These attributes have made XML one of the most ubiquitous formats for communicating and sharing information across applications and systems on the web.

XML-based dialects are used quite extensively for data mining purposes. Table 2 lists and describes some of the more common dialects used in data mining.

Table 2. XML dialects for data mining
DialectDescription
CWM-DMThe Common Warehouse Model for Data Mining (CWM-DM) specification seeks to define data-mining metadata such as a model description, algorithm settings, and attributes. CWM-DM models are XML documents generated using Unified Modeling Language (UML) tools and applications.
PMMLPredictive Model Markup Language (PMML) is an XML dialect for defining statistical and data mining models that can be shared between PMML-compliant services and applications. PMML allows heterogeneous services and applications to manipulate data-mining models in a standard manner. A PMML document consists of items such as a header, a data dictionary, data transformation mappings, the model definition, a mining schema, post-processing targets, and output fields.
XMLAThe XML for Analysis (XMLA) specification defines XML interfaces that use SOAP to provide access to analytical data from varied sources using two methods: discover and execute. XMLA is intended explicitly for data mining and online analytical processing (OLAP).
XPathXML Path Language (XPath) is a common mechanism used to refer to elements and data within an XML document. It bears some similarities to mechanisms used to navigate file-system hierarchies.
XQLXML Query Language (XQL) is a query language designed specifically for XML data, similar to using Structured Query Language (SQL) as the query language for relational data.

Representing semi-structured data using XML

Semi-structured data refers to data with some form of structure but not enough structure to fit gracefully into a relational model. Semi-structured data generally contains tags or other elements to separate semantically related elements that imply some details about the associated data. HTML represents one common form of semi-structured data. Semi-structured data, primarily in the form of HTML, is enabling new prospects for mining data from the web.

XML is able to represent both tabular and hierarchical data. XML is also rich with embedded meta data and other descriptive entities such as schemas and Document Type Definitions (DTDs). XML can represent data as simple structures or as complex and elaborate structures. These attributes make XML dialects prime vocabularies for representing semi-structured data.

XML, by design, almost forces a service or application to apply structure to data. This fact leads data-mining mechanisms to imply semantics to the data being processed to define a useful data model. The data designer using an XML dialect as the data format has complete control over the semantic model defining the data.

XML also presents enough common attributes to services and applications to facilitate generic access to the data from heterogeneous programming languages and environments. This access enables a programmer or user to focus on data manipulation and consumption rather than the algorithms and programming work needed to manipulate and consume the data.


Using XML to extract useful information from semi-structured data sets

Representing semi-structured data as an XML-based document requires a robust data-mining system to support XML consumption, manipulation, and output. This requirement leads a data-mining system to operate on the data in a common manner.

Semi-structured data represented by XML can be perceived as a labeled, directed graph containing one root vertex. Edges, leaves, and other nodes of an XML-based graph can be labeled with text. Each node in the graph can also be defined by a unique identifier.

Consider the example of an XML document in Listing 1 that represents a filmography .

Listing 1. An example of an XML document
<filmography>
  <director name="Scorsese">
    <year>
      2002
      <film>
        <title>Deuces Wild</title>
      </film>
    </year>
    <year>
      2003
      <film>
        <title>The Soul of a Man</title>
      </film>
      <film>
        <title>The Blues</title>
      </film>
    </year>
  </director>
</filmography>

You can diagram the XML document from Listing 1 as the graph illustrated in Figure 1.

Figure 1. XML document as a directed graph
XML document in Listing 1 as a directed graph

XML-based query languages are constructed to reach any arbitrary position within an XML document's "graph" using sequences of edge labels and delimiters representing a path. Listing 2 shows a path, or XPath, expression for the Deuces Wild node from the XML graph in Figure 1. The path starts at the root label.

Listing 2. A simple XPath expression
/filmography/director[@name="Scorsese"]/year[0]/film/title/text()

To overcome ambiguity in XML documents, use the context from which the data is derived. For example, if a person searches for a director with a last name of White and the XML document contains a film with the name White, the context defined by the XPath expression /filmography/director[@name="White"]@name eliminates the ambiguity when compared with the XPath expression /filmography/director[@name="White"]/year[1]/film[0]/title/text().

Extracting information from semi-structured data sources such as HTML documents on the web typically requires filtering, converting, and extracting the data. To change an HTML document to a structured form, convert it to an Extensible Hypertext Markup Language (XHTML) document. This typically involves a filtering process that includes logic to group related nodes by tag name, remove prohibited items such as deprecated tags and attributes, and others. The next step is to convert the filtered document by declaring a single html root element, converting tag elements and attribute names to lowercase, adding end tags to all start tags, adding alt attributes to img tags, and so on.

Using the filter/convert/extract process described earlier, you can consider the web to be a large XML-based data store. Therefore, you can facilitate extraction of data from the web using a query language for XML such as XQuery.

To extract and use data from an XHTML document using XQuery, you formulate a query to locate content in the document, creating new XML structures as needed, and, optionally, creating new XML documents altogether. Locating data in the document is an exercise of pattern matching using XPath expressions.

The XQuery in Listing 3 illustrates a simple XQuery expression to return the title for films from the year 2002 from the XML document in Listing 1.

Listing 3. An XQuery expression to locate all film nodes
for $x in doc("www.example.com/films.xml")/filmography/director
where $x/year="2002"
return $x/year/film/title

Note that you can annotate results returned from an XQuery document with XML tags, attributes, and other markup as needed by the consumer of the results. For example, note the snippet in Listing 4.

Listing 4. An annotated XQuery result
for $x in doc("www.example.com/films.xml")/filmography/director
return
  <director name="{$x[@id]}">
    <other_data>...</other annotated data>
  </director>

In Listing 4, the result returned from the XQuery code is annotated as an XML fragment with data returned from the query.

A feature of XQuery is its ability to indicate a specific document to which you can apply a query. You can enhance the XPath expression /filmography/director[@name="White"]@name in XQuery to apply to a specific document as in the XQuery expression:

doc("www.example.com/films.xml")/filmography/director[@name="White"]@name

You can apply similarity detection to XML data using various XQuery constructs to find data by a common element such as author name, source of data, and so on. More complex methods such as frequency weighting and frequency normalization go beyond the scope of this article.


Transforming relational data to XML data

As mentioned before, the structure of an XML document is representative of a nested hierarchical tree with a number of different types of nodes such as elements, entity references, comments, and so on. The document can have only one root node, which is the first node found in the document.

A relational database comprises a set of tables, containing a set of records or rows. A record or row contains a set of fields or columns containing data. All rows in a given table have the same number of columns. Therefore, you can model a relational database as a hierarchical XML structure composed of a database node containing a set of table nodes containing a set of row nodes containing a set of column nodes.

You can model a relational database (filmography) with a structure composed of one table (director) containing one row as an XML document, as in Listing 5.

Listing 5. An XML representation of a database
<filmography>
  <director>
    <row>
      <name>Scorsese</name>
      <year>
        2002
        <film>
          <title>Deuces Wild</title>
        </film>
      </year>
      <year>
        2003
        <film>
          <title>The Soul of a Man</title>
        </film>
        <film>
          <title>The Blues</title>
        </film>
      </year>
    </row>
  </director>
</filmography>

Note that tables related by key can be nested, as in Listing 5, or referenced by a link embedded within the document as attributes on the related table name.


Detecting changes in XML data

Consumers of XML data returned from data-mining queries are often interested only in the data changes made since a previous query. Solutions to this challenge are most often implemented around a store/index/diff process. In this process, XML data is extracted and stored. The data is then indexed using techniques such as XML pattern expression indexing, B-tree structural indexing, structural indexing, content or keyword indexing, and others. After the XML data is indexed, you can query it quickly on subsequent searches to access data to be used in a diff process. The diff process produces a delta that you can compare against current data.

Change detection can be useful for identifying changes in data patterns that you can use for future analysis and prediction.


Summary

Data mining processes apply algorithms to data to uncover patterns matching a given context or query. Organizations use data mining to analyze large collections of data to produce meaningful reports that help to predict behavior, overcome competition, and more.

The sheer volume of unstructured and semi-structured data found on the web and in internal data stores increases the need for intelligent and effective data mining. Large, complex data sets escalate traditional data mining techniques to new levels of processing demand. Mining data from today's data stores requires that processors attempt to create structured data from data that is often totally unstructured or semi-structured at best.

In this article, you reviewed the use and roles of XML in data mining, including pattern matching, change detection, similarity search and detection, data annotation, and semantics. You also looked briefly at existing standards for using XML in the context of data mining.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=664757
ArticleTitle=Add XML as a data mining tool
publish-date=05312011