XML data mining, Part 1: Survey several approaches to XML data mining

Examine and address the challenges of static and dynamic documents

XML is used for data representation, storage, and exchange in many different arenas. This series explores one facet of XML data analysis: XML data mining. In this first article, get an introduction to some techniques and approaches for mining hidden knowledge from XML documents. Learn about mining data, the hierarchical structure of the information, and the relationships between elements. Subsequent articles will cover mining XML association rules and clustering multi-version XML documents.

13 Dec 2011 - Added sidebars with a link to Part 2 of this article in the Introduction, Summary, and Resources.

02 May 2012 - Added links to Part 3 of this article to sidebars in the Introduction and Summary, and in Resources.

Laura Irina Rusu, PhD MACS CP, Development Team Lead, DW and BI Consultant, Computershare Technology Services Australia, La Trobe University Australia

Photo of Laura Irina RusuLaura Rusu completed her Ph.D. in computer science in 2009 at La Trobe University, Melbourne, Australia, with a thesis on XML data warehousing and mining. She proposed several novel techniques for XML data warehousing and mining, presented them at a number of international conferences, and published extended papers in international peer-reviewed journals. She wrote a book chapter on mining association rules from XML documents and edited a book on data mining and warehousing technologies. Laura's experience is a blend of academic, research, and IT industry roles. You can reach her at Laura.Rusu@acsmail.net.au.



02 May 2012 (First published 15 November 2011)

Also available in Chinese Russian Japanese

Introduction

XML has increasingly become the language of choice for data representation, storage, and exchange in many domains—from technology to financial services, medical systems, bioinformatics, aviation, defense, and so on. There has been a rapid increase in the amount of information being represented in XML. Many folks are searching for sensible techniques to solve issues associated with XML document storage and analysis. In this three-part series, explore certain facets of XML data mining.

In this article, learn about the research and techniques available for XML data mining. Explore the challenges in mining association rules from static and dynamic XML documents.

Subsequent articles in this series will cover mining XML association rules and clustering multi-version XML documents.


XML mining

Frequently used abbreviations

  • W3C: World Wide Web Consortium
  • WSDL: Web Services Description Language

XML mining includes mining both the structure and the contents from XML documents. Mining of structure, which is essentially mining the XML schema, includes intra-structure mining (mining the structure inside an XML document) and inter-structure mining (mining the structures between XML documents). Mining of content involves content analysis and structure clarification. Content analysis is concerned with analyzing texts within the XML document. Structural clarification is concerned with determining the similar documents based on their content.

Static XML documents do not change their content and structure over time. For example, an XML document containing details of papers presented at a conference is a static document. Dynamic, or multi-versioned, XML documents change their structure or content over time. For example, if the content of an online bookshop were represented in XML format, it would change daily based on e-customer behavior.

The main characteristics differentiating static from dynamic XML documents are:

Representation of temporal coverage
A static XML document does not contain an element indicating how long the document is valid. Conversely, a dynamic XML document inherently contains at least one element that indicates the temporal coverage of the specific version of the document.
Persistence of the rendered information
Once it has been created, the information in a static XML document is always valid. Conversely, a version of a dynamic XML document is valid only for the period specified by the temporal elements. Once a new version emerges, the information contained in the previous version is superseded.

Figure 1 shows an example of a schema for a static XML document in WSDL format.

Figure 1. Schema for a static XML document
Example diagram showing a schema for a static XML document

The content of an XML document instance based on the schema in Figure 1 will never change because, once published, the papers presented at that particular conference will not be modified.

Figure 2 shows a schema example for a dynamic XML document in WSDL format.

Figure 2. Schema for a dynamic (multi-versioned) XML document
Example diagram showing a schema for a dynamic XML document

In Figure 2, the <AsAtDate> node contains the validity period for the data in each XML document version. Instead of using a single date node to specify a validity time, you might use two date nodes to specify a validity date range (for example, <from_date> and <to_date>). Generally, changes to XML documents make them dynamic. Given an initial version of an XML document, each set of changes applied to it produces a new XML document. The new XML document will be considered a new version of the old one.


Mining static XML documents

This section reviews several approaches for mining association rules from static XML documents and for clustering static XML documents.

Association rules from static XML documents

Most of the work with mining association rules from static XML documents uses XML-oriented algorithms based on the Apriori algorithm. There are several non-Apriori-based approaches, however.

The Apriori algorithm

The concept of association rules was introduced in 1993 and described in more detail in 1994 (see Resources). For example, an association rule is: "If the user buys product A, he or she also buys product B, and this happens in more than 80% of transactions." The support of the rule is 80%, which means that 80% of transactions contain both items A and B.

The algorithm to find the rules sets the minimum support and confidence required at the beginning of the mining process. Then, all the large k-itemsets are determined, starting from k = 1 (itemsets with only one item) and looping through the set of transactions to calculate the support and the confidence. If they are not validated against the minimum required, the k-itemsets are considered not large and are removed.

This algorithm has become known as the Apriori algorithm and has been applied to many subsequent approaches to discover association rules.

Discovering association rules involves finding the interesting relationships among items that appear together in a given data set. The relationship should occur often enough to be treated as a rule; the rule has to have support and confidence with regard to the data set.

If the concept of association rules is extended to XML documents, you're seeking relationships between parts of the XML documents. A part of an XML document can be either a simple node or a complex node, which consists of other simple nodes. Any complex node can be seen as a tree within the bigger tree that is the whole XML document. Hence, the task of mining association rules in an XML document entails finding relationships between subtrees (substructures) of XML documents. The list below summarizes three different methods.

  • One XML association rule extraction algorithm proposed by Wan and Dobbie (see Resources) is actually an implementation of the Apriori algorithm using the XQuery language. Using this algorithm, XML data can be mined directly, without the need to preprocess the XML document. Mapping to a different format is not required. However, discovering large itemsets in the mining process is slower when using the XQuery implementation than when using a C++ implementation. In the XQuery implementation, the update capabilities that the algorithm requires are limited.
  • A method proposed by Daniele Braga and colleagues (see Resources) is based on the Apriori algorithm and mines the association rules in three major steps: preprocessing data, extracting association rules, and post-processing association rules. The authors introduce the concepts of:
    • Context of the association rules
    • Context selection
    • Body
    • Head
    In a relationship X -->Y, X is the body and Y is the head of the association rule. Body and head are always defined with respect to the context of the rule. The support and the confidence will be calculated and relevant only with respect to the established context. Figure 3 shows an example of an XML document represented as a tree with the identified context, body, and head.
    Figure 3. XML document with context, body, and head
    Example diagram of an XML document with the identified context, body, and head
  • A technique proposed by Ling Feng and colleagues (see Resources), which is not based on the Apriori algorithm, proposes a different mapping of the concepts of transaction and item to the tree-like structure of the XML documents. The goal is to discover association rules from a collection of XML documents rather than from a single document. Hence, each XML document (tree) corresponds to a database record (transaction), where each XML fragment (subtree) corresponds to an item in the transaction. This work tries to discover association rules among trees in XML documents rather than among simple-structured items. Each tree is named a tree-structured item and is a rooted, ordered tree with nodes classified into basic nodes (no edges emanate from them) and complex nodes (internal nodes with one or more edges emanating from them).

The main difference between the Apriori-based and non-Apriori-based approaches is in the way they use the notions of transaction and item. Although the Apriori-based algorithms extract the items to be mined as a list of nodes by querying the XML document for a specific path, the non-Apriori algorithms see each subtree (substructure) in the XML documents as an item.

Various algorithms allow different degrees of complexity (different levels of nesting) in the mined documents. It is also possible that you need to mine only a single XML document or to extract association rules from multiple XML documents.

The Apriori-based XML association rules mining techniques described here work with XML documents of any complexity as long as the structure is known in advance. For XML documents that are complex and irregularly structured, consider using a set of transformations on the XML data to simplify identifying the mining context. In some of the algorithms, you need to define the context, body, and head of the association rules at the beginning of the algorithm. The methods, their strengths, and their weaknesses are summarized in Table 1.

Table 1. Mining association rules from static XML documents
WorksApriori-based mining algorithmNon-Apriori-based mining algorithmMine association rules from a single XML documentMine association rules from multiple XML documentsMine association rules from a simple XML documentMine association rules from complex XML documents
Wan & Dobbie, 2003; Wan & Dobbie, 2004YNYNYN
Braga et al., 2002; Braga et al., 2003YNYYYY
Feng et al., 2003NYYYYY

Clustering static XML documents

Static XML documents can be clustered based on the type of similarity—structural, semantic, or combined—each technique employs.

Clustering based on structural similarity focuses on grouping XML documents into clusters based on the similarity between their structures. Structural approaches include:

  • The S-GRACE algorithm (see Resources) clusters XML documents by structure using structural graphs (s-graphs). An s-graph is a directed graph that provides a summary of the nodes and edges in the XML documents together with the relationships between parent and child nodes. S-GRACE applies a hierarchical clustering algorithm on the s-graphs extracted from the given set of XML documents.
  • A generic, agglomerative hierarchical clustering algorithm proposed by De Francesca and colleagues (see Resources) addresses the XML clustering problem in a parametric way. Given a suitable distance measure, the algorithm finds optimal cluster representatives. A cluster representative is an XML document that mirrors the structural content of the cluster. Hence, the task is mainly to calculate a cluster representative through tree matching, tree merging, and pruning of the merged tree.

In addition to structural information, each node has a semantic meaning. You can cluster the XML documents based on both structural and semantic similarity.

  • Yoon proposed an approach (see Resources) that breaks down each XML document that needs to be clustered into ePaths. ePaths are full paths starting from the root of the XML document where the most nested element can only be a simple element (containing a value and no children elements).

    A three-dimensional bitmap index called a BitCube is then built. The dimensions are the list of documents, the list of ePaths extracted from the documents, and the list of words extracted from the simple content of all ePaths. From this cube, you can then determine which documents have certain words associated with them (among other things) and speed up the retrieval process when the user searches by specific terms.

  • A technique offered by Shen and Wang (see Resources) extracts macro-paths from the XML documents, which may or may not be based on an XML schema, where each macro-path has three elements: a path sequence, an attribute sequence, and a content sequence. The degree of similarity between these macro-paths indicates the degree of similarity between the XML documents analyzed.

Distance-based clustering techniques are an extension of the structural methods because they also consider both structural and semantic features of the given set of XML documents to cluster. Distance-based clustering techniques have a specific characteristic: They use the tree edit distance concept.

Tree edit distance

Given two XML documents represented as trees, the edit distance between them is the set of edit operations with the lowest total cost that is able to transform one document into the other. The difference between distance-based clustering approaches is in the way each defines the acceptable edit operations for edit tree distance.

  • A technique from Theodore Dalamagas and colleagues (see Resources) views XML documents as ordered trees. Their algorithm reduces nesting and repetition then calculates a tree edit distance between each pair of structural summaries. Clustering is based on the distance between structural summaries.
  • A method by Nierman and Jagadish (see Resources) calculates the distance between two XML documents by using five different operations on an XML tree: relabel, insert, delete, insertTree, and deleteTree. The last two operations allow the cutting and pasting of whole sections of an XML document. Allowable sequences of edit operations can also be defined and used by a dynamic programming algorithm to calculate the distances between pairs of XML documents in the given set.
  • A more recent technique from Xing, Xia, and Guo (see Resources) calculates the similarity between two XML documents by calculating the edit distance between XML documents and their schema. A schema is extracted from each document and the distance between two XML documents is then calculated as the average of the distances between each XML document and the schema extracted from the other XML document in the pair.

Table 2 summarizes the approaches involving clustering static XML documents.

Table 2. Clustering static XML documents
WorksCalculated distance as a set of edit operationsAssesses structural similarity onlyAssesses structural and semantic similarityAssesses semantic similarity only
De Francesca et al., 2003NYNN
Liang et al., 2004NYNN
Yoon et al., 2001NNYN
Shen & Wang, 2003NNYN
Nierman & Jagadish, 2002YNYN
Dalamagas et al., 2004YYNN
Xing et al., 2007YYNN

Mining dynamic XML documents

Mining dynamic XML documents brings new challenges to developers. In real-world applications, the differences between consecutive versions of an XML document fluctuate, so the mining techniques for static XML documents cannot be replicated for dynamic XML documents.

Discovering association rules from dynamic XML documents

Discovery of association rules from multi-versioned XML documents is still in its infancy. The Weighted-FPgrowth algorithm (see Resources) addresses the issue. This algorithm extracts the association rules from XML structural deltas, or XSD-ARs, in three steps:

  1. Build the structural deltas database.
  2. Discover frequent subtree patterns.
  3. Extract association rules.

XML structural deltas are structural differences between consecutive versions of a dynamic XML document. The Weighted-FPgrowth algorithm also defines the concepts of degree of change, frequency of change, and frequent subtree pattern.

I propose two different ways to mine association rules from dynamic XML documents (see Resources):

  • Interesting knowledge (in this case, association rules) from the collection of historic versions of the XML documents.

    Consider the changes, in both structure and content, affecting the multi-versioned XML documents in a given period of time. You can dynamically determine the most current valid association rules between elements or substructures of the XML documents involved without running the full discovery algorithm on the latest XML documents versions.

  • Association rules extracted from the actual changes between versions of XML documents.

    Look into the actual changes between versions of XML documents (stored as delta documents or a consolidated delta document) and find the associations between them.

Clustering dynamic XML documents

There has not been much work involving clustering dynamic XML documents. Two possibilities are outlined below.

  • One technique, which my colleagues and I proposed (see Resources), is to reassess the distances between clustered dynamic XML documents after they change and calculate the effects of the changes on the distances before documents have changed.
  • Another technique, proposed by Nayak and Xu (see Resources), is to cluster series of XML documents. Given a series, or streams, of XML documents, a distance is calculated between each incoming XML document and the existing clusters using the level structure. This distance is determined by matching the nodes from the incoming document to the nodes of the existing clusters. In this way, the similarity is determined at the cluster level, rather than pair-wise, for individual documents in the clusters.

Table 3 summarizes the approaches for mining dynamic XML documents.

Table 3. Mining dynamic XML documents
WorksMine association rules from multiple versionsMine association rules from deltasCluster series of XML documentsCluster multi-versioned (dynamic) XML documents
Chen et al., 2004NYNN
Nayak and Xu, 2006NNYN
Rusu et al., 2006a;YNNN
Rusu et al., 2006b;NYNN
Rusu et al., 2008NNNY

Summary

In this article, you learned about techniques in XML data mining to discover association rules from XML documents or to cluster XML documents based on structure or content. Most of the techniques deal with static XML documents. However, the pervasive dynamic information available nowadays requires techniques that can also work with dynamic (multi-versioned) XML documents.

Stay tuned for Part 2 in this series, which will go into detail about mining XML association rules.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=773464
ArticleTitle=XML data mining, Part 1: Survey several approaches to XML data mining
publish-date=05022012