XML data mining, Part 2: Mining XML association rules

Discover association rules from static and dynamic XML documents

In Part 2 of this series, learn about mining association rules from XML documents. Mining association rules from XML documents is different from mining rules from relational data. Information can be structured differently in XML because of the language's flexibility and hierarchical organization. This article also introduces the notion of dynamic association rules. You'll explore an approach to mining XML association rules when the underlying XML documents change without a full re-run of the association rules discovery algorithm.

02 May 2012 - Added links to Part 3 of this article to sidebars in the Introduction and Conclusion, and in Resources.

Share:

Laura Irina Rusu, PhD MACS CP, Development Team Lead, DW and BI Consultant, Computershare Technology Services Australia, La Trobe University Australia

Photo of Laura Irina RusuLaura Rusu completed her Ph.D. in computer science in 2009 at La Trobe University, Melbourne, Australia, with a thesis on XML data warehousing and mining. She proposed several novel techniques for XML data warehousing and mining, presented them at a number of international conferences, and published extended papers in international peer-reviewed journals. She wrote a book chapter on mining association rules from XML documents and edited a book on data mining and warehousing technologies. Laura's experience is a blend of academic, research, and IT industry roles. You can reach her at Laura.Rusu@acsmail.net.au.



02 May 2012 (First published 13 December 2011)

Also available in Chinese Russian Japanese

Introduction

XML has increasingly become the language of choice for data representation, storage, and exchange in many arenas. With the rapid increase in the amount of information represented in XML, people are searching for sensible techniques to solve issues with XML document storage and analysis. Part 1: Survey several approaches to XML data mining in this series introduced some techniques for mining hidden knowledge from XML documents.

In this article, learn about discovering XML association rules (as opposed to discovering association rules from relational data). Learn a technique to assess the validity of XML association rules extracted from XML documents that change their content in time. Several examples help illustrate the differences and the new XML data-mining technique.


Mining association rules from relational data

Part 1 discussed association rules, introduced by Agrawal and Srikant in 1993 (see Resources), for relational data to determine interesting relationships that can be extracted from a set of transactions in a market basket analysis. The algorithm to extract this type of knowledge is known as the Apriori algorithm. It employs the concepts of support and confidence, as follows.

Given a set of distinct items I and a set of transactions D, where any transaction T from D contains only items from I, an association rule R is an implication "X to Y," where X and Y are unrelated items from I. The rule R has the support s in D if s% of transactions in D contain both items X and Y, and it has the confidence c if c% of transactions in D that contain X also contain Y.

For example, an association rule in a market basket analysis is "Users who buy product X also buy product Y, and this happens in s% of transactions with c% confidence."

Transactions and items in the relational approach

The association rules concept was first introduced for relational data. In this context, a transaction is a record in the table and an item is the value of an attribute in the table. Consequently, the set of transactions is given by a set of records where all share the same attributes (columns and fields in the table).

Figure 1 is a visual representation of the set of transactions in a table where the task is to mine an association rule as a correlation between values taken by two items (attributes) X and Y. The sample transaction set shows several items that include X and Y in no particular order. Other items in the set include either X or Y.

Figure 1. A set of transaction and items in the relational approach
Concepts of set of transaction and items in the relational approach

Mining association rules from XML

Mining association rules from XML documents is different than mining rules in relational data because the information can be structured differently in the XML format. Generally, discovering XML association rules means finding a relationship between substructures of one or more XML documents.

Transactions and items for XML association rules

The current works in mining static XML association rules use the concepts of transaction and items for XML data. A set of transactions is given by a list of complex nodes formed by querying the XML documents for specific paths; a single complex node forms a transaction. The items are the simple nodes that are the children of the transaction node.

In relational data, a transaction extracted from a table always contains a fixed number of items: the number of attributes or columns existing in the table. An XML transaction can have a different number of items depending on the level of nesting in the document and whether it adheres to an XML schema. A similar view is that all nesting paths in an XML document are records, starting with the root. For any node in the document, each child is viewed as a record relative to the other records at the same depth or with similar tags (see Resources for more information).

Figure 2 shows an example where each <staff_member> node (here, John Brown and Mellisa White) is a transaction and each <publi_what>, <publi_where>, and <publi_when> node is an item. Each staff_member transaction contains one or more publications under the name of the staff member. Each publication includes a publi_what, publi_where, and publi_when item. (View a text version of the XML document in Figure 2.)

Figure 2. Transactions and items in an XML document
Example of transactions and items in an XML document

Context, body, and head for XML association rules

The context of the XML association rules refers to those sections of the XML documents that are actually mined. Sometimes there is no need to mine all the information contained in an XML document. Consequently, context selection refers to the users' ability to define constraints on the set of XML transactions.

Figure 3 shows an example of context selection. Only transactions representing staff members who published after 2006 are selected to be mined. In this case, only Mellissa White published after 2006.

Figure 3. Context selection in an XML document
Example of context selection in an XML document

If an XML association rule is an implication X to Y (similar to the relational approach but between two XML complex nodes), then X is the body of the rule and Y is the head of the rule. Body and head are always defined with respect to the context of the rule. Similarly, the support and the confidence of the rule are only calculated with respect to the defined context.

Suppose the context of the transactions in Figure 3 is set to include all staff members who published after 2000 instead of 2006. Figure 4 shows an example of the body and head of that XML association rule, where a hypothetical relationship between <publi_what> (the value "Data mining book" is the rule body) and <publi_where> (the value "ABC publishing" is the rule head) was found.

Figure 4. Body and head in an XML document
Example of body and head in an XML document

Examples of XML association rules

Figure 5 shows an example of an XML document (department.xml) in a tree representation. The document groups students' and staff members' research publications, including details about what, where, and when they were published. For this discussion of generic concepts for mining XML association rules, assume the <from_date> and <to_date> nodes are empty so the XML document is static, not dynamic.

Figure 5. Static XML document in a tree representation
Example of a static XML document in a tree representation

The following two XML association rules exemplify the transaction, item, context, body, and head concepts using the document in Figure 5.

XML association rule example 1
Assume that you want to discover an association rule from the information about staff research publications that were published after 2005. Suppose the following rule was obtained:

"Staff members who published after 2005 with XYZ journal also have projects in the ABC domain."

Figure 6 shows a representation of the concepts in this example:

  • Context is given by all <staff> complex nodes because the requirement was to mine staff publications, not student publications.
  • Context selection refers to selecting those publication records in which the value of <publi_when> is greater than 2005.
  • Each <staff> node is a transaction; the <staff> node is complex and has other children nodes.
  • All simple nodes from the <staff> complex node that relate to research publications and projects are the items—that is, the nodes found in the following paths:
    • staff/publication/publi_what
    • staff/publication/publi_where
    • staff/publication/publi_when
    • staff/project/name
    • staff/project/domain
    • staff/project/collab
  • Nodes found following the path staff/publication/publi_where form the head of the rule.
  • Nodes found following the path staff/project/domain form the body of the rule.
    Figure 6. XML association rules concepts in example 1
    Visual representation of XML association rules concepts in example 1
XML association rule example 2
Assume that you want to discover an association rule from the information about students and staff research publications overall. Suppose the following rule was obtained:

"Students who published with publisher ABC also collaborate with staff in research projects."

Figure 7 shows the concepts in this example:

  • Context is given by all <students> and <staff> complex nodes because there was no requirement to mine only staff or only student publications.
  • No context selection is defined, so the entire context as defined above will be considered.
  • Each <personal> complex node is a transaction.
  • All simple nodes from the <student> and <staff> complex nodes that relate to research publications and projects are the items, or nodes, found in the following paths:
    • student/publication/publi_what
    • student/publication/publi_where
    • student/publication/publi_when
    • staff/publication/publi_what
    • staff/publication/publi_where
    • staff/publication/publi_when
    • staff/project/name
    • staff/project/domain
    • staff/project/collab
  • Nodes found in the path student/publication/publi_where are the head of the rule.
  • Nodes found in the path staff/project/collab are the body of the rule.
    Figure 7. XML association rules concepts in example 2
    Visual representation of XML association rules concepts in example 2

As examples 1 and 2 show, the concepts of transaction and item are flexible when mining association rules from XML documents.

The document example in Figure 5, Figure 6, and Figure 7 is considered static and the elements <from_date> and <to_date> ignored. Assume that those two nodes are not empty but receive values based on the period of validity for the XML document. For example, a new version of the department.xml document can be created each year. The version for 2007 would have <from_date>01/01/2007</from_date> and <to_date>31/12/2007</to_date>. The association rules described previously would be valid for 2007 only. Any older rule (discovered from the previous years' versions of the XML document) might or might not be the same as the rules in 2007. Therefore, association rules from dynamic (multi-versioned) XML documents need a specific approach to assess their validity during successive document versions. These rules are called dynamic and are discussed in the next section.


Dynamic XML association rules

Association rules discovered from the initial versions of dynamic XML documents fluctuate in real-life applications based on the amount and intensity of changes between consecutive document versions. You can infer the association rules from dynamic XML documents after a set of changes by looking at the effect of the historical changes on the initial rules.

Given a set of XML association rules discovered from the initial versions of multi-versioned XML documents, the task is to determine how they perform after a number of changes. In other words, the problem is identifying which of the initial rules are still valid, which are no longer valid, or which new rules emerged after the changes.

One way to tackle the problem is by running an association rules mining algorithm on the document versions after the changes. However, this practice is not efficient in many cases, such as when changes between versions are not many or not critical, with no real significant impact on the initial association rules. The same applies when the changes produce only an increase or decrease in the support and confidence values for the rules but don't affect their overall validity. You would run the discovery algorithm redundantly without obtaining useful output every time.

By considering the effect of changes on the initial set of valid rules, you can determine the new, valid XML association rules after a set of changes without re-running the full association rules discovery algorithm on the final versions. Figure 8 shows the dynamic XML document used for this example. It depicts the catalog of an online shop for four different days. The catalog is different each day because of the activity of sellers and buyers on the bidding site (products are added or withdrawn, prices changed, details updated, and so on). (View a text version of Figure 8.)

Figure 8. A dynamic XML document in four consecutive versions
An example of a dynamic XML document in four consecutive versions

The purpose of this example is to show the steps in mining XML dynamic association rules—not how the initial XML rules (before changes) were mined. You can mine the initial rules using any XML-specific association rules algorithm from the existing work.

Step 1. Preparation

In this example, assume this step was performed at time T0:

  • Minimum support and confidence acceptable for the rules were set to min_sup = 22%, min_conf = 30%, prov_sup = 18%, and prov_ conf = 20%.
  • Two XML association rules were discovered, as follows:
    • Re = "when a walkman is available for sale, a mobile phone is available for sale as well", with support (Re) = 25% and confidence (Re ) = 33%.
    • Rp = "when a bbq tools set is available for sale, a suitcase is available for sale as well", with sup (Rp) = 20%, conf (Rp) = 25%.
  • Assume that the total number of transactions mined to get the above rules was 20.

From the results above, notice that:

  • Re is an effective rule because min_sup < support (Re) and min_conf < confidence (Re).
  • Rp is a provisional rule because prov_sup < support (Rp) < min_sup and prov_conf < confidence (Rp) < min_conf.

You want to check the validity of rules Re and Rp after the four-day period during which the bidding site catalog has changed, as in Figure 8.

Step 2. Mining version based association rules

The next task is to build the consolidated delta for the document catalogue.xml and extract the sets of changes C1 (between T0 and T1), C2 (between T1 and T2), and C3 (between T2 and T3). Figure 9 shows the consolidated delta for the four versions of the document in Figure 8. (View a text version of Figure 9.) For more details on how to build a consolidated delta, see Resources.

Figure 9. Consolidated delta for the XML document versions in Figure 8
Consolidated delta for the XML document versions in Figure 8

Table 1 shows the document changes contained in the consolidated delta. Note the Product changes as specific products are removed or added.

Table 1. Changes for the dynamic XML document in the example
Set of changesVersion intervalSummary of changes
C1T0 to T1Price – modified
Product 'mobile phone' – deleted
Price – deleted
C2T1 to T2Product 'suitcase' – inserted
Price – modified
Status – modified
C3T2 to T3Name – modified
Product 'mobile phone' – deleted
Price - modified
Product 'bbq set' – inserted

For clarity, this example had only four versions. Naturally, the use of consolidated delta during the step described above is more appropriate when there are more versions.

It was assumed at Step 1 that the total number of transactions (D) mined before the changes to get Re and Rp was 20. The set of changes C1, C2, and C3 contain four transactions. The new total number of transactions to be mined after the changes (D') is 20 + 4 = 24.

The new support and confidence values for the effective rule Re is calculated using the steps in Figure 10. (View a text version of Figure 10.)

Figure 10. Calculation of new support and confidence for an effective association rule
Calculation of new support and confidence for an effective association rule in the working example

The new support and confidence values for the provisional rule Rp is calculated using the steps in Figure 11. (View a text version of Figure 11.)

Figure 11. Calculation of new support and confidence for a provisional association rule
Calculation of new support and confidence for a provisional association rule in the working example

As shown in the results in Figure 10 and Figure 11, rule Re is no longer effective because its new support (12.5%) is under the minimum support required, even under the provisional level. The same behavior can be observed for rule Rp, which was provisional before but now has the new support (16.66%) under the provisional support level. However, the level of confidence for rule Rp (23.5%) is above the provisional confidence level.


Conclusion

The example in this article is small and did not cover all possible scenarios in a real-life mining application. However, it shows how to apply the proposed framework to determine version-based association rules after the mined XML documents change.

In this article, you learned about several concepts used in XML data mining to discover association rules from XML documents. You saw examples of static XML association rules. You also walked through an example of dynamic XML association rules where the underlying XML documents are multi-versioned and a fluctuating number of changes exist between one version and another.

Watch for Part 3 in this series, which will explore clustering multi-version XML documents.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=780186
ArticleTitle=XML data mining, Part 2: Mining XML association rules
publish-date=05022012