XML has increasingly become the language of choice for data representation, storage, and exchange in many arenas. With the rapid increase in the amount of information represented in XML, people are searching for sensible techniques to solve issues with XML document storage and analysis. Part 1: Survey several approaches to XML data mining in this series introduced some techniques for mining hidden knowledge from XML documents.
In this article, learn about discovering XML association rules (as opposed to discovering association rules from relational data). Learn a technique to assess the validity of XML association rules extracted from XML documents that change their content in time. Several examples help illustrate the differences and the new XML data-mining technique.
Part 1 discussed association rules, introduced by Agrawal and Srikant in 1993 (see Resources), for relational data to determine interesting relationships that can be extracted from a set of transactions in a market basket analysis. The algorithm to extract this type of knowledge is known as the Apriori algorithm. It employs the concepts of support and confidence, as follows.
Given a set of distinct items I and a set of transactions D, where any transaction T from D contains only items from I, an association rule R is an implication "X to Y," where X and Y are unrelated items from I. The rule R has the support s in D if s% of transactions in D contain both items X and Y, and it has the confidence c if c% of transactions in D that contain X also contain Y.
For example, an association rule in a market basket analysis is "Users who buy product X also buy product Y, and this happens in s% of transactions with c% confidence."
The association rules concept was first introduced for relational data. In this context, a transaction is a record in the table and an item is the value of an attribute in the table. Consequently, the set of transactions is given by a set of records where all share the same attributes (columns and fields in the table).
Figure 1 is a visual representation of the set of transactions in a table where the task is to mine an association rule as a correlation between values taken by two items (attributes) X and Y. The sample transaction set shows several items that include X and Y in no particular order. Other items in the set include either X or Y.
Figure 1. A set of transaction and items in the relational approach
Mining association rules from XML documents is different than mining rules in relational data because the information can be structured differently in the XML format. Generally, discovering XML association rules means finding a relationship between substructures of one or more XML documents.
The current works in mining static XML association rules use the concepts of transaction and items for XML data. A set of transactions is given by a list of complex nodes formed by querying the XML documents for specific paths; a single complex node forms a transaction. The items are the simple nodes that are the children of the transaction node.
In relational data, a transaction extracted from a table always contains a fixed number of items: the number of attributes or columns existing in the table. An XML transaction can have a different number of items depending on the level of nesting in the document and whether it adheres to an XML schema. A similar view is that all nesting paths in an XML document are records, starting with the root. For any node in the document, each child is viewed as a record relative to the other records at the same depth or with similar tags (see Resources for more information).
Figure 2 shows an example where each
<staff_member> node (here, John Brown and Mellisa
White) is a transaction and each
<publi_when> node is an item. Each staff_member
transaction contains one or more publications under the name of the staff member.
Each publication includes a publi_what, publi_where, and publi_when item.
(View a text version of the XML document in Figure 2.)
Figure 2. Transactions and items in an XML document
The context of the XML association rules refers to those sections of the XML documents that are actually mined. Sometimes there is no need to mine all the information contained in an XML document. Consequently, context selection refers to the users' ability to define constraints on the set of XML transactions.
Figure 3 shows an example of context selection. Only transactions representing staff members who published after 2006 are selected to be mined. In this case, only Mellissa White published after 2006.
Figure 3. Context selection in an XML document
If an XML association rule is an implication X to Y (similar to the relational approach but between two XML complex nodes), then X is the body of the rule and Y is the head of the rule. Body and head are always defined with respect to the context of the rule. Similarly, the support and the confidence of the rule are only calculated with respect to the defined context.
Suppose the context of the transactions in Figure 3 is set to include all
staff members who published after 2000 instead of 2006. Figure 4
shows an example of the body and head of that XML association rule, where
a hypothetical relationship between
(the value "Data mining book" is the rule body) and
<publi_where> (the value "ABC publishing" is the rule head) was found.
Figure 4. Body and head in an XML document
Figure 5 shows an example of an XML document (department.xml)
in a tree representation. The document groups students' and staff members' research
publications, including details about what, where, and when they were published. For
this discussion of generic concepts for mining XML association rules, assume the
nodes are empty so the XML document is static, not dynamic.
Figure 5. Static XML document in a tree representation
The following two XML association rules exemplify the transaction, item, context, body, and head concepts using the document in Figure 5.
- XML association rule example 1
- Assume that you want to discover an association rule from the information about staff
research publications that were published after 2005. Suppose the following rule
"Staff members who published after 2005 with XYZ journal also have projects in the ABC domain."
Figure 6 shows a representation of the concepts in this example:
- Context is given by all
<staff>complex nodes because the requirement was to mine staff publications, not student publications.
- Context selection refers to selecting those publication records in
which the value of
<publi_when>is greater than 2005.
<staff>node is a transaction; the
<staff>node is complex and has other children nodes.
- All simple nodes from the
<staff>complex node that relate to research publications and projects are the items—that is, the nodes found in the following paths:
- Nodes found following the path staff/publication/publi_where form the head of the rule.
- Nodes found following the path staff/project/domain form the body
of the rule.
Figure 6. XML association rules concepts in example 1
- Context is given by all
- XML association rule example 2
- Assume that you want to discover an association rule from the information about
students and staff research publications overall. Suppose the following rule was
"Students who published with publisher ABC also collaborate with staff in research projects."
Figure 7 shows the concepts in this example:
- Context is given by all
<staff>complex nodes because there was no requirement to mine only staff or only student publications.
- No context selection is defined, so the entire context as defined above will be considered.
<personal>complex node is a transaction.
- All simple nodes from the
<staff>complex nodes that relate to research publications and projects are the items, or nodes, found in the following paths:
- Nodes found in the path student/publication/publi_where are the head of the rule.
- Nodes found in the path staff/project/collab are the body of the rule.
Figure 7. XML association rules concepts in example 2
- Context is given by all
The document example in Figure 5, Figure 6,
and Figure 7 is considered static and the elements
ignored. Assume that those two nodes are not empty but receive values based
on the period of validity for the XML document. For example, a new version of the
department.xml document can be created each year. The version for 2007 would have
</to_date>. The association rules described previously would be valid for 2007 only.
Any older rule (discovered from the previous years' versions of the XML document) might
or might not be the same as the rules in 2007. Therefore, association rules from dynamic
(multi-versioned) XML documents need a specific approach to assess their validity
during successive document versions. These rules are called dynamic and are
discussed in the next section.
Association rules discovered from the initial versions of dynamic XML documents fluctuate in real-life applications based on the amount and intensity of changes between consecutive document versions. You can infer the association rules from dynamic XML documents after a set of changes by looking at the effect of the historical changes on the initial rules.
Given a set of XML association rules discovered from the initial versions of multi-versioned XML documents, the task is to determine how they perform after a number of changes. In other words, the problem is identifying which of the initial rules are still valid, which are no longer valid, or which new rules emerged after the changes.
One way to tackle the problem is by running an association rules mining algorithm on the document versions after the changes. However, this practice is not efficient in many cases, such as when changes between versions are not many or not critical, with no real significant impact on the initial association rules. The same applies when the changes produce only an increase or decrease in the support and confidence values for the rules but don't affect their overall validity. You would run the discovery algorithm redundantly without obtaining useful output every time.
By considering the effect of changes on the initial set of valid rules, you can determine the new, valid XML association rules after a set of changes without re-running the full association rules discovery algorithm on the final versions. Figure 8 shows the dynamic XML document used for this example. It depicts the catalog of an online shop for four different days. The catalog is different each day because of the activity of sellers and buyers on the bidding site (products are added or withdrawn, prices changed, details updated, and so on). (View a text version of Figure 8.)
Figure 8. A dynamic XML document in four consecutive versions
The purpose of this example is to show the steps in mining XML dynamic association rules—not how the initial XML rules (before changes) were mined. You can mine the initial rules using any XML-specific association rules algorithm from the existing work.
In this example, assume this step was performed at time T0:
- Minimum support and confidence acceptable for the rules were set to min_sup = 22%, min_conf = 30%, prov_sup = 18%, and prov_ conf = 20%.
- Two XML association rules were discovered, as follows:
- Re =
"when a walkman is available for sale, a mobile phone is available for sale as well", with support (Re) = 25% and confidence (Re ) = 33%.
- Rp =
"when a bbq tools set is available for sale, a suitcase is available for sale as well", with sup (Rp) = 20%, conf (Rp) = 25%.
- Re =
- Assume that the total number of transactions mined to get the above rules was 20.
From the results above, notice that:
- Re is an effective rule because min_sup < support (Re) and min_conf < confidence (Re).
- Rp is a provisional rule because prov_sup < support (Rp) < min_sup and prov_conf < confidence (Rp) < min_conf.
You want to check the validity of rules Re and Rp after the four-day period during which the bidding site catalog has changed, as in Figure 8.
The next task is to build the consolidated delta for the document catalogue.xml and extract the sets of changes C1 (between T0 and T1), C2 (between T1 and T2), and C3 (between T2 and T3). Figure 9 shows the consolidated delta for the four versions of the document in Figure 8. (View a text version of Figure 9.) For more details on how to build a consolidated delta, see Resources.
Figure 9. Consolidated delta for the XML document versions in Figure 8
Table 1 shows the document changes contained in the consolidated delta. Note the Product changes as specific products are removed or added.
Table 1. Changes for the dynamic XML document in the example
|Set of changes||Version interval||Summary of changes|
|C1||T0 to T1||Price – modified|
Product 'mobile phone' – deleted
Price – deleted
|C2||T1 to T2||Product 'suitcase' – inserted|
Price – modified
Status – modified
|C3||T2 to T3||Name – modified|
Product 'mobile phone' – deleted
Price - modified
Product 'bbq set' – inserted
For clarity, this example had only four versions. Naturally, the use of consolidated delta during the step described above is more appropriate when there are more versions.
It was assumed at Step 1 that the total number of transactions (D) mined before the changes to get Re and Rp was 20. The set of changes C1, C2, and C3 contain four transactions. The new total number of transactions to be mined after the changes (D') is 20 + 4 = 24.
Figure 10. Calculation of new support and confidence for an effective association rule
Figure 11. Calculation of new support and confidence for a provisional association rule
As shown in the results in Figure 10 and Figure 11, rule Re is no longer effective because its new support (12.5%) is under the minimum support required, even under the provisional level. The same behavior can be observed for rule Rp, which was provisional before but now has the new support (16.66%) under the provisional support level. However, the level of confidence for rule Rp (23.5%) is above the provisional confidence level.
The example in this article is small and did not cover all possible scenarios in a real-life mining application. However, it shows how to apply the proposed framework to determine version-based association rules after the mined XML documents change.
In this article, you learned about several concepts used in XML data mining to discover association rules from XML documents. You saw examples of static XML association rules. You also walked through an example of dynamic XML association rules where the underlying XML documents are multi-versioned and a fluctuating number of changes exist between one version and another.
Watch for Part 3 in this series, which will explore clustering multi-version XML documents.
- Part 1: Survey several approaches to XML data mining: In the first part of this series, start with an introduction to mining hidden knowledge from XML documents. Learn about mining data, the hierarchical structure of the information, and the relationships between elements.
- Part 3: Clustering XML documents to improve data mining: Check out a non-redundant methodology tp recalculate the new clusters of XML documents after changes. Work through a step-by-step case example to understand the technique and apply it in the third part of this series.
- Extracting Variable Knowledge from Multi-Versioned XML Documents (L.I. Rusu, W. Rahayu, and D. Taniar, 2006): Read about a novel technique of determining how the knowledge discovered from initial XML documents changes in time with the content and structure fluctuations of the documents.
- Mining Changes from Versions of Dynamic XML Documents (L.I. Rusu, W. Rahayu, and D. Taniar, 2006): Learn about an approach for mining association rules from changes between versions of dynamic XML documents, in a simple manner, by using the information contained in the consolidated delta.
- Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski, and A.N. Swami, 1993): Read about an efficient algorithm that generates all significant association rules between items in a large database.
- Deriving General Association Rules from XML Data (Q. Ding, K. Ricords, and J. Lumpkin, 2003): Learn how the authors address the problem of deriving general association rules form XML data and propose an approach to perform the task.
- New to XML? Get the resources you need to learn XML.
- XML area on developerWorks: Find the resources you need to advance your skills in the XML arena, including DTDs, schemas, and XSLT. See the XML technical library for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
- developerWorks on Twitter: Join today to follow developerWorks tweets.
- developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
- developerWorks on-demand demos: Watch demos ranging from product installation and setup for beginners to advanced functionality for experienced developers.
Get products and technologies
- IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
- developerWorks profile: Create your profile today and set up a watchlist.
- XML zone discussion forums: Participate in any of several XML-related discussions.
- The developerWorks community: Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.
Laura Rusu completed her Ph.D. in computer science in 2009 at La Trobe University, Melbourne, Australia, with a thesis on XML data warehousing and mining. She proposed several novel techniques for XML data warehousing and mining, presented them at a number of international conferences, and published extended papers in international peer-reviewed journals. She wrote a book chapter on mining association rules from XML documents and edited a book on data mining and warehousing technologies. Laura's experience is a blend of academic, research, and IT industry roles. You can reach her at Laura.Rusu@acsmail.net.au.