XML data mining, Part 3: Clustering XML documents for improved data mining

Assess changes to clusters of dynamic XML documents to speed your data mining

Similar to the task of mining association rules from an XML document, clustering XML documents is different from clustering relational data because of the specific structure of the XML format, its flexibility, and its hierarchical organization. Learn about clustering XML documents as a major task in XML data mining in this third article in a series on XML data mining.

Share:

Laura Irina Rusu, PhD MACS CP, Development Team Lead, DW and BI Consultant, Computershare Technology Services Australia, La Trobe University Australia

Photo of Laura Irina RusuLaura Rusu completed her Ph.D. in computer science in 2009 at La Trobe University, Melbourne, Australia, with a thesis on XML data warehousing and mining. During her studies, she proposed a number of novel techniques for XML data warehousing and mining, presented them at several international conferences, and published extended papers in international peer-reviewed journals. She wrote a book chapter on mining association rules from XML documents and edited a book on data mining and warehousing technologies. Laura's experience is a blend of academic, research, and IT industry roles. You can reach her at Laura.Rusu@acsmail.net.au.



02 May 2012

Also available in Chinese Russian Japanese

This article, the third in a series on XML data mining, explains several concepts related to clustering XML documents and presents the task of clustering XML documents when their content or structure changes over time. In real-world applications, the number of changes from one version of an XML document to another cannot be predicted. It's always possible that an initial clustering solution becomes obsolete after the changes take place. To combat this built-in obsolescence, this article describes a non-redundant methodology for recalculating the new clusters of XML documents after the changes. You work through a step-by-step case example to help you understand the technique and apply it in practice.

Background concepts

Clustering is a data mining task—usually done using distance measurements—that looks for regions in a data set that are dense. In other words, clustering is a process that partitions data in such a way that similar data items are grouped into sets referred to as clusters.

Clustering techniques

A lot of work has been done in clustering metric or spatial data, and several types of algorithms have been proposed. A few of them are:

  • Partitioning algorithms
  • Hierarchical algorithms
    • Single-link clustering (also called the minimum distance method)
    • Complete-link clustering (also called the maximum distance method)
    • Average-link clustering
  • Density-based algorithms

For details on the specific techniques, see Resources.

XML document clustering differs from clustering any other type of data set because of the hierarchical structure of XML. Several XML clustering approaches have been proposed, such as clustering XML documents by structure, semantic XML clustering, clustering schema-less XML documents, and clustering linked XML documents. You can read more about different types of XML clustering in Part 1 of this series. This article focuses on clustering XML by structure using a hierarchical (distance-based) XML clustering technique.

In distance-based clustering techniques, each object from the given set is first assigned to a cluster. Then, distances between pairs of clusters are computed, and the closest clusters (the most similar) are grouped to form a new (bigger) cluster. In other words, when two XML documents are more similar compared to other pairs of XML documents, the distance between them is smaller; hence, they can become members of the same cluster.

To exemplify the concept of XML documents similarity, Figure 1 shows three XML documents, where two are highly similar (that is, documents DA and DB), while document DC is not similar to either DA or DB. Documents DA and DB) list information about two students, including year of study, subjects and exams, and student names. Document DC lists information about a book, including title, ISB number, and two author names.

Figure 1. Example of similar and dissimilar XML documents
Diagram showing an example of similar and dissimilar XML documents

In Figure 1, any queries regarding students' details are applicable only to the relevant documents (that is, DA and DB) and not to any other documents that contain a different kind of information, such as DC. Intuitively, documents DA and DB are grouped in a cluster, while DC forms a cluster by itself.

Distance between XML documents and XMLDelta

If you consider two XML documents—D1 and D2—together with their representation as two trees, then the distance between these two documents, noted as d(D1, D2), is determined by the set of basic operations (that is, insert, update, delete) that has the smallest total cost and is able to transform one document into another.

As an example, to determine the distance between documents DA and DB in Figure 1, you first find the set of basic operations that can transform DA into DB (forward); then, you find the set of basic operations that can transform DB into DA (backward). For both of these sets of operations, you calculate their costs; finally, you select the set that has the smallest total cost:

d(DA --> DB)={update(Student, John, Mary), update(year, 2, 3), insert(Exams), insert (Subject, Drama), insert (Subject, Music)} and d(DB --> DA)={ update(Student, John, Mary), update(year, 2, 3), delete(Exams), delete(Subject), delete(Subject)}

To calculate the minimum cost for each set of operations, you use a cost model based on node position in the XML document. (For an example of such a model, see Resources.) The cost in this example is:

d (DA --> D B) = d (DB --> DA) = 5

In this example, you can select either set of operations because they have the same total cost.

In the case of dynamic (also called multi-versioned) XML documents, each new document version is actually a new XML document obtained by updating by a certain degree the previous document version. This update is generally done by applying a mixture of basic operations (that is, insert, update, and delete) on the previous XML document version. If you look at these operations as a whole, they form what is known as a delta. When you speak about XML delta, you know that it refers to the difference between two consecutive versions of an XML document. The cost of delta is the total cost of the mixture of operations mentioned above and listed in the delta.


Clustering dynamic (multi-version) XML documents

When you cluster dynamic XML documents, you have two basic options:

  • Option A. The traditional method, option A, is straightforward. When one or more documents change, this option recalculates the new distances among all XML documents by performing a comparison for each pair of XML documents in the set. Obviously, this option can be expensive because it makes no distinction regarding the level of changes that have been applied to a document. Therefore, if a version of an XML document is only slightly different or not different at all from its previous version, comparing them would involve redundant and unnecessary effort.
  • Option B. This option is the one described in this article in more detail. It uses the distances calculated earlier for each pair of XML documents (that is, distances between documents before any change) plus the set of changes that occurred during a period of time. Figure 2 shows a summary of the steps involved in option B.
    Figure 2. Summary of the steps in reassessing distances between dynamic XML documents, based on changes
    Summary diagram of the steps in reassessing distances between dynamic XML documents based on changes

You can divide the methodology used for option B in Figure 2 into two main parts:

  • Part 1: The initial calculation of distances. Note that this phase is one-off, not repetitive. The steps are:
    1. Determine and store the distances between each pair of XML documents.
    2. Calculate and store the minimum cost for transforming each pair of documents in both directions—that is, backward and forward.
  • Part 2: Reassessing the distances. You can schedule this phase to run regularly after changes occur or run it as you need to know the latest distances between documents. The steps are:
    1. Get the matrix of distances calculated earlier during Part 1 or from the most recent run of Part 2.
    2. Get the set of changes between the timestamp of the latest known distances and the current timestamp.
    3. Recalculate each distance by including in the equation the previous or initial distance, the transformation direction (forward or backward) that has minimum cost, and the set of changes between documents.

From this phase, it is clear that the most critical step in Part 2 of the process is recalculating the new distance for each pair of XML document versions in the dataset. For any given two XML documents where distance needs to be reassessed, two different scenarios exist:

  • Only one document from the pair changes
  • Both documents of the pair change

For each scenario, further considerations exist related to the type of operations applied to the documents; however, they are not detailed in this article. You can find more information and detailed explanations in Resources.


Case study

The XML documents used in this example represent customer transactions in a bank. The bank creates one XML document for each customer who opens an account with the bank, and then the bank creates a new version each time a new transaction is entered. Each version has a transaction date, shown by the element >_date<. Suppose that three documents for three customers exist as at 01 January 2008 (that is, documents D1, D2, and D3). Listing 1 shows document D1.

Listing 1. XML document example (D1) for the dynamic clustering case study
<?xml version="1.0" encoding="UTF-8"?> 
<customer> 
    <name>cust1</name> 
    <street>street1</street> 
    <city>;city1</city> 
    <transaction> 
        <trans_date>01/01/2008</trans_date> 
        <deposit> 
            <bank_account>BA1</bank_account>>
            <BSB>BSB1</BSB>
            <acc_name>ACC1</acc_name>
            <amount>100</amount>
        </deposit> 
    </transaction> 
</customer>

Then, on 15 January 2008, one more transaction is entered for each of the second and third customers. As a result, documents D2 and D3 have new versions—D*2 and D*3, respectively (see Listing 2).

Listing 2. XML document example (D2) in two consecutive versions
<?xml version="1.0" encoding="UTF-8"?> 
<customer> 
    <name>cust2</name> 
    <street>street2</street> 
    <city>city2</city> 
    <transaction> 
        <trans_date>01/01/2008</trans_date> 
        <deposit 
            ><bank_account>BA2</bank_account> 
            <BSB>BSB2</BSB> 
            <acc_name>ACC2</acc_name> 
            <amount>200</amount> 
            <balance>1000</balance> 
        </deposit> 
    </transaction> 
</customer> 
******************************************************************************* 
<?xml version="1.0" encoding="UTF-8"?> 
<customer> 
    <name>cust2</name> 
    <street>street2</street> 
    <city>city2</city> 
    <transaction> 
        <trans_date>15/01/2008</trans_date> 
        <withdrawal> 
            <bank_account>BA2</bank_account> 
            <BSB>BSB2</BSB> 
            <acc_name>ACC2</acc_name> 
            <amount>400</amount> 
            <balance>800</balance> 
        </withdrawal> 
    </transaction> 
</customer>

Listing 3 shows the third document in the case study, before and after changes (that is, D3 and D*3).

Listing 3. XML document example (D3) in two consecutive versions
<?xml version="1.0" encoding="UTF-8"?> 
<customer> 
    <name>cust3</name> 
    <street>street3</street> 
    <city>city3</city> 
    <transaction> 
        <trans_date>01/01/2008</trans_date> 
        <withdrawal> 
            <bank_account>BA3</bank_account> 
            <BSB>BSB3</BSB> 
            <acc_name>ACC3<acc_name> 
            <amount>50</amount> 
        </withdrawal>
    </transaction> 
</customer>
******************************************************************************* 
<?xml version="1.0" encoding="UTF-8"?> 
<customer> 
    <name>cust3</name> 
    <street>street3</street> 
    <city>city3</city> 
    <transaction> 
        <trans_date>15/01/2008</trans_date> 
        <withdrawal> 
            <bank_account>BA3</bank_account> 
            <BSB>BSB3</BSB> 
            <acc_name>ACC3</acc_name> 
            <amount>50</amount> 
            <balance>500</balance>
        </withdrawal> 
    </transaction>
</customer>

The purpose of the case study is to show how the initial clusters that exist at 01 January 2008 change after the new transactions are entered for the customers on 15 January 2008.

Part 1: Initial calculation of distances

As mentioned, this phase calculates the distance between the documents before the set of changes to determine the initial clustering composition. For the purpose of this example, assume that each basic operation has a cost equal to 1.

Table 1, Table 2, and Table 3 show the operations required and the distance (that is, the total cost of operations) for each pair of documents in the case study.

Table 1. Initial pair-wise distance between the XML documents D1 and D2
d (D1 --> D2)
Update (<name>, 'cust1', 'cust2')
Update (<street>, 'street1', 'street2')
Update (<city>, 'city1', 'city2')
Update (<bank_account>, 'BA1', 'BA2')
Update (<BSB>, 'BSB1', 'BSB2')
Update (<acc_name>, 'ACC1', 'ACC2')
Update (<amount>, '100', '200')
Insert (<balance>, '1000')
d (D1 --> D2) = 8

For readability in Table 1, Table 2, and Table 3, unique identifiers attached to each node are omitted, and the operations are listed with their corresponding values.

Table 2. Initial pair-wise distance between the XML documents D1 and D3
d (D1 --> D3)
Update (<name>, 'cust1', 'cust3')
Update (<street>, 'street1', 'street3')
Update (<city>, 'city1', 'city3')
Delete (<deposit>)
Delete (<bank_account>, 'BA1')
Delete (<BSB>, 'BSB1')
Delete (<acc_name>, 'ACC1)
Delete (<amount>, '100')
Insert (<withdrawal>)
Insert (<bank_account>, 'BA3')
Insert (<BSB>, 'BSB3')
Insert (<acc_name>, 'ACC3')
Insert (<amount>, '50')
d (D1 --> D3) = 13

As you can easily see, the initial distance between D1 and D2 is the smallest.

Table 3. Initial pair-wise distance between the XML documents D2 and D3
d (D2 --> D3)
Update (<name>, 'cust2', 'cust3')
Update (<street>, 'street2', 'street3')
Update (<city>, 'city2', 'city3')
Delete (<deposit>)
Delete (<bank_account>, 'BA2')
Delete (<BSB>, 'BSB2')
Delete (<acc_name>, 'ACC2')
Delete (<amount>, '200')
Delete (<balance>, '1000')
Insert (<withdrawal>)
Insert (<bank_account>, 'BA3')
Insert (<BSB>, 'BSB3')
Insert (<acc_name>, 'ACC3')
Insert (<amount>, '50')
d (D2 --> D3) = 14

Based on a simple hierarchical clustering method, documents D1 and D2 can be grouped in a cluster (CBC1, where BC stands for before changes), while D3 will form a separated cluster—that is, CBC2.

Figure 3 shows the clustering composition at the end of Phase 1. The dotted lines indicate that documents D2 and D3 have new versions, and you want to determine the new clusters after changes. This takes you to the next phase of the method.

Figure 3. Initial clustering composition in the case study
Diagram of the initial clustering composition in the case study

Part 2: Reassessing the distances

As mentioned, you can apply this phase whenever it is required after a set of changes has affected the clustered XML documents.

First, you must obtain the set of changes (deltas) between the document versions. In this case, these are Δ(D2 -->D*2) and Δ(D3 -->D*3). Then, these deltas will be employed to determine the new distances between the modified XML documents.

Table 4 shows the set of operations for Δ(D2 -->D*2).

Table 4. Operations included in the delta for the XML document D2
Δ(D2 --> D*2)
Update (<trans_date>, '01/01/2008', '15/01/2008')
Delete <deposit>)
Delete (<bank_account>, 'BA2')
Delete (<BSB>, 'BSB2')
Delete (<acc_name>, 'ACC2')
Delete (<amount>, '200')
Delete (<balance>, '1000')
Insert (<withdrawal>)
Insert (<bank_account>, 'BA2')
Insert (<BSB>, 'BSB2')
Insert (<acc_name>, 'ACC2')
Insert (<amount>, '400')
Insert (<balance>, '800')

Table 5 shows the set of operations for Δ(D3 -->D*3).

Table 5. Operations included in the delta for the XML document D3
Δ(D3 --> D*3)
Update (<trans_date>, '01/01/2008', '15/01/2008')
Insert (<balance>, '500')

After you identify the operations included in deltas, the next step is to recalculate each new distance based on the old distance and the deltas. In other words, you recalculate the following distances:

  • d'(D1 --> D*2), using d(D1 --> D2) and Δ(D2 --> D*2)
  • d'(D1 --> D*3) using d(D1 --> D3) and Δ(D3 --> D*3)
  • d'(D*2 --> D*3) using d(D2 --> D3), Δ(D2--> D*2) and Δ(D3 --> D*3)

In reassessing these distances, a number of operations are matched or replacement operations are found—for example, if operation Update (<amount>, '100', '200') is matched with Delete (<amount>, '200'), the replacement operation is Delete (<amount>, '100'). In other words, the result of applying both matched operations on an XML document is the same as the result of applying the replacement operation on the same initial XML document. For more details on how matching or replacement operations are defined, check out Resources.

Figure 4 provides a visual representation of how the final cost is calculated for d'(D1 -->D*2).

Figure 4. Visual representation of the process to calculate new distance d' (D1-->D*2) after the changes
Visual representation of the process to calculate new distance d'(D1 -->D*2) after the changes

In this case, 11 operations could not be matched with other operations, so they are counted with their own cost, eight operations were replaced by four complementary operations, and two operations reversed each other. Consequently, the sum of individual costs of all operations participating in the new distance is 11 + 4 + 0 = 15. Hence, the new distance d'(D1 -->D*2) = 15.

Using a similar logic, the distance d'(D1 -->D*3) is calculated. Figure 5 shows the operations involved.

Figure 5. Visual representation of the process to calculate new distance d' (D1-->D*3) after the changes
Visual representation of the process to calculate new distance d'(D1 -->D*3) after the changes

The last step in reassessing distances is to calculate d'(D*2 --> D*3). Figure 6 shows the operations involved.

Figure 6. Visual representation of the process to calculate new distance d'(D*2 --> D*3) after the changes
Visual representation of the process to calculate new distance d'(D*2 -->D*3) after the changes

Based on the results obtained, Figure 7 shows the new clustering composition after the changes. Note that the clusters have changed, and now documents D2 and D3 in their new versions form a cluster, while the document D1 (which has not changed) forms a separate cluster by itself.

Figure 7. Clustering compositions after the changes
Diagram of clustering compositions after the changes

Conclusion

This article described a method for clustering dynamic XML documents using a distance-based technique. In particular, it described a technique that allows you to determine the distances between the dynamic XML document versions after the initial documents change. This method was proved as time and cost-effective because it greatly reduces the number of operations required to recalculate the distances and to reassess the clusters after the XML documents change.

For the purposes of this article and case study, the XML documents were intentionally kept simple and the number of versions small. For more complex documents, you follow the same steps with a larger number of versions and a larger number of differences between them.

Resources

Learn

  • Part 1: Survey several approaches to XML data mining: In the first part of this series, start with an introduction to mining hidden knowledge from XML documents. Learn about mining data, the hierarchical structure of the information, and the relationships between elements.
  • Part 2: Mining XML association rules: Dig into XML data mining, a facet of XML data analysis. Learn about static and dynamic XML association rules and how to create version-based association rules when the mined XML documents change in Part 2 of this three-part series.
  • Intelligent Dynamic XML Documents Clustering [L.I. Rusu, W. Rahayu, and D. Taniar, published in Proceedings of the 22nd International Conference on Advanced Information Networking and Applications (AINA 2008), IEEE Computer Society, Okinawa, Japan]: Check out an efficient way to avoid redundant calculations as you reassess pair-wise distances between clustered dynamic XML documents that change in time.
  • Storage Techniques for Multi-versioned XML Documents [L.I. Rusu, W. Rahayu, and D. Taniar, published in Proceedings of the 13th International Conference on Database Systems for Advanced Applications (DASFAA 2008), New Delhi, India]: Read about and evaluate four storage methods for dynamic XML documents.
  • Partitioning Methods for Multi-version XML Data Warehouses [L.I. Rusu, W. Rahayu, and D. Taniar, published in Distributed and Parallel Databases, 25(1-2), Springer, April 2009]: Explore how to build an XML data warehouse schema that uses several partitioning techniques and evaluate various query types for an XML data warehouse.
  • New to XML? Get the resources you need to learn XML.
  • XML area on developerWorks: Find the resources you need to advance your skills in the XML arena, including DTDs, schemas, and XSLT. See the XML technical library for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
  • IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
  • developerWorks technical events and webcasts: Stay current with technology in these sessions.
  • developerWorks on Twitter: Join today to follow developerWorks tweets.
  • developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
  • developerWorks on-demand demos: Watch demos ranging from product installation and setup for beginners to advanced functionality for experienced developers.

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML
ArticleID=811241
ArticleTitle=XML data mining, Part 3: Clustering XML documents for improved data mining
publish-date=05022012