Practical data mining of vague and uncertain data

Introducing fuzzy association rule mining

With larger and larger amounts of data being generated both privately in business and publicly over the web, data mining is becoming increasingly more interesting and useful. However, traditional data mining is not suited to handle some forms of uncertainty and vagueness. This is where fuzzy set theory can help.

Daniel J Lewis, Researcher, University of Bristol

Daniel LewisDaniel Lewis is a Ph.D. and researcher in intelligent systems at the University of Bristol, where he is investigating rule mining and pattern finding within the context of fuzzy set theory and conceptual hierarchies and ontologies. His previous experience includes working in the fields of knowledge bases, Semantic Web, linked data, and open data for several companies, including OpenLink Software and the Open Knowledge Foundation. He has also spent some time as a web developer. Daniel has written many articles for IBM developerWorks on a variety of topics, primarily on database and Semantic Web technologies, and he is a published author in the field of intelligent systems.



28 January 2014

Also available in Russian Japanese

Databases are developed to store data, but more important is the ability to query the data after it has been stored. Database management systems, such as those based around relational models (that is, Structured Query Language (SQL)-based), document models (such as MongoDB and Apache CouchDB), and graph models (such as Resource Description Framework (RDF) triple stores), have the ability to accurately store and query the data stored within them.

When you query data in its pure form, however, you can only retrieve certain and specific information. A step higher than the basic query is statistical analyses, which are based on certain specific counts (percentages and ratios). Using statistics, you can predict possible futures, but that is largely (usually) based on the closed data that is stored in your database.

Data mining exists in several forms; many of its methods are specifically for statistical analysis and attempt to deduce possible new information based on the existing data. An increasing number of data-mining methodologies now handle uncertainty and vagueness in the existing data as well, and they allow for uncertain and vague answers in the output of data-mining queries.

What and where are uncertainties and vagueness?

Uncertainties and vagueness exist in day-to-day human life. For example, you might hear, "It's supposed to rain today." This would be an uncertainty, because it might not rain at all today and weather prediction is not (yet) infallible. You might also hear from your friend that he or she will be arriving at your house at "around 6 p.m.," the word around indicating an intended vagueness because there is a lack of precision with the specified time.

In fact, if you listen to most conversations you make in daily life, you'll notice uncertainties and vagueness peppering our every sentence. It's quite likely that this very article has uncertainties and vagueness throughout! It is important to note that data can be either unintentionally or intentionally "fuzzy." For example, if an opinion poll has been taken on a sample of the population, you might unintentionally miss a strong opinion if it were not part of your sample, but if you hear a weather forecast on the television or radio, the presenter uses intentionally fuzzy terminology such as a "light breeze" or "heavy rain." What does it mean to be a "light breeze" or "heavy rain"? How "light" or "heavy" will they be? Or, perhaps in terms of the opinion poll example, it might be that you could consider that strong opinion to be somewhat important in shaping your decisions—but by how much?

Traditional probability theory provides statistical methods to decipher whether an event is likely to happen and can also successfully classify new information based on old information. This form of probability is more precisely known as subjective probability and attempts to answer the question, "How probable is X?" An alternative view is to consider fuzzy set theory, which attempts to answer the question, "How much do I consider X to be a member of this set?" and could be seen therefore as a "possibility measure" rather than any kind of probabilistic measure. There has been some work on translating fuzzy sets into probability sets and vice versa.


Basics of fuzzy set theory

Classical set theory provides operations for combining and manipulating sets of data, including union (returning a combination from two or more sets), intersection (finding the common data in two or more sets), set difference (removing data found in one set from another set), and negation (specifying that you want to find everything that we know about but is not within that set). Various general laws (like physical laws rather than rules and regulations) exist for the simplification of combinations of set theoretic operations. These laws always hold and are mathematically set in stone because of the nature of numbers, sets, and ordering. Classical set theory permeates the very nature of all common databases, where a record (or an object of data) can be seen as an item within one or more sets. Database queries relate these sets and filter these relations to provide a result. With classical set theory, an item of data is either "in" or "out" of a set: There is no half-way and no shades of grey.

Fuzzy set theory, in contrast, allows for an item of data to be "in" (or "out") of a set to a degree. So, you might consider Nick to have brown hair to a 0.9 membership but consider Joanne to have brown hair to a 0.5 membership. In classical set theory, you would have to choose whether to put them both in the "brown" set, which would be particularly difficult in the case of Joanne, as it is relatively uncertain what her hair color actually is. As you can see "fully in" as a membership of 1.0, and "fully out" as a membership of 0.0, it is possible to see classical set theory as a restricted version of fuzzy set theory, and in the literature on the subject, it is not uncommon to see classical set theory denoted as "crisp set theory" for this reason. Like classical set theory, fuzzy set theory has the same group of operations, including union, intersection, set difference, and negation, but there are some slight differences in the general laws, simply because some of the laws of classical set theory work only when you restrict membership to two values. Those readers who have knowledge of relational databases might understand union and intersection as particular types of join.

Various forms of union, intersection, negation, and difference are found within fuzzy set theory, but the following are the most common. They also have the benefit of working on crisp sets:

  • Fuzzy union:
    A union B = { max(mu(a), mu(b)) for a in A and b in B and if a == b }

    This means that the result of a fuzzy union is the maximum membership value of every unique item in both sets. So, if "Joanne" appeared in both lists, then whichever membership value were higher out of the two sets becomes the membership value for the union.

  • Fuzzy intersection:
    A intersection B = { min(mu(a), mu(b) for a in A and b in B and if a == b }

    This means that the result of a fuzzy intersection is the minimum membership value of every unique item in both sets. If "Nick" has a membership of 0.9 in one set and 0.1 in the other, then 0.1 is chosen as the membership value for the intersection. If "Nick" had a membership of 0.0 (that is, he did not appear in one set), then he would be discarded from the intersection.

  • Fuzzy negation:
    -A = { 1 – mu(a) for a in A }

    This means that you process the entire set and take away each membership value from 1.0. For example, "Keith" might be in the set of "Grey Hair People" to a degree of 0.8; if you want to find out what he is in the set of "Not Grey Hair People," you perform 1.0 – 0.8, which of course equals 0.2.

  • Fuzzy set difference:
    A – B = { max(0, mu(a) – mu(b)) for a in A and b in B and if a == b }

    This means that you subtract the memberships of each element in one data set from each matched element in another data set. This particular form is more accurately termed bounded difference, because if the result from the subtraction is a negative number, then you bound it to 0. So, if "Nick" is in the set of brown-haired people to a degree of 0.9 and he is also in a set of blonde-haired people to a degree of 0.2, then the fuzzy set difference (BrownHairedPeople – BlondeHairedPeople) will include Nick to a degree of 0.7 (that is, 0.9 – 0.2). This could be used to assist in figuring out whether Nick is clearly brown haired through eliminating other potential properties.


Developing fuzzy data mining

It is through the use of fuzzy set operations that you can build fuzzy algorithms, fuzzy databases, and fuzzy knowledgebases. Fuzzy data mining began to appear in the late 1990s and early 2000s, after computational machines became powerful enough to implement data-mining and machine learning methods on top of large data sets and the need to handle uncertainty and vagueness became essential. The form of fuzzy data mining primarily covered in this article is fuzzy association rule mining, but I briefly cover other fuzzy data-mining techniques as well.

Classical association rule mining

At IBM in the early 1990s, a group of researchers that included Rakesh Agrawal, Tomasz Imielinski, and Arun Swami conceptualised the idea of automatically calculating the "confidence" of relationships from one item to another item in a transaction database. The transaction database in question was a supermarket database that contained transactions that customers would make. The algorithm that Agrawal and his team developed, called APRIORI, matched items such as "milk" with items such as "bread" and calculated a confidence based on how often milk and bread were seen together in single transactions. This string of (ordered) items became known as association rules, the number of transactions with one or more of those items in became known as the support, and the result was a confidence, which was calculated by dividing the support for all items in the rule by support for the initial items. Transactional association rules would continue to be developed over the years to become more efficient, and researchers developed versions of the algorithm specifically for parallel processing computers. In essence, the following would be done to calculate a transaction association rule confidence:

transaction_association_confidence(A to B) = support(A union B)/support(A)

To clarify with an example, this means that if A is milk and B is bread, then the confidence of "milk then bread" equals the number (or percentage) of transactions with either (or both) milk or bread over the number (or percentage) of transactions with just milk.

A slightly different form of association rule was born based not on transactions but on properties of an object. So instead of finding the confidence of "milk then bread," you could find the confidence of "brown hair to hazel eyes," where items (in this case, people) would appear in both the source set and a target set. In this case, because of the nature of objects, the union within the confidence calculation would need to become an intersection:

object_association_confidence(A to B) = support(A intersection B)/support(A)

To clarify this with an example, if A is brown hair and B is hazel eyes, then you divide the number of people who have brown hair and hazel eyes by the number of people who have brown hair (regardless of their eye colour).

Fuzzy association rules how-to

One of the easiest ways to make the association rule mining algorithm more "fuzzy" is simply to perform cuts (known as alpha-cuts) at predetermined points between 0.0 and 1.0, where those members with a membership above that cut level are considered crisp (that is, full membership) and those below are with no membership. The result is confidence levels for rules at the various mu values. This is one of the fuzziest methods and results in a two-dimensional graph rather than a single confidence value. So, your fuzzy association calculation would look like this (and this could work for either a transactional form or an object-orientated form):

fully_fuzzy_association_confidence(A to B) = {classical_association_confidence(alphacut(A, alpha) 
    to alphacut(B, alpha)) for alpha in predetermined_alphas}

Although this is possibly the fuzziest option that is also easily implementable, it has the potential to be quite computationally expensive. It also requires membership values (or membership functions) to be appended to each item (or each class of item) in a database. Ideally, these membership values would be calculated by some form of formula—for example, "heavy rain" could be a fuzzy membership function based on millimetres of rainfall, and this could be associated with "strong wind," which could be a fuzzy membership function based on kilometres per hour. The association could be built using instances from a particular day. For example, if you analyse this country and this day for the past five years, you might find that there is a strong association between "heavy rain" and "strong wind," but if you heard a forecaster, he or she might be a little more purposefully vague (and have a higher alpha-cut) and state that if "heavy rain" occurs, then you might have "strong wind" on this day.

What seems to be more prevalent in the fuzzy association rule mining literature is performing summations of the membership value and dividing the result by the number of transactions (or objects) in the database. Thus, to clarify the two forms:

fuzzy_sum_transaction_association_confidence(A to B) = sum({mu(t) 
    for t in (A union B)})/number_of_transactions

_________________________________________


fuzzy_sum_object_association_confidence(A to B) = sum({mu(o) 
    for o in (A intersection B)})/number_of_objects

This form of summation simplifies the resultant confidence in that it provides a single value (rather than a two-dimensional graph) but effectively "defuzzifies" the data. You no longer have the meaningfulness that fuzzy provides. It also requires preset membership values or membership functions.

In 1998, a group of researchers from the Chinese University of Hong Kong, including Ada Wai-chee and Man Hon Wong, developed a technique for finding fuzzy sets from a given database. The academic paper presenting their work was entitled, "Finding Fuzzy Sets for the Mining of Fuzzy Association Rules for Numerical Attributes," and it was written for the Intelligent Data Engineering and Automated Learning Conference in 1998. This work is particularly interesting because it eliminates the requirement of knowing beforehand the relevant fuzzy membership values or functions. The downside is that it uses a rather complicated clustering algorithm known as CLARANS, which is beyond the scope of this article.


Other forms of fuzzy data mining

Association rule mining is not the only form of data mining. In fact, the team at IBM led by Rakesh developed sequence pattern mining as an extension to association pattern mining. Instead of forming directional associations regardless of time, these sequence patterns are time sensitive; they simply have to be in order. Sequence pattern mining perhaps became more useful for transactional databases than association rules and also has application in any timestamped database records (or temporal data sets). Sequence pattern mining also has the ability to be "fuzzified." For example, you could talk about, "What items are associated in the early afternoon?" Classification is another form of data mining that often uses clustering methods and attempts to discover similarity statistically and classifies that data. Fuzzy classification techniques exist and have been applied to such diverse subjects as geology and medicine.


Conclusion

This article provided a brief overview of how to handle uncertainties and vagueness in data mining. It explored the development and practice of fuzzy set theory as a solution to handling uncertainty and vagueness. It also investigated association rules, which were originally developed by IBM, and have since been extended and enhanced by the data-mining community. I explored various ways to handle fuzzy data in association rules and highlighted that "fuzzification" and the result of fuzzy association rule mining depends strongly on the domain and the user. Finally, the article offered a short discussion of other forms of fuzzy data mining, all of which have been successfully employed in various industries.

The algorithms found within data mining are becoming increasingly necessary to those who are storing data in databases. It has become clear when you look at the businesses that sell database products (including IBM) and that advertise that their products scale up and can handle parallelised computing as well as when you see researchers and developers discussing the pros and cons of big data. I am certain that fuzzy set theory, along with probability theory, will be crucial in the future development of data mining, because it has the potential not only to handle uncertainties and vagueness but also to allow computer interfaces to become more human through the implementation of "computing with words" rather than with code.

IBM has two data-mining products: IBM DB2® Intelligent Miner® and SPSS® Modeler. IBM has also laid some of the foundations of the Unstructured Information Management Architecture, which is an architecture for data mining. Other open source and commercial software products are available for data mining as well, but these systems are primarily statistical, and I would challenge businesses and the open source community to implement fuzzy techniques. For the time being, however, we as developers can implement fuzzy techniques on top of these systems, and I invite you to comment on this article and share your successes and any troubles you might have in implementing fuzzy data-mining algorithms.

Resources

Learn

Get products and technologies

Discuss

  • Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=960996
ArticleTitle=Practical data mining of vague and uncertain data
publish-date=01282014