Databases are developed to store data, but more important is the ability to query the data after it has been stored. Database management systems, such as those based around relational models (that is, Structured Query Language (SQL)based), document models (such as MongoDB and Apache CouchDB), and graph models (such as Resource Description Framework (RDF) triple stores), have the ability to accurately store and query the data stored within them.
When you query data in its pure form, however, you can only retrieve certain and specific information. A step higher than the basic query is statistical analyses, which are based on certain specific counts (percentages and ratios). Using statistics, you can predict possible futures, but that is largely (usually) based on the closed data that is stored in your database.
Data mining exists in several forms; many of its methods are specifically for statistical analysis and attempt to deduce possible new information based on the existing data. An increasing number of datamining methodologies now handle uncertainty and vagueness in the existing data as well, and they allow for uncertain and vague answers in the output of datamining queries.
What and where are uncertainties and vagueness?
Uncertainties and vagueness exist in daytoday human life. For example, you might hear, "It's supposed to rain today." This would be an uncertainty, because it might not rain at all today and weather prediction is not (yet) infallible. You might also hear from your friend that he or she will be arriving at your house at "around 6 p.m.," the word around indicating an intended vagueness because there is a lack of precision with the specified time.
In fact, if you listen to most conversations you make in daily life, you'll notice uncertainties and vagueness peppering our every sentence. It's quite likely that this very article has uncertainties and vagueness throughout! It is important to note that data can be either unintentionally or intentionally "fuzzy." For example, if an opinion poll has been taken on a sample of the population, you might unintentionally miss a strong opinion if it were not part of your sample, but if you hear a weather forecast on the television or radio, the presenter uses intentionally fuzzy terminology such as a "light breeze" or "heavy rain." What does it mean to be a "light breeze" or "heavy rain"? How "light" or "heavy" will they be? Or, perhaps in terms of the opinion poll example, it might be that you could consider that strong opinion to be somewhat important in shaping your decisions—but by how much?
Traditional probability theory provides statistical methods to decipher whether an event is likely to happen and can also successfully classify new information based on old information. This form of probability is more precisely known as subjective probability and attempts to answer the question, "How probable is X?" An alternative view is to consider fuzzy set theory, which attempts to answer the question, "How much do I consider X to be a member of this set?" and could be seen therefore as a "possibility measure" rather than any kind of probabilistic measure. There has been some work on translating fuzzy sets into probability sets and vice versa.
Basics of fuzzy set theory
Classical set theory provides operations for combining and manipulating sets of data, including union (returning a combination from two or more sets), intersection (finding the common data in two or more sets), set difference (removing data found in one set from another set), and negation (specifying that you want to find everything that we know about but is not within that set). Various general laws (like physical laws rather than rules and regulations) exist for the simplification of combinations of set theoretic operations. These laws always hold and are mathematically set in stone because of the nature of numbers, sets, and ordering. Classical set theory permeates the very nature of all common databases, where a record (or an object of data) can be seen as an item within one or more sets. Database queries relate these sets and filter these relations to provide a result. With classical set theory, an item of data is either "in" or "out" of a set: There is no halfway and no shades of grey.
Fuzzy set theory, in contrast, allows for an item of data to be "in" (or "out") of a set to a degree. So, you might consider Nick to have brown hair to a 0.9 membership but consider Joanne to have brown hair to a 0.5 membership. In classical set theory, you would have to choose whether to put them both in the "brown" set, which would be particularly difficult in the case of Joanne, as it is relatively uncertain what her hair color actually is. As you can see "fully in" as a membership of 1.0, and "fully out" as a membership of 0.0, it is possible to see classical set theory as a restricted version of fuzzy set theory, and in the literature on the subject, it is not uncommon to see classical set theory denoted as "crisp set theory" for this reason. Like classical set theory, fuzzy set theory has the same group of operations, including union, intersection, set difference, and negation, but there are some slight differences in the general laws, simply because some of the laws of classical set theory work only when you restrict membership to two values. Those readers who have knowledge of relational databases might understand union and intersection as particular types of join.
Various forms of union, intersection, negation, and difference are found within fuzzy set theory, but the following are the most common. They also have the benefit of working on crisp sets:
 Fuzzy union:
A union B = { max(mu(a), mu(b)) for a in A and b in B and if a == b }
This means that the result of a fuzzy union is the maximum membership value of every unique item in both sets. So, if "Joanne" appeared in both lists, then whichever membership value were higher out of the two sets becomes the membership value for the union.
 Fuzzy intersection:
A intersection B = { min(mu(a), mu(b) for a in A and b in B and if a == b }
This means that the result of a fuzzy intersection is the minimum membership value of every unique item in both sets. If "Nick" has a membership of 0.9 in one set and 0.1 in the other, then 0.1 is chosen as the membership value for the intersection. If "Nick" had a membership of 0.0 (that is, he did not appear in one set), then he would be discarded from the intersection.
 Fuzzy negation:
A = { 1 – mu(a) for a in A }
This means that you process the entire set and take away each membership value from 1.0. For example, "Keith" might be in the set of "Grey Hair People" to a degree of 0.8; if you want to find out what he is in the set of "Not Grey Hair People," you perform 1.0 – 0.8, which of course equals 0.2.
 Fuzzy set difference:
A – B = { max(0, mu(a) – mu(b)) for a in A and b in B and if a == b }
This means that you subtract the memberships of each element in one data set from each matched element in another data set. This particular form is more accurately termed bounded difference, because if the result from the subtraction is a negative number, then you bound it to 0. So, if "Nick" is in the set of brownhaired people to a degree of 0.9 and he is also in a set of blondehaired people to a degree of 0.2, then the fuzzy set difference (BrownHairedPeople – BlondeHairedPeople) will include Nick to a degree of 0.7 (that is, 0.9 – 0.2). This could be used to assist in figuring out whether Nick is clearly brown haired through eliminating other potential properties.
Developing fuzzy data mining
It is through the use of fuzzy set operations that you can build fuzzy algorithms, fuzzy databases, and fuzzy knowledgebases. Fuzzy data mining began to appear in the late 1990s and early 2000s, after computational machines became powerful enough to implement datamining and machine learning methods on top of large data sets and the need to handle uncertainty and vagueness became essential. The form of fuzzy data mining primarily covered in this article is fuzzy association rule mining, but I briefly cover other fuzzy datamining techniques as well.
Classical association rule mining
At IBM in the early 1990s, a group of researchers that included Rakesh Agrawal, Tomasz Imielinski, and Arun Swami conceptualised the idea of automatically calculating the "confidence" of relationships from one item to another item in a transaction database. The transaction database in question was a supermarket database that contained transactions that customers would make. The algorithm that Agrawal and his team developed, called APRIORI, matched items such as "milk" with items such as "bread" and calculated a confidence based on how often milk and bread were seen together in single transactions. This string of (ordered) items became known as association rules, the number of transactions with one or more of those items in became known as the support, and the result was a confidence, which was calculated by dividing the support for all items in the rule by support for the initial items. Transactional association rules would continue to be developed over the years to become more efficient, and researchers developed versions of the algorithm specifically for parallel processing computers. In essence, the following would be done to calculate a transaction association rule confidence:
transaction_association_confidence(A to B) = support(A union B)/support(A)
To clarify with an example, this means that if A is milk and B is bread, then the confidence of "milk then bread" equals the number (or percentage) of transactions with either (or both) milk or bread over the number (or percentage) of transactions with just milk.
A slightly different form of association rule was born based not on transactions but on properties of an object. So instead of finding the confidence of "milk then bread," you could find the confidence of "brown hair to hazel eyes," where items (in this case, people) would appear in both the source set and a target set. In this case, because of the nature of objects, the union within the confidence calculation would need to become an intersection:
object_association_confidence(A to B) = support(A intersection B)/support(A)
To clarify this with an example, if A is brown hair and B is hazel eyes, then you divide the number of people who have brown hair and hazel eyes by the number of people who have brown hair (regardless of their eye colour).
Fuzzy association rules howto
One of the easiest ways to make the association rule mining algorithm more "fuzzy" is simply to perform cuts (known as alphacuts) at predetermined points between 0.0 and 1.0, where those members with a membership above that cut level are considered crisp (that is, full membership) and those below are with no membership. The result is confidence levels for rules at the various mu values. This is one of the fuzziest methods and results in a twodimensional graph rather than a single confidence value. So, your fuzzy association calculation would look like this (and this could work for either a transactional form or an objectorientated form):
fully_fuzzy_association_confidence(A to B) = {classical_association_confidence(alphacut(A, alpha) to alphacut(B, alpha)) for alpha in predetermined_alphas}
Although this is possibly the fuzziest option that is also easily implementable, it has the potential to be quite computationally expensive. It also requires membership values (or membership functions) to be appended to each item (or each class of item) in a database. Ideally, these membership values would be calculated by some form of formula—for example, "heavy rain" could be a fuzzy membership function based on millimetres of rainfall, and this could be associated with "strong wind," which could be a fuzzy membership function based on kilometres per hour. The association could be built using instances from a particular day. For example, if you analyse this country and this day for the past five years, you might find that there is a strong association between "heavy rain" and "strong wind," but if you heard a forecaster, he or she might be a little more purposefully vague (and have a higher alphacut) and state that if "heavy rain" occurs, then you might have "strong wind" on this day.
What seems to be more prevalent in the fuzzy association rule mining literature is performing summations of the membership value and dividing the result by the number of transactions (or objects) in the database. Thus, to clarify the two forms:
fuzzy_sum_transaction_association_confidence(A to B) = sum({mu(t) for t in (A union B)})/number_of_transactions _________________________________________ fuzzy_sum_object_association_confidence(A to B) = sum({mu(o) for o in (A intersection B)})/number_of_objects
This form of summation simplifies the resultant confidence in that it provides a single value (rather than a twodimensional graph) but effectively "defuzzifies" the data. You no longer have the meaningfulness that fuzzy provides. It also requires preset membership values or membership functions.
In 1998, a group of researchers from the Chinese University of Hong Kong, including Ada Waichee and Man Hon Wong, developed a technique for finding fuzzy sets from a given database. The academic paper presenting their work was entitled, "Finding Fuzzy Sets for the Mining of Fuzzy Association Rules for Numerical Attributes," and it was written for the Intelligent Data Engineering and Automated Learning Conference in 1998. This work is particularly interesting because it eliminates the requirement of knowing beforehand the relevant fuzzy membership values or functions. The downside is that it uses a rather complicated clustering algorithm known as CLARANS, which is beyond the scope of this article.
Other forms of fuzzy data mining
Association rule mining is not the only form of data mining. In fact, the team at IBM led by Rakesh developed sequence pattern mining as an extension to association pattern mining. Instead of forming directional associations regardless of time, these sequence patterns are time sensitive; they simply have to be in order. Sequence pattern mining perhaps became more useful for transactional databases than association rules and also has application in any timestamped database records (or temporal data sets). Sequence pattern mining also has the ability to be "fuzzified." For example, you could talk about, "What items are associated in the early afternoon?" Classification is another form of data mining that often uses clustering methods and attempts to discover similarity statistically and classifies that data. Fuzzy classification techniques exist and have been applied to such diverse subjects as geology and medicine.
Conclusion
This article provided a brief overview of how to handle uncertainties and vagueness in data mining. It explored the development and practice of fuzzy set theory as a solution to handling uncertainty and vagueness. It also investigated association rules, which were originally developed by IBM, and have since been extended and enhanced by the datamining community. I explored various ways to handle fuzzy data in association rules and highlighted that "fuzzification" and the result of fuzzy association rule mining depends strongly on the domain and the user. Finally, the article offered a short discussion of other forms of fuzzy data mining, all of which have been successfully employed in various industries.
The algorithms found within data mining are becoming increasingly necessary to those who are storing data in databases. It has become clear when you look at the businesses that sell database products (including IBM) and that advertise that their products scale up and can handle parallelised computing as well as when you see researchers and developers discussing the pros and cons of big data. I am certain that fuzzy set theory, along with probability theory, will be crucial in the future development of data mining, because it has the potential not only to handle uncertainties and vagueness but also to allow computer interfaces to become more human through the implementation of "computing with words" rather than with code.
IBM has two datamining products: IBM DB2® Intelligent Miner® and SPSS® Modeler. IBM has also laid some of the foundations of the Unstructured Information Management Architecture, which is an architecture for data mining. Other open source and commercial software products are available for data mining as well, but these systems are primarily statistical, and I would challenge businesses and the open source community to implement fuzzy techniques. For the time being, however, we as developers can implement fuzzy techniques on top of these systems, and I invite you to comment on this article and share your successes and any troubles you might have in implementing fuzzy datamining algorithms.
Resources
Learn
 Mining Your Own Business in Retail Using DB2 Intelligent Miner for Data (IBM Redbooks®) is a guide for using DB2 Intelligent Miner in the retail sector.
 "Statistical analysis of medical data with IBM SPSS Modeler" (Alex Torquato Souza Carneiro and Marcelo Franceschi de Bianchi, developerWorks, October 2012) is a guide on how to use the datamining and machine learning techniques in SPSS Modeler.
 "XML data mining, Part 2: Mining XML association rules" (Laura Irina Rusu, developerWorks, December 2011) is a stepbystep guide to finding association rules from XML files.
 Fuzzy Logic in the Stanford Encyclopedia of Philosophy is a philosophical description of fuzzy logic and fuzzy set theory.
 The video Fuzzy Logic is a practical tutorial for building fuzzy logic and fuzzy set applications, particularly for data analysis.
 Finding Fuzzy Sets for the Mining of Fuzzy Association Rules for Numerical Attributes by Ada Waichee, Man Hon Wong, et al., is about using clustering techniques to find fuzzy membership functions automatically for use in association rule mining.
 Fast Algorithms for Mining Association Rules by Rakesh Agrawal and Ramakrishnan Srikant, both of IBM: This 1994 article describes association rule mining and the APRIORI algorithm.
 Efficient and Effective Clustering Methods for Spatial Data Mining by Raymond Ng and Jiawei Han: This 1994 article discusses various clustering methods, but most importantly it introduces the CLARANS algorithm.
 The MongoDB website is a great resource for learning about the document store MongoDB and NoSQL in general.
 The CouchDB website at the Apache Foundation also has relevant information about NoSQL systems.
 Learn more about RDF triple stores from the W3C Wiki on large triple stores.
 Find the resources you need to improve outcomes and control risk in the developerWorks Business analytics zone.
 Browse the technology bookstore for books on these and other technical topics.
 Follow developerWorks on Twitter.
 Watch developerWorks ondemand demos ranging from product installation and setup demos for beginners to advanced functionality for experienced developers.
Get products and technologies
 Learn more about the SPSS family of products.
 Learn more about DB2 Intelligent Miner.
 Evaluate IBM products.
Discuss
 Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developerdriven blogs, forums, groups, and wikis.
Comments
Dig deeper into Big data and analytics on developerWorks

developerWorks Premium
Exclusive tools to build your next great app. Learn more.

dW Answers
Ask a technical question

Explore more technical topics
Tutorials & training to grow your development skills