Data growth and standards

An exploration of relevant open standards

Examine the challenges presented by the explosion of data, the analytics thereof, and an introduction to some standards relevant to these challenges. A sample scenario depicts a system where large amounts of data are ingested, understood, and manipulated, and where some specific standards promote integration and interoperability.

Peter Haggar, Senior Technical Staff Member, IBM

Photo of Peter HaggarPeter Haggar is a Senior Technical Staff Member with IBM in Research Triangle Park, North Carolina. He has most recently worked on business analytics, emerging software standards, XML, binary XML, and web services. He now works on emerging Internet technology focused on Watson and DeepQA. He has worked for IBM for more than 20 years. Contact Peter at haggar@us.ibm.com.



21 June 2011

Also available in Chinese Japanese Spanish

Overview

Frequently used acronyms

  • GPS: Global Positioning System
  • HTML: HyperText Markup Language
  • IT: Information technology
  • OASIS: Organization for the Advancement of Structured Information Standards
  • OLAP: Online Analytical Processing
  • SEC: Securities and Exchange Commission
  • W3C: World Wide Web Consortium
  • XML: Extensible Markup Language

After years and much money invested in technology to record and store data on virtually every transaction and from the vast array of instrumented objects, customers want to get more mileage out of that information. Businesses want information that is more timely and useful, particularly if it can directly and positively affect growth and profitability.

Data analysis encompasses various problem domains including retail sales, fraud, consumer/client acquisition and retention, security, and financial services, and therefore many technologies. Key standards and technologies used to support the creation of solutions to the various problem domains are given along with the value they deliver.

For years, the IT industry has spent untold time and money creating systems to record data and transactions. In addition, the number of devices that produce data that is collected is growing exponentially. Furthermore, vast data storage systems are available to store this data, and fast networks exist to transmit it between data centers and machines that process it. Businesses want to take advantage of the investment in the available data to gain timely and useful insights to feed growth and profitability.


What is business analytics?

Business analytics is technology that delivers immediate and actionable insights into how a business is performing. It enables you to spot and analyze trends, patterns, and anomalies so you can plan, budget, and forecast resources. The goal is to make smarter decisions that lead to better and more profitable outcomes. The opportunity to create business value through data is enhanced by the sheer volume of available data. The challenge lies in producing analytic output that creates this value in a cost-effective manner. Business analytics refers to the analysis and organization of data and the delivery of meaningful business information on time and in convenient forms. For example, real-time alerts or executive dashboards are forms of presentation that show high-level measurements of corporate performance. By delivering information online, rather than in static reports, business analytic tools allow you to know relevant business facts sooner while allowing you to "drill down" to examine details by clicking a chart to see the numbers behind it.

Business analytics is not a single product or technology, but a technology domain that requires many products to interoperate. An analytics system analyzes data that is likely stored in disparate databases and warehouses in various data formats. In addition, the system might also incorporate real-time data feeds to analyze in conjunction with historical data. While the data is analyzed, rules might be applied, predictive or optimization models incorporated, and different forms of output produced depending on the scenario or problem being solved.

Consider a retail store trying to retain existing customers. The customer's product-buying history might be stored in one database while the customer's transaction history is in another. The retail store can glean what types of products are purchased, how much money a particular customer has spent on these products during different times of the year, and how purchasing offers influence buying decisions, and so on. The retail store also has real-time data that is not stored in the aforementioned databases, such as what is moving onto and off of its shelves now based on live sales data. Using all of this data, a predictive model can be built to determine with a level of confidence how likely a particular customer is to purchase incoming or existing products at the store. Based on these various factors, this model can be combined with business rules, customer demographics, and historical buying patterns and choices to make intelligent decisions. For example, a store might take action in real time through a special offer at the point of sale, or it might determine the best time to offer and advertise incentives and sales and who to target with them. Analytics can yield interesting and useful customer insights to understand customer trends and behavior and to make sure that customers know about specific and targeted offers.

Scenarios are made up of multiple databases with historical information, real-time data feeds, predictive or optimization models, business rules, and a user interface dashboard all working in concert with one another, but not designed or developed to necessarily solve the particular problem. Standards best address these complex interactions between the various products and systems due to the tight communication required. Standards provide customer benefit as they know that their data, rules, predictive models, and so on are stored in a format or are accessible in an open way and not controlled by a single vendor. Standards allow the freedom of action customers desire to not be locked into a particular tool set, data format, or protocol. In addition, standards allow disparate systems to work together without these systems being built with the other in mind.

The focus of business analytics is to develop new insights and an understanding of a business based on statistical methods and analysis applied to this data, leading to better and more informed decisions. Business analytics software can provide this and other types of actionable insights for these and other types of problems by analyzing huge amounts of data in a short period of time.


Analysis of data

Data analysis is not new; however, some of the challenges today include these:

  • The vast amount of data that you must process, or you can process, to produce accurate and actionable results
  • The speed at which you need to analyze data to produce results
  • The type of data that you analyze—structured versus unstructured

Amount of data

Analytic systems today must be able to handle Internet-scale data volumes. Online data is growing rapidly, and terms like terabyte, petabyte, and exabyte are commonly used. (See Table 1.)

Table 1. Definitions and estimations of data volumes
DefinitionsEstimations
Gigabyte: 1024 megabytes4.7 Gigabytes: A single DVD
Terabyte: 1024 gigabytes

1 Terabyte: About two years worth of non-stop MP3s. (Assumes one megabyte per minute of music)

10 Terabytes: The printed collection of the U.S. Library of Congress

Petabyte: 1024 terabytes

1 Petabyte: The amount of data stored on a stack of CDs about 2 miles high or 13 years of HD-TV video

20 Petabytes: The storage capacity of all hard disk drives created in 1995

Exabyte: 1024 petabytes

1 Exabyte: One billion gigabytes

5 Exabytes: All words ever spoken by mankind

In 2002, there were about five exabytes of data online. In 2009, that total increased to 281 exabytes, a growth rate of 56 times in seven years. According to Forrester Research Inc., the total amount of data warehoused by enterprises is doubling every three years.

Internet-scale refers to the terabyte and petabyte age of data sizes and the ability to scale to meet the processing requirements to handle this amount of data in a timely manner. The amount of data to be processed includes stored data, as well as real-time streaming data. Virtually everything is electronically recorded today: video and audio surveillance, banking transactions, purchasing transactions, email traffic, instant messaging traffic, Internet searches, medical images and records, and more.

For example, consider the simple scenario of driving home from work and stopping to buy gas. As you leave your place of work and walk to your car, you are likely recorded on video surveillance cameras. As you drive, your cell phone might be sending GPS location information that is recorded. You then receive a text message while driving home. The time and content of these messages are stored by your carrier. You wait to answer it until you pull into the gas station, where another set of video surveillance cameras records the activity. Your gas purchase transaction is then recorded, along with your frequent buyer card that you scanned at the pump. The gas station happens to be in a high-crime area that the city is monitoring with technology such as ShotSpotter (see Resources for a link). ShotSpotter uses microphones positioned in various locations to record and listen for gunshots. If a gunshot is heard, authorities are notified immediately and video surveillance is taken of the area. Therefore, while you are at the gas station, audio is being analyzed and recorded.

A sizeable portion of the rise in warehoused data can and will be attributed to Electronic Medical Records (EMRs). EMRs and advances in medical imaging, along with the length of time they need to be stored (seven years according to U.S. federal law), will continue to contribute to the massive growth of warehoused data. This warehoused data creates data volumes at a scale previously unthinkable. In addition, video and audio feeds are extremely costly to store due to the large volumes of this type of data collected, coupled with its poor compression characteristics. This high volume makes real-time analysis of this type of data important, which enables a selective way to store only the pertinent parts.

Data is being recorded everywhere from virtually everything that moves, and many things that don't. In addition to a typically recorded transaction, many innocuous objects, such as parking lots, buildings, and street corners, are instrumented and record large volumes of data around the clock.

Speed

With the amount of stored data growing constantly and exponentially, so too is the amount of data that a business analytics system must process to produce relevant results. Consider that Twitter processes seven terabytes of data every day, while Facebook processes 10 terabytes each day. The CERN Hadron Collider generates 40 terabytes every second. Without analytic systems that scale to these volumes, the data collected loses value.

To put this volume in perspective, Yahoo! reported using Hadoop to sort one petabyte of data in about 16 hours (see Resources to learn more about these benchmarks). This sorting required about 3800 nodes with two quad core 2.5 Ghz processors per node. All other things being equal, sorting an exabyte on the same cluster would take about 1000 times longer, or almost two years.

Business analytic systems also process real-time streaming data that has not yet been stored. The speed at which the large sums of data and the real-time data is processed is critical to produce key insights in a timely manner. In some business analytics use cases, the correct insight or answer, but provided late in a non-timely fashion, can often be considered the wrong answer. The business analytics system must be able to handle large volumes of data, process it efficiently, and come to its result in a window of time that is relevant to the user. For example, a facial recognition system working off a real-time video feed is of much higher value if the system indicates that a wanted suspect is at a specific location one minute, instead of one day, after the fact.

Structured versus unstructured data

Most data produced today is unstructured. Unstructured means there is no latent meaning attached to the data such that a computer program can understand what it represents. Structured data is data that has semantic meaning attached, making it easier to be understood. For example, the following text message or email contains unstructured data:

Hi Joe, call me...my numbers are home – 919-555-1212, office – 919-555-1213, 
cell – 919-555-1213.

By reading this message, a human knows the latent meaning and that of the data and can tell you what the home, office, and cell numbers are. To represent the same data in HTML, the data now looks structured through its layout and how the HTML is organized in a nested fashion. The data, however, is unstructured to an analytical system because there is no meaning associated with it. HTML, emails, text messages, blogs, video, and audio all represent unstructured information. If the relevant phone number information is put into HTML, you might have this:

<h1>List of Numbers</h1>
<b>HNumber: 919-555-1212</b>
<b>ONumber: 919-555-1213</b>
<b>CNumber: 919-555-1214</b>

The HTML looks structured as described here, but not the type of structure that applies the latent meaning to the data. This data is still unstructured as far as an analytics processing system is concerned. Furthermore, if you used XML without a schema, it would also be unstructured in the same way that the HTML is:

<List of Numbers>
<HNumber>919-555-1212</HNumber>
<ONumber>919-555-1213</ONumber>
<CNumber>919-555-1214</CNumber>
</List of Numbers>

XML is often referred to as semi-structured. There is structure in the relationships of the data, but the data is not structured with regard to the meaning of that data. With a schema, you can now say that the above XML is structured because you now have a way to attach meaning to the data. With a schema, you know that the HNumber, ONumber, and CNumber elements all represent different phone numbers for Home, Office, and Cell, respectively. Databases contain structured data as well. Data stored in rows and columns with a schema allow the meaning of the data to be understood by a computer program.

Some of the value of different analytics products is their ability to process large amounts of unstructured data to discover the latent meaning. Consider the text message, HTML, and schema-less XML examples above. A computer program can figure out that those are likely phone numbers because they match a pattern of three digits, followed by a separator [in the form of a hyphen (-), a period (.), or a space ( )], followed by three more digits, a separator, then four digits. More processing can be done to infer that the three numbers are from North Carolina due to the 919 area code. You can imagine a similar algorithm for an international number with a country code.

Structured data is simpler to process because more information is available to the program beforehand in order for it to determine the data's meaning. This approach is more efficient as opposed to spending compute cycles to figure it out. Much of the growth of data in today's age, however, is that of unstructured data, making it critical for systems to be able to process it efficiently and to correctly determine the meaning contained within it. For example, emails and text messages as well as audio and video streams are some of the largest categories of unstructured data today. This type of unstructured data continues to grow unabated, making the efficient processing of it critical to the continued success of business analytic processing systems.

While the amount, speed, and type of data are all challenges facing business analytic systems, great strides are being made in addressing these issues. Processing on huge datasets that used to take weeks now takes minutes. Real-time feeds can be processed efficiently while the data is still in motion, running on scale-out clusters with fail-over capability, and all performed on commodity machines. This kind of processing enables the creation of applications unthinkable just a few years ago. For this area of computing to have maximum benefit, software standards play an important role.


Definitions

Predictive analytics

Predictive analytics is where software uses various historical data sources to make predictions about future events or behavior. The predictions are provided with a level of confidence for the prediction.

Data in motion analytics

Data "in motion" analytics is the analysis of data before it has come to rest on a hard drive or other storage medium. Due to the vast amount of data being collected today, it is often not feasible to store the data first before analyzing it. In addition, even if you have the space to store the data first, additional time is required to store and then analyze. This time delay is often not acceptable in some use cases.

Data at rest analytics

Due to the vast amounts of data stored, technology is needed to sift through it, make sense of it, and draw conclusions from it. Much data is stored in relation or OLAP stores. But, more data today is not stored in a structured manner. With the explosive growth of unstructured data, technology is required to provide analytics on relational, non-relational, structured, and unstructured data sources.

Business rules

Rules are used to define or constrain some aspect of the business to make more intelligent decisions. Rules are stored outside of application logic, making it easy for a business person to add or modify the rules while not taking the system offline.

Reporting

Reports take the form of user interface dashboards of varying degrees of complexity.


Key standards

This section describes some of the key standards and their relevance and value to supporting data analysis.

UIMA

UIMA (Unstructured Information Management Architecture) is an OASIS standard in which IBM was the chair of the technical committee (see Resources). UIMA is a framework to process unstructured information, discover the latent meaning, relationships, and relevant facts contained in that data, and represent those findings in an open and standard form. For example, UIMA can be used to ingest plain text and determine the people, places, organizations, and relationships, such as "is friends with" or "is married to" contained in the data. These findings are represented in a data structure defined by the UIMA standard.

UIMA defines four terms to help in understanding its role and purpose:

  • Artifact—A piece of unstructured content
  • Analysis—Assigns semantics to an artifact
  • Analytic—Software that performs the analysis
  • Artifact metadata—The result of analysis of an artifact by an analytic

Consider a large collection of fast food restaurant surveys, which amounts to a large amount of unstructured text. This information is analyzed to find the most common reasons for complaints, to identify the names and locations of stores with the most complaints, and for each type of complaint, to see which stores generated the most complaints. You can use UIMA to glean this type of information so you can see trends and the type of complaints. You can also see which complaint types become rarer and which increase.

Referring to Figure 1, the raw survey data represents the artifact (1), as it is unstructured content. The analysis assigns meaning to the artifacts (2). For example, stores 15 and 38 have the most complaints about the desserts, while store 27 has reduced its complaints by half since the last survey. The analytic is typically proprietary software that performs this analysis and produces the artifact metadata (3). The artifact metadata is contained in a data structure known as the Common Analysis Structure (CAS).

Figure 1. High-level view of UIMA
Diagram of a high-level view of UIMA

One goal of UIMA is to support interoperability of analytics. The CAS allows for the sharing of these results across analytics. This approach benefits customers by allowing them to share the data representations and interfaces between various tools and products that support UIMA. Given the example in Figure 1, an analytic could interoperate with a tool that performs the analysis on the artifacts if both supported UIMA. This ability enables various tools to interoperate and allows customers to choose different vendors for the analysis of their unstructured data.

UIMA supports a common data representation of artifacts and artifact metadata independently of the original representation of the artifact. It also allows for platform-independent interchange of artifact and artifact metadata while allowing you to discover, reuse, and compose independently developed analytics. Furthermore, UIMA provides interoperability of independently developed analytics. UIMA is the leading technology in this area and is backed by Apache open source implementations. The 1.0 specification is complete as of March 2009, with no further work planned. (For a link to the UIMA specification, see Resources.)

PMML

PMML (Predictive Model Markup Language) is an XML-based markup language developed by the Data Mining Group (DMG) in which IBM is a contributor. (See Resources.) PMML represents a predictive model that is created after analyzing historical data for various insights.

For example, assume that a telecommunications company wants to analyze historical data to predict, with some level of certainty, whether customers will drop their land-line service in favor of cell service. The algorithm (1 in Figure 2) looks at historical data and produces parameters for an equation across multiple input fields (age, salary, marital status, home owner or renter, level of education, and so on) that best can predict whether the customer is likely to drop the service. The algorithm produces a PMML model (2) which is the input to a scoring process (3). The scoring process outputs a prediction (4) on whether a particular customer is likely to drop the service along with an indicator of the confidence of this prediction. Higher confidence in the prediction that you will lose a customer might dictate a more aggressive response.

Figure 2. High-level view of PMML
Diagram of a high-level view of PMML

PMML is a model exchange standard to share models between vendors. PMML provides applications with vendor-independent models with the goal that proprietary issues and incompatibilities are no longer a barrier to the exchange of models between applications. This is beneficial and allows users to develop models within one vendor's application and use another vendor's applications to visualize, analyze, evaluate, and use the models. Because PMML is an XML-based standard, the specification comes in the form of an XML schema.

Adoption of PMML in the industry is strong, as indicated by this list of current adoption in the industry. (For a link to a web page, see Resources.)

  • Augustus / Open Data Group
  • KNIME
  • MicroStrategy
  • Pervasive DataRush
  • Rapid-i
  • R/Rattle
  • Salford Systems
  • SAS
  • TIBCO
  • Weka
  • Zementis

RIF

RIF (Rule Interchange Format) is a W3C standard in which IBM was the co-chair. RIF represents, in XML, the executable form of a business rule. Business rules can be used in business analytic systems in various ways. Rules are used to determine specific actions that the system takes based on various conditions and input. For example, a mortgage lending company would have rules to determine if a person qualified for a loan. Factors such as income, debt, and credit score would all play a role. The rules might be of the form: If borrower has income above X, debt less than Y, and a credit score above Z, the borrower qualifies for a given loan amount. Different vendors have their own proprietary way to write the rules, but RIF enables a common and interoperable format for their executable format.

RIF was designed primarily for the interchange of rules between rule engines. RIF delivers value because it provides interoperability between rule execution systems while preventing lock-in by rule vendors. This interoperability enables users to employ various tools to create their business rules but interoperate with various rule execution systems that support RIF.

RIF became a W3C recommendation in June 2010. Therefore, industry adoption is developing as this list of reference implementations of RIF indicates. (For a link to a web page, see Resources.)

  • SILK
  • OntoBroker
  • fuxi
  • Eye
  • VampirePrime
  • RIFle
  • Oracle (OBR)
  • STI Innsbruck (IRIS)
  • riftr
  • WebSphere ILOG JRULES
  • TIBCO
  • FICO
  • Drools

These implementations were of the RIF standard as it was developed. Several of these companies might implement the full standard, although that is not assured.

XBRL

XBRL (eXtensible Business Reporting Language) is an XML-based standard by XBRL International used for financial reporting. XBRL is relevant because it is mandated and/or adopted by various governments and countries as the standard format for providing financial reports. With its use growing, the analysis of XBRL documents and the data they contain becomes relevant.

Traditionally, reports are produced in HTML or PDF. These formats, while easy to read by a human, though, are not structured. XBRL is structured because it is provided in XML with a well-known schema, but it is not very human readable. Therefore, meaning can be inferred from the data making the document structured and more useful by a computer program.

Recently, the SEC began to require 500 of the largest public companies to begin filing their financial statements using XBRL. This requirement will gradually expand to include smaller public companies in the future. Companies with market capitalization above $5 billion began filing in XBRL in 2009, but this year they must submit financial statements with more detailed tagging of footnotes. Those with market capitalization above $700 million must make their initial submission in XBRL without detailed tagging of footnotes. All publicly held Korean firms have been required since October 2007 to electronically file their periodic and other financial reports in the XBRL format. Required XBRL filings are being used in Japan by the Tokyo Stock Exchange (TSE), which accounts for 90% of all trades made on Japanese stock exchanges. Since 2008, the TSE requires all listed entities to file their financial information with the TSE in the XBRL format.

XBRL has been adopted and mandated across several of the most mature world economies. Table 2 identifies several of the XBRL adoptions across the globe.

Table 2. XBRL adoption
CountryOrganizationApplication/program
Netherlands Dutch Tax Authority Corporate tax returns
Australia Australian Prudential Review Authority (APRA) Prudential filings
Jamaica Bank of Jamaica Financial companies' registered filings
United States Federal Financial Institutions Examination Council (FFIEC) Call report modernization
United States Securities and Exchange Commission XBRL voluntary filer program
Belgium National Bank of Belgium Belgium companies' annual account filings
Japan Bank of Japan Financial services companies' filings
Spain Bank of Spain COREP filings
Canada Ontario Securities Commission (OSC) Voluntary filer program
Japan Tokyo Stock Exchange (TSE) TSE registrant financial report filings

OWL

Web Ontology Language (OWL) is a high-level language for representing ontologies of information or models. For example, Joe is a human, is married to Jane, and is a male. Sam is a human, is married to Sue, is a male, and is a husband. Therefore, you can deduce that Joe is a husband. These interactions are being explored because XML Schema often has poor semantics and requires more human interactions to deduce similar facts. With OWL, you are able to more easily deduce knowledge programmatically, making OWL useful for exchanging models and using them in rule-based systems.


Scenario

The following depicts a retail scenario that uses the various standards mentioned previously.


Overview

Figure 3 shows the high-level components in this scenario. The components consist of:

  • Databases that contain historical data (data at rest)
  • Feeds of real-time data (data in motion)
  • Engines that perform the analytics on that data
  • Predictive analytics
  • Business rules
  • User interfaces using dashboards to display results or alerts, while allowing user interactions
Figure 3. Components of the scenario
Diagram of the components of the sample scenario

Figure 4 shows current and future key integration points between the different components (in Figure 3) where the various standards discussed previously interact and provide interoperability benefits. Historical data use a variety of standards, such as XML, CSV, XLS, PDF, DITA, and XBRL. The analytic engines frequently use UIMA. Predictive analytics and business rules commonly use the PMML and RIF standards, respectively.

Figure 4. Key integration points
Diagram of integration points that use various standards

Scenario details

The next several figures step through the scenario and explain the value that the standards bring. The standards play an important role, especially when you deploy this type of solution into an existing heterogeneous customer environment. This scenario depicts a large retail store solution that is attempting to use historical and real-time data to increase sales, retain existing customers, and attract new ones.

Figure 5 shows the retail chain's historical data in different databases and stored in various data formats. This scenario includes data such as customer transaction data, preferences, purchasing history, demographic information, survey data, customer call center notes and recordings, and so on. In addition, a real-time data feed is provided. This feed might include data such as up-to-the-minute transactions per store or region, live transaction data per customer or group of customers, live customer call center feeds, video surveillance feeds, products in route to various store locations, and so on.

Figure 5. Historical and real-time data
Diagram of historical and real-time data

Each successive figure uses shading to indicate the new portion of the picture that was added. Figure 6 shows Hadoop used for historical data analysis to provide analytics on structured and unstructured data. For example, the analysis of this historical data might reveal information about buying patterns for particular customers, purchasing preferences, attitudes on competing retailers, and more. Note the introduction of the UIMA standard to share the analytical output with other systems to enable interoperability.

Figure 6. Historical data analysis
Diagram of historical data analysis

Figure 7 shows the introduction of a real-time analysis engine. These engines can ingest and process real-time in-motion data that is structured or unstructured. In addition, you can feed results from the historical analysis into the real-time engine to help discover additional insights. For example, consider a historical analysis that shows sales of a particular product are best during the weekend days but sluggish otherwise. Furthermore, the real-time analysis shows that the particular product is low in inventory and that the weekend is approaching. An alert can be raised about this situation in hopes of correcting it.

Figure 7 also shows a two-way connection between the real-time analysis engine and the historical data in the databases. The engine might use historical data to correlate with the real-time data and might also store data periodically. For example, assume that the real-time data contained audio feeds from customer call centers. You would not want to store every minute of every call, but maybe you want to store random calls for quality review later. Calls where the system detects an angry customer could be recorded for later review and analysis.

Figure 7. Real-time data analysis
Diagram of real-time data analysis

Figure 8 shows predictive analytics as part of the scenario. (View a larger version Figure 8.) Modeling tools can be used to create a predictive model in PMML. This PMML model can be stored in the database and understood by a real-time analysis engine. For example, you might use the predictive PMML model in this case to determine the likelihood that a particular set of facts from the real-time and historical data will lead to a customer switching loyalties and shopping at a competitor. As the real-time analysis engine processes data, it can use this model to score the facts it is uncovering. This scoring allows the engine to make additional and further insights about the data it is processing.

Figure 8. Predictive analytics
Diagram of predictive analytics

Figure 9 shows that you can inject new PMML models into the analysis engine in real time. (View a larger version Figure 9.) This injection is a powerful concept as you can create and deploy new models while the system is running and based on the data currently being collected.

Figure 9. Real-time PMML model injection
Diagram of real-time PMML model injection

Figure 10 depicts the introduction of business rules into the scenario. (View a larger version Figure 10.) As the real-time analysis engine is processing incoming and historical data looking for sales trends, it can invoke rules created with a business rules management system to make additional intelligent decisions. For example, a rule might say: "If customer A, B, or C (part of your Gold customers) hasn't had a purchasing transaction in the last N number of days, and if their survey data indicates that they may move to a competitor, offer them a specific discount."

Figure 10 also shows the standard, RIF. RIF is used to represent an executable form of a rule. This form enables vendor's rule systems to share rules allowing customers not to be locked into a particular rule vendor.

Like the real-time injection of new predictive PMML models depicted in Figure 9, Figure 10 shows how you can inject new rules in real time as well.

Figure 10. Business rules deployment
Diagram of business rules deployment

Figure 11 shows how dashboards and visualization features are utilized. (View a larger version Figure 11.) You can create these features by combining the real-time information being processed and historical data stored on traditional or OLAP databases and surfaced as a real-time alert or as an informational dashboard.

Figure 11. Dashboards and visualization
Diagram of dashboards and visualization

Summary

With the explosion of collected and available data, coupled with the expectation of gaining new and additional insights from that data, the pressure is on to handle, efficiently process, and make sense of data in volumes previously unimaginable. To achieve these goals requires multiple systems and technologies, both legacy and new, working together. This integration between technologies calls for standards to enable the interoperability required to integrate the data, products, and technologies to efficiently achieve the goals expected by business and consumers.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Industries
ArticleID=681406
ArticleTitle=Data growth and standards
publish-date=06212011