Big data architecture and patterns, Part 5: Apply a solution pattern to your big data problem and choose the products to implement it

Using a scenario-based approach, this article outlines solution patterns that can help define your big data solution. Each scenario starts with a business problem and describes why a big data solution is required. A specific solution pattern (made up of atomic and composite patterns) is applied to the business scenario. This step-by-step approach helps identify the components required for the solution. At the end of the article, some typical products and tools are suggested.

Share:

Divakar Mysore (mrdivakar@in.ibm.com), Senior IT Architect, IBM

Divakar MysoreDivakar Mysore is an IBM-certified senior IT architect with more than 15 years of experience in the IT industry. He has been part of multiple strategic initiatives for global corporations. He has extensive experience as enterprise architect, application architect, system engineer, data modeler, and test architect. He leads the application architecture discipline for the Enterprise Architecture and Technology team in Global Delivery India. He drives technical vitality initiatives on mobile, front office, social and big data.



Shrikant Khupat (skhupat1@in.ibm.com), Application Architect, IBM

Shrikant Khupat Shrikant Khupat is an IBM application architect. He is experienced in defining enterprise class, distributed, disconnected, client-server architectures, and designs. He has exposure to a variety of domains, such as insurance and energy and utilities. He worked on complex solutions involving distributed data processing using Apache Hadoop and unstructured data processing using machine learning languages. His current interests include defining big data architecture and patterns.



Shweta Jain (shweta.jain@in.ibm.com), IT Architect, IBM

Shweta Jain Shweta Jain is an accredited IT architect with IBM AIS Global Delivery with more than 10 years of industry experience. She specializes in architecting SOA-based integration solutions using industry standards and frameworks. She has experience in architecture, design, implementation, and testing of integration solutions based on the SOA framework, SOMA methodology, and the software development life cycle based on methods. As an integration architect, she is responsible for architecting the BPM/EAI layer using IBM tools, standards, processes, and methodologies; and incorporating industry standards for complex integration and transformation projects. She also enjoys reading and contributing to latest technologies, such as big data.



17 December 2013

Also available in Chinese Russian

Introduction

Part 3 of this series describes atomic and composite patterns that address the most common and recurring big data problems and their solutions. This article suggests three solution patterns that can be used to architect a big data solution. Each solution pattern uses a composite pattern, which is made of up logical components (also covered in Part 3). At the end of this article, find a list of products and tools that map to the components of each solution pattern.


Solution patterns

The following sections describe three solution patterns that can be used to architect a big data solution. To illustrate the patterns, we apply them to a particular use case (how to detect healthcare insurance fraud), but the patterns can be used to address many other business scenarios. Each solution pattern takes advantage of a composite pattern. In the following table, see the list of solution patterns covered here, along with the composite patterns they are based on.

Table 1. Composite pattern used by each solution pattern
Solution patternComposite pattern
Getting startedStore and explore
Gaining advanced business insightPurposeful and predictive analytics
Take the next best actionActionable analysis

Description of the use case: Insurance fraud

Financial fraud poses a serious risk to all segments of the financial sector. In the United States, insurers lose billions of dollars annually. In India, the loss in 2011 alone totals INR 300 billion. Apart from the financial loss, insurers are losing business because of customer dissatisfaction. Although many insurance regulatory bodies have defined frameworks and processes to control fraud, they are often reacting to fraud rather than taking proactive steps to prevent it. Traditional approaches, such as circulating the roll of black-listed customers, insurance agents, and staff, do not resolve the problem of fraud.

This article proposes a solution pattern for big data solution, based on the logical architecture described in Part 3 of this series and the composite patterns covered in Part 4.

Insurance fraud is an act or omission intended to gain dishonest or unlawful advantage, either for the party committing the fraud or for other related parties. Broad categories of fraud include:

  • Policyholder fraud and claims fraud— Fraud against the insurer in the purchase and execution of an insurance product, including fraud at the time of making an insurance claim.
  • Intermediary fraud— Fraud perpetuated by an insurance agent, corporate agent, intermediary, or third party-agent against the insurer or the policy holders.
  • Internal fraud— Fraud against the insurer by its director, manager, or any other staff or office member.

Current fraud-detection process

The insurance regulatory boards have established anti-fraud policies, which include well-defined processes for monitoring fraud, for searching for potential fraud indicators (and publishing a list), and for coordinating with law enforcement agencies. The insurers have staff dedicated to analyzing fraudulent claims.

Issues with the current fraud-detection process

The insurance regulators have well-defined fraud-detection and mitigation processes. Traditional solutions use models based on historical fraud data, black-listed customers and insurance agents, and regional data about fraud peculiar to a certain area. The data available for detecting fraud is limited to the given insurer's IT systems and a few external sources.

Current fraud-detection processes are mostly manual and work on limited data sets. Insurers may not be able to investigate all the indicators. Fraud is often detected very late, and it is difficult for the insurer to do adequate followup for each fraud case.

Current fraud detection relies on what is known about existing fraud cases, so every time a new type of fraud occurs, insurance companies have to bear the consequences for the first time. Most traditional methods work within a particular data source and cannot accommodate the ever-growing variety of data from different sources. A big data solution can help address these challenges and play an important role in fraud detection for insurance companies.


Solution pattern: Getting started

This solution pattern is based on the store-and-explore composite pattern. It focuses on acquiring and storing the relevant data from various sources inside or outside the enterprise. The data sources shown in Figure 1 are examples only; domain experts can identify the appropriate data sources.

Because a large volume of varied data from many sources must be collected, stored, and processed, this business challenge is a good candidate for a big data solution.

The following diagram shows the solution pattern, mapped onto the logical architecture, described in Part 3.

Figure 1. Solution pattern for getting started
Image shows logical layers diagram showing solution pattern for getting started

Click to see larger image

Figure 1. Solution pattern for getting started

Image shows logical layers diagram showing solution pattern for getting started

Figure 1 uses data providers from:

  • External data sources
  • Structured data storage
  • Transformed, structured data
  • Entity resolution
  • Big data explorer components

The data required for healthcare fraud detection can be acquired from various sources and systems such as banks, medical institutions, social media, and Internet agencies. It includes unstructured data from sources such as blogs, social media, news agencies, reports from various agencies, and X-ray reports. See the data sources layer in Figure 1 for more examples. With big data analytics, the information from these varied sources can be correlated and combined, and — with the help of defined rules — analyzed to determine the possibility of fraud.

In this pattern, the required external data is acquired from data providers who contribute preprocessed, unstructured data converted to structured or semi-structured format. This data is stored in the big data stores after initial preprocessing. The next step is to identify possible entities and generate ad-hoc reports from the data.

Entity identification is the task of recognizing named elements in the data. All entities required for analysis must be identified, including loose entities that do not have relationships to other entities. Entity identification is mostly performed by data scientists and business analysts. Entity resolution can be as simple as identifying single entities or complex entities based on the data relationships and contexts. This pattern uses the simple-form entity resolution component.

Structured data can be simply converted into the format most appropriate for analysis and directly stored in big data structured storages.

Ad-hoc queries can be performed on this data to get the information like:

  • Overall fraud risk profile for a given customer, region, insurance product, agent, or approving staff in the given period
  • Inspection of past claims by certain agents or approvers or by the customer across insurers

Typically, organizations get started with big data by adapting this pattern, as the name implies. Organizations employ an exploratory approach to assess what kind of insight could be generated, given the data available. At this stage, organizations do not generally invest in advanced analytics techniques such as machine learning, feature extraction, and text analytics.


Solution pattern: Gaining advanced business insight

This pattern is more advanced than the getting-started pattern. It predicts fraud at three stages of claim processing:

  1. The claim is already settled.
  2. The claim processing is underway.
  3. The claims request is just received.

For cases 1 and 2, the claims can be processed in batch, and the fraud-detection process can be initiated as part of the regular reporting process or as requested by the business. Case 3 can be processed at near-real time. The claims request interceptor intercepts the claim request, initiates the fraud-detection process (if the indicators report it as a possible fraud case), then notifies the stakeholders identified in the system. The earlier the fraud is detected, the less severe is the risk or loss.

Figure 2. Solution pattern for gaining advanced business insight
Logical-layers diagram shows solution pattern for gaining advanced business insight

Click to see larger image

Figure 2. Solution pattern for gaining advanced business insight

Logical-layers diagram shows solution pattern for gaining advanced business insight

Figure 2 uses:

  • Unstructured data storage
  • Structured data storage
  • Transformed structured data
  • Preprocessed, unstructured data
  • Entity resolution
  • Fraud-detection engine
  • Business rules
  • Big data explorer
  • Alerts and notifications to users
  • Claims request interceptor

In this pattern, organizations can choose to preprocess unstructured data before analyzing it.

The data is acquired and stored, as-is, in unstructured data storage. It is then preprocessed into a format that can be consumed by the analysis layer. At times, the preprocessing can be complex and time-consuming. Machine-learning techniques can be used for text analytics, and the Hadoop Image Processing Framework can be useful for processing images. The most widely used technique is JSON. The preprocessed data is then stored in structured data storage, such as HBase.

The core component in this pattern is the fraud-detection engine, composed of the advanced analytics capabilities that help predict fraud. Well-defined and frequently updated fraud indicators help identify fraud. The following fraud indicators can help detect fraud, and technology can be used to implement systems to combat fraud. Here's a list of common fraud indicators:

  • Claims are made shortly after the policy inception.
  • Serious underwriting lapses occur while processing a claim.
  • The insured person is overtly aggressive in pursuit of a quick settlement.
  • Insured parties are willing to accept a small settlement rather than document all losses.
  • The authenticity of documents is doubtful.
  • The insured person is behind in loan payments.
  • The injury incurred is not visible.
  • A high-value claim has no known casualty.
  • Relationships exist between clusters of individuals, including policy holders, medical institutions, associates, suppliers, and partners.
  • Links exist between licensed and non-licensed healthcare providers.

Traditional methods alone are not adequate to predict fraud. Social-network analytics are required to detect links between licensed and non-licensed healthcare providers and to detect relationships between policy holders, medical institutions, associates, suppliers, and partners. Validating the authenticity of documents and finding the credit score of individuals are difficult tasks to accomplish with traditional approaches.

During analysis, the search for all of these indicators can occur simultaneously on a huge volume of data. Every indicator is weighted. The total weight across all indicators indicates the accuracy and severity of the predicted fraud.

When the analysis is complete, alerts and notifications can be sent to relevant stakeholders, and reports can be generated to show the outcome of analysis.

This pattern is suitable for enterprises that need to perform advanced analytics using big data. It involves performing complex preprocessing so that the data can be stored in a form that can be analyzed using advanced techniques, such as feature extraction, entity resolution, text analytics, machine learning, and predictive analytics. This pattern does not involve taking any action or suggesting recommendations on the output of analysis.


Solution pattern: Take the next-best action

The fraud predictions made in the solution pattern about gaining advanced business insight normally lead to certain actions to be taken, such as rejecting the claim or putting it on hold until additional clarification and information is received or reporting it for legal action. In this pattern, actions are defined for each outcome of the prediction. This action-to-outcome table is referred to as an action-decision matrix.

Figure 3. Solution pattern for the next-best action
Logical layers diagram showing solution pattern for taking the next-best action

Click to see larger image

Figure 3. Solution pattern for the next-best action

Logical layers diagram showing solution pattern for taking the next-best action

Figure 3 uses:

  • Unstructured data storage
  • Structured data storage
  • Transformed structured data
  • Preprocessed unstructured data
  • Entity resolution
  • Fraud-detection engine
  • Business rules
  • Decision matrices
  • Data exploration tools
  • Alerts and notifications to users
  • Claims request interceptor
  • Alterations and notifications to other systems and business process components

Typically, three kinds of actions can be taken:

  • A notification can be sent to stakeholders to take the necessary action — for example, to notify the user to take legal action against the claimant.
  • The system notifies the user and waits for the user's feedback before taking further action. The system can wait for the user to respond to a task or it can stop or put on hold a claim-processing transaction.
  • For scenarios that do not need manual intervention, the system can take an automated action. For example, the system can send a trigger to a process to stop the claims process and inform the legal department about the claimant, agent, and approver.

This pattern is suitable for enterprises that need to perform advanced analytics using big data. This pattern uses advanced capabilities to detect fraud, to notify and alert relevant stakeholders, and to initiate automatic workflows to take action based on outcome of processing.


Products and technologies that form the backbone of a big data solution

The following diagram shows how big data software maps to the various components of the logical architecture described in Part 3. These are not the only products, technologies, or solutions that can be used in a big data solution; your own requirements and environment must shape the tools you choose to deploy.

Figure 4 shows big data appliances, such as IBM PureData™ System for Hadoop and IBM PureData System for Analytics, cutting across layers. These appliances have features such as built-in visualization, built-in analytic accelerators, and a single system console. Using an appliance has many advantages. (See Resources for more information about the IBM PureData System for Hadoop.)

Figure 4. Products and technologies mapped to logical layers diagram
Logical layers diagram showing products

Click to see larger image

Figure 4. Products and technologies mapped to logical layers diagram

Logical layers diagram showing products

Benefits of using big data analytics in fraud detection

Using big data analytics for detecting fraud has various benefits over traditional approaches. Insurance companies can build systems that include all relevant data sources. An all-encompassing system helps detect uncommon cases of fraud. Techniques such as predictive modeling thoroughly analyze instances of fraud, filter obvious cases, and refer low-incidence fraud cases for further analysis.

A big data solution can also help build a global perspective of the anti-fraud efforts throughout the enterprise. Such a perspective often leads to better fraud detection by linking associated information within the organization. Fraud can occur at a number of source points: claims processing, insurance surrender, premium payment, application for a new policy, or employee-related or third-party fraud. Combined data from various sources enables better predictions.

Analytics technologies enable an organization to extract important information from unstructured data. Although volumes of structured information are stored in data warehouses, most of the crucial information about fraud is in unstructured data, such as third-party reports, which are rarely analyzed. In most insurance agencies, social media data is not appropriately stored or analyzed.


Conclusion

Using business scenarios based on the use case of identifying fraud in the insurance industry, this article describes solution patterns that vary in complexity. The simplest pattern addresses storing data from various sources and doing some initial exploration. The most complex covers how to gain insight from the data and take action based on the analysis.

Each business scenarios is mapped to the appropriate atomic and composite patterns that make up the solution pattern. Architects and designers can apply the solution pattern to define the high-level solution and functional components of the appropriate big data solution.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Big data and analytics on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Big data and analytics
ArticleID=956927
ArticleTitle=Big data architecture and patterns, Part 5: Apply a solution pattern to your big data problem and choose the products to implement it
publish-date=12172013