Explore the advanced analytics platform, Part 7

The customer profile hub


Content series:

This content is part # of # in the series: Explore the advanced analytics platform, Part 7

Stay tuned for additional content in this series.

This content is part of the series:Explore the advanced analytics platform, Part 7

Stay tuned for additional content in this series.

This is the seventh tutorial in the Advanced Analytics Platform (AAP) tutorial series. Previous tutorials provided an understanding of use cases, algorithms, discovery patterns, and data or logic flows. These tutorials provide you with an understanding of where to use AAP and how it is assembled and integrated into end-to-end application architecture. The subject of the fifth tutorial, information architecture, covered aspects of data modeling and ontology for data discovery. The sixth tutorial discussed the ability to analyze large volumes of data at high velocity that results in real-time actionable steps. This tutorial illustrates the need to have an integrated customer profile pattern.

The customer profile pattern helps you to gain a deep understanding of the customer. Traditional master data management (MDM) either federated or consolidated customer profile data. The enriched profile was then combined with various transactional and demographic data sources from business unit silos. With big data, two new types of data sources are emerging, conversations and usage. These data sources add more dimensions to customer views and provide a richer insight to customer behavior, preferences, and usage patterns. The resulting customer profile that is built with a much larger data set generates thousands of micro-segments that can be used across various use cases in sales and marketing, customer care, revenue and risk management, and resource management. A person can easily respond to changing customer needs such as fraud, customer care, or marketing if they understand the customer profile, specifically a 360-degree view of the customer.

A robust customer profile has several important characteristics:

  • The ability to integrate data from various different sources and types.
  • Models that can support multiple dimensions of data. These models need to be agile to accommodate for changing customer needs. Entity resolution between the different sources is also essential.
  • Different types of data representations such as data in motion or at rest that is associated with the customer profile. The customer profile needs to be integrated effectively to all components of the architecture. For example, the predictive models require customer profiles, and the outputs are integrated with the campaign module.
  • Effective security and governance to any customer profile is essential. Security breaches to customer profile data can have serious repercussions to the company. Ineffective governance can result in data that is inaccurate, stale, and untrustworthy. Processes need to be in place to ensure that security and governance is carefully managed by the company.

The net is that a customer profile pattern is foundational for effective analytics, and numerous business use cases can be built by using the customer profile. This tutorial discusses the customer profile and its characteristics as well as how to implement them effectively in the Advanced Analytics Platform. Governance, data quality and veracity, identity resolution and metadata management, and security and privacy issues are discussed in Part 8.

Data sources

Most traditional analytics systems carry transactional data about their customers. For a typical communications service provider (CSP), the billing system is the best source of transactional customer data because it provides the customer's sales and billing transaction history. Customer relationship management (CRM) systems allow the CSPs to focus on demographics data. CSPs can enhance their existing CRM data by using census (or other third-party sources) to collect household income, neighborhood, or family-related information. With a number of transactional sources and their change logs, the data warehouses and data lakes are now the accumulation points for transactions and logs from hundreds of systems in a typical analytical store. This data is often correlated to master IDs and reorganized for analytical processing.

Big data sources added two new sources of data, which were previously hard to collect or analyze. The first is the conversation data. The conversation between a customer and a CSP includes all contact points, including machines (IVR, website, and so on). Conversations with the customer and CSP employees (call centers, chat service, field repair, and so on) are also included in conversation data. Whether the conversation is with a machine or an employee, the conversation can be captured in detail and analyzed for insights, including:

  • Collecting and analyzing click stream data from customers as they visit CSP websites.
  • Detailed information about the pages visited, specific objects clicked, and session successes and incomplete exits.
  • Chat session logs that are collected and analyzed.
  • Recorded customer conversations with call centers. New advancements have made it possible to transcribe the sessions and provide other metadata: tones, sentiments, and so on, regarding the conversation. This session data is a significant advancement over short text that is typed by the agents or trouble tickets available in the transaction data.

In addition to the conversation with CSPs, people do discuss the CSP products with each other. Social media, third-party, and regulator sites (such as public utility commissions) are more sources of data you can use to understand customer sentiment and feedback. Unlike the internal data sources, these external sources might include a fair amount of veracity, leading to concerns about quality and bias in the source information. In addition, data lakes often ingest the data without requiring strong governance, and apply governance on read. Big data governance is the topic of Part 8 in this tutorial series.

Figure 1 shows an example of data analysis that uses Twitter data. The text analytics engine used 200-250 tweets from one subscriber to generate personality profiles. Assuming the subscriber agreed to provide the Twitter handle and link to the mobile number or the CSP derives the information from additional sources, the transactional and demographic data is ready to augment with a rich set of personality profile information. Marketers use the personality profile information to cluster subscribers with similar personality groups in offering video products.

Figure 1. Personality profile of a user using Twitter data
Alt text=Circular image of individual personality profile
Alt text=Circular image of individual personality profile

Usage information is available in large volumes of data from the network probes-related logs and other sources. In the past, usage information was discarded because it was hard to collect and analyze. Big data analytics tools have made it easier to collect the data, correlate it to subscribers, and place it in large data lakes. The usage data can then be organized and correlated with conversation, demographics, or transaction data to get meaningful insights about the customers.

The most common usage information is the transaction detail records (TDR) for data communication and call detail records (CDR) for voice records. In this tutorial, xDR refers to any usage records. These records carry control and user plane information about the transactions or calls and provide detailed data about quality, latency, frequency, sources, targets, and so on. Because these records are associated with geo-specific locations for devices and network elements, you can also use them to estimate the approximate location of an individual. The location and usage data that is derived from xDRs has located many users across marketing, sales, care, network, security, and finance organizations in a typical CSP. Location information for marketing is the topic of the fourth tutorial. The sixth tutorial explores usage and location data use cases in counter-fraud management and customer care.

Figure 2 shows an example of correlated usage and mobility data. This data was accumulated by using the network probes from IBM's The Now Factory (TNF). The data was correlated across many transactions from the focused subscribers to deduce their mobility and usage information. The streaming engine correlated the data and organized it for processing. IBM SPSS® used the data lake in Hadoop to further analyze the data to establish mobility profiles (busy and active versus staying at home) based on location changes, streaming quality, usage life style using type and frequency of data movements, URL analysis, and personality. The analytics engine also established social behaviors, such as people who are traveling together, by analyzing mobility patterns across the subscriber community.

Figure 2. Mobility and usage information using TNF probe
User movement where user hangs out and who with
User movement where user hangs out and who with

Usage data is fairly rich in content and poses significant design challenges to analytics systems because:

  • Usage data is typically high velocity. A typical data or voice communication goes through one or more devices and a number of network elements. A typical access to a social media site might generate hundreds of TDRs. For a CSP with 50 million subscribers, the network probe data can easily exceed 2,000 gbps (gigabits per second). Collecting, correlating, and organizing such high velocity data is difficult and requires various techniques for correlating the data and aggregating, de-duplicating, and filtering the data based on use cases. This dynamic organization of data requires significant maturity in data modeling and matching of data-in-motion.
  • The data might include low-quality transactions. While it is important to collect and analyze clicks on a set-top box, a television left on the entire day with the household dog as the sole viewer does not constitute as marathon viewing by the household. In this instance, it might be possible to infer lack of presence from the location of the cell phones. The data must be processed for data quality before its use.

While the detailed data might be useful to certain use cases, there are many more users for aggregated data. It is important to organize the data around levels of aggregation and distribute it at the appropriate granularity to diverse users across the organization. Otherwise, the data storage and transport can overwhelm the organization. While you can generate detailed latitude and longitude information every 30 seconds for 50 million subscribers, marketing might only care about a space and time box in an 88-byte geohash over a 5-minute interval, which would offer a significant reduction in storage and transport. This was a topic in the fourth tutorial.

Profile hub and data sets

Organize this data from transactional, demographic, conversation, and usage-related sources into a profile hub. In our work, we identified four major dimensions:

  • Description
  • Interaction
  • Behavior
  • Attitudes

Table 1 shows examples of data sets associated with each data dimension.

Table 1. Sample data dimensions and data set
Data dimension Example data sets
Description Age, income, gender, education, family size, home location, occupation, ethnicity, nationality, religion, federated ID, CRM ID, billing telephone number, Government ID, family unit, organizational hierarchy, social media IDs, email address
Interaction Channel interactions, third-party interactions, contact preferences, alerts, billing history, payment history, subscriptions, privacy preferences, NPS survey
Behavior Locations, usage, device failure patterns, fraudulent patterns, socio-temporal patterns, usage lifestyle, personality, fraud alert, care alert, lifestyles, activities, media viewership patterns, customer lifetime value, favorites, social network, social media discussions, work location
Attitudes Sentiments/opinions, brand loyalty, usage experience, social leadership, attitudes, derived net promoter score

Description: With primary or secondary data collection, or third-party data, a CSP can identify demographic information that is associated with their customers. For residential buyers, this information might represent gender, age, income group, residential communities, family size, and so on. For commercial buyers, this information might include industry codes, size of the organization, geographic location/distribution, and so on. This data can be derived or verified with usage or conversation information. By correlating conversation, usage, and demographics to IDs, you can start to evaluate the veracity of externally sourced data. Descriptive data includes indexes to other sources, such as CRM, billing, ERP, and so on. In addition, CSPs often standardize around enterprise-wide IDs representing common views across department or market unit silos. These IDs might not be at the same level of customer hierarchy. While a billing ID can represent a billing account, a CRM customer ID might represent a collection of billing accounts grouped under a customer contract. A household ID can represent a family while device IDs can represent individual tablets and smartphones used by members of the household.

Interaction: This dimension represents many data sets associated with the channel interactions. It includes both pre and post-purchase channel interactions. Marketers increasingly use information about web click streams to study shopping behavior. A fair amount of data about the customer can be inferred by studying their channel interaction, order frequency, and billing history.

Behavior: After the service is purchased, the usage information and related understanding of its context can provide behavioral data. Some of the standard communications behaviors are statistical summaries of raw usage data. Examples of these communications behaviors are heavy data user, heavy voice user, frequent roamer, and so on. However, increasingly data scientists use usage and location data to develop new derived behavioral data sets, such as mobility patterns (work at home versus frequent traveler), media habits (sports fan versus daytime show viewer), or shopping behavior (discount shopper, quality shopper). Behaviors can also identify unauthorized use, such as fraudsters, or broken services, such as care alerts.

Attitudes: Attitudes represent customers' preferences, feelings, and sentiments toward products, service providers, or channels of communication. Savvy marketers personalize their communications with their customers around customer attitude and get far better effective conversion rates than traditional methods. Consumer attitudes can be explicitly expressed to the service provider in their communications (such as privacy preferences), expressed to third parties (such as social conversations or complaints to regulators), or implicitly stated (such as leader/follower relationships in voice conversations).

Representational diversity

Two powerful analytics processes, unification and behavioral exploration, create and use a profile hub. Unification is the process of creating a common view of the data from many diverse inputs. For example, the xDR data for usage analytics might get generated by a number of sources using their Network Equipment Provider (NEP) proprietary formats. Using unification, all of these NEP-specific formats can be combined into a common data representation. Customer MDM products routinely take the customer data from many CRM, billing, and ERP systems and unify this data into a customer master.

Behavioral exploration is the process of taking unified or source data and creating many derived attributes, which exemplify customer behaviors. For example, the fourth tutorial in this series showed how device location information can be inferred by using the xDR data. Advanced analytics tools then use a number of statistical techniques to create behavioral clusters and segments, such as "work at home," "globe trotter," and so on, using the location data. Data scientists use web browsing data to assign labels, such as "soccer fan" or "political news reader" to the associated group of subscribers. Most of these data sets are represented as columns in sparsely populated databases, or stored in more efficient ways by using non-relational structures. A profile hub stores the result of unification and behavioral exploration. The profile hub unifies the source data into a single set of descriptive information. However, by doing this it stores a large number of labels that express behavioral or attitudinal derived data sets.

The profile hub is shared across the end-to-end architecture. For a typical targeted marketing solution, you might need to run these steps:

  • SPSS provides the discovery engine for specifying targets.
  • Streams provide the real time detection or scoring capabilities.
  • Use IBM Cognos® for reporting.
  • Communicate the results of the analysis to an API to be used by campaign or customer interaction systems.

The underlying data representation across these tools is not the same, especially if the tools belong to different software companies and are not preintegrated.

These profiles are dynamic. The CSP's marketing, customer care, and risk management departments want to add new data sets in flexible ways so they can introduce new ways of dealing with subscribers. At the same time, they want to minimize their dependency on the IT application developers to reduce development time and effort. These are conflicting requirements that can be best met with an approach that divides the changes into two categories:

  • A set of frequent changes that can be made by business users
  • A set of infrequent structural changes that require IT application development

For example, a business user should be able to add a new mobility pattern, a usage pattern, a fraud pattern, or a device trouble symptom without requiring IT application development. If these data sets span real-time analytics on data-in-motion, as well as dashboards in structured reporting, multiple tools need notification as the data sets are changed. Changes can be captured and simultaneously communicated to all elements of an end-to-end architecture by using metadata and declarative algorithms.

Three illustrative ways to define the common information architecture that span across AAP are:

  • Table/graph-driven data set catalog: Profiles and data set categories are represented as tables or graphs in a common data store. The common catalog is shared across all tools. Each tool converts the catalog to explicit representation in the tool.
  • Metadata standards: Use industry standards, such as PMML and corporate standards to establish a standard way of representing computing algorithms and associated data sets. Each tool uses the standard to interpret the metadata.
  • External operators: A single program orchestrates the behavior across platforms and provides the decision logic, which is executed in the target platform. The logic is changed in the orchestration program and the operator in each target system translates the logic to the target environment. For example, a data scientist designs a scoring engine by using the IBM SPSS Modeler user interface, but keeps the historical data in Hadoop storage by using IBM BigInsights™. The operator directs the resulting scoring engine to the SPSS analytical decision model to execute as an operator in Steams in real time.

Depending on the level of tool integration and application requirement, an information architect can use a combination of approaches to design the solution architecture for metadata integration.

Data-in-motion and real-time analytics view

The sixth tutorial in this series discussed a counter-fraud analytics and management use case. That use case is used here to show how to share counter-fraud data model that is shared across ODM, Streams, and SPSS.

Many data sources are analyzed to analyze fraudulent activities:

  • Network data:
    • Usage data from network probes
    • XDRs
    • Mobility profile
    • Infrastructure
  • External data that provides social leaderships, sentiments, and social networks, including any of these sources:
    • Third-party interactions
    • Call center conversations, email
    • Facebook
    • Twitter
  • Legacy data, including:
    • CRM system (channel interactions, contact preferences, onboarding and retention, personalization, permissions, and data privacy)
    • Billing system (subscriptions, financial and billing profile, and mobile payments)
    • Marketing system (customer lifetime value)
    • Financial system (alerts and red flags)
  • Third-party interactions

As many big data sources become available, two sets of rules are applied to determine fraudulent activities.

  • The first set of rules are elimination rules. These rules are written directly in Streams and are started on a regular interval (every 15 minutes, for example). These rules are relatively simple in structure and eliminate cases that do not need to be investigated. The elimination rules facilitate speedier processing of data by using a small number of attributes. Examples of these rules are:
  • Check whether the same Mobile Directory Number (MDN) is used in different devices
  • Check whether the device moved with unreasonably high speed

The Streams smart filter operator uses a simple set of patterns and eliminates most of the transactions by using thresholds and trigger points on a few attributes. Figure 3 shows this process in detail. The second set is investigation rules, which are written for the Operational Decision Manager (ODM) operator. These rules investigate the focused set of cases. They use a larger number of parameters with combined historical and real-time data. For example, in the case of counter-fraud analytics, these rules might use a combination of billing and network data and look for the presence of fraudulent patterns. The ODM operator participates in the data flow by applying the rules that are promoted to it by the rule. Examples of these rules include:

  • Check whether there is a change in the usage pattern in a non-familiar mobility/location profile
  • Check for a sudden jump in consumption from a non-familiar hangout

In this counter-fraud example, an SPSS data mining model is developed on historical data by accepting data from real-time sources. The ODM rules are written by analyzing the data. IBM Streams can accept the IBM SPSS models in their native format by using special scoring nodes. The model is used to score new data in near real-time and trigger actions based on the analysis.

New data sources can require IT help. Changes to data models across the integrated platform might be required each time a new data source is introduced. However, counter-fraud experts can add measures or new fraudulent patterns by using existing data sources. Counter-fraud experts cannot afford to wait for the next IT development and test cycle each time a pattern is introduced. There is a potential major revenue loss if the counter-fraud rules are not added quickly. SPSS and ODM can share the respective data types, each performing its role, and by using its private data representation for how they manipulate and use the data. SPSS might include statistical processes on the measures to compute average behaviors, while ODM might prescribe a pattern of measures, which reflects a fraudulent use of the subscribers device.

Figure 3 illustrates how elimination and investigation rules are applied. Data comes into the system and the smart filter uses simple patterns to initiate the elimination rules. The data that passes the elimination rules is placed in the data lake. The problematic data is sent to the investigation phase, which results in specific actions and recommendations with comprehensive KPIs to help the fraud analyst to understand what is really happening.

Figure 3. Fraud analytics in real time by using the Advanced Analytics Platform
Data capture, filtering, analysis, and outcome
Data capture, filtering, analysis, and outcome

Data discovery and pattern analysis view

The Advance Analytics Platform is used for pattern discovery by using data from mobile devices such as mobile phones, connected cars, and even credit card transactions. Detailed events from the CSP's network is formatted into a standard format. In the following example, call detail records are formatted and loaded into an EVENTS table, which stores the events. The events data is collected and then aggregated by using specialized algorithms to create a location affinity table. This table represents all locations where a device had an event, how often, the event time, and the duration and quality of the event. This location table is then passed to another algorithm to determine the home, work, and weekend locations of the device user. All of this activity takes place in the Pure Data for Analytics System (PDA) with SPSS Modeler issuing the SQL and managing the process. Figure 4 shows a high-level overview of the flow.

Figure 4. Data discovery and pattern analysis from device data
Call Detail records to permit discovery and analysis
Call Detail records to permit discovery and analysis

In this example, a data mining model is developed by using historical data in the platform. The model is designed to accept data from real-time sources to deploy into operational systems. The deployment of models into operational systems is how insights are turned into actions. IBM InfoSphere Streams can accept the IBM SPSS models in their native format by using special scoring nodes. The model is used to score new data in near real time and trigger actions based on the analysis. Figure 5 shows how the software components are integrated for this example.

Figure 5. Data discovery and pattern analysis implementation
Real time and at rest data for discovery and analysis
Real time and at rest data for discovery and analysis

Columnar data reporting view

The detail data that is stored in the PDA is easily accessible for reporting and analysis using Cognos reports. The data model, table design, and performance of the PDA allow for drill-down into the data for speed of thought analysis. In this example, the analyst can see there are data quality issues in Orchard Road, as shown in Table 2. The analyst can double-click that location to immediately drill down to the cell tower level analysis to spot the poorly performing towers.

Table 2. Sample KPI for data discovery and pattern analysis
QosStreaming app qualityData qualityWeighted QoS
Central Bugis Victoria Street 98.7 99.3 98.9
Chinatown 99.2 97.497.5
Orchard Road 98.3 98.7 98.3
Tanglin Road 98.7 99.0 98.6
East Gedo Chai Chee 99.0 98.8 98.9
Changi Airport 98.5 99.0 98.4

A drill-down into Orchard road provides more details on what is happening at this tower. These details are shown in Table 3.

Table 3. Drill downs for further discovery and pattern analysis
Location Data event count Streaming event count Voice event count Data quality Streaming quality Voice quality
Cell_S558 24.107 3,657 51 99.6% 99.5% 100%
Cell_S98 123.756 1,167 12 96.1% 99.7% 100%
Cell_S4790 118.186 83 73 98.8% 99.5% 100%
Cell_S2072 109.91 1048 12 99.9% 98.2% 100%

Key architectural components

The key architectural components that are used to create the profile hub are shown in Figure 6.

Figure 6. Key architectural components
Image showing the key architectural components
Image showing the key architectural components

Most of the following products have been discussed in other tutorials in this series. In the context of profile hubs, the products are:

  • InfoSphere Streams: Real-time analytics to create or update profiles in real time
  • InfoSphere Information Server: Extract, transform load ETL of profile-related data for storage in appropriate repository
  • InfoSphere BigInsights: Repository for profile-related data with related analytics tools for segmentation of profiles
  • SPSS: Predictive analytics tool to identify clusters and segments for profile refinement
  • PureData System: Repository for profile data, which when integrated with more analytics tools permits high performance analysis
  • Cognos: Dashboard for end-to-end profile insights.

Additional products are needed depending on the data sources and complexity of the profile. For example, entity analytics is important if you have disparate data where relationships need to be found. There are various architectural options that need to be considered depending on the type of data accuracy that is needed for the resolution.


The customer profile pattern is an essential pattern with the Advanced Analytics Platform. In this tutorial, you learned how to use data from different sources, in real time and at rest, to develop the appropriate dimensions. Further discovery and data patterns can be recognized after the core profile is in place by integrating with additional components of the platform provided in earlier tutorials.

What's next?

Information governance is the topic of the next tutorial. Many tutorials, including this one, have referred to information governance at a high level. Information governance cannot be ignored and becomes even more important as we move toward low latency decisions and high volumes of ungoverned external data. The governance process cannot be applied early in the data acquisition process due to the lack of data, so it can be tempting to ignore it. However, it is a critical step as data lakes are used for discovery and prediction and when insights are used to execute actionable steps. Lack of governance can also result in catastrophic problems as the data gets used in near real time mission critical business processes. It is, therefore, highly advisable to consider data governance at the start of the project. The next tutorial in this series provides more details.

Downloadable resources

Related topics


Sign in or register to add and subscribe to comments.

Zone=Data and analytics
ArticleTitle=Explore the advanced analytics platform, Part 7: The customer profile hub