Data Analytics

The Power of Profiling: How automated classification aids data governance

Share this post:

Intelligent, flexible classification capabilities within IBM Watson Data Platform help users govern, prepare, integrate and analyze data quickly

  • Instantly classify and understand the contents of each data set
  • Profile data and identify outliers to facilitate data preparation
  • Automatically apply appropriate policies to keep sensitive data safe

In the modern workplace, more and more jobs involve interacting with data on a day-to-day basis. For example:

  • Developers and data engineers, whose job is to build data-driven applications and pipelines.
  • Business analysts and data scientists, whose task is to turn data into actionable insight.
  • “Citizen analysts” from line-of-business teams, who need self-service analytics to support their day-to-day work.

These groups may be working on different types of tasks, have different levels of technical expertise, and use different tools to achieve their goals.

Nonetheless, when they start working with a new data set, they all face the same problem. They need to understand what data the data set contains, how high the quality of that data is, and what steps they need to take to shape it to meet their objectives.

Why profiling matters

The first step in working with any structured data set is to profile it. Whether you’re dealing with a spreadsheet, a JSON document or a set of SQL tables, you need to work out what characteristics the data has, and decide what relevance those characteristics have to the task you wish to achieve.

Let’s assume that you can conceptualize your data set as a set of records (rows), where each record contains one or more fields (columns). Profiling is the process of analyzing each column and understanding what type of data it contains, as well as the number, frequency and distribution of values across its rows.

Profiling in practice

Identifying the type of the data in the column is the first and most critical task. Let’s say your records all include a field that contains a 16-digit integer. Is that just a very large number? Or could it be a credit card number? The answer to this question will have a fundamental impact on how you understand and use the data.

If the field does seem to contain credit card numbers, other questions arise. Does every row contain a valid number, or are some of them just random digits? Can you tell which rows are Visa, Mastercard, or American Express? In general, how can you assess the quality and usefulness of the data?

More importantly, if your data set does contain genuine credit card numbers, it’s probably pretty sensitive. How can you make sure the data is properly protected, so that only authorized users can access it?

Finding a better way

Perhaps in an ideal world, all data sets ought to be profiled before users start working with them. This would not only save time by giving the user a quick overview of the data—it would also minimize the risk of sensitive data falling into unauthorized hands.

However, there’s an obvious problem here: how can a data set get profiled, without a user to profile it?

IBM® Watson® Data Platform provides an answer with its integrated profiling service. This service comes with built-in column-level classifiers that can recognize more than 160 common data types—from credit card and social security numbers to email addresses, and even more fuzzily defined concepts such as names and addresses.

Moreover, if your business uses a bespoke type of data that doesn’t fit any of these default classifiers (perhaps you use a special alphanumeric code for product or component numbers, for example), you can build a custom classifier that will automatically recognize that data type in any data set you create.

Even though the profiling engine itself is highly sophisticated, building custom classifiers is simple: you can either supply a list of valid values to match against (“Bill”, “Jane”, “Tom”, “Susan”, and so on), or use a regular expression. As a result, adding new classifiers can take a matter of minutes.

Instant insight into data quality

The intelligence of the profiling service comes into play when a column doesn’t completely match a classifier. For example, a field might mostly contain valid email addresses, but a small proportion of records might be malformed, or a different type, or just junk data.

In these cases, the profiler automatically calculates a quality score, telling the user how reliable the data is and what different types the column includes – helping them make quicker, better decisions about whether they can fix the problem in IBM Data Refinery, or whether the data is too corrupted to be useful.

Automating data governance

In addition to making life easier for data users, automatic profiling is also an invaluable tool for chief data officers and data stewards. Classifiers can be used as a means of enforcing data governance, which saves time and reduces the risk of errors that could lead to data breaches.

For example, an organization’s data governance team could create a policy in IBM Data Catalog to prevent data sets containing credit card numbers from being accessed by unauthorized users. Whenever the profiler detects a credit card field in a new data set, that policy will automatically be applied to the whole data set, and access will be restricted to users who have appropriate clearance.

Communicating data to a business audience

The profiling service also integrates with the business glossary service within Watson Data Platform, which provides a translation layer between technical, data-centric terminology and everyday business language.

Whenever a new custom classifier is created for the profiling service, a new technical term is automatically created in the glossary too, allowing data stewards to give the classifier a name and description that can be easily understood by less technical users.

As a result, when a data scientist or citizen analyst comes across a data set that contains an unfamiliar field, they can simply look it up in the business glossary. This helps them quickly understand what kind of data the field contains, and which other assets in the catalog contain similar data.

Getting to the goal

Whether you are assessing how to migrate data from one database to another, building a complex neural network, or just trying to plot some simple charts to visualize business performance, the first step is understanding what data you have, and how it can solve your problem.

With IBM Watson Data Platform, that initial profiling step is no longer a time-consuming manual task—it’s a fully automated part of the workflow that you can simply take for granted. From the second you load up your data set, you can take a first peek at what your data is telling you—and start making smarter decisions about how to achieve your goals.

If you’d like to learn more about how IBM Watson Data Platform can transform the way your organization looks at its data, click here to sign up for a free trial today.

More Data Analytics stories
January 22, 2019

India’s Journey to Personal Data Protection and Data Privacy Law

Learn more about India's data protection law—The Personal Data Protection Bill 2018—and how it can impact various organtizations and data governance in general.

Continue reading

December 18, 2018

SQL Query Releases Serverless Transformation and Partitioning of Data in Open Formats

We're excited to announce that SQL Query now allows you to specify the format and layout in which a result for a SQL query is written. By adding these abilities we're opening up serverless data transforming in IBM Cloud Object Storage.

Continue reading