Drowning in data sources: How data cataloging could fix your findability problems

5 min read

Dangers of tunnel vision

We’ve all heard horror stories about companies that didn’t read the market effectively. Global book retailers who sleepwalked through the onset of the e-reader era. Mobile giants who were wrong-footed by the smartphone revolution. Brick-and-mortar video stores that ignored the onset of mail-order rentals, and later of on-demand streaming technology.

Dangers of tunnel vision

These companies all had something in common: their strategic decisions were not based on an adequate understanding of the market at the time. The lesson for data scientists and other knowledge workers is that you cannot base your analyses solely on your own company’s internal data—you must bring in external sources too, or risk missing out on make-or-break opportunities for your business.

Looking outwards — and inwards — for your data sources

However, the need for data scientists to access and analyze both internal and external data often poses practical problems. Accessing external sources means connecting to systems outside your firewall, which can create technical challenges around security. And since you don’t own and can’t control the source of the data, it may only be accessible via a complex, poorly documented or frequently changing API—putting further barriers between your knowledge workers and the information they need.

The challenges are not limited to external data sources: there are often obstacles to overcome when working with internal sources too. For example, if you want to extract information directly from an internal database, you will typically need the IT team to set up an account and provide you with access credentials.

Such requests may also require approvals from data stewards or the office of the Chief Data Officer (CDO). Requests to access data can hold up analytics projects for months at a time, as knowledge workers figure out who to ask and make their justifications, and data stewards locate and review the datasets in questions. By the time the request is granted, it may no longer be worthwhile to complete the analysis, as the window of opportunity for decision-making may already have passed.

Even if you receive clearance to proceed in time, you will need to figure out how to extract, cleanse and transform the data for analysis. And your work isn’t finished there: if you are working on a copy of the data, how can you make sure that this copy remains up-to-date as new information gets added to the source system? Putting processes in place to refresh data sets regularly is yet another time-consuming task that limits the amount of time data scientists can spend on genuine analysis, as discussed in previous blog “Breaking the 80/20 rule: How data catalogs transform data scientists’ productivity”.

These factors can all impose a significant drag on analytics efforts. It’s no wonder that Gartner predicts that in 2017, 60 percent of big data projects will fail to go beyond piloting and experimentation.

So, what can you do?

The ideal scenario for knowledge workers is to be able to find and use data in seconds, regardless of whether it comes from internal or external sources, where it is stored, or what format it is in. Just imagine how much more effective your data scientists and data engineers could be in such a situation—and how much more successful their analytics projects could be!

To turn this dream into a reality, you need a solution such as IBM Data Catalog (currently in beta). Let’s take a look at some of the features that can help to resolve these challenges.

First, you need a tool that can provide an abstraction layer over multiple external and internal sources of data, whether it resides on-premises or in private or public cloud platforms. For example, IBM Data Catalog will support connections to approximately 32 different sources, including both IBM and popular third-party technologies such as Cloudera Impala, Salesforce.com, Apache Hive, Amazon Redshift, Microsoft SQL Server, Sybase, Oracle, PostgreSQL and more.

Once you have connected to a source, you will be able to start creating data assets, either by copying the source data into cloud object storage connected to the catalog, or by creating a reference to the source within the catalog. You will also have the option to add comments and tags to the asset, making it easier for other users to find and utilize it.

Dive into automation

Using its automatic data discovery capabilities, the catalog can proactively scan and inventory data assets across your entire range of different repositories, inferring the schema, type and class of each data asset. IBM Data Catalog can then provide a common approach for accessing data from any source, abstracting away the complexity associated with connecting to diverse data sources. This means that you—or your data scientists—don’t need to spend time learning a different API for each data source, or understanding the complexities of reading and writing data in different databases. Instead, you have a single workspace where you can use tools to simply click to move the data you want between your source and target systems.

IBM Data Catalog also helps by automatically applying relevant data governance rules to each asset in the catalog, based on the attributes that the asset contains. As a result, knowledge workers don’t need to worry about whether they are allowed to access or use a given data set: they will automatically either be granted or denied access based on these permissions.

To explore how IBM Data Catalog can help you transform the way your data scientists, engineers and stewards work with internal and external data sources, visit our website and learn more about the beta.

Be the first to hear about news, product updates, and innovation from IBM Cloud