Five steps to jumpstart your data integration journey
26 June 2020
5 min read

As coined by British mathematician Clive Humby, “data is the new oil.” Like oil, data is valuable but it must be refined in order to provide value. Organizations need to collect, organize, and analyze their data across multi-cloud, hybrid cloud, and data lakes. Yet traditional ETL tools support only a limited number of delivery styles and involve a significant amount of hand-coding. In turn, enterprises are increasingly looking for machine-learning-powered integration tools to synchronize data for analytics, improve employee productivity, and prepare data for analytics.

To achieve this, we will examine five steps an analyst at a fictitious financial services company, named Raviga, will take to be successful in their data integration project using IBM DataStage on IBM Cloud Pak for Data.

1. Ingest the Data

Data must be ingested before it can be digested.

First, the Raviga analyst needs access to its data sources before it can procure analytics. They need to access their data from multiple sources (such as Google BigQuery, Amazon Redshift, and Azure Blob storage). IBM DataStage has hundreds of pre-built, pluggable connectors to access different data storages as well as relational databases and Hadoop sources.

With these connectors, the analyst can quickly connect to both cloud and on-premise databases. It can also perform various functions like importing metadata, reading and writing data in parallel, and accessing various compression and file formats. DataStage has a generic JDBC and ODBC connector to query external databases for which it does not have a dedicated connector.

For the purposes of this example, let’s suppose the analyst has imported data on scheduled transaction dates, customer state abbreviations, and zip codes.

2. Ensure Resource Availability

Before the Raviga analyst performs any transformations or analysis on its data, they must ensure performance and uptime will be consistent. With DataStage’s built-in support for resiliency, data delivery is safer as the program recognizes when computer nodes fail. As needed, new nodes will be dynamically spun up. This is unlike bare metal applications, where humans must closely monitor and manually ensure failovers take over.

3. Evaluate Data Quality

Now, the Raviga analyst must ensure the data is accurate, complete, and relevant. Bad data will lead to bad analysis, which will cause the team to potentially make bad business decisions.

IBM QualityStage, a data cleansing tool fully integrated with DataStage, makes it easy for the analyst to easily detect quality issues in data. With this tool, the analyst is able to automatically import, profile, and assign a quality score to Raviga’s customer data.

The analyst can easily create logical automation rules using a graphical interface to look for certain abnormalities in the data. For instance, he/she can verify that a scheduled transaction date isn’t in the past, customer state abbreviations are valid, and zip codes conform to a standard format. This ensures that poor quality data does not get loaded, but instead gets fixed to ensure the data can be trusted down the line.

Let’s dive deeper into validating zip codes. To achieve this, the analyst can use the Address Verifcation Interface (AVI) in QualityStage. The AVI module ensures complete, consistent, and valid address information across Raviga’s databases. It will parse, standardize, validate, and geocode to automatically create a single best address record.

The analyst is able to parse addresses and assign address components, such as street or postal code, to a column. After the module finishes running, a report is available showing the total number of records processed, as well as the number and percentage of records that were validated, corrected, or failed. When appropriate, smart suggestions to update some addresses will be made available to the analyst.

AVI works across 240 countries and territories. Even as Raviga expands their business to other countries like China, AVI will help transliterate Chinese address data and convert the information into a common representation using a Latin character set. For example, the Chinese character 街 will be transliterated to “jie”.

4. Standardize the Data with Machine Learning

Now that the Raviga analyst has identified the data with poor data quality, he/she wants to fix these problems. Transformer stages are available in DataStage to manipulate data to bring it into compliance with business and integrity rules and raise its quality score. In a DataStage job, stages are represented as nodes on the canvas.

Transformations can be written to handle a variety of needs from ensuring data types are correctly mapped from source to target, to resolving anomalies and inconsistencies, to reducing the amount of redundant or duplicate data. After cleaning Raviga’s customer zip code data in QualityStage AVI, the analyst now wants a final list of unique customer zip codes. This can be easily achieved using DataStage’s Remove Duplicates stage. This stage takes the set of customer zip codes, removes all duplicates, and writes the results to an output data set.

As the analyst uses DataStage Flow Designer more frequently, DataStage will be able to use machine learning to more accurately predict what he/she will do next in the job design.

To reduce developer friction, these features (like Smart Palette and Suggested Stages) provide a ranked prediction of what he/she may want next. After deduping customer zip codes, the analyst could also ensure scheduled transaction dates are all in the same standard format (eg. CCYYMMDD) and customer state abbreviations have uniform capitalization (eg. CA vs ca).

This is all achieved within a browser-based interface. With IBM DataStage Flow Designer, anyone else within the Raviga organization can also solve data preparation and transformation needs without relying on a desktop environment.

5. Data Analysis

After the Raviga analyst imports the data, identifies, and resolves data quality issues, he/she is now ready to get a more complete picture of the data and extract insights. On the IBM Cloud Pak for Data platform, the analyst may use other services like Watson Knowledge Catalog to create a smart data catalog to help with data governance, or IBM Watson Studio AutoAI to quickly develop machine learning models and automate hyperparameter tuning.

Before using IBM DataStage, Raviga was using a manual hand-coded process that needed tedious integrations with many other systems. It also required a team of developers to be maintained. Customers have reported a saving of up to 87 percent in development costs after making the switch to DataStage from hand-coding, which translates to better business savings.

Now with IBM DataStage, Raviga has reduced the time needed to transform the data and operationalize data delivery for analytics and AI use cases. Many manual processes have been automated. Folks across different teams can now easily gather new insights such as seeing immediate changes to customer trends.

Next Steps

As demonstrated by Raviga, the services within IBM DataStage, from data integration to data quality, have the machine learning-powered features enterprises need to power their data integration journey. To dive deeper IBM DataStage visit here, and to learn why IBM is recognized as a leader in the market when it comes to data integration tools, read the report here.

 
Author
Kevin Wei Kevin Wei