January 14, 2015 | Written by: Chris Howard
Share this post:
Wasting time on data wrangling? Burning hours on munging data before it’s useful / consumable?
Well rest assured, you are not alone. It is claimed that 70-80% of a data analyst’s time is engaged in data discovery, integration, transformation and cleansing before the real analysis can even start. Organisations— both large and small —face significant challenges when it comes to extracting insight from the enterprise data they already have at hand.
- Deeply skilled analytical / statistical experts are spending valuable time on data preparation—not the most effective use of their time.
- Data wrangling often exploits a mixed bag of highly technical tools and requires knowledge of scripting / development, using languages such as R and Python. This leaves many business users high and dry as they do not possess the skills to prepare their own data before analysis.
- This ad hoc approach using scripts etc. lacks the necessary lineage, governance rigour and transparency for data confidence and trust.
- Business users / analysts are often left waiting for IT to source and prepare the data for them. IT are under increasing pressure to facilitate access to “clean” data.
- In many cases the same data set (or subset) is being sourced and prepped for analysis by multiple users across the organisation. This results in redundant / duplicate storage and processing of like data.
Data Cleansing Effort
Often data requires significant cleansing to get to the point where it is useful—fixing missing, extreme, inconsistent, duplicate and erroneous values. Automated and semi-automated techniques are now being used to provide a certain level of data cleansing:
- Data profiling capabilities that can discover field data type, length, precision, scale, format patterns, completeness, and cardinality.
- Statistical analysis to identify outliers and erroneous entries.
- Web services to provide validation, completion and standardisation for entities such as addresses / locations.
Lambda Value / Data Decay
Beyond issues around data quality, time as a dimension is also becoming an issue. Much of the data we deal with exhibits something of a half-life like characteristic, with the value of this data decaying over time —what may be useful this moment, may be useless (or less useful) the next. As it takes longer to prepare and processes our raw data—to the point that its good enough to answer our questions—the value of the data reduces and this presents a data confidence challenge.
A recognition of this decay forces us to place further constraints on wrangling effort as we cannot expend too much time on cleansing before we start answering questions. This then raises the question of not just is the data good enough, but when is it good enough? For time sensitive data we may even have to compromise on “good enough”.
So what if we could wrangle our data at speed – automating the essentials, yet still achieve acceptable results? Enter IBM DataWorks.
Launched as Bluemix beta service in late 2014, IBM DataWorks™ provides data refinery services to speed application development by getting the data you need, when you need it, and then ensuring it is fit for purpose. The service provides a set of APIs that implement a standard REST model. These APIs allow you to interoperate with feature-rich refinery capabilities. The performance and scalability of the IBM DataWorks engine will ensure that your application runs efficiently. IBM DataWorks includes APIs that load data, cleanse US postal addresses, and classify data, all of which help you quickly develop data-rich applications. Access to good clean data lickety-split.
If some of this sounds familiar, it should. The IBM DataWorks capabilities are also provided as baked-in features with IBM DashDB and with IBM Watson Analytics, our premier in-cloud Analytics experience.
Data Profiling Example
One of the core APIs included with the DataWorks service provides powerful data profiling capabilities, to quickly gain new insights about the data in your application. This API can be used to gather information about the data, such as column value distributions or data types. The services can also help to identify higher value data attributes like e-mail addresses, Social Security Numbers, National IDs or credit card numbers so that your application can take action, such as masking sensitive data for HIPAA compliance.
As a quick Bluemix example, I wanted to see if I could profile some existing data in a spreadsheet and do something useful with the profiling results. As a starter, I leveraged the DataWorks sample application available on Bluemix. This sample provides a simple Liberty for Java backend that exposes a couple of REST APIs for you to be able to submit a new data profiling request and to check on the request progress. For the front end, I use some simple Microsoft Office VBA scripts to control the movement of data and the calls to the REST APIs.
The application flow works as follows:
- The user selects a column/row range in Excel and clicks a toolbar button to request that the data be profiled using DataWorks.
- The data, along with the column names, is uploaded to the SoftLayer Object Store so that it can be made available to DataWorks. Currently DataWorks supports both Amazon Simple Storage Service (S3) and SoftLayer Object Storage as valid data sources.
- The Excel application then makes a call to the REST API exposed by the DataWorks sample application to begin profiling the data in SoftLayer.
- The data in the SoftLayer Object Store is accessed and the profiling task is started.
- Once complete the profiling task returns a JSON format data set that provides details of the column attributes including format and data type (for more details, see Integrating the Data Profiling REST API into your application)
- Finally we use the profiling data to add additional column details and format the data columns back in Excel.
Here is a sample of the Excel data before and after formatting / presentation. Not quite Watson Analytics (this is really just prettifying the data not providing any real analysis), but it gives you an idea of what’s possible
Give IBM DataWorks a Try
This is a simple example, that glues together a couple of cloud based capabilities along with traditional desktop processing to leverage the power of DataWorks. Why not give it a try yourself? The sample applications provide an easy entry point to data loading, US postal address cleansing and data profiling. Over time I am sure we will see additional services being added to the DataWorks service to further extend the capability, freeing you up to spend more time analysing your data and less time wrangling 😉
Please reach out to me on Twitter (@kitard) if you have any questions / issues or post a comment below.