Data integration often splits the data science community into two factions: those who automate and those who prefer manual coding.
With demand for database services, and data volumes, growing at an average annual rate of 20%, the demand on IT resources is growing right along with it. But what if you could spend your time doing something more interesting than endless hours of manual imputing?
A report by BI experts at Transforming Data With Intelligence (TDWI) showed that over a third of analysts (37%) spend the majority of their time on data prep rather than analysis and interaction. The main reason for this is difficulty accessing and integrating data across systems or application silos.
So while automated tools are still sometimes viewed with suspicion by data ‘purists’ who favor SQL or other programming or scripting languages, even the most hardened coders can benefit from automated data integration systems.
Sometimes it’s simply a case of: if it ain’t broke, don’t fix it. But what if you could do things better? Let’s look at some of the arguments for and against.
Automation: what’s holding you back?
First thing’s first, let’s look at data integration. This is the most crucial part of the data preparation process, which sorts out clean and accurate data and ensures every piece of information that appears in the analytical database has a single, truthful source.
Knowing how crucial the ETL stage of data processing is, data analysts, scientists, and developers may choose manual coding to give them a feeling of control over the data, rather than using automated data preparation tools.
But automation can take away some of the drudgery of repetitive cutting and pasting – freeing up time for actual analysis and getting the results you need from your data.
Automated systems can take the heat off data analysts, offering an alternative way to get the most out of raw data, and generally producing quicker results, in line with increasing demand – and providing cleaner and more transparent results.
If you have qualms over performance and stability issues with automated data integration, it’s worth remembering that the engineers who build these commercial tools are under constant pressure to improve performance and stability when working with large or disparate datasets.
They have the resources to go way deep, below-the-OS instructions, looking at the way data is mapped and shifted around the various layers of the computer’s memory and CPU cache. They also have the luxury of parallel processing and SIMD computing resources at their disposal.
Of course, you could achieve similar results via hand-coding. But is it really worth the time and cash investments that would be needed? Could you be certain that the high-performance system you’ve created would be able to bear larger loads as data and use cases increase in complexity?
With automated data integration, the systems you use are built specifically to handle large and complex data, and to scale with your needs.
Let’s look at an example.
Say you’re an ecommerce company with several billions of rows of data. Your manual system works day to day, but then you need fast results showing seasonal changes, for instance.
With the manual system, the raw data has to be “shrunk” by analyzing a sample and extrapolating the results. The system simply isn’t agile enough for the purpose of quick queries and what’s more, accuracy is an issue because you’re relying on sample data.
When it comes to transparency, nothing quite beats automated tools.
With manual coding, difficult, structured, and unstructured data will always end up with many, many lines of code and this can be difficult to manage in complex scenarios. What happens when you need to locate a change?
If one of your developers is off sick or on leave, it could take you hours or even days to locate a change he or she has made.
The visual layer offered with data integration systems gives transparency so that if anything goes wrong, it’s easy to modify or troubleshoot. Every change is clearly documented, immediately noticeable and can be retraced - eliminating the need for backbreaking legwork to find out what’s gone wrong.
Manual coding will never be able to replicate this visual environment where you can see where data is coming from and going to, and which calculations are performed on it.
Let’s look at another example.
Take schema changes – imagine that new data sources need to be added and the data model needs updating.
Coding manually, you’d have to go back to the start, make the necessary changes and then run a series of tests to make sure everything still works.
Even more frustrating, in larger data teams or deployments, if just a simple modification is needed, you’d need multiple people who are intimately familiar with the code to perform the task.
Simplify your life
Why not make things easier for yourself? If you could automate scheduled updates for instance, it would be another thing to tick off your to-do list.
This is just one example of a process that can be automated. Updates are a crucial yet menial, basic task and automated systems will not only take care of this for you, but if anything untoward happens, it will notify you.
To save even more time, you won’t have to struggle with API calls or manual exports because data integration solutions can do it all for you via natively supported data sources that can connect to and extract the data. You can also save and reuse certain processes for new projects or data models using simple drag and drop.
Et voila – suddenly you have more time than you thought you did.
Let’s think of another scenario. A business user wants to do their own data modeling by connecting a common field between two tables.
Using Excel to produce manual data mashups, a large company would generate a massive file. It would take more than one person several days to join all these tables.
A BI solution could slash the time it takes to analysis, taking hours rather than days to prepare all the data. This also gives more power to the business user, who can perform a straightforward mash-up independently.
When all’s said and done
You can always find individual use cases where manual coding works every bit as well as a BI solution. Some companies may have large data science teams who are more than happy continuing to use tried-and-tested manual processes instead of implementing a new tool, which would take some getting used to.
Some developers may even be frustrated by visual interfaces if they’re used to thinking in code. And data integration systems will have their workflow and logic, which could feel limited when compared with the possibilities you could create manually.
Ultimately, it’s a choice – and whether you use manual coding or automate the process may depend on the complexity of your data and how much extensive data preparation is required.
The battle between manual coding and automated data integration tools will continue long into the future, no doubt. But automation is being recognized more and more as an essential tool in the data analysis armory that can help us reach our goal quicker – after all, it’s the insights that really matter.
Ilan Hertz is head of digital marketing at Sisense, the leader in simplifying business intelligence for complex data. He has close to a decade of experience in applying data-driven methodologies in senior marketing positions in the technology industry.