April 10, 2018 | Written by: Marco Brenner
Categorized: Analytics | Automation | GDPR | IBM Research
Share this post:
Within two years, most of today’s cybersecurity technologies will be obsolete.
Since the beginning of 2016, hackers have stolen more than 8 billion records – more than double the two previous years combined – and that doesn’t account for unreported intrusions.
The current system of patches, firewalls and blacklists isn’t working. It’s no match for the organized crime rings that carry out more than 80 percent of attacks, systematically probing for weaknesses, sharing tools and techniques, and continually developing countermeasures for even today’s most advanced security technologies.
The best course of action is to constantly innovate.
One method is known as fully homomorphic encryption, which makes it possible to crunch data while its encrypted, meaning the data used never yields any private information. While this could be a great solution, it’s still a few years out from being practical due to processing speed.
Another innovation is called pseudonymization, or if that is a mouthful, desensitized data. The idea is simple, even obvious — transform data so it looks and behaves like the real data, but it’s not.
For the past several years, IBM cryptographers in Zurich have been developing this technology and it’s commercialized under the name the IBM High Assurance Desensitization Engine.
The timing couldn’t be better due to the recent data privacy leaks and to meet the EU’s upcoming General Data Protection Regulation (GDPR), which seeks to create a harmonized data protection law framework across the EU which imposes strict rules on those hosting, moving and ‘processing’ this data, anywhere in the world.
The Pseudo Engine that Could
The technology works by creating replicas of production data, which are significantly less sensitive than the original data, but maintain all desired characteristics for further use. Put simply, the data maintains it’s utility, while also being privacy friendly.
The IBM tokenization solution works efficiently with different database technologies, provides consistent data across comprehensive application landscapes, includes advanced security functionality and scales to very large volumes (production size). These “tokenized” replicas can be used for various activities including data analytics, to protect internal confidentiality, support regulatory compliance or testing.
In fact, today we are announcing that Rabobank, the Dutch multi-national bank and financial services company, is using the technology for both GDPR compliance and data for performance testing for the development of new innovative technologies and services, such as mobile apps and payment solutions.
This is what Peter Claassen, Delivery Manager Radical Automation, Rabobank, said publically about the use application of the technology:
“It’s critical for our DevOps team to use data which is as close as possible to production during the testing phase, so when we go live, we are confident that our services will perform. Being able to test and iterate using pseudonymized data is going to unleash new innovations from our DevOps team bringing even more security, innovation and convenience to our clients.”
What Rabobank is referring to is not uncommon. In a world where data is considered a natural resource many enterprises use production data including personal client data, not only for their primary purpose, but also to run analytics on it to get better customer insight, or use a copy of production data for testing, to increase quality of the software development and minimize production incidents when deploying new releases.
Starting on 25 May, GDPR will impose stricter controls than its predecessor legislations on using personal data and prevention of re-identification of individuals, in particular for use beyond the primary business need. Therefore, these additional uses are not allowed anymore in the same way, but only possible under restrictive constraints.
Thankfully, this is where our technology helps. It provides data which is similar to the original data in its behavior, but bears a significantly lower risk for re-identification of individuals. For example, my name, birthday, address and bank account number would be converted to a completely random set of identifiers.
The benefits are obvious. If this data were to fall into the wrong hands it would be completely useless. Therefore, the regulatory constraints for such data are considerably less restrictive, and the activities can be executed as before, under some basic operational and technical controls.
IBM’s Crypto Innovation
Traditional attempts to tokenize multiple application databases typically suffer from a tradeoff between tokenization security, interoperability between databases and scalability/efficiency. Our technology largely eliminates these dependencies and constraints and allows for high security and high performance tokenization, scaling to large volumes.
The tokenization engine at its core provides not only advanced cryptography to protect the data, but also highly efficient functionality to maintain format and semantics of the original data, as well as the capability to cope with reserved values, consider black- and whitelists and manage exceptions and anomalies in existing data.
The simultaneous availability of these capabilities enable us to process terabytes of data, i.e. tokenize tens of billions of values quickly and exploit the built-in data consistency to support full heterogeneous application landscapes.
All In When it Comes to Data Privacy
No industry is immune from the threat of a data privacy leak, which is why pseudonymization can be applied across any industry in support of regulatory compliance and the protection of company confidential information.
Typical use cases are the creation of production-grade test data (to increase quality of testing and maintain stable production systems during releases), secure analytics (development and execution of analytic queries on granular but less sensitive data, reducing the need for privileged access), or the exchange of sensitive data between parties for joint use but without disclosing the full information.
This technology is particularly good news for data scientists, in sensitive fields, such as healthcare, who are looking to study aging demographics or the spread of diseases.
25 May Deadline for GDPR
Tokenization is a recurring activity and is best set up in a factory mode, preceded by a pilot project for general and client-specific configuration and tuning. Based on our experience, the definition of an initial tokenization configuration usually takes a few weeks and requires tight collaboration between client and IBM experts; the processing of a set of pilot databases depends on the availability of infrastructure and the accessibility of data in scope.
Overall, the preparation and setup of a data tokenization factory is expected to take between 4 to 6 months. In factory mode each new database needs to go through some onboarding steps; once on-boarded a database can be reprocessed in a few days in a largely automated mode.