April 12, 2018 | Written by: Jeff Jonas
Categorized: Analytics | GDPR
Share this post:
Organizations implementing GDPR are going to discover, sooner or later, that finding someone’s data following a subject access request is much easier said than done.
Most organizations think they can individually search every system, either manually or via a federated crawler that automatically queries each system. In fact, neither of these approaches works well. There are a number of reasons why individually searching systems fails, such as deficiencies in manual search and deficiencies in automated crawlers.
Deficiencies of Manual Search
- Volume: searching every system consistently is time consuming and challenging—especially if there are hundreds or thousands of different systems to search. Will the person searching remember to search the payroll database?
- Variation: it’s unlikely the person searching will remember to search for every variation e.g., Elizabeth, Beth, Liz or the many spelling of Muhammed including Muhd.
- Variability: it’s unlikely the person searching will try dates of birth with month and day transposed (a common data quality problem) or address inconsistencies like 123 N Main Street vs. 123 Main St.
While automated search could in theory remedy the above deficiencies, there are other serious issues that are not easily solved, even with automation.
Deficiencies of Automated Crawlers for Search
- Constraints: legacy systems often don’t support search on such things as address, phone, email, etc. For example, a hotel reservation system may provide for search by reservation number or arrival date and last name, but not a means to search by email or phone.
- Completeness: if a data subject’s access request includes a name, address, and email, how will the crawler find records containing only a maiden name and phone?
- Comingling: just because records look alike doesn’t mean they are alike. What if you find a matching record, based on an email address, but that email was periodically shared by a husband and wife? Knowing the email was shared is essential to understanding who is who in your data before releasing any information.
- Contamination: many systems write searches to audit logs, which mean every search can create more copies of the subject’s personal data. Imagine that the first time a subject asks for access you report a few instances, but the second time they ask you have to report hundreds of instances (due to meticulously logged searches)!
Enter GDPR and the Missing Link
Organizations dependent on individual system search will miss records that could have easily been identified with a single subject search index. Imagine searching for this information: Liz Reston, 123 E Court Rd, email@example.com, and finding all of this info:
1. Liz Reston, 123 E Court Rd, firstname.lastname@example.org
2. Elizabeth Reston, email@example.com, +1 (301) 499-4900
3. Beth Reston, +1 (301) 499-4900, firstname.lastname@example.org, email@example.com
4. Beth Smith-Reston, firstname.lastname@example.org, POB 19557
5. Beth Smith, POB 19557, email@example.com
Meanwhile, during the search, you could be advised to use caution about these records:
6. firstname.lastname@example.org, 123 E Court Rd
7. email@example.com, +1 (301) 499-4900
That’s because of the existence of this record:
8. Bob Reston, firstname.lastname@example.org
Individual system search, whether manual or automated, is not going to find records 3, 4, 5 and they are likely to accidentally include records 6 and 7, while at the same time sprinkling more personal data across the enterprise’s search logs!
Using a centralized, entity-resolved, index provides a single subject search mechanism that instantly suggests group 1 (records 1-5) are matches, while group 2 (records 6 and 7) are only possible matches, and record 8 is identified as different but related (helping to highlight the fact that an email address is being shared by two people).
Without an entity-resolved index, organizations striving to comply with GDPR requirements are going to have difficulty.
IBM provides rich single subject search technologies within its InfoSphere Master Data Management Advanced and Standard Editions, for enterprises. The solutions offer extensive capabilities around consent and management.
But because GDPR affects organizations of all sizes, we at Senzing wanted to democratize entity resolution. Our Senzing ER Workbench allows organizations, regardless of how small or how untechnical their staff is, to instantly perform single subject search, for a very low monthly fee.
Larger organizations can also use our Senzing ER Enterprise product (tools for application developers) to integrate single subject search into real-time information flows, case management, and network visualization solutions, like IBM InfoSphere Streams, IBM Case Manager and IBM i2 Analyst Notebook, respectively.