Share information while protecting privacy using IBM InfoSphere Guardium Data Redaction

Overview of an automated redaction and redaction manager

IBM® InfoSphere® Guardium® Data Redaction enables effective disclosure of public information while protecting confidential information of the enterprise, as well as personal customer data. It redacts private information in free-text documents and forms, based on the role of the reader. It accelerates information sharing within organizations, and across organization boundaries by supporting controlled information disclosure. In this article, get an introduction to the capabilities of this product. See how to use automated redaction to process a large number of documents in batch mode, and learn how to use the GUI client to redact and review automated redactions.

Yasutomo Nakayama (nakayama@jp.ibm.com), Yamato Software Development Laboratory (YSL), IBM

Yasutomo Nakayama is a member of the Enterprise Content Management development team in Yamato Software Development Laboratory (YSL), IBM Japan.



Eisuke Kanzaki (JL17613@jp.ibm.com), Yamato Software Development Laboratory (YSL), IBM

Eisuke Kanzaki is a member of the Enterprise Content Management development team in Yamato Software Development Laboratory (YSL), IBM Japan.



September 2011 (First published 27 December 2010)

Also available in Chinese

Introduction

Securing confidential information and protecting the personal data of customers are important objectives for every enterprise. Companies must also be able to promptly produce reports that are required by various regulations. At the same time, a company must share business information internally with its employees, and externally with customers and business partners. IBM InfoSphere Guardium Data Redaction is a solution that addresses all of these conflicting objectives. The privacy-related information in unstructured documents can be protected easily and efficiently by using IBM InfoSphere Guardium Data Redaction's automated redaction functions, .

One way to remove private information from paper documents is to blacken it out with a pen. You can use a word processing application to perform similar actions to electronic documents. However this must be done carefully, because even when the information is masked on display screens and printouts, it may still survive as hidden data in the document file. So it is necessary to use a method that securely removes any private information. Another important consideration is the efficient processing of large numbers of documents that may exist at many locations in a company.

This article gives an overview of IBM InfoSphere Guardium Data Redaction's automated redaction functions, and shows how you can redact documents semi-automatically by using the redaction manager, which is a GUI client for controlled redaction.

Overview of IBM InfoSphere Guardium Data Redaction

As shown in Figure 1, IBM InfoSphere Guardium Data Redaction Version 2.1 includes a redaction server and a redaction manager. The redaction server automatically extracts and removes privacy-related information from documents, and provides the results in specified formats. The server also provides these redaction services for the redaction manager and other custom applications. The redaction manager is a Flex-based Web application that lets you perform interactive redaction, and review batch-redacted documents.

Figure 1. Components of IBM InfoSphere Guardium Data Redaction
Redaction manager resides at the webclient, and contains redaction, review, and template creation. It is connected to the redaction server, which hosts the automated redaction process and service for clients (redaction manager web service and API)

Automated redaction process

Figure 2 shows the key concepts of the automated redaction process of the redaction server.

Figure 2. Automated redaction process
Input documents go through entity extraction, policy models, and templates in the redaction process, then reconstructed and output as documents again.

In the first step, the system converts a document into an internal format that combines a graphical and textual document representation. This internal format makes it possible to use the same analytic approach for documents of various types. You then select the candidate words that you want to redact from the content by using the advanced entity extraction capabilities of the SystemT project from IBM Research.

To define the targets for entity extraction, a sophisticated SQL-like language is available, along with development tools.

One part of the entity extraction function is based on lists of words that should be redacted. The lists are prepared for categories such as people, locations, or organizations. The words that match any of the words in the lists are then selected to be redacted.

Another supported extraction technique is to use regular expressions to find words or patterns such as phone numbers, credit card numbers, or social security numbers. Words that match the patterns of the regular expressions are selected to be redacted.

SystemT combines both dictionaries and regular expressions techniques, along with sophisticated pattern recognition using context and syntax to achieve industry-leading accuracy rates. Sensitive data selected in a textual document representation is mapped to the graphical document representation. For each document page, an image is created for which the sensitive data is blacked out. Optionally, the semantic category name, or another label for the sensitive data, can be printed on top of the redacted areas. Finally, the modified page images are reconstructed into an output-redacted document in one of the supported formats. PDF, Microsoft Word, TIFF, or plain text documents are supported as input data, and any of these formats can also be specified for output without regard to the format of the input document.

Policy model

InfoSphere Guardium Data Redaction uses a policy model that lets you get appropriate output that suits your requirements. The policy model defines what kinds of semantic entities should be redacted in the documents. The model states whether a given role, for example administrative staff, may see a given entity type, for example social security number, phone number, or address. Table 1 shows examples of entity types and roles that are used for the redaction of medical records.

Table 1. Examples of entity types and roles
Entity typePrimary physician roleConsulting physician roleAdministrative staff
SSNyesnoyes
Phone numberyesnoyes
Addressyesnoyes
Medical historyyesyesno
Heightyesyesno
Weightyesyesno

A primary physician can see all of the fields in the medical records of his patient. When a record is sent to another physician to ask for an opinion, topics such as medical history, height, and weight are important, but the consulting physician does not need to refer to the SSN, phone number, or address. In contrast, when an administrative staff member at the hospital looks at this record for payment processing, only the SSN, phone number, and address information is needed. The primary physician can specify the role of the recipient to redact appropriate words, and prepare a different version of the document that suits each recipient.

Forms redaction with templates

You can use a template for another redaction method that is suitable for scanned forms, and which does not require any semantic analysis for redaction. The template contains information that is used to identify the documents and the locations for redaction. The locations for redaction are specified by coordinates and sizes of the redaction rectangles. The system must also deal with scanning inaccuracies such as skewing or scaling. Therefore this method is most suitable for scanned images of forms for which the fields have fixed locations. For more information on the redaction of forms using templates, see the Creating templates section.

Repository

A repository is a group of directories for input and output documents, as well as for working files. These can simply be on a disk, or an Enterprise Content Management system can be linked to the redaction server using its connector API.

There are two kinds of repositories. The first type is a batch repository. The redaction server will monitor the batch directory to detect incoming documents for processing. If documents are simply copied to the input directory of the batch repository, the server will automatically redact those documents and save the results in the output directory, as shown in Figure 3.

Figure 3. Batch process via repository
shows documents placed in the input repository, then going through automated redaction process, then output

To support different attributes such as roles and output formats, one defines several repositories, each separately configured. In this way for example, different output suited to readers of various roles will be created. For the medical records example shown in the previous section, there should be separate repositories for the roles of the consulting physician and administrative staff member to prepare the documents for those roles.

The other type of repository is an on-demand repository. It is used to present documents for redaction by client applications, including the redaction manager or a client application which uses the Java or SOAP APIs. These APIs, allow custom applications to send requests from a remote machine to redact documents in the repository, or documents transmitted by the client, as shown in Figure 4.

Figure 4. Redaction service for clients via an on-demand repository
shows custom application connection to redaction server through an API. and redaction manager connecting through a web service

Supported document languages

English, German, French, and Spanish documents are supported in IBM InfoSphere Guardium Data Redaction Version 2.1. The user interface, however, is in English only.

Redaction manager

The redaction manager is a client application used in accessing the redaction server from a Web browser to redact documents. When you open a document with this tool, you will see the image of your redacted document in the window. You can then further redact the document with the GUI. You can also use the redaction manager to create templates and review the results of the batch processing. The operations of the redaction manager are described and illustrated in more detail in the following section.


Using the redaction manager

Redacting documents

The redaction manager allows you access to the redaction server from a Web browser on a remote client machine. The browser provides a GUI for redacting documents in both the redaction manager, and on the remote client machine.

This section describes the operations of the redaction manager, beginning with logging into the server to see the document-selection menu shown in Figure 5.

Figure 5. Document-selection menu
Menu shows 2 buttons: redact documents or create template, and review batch-redacted documents

You can click the Redact documents or create template button to bring up the document selection panel. Clicking on the Repository folder tab also displays a list of sample documents in the repository, as shown in Figure 6.

Figure 6. Document selection list
Figure 6 shows the document selection list of documents

InfoSphere Guardium Data Redaction has pre-defined policy types for 13 major categories, such as Person, Location, and Organization. The sample General and Restricted roles are provided out-of-the-box, yet you can create custom vertical roles that best suit your organization's needs.

If a document is selected and opened from this panel, it is automatically redacted by the server based on the selected role. The result is then shown in the document redaction screen. Figure 7 shows a sample document redacted with the General role.

Figure 7. Document redaction screen
screen shows fields related to a person's medical report

Notice that Person, Social Security Number, and Phone Number are assigned to the General role as entity types that should be redacted. Text of these entity types are marked with a light blue rectangle on the screen. Once this redacted document is saved, these words are replaced with redaction rectangles. In this screen, you can make additional redactions, as well as remove unnecessary redactions. You can redact a word simply by selecting it with your mouse cursor. You can click Find to redact the same words globally in the document. You can also redact entire paragraphs, images, and pages by highlighting them with your mouse cursor. You can make the selected objects a template identifier by changing their attribute options in the property panel. For more information, see the Creating templates section.

You can click the Preview/Print button to check what the output document will look like. As Figure 8 shows, you will see black rectangles instead of the selected words or areas.

Figure 8. Preview screen
shows the previous medical report screen with selected areas blacked out

Once the redactions are finished, you can click Submit to save the document. In this example, the original document is loaded from the repository on the redaction server, and the output is saved in the same repository. It is also possible to open and save documents in the local file system on the remote client.

Creating templates

To redact forms, you can use the redaction manager to create templates to mark the relevant entities in the form, as shown in Figure 9.

Figure 9. Template creation
shows the steps: first define template identifiers, then define places to be redacted, then save as template

To create a template, you should open an example form in the redaction manager, preferably in PDF or TIFF format. This example can be blank or filled-in, and then you must mark at least two areas on the form as template identifiers to allow the server to distinguish instances of different forms. Ensure that the template identifiers are words or graphics that exist on all copies of this form, such as the title of the form, a document ID number, or a corporate logo. Drag your mouse to highlight the areas that you want redacted, then click Create a new template to save it as a template file.

Once you've created the template, you can apply it to other filled-in documents of the same form by opening the redaction manager and clicking Apply template. The server compares the document identifiers in a template with the document currently opened in redaction manager, and if they match, the specified areas of the document page are redacted. You can verify the results on the screen.

A template is especially useful when redacting forms in batch processing. For example, if a repository is configured with appropriate templates, then a large number of scanned hand-filled or machine printed forms can be processed in a batch without human supervision, as shown in Figure. 10.

Figure 10. Automated redaction process with template
documents are input, then go through automated redeaction process with template, then output with the appropriate areas blocked

Reviewing redacted documents

In some cases, visual confirmation may be required for documents that are automatically redacted by the redaction server. You can configure the server to temporarily hold the redacted documents for review before they are saved in the output directory. You can then review each of the pending documents by clicking the Review batch-redacted documents menu in the redaction manager, and correcting or adjusting the redacted documents as required. After you finish your review, you can click Submit to save each document in the output directory, which automatically opens the next pending document for review. Figure 11 shows how this semi-automatic review operation is continuous and rapid. You can also specify, as a percentage (from 0% to 100%) how many documents that you want held or saved without review, as a way to support full inspection, or random sampling of the redacted results.

Figure 11. Review of the batch results
shows redaction server process with the extra step of sampling after the redaction process, then review through the redaction manager, then output

Document redaction workflow

Many workflow options are available to process documents. For example, you can redact one document at a time as needed, or if a company already has a large number of archival documents, they can be redacted using batch processing. If documents are created periodically or unpredictably and they must be redacted each time they are created, then the server can be configured to continuously watch for such incoming documents and automatically redact them.

In an automated redaction process, a template preparation step can be added as a preliminary workflow and a review workflow can be added as post-processing.

Also, if InfoSphere Guardium Data Redaction is integrated with your company's enterprise content management system, then you can include these workflows in the system as a part of the business processes, as shown in Figure 12.

Figure 12. Redaction workflows integrated with ECM system
after redacted documents are output, they go to the ECM system. Documents are also input from the ECM system to the redaction server process.

Conclusion

IBM InfoSphere Guardium Data Redaction is a solution to efficiently protect sensitive information in unstructured documents and forms. The redaction server redacts documents automatically by using advanced entity extraction software and templates. The redaction manager supports manual redaction and review of the redacted documents. This article gave an overview of the automated redaction functions, and provided examples of the redaction operations with the redaction manager.


Acknowledgements

The authors would like to thank Joshua Fox and Michael Pelts of the InfoSphere Guardium Data Redaction development team for their review and valuable advice.

Resources

Learn

Get products and technologies

  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

  • Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Security
ArticleID=604298
ArticleTitle=Share information while protecting privacy using IBM InfoSphere Guardium Data Redaction
publish-date=09272011