Securing confidential information and protecting the personal data of customers are important objectives for every enterprise. Companies must also be able to promptly produce reports that are required by various regulations. At the same time, a company must share business information internally with its employees, and externally with customers and business partners. IBM InfoSphere Guardium Data Redaction is a solution that addresses all of these conflicting objectives. The privacy-related information in unstructured documents can be protected easily and efficiently by using IBM InfoSphere Guardium Data Redaction's automated redaction functions, .
One way to remove private information from paper documents is to blacken it out with a pen. You can use a word processing application to perform similar actions to electronic documents. However this must be done carefully, because even when the information is masked on display screens and printouts, it may still survive as hidden data in the document file. So it is necessary to use a method that securely removes any private information. Another important consideration is the efficient processing of large numbers of documents that may exist at many locations in a company.
This article gives an overview of IBM InfoSphere Guardium Data Redaction's automated redaction functions, and shows how you can redact documents semi-automatically by using the redaction manager, which is a GUI client for controlled redaction.
Overview of IBM InfoSphere Guardium Data Redaction
As shown in Figure 1, IBM InfoSphere Guardium Data Redaction Version 2.1 includes a redaction server and a redaction manager. The redaction server automatically extracts and removes privacy-related information from documents, and provides the results in specified formats. The server also provides these redaction services for the redaction manager and other custom applications. The redaction manager is a Flex-based Web application that lets you perform interactive redaction, and review batch-redacted documents.
Figure 1. Components of IBM InfoSphere Guardium Data Redaction
Automated redaction process
Figure 2 shows the key concepts of the automated redaction process of the redaction server.
Figure 2. Automated redaction process
In the first step, the system converts a document into an internal format that combines a graphical and textual document representation. This internal format makes it possible to use the same analytic approach for documents of various types. You then select the candidate words that you want to redact from the content by using the advanced entity extraction capabilities of the SystemT project from IBM Research.
To define the targets for entity extraction, a sophisticated SQL-like language is available, along with development tools.
One part of the entity extraction function is based on lists of words that should be redacted. The lists are prepared for categories such as people, locations, or organizations. The words that match any of the words in the lists are then selected to be redacted.
Another supported extraction technique is to use regular expressions to find words or patterns such as phone numbers, credit card numbers, or social security numbers. Words that match the patterns of the regular expressions are selected to be redacted.
SystemT combines both dictionaries and regular expressions techniques, along with sophisticated pattern recognition using context and syntax to achieve industry-leading accuracy rates. Sensitive data selected in a textual document representation is mapped to the graphical document representation. For each document page, an image is created for which the sensitive data is blacked out. Optionally, the semantic category name, or another label for the sensitive data, can be printed on top of the redacted areas. Finally, the modified page images are reconstructed into an output-redacted document in one of the supported formats. PDF, Microsoft Word, TIFF, or plain text documents are supported as input data, and any of these formats can also be specified for output without regard to the format of the input document.
InfoSphere Guardium Data Redaction uses a policy model that lets you get appropriate output that suits your requirements. The policy model defines what kinds of semantic entities should be redacted in the documents. The model states whether a given role, for example administrative staff, may see a given entity type, for example social security number, phone number, or address. Table 1 shows examples of entity types and roles that are used for the redaction of medical records.
Table 1. Examples of entity types and roles
|Entity type||Primary physician role||Consulting physician role||Administrative staff|
A primary physician can see all of the fields in the medical records of his patient. When a record is sent to another physician to ask for an opinion, topics such as medical history, height, and weight are important, but the consulting physician does not need to refer to the SSN, phone number, or address. In contrast, when an administrative staff member at the hospital looks at this record for payment processing, only the SSN, phone number, and address information is needed. The primary physician can specify the role of the recipient to redact appropriate words, and prepare a different version of the document that suits each recipient.
Forms redaction with templates
You can use a template for another redaction method that is suitable for scanned forms, and which does not require any semantic analysis for redaction. The template contains information that is used to identify the documents and the locations for redaction. The locations for redaction are specified by coordinates and sizes of the redaction rectangles. The system must also deal with scanning inaccuracies such as skewing or scaling. Therefore this method is most suitable for scanned images of forms for which the fields have fixed locations. For more information on the redaction of forms using templates, see the Creating templates section.
A repository is a group of directories for input and output documents, as well as for working files. These can simply be on a disk, or an Enterprise Content Management system can be linked to the redaction server using its connector API.
There are two kinds of repositories. The first type is a batch repository. The redaction server will monitor the batch directory to detect incoming documents for processing. If documents are simply copied to the input directory of the batch repository, the server will automatically redact those documents and save the results in the output directory, as shown in Figure 3.
Figure 3. Batch process via repository
To support different attributes such as roles and output formats, one defines several repositories, each separately configured. In this way for example, different output suited to readers of various roles will be created. For the medical records example shown in the previous section, there should be separate repositories for the roles of the consulting physician and administrative staff member to prepare the documents for those roles.
The other type of repository is an on-demand repository. It is used to present documents for redaction by client applications, including the redaction manager or a client application which uses the Java or SOAP APIs. These APIs, allow custom applications to send requests from a remote machine to redact documents in the repository, or documents transmitted by the client, as shown in Figure 4.
Figure 4. Redaction service for clients via an on-demand repository
Supported document languages
English, German, French, and Spanish documents are supported in IBM InfoSphere Guardium Data Redaction Version 2.1. The user interface, however, is in English only.
The redaction manager is a client application used in accessing the redaction server from a Web browser to redact documents. When you open a document with this tool, you will see the image of your redacted document in the window. You can then further redact the document with the GUI. You can also use the redaction manager to create templates and review the results of the batch processing. The operations of the redaction manager are described and illustrated in more detail in the following section.
Using the redaction manager
The redaction manager allows you access to the redaction server from a Web browser on a remote client machine. The browser provides a GUI for redacting documents in both the redaction manager, and on the remote client machine.
This section describes the operations of the redaction manager, beginning with logging into the server to see the document-selection menu shown in Figure 5.
Figure 5. Document-selection menu
You can click the Redact documents or create template button to bring up the document selection panel. Clicking on the Repository folder tab also displays a list of sample documents in the repository, as shown in Figure 6.
Figure 6. Document selection list
InfoSphere Guardium Data Redaction has pre-defined policy types for 13
major categories, such as Person, Location, and Organization. The sample
Restricted roles are provided out-of-the-box,
yet you can create custom vertical roles that best suit your
If a document is selected and opened from this panel, it is automatically
redacted by the server based on the selected role. The result is then
shown in the document redaction screen. Figure 7 shows a sample document
redacted with the
Figure 7. Document redaction screen
Notice that Person, Social Security Number, and Phone Number are assigned
General role as entity types that should
be redacted. Text of these entity types are marked with a light blue
rectangle on the screen. Once this redacted document is saved, these words
are replaced with redaction rectangles. In this screen, you can make
additional redactions, as well as remove unnecessary redactions. You can
redact a word simply by selecting it with your mouse cursor. You can click
Find to redact the same words globally in the
document. You can also redact entire paragraphs, images, and pages by
highlighting them with your mouse cursor. You can make the selected
objects a template identifier by changing their attribute options in the
property panel. For more information, see the Creating templates section.
You can click the Preview/Print button to check what the output document will look like. As Figure 8 shows, you will see black rectangles instead of the selected words or areas.
Figure 8. Preview screen
Once the redactions are finished, you can click Submit to save the document. In this example, the original document is loaded from the repository on the redaction server, and the output is saved in the same repository. It is also possible to open and save documents in the local file system on the remote client.
To redact forms, you can use the redaction manager to create templates to mark the relevant entities in the form, as shown in Figure 9.
Figure 9. Template creation
To create a template, you should open an example form in the redaction manager, preferably in PDF or TIFF format. This example can be blank or filled-in, and then you must mark at least two areas on the form as template identifiers to allow the server to distinguish instances of different forms. Ensure that the template identifiers are words or graphics that exist on all copies of this form, such as the title of the form, a document ID number, or a corporate logo. Drag your mouse to highlight the areas that you want redacted, then click Create a new template to save it as a template file.
Once you've created the template, you can apply it to other filled-in documents of the same form by opening the redaction manager and clicking Apply template. The server compares the document identifiers in a template with the document currently opened in redaction manager, and if they match, the specified areas of the document page are redacted. You can verify the results on the screen.
A template is especially useful when redacting forms in batch processing. For example, if a repository is configured with appropriate templates, then a large number of scanned hand-filled or machine printed forms can be processed in a batch without human supervision, as shown in Figure. 10.
Figure 10. Automated redaction process with template
Reviewing redacted documents
In some cases, visual confirmation may be required for documents that are automatically redacted by the redaction server. You can configure the server to temporarily hold the redacted documents for review before they are saved in the output directory. You can then review each of the pending documents by clicking the Review batch-redacted documents menu in the redaction manager, and correcting or adjusting the redacted documents as required. After you finish your review, you can click Submit to save each document in the output directory, which automatically opens the next pending document for review. Figure 11 shows how this semi-automatic review operation is continuous and rapid. You can also specify, as a percentage (from 0% to 100%) how many documents that you want held or saved without review, as a way to support full inspection, or random sampling of the redacted results.
Figure 11. Review of the batch results
Document redaction workflow
Many workflow options are available to process documents. For example, you can redact one document at a time as needed, or if a company already has a large number of archival documents, they can be redacted using batch processing. If documents are created periodically or unpredictably and they must be redacted each time they are created, then the server can be configured to continuously watch for such incoming documents and automatically redact them.
In an automated redaction process, a template preparation step can be added as a preliminary workflow and a review workflow can be added as post-processing.
Also, if InfoSphere Guardium Data Redaction is integrated with your company's enterprise content management system, then you can include these workflows in the system as a part of the business processes, as shown in Figure 12.
Figure 12. Redaction workflows integrated with ECM system
IBM InfoSphere Guardium Data Redaction is a solution to efficiently protect sensitive information in unstructured documents and forms. The redaction server redacts documents automatically by using advanced entity extraction software and templates. The redaction manager supports manual redaction and review of the redacted documents. This article gave an overview of the automated redaction functions, and provided examples of the redaction operations with the redaction manager.
The authors would like to thank Joshua Fox and Michael Pelts of the InfoSphere Guardium Data Redaction development team for their review and valuable advice.
- Learn about the SystemT project from IBM, an amalgam of two major research themes centered around analytics and search over unstructured content.
- Learn more about IBM Optim Data Redaction: Automatically recognize and remove sensitive content from documents and forms.
- In the IBM Optim Data Redaction: Reconciling openness with privacy white paper, learn about the characteristics of a complete automated redaction solution.
- Refer to the Information Center section on DB2 Features and Functions by edition to get information about different DB2 for LUW editions and available features.
- Stay current with developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
- Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and tools as well as IT industry trends.
- Follow developerWorks on Twitter.
- Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
Get products and technologies
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
- Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.