Advanced redaction for better document workflow

Programming and configuration tips for enhanced features of the IBM InfoSphere Guardium Data Redaction

IBM® InfoSphere® Guardium Data Redaction removes sensitive data from documents that enterprises share across departments or with the public. The product supports various document formats. This article describes several features of InfoSphere Guardium Data Redaction along with configuration and programming tips, starting with XML document redaction.

Share:

Yasutomo Nakayama (nakayama@jp.ibm.com), Advisory Software Engineer, IBM

Yasutomo Nakayama is a member of the Data Security Compliance and Optimization team in the Tokyo Software Development Laboratory (TDSL) at IBM Japan.



Eisuke Kanzaki (JL17613@jp.ibm.com), Advisory Software Engineer, IBM

Eisuke Kanzaki is a member of the Data Security Compliance and Optimization team in the Tokyo Software Development Laboratory (TDSL) at IBM Japan.



11 July 2013

Also available in Chinese

Introduction

Every day, companies are producing many documents, forms, and files, and sharing them across organizational boundaries or disclosing them externally. However, sensitive data in the documents, such as personal privacy information, must be carefully checked and fully eliminated before the documents are exposed to the public. IBM InfoSphere Guardium Data Redaction is a solution for secure disclosure. It offers enterprises efficient methods to remove sensitive data from the large number of documents that are produced in their daily business (see Figure 1).

The redaction server automatically extracts and removes privacy-related information from documents, and provides the results in specified formats. The server also provides these redaction services for the redaction manager and other custom applications. The redaction manager is a Flex-based web application that lets you perform interactive redaction and review batch-redacted documents. Previous articles have explained how the automated redaction works and how you can create your own custom applications using a redaction server as part of your business workflow. Since the first release, InfoSphere Guardium Data Redaction has been updated several times. Many features have been added and other functions have been enhanced to improve its usability.

This article covers the new functions and features that have been added to or enhanced in IBM InfoSphere Guardium Data Redaction V2.1 and V2.5. It is assumed that you have basic knowledge of the product. Refer to Resources for links to the other articles that discuss redaction.

Figure 1. Business document data redaction with IBM InfoSphere Guardium Data Redaction
Business document data redaction with the IBM InfoSphere Guardium Data Redaction

XML document redaction

IBM InfoSphere Guardium Data Redaction V2.5 now supports XML as a redactable format. The redaction client API has been enhanced so that you can redact XML documents in your custom applications. This is important for enterprise customers because XML documents (data) are semi-structured, combining structured elements with free text. They are highly accessible for computers and are easy to enhance, while also being suitable for exchanges between organizations that are using different systems and applications. This section contains programming tips for the new API as well as how to configure the redaction server.

To use the XML redaction function, an XML Information Specification (XIS) file must be prepared. The XIS (itself an XML document) defines how the XML document will be redacted. Listing 1 shows an example XIS. Two types of elements are used in the XIS file to specify how the XML documents should be redacted: freeText elements and typedEntity elements. A freeText element can include one or more textPath elements. A typedEntity has one or more nodePath elements. Both textPath and nodePath elements contain XPath expressions defining target nodes.

Listing 1. Example of XML Information Specification (XIS)
<ns25:xmlInfoSpec xmlns:ns25="http://ibm.com/igdrxml25">
    <ns25:freeText>
        <ns25:textPath>//patientRecord/nurseNotes/text()</ns25:textPath>
    </ns25:freeText>	

    <ns25:typedEntity semanticCategoryId="6">
        <ns25:nodePath>//patientRecord/ssn</ns25:nodePath>
    </ns25:typedEntity>
</ns25:xmlInfoSpec>

If a target node is specified by a freeText element, then the content of the node is evaluated as free text by the redaction server. The server finds the words to be redacted by using the powerful AQL language, part of IBM's System T, which uses advanced techniques to extract information. The server then replaces the words with predefined strings (labels), representing the semantic category of the information. In this example, the nurseNotes elements in the target XML file are evaluated and any private information is redacted.

If a target node is specified by a typedEntity element, then the entire content of the node is identified as an expression of the semantic category that is specified by the sematicCategoryId parameter. For example, if a node is specified in a typedEntity element as a Social Security Number (semanticCategoryId=6), then the content of the node is identified as an SSN. If it is prohibited to show an SSN to the document's recipients, then the redaction server replaces the body of the node with the appropriate label (in this example, with "[SSN]").

XML redaction is not possible with automated batch-processing or the Redaction Manager, nor can files that are processed in this way be displayed in the secure viewer. However, a client application with batch redaction capabilities can be created using the APIs. A sample application that illustrates how to invoke these functions is provided. The sample application can also be used as the starting point for your own application.

The XML redaction can be invoked on the server from a custom application by using the redaction client API. Six methods for the RedactionToolkitClient class (see Table 1) and some related classes have been added to the API for XML document redaction. Usage of the first four methods is almost the same as the corresponding methods for ordinary documents using such API calls as redactDocumentByRules() or redactRepositoryDocumentForRole(). The only difference is that the new methods require XIS information. There are two ways to pass the XIS information to the methods: the body of an XIS file can be passed in a String valuable or an XIS file name can be passed as a RedactionAttributes class object.

Table 1. Enhanced methods for XML document redaction
MethodOperation
redactXmlByRules()Redact an XML document according to the specified rule
redactXmlForRole()Redact an XML document for the specified recipient role
redactRepositoryXmlByRules()Redact an XML document in the repository according to the specified rule
redactRepositoryXmlByRules()Redact an XML document in the repository according to the permissions for the specified recipient role
redactRepositoryXmlsByRules()Redact XML documents in the repository according to the specified rule
redactRepositoryXmlsForRole()Redact XML documents in the repository according to the permissions for the specified recipient role

Listing 2 shows an example of how to use the redactXmlForRole() method. This program sends a request to a server to redact a document on a client machine. The input document has been loaded into the inputDocBytes variable (but the loading steps have been omitted from the listing). The XIS information is loaded from a file named sampleXmlInfoSpec.xis and passed as the third argument of the redactXmlForRole() method.

Listing 2. Example program using the redactXmlForRole() method (excerpted)
// Set Redaction Attributes
//     Set output format
RedactionAttributes redactionAttributes = new RedactionAttributes("application/xml");

// prepare Document class object
//     load input document data 
byte[] inputDocBytes; // binary document data 
...
//     Set input format and language  
DocumentAttributes documentAttributes = new DocumentAttributes("InputDocument.xml", "appli
cation/xml", null);
documentAttributes.addAttribute(DocumentAttributes.LANGUAGE, new AttributeValue("en"));
//     Set input document data, attributes and status 
Document document = new Document(inputDocBytes, documentAttributes, DOCUMENT_STATE.NOT_RED
ACTED);

// load xis
String xisContentFile = "sampleXmlInfoSpec.xis";
String xisContents = FileUtils.readTextFile(xisContentFile, "UTF-8");

// prepare role ID
int ROLE_RESTRICTED = 1000; // "RESTRICTED" role in the sample policy model

// redact a document
Document redactedDocument = client.redactXmlForRole(redactionAttributes, document, xisCont
ents, ROLE_RESTRICTED);

Listing 3 is the sample XML input to be redacted. Listing 4 is a sample redacted output for the input document. This redaction process is performed by the program of Listing 2 with the XIS of Listing 1. Certain words (personal names, times, and dates) and the ssn elements are redacted.

Listing 3. Example of XML document to be redacted (input)
<medicalData> 
	<patientRecord>
		<department>Obstetrics</department>
		<ssn>393-55-3113</ssn>
		<nurseNotes nurseId="KJ8838383">Mrs. Mary Jones was admitted at 9:27 AM on
 1 November 2009.</nurseNotes> 
	</patientRecord> 
	<patientRecord>
		<ssn>371-22-3459</ssn>
		<department>Surgery</department>
		<nurseNotes nurseId="FJ82920909">The patient, James Smith, arrived at 7:17
 AM on 31 August 2009.</nurseNotes>
	</patientRecord> 
</medicalData>
Listing 4. Example of redacted XML document (output)
<?xml version="1.0" encoding="UTF-8"?><medicalData> 
	<patientRecord>
		<department>Obstetrics</department>
		<ssn>[SSN]</ssn>
		<nurseNotes nurseId="KJ8838383">Mrs. [Person] was admitted at [Time] on [D
ate].</nurseNotes> 
	</patientRecord> 
	<patientRecord>
		<ssn>[SSN]</ssn>
		<department>Surgery</department>
		<nurseNotes nurseId="FJ82920909">The patient, [Person], arrived at [Time] 
on [Date].</nurseNotes>
	</patientRecord> 
</medicalData>

The redactRepositoryXmlsByRules() and redactRepositoryXmlsForRole() methods are used to effectively redact many documents in one repository at the same time. Supported repositories include Filenet P8 V4.5.1, V5.0, and V5.1, as well as IBM Content Manager V8. If a redaction processor for the repository is configured with the multi-thread option, then the documents are redacted in parallel for potentially faster throughput. Listing 5 shows how to use the redactRepositoryXmlsForRole() method. This example program searches the input XML files in the "on-demand" repository. Unlike the program in Listing 2, the XIS information is loaded from an XIS file in the repository by referring to the file name, which is passed to the redaction method using a RedactionAttributes class object. In this case, the third argument of the method should be null. In the final phase, the XML files are redacted by one call of the redactRepositoryXmlsForRole() method.

Listing 5. Example of redacted XML document (output)
// Set Redaction Attributes
// Set output format and xis information
RedactionAttributes redactionAttributes = new RedactionAttributes("application/xml"); 
redactionAttributes.addAttribute(RedactionAttributes.REDACT_WITH_INFO_SPEC_IN_REPOSITORY, 
new AttributeValue("sampleXmlInfoSpec.xis"));

// Set DocumentReference class objects
// Search and set information of documents
DocumentReference[] docRefs = client.queryDocumentRepository(new DocumentRepositorySearchC
riteria("on-demand", DocumentRepositorySearchCriteria.SEARCH_WITHIN.NOT_REDACTED));

// prepare role ID
int role = 1000; // "RESTRICTED" role in the sample policy model

// redact the documents 
DocumentReference[] redactedDocRefs = client.redactRepositoryXmlsForRole(redactionAttribut
es, docRefs, null, role);

Secure viewer

In a typical redaction scenario, documents are processed in a batch by the redaction server in the first step. An administrator can review the results using the redaction manager GUI, as required (Figure 2). The redacted documents can be distributed to readers. For example, a redacted PDF file may be sent to an authorized reader by email. However, after it is sent, control of the document is lost, and the recipient may possibly forward the document to unauthorized persons.

The new secure viewer GUI supports need-to-know viewing. Using this GUI, the user can view redacted documents that are permitted to their role or lower. The user can open and read documents, but the GUI does not copy the text or directly save the documents in local storage. Therefore, if the user's credentials expire the document can be immediately blocked from access.

The viewer also supports text-search, even for originally graphical documents such as TIFFs.

Figure 2. Document Data Redaction Workflow with Redaction Manager and Secure Viewer GUI
Document Data Redaction Workflow with Redaction Manager and Secure Viewer GUI

In addition to general output document formats, such as PDF, MS-word, plain text, and tiff images, the redaction server supports the secure document viewer format (.sdvf). This is a special internal format for the secure viewer. With the .sdvf output, the enhanced function "securely retrieve and fill in" is available.

In some cases, even a fully permitted type of information still needs extra precautions. The redaction system can be configured to redact the information, yet support need-to-know viewing with the secure viewer.

The configuration for the redaction policy is defined in the XmlPolicyModel.xml file on the server. Listing 6 shows an example of this configuration file. The permission elements define which semantic categories are redacted for which user roles.

Listing 6. Examples of definitions of redaction conditions (XmlPolicyModel.xml)
...
<ns21:permission userRoleId="1001" semanticCategoryId="3">
	<ns21:redact/>
</ns21:permission>
...
<ns21:permission userRoleId="1001" semanticCategoryId="10">
	<ns21:reveal/>
</ns21:permission>
...

If a redact element is defined, information in the specified category is securely and unconditionally deleted for the specified user role. If a reveal element is defined instead, the information is securely deleted, but can be securely retrieved from the original document on the server and displayed when an authorized reader requests to see it. Optionally, the requesting user can be required to specify their business purpose for the display.

For example, in the sample configuration, organization names (semanticCategoryId=3) are deleted and replaced by labels: "[Organization]", while person names (SemanticCategoryId=10) are also replaced but can be revealed by an authorized "General" role user (userRoleId=1001). To reveal a redacted word the user clicks the label to select it and then clicks the "Reveal selected entity" icon. The user may then be asked (or required, depending on the configuration) to provide a business purpose, choosing from an option that is configured into a combo box or using free text. Then, the redacted text is securely retrieved from the server and displayed (Figure 3). The requests to display redacted information are logged by the redaction server so that an auditor can review those records later.

Figure 3. Securely retrieving and displaying redacted information using the need-to-know viewing function
Securely retrieving and displaying redacted information using the need-to-know viewing function

The advantages of role-based redaction and the role-based need-to-know viewing features are that the administrator does not need to prepare different versions of each document for each role. If a document is generated for role with the lowest level of access, then all of the recipients can browse it and the readers with roles requiring higher levels of access can still display the information that is related to their roles.


Stamp for Bates Numbering

The stamp function has been extended. In addition to a fixed string, the date and time of the redaction can be inserted into the stamp's letters with the user's preferred format. An auto-incremented Bates serial number, often specified in legal document management scenarios such as eDiscovery, is also supported in the stamp. This function is useful when processing documents in bulk. Listing 7 and Figure 4 show an example of the use of the stamp function. In addition, document attributes can be specified (such as file name or content type) to be included in the stamp. A globally unique identifier (GUID) for each stamp is also available.

Listing 7. Example of the stamp label definition
<redactionStamp enabled="true">
	<label>
		<text>REDACTED for external readers: </text>
		<date format="yyyy/MMM/dd HH:mm:ss"/>
		<text> Serial #: </text>
		<batesNumber startNumber="1"/>
	</label>
	<position>UpperLeft</position>
</redactionStamp>
Figure 4. Example of stamp output
Example of stamp output

Adding a watermark to a document with a postprocessor plug-in

A postprocessor plug-in added to a redaction server can perform customized processing on the redacted documents. The product includes a sample configuration file for a graphical watermark. This section describes the configuration steps to add watermarks to documents.

Preparation

A watermark can be added if the redacted document is output as a PDF file, regardless of the input format. For this process, the iText PDF library should be obtained from its website (http://itextpdf.com/) and added to the <IBM_REDACTION_HOME>\server\plugins\redaction-postprocessor-watermark\lib directory on the redaction server. <IBM_REDACTION_HOME> is the program file directory that was specified during the installation process.

The watermark image that will be embedded into the documents must be created as a transparent png file. The background of the image is specified as transparent, and the foreground image (the actual watermark) is specified as partially transparent.

Enable the plug-in

The file <IBM_REDACTION_DATA>\server\conf\pdfWatermark.xml should be opened with a text editor to add the definition that is shown in Listing 8. <IBM_REDACTION_DATA> is the data file directory that contains configuration files, log files, and samples. Its location is defined by the installation program and depends on the operating system that the products are installed on. The xml file specifies the post processor name, a global flag, the watermark image file, its position, and the target pages to receive the watermark. In Listing 8, WatermarkImg.png is the file name for the watermark image file.

Listing 8. Example of a watermark plug-in configuration (pdfWatermark.xml)
<?xml version="1.0" encoding="UTF-8"?>
<postprocessor name="PdfRedactedWatermark">
 	<enableGlobally>false</enableGlobally>
	<watermarkFile>
	/server/conf/WatermarkImg.png
	</watermarkFile>
	<lowerLeft>
		<x>0</x>
		<y>0</y> 
	</lowerLeft>
	<pages>1</pages>
</postprocessor>

Open the configuration file <IBM_REDACTION_DATA>\server\conf\plugins.xml with a text editor and remove the comment-out tags to enable the definition that is shown in Listing 9. This causes the redaction server to load the pdfWatermark.xml configuration file when it starts.

Listing 9. Registration of the watermark plug-in in plugins.xml
<plugin>
    <pluginClass>com.ibm.nex.redaction.postprocessors.PdfWatermarker</pluginClass>
    <configFile>pdfWatermark.xml</configFile>
</plugin>

If the global flag is set to true, then the watermark is applied to all of the documents that are processed by any of the features of the Redaction server. If the global flag is set to false, then the postprocessor name in the configuration file of each repository processor for which the watermark will be applied must be specified. The definition that is shown in Listing 10 must be added to the configuration files. (For example, as the <IBM_REDACTION_DATA>\server\conf\batchFileSystemProcessor.xml file.)

Listing 10. Postprocessor specification for the watermark plug-in in repository processor configuration file
<postprocessors>
<postprocessor>PdfRedactedWatermark</postprocessor> 
</postprocessors>

The redaction server must be rebooted to enable the new settings. The watermark is added to each redacted pdf document (Figure 5).

Figure 5. Example of the redacted document output with a watermark
Example of the redacted document output with a watermark

Conclusion

Sensitive data in XML documents can be removed with IBM InfoSphere Guardium Data Redaction to keep the information secure. IBM InfoSphere Guardium Data Redaction can be integrated into your ECM system to effectively perform mass redaction of document resources in your enterprise. In addition, a redaction system can use the enhanced features of the product, such as the secure viewer, the stamp extensions, or the watermark plug-in.


Acknowledgements

The authors would like to thank Joshua Fox and Michael Pelts of the InfoSphere Guardium Data Redaction development team for their reviews and valuable advice.

Resources

Learn

Get products and technologies

Discuss

  • Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Security
ArticleID=936837
ArticleTitle=Advanced redaction for better document workflow
publish-date=07112013