Every day, companies are producing many documents, forms, and files, and sharing them across organizational boundaries or disclosing them externally. However, sensitive data in the documents, such as personal privacy information, must be carefully checked and fully eliminated before the documents are exposed to the public. IBM InfoSphere Guardium Data Redaction is a solution for secure disclosure. It offers enterprises efficient methods to remove sensitive data from the large number of documents that are produced in their daily business (see Figure 1).
The redaction server automatically extracts and removes privacy-related information from documents, and provides the results in specified formats. The server also provides these redaction services for the redaction manager and other custom applications. The redaction manager is a Flex-based web application that lets you perform interactive redaction and review batch-redacted documents. Previous articles have explained how the automated redaction works and how you can create your own custom applications using a redaction server as part of your business workflow. Since the first release, InfoSphere Guardium Data Redaction has been updated several times. Many features have been added and other functions have been enhanced to improve its usability.
This article covers the new functions and features that have been added to or enhanced in IBM InfoSphere Guardium Data Redaction V2.1 and V2.5. It is assumed that you have basic knowledge of the product. Refer to Resources for links to the other articles that discuss redaction.
Figure 1. Business document data redaction with IBM InfoSphere Guardium Data Redaction
XML document redaction
IBM InfoSphere Guardium Data Redaction V2.5 now supports XML as a redactable format. The redaction client API has been enhanced so that you can redact XML documents in your custom applications. This is important for enterprise customers because XML documents (data) are semi-structured, combining structured elements with free text. They are highly accessible for computers and are easy to enhance, while also being suitable for exchanges between organizations that are using different systems and applications. This section contains programming tips for the new API as well as how to configure the redaction server.
To use the XML redaction function, an XML Information Specification (XIS) file must
be prepared. The XIS (itself an XML document) defines how the XML document will be
redacted. Listing 1 shows an example XIS. Two types of
elements are used in the XIS file to specify how the XML documents should be
freeText elements and
typedEntity elements. A
freeText element can include
one or more
textPath elements. A
typedEntity has one or more
nodePath elements contain XPath expressions defining target nodes.
Listing 1. Example of XML Information Specification (XIS)
<ns25:xmlInfoSpec xmlns:ns25="http://ibm.com/igdrxml25"> <ns25:freeText> <ns25:textPath>//patientRecord/nurseNotes/text()</ns25:textPath> </ns25:freeText> <ns25:typedEntity semanticCategoryId="6"> <ns25:nodePath>//patientRecord/ssn</ns25:nodePath> </ns25:typedEntity> </ns25:xmlInfoSpec>
If a target node is specified by a
freeText element, then
the content of the node is evaluated as free text by the redaction server. The
server finds the words to be redacted by using the powerful AQL language, part of
IBM's System T, which uses advanced techniques to extract information. The server
then replaces the words with predefined strings (labels), representing the semantic
category of the information. In this example, the
nurseNotes elements in the target XML file are evaluated and any
private information is redacted.
If a target node is specified by a
then the entire content of the node is identified as an expression of the semantic
category that is specified by the
sematicCategoryId parameter. For example, if a
node is specified in a
typedEntity element as a Social
Security Number (
semanticCategoryId=6), then the content
of the node is identified as an SSN. If it is prohibited to show an SSN to the
document's recipients, then the redaction server replaces the body of the node with
the appropriate label (in this example, with "[SSN]").
XML redaction is not possible with automated batch-processing or the Redaction Manager, nor can files that are processed in this way be displayed in the secure viewer. However, a client application with batch redaction capabilities can be created using the APIs. A sample application that illustrates how to invoke these functions is provided. The sample application can also be used as the starting point for your own application.
The XML redaction can be invoked on the server from a custom application by using the
redaction client API. Six methods for the
RedactionToolkitClient class (see Table 1) and
some related classes have been added to the API for XML document redaction. Usage
of the first four methods is almost the same as the corresponding methods
for ordinary documents using such API calls as
redactRepositoryDocumentForRole(). The only difference is that the new
methods require XIS information. There are two ways to pass the XIS information to
the methods: the body of an XIS file can be passed in a String valuable or an XIS file name can be passed as a
Table 1. Enhanced methods for XML document redaction
|Redact an XML document according to the specified rule|
|Redact an XML document for the specified recipient role|
|Redact an XML document in the repository according to the specified rule|
|Redact an XML document in the repository according to the permissions for the specified recipient role|
|Redact XML documents in the repository according to the specified rule|
|Redact XML documents in the repository according to the permissions for the specified recipient role|
Listing 2 shows an example of how to use the
redactXmlForRole() method. This program sends a request to
a server to redact a document on a client machine. The input document has been
loaded into the
inputDocBytes variable (but the loading
steps have been omitted from the listing). The XIS information is loaded from a file
named sampleXmlInfoSpec.xis and passed as the third argument of the
Listing 2. Example program using the redactXmlForRole() method (excerpted)
// Set Redaction Attributes // Set output format RedactionAttributes redactionAttributes = new RedactionAttributes("application/xml"); // prepare Document class object // load input document data byte inputDocBytes; // binary document data ... // Set input format and language DocumentAttributes documentAttributes = new DocumentAttributes("InputDocument.xml", "appli cation/xml", null); documentAttributes.addAttribute(DocumentAttributes.LANGUAGE, new AttributeValue("en")); // Set input document data, attributes and status Document document = new Document(inputDocBytes, documentAttributes, DOCUMENT_STATE.NOT_RED ACTED); // load xis String xisContentFile = "sampleXmlInfoSpec.xis"; String xisContents = FileUtils.readTextFile(xisContentFile, "UTF-8"); // prepare role ID int ROLE_RESTRICTED = 1000; // "RESTRICTED" role in the sample policy model // redact a document Document redactedDocument = client.redactXmlForRole(redactionAttributes, document, xisCont ents, ROLE_RESTRICTED);
Listing 3 is the sample XML input to be redacted. Listing 4 is a sample redacted output for the input
document. This redaction process is performed by the program of Listing 2 with the XIS of Listing 1. Certain words (personal names, times, and dates) and the
ssn elements are redacted.
Listing 3. Example of XML document to be redacted (input)
<medicalData> <patientRecord> <department>Obstetrics</department> <ssn>393-55-3113</ssn> <nurseNotes nurseId="KJ8838383">Mrs. Mary Jones was admitted at 9:27 AM on 1 November 2009.</nurseNotes> </patientRecord> <patientRecord> <ssn>371-22-3459</ssn> <department>Surgery</department> <nurseNotes nurseId="FJ82920909">The patient, James Smith, arrived at 7:17 AM on 31 August 2009.</nurseNotes> </patientRecord> </medicalData>
Listing 4. Example of redacted XML document (output)
<?xml version="1.0" encoding="UTF-8"?><medicalData> <patientRecord> <department>Obstetrics</department> <ssn>[SSN]</ssn> <nurseNotes nurseId="KJ8838383">Mrs. [Person] was admitted at [Time] on [D ate].</nurseNotes> </patientRecord> <patientRecord> <ssn>[SSN]</ssn> <department>Surgery</department> <nurseNotes nurseId="FJ82920909">The patient, [Person], arrived at [Time] on [Date].</nurseNotes> </patientRecord> </medicalData>
redactRepositoryXmlsForRole() methods are used to effectively redact many
documents in one repository at
the same time. Supported repositories include
Filenet P8 V4.5.1, V5.0, and V5.1, as well as IBM Content Manager V8. If a redaction
processor for the repository is configured with the multi-thread option, then the
documents are redacted in parallel for potentially faster throughput. Listing
5 shows how to use the
redactRepositoryXmlsForRole() method. This example program searches the
input XML files in the "on-demand" repository. Unlike the program in Listing 2, the XIS information is loaded from an XIS file
in the repository by referring to the file name, which is passed to the redaction
method using a
RedactionAttributes class object. In this
case, the third argument of the method should be null. In the final phase, the XML
files are redacted by one call of the
Listing 5. Example of redacted XML document (output)
// Set Redaction Attributes // Set output format and xis information RedactionAttributes redactionAttributes = new RedactionAttributes("application/xml"); redactionAttributes.addAttribute(RedactionAttributes.REDACT_WITH_INFO_SPEC_IN_REPOSITORY, new AttributeValue("sampleXmlInfoSpec.xis")); // Set DocumentReference class objects // Search and set information of documents DocumentReference docRefs = client.queryDocumentRepository(new DocumentRepositorySearchC riteria("on-demand", DocumentRepositorySearchCriteria.SEARCH_WITHIN.NOT_REDACTED)); // prepare role ID int role = 1000; // "RESTRICTED" role in the sample policy model // redact the documents DocumentReference redactedDocRefs = client.redactRepositoryXmlsForRole(redactionAttribut es, docRefs, null, role);
In a typical redaction scenario, documents are processed in a batch by the redaction server in the first step. An administrator can review the results using the redaction manager GUI, as required (Figure 2). The redacted documents can be distributed to readers. For example, a redacted PDF file may be sent to an authorized reader by email. However, after it is sent, control of the document is lost, and the recipient may possibly forward the document to unauthorized persons.
The new secure viewer GUI supports need-to-know viewing. Using this GUI, the user can view redacted documents that are permitted to their role or lower. The user can open and read documents, but the GUI does not copy the text or directly save the documents in local storage. Therefore, if the user's credentials expire the document can be immediately blocked from access.
The viewer also supports text-search, even for originally graphical documents such as TIFFs.
Figure 2. Document Data Redaction Workflow with Redaction Manager and Secure Viewer GUI
In addition to general output document formats, such as PDF, MS-word, plain text, and tiff images, the redaction server supports the secure document viewer format (.sdvf). This is a special internal format for the secure viewer. With the .sdvf output, the enhanced function "securely retrieve and fill in" is available.
In some cases, even a fully permitted type of information still needs extra precautions. The redaction system can be configured to redact the information, yet support need-to-know viewing with the secure viewer.
The configuration for the redaction policy is defined in the XmlPolicyModel.xml file on the server. Listing 6 shows an example of this configuration file. The permission elements define which semantic categories are redacted for which user roles.
Listing 6. Examples of definitions of redaction conditions (XmlPolicyModel.xml)
... <ns21:permission userRoleId="1001" semanticCategoryId="3"> <ns21:redact/> </ns21:permission> ... <ns21:permission userRoleId="1001" semanticCategoryId="10"> <ns21:reveal/> </ns21:permission> ...
redact element is defined, information in the
specified category is securely and unconditionally deleted for the specified
user role. If a
reveal element is defined instead, the
information is securely deleted, but can be securely retrieved from the
original document on the server and displayed when an authorized reader requests to
see it. Optionally, the requesting user can be required to specify their business
purpose for the display.
For example, in the sample configuration, organization names (
semanticCategoryId=3) are deleted and replaced by labels:
"[Organization]", while person names (
SemanticCategoryId=10) are also replaced but can be revealed by an
authorized "General" role user (
reveal a redacted word the user clicks the label to select it and then clicks
the "Reveal selected entity" icon. The user may then be asked (or required,
depending on the configuration) to provide a business purpose, choosing from an
option that is configured into a combo box or using free text. Then, the redacted text is
securely retrieved from the server and displayed (Figure 3). The requests to display redacted information are logged by the redaction server
so that an auditor can review those records later.
Figure 3. Securely retrieving and displaying redacted information using the need-to-know viewing function
The advantages of role-based redaction and the role-based need-to-know viewing features are that the administrator does not need to prepare different versions of each document for each role. If a document is generated for role with the lowest level of access, then all of the recipients can browse it and the readers with roles requiring higher levels of access can still display the information that is related to their roles.
Stamp for Bates Numbering
The stamp function has been extended. In addition to a fixed string, the date and time of the redaction can be inserted into the stamp's letters with the user's preferred format. An auto-incremented Bates serial number, often specified in legal document management scenarios such as eDiscovery, is also supported in the stamp. This function is useful when processing documents in bulk. Listing 7 and Figure 4 show an example of the use of the stamp function. In addition, document attributes can be specified (such as file name or content type) to be included in the stamp. A globally unique identifier (GUID) for each stamp is also available.
Listing 7. Example of the stamp label definition
<redactionStamp enabled="true"> <label> <text>REDACTED for external readers: </text> <date format="yyyy/MMM/dd HH:mm:ss"/> <text> Serial #: </text> <batesNumber startNumber="1"/> </label> <position>UpperLeft</position> </redactionStamp>
Figure 4. Example of stamp output
Adding a watermark to a document with a postprocessor plug-in
A postprocessor plug-in added to a redaction server can perform customized processing on the redacted documents. The product includes a sample configuration file for a graphical watermark. This section describes the configuration steps to add watermarks to documents.
A watermark can be added if the redacted document is output as a PDF file, regardless of the input format. For this process, the iText PDF library should be obtained from its website (http://itextpdf.com/) and added to the <IBM_REDACTION_HOME>\server\plugins\redaction-postprocessor-watermark\lib directory on the redaction server. <IBM_REDACTION_HOME> is the program file directory that was specified during the installation process.
The watermark image that will be embedded into the documents must be created as a transparent png file. The background of the image is specified as transparent, and the foreground image (the actual watermark) is specified as partially transparent.
Enable the plug-in
The file <IBM_REDACTION_DATA>\server\conf\pdfWatermark.xml should be opened with a text editor to add the definition that is shown in Listing 8. <IBM_REDACTION_DATA> is the data file directory that contains configuration files, log files, and samples. Its location is defined by the installation program and depends on the operating system that the products are installed on. The xml file specifies the post processor name, a global flag, the watermark image file, its position, and the target pages to receive the watermark. In Listing 8, WatermarkImg.png is the file name for the watermark image file.
Listing 8. Example of a watermark plug-in configuration (pdfWatermark.xml)
<?xml version="1.0" encoding="UTF-8"?> <postprocessor name="PdfRedactedWatermark"> <enableGlobally>false</enableGlobally> <watermarkFile> /server/conf/WatermarkImg.png </watermarkFile> <lowerLeft> <x>0</x> <y>0</y> </lowerLeft> <pages>1</pages> </postprocessor>
Open the configuration file <IBM_REDACTION_DATA>\server\conf\plugins.xml with a text editor and remove the comment-out tags to enable the definition that is shown in Listing 9. This causes the redaction server to load the pdfWatermark.xml configuration file when it starts.
Listing 9. Registration of the watermark plug-in in plugins.xml
<plugin> <pluginClass>com.ibm.nex.redaction.postprocessors.PdfWatermarker</pluginClass> <configFile>pdfWatermark.xml</configFile> </plugin>
If the global flag is set to
true, then the watermark
is applied to all of the documents that are processed by any of the features of the Redaction
server. If the global flag is set to
false, then the
postprocessor name in the configuration file of each repository processor for which
the watermark will be applied must be specified. The definition that is shown in Listing
10 must be added to the configuration files. (For
example, as the <IBM_REDACTION_DATA>\server\conf\batchFileSystemProcessor.xml
Listing 10. Postprocessor specification for the watermark plug-in in repository processor configuration file
<postprocessors> <postprocessor>PdfRedactedWatermark</postprocessor> </postprocessors>
The redaction server must be rebooted to enable the new settings. The watermark is added to each redacted pdf document (Figure 5).
Figure 5. Example of the redacted document output with a watermark
Sensitive data in XML documents can be removed with IBM InfoSphere Guardium Data Redaction to keep the information secure. IBM InfoSphere Guardium Data Redaction can be integrated into your ECM system to effectively perform mass redaction of document resources in your enterprise. In addition, a redaction system can use the enhanced features of the product, such as the secure viewer, the stamp extensions, or the watermark plug-in.
The authors would like to thank Joshua Fox and Michael Pelts of the InfoSphere Guardium Data Redaction development team for their reviews and valuable advice.
- Learn more about sharing information while protecting privacy using "IBM InfoSphere Guardium Data Redaction" (developerWorks, September 2011).
- Learn more about Guardium and the redaction process in the article "Integrate a document data redaction process in your business workflow using IBM InfoSphere Guardium Data Redaction" (developerWorks, Sep 2011).
Get products and technologies
- Evaluate IBM products in the way that suits you best: Download a product trial or try a product online.
- Get involved in the developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.