Integrate a document data redaction process in your business workflow using IBM InfoSphere Guardium Data Redaction

Redaction client API programming tips

IBM® InfoSphere® Guardium® Data Redaction provides a Java® API that allows your custom application to use the redaction server. This article describes how to use the API in three typical document-redaction scenarios.

Yasutomo Nakayama (nakayama@jp.ibm.com), Yamato Software Development Laboratory (YSL), IBM

Yasutomo Nakayama is a member of the Enterprise Content Management development team in Yamato Software Development Laboratory (YSL), IBM Japan.



Eisuke Kanzaki (JL17613@jp.ibm.com), Yamato Software Development Laboratory (YSL), IBM

Eisuke Kanzaki is a member of the Enterprise Content Management development team in Yamato Software Development Laboratory (YSL), IBM Japan.



15 September 2011

Also available in Chinese Russian Spanish

Introduction

IBM InfoSphere Guardium Data Redaction supports selective disclosure of information in documents. The redaction server automatically removes the private information in free-text documents and forms, as shown in Figure 1.

These functions help process large numbers of documents that may be scattered among various locations. Where there are regulatory or business requirements to maintain the privacy of certain information in the documents, while specifically revealing relevant information to authorized persons, the redaction system can help satisfy both goals.

This product includes browser-based applications with intuitive interfaces that lets you redact documents or view the redacted documents. It also offers multiple options for automated access. These features can be especially valuable in Enterprise Content Management (ECM) workflows, where controlled redaction is an essential part of document management.

Figure 1. Overview of IBM InfoSphere Guardium Data Redaction
This figure shows overview of InfoSphere Guardium Data Redaction, including redaction manager, secure viewer and custom application going to redaction server, and then to ECM system.

There are three methods for using these redaction services:

  • Through document repositories or a file system. You copy the original documents into an input folder and receive the redacted files in an output folder.
  • Through the SOAP API, recommended for non-Java applications.
  • Through the Java redaction-client API, the focus of this article.

IBM InfoSphere Guardium Data Redaction provides a redaction-client API so that your custom applications can directly control the redaction service through the API. This article describes the capabilities of the redaction-client API and how to use the API in a custom application. Three typical scenarios are described as examples in this article.

Note: This article is based on IBM InfoSphere Guardium Data Redaction V2.1.


Redaction-client API

The redaction-client API is a Java-class library to access the redaction server's services from a program. The API toolkit is included with IBM InfoSphere Guardium Data Redaction.

Out-of-the-box sample code and documentation are provided for the APIs. When the product is installed, the API libraries and a sample program are copied to the <IBM_REDACTION_HOME>\client folder, where <IBM_REDACTION_HOME> is the installation folder. The Java doc for the API is also copied to the <IBM_REDACTION_HOME>\docs\apidocs folder.

There are two modes to use the redaction service from a custom application with the API: a remote client mode and an in-process client mode. In the remote client mode, a program on a remote (or local) client machine invokes the redaction server via a SOAP connection. In the in-process client mode, the entire redaction server runs as a part of the client application. The implementation of the program is almost the same in both modes. In these examples, the use of the API in the remote client mode is shown first for each of the three typical scenarios. Finally, the differences between the two modes are described.

The first scenario is the most basic one. A client program simply connects to the server and redacts a document. In the second scenario, many documents are sent to the server and they are redacted using a batch repository. The third scenario is an advanced case showing the high level of control that an application has over the redaction of a collection of documents.


Scenario 1: Local file system redaction

This is the simplest scenario and illustrates the basic usage of the API. In this scenario, a program on a remote client machine sends a redaction request to the redaction server, as shown in Figure 2.

Figure 2. Redaction request sent to redaction server
This figure shows the client machine sending a redaction request to the redaction server

The scenario involves these steps:

  1. Establish a connection to the redaction server
  2. Load a target document as binary data
  3. Submit a redaction request
  4. Save the result as a document

Establish connection to the redaction server

To connect to the redaction server, first create a RedactionToolkitClient class object. The createRemoteClient() method is used to create the object in the remote client mode. Listing 1 shows an example of how to use the createRemoteClient() method.

Listing 1. Use createRemoteClient() method
User user = new User("CustomApp01", "userid", "password");
RedactionToolkitClient client = RedactionToolkitClient.createRemoteClient("foo.bar.com", 
8097, user);

When the createRemoteClient() method is invoked, a SOAP connection between the remote client machine and the redaction server is established. Therefore the method requires the IP address and port number of the redaction server. It also requires a valid user ID and password for the redaction server. Once the redaction server and the client are connected, your application can invoke the server's services through the SOAP engine. The API provides several methods to invoke the redaction services.

Load a target document as binary data

The redaction server supports several document formats such as PDF, Microsoft Word, tiff images, and plain text. For all formats, the document content should be passed to the server as a binary array. Therefore, the files must be opened in binary mode when redacting documents in a file system.

Submit a redaction request

The redactDocumentByRules() method sends a redaction request to the server. This method requires several parameters to specify the options for the request. Table 1 below lists the classes that are used to specify the parameters for the request.

Table 1. Classes used for the redactDocumentByRules() method
ClassParameters to set
RedactionAttributes output format, masking rule, color mode, stamp and other redaction options
RedactionStamp stamp label and position
DocumentAttributes document name, language of the document, etc.
Document document data (binary), DocumentAttributes class object

The RedactionAttributes class object specifies how to redact the document. Listing 2 shows how to set the redaction attributes for this class object. In this sample, “image/png” is specified as the output content-type, GRAY as the output color mode, and "Classified" as the stamp on the upper left corner of the document pages. The RedactionAttributes.ColorMode.GRAY is a static constant to specify the color mode.

Listing 2. Set the redaction attributes
// Set Redaction Attributes
// Set output format
RedactionAttributes redactionAttributes = new RedactionAttributes("image/png");
// Set color mode
redactionAttributes.addAttribute(RedactionAttributes.ColorMode.COLOR_MODE , 
new AttributeValue(RedactionAttributes.ColorMode.GRAY.toString()));
// Set stamp
RedactionStamp stamp = new RedactionStamp(Boolean.TRUE, "Classified", 
RedactionStamp.StampPosition.UPPER_LEFT) ;
redactionAttributes.addRedactionStamp(stamp);

The Document class object is used to set the information about the input document and to get the information in the redacted document. The DocumentAttributes class is used to keep the Document class metadata such as the document name, content type, language, and others.

In Listing 3, the name and content type of the input document is specified by the constructor. When "" is specified as the content type, the redaction server examines the document data and identifies the type. The language of the document (English) is set for the object by using the addAttribute() method. If the language is not specified, the system identifies it automatically during the redaction process. The binary data for the document, the documentAttributes instance, and the document status are set in the Document class object. In this sample code, the binary data of the document is stored in the inputDocBytes variable.

Listing 3. Use Document class object
byte[] inputDocBytes; //  binary document data 
...
// set file name and language   
DocumentAttributes documentAttributes = new DocumentAttributes("sales_results.pdf", "", 
null);
documentAttributes.addAttribute(DocumentAttributes.LANGUAGE, new AttributeValue("en"));
// set binary data, attributes and status 
Document document = new Document(inputDocBytes, documentAttributes, 
DOCUMENT_STATE.NOT_REDACTED);

IBM InfoSphere Guardium Data Redaction has 13 pre-defined major semantic categories, as shown in Table 2, so that you can specify which words should be redacted in the documents. Additional categories can be created as required.

Table 2. Predefined semantic categories
Category ID Semantic Category
2 Phone Number
3 Organization
4 Date
5 Time
6 SSN (Social Security number)
7 Email address
8 URL
9 Location
10 Person
11 Address
100 USD (US dollars)
101 Nationality
102 Bank Card Number

When a redaction request is sent to the server using the redactDocumentByRules() method, the target data types for redaction are specified. Listing 4 shows an example of how to prepare the redaction rules and call the method. This example specifies that the phone number, personal name, and address must be redacted.

Listing 4. Prepare redaction rules
RedactionRules rules = new RedactionRules();
rules.add(new RedactionRule( 2, RedactionRule.PERMISSION_TYPE.REDACT)); // ID  2:
Phone Number
rules.add(new RedactionRule(10, RedactionRule.PERMISSION_TYPE.REDACT)); // ID 10: 
Person name
rules.add(new RedactionRule(11, RedactionRule.PERMISSION_TYPE.REDACT)); // ID 11: Address
Document outDocument = client.redactDocumentByRules(redactionAttributes, document, rules);

The data types for redaction can be more easily specified by using roles. Permissions for the various semantic categories (as in whether or not to redact) are specified for each role. Together, the rules and roles are called a policy model, which should be defined in an XmlPolicyModel.xml file in the <IBM_REDACTION_HOME>\server\conf folder on the redaction server.

The redactDocumentForRole() method is used to specify the role as a redaction option. Listing 5 shows an example of how to use the redactDocumentForRole() method. The RESTRICTED role (ID 1000) is one of the product's sample roles. Typically, you should define your own roles that suit your business logic with the predefined semantic categories.

Listing 5. Use redactDocumentForRole() method
int ROLE_RESTRICTED = 1000; // RESTRICTED role in the sample policy model
// redact a document
Document outDocument = client.redactDocumentForRole(redactionAttributes, document, 
ROLE_RESTRICTED);

Save the result as a document

Both the redactDocumentByRules() method and the redactDocumentForRole() method return a Document class object as the redacted result. The binary data and the document name from this object are used to save the document to a local file.

This scenario is best for clients who require synchronous processing. The connection to the redaction server is kept alive until the end of the redaction. This allows clients to track the progress and control the execution of the redactions. The redaction parameters are passed with each individual document. For better throughput, clients can submit redaction requests from multiple threads. The second scenario is suitable for the redaction of large numbers of documents when the client application should not be blocked.


Scenario 2: Redact documents in a batch repository

A repository is a specified folder, including subfolders, in a file system or in an ECM system. It is used for input and output documents as well as for other artifacts used in redaction. There are two types of repositories: batch and on-demand. A batch repository is used in this second scenario. The redaction server is set to monitor the input folder of the batch repository for incoming documents. When documents are copied to the input folder, the server automatically starts to redact the documents with the options that are defined in advance for the repository, and puts the results into the output folder.

In this scenario, the custom application is divided into two parts, as shown in Figure 3. The first part, the custom application A, simply sends the documents to the batch repository. After the server has processed the documents, the second part, custom application B, connects to the server and downloads the redacted documents.

Figure 3. Custom application divided into two parts
This figure shows the custom application A and B going to the server machine

Table 3 shows the repository-related methods of the RedactionToolkitClient class. This scenario is implemented by using some of these methods.

Table 3. Methods to operate on repositories
Method Operation
queryDocumentRepository() Get a list of the documents in the repository
addDocumentToRepository() Add a document to the repository
getRepositoryDocument() Get a document from the repository
deleteRepositoryDocument() Delete a document in the repository
redactRepositoryDocumentByRules() Redact a document in the repository according to the specified rule
redactRepositoryDocumentForRole() Redact a document in the repository according to the permission for the specified recipient role

Send the documents

The custom application A connects to the server and loads the target documents. These operations are similar to the first two steps in Scenario 1.

The addDocumentToRepository() method puts a document into the input folder of the specified repository. Listing 6 shows an example of how to use the addDocumentToRepository() method. The "batch" is the name of one of the sample batch repositories that is created when the product is installed.

Listing 6. Use addDocumentToRepository() method
DocumentAttributes documentAttributes = new DocumentAttributes("sales_results.pdf", "", 
null);
Document document = new Document(inputDocBytes, documentAttributes, 
DOCUMENT_STATE.NOT_REDACTED);
client.addDocumentToRepository(document, "batch");

The application continues uploading until all of the documents have been sent to the repository. When the last document has been uploaded, the program terminates.

The server is monitoring the input folder and automatically starts to process the documents. This makes it unnecessary to maintain a connection to the server during the processing. The documents are processed with the options that are defined in the configuration file that is stored on the redaction server for that repository.

Download the documents and clean up the repository

Later, the custom application B connects to the redaction server and checks the output folder of the repository. The queryDocumentRepository() method gets the references for the documents in the specified repository. Listing 7 shows an example of how to use the queryDocumentRepository() method.

Listing 7. Use queryDocumentRepository() method
String repositoryFolderName = "batch";
DocumentReference[] docRefs = null;
...
docRefs = client.queryDocumentRepository(new DocumentRepositorySearchCriteria
(repositoryFolderName, DocumentRepositorySearchCriteria.SEARCH_WITHIN.REDACTED));

To find the redacted documents, DocumentRepositorySearchCriteria.SEARCH_WITHIN.REDACTED should be specified as the type of the documents to search for, which means the documents that have been redacted rather than the original documents. The results of the query are returned in an array of the DocumentReference class. The references are used to specify documents in repository folders to be downloaded or deleted. The getRepositoryDocument() method gets the actual document data, the Document class object. Even after the data is downloaded, the original data still remains in the repository. The deleteRepositoryDocument() method is called to clean up the repository. Listing 8 shows an example of how to use the getRepositoryDocument() and the deleteRepositoryDocument() methods.

Listing 8. Use the getRepositoryDocument() and deleteRepositoryDocument() methods
for (int i = 0; i < docRefs.length; i++) {
    Document doc = client.getRepositoryDocument(docRefs[i]);
    ...
    // save the binary data to a local file
    ...
    // delete the document to ensure next query doesn't return it
    client.deleteRepositoryDocument(docRefs[i]);
}

A batch repository is designed to redact many documents at one time by a configurable number of threads. One set of redaction options, for example output format, color mode, role, and others, can be specified for each repository. If there are different redaction options for different documents, then a batch repository must be prepared for each set of the options. In contrast, with an on-demand repository, you can control how the server redacts each document in the repository.


Scenario 3: Redact documents with an on-demand repository

In the third scenario, the custom application is divided into three parts. The uploading and downloading parts are almost the same as in the second scenario, except that the documents are put into an on-demand repository. Unlike the batch repository, the on-demand repository does not start to redact the input document automatically. Therefore a new part, custom application C, must be running on the server machine to request that the server process the documents, as shown in Figure 4.

Figure 4. Custom application C run on the server machine to request the redaction process
This figure shows the custom appliation C running on the server machine

Custom application A uploads the documents to the on-demand repository. This may be done repeatedly and as needed. The custom application C on the server machine runs on a regular schedule, such as once a day, when it checks the repository to find the documents.

Application C is running on the server machine, however it still connects to the server process via a SOAP connection because createRemoteClient() is used as in the other custom applications. The references for the documents in the input folder of the repository are obtained by using the queryDocumentRepository() method with the DocumentRepositorySearchCriteria.SEARCH_WITHIN.NOT_REDACTED option. The redactRepositoryDocumentByRules() or redactRepositoryDocumentForRole() method is used to redact the documents in the on-demand repository folders.

The application can specify different options for each document by using the RedactionAttributes class and the roles or rules required by the business logic. For example, different roles can be specified for each document by referring to each document name.

Listing 9 shows an example of how to use the redactRepositoryDocumentForRole(). In this code, documents whose filename prefix is “ClassA” are redacted with the RESTRICED role, and the “Classified” stamp is added to the pages.

Listing 9. Use redactRepositoryDocumentForRole() method
for (int i = 0; i < docRefs.length; i++) {
    // Set output format
    int role = 1001; //  GENERAL role in the sample policy model
    RedactionAttributes redactionAttributes = new RedactionAttributes("image/tiff"); // 
set output format to tiff image

    if(docRefs[i].getDocumentName().startsWith("ClassA")){
        //add stamp and use RESTRICTED role
        RedactionStamp stamp = new RedactionStamp(Boolean.TRUE, "Classified", 
RedactionStamp.StampPosition.UPPER_LEFT) ;
        redactionAttributes.addRedactionStamp(stamp);
        role = 1000; //  RESTRICTED role in the sample policy model
    }
    client.redactRepositoryDocumentForRole(redactionAttributes, docRefs[i], role);
}

This scenario can be extended to integrate the document redaction process into the document workflow as required. The redaction server can access an ECM system and use its data store as a repository. Therefore custom applications can monitor and process documents on the ECM system with the required business logic.


Using in-process client mode

For on-the-spot redaction services, to avoid running the redaction server continually (thus conserving resources) or to reduce inter-process communication overhead, documents can be redacted by using the in-process client mode. This mode gives a client application control over the redaction server life cycle. The server is initialized along with the client. This means that the custom application and the redaction server are running in the same JVM process, as shown in Figure 5.

Figure 5. Redacting a document in the in-process client mode
This figure shows the custom application and redaction server running in the same JVM process

To use the redaction-client API in the in-process client mode, the createInProcessClient() method is used to create a RedactionToolkitClient class object instead of using the createRemoteClient() method. Unlike the remote client, an in-process client does not use a SOAP connection. It starts the redaction server as a JVM process. Once the object is obtained, your program can use its methods in the same way as it does in the remote client mode. The only difference is that the cleanup() method must be called at the end of the program to release the unneeded resources before stopping the JVM process.

The program for the first scenario is modified to use the in-process client by simply replacing the createRemoteClient() method and adding the cleanup() method.

Listing 10 shows an example of how to use the in-process client.

Listing 10. Use in-process client
User user = new User("CustomApplication04", "userid", "password");
RedactionToolkitClient client = RedactionToolkitClient.createInProcessClient(user);
// redact the documents with the client object here
client.cleanup();

The in-process client is more efficient than the remote client since it reduces the inter-process communication overhead. However, the remote client has a tighter coupling between the client and server processes. Here are some additional concerns with the in-process client.

Redaction server components

The program must be run on the same machine that has the redaction server components, and it must be running on the same JVM as the server components (<IBM_REDACTION_HOME>\ibm-java2-jre-60\jre\bin\java.exe).

The same configuration files are used for both the redaction server and the in-process client. However, the Web services (redaction manager and secure viewer), SOAP services, and batch processor are not available in the in-process client mode.

Coexistence with the redaction server

The in-process client can be used even when the redaction server is running, except when the redaction server is connected to ECM servers. When the redaction server is connected to the repositories (the data store) on the ECM servers it locks the repositories. Therefore if the in-process client starts and also tries to connect to the ECM servers, a conflict occurs.

Classpath

The follow jar files must be added to the classpath:

  • All jar files in <IBM_REDACTION_HOME>\server\lib
  • All jar files under <IBM_REDACTION_HOME>\server\plugins

In addition, if the Content Manager 8 object store is used as a repository, the following jar files and directory are also required in the classpath, where <IBMCMROOT> is the installation folder of the IBM DB2 Information Integrator for Content. This product is only a prerequisite if InfoSphere Guardium Data Redaction is to connect with the IBM Content Manager 8.

  • <IBMCMROOT>\lib\cmbcm81.jar
  • <IBMCMROOT>\lib\cmbicm81.jar
  • <IBMCMROOT>\lib\db2jcc_license_cisuz.jar
  • <IBMCMROOT>\lib\db2jcc_license_cu.jar
  • <IBMCMROOT>\lib\db2jcc.jar
  • <IBMCMROOT>\cmgmt

Java start option requirements

The custom application must be run with these options. It may also be necessary to modify the memory pool sizes.

-Xms256m -Xmx1536m -Djava.security.auth.login.config="%IBM_REDACTION_HOME%/server/conf/login.conf"


Conclusion

The basic usage of the redaction client API has been described with three typical redaction scenarios and two modes (remote client and in-process client). By using InfoSphere Guardium Data Redaction through this API in a custom application, the document workflow in the system can be improved and the business processes can be made more efficient.


Acknowledgements

The authors would like to thank Joshua Fox and Michael Pelts of the InfoSphere Guardium Data Redaction development team for their review and valuable advice.

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently..

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, Security
ArticleID=757148
ArticleTitle=Integrate a document data redaction process in your business workflow using IBM InfoSphere Guardium Data Redaction
publish-date=09152011