I'm Joshua Fox. and you can learn more about me at my personal website
I've had the great experience of launching a new product at IBM from scratch, IBM Optim Data Redaction
. I'll blog about what it was like to create this product, and about the functionality it provides: Privacy for documents.
I entered the redaction project at its seed stage, when Michael Pelts was developing ideas about protecting unstructured data towards the end of 2007. Michael and I, working at the Jerusalem site of the Israel Software Labs, built out the project with the generous sponsorship and mentorship of the late IBM Fellow Ed Kahan.
During 2008, we created prototypes and demos, explored the market, and looked for a sponsor to turn the seed project into a product. Optim adopted us in January 2009, at the deepest trough of the recession, and from that point we had a real product to build, releasing the first General Availability version in March 2010--shortly before Ed passed away.
I really believe in this product, and the need for it. The amount of stored and transmitted documents is growing tremendously. Regulations impose two contradictory pressures: Open up, but keep private information private. Organizations today often ignore this balance, but regulatory deadlines, multiply deferred, are now coming due.
Redaction software exists, but nothing with the degree of automation and enterprise-readiness that is needed.
I'm looking forward to blogging about it: Subscribe to the RSS for more!
I have a new article on redaction, live at Infosecurity magazine.
"People who live in glass houses... should put up some window-shades." Read it here.
After a good seven months with the "Right to Privacy" blog at developerWorks, this will likely be my last post.
You'll see more of me elsewhere on the Web; or I may revive this blog eventually.
Privacy and security are different things.
Security is the low-level lockdown: The password on the database, the encryption on the file-server, the lock on the door. Security means exposing data only to authenticated users, those who have the password, the thumbprint, the key, etc.
Privacy is protecting information based on business purpose: The need-to-know. Security is a pre-requisite for privacy, but privacy is much harder to enforce: Once an employee or other user has a password to the application or a key to the door, how can we ensure that they see only what they need to ?
That's what the Redaction system does. The privacy policies, based on regulations, don't just say what documents the reader can see, or what sorts of text-strings need to be deleted. The policies protect semantic types, such as personal names or telephone numbers, cross-referencing them against the role of the person who will see the document.
The Obama passport case
makes a good example. There was no security problem. State Department subcontractors had legitimate permission to access to passport records. But they had no business reason in this case, which is why there was a security problem.
Redaction means more creating blacked-out copies of documents. Live, automated redaction, when used as part of a document viewer, becomes a form offine-grained access control for documents: Users should be able browse to documents in an Electronic Content Management system, and view them with the sensitive information redacted.
If they do have a real reason to see the information, they can ask to see it. Then, if permissions allow, the blank rectangle is filled in. The request is logged, and since the user is authenticated, the auditor knows and the user knows that the auditor knows that they asked to see this information.
This approach has the side-benefit of balancing the risks of automated over-redaction and under-redaction. We can err on the side of over-redaction, knowing that users can ask to see the information if they have a valid business purpose.
In this way, redaction becomes a layer of privacy protection on top of a generic document viewer. No longer are we just blacking out text--we're controlling access to precise units of data, live, on a need-to-know basis.
The need for good redaction keeps coming up around Barack Obama. See this earlier post
After Obama resigned from his Senate seat to become president, Rod Blagojevich, governor of Illinois, had the job of appointing a successor. Blagojevich was accused of selling the Senate seat, among a long list of other corrupt dealings.
In the trial, Blagojevich's lawyers tried to subpoena Obama -- whether because his testimony was material to the trial, or whether they just wanted to complicate things by dragging the president into the trial. Their motion was duly posted online, with key paragraphs redacted. Here is the actual file.
The political implications are one thing (see this
) but we're interested in something else.
Two interesting points here:
A. Note the double level of redaction:
1. Personally identifying information like personal names are replaced with a semantic category like "labor union official." This leaves the text comprehensible, but makes it impossible to identify individuals.
2. Paragraphs are blacked out, eliminating the context even for redacted entities, so that the meaning of parts of the document can no longer be understood.
B. The apparently blacked-out paragraphs were simply hidden behind black
layers. Selecting the section and hitting Control-C recover
You cannot redact with ad hoc tools! Your redacted document must lack any private data which is hidden from the human eye. Because a human reviewer doesn't know what appears in hidden sections, your redacted document must not have any
During the 2008 presidential campaign, some State Department subcontractors who were working on passport-processing looked at the records
of candidates, including Barak Obama.
This was not a security breach -- it was a privacy breach.They had passwords, and needed passwords, to do their work. But they were looking at the records out of simple curiosity.
Keeping personal information private is essential, but with the Birther ruckus, Obama's personal information is even touchier than most.
The solution is to present documents to employees who need them, but without exposing individual sensitive entities which they might not need for their job , such as the name, social security number, or place of birth. If the worker needs to see the information, they can click on the redaction rectangle, and then (if permissions allow), the information is securely retrieved into the document which the worker sees.
The access is logged, and the users knows it, so they will be cautious about what they ask for. Analytic software will later track suspicious patterns of access.
More on Obama and the dangers of inadequate redaction in an upcoming post, to be published here
In a previous post, I wrote about the hidden information in PDF, Microsoft Word, and even TIFF, JPEG, or PNG files. If some of this is sensitive or personally identifying information (PII) as defined in regulations, it must be redacted. This severely complicates the job of a reviewer who is redacting the document manually, particularly if they are not using sophisticated tools but simply deleting text in a word processor or other software.
But there is a lot of data which is in principle visible -- it is rendered as part of the document's visible image -- which in fact not visible to the human reviewer.
- Text with the same color as its background: E.g, white on white
- Super-small text: 1 pixel high, effectively invisible
- Super-wide margins: The text is pushed off the edge
- Folded sections
- Layered sections
A human reviewer cannot see and redact all this quasi-visible information. For privacy, it is essential to eliminate such text, together with the metadata, and only the human-readable form of the document -- after suitable redaction, of course.
See also this excellent White Paper: "The Risks of Metadata and Hidden Information
Our redaction software deletes sensitive text and images from documents. That's easy to see. What's harder to see is the incredible morass of hidden data which can hide inside any rich-text document.
The phenomenon is best known in Microsoft Word. It's well known that Track Changes can hold deleted information, but so can many other features of the software. For example, the little-known Fast Save feature, developed in the days when hard drives were very slow, retains deleted blocks of data to accelerate synchronization between memory and the disk.
PDF, too, can carry hidden information. PDF presents text and graphics cleanly, but inside it's a mess of elements layered, hidden, and arranged in no obvious relation to the external appearance.
Even TIFF, a multipage graphical format, is a complex wrapper for multiple images and multiple text tags (TIFF stands for Tagged Image File Format). You might thing that you are safe with PNG or JPEG, simple image
formats, but these two also allow textual tags. The tags are mostly intended for simple metadata like author, date of creation, location etc, but even these can be incriminating--and any other private text could in principle be hiding in the tags.
If some of this hidden information is private or personally identifying information, it must also be redacted; but this is problematic when a human reviewer is redacting the electronic--hidden information is by definition not visible to the person viewing the document.
In principle, while redacting visible text, we could also extract the invisible text or graphics and redact sensitive entities with our combined automated/human process, as we do for the visible part of the document. But the hidden data is, internally, a jumble of unordered fields; it is not meant to be read. In some cases, it is nearly impossible to reconstruct how to present the text to a machine or human reader, as when an internal script builds up some text. For example, if a macro calculates a person's age from her birthdate, it's unlikely to be found by an automated system or even a human, yet it might might using birthdate data in fields which are also hidden in the document. In a scenario where ages are considered sensitive, for example, where discrimination lawsuits are a risk, such information needs to be found and deleted.
The hidden data mentioned above is sometimes called "Metadata." As a technical evangelist
for Unicorn, which built metadata management software, I learned to beware that word, which means different things to different people -- so we can just call it "invisible" data. But besides such invisible data, there is also information which is in principle part of the appearance of the document, but is not visible to the human eye. I'll write about that in an upcoming post.
Continued from Privacy on the Deeper Web , part I
Users of the Web have been empowered by the ability to freely browse masses of information.
Knowledge workers in organizations, who increasingly need to find patterns and draw connections between disparate sources of data, could derive tremendous business value if they had this degree of flexibility in using intranet sites--part of the trend known as "Enterprise 2.0."
There are various reasons for the current clumsiness in intranet access, including organizational boundaries, but the need to protect private data is among the most important. How can we let knowledge workers use intranet information as freely as the open Web, while guaranteeing that the right eyes are seeing each item?
In the case of structured content or pages generated from structured content, mostly relational databases,
we can avoid exposing certain units of data -- specific tables and columns -- based on the role of the user. Yet 80% of data in the "deeper Web" is simply
documents -- HTML, PDF, scans -- each of which holds a mix of data.
We could control access to each documents
as a whole, but this removes the openness which made the
Web so successful. We want the same benefits for the Deeper Web.
To resolve the openness-privacy dilemma for documents,
fine-grained access control is needed.
Users of internal
secured systems need to be able to search and browse large numbers of documents smoothly, conveniently, flexibly, with no pre-approval, just as they do on the Web. At the same
time, the private information which occurs in such systems--and generally does not
occur on the Web--needs to be securely deleted in each document, for those who are not authorized to see it.
For example, doctors not treating a patient should
be allowed to see medical documents to track the spread of diseases in their hospital, but personally identifying information in these documents should be
Such deletion still leaves something to be desired. Users may have a real
business need to view the deleted data and they should be allowed to ask to see some types of redacted information, as long as they can provide a good reason. Fortunately, such users already have an existing relationship with the organization and are authenticated (logged in), so when the information is (securely) retrieved and revealed, auditors know who asked to see certain information, and the users know that the auditors know.
To access these documents, users need a search engine
which indexes and returns links to all possible documents, which users
can see if authorized--yet private information must not appear in the
search-result summary pages.
All phase of the browsing experience--searching for
information, reading documents and revealing private information where
allowed--must be as seamless as Web browsing. Employees tolerate clunky
IT systems, but the goal here not just to allow employees to do what
they've been told to do. Users need to freely browse numbers of
documents, learn new things and draw new connections, just as they do on
the open Web.
The Optim Data Redaction system has a number of competitors in the software industry.
But our biggest "competitor" is still the good old black marker. This may seem strange, since deadlines for ever stricter privacy regulations are now coming due. But the redaction software industry is still not mature, and doing redaction the old-fashioned way seems cheaper and safer.
In fact, guidelines from intelligence agencies sometimes recommend printing out electronic documents, redacting them manually, then scanning and uploading them back into Enterprise Content Management systems. This avoids some of the simpler mistakes when deleting information in electronic documents--like leaving information in the change-tracking of MSWord, or hiding text behind an easily removable black layer in PDF
The black pen, however, just won't do.
Words are often visible through the the black ink. Though scans of such
pages may further obscure the faint impressions of the underlying text,
certain scanners with image-enhancement features can actually heighten
contrast and make these traces more legible.
Think about the labor cost per page for manually redacting printouts (and typical deployments have millions of documents.) Employees who can do redaction are expensive, since (1) they must
understand the domain of the documents as well as the regulations
driving redaction and (2) people with those skillsets would rather not
spend their days blacking out paragraphs.
Regardless of what you pay, manual redaction falls prey to human weaknesses: People miss private text and their hands stray from the line of text.
But the worst problem with the marker is, in a word, workflow. Here's what you need to do.
1. Print out 10,000 pages (just to get going; there's a lot more where that came from).
2. Redact them manually.
(Don't forget to keep everything well-ordered through every step.)
3. Pass some percentage of these to a reviewer.
4. Take the documents which the reviewer rejects and feed them back into the stage 2.
4. Merge the successfully reviewed documents with the rest in the original order.
5. Feed everything into the scanner. Don't miss any.
6. Next, you'll want to shred the redacted pages, despite the redaction, to minimize risk.
7. Upload the scans in a repeat of the document capture process which was used for the original documents.
8. In your ECM system, match each page with its source in the archives, so that, depending on your requirements, you can either delete the original, or link the redacted version to the original.
Each of these steps is very hard to get right. You'll need iron discipline in your manual redaction workflows. The last step, matching redacted documnets with originals after the print/scan cycle, is next to impossible.
The Optim Data Redaction system has some industry-leading capabilities in areas such as entity extraction. But for many customers, the most important reason to buy the product is in the workflow, keeping track of masses of documents as they go through the redaction process.
P.S. I've noticed this repeatedly in my time in the enterprise software industry. Innovative algorithmic and technological innovations are important as a differentiators. But the biggest value-add in financial terms, the motivation for most of the sales price, is in the "enterprise features," the workflow and information management wrapped around the core engine.
The Web has changed the way we view information: Much more is known, much more is available. We can learn about any topic through Wikipedia, discover the doings of our friends through social networking, see what's happening on at any location through webcams and street views.
Openness is essential to the success of these systems: Wikipedia beat out for-pay encyclopedias, and gated newspapers have mostly been forced to open their information for general view.
Much of this data was always available, though it used to be harder to
get: You had to buy an encyclopedia, subscribe to a newspaper, write
your friends through snailmail, or visit a location to see what's
The open data is analyzed by both human and machines, as new analytics approaches automate the extraction of facts from the masses of data, making it even more useful -- IBM even opened a new business unit in its software group for this.
Today's openness is doing a lot of good in exposing corruption and empowering the populace with knowledge. Citizens can protect their freedom and safety by watching the
government and looking out for threats to the public.
Yet the openness itself poses threats.
In some cases, even previously available information can be dangerous, as the Web and analytics allow people to pry more effectively. Your home's deeds were always available at the local land registry, but prying eyes used to have to go down to the office; now they can get information in an instant and cross-reference it to learn more.
In the enveloping and yet sometimes stifling openness, we are returning,
in a way, to the tribal or small-town life of earlier ages. See David
Brin's The Transparent
More serious is information which was historically private and should remain that way. Credit card numbers and computer account passwords, for example, sometimes leak onto the Web. Secure practices including the right software tools are essential to keep that from happening.
Most private information, however, is fortunately kept safe in secured systems.
Even deeper than the "Deep Web
" of dynamically generated, unlinked and password-protected sites is the "Deeper Web" of intranet content, an enormous range of medical, financial, security, and legal information residing in in-house systems. Thousands of employees and other authorized viewers use this information in their daily work.
As simple tasks are made routine and handled by machine, with the help of a small number of workers, the more complex analytical tasks become more important. Knowledge workers browse masses of data, with the aid of analytical tools, to detect patterns and extract new information. But they cannot do this for Intranet information as easily as with the Internet.
The challenge is to give intranet information the openness of the Web--for authorized users--while still keeping private everything that needs to be, even from people who have passwords but don't need the information in their work,
To be continued: in Privacy
on the Deeper Web , part II