Right to Privacy
Privacy and security are different things.
Security is the low-level lockdown: The password on the database, the encryption on the file-server, the lock on the door. Security means exposing data only to authenticated users, those who have the password, the thumbprint, the key, etc.
Privacy is protecting information based on business purpose: The need-to-know. Security is a pre-requisite for privacy, but privacy is much harder to enforce: Once an employee or other user has a password to the application or a key to the door, how can we ensure that they see only what they need to ?
That's what the Redaction system does. The privacy policies, based on regulations, don't just say what documents the reader can see, or what sorts of text-strings need to be deleted. The policies protect semantic types, such as personal names or telephone numbers, cross-referencing them against the role of the person who will see the document.
The Obama passport case makes a good example. There was no security problem. State Department subcontractors had legitimate permission to access to passport records. But they had no business reason in this case, which is why there was a security problem.
Redaction means more creating blacked-out copies of documents. Live, automated redaction, when used as part of a document viewer, becomes a form offine-grained access control for documents: Users should be able browse to documents in an Electronic Content Management system, and view them with the sensitive information redacted.
If they do have a real reason to see the information, they can ask to see it. Then, if permissions allow, the blank rectangle is filled in. The request is logged, and since the user is authenticated, the auditor knows and the user knows that the auditor knows that they asked to see this information.
This approach has the side-benefit of balancing the risks of automated over-redaction and under-redaction. We can err on the side of over-redaction, knowing that users can ask to see the information if they have a valid business purpose.
In this way, redaction becomes a layer of privacy protection on top of a generic document viewer. No longer are we just blacking out text--we're controlling access to precise units of data, live, on a need-to-know basis.
The need for good redaction keeps coming up around Barack Obama. See this earlier post.
After Obama resigned from his Senate seat to become president, Rod Blagojevich, governor of Illinois, had the job of appointing a successor. Blagojevich was accused of selling the Senate seat, among a long list of other corrupt dealings.
In the trial, Blagojevich's lawyers tried to subpoena Obama -- whether because his testimony was material to the trial, or whether they just wanted to complicate things by dragging the president into the trial. Their motion was duly posted online, with key paragraphs redacted. Here is the actual file.
The political implications are one thing (see this) but we're interested in something else.
Two interesting points here:
A. Note the double level of redaction:
1. Personally identifying information like personal names are replaced with a semantic category like "labor union official." This leaves the text comprehensible, but makes it impossible to identify individuals.
2. Paragraphs are blacked out, eliminating the context even for redacted entities, so that the meaning of parts of the document can no longer be understood.
B. The apparently blacked-out paragraphs were simply hidden behind black layers. Selecting the section and hitting Control-C recover the text.
You cannot redact with ad hoc tools! Your redacted document must lack any private data which is hidden from the human eye. Because a human reviewer doesn't know what appears in hidden sections, your redacted document must not have any unseen sections.
During the 2008 presidential campaign, some State Department subcontractors who were working on passport-processing looked at the records of candidates, including Barak Obama.
This was not a security breach -- it was a privacy breach.They had passwords, and needed passwords, to do their work. But they were looking at the records out of simple curiosity.
Keeping personal information private is essential, but with the Birther ruckus, Obama's personal information is even touchier than most.
The solution is to present documents to employees who need them, but without exposing individual sensitive entities which they might not need for their job , such as the name, social security number, or place of birth. If the worker needs to see the information, they can click on the redaction rectangle, and then (if permissions allow), the information is securely retrieved into the document which the worker sees.
The access is logged, and the users knows it, so they will be cautious about what they ask for. Analytic software will later track suspicious patterns of access.
More on Obama and the dangers of inadequate redaction in an upcoming post, to be published here.
In a previous post, I wrote about the hidden information in PDF, Microsoft Word, and even TIFF, JPEG, or PNG files. If some of this is sensitive or personally identifying information (PII) as defined in regulations, it must be redacted. This severely complicates the job of a reviewer who is redacting the document manually, particularly if they are not using sophisticated tools but simply deleting text in a word processor or other software.
But there is a lot of data which is in principle visible -- it is rendered as part of the document's visible image -- which in fact not visible to the human reviewer.
A human reviewer cannot see and redact all this quasi-visible information. For privacy, it is essential to eliminate such text, together with the metadata, and only the human-readable form of the document -- after suitable redaction, of course.- Text with the same color as its background: E.g, white on white
See also this excellent White Paper: "The Risks of Metadata and Hidden Information."
JoshuaFox 270000YU1A Теги:  jpeg tiff security redaction word png pdf privacy Комментариев: 2 Посещений: 3 958
Our redaction software deletes sensitive text and images from documents. That's easy to see. What's harder to see is the incredible morass of hidden data which can hide inside any rich-text document.
The phenomenon is best known in Microsoft Word. It's well known that Track Changes can hold deleted information, but so can many other features of the software. For example, the little-known Fast Save feature, developed in the days when hard drives were very slow, retains deleted blocks of data to accelerate synchronization between memory and the disk.
PDF, too, can carry hidden information. PDF presents text and graphics cleanly, but inside it's a mess of elements layered, hidden, and arranged in no obvious relation to the external appearance.
In principle, while redacting visible text, we could also extract the invisible text or graphics and redact sensitive entities with our combined automated/human process, as we do for the visible part of the document. But the hidden data is, internally, a jumble of unordered fields; it is not meant to be read. In some cases, it is nearly impossible to reconstruct how to present the text to a machine or human reader, as when an internal script builds up some text. For example, if a macro calculates a person's age from her birthdate, it's unlikely to be found by an automated system or even a human, yet it might might using birthdate data in fields which are also hidden in the document. In a scenario where ages are considered sensitive, for example, where discrimination lawsuits are a risk, such information needs to be found and deleted.