Right to Privacy
I have a
Privacy and security are different things.
Security is the low-level lockdown: The password on the database, the encryption on the file-server, the lock on the door. Security means exposing data only to authenticated users, those who have the password, the thumbprint, the key, etc.
Privacy is protecting information based on business purpose: The need-to-know. Security is a pre-requisite for privacy, but privacy is much harder to enforce: Once an employee or other user has a password to the application or a key to the door, how can we ensure that they see only what they need to ?
That's what the Redaction system does. The privacy policies, based on regulations, don't just say what documents the reader can see, or what sorts of text-strings need to be deleted. The policies protect semantic types, such as personal names or telephone numbers, cross-referencing them against the role of the person who will see the document.
The Obama passport case makes a good example. There was no security problem. State Department subcontractors had legitimate permission to access to passport records. But they had no business reason in this case, which is why there was a security problem.
Redaction means more creating blacked-out copies of documents. Live, automated redaction, when used as part of a document viewer, becomes a form offine-grained access control for documents: Users should be able browse to documents in an Electronic Content Management system, and view them with the sensitive information redacted.
If they do have a real reason to see the information, they can ask to see it. Then, if permissions allow, the blank rectangle is filled in. The request is logged, and since the user is authenticated, the auditor knows and the user knows that the auditor knows that they asked to see this information.
This approach has the side-benefit of balancing the risks of automated over-redaction and under-redaction. We can err on the side of over-redaction, knowing that users can ask to see the information if they have a valid business purpose.
In this way, redaction becomes a layer of privacy protection on top of a generic document viewer. No longer are we just blacking out text--we're controlling access to precise units of data, live, on a need-to-know basis.
The need for good redaction keeps coming up around Barack Obama. See this earlier post.
After Obama resigned from his Senate seat to become president, Rod Blagojevich, governor of Illinois, had the job of appointing a successor. Blagojevich was accused of selling the Senate seat, among a long list of other corrupt dealings.
In the trial, Blagojevich's lawyers tried to subpoena Obama -- whether because his testimony was material to the trial, or whether they just wanted to complicate things by dragging the president into the trial. Their motion was duly posted online, with key paragraphs redacted. Here is the actual file.
The political implications are one thing (see this) but we're interested in something else.
Two interesting points here:
A. Note the double level of redaction:
1. Personally identifying information like personal names are replaced with a semantic category like "labor union official." This leaves the text comprehensible, but makes it impossible to identify individuals.
2. Paragraphs are blacked out, eliminating the context even for redacted entities, so that the meaning of parts of the document can no longer be understood.
B. The apparently blacked-out paragraphs were simply hidden behind black layers. Selecting the section and hitting Control-C recover the text.
You cannot redact with ad hoc tools! Your redacted document must lack any private data which is hidden from the human eye. Because a human reviewer doesn't know what appears in hidden sections, your redacted document must not have any unseen sections.
During the 2008 presidential campaign, some State Department subcontractors who were working on passport-processing looked at the records of candidates, including Barak Obama.
This was not a security breach -- it was a privacy breach.They had passwords, and needed passwords, to do their work. But they were looking at the records out of simple curiosity.
Keeping personal information private is essential, but with the Birther ruckus, Obama's personal information is even touchier than most.
The solution is to present documents to employees who need them, but without exposing individual sensitive entities which they might not need for their job , such as the name, social security number, or place of birth. If the worker needs to see the information, they can click on the redaction rectangle, and then (if permissions allow), the information is securely retrieved into the document which the worker sees.
The access is logged, and the users knows it, so they will be cautious about what they ask for. Analytic software will later track suspicious patterns of access.
More on Obama and the dangers of inadequate redaction in an upcoming post, to be published here.
In a previous post, I wrote about the hidden information in PDF, Microsoft Word, and even TIFF, JPEG, or PNG files. If some of this is sensitive or personally identifying information (PII) as defined in regulations, it must be redacted. This severely complicates the job of a reviewer who is redacting the document manually, particularly if they are not using sophisticated tools but simply deleting text in a word processor or other software.
But there is a lot of data which is in principle visible -- it is rendered as part of the document's visible image -- which in fact not visible to the human reviewer.
A human reviewer cannot see and redact all this quasi-visible information. For privacy, it is essential to eliminate such text, together with the metadata, and only the human-readable form of the document -- after suitable redaction, of course.- Text with the same color as its background: E.g, white on white
See also this excellent White Paper: "The Risks of Metadata and Hidden Information."
JoshuaFox 270000YU1A Tags:  information mining search software data extraction artificial intelligence 2,028 Visits
Life is moving to the Internet. Work life and personal life merge. I keep my personal blog and IBM blog separate, but a few of the posts at adarti.blogspot.com are relevant to topics here. Here's the first of a series on important new directions in software.
The basic engineering of cars has not changed in decades. Are we software engineers doomed to end up like GM employees?
At least one type of software is developing and growing: Software which squeezes meaning out of large, messy datasets.
Most older data-oriented applications relied on data that was neatly lined up, with its meaning expressed in rigid arrangements. For example, First Name, Last Name, Date of Birth could appear as fields in a database. Applications which worked with text were limited to simple searches for known patterns. But the burgeoning new category of software finds patterns where none were known before.
I'll describe some variations on this concept, because I think that any software engineer needs to keep an eye on this as a potential career direction. I'll present this for software engineers developing applications, rather than for computer scientists focused on algorithms, or business people looking for ways to sell software.
That's because this type of software is in the ideal state of maturity for us software engineers to get into: Solutions exist in each one of these areas, having long since emerged from of academia and entered real products. But the software is far from commoditized. Many of the leading solutions require lots of custom coding and configuration, and besides that, they are hard to use. If you are a software developer, engineer or entrepreneur, the time is right to create the ideal easy-to-use software package for a well-defined functional area; or at least to use the clunky libraries available today for a specific application which blows away the competition.
See more articles in the series here: 1 2 3 4 5 6 7 8
I have asked a few questions on StackOverflow which apparently ventured into unexplored territory--or at least no one know a simple answer.
This one, for example.
It turns out that Flash cannot handle HTTP status codes, even though such codes are essential to modern REST architectures. It is not even posible to pass error info as (See this none are accessible!
You get an
The workaround is to always pass the 200 "OK" status code from your servelet or other server code, and use XML content in the body, e.g. with an
(By the way, you should always use POST rather than GET, DELETE, or PUT, since Flex passes more information with POST than with GET -- see this. Again, you'll have to create your own mechanism using a special HTTP parameter, to convey the HTTP-method used.)
I've been been working with Flex DataGrids, trying to validate the cells in it both from the "inside" -- an event handler which validates on exiting each cell -- and from the "outside" -- an "OK" button in a model dialog, which validates the entire grid before "agreeing" to close it; if any cell is not valid, its errorString is set, so that we get a red outline and a special red-bordered tooltip.
For example, this grid has fields; each has a specific validation constraint.
JoshuaFox 270000YU1A Tags:  jpeg tiff security redaction word png pdf privacy 2 Comments 4,270 Visits
Our redaction software deletes sensitive text and images from documents. That's easy to see. What's harder to see is the incredible morass of hidden data which can hide inside any rich-text document.
The phenomenon is best known in Microsoft Word. It's well known that Track Changes can hold deleted information, but so can many other features of the software. For example, the little-known Fast Save feature, developed in the days when hard drives were very slow, retains deleted blocks of data to accelerate synchronization between memory and the disk.
PDF, too, can carry hidden information. PDF presents text and graphics cleanly, but inside it's a mess of elements layered, hidden, and arranged in no obvious relation to the external appearance.
In principle, while redacting visible text, we could also extract the invisible text or graphics and redact sensitive entities with our combined automated/human process, as we do for the visible part of the document. But the hidden data is, internally, a jumble of unordered fields; it is not meant to be read. In some cases, it is nearly impossible to reconstruct how to present the text to a machine or human reader, as when an internal script builds up some text. For example, if a macro calculates a person's age from her birthdate, it's unlikely to be found by an automated system or even a human, yet it might might using birthdate data in fields which are also hidden in the document. In a scenario where ages are considered sensitive, for example, where discrimination lawsuits are a risk, such information needs to be found and deleted.
Flex is coded with ActionScript, which combines aspects of scripting languages and application languages in a balance which is just right for GUIs.
ActionScript It supports numeric arrays and associative arrays (maps). Every object is an associative array, so that m["age"] is the same as m.age -- useful for dynamically building an object as easily as a map, and indeed such objects are often used as maps.
GUIs are best expressed declaratively, and Flex lets you do this exactly when needed with a language called MXML, unlike, e.g., the basic forms of Swing or SWT. MXML mixes XML declarations of GUI structure and event-binding with ActionScript for event handlers and any code needed to support the GUI class.
I'm preparing some posts about overcoming some quirks in Flex. Stay tuned.
David Hill, an analyst with the Mesabi Group, wrote an article about the Redaction product. David also wrote an introduction to our White Paper, which I mentioned in a recent blog post.
Continued from Privacy on the Deeper Web , part I
Users of the Web have been empowered by the ability to freely browse masses of information.
Knowledge workers in organizations, who increasingly need to find patterns and draw connections between disparate sources of data, could derive tremendous business value if they had this degree of flexibility in using intranet sites--part of the trend known as "Enterprise 2.0."
There are various reasons for the current clumsiness in intranet access, including organizational boundaries, but the need to protect private data is among the most important. How can we let knowledge workers use intranet information as freely as the open Web, while guaranteeing that the right eyes are seeing each item?
In the case of structured content or pages generated from structured content, mostly relational databases, we can avoid exposing certain units of data -- specific tables and columns -- based on the role of the user. Yet 80% of data in the "deeper Web" is simply documents -- HTML, PDF, scans -- each of which holds a mix of data.
To resolve the openness-privacy dilemma for documents, fine-grained access control is needed.
Users of internal secured systems need to be able to search and browse large numbers of documents smoothly, conveniently, flexibly, with no pre-approval, just as they do on the Web. At the same time, the private information which occurs in such systems--and generally does not occur on the Web--needs to be securely deleted in each document, for those who are not authorized to see it. For example, doctors not treating a patient should be allowed to see medical documents to track the spread of diseases in their hospital, but personally identifying information in these documents should be removed.
Such deletion still leaves something to be desired. Users may have a real business need to view the deleted data and they should be allowed to ask to see some types of redacted information, as long as they can provide a good reason. Fortunately, such users already have an existing relationship with the organization and are authenticated (logged in), so when the information is (securely) retrieved and revealed, auditors know who asked to see certain information, and the users know that the auditors know.
To access these documents, users need a search engine which indexes and returns links to all possible documents, which users can see if authorized--yet private information must not appear in the search-result summary pages.All phase of the browsing experience--searching for information, reading documents and revealing private information where allowed--must be as seamless as Web browsing. Employees tolerate clunky IT systems, but the goal here not just to allow employees to do what they've been told to do. Users need to freely browse numbers of documents, learn new things and draw new connections, just as they do on the open Web.
The Optim Data Redaction system has a number of competitors in the software industry.
But our biggest "competitor" is still the good old black marker. This may seem strange, since deadlines for ever stricter privacy regulations are now coming due. But the redaction software industry is still not mature, and doing redaction the old-fashioned way seems cheaper and safer.
In fact, guidelines from intelligence agencies sometimes recommend printing out electronic documents, redacting them manually, then scanning and uploading them back into Enterprise Content Management systems. This avoids some of the simpler mistakes when deleting information in electronic documents--like leaving information in the change-tracking of MSWord, or hiding text behind an easily removable black layer in PDF
The black pen, however, just won't do.
Words are often visible through the the black ink. Though scans of such pages may further obscure the faint impressions of the underlying text, certain scanners with image-enhancement features can actually heighten contrast and make these traces more legible.
The Web has changed the way we view information: Much more is known, much more is available. We can learn about any topic through Wikipedia, discover the doings of our friends through social networking, see what's happening on at any location through webcams and street views.
Openness is essential to the success of these systems: Wikipedia beat out for-pay encyclopedias, and gated newspapers have mostly been forced to open their information for general view.
Much of this data was always available, though it used to be harder to get: You had to buy an encyclopedia, subscribe to a newspaper, write your friends through snailmail, or visit a location to see what's happening.
The open data is analyzed by both human and machines, as new analytics approaches automate the extraction of facts from the masses of data, making it even more useful -- IBM even opened a new business unit in its software group for this.
Today's openness is doing a lot of good in exposing corruption and empowering the populace with knowledge. Citizens can protect their freedom and safety by watching the government and looking out for threats to the public.
Yet the openness itself poses threats.
In some cases, even previously available information can be dangerous, as the Web and analytics allow people to pry more effectively. Your home's deeds were always available at the local land registry, but prying eyes used to have to go down to the office; now they can get information in an instant and cross-reference it to learn more.
In the enveloping and yet sometimes stifling openness, we are returning, in a way, to the tribal or small-town life of earlier ages. See David Brin's The Transparent Society for more.
More serious is information which was historically private and should remain that way. Credit card numbers and computer account passwords, for example, sometimes leak onto the Web. Secure practices including the right software tools are essential to keep that from happening.
Most private information, however, is fortunately kept safe in secured systems.
Even deeper than the "Deep Web" of dynamically generated, unlinked and password-protected sites is the "Deeper Web" of intranet content, an enormous range of medical, financial, security, and legal information residing in in-house systems. Thousands of employees and other authorized viewers use this information in their daily work.
As simple tasks are made routine and handled by machine, with the help of a small number of workers, the more complex analytical tasks become more important. Knowledge workers browse masses of data, with the aid of analytical tools, to detect patterns and extract new information. But they cannot do this for Intranet information as easily as with the Internet.
The challenge is to give intranet information the openness of the Web--for authorized users--while still keeping private everything that needs to be, even from people who have passwords but don't need the information in their work,
To be continued: in Privacy on the Deeper Web , part II