My last blog was about the best paper winner at the ACM Symposium on Document Engineering, and I mentioned there that I'd make another post to give you an idea of the breadth and depth of the papers presented. So, the bad news is you'll have to wait one more post for me to tell you about my own papers because I want to keep the focus on other people's ideas, and so the good news is that this is that follow-on post.
I can't tell you about every paper, nor even get close, and omissions here should in no way be construed as meaning anything. I just had to pick some. Did I mention it was a really interesting conference?
I was quite impressed with a pair of papers in which natural language processing techniques were used either to extract a meaningful summary of a text (a kind of lossy compression, where we expect high compaction and low information theoretic loss) or to rewrite the text into a format that is easier to read (the intent of which is to increase understanding by persons with literacy challenges). It isn't just the computer science of natural language analysis that excites me here. These endeavors are ripe for formal statistical experimental analysis on human subjects to prove out whether the computing techniques are actually benefiting humanity. That's exactly the kind of high-impact innovation for which IBM itself is world-renowned.
There was a really interesting paper on how to reuse and repurpose unstructured content entitled "Automated repurposing of implicitly structured documents." What interested me first was the work they do to analyze the style applied to character sequences to glean document structure like sections and subsections. But then I realized that what they were really up to was the repurposing angle, where one seeks to apply the content of a document to a new style format based on mapping the gleaned implicit structure from the first document onto the gleaned implicit structure of the second document.
Another interesting work was called PrintMonkey, which is not featured here because its authorship includes people from my alma mater (University of Victoria), nor because it has a really cool name. Simple fact is that the problem it addresses is one that I personally have struggled with, and it's nice to see there's progress being made on it. The issue is that sometimes you need a printed copy of a website, but the default web browser print algorithm is typically not suited at all to paper rendition. It spills over the edges of the paper, or takes chunks of the page and prints them on separate pages (ever print a google map?). But what's neat about this work is not just that they solve the problem but that they do so without requiring participation of website authors AND that they involve a social computing aspect in the work. Specifically, the solution allows people to share prior layouts of the same website. Who knows, maybe someone else has already laid out the site in a way suitable to your purposes. Very Web 2.0.
A short paper I enjoyed presented a concept called content-based identifiers (CBIs) for documents. When you store a document in a content repository, how about storing it at a URI that includes the hash of the document? This way, the document retrieval can be immediately be followed by a check to see whether the document you got has been altered. Moreover, a log of the document history can simply consist of the list of CBIs for the document, possibly followed by a hash of the CBI and its predecessor in the log. This latter step establishes a chain of evidence because the predecessor document must have existed in order for its hash to have been computed and combined with the hash of the current document.
Speaking of change tracking and document versioning, there was a good paper on Merging Changes in XML Documents using Reliable Context Fingerprints. The basic idea is that lots of XML documents contain element sequences, which can make it hard to identify where a patch change should go in the face of merged changes from multiple contributors. The naive method of saying "Change element /doc/e as follows" from one person would break if another person adds a new e. This paper attempts to identify the location where a change should go by doing a best match to node structure and content within a given "radius" of nodes.
A real brain opener for me was a paper about Improving Query Performance by generating XML schemas from system design artifacts that include not only the conceptual information to be manipulated by the system but also expected query workload, the idea being to reduce the number of nodes that have to be touched by a query to produce a result.
The last paper that I can reasonably present in this too-long blog entry deals with adaptability of multimedia document content. This is really a fascinating area that addresses the issue of presenting content in the face of spatial, temporal or interactivity constraints (e.g. limited visual space, limited bandwidth, limited available CPU).
Despite the number of things I did talk about above, I am even more mindful now than before about how many good things I've had to leave out. Hopefully you will at least have a better idea of the interesting concepts presented at this conference and take the time to peruse the DocEng conference series for other items of interest to you.
Next up in this blog, I will take the time to tell you about my own papers at this conference on Interactive Office Documents and Web 2.0 Applications and An Office Document Mashup for Document-Centric Business Processes.