An important OSCON focus this year was open government, including the cooperation possible between open source developers and a more transparent government. I also attended sessions on improving the reliability of computer-based voting.
Keynote sessions at OSCON often do a good job setting the tone of the conference sessions. Tim O'Reilly, who certainly must have followed the convention planning, made some comments that spoke to the range of concerns of the next days and advanced his own current areas of interest. Like many in the open source movement, O'Reilly has been concentrating much of his recent efforts working with government initiatives and with open source developers hoping to get more out of government information resources.
Before launching into his work with what he called "Gov 2.0," O'Reilly touched on some other interesting issues of "federation" of data services and social networks. Foreshadowing a widely expressed sentiment, O'Reilly discussed the move to smaller devices—mainly smart phones—that both employ lots of sensors (touch screen, camera, microphone, motion, GPS) and utilize cloud data services for functions such as speech recognition, search, location, face recognition, and integration into social networks. The thing about the cloud is that the more cooperation that exists between data services, the more valuable each one is; the downside is that it creates a tendency towards monopoly.
Back to government, though. Now, in the current operation of government, information of all sorts is published much more openly and there has been a big move to an "Open Government Platform" in which data services provide SDKs and open APIs for citizens (and open source developers) to utilize. O'Reilly spoke particularly highly of the new Federal Chief Technology Officer, Aneesh Chopra, in this regard.
According to O'Reilly, the entity Gov 2.0 has several features. It emphasizes transparency of social networks and allows citizen contribution and collaboration. It treats government as a platform and a means of collective action, but it's about Benjamin Franklin's concept of citizen action, not what Donald Kent called "vending machine government." Open source developers bring an attitude to this process of "give us tools, and we can do it ourselves," which precisely characterizes an engaged citizenry.
There are, however, still obstacles in the move to Gov 2.0, the main problem being the byzantine procurement process that remains. O'Reilly offered the example of Brewster Kahle and archive.org offering to mirror the historical state of White House Web sites for minimal cost, but the existing procurement process makes it far more difficult to spend tens of thousands for an agile project than to spend millions (or billions) on inflexible and overdesigned (and closed source) ones. So now White House workers take fragile and time-consuming manual snapshots of sites, which is both inefficient and far more expensive than the right approach.
I listened to two talks that spoke directly to using open source for public elections. Long-time readers have seen my prior articles discussing open source for elections and my work with the Open Voting Consortium (OVC). In fact, I've presented technical results of OVC's demo/prototype systems at previous OSCONs.
One of the talks this year was by Neal McBurnett, who developed a system using Django for Election Verification for the city of Boulder, Colorado. Aside from the technical issues that he discussed, the wonderful and remarkable thing is that he managed to move the county to use these systems in 2008, thereby improving the reliability and transparency of those elections.
McBurnett's talk did not delve too deep into his specific code, but more interestingly looked at design principles for election auditing and more generally for elections systems. One of the most important principles he mentioned was Rivest and Wack's notion of "software independence" for elections. In short, this is the idea that the dependence on software is non-essential and importantly, each component is replaceable and operates on well-defined interfaces. Paper ballots and open formats are the main elements of this software independence.
Most of McBurnett's talk concerned the best principles for statistical verification of ballots and tallies. There are surprisingly subtle details of sampling and a difference between fixed-percentage and risk-limiting audits. Those details are better addressed in his slides, linked to in the Resources section.
The second talk, closely related to this, was by James Tillman and Richard Benham, who presented a system they had developed in Florida to conduct elections correctly. At this point, Tillman and Benham, who have worked with prominent elections activist Ion Sancho, have a proof-of-concept, a good system that has not yet been used in public elections (nor has the source code yet been released, though that was promised as coming soon).
As a number of independent projects have, Tillman and Benham came to the same conclusions as the OVC did in 2003. The best voting system is one that uses computer devices, probably touch screen interfaces, to produce voter-verifiable paper ballots. The advantages of using computers at all are:
- Enabling accessibility to voters with disabilities
- Providing multi-lingual capabilities
- Reducing ballot spoilage through overvoting or undervoting
In the end, the computer should just be a fancy printer. As with OVC's design, Tillman and Benham decided to use a 2D barcode as part of their printed ballots. OVC's code stack uses Python, PyGame, and Postscript to produce this result; Tillman and Benham instead use Java™ code with most of the work being done directly with XML technologies such as EML, XSLT, and XSL-FO. In both cases, an essential element of the design is the use of open source (and therefore voter-inspectable, in principle) tool stacks.
Although it did not focus centrally on elections, a session called "Hacking the Open Government" was very informative as well. This session featured Adina Levin, Silona Bonewald, Kevin Marks, Ilan Rabinovitch, and California Secretary of State Debra Bowen. Secretary Bowen's star appeal drew me to this one; it is heartening to see a prominent elected official who "gets" openness, and I was happy to have the opportunity to speak with Secretary Bowen briefly in the question period after the session.
The second aspect of the convention I paid special attention to was concurrency. As Moore's law has generally ended, remaining improvements in processor performance focus on more cores rather than greater clock speeds, so distributing work among cores and chips (and nodes and networks) becomes increasingly central to good theoretical application design. Plus, good distribution deserves good tools to implement and maintain it.
REvolution Computing's ParallelR
At the invitation of the company's press representative, I had a very nice talk with REvolution Computing's CEO, Richard Schultz. The company, following the model of many open source companies, has created an enhanced enterprise version that adds various features to the basic REvolution R product, with a pricing structure that allows the enterprise version to make sense for commercial application. I'm not specifically interested in REvolution Computing's marketing, though, but rather in the general product they have created.
In both versions, REvolution Computing has boosted the R programming language to greatly enhance performance, and the company has added domain-specific libraries for both computation and graphing. The performance improvements are of two sorts. On the one hand, they have simply improved algorithms and optimized the speed of linear computations. According to Schultz, they have managed 2-5x improvements as a rule. That is impressive, but of even more interest to me is how they have approached distributing work over heterogeneous resources, from cores to node clusters.
ParallelR is based on an approach to concurrency developed for the Linda
Coordination Language, based on allocation of segments of work from a
centrally managed tuple-space. R itself is well suited to such
parallelization, being largely functionally based. ParallelR adds packages
foreach and iterator,
which provide high-level constructs to parallelized computation, the
allocation being handled transparently from a programmer's point of view.
For the large-scale computation often performed using R, this high-level
concurrency is a big win.
A caveat that is almost automatic in discussion of heterogeneous concurrency is that it works best for problems that are "embarrassingly parallel." Fortunately, elementwise operations on large data sets are precisely this kind of problem and largely the type of problem R users usually face.
Ted Leung is always an enjoyable speaker—he's someone who has done a lot of work for the dynamic programming language communities, especially around Python. His talk was a detailed overview of multiple approaches to concurrency; some of it will be familiar to most readers, but some is relatively esoteric for mainstream computer science courses and practical programmers. Leung enumerated a half-dozen or so models or paradigms for concurrency.
The model that is probably most familiar to most programmers is threading. The problems in threading are almost equally familiar: lock contention and deadlocks. More generally, parallelization with locks always requires a lot of explicit planning by developers. Other approaches have their own drawbacks but mostly allow higher level abstraction away from explicitly locking resources.
One approach to concurrency is software transactional memory, which has been popularized (somewhat) by implementation in Haskell. In essence, this style treats memory similarly to transactional database systems in which operations maintain atomicity, with rollback and retry in cases of conflicting requests. Transactional memory can lead to duplication of computation, but in the frequent case where conflicting actions are uncommon, the overhead can still be less than that of locking.
Another model is that of Actors, as especially well known in Erlang (though Leung was careful to observe that the heritage of the Actor model and of Erlang are independent). The Actor model is basically a "shared nothing" architecture in which Actors always pass self-contained messages to request work from other Actors. I wrote about a presentation on Actors during last year's OSCON.
Leung discussed, though briefly as time was running out, the tuple-space model underlying Linda and also ParallelR. In this model, tasks are uncoupled in space and time, and Readers are decoupled from Writers.
Leung also mentioned in passing Bill Ackerman's notion of "dataflow," which allows declarative concurrency. The problem with this one is that while it inherently partitions problems, it cannot deal with non-determinism (which is to say, it is meant for pure functional languages); once you need to deal with real-world indeterminacy (say, input and output), bolted-on equivalents of Actors become needed.
A last model Leung discussed is persistent data structures. This is an incomplete model; it does not deal with all the issues in concurrency. It does a good measure, though. This model has long roots but has been popularized in the last year or two by its use in the language Clojure. The basic idea here is that all data structures are preserved (that is, immutable), but copies representing various modifications are made. Once you have immutability, all the issues of locking go away. Not every data structure is a persistent version, but the large majority of widely used ones have variants that preserve big-O efficiency of their mutable counterparts (see Resources for a link to more information on big-O notation).
Clojure provides functional concurrency for the JVM
Howard Lewis Ship introduced the language Clojure in a general way in his presentation. I have been hearing quite a lot of buzz around this new language for the JVM. Of course, if you have heard anything about it, you know that Clojure is really a very old language, known as Lisp (albeit with some improvements to built-in datatypes and minor syntax additions).
I have been excited to find an opportunity to do some concrete development in Clojure, but haven't quite found the chance yet. There are a number of nice features of Clojure, which Ship outlined. Being built atop the JVM, Clojure is an equal citizen with other JVM languages and in particular has access to all the standard libraries that come with the Java language (or for that matter, any classes you might happen to have already).
From a concurrency perspective, Clojure does a number of things right or at least in a plausibly right way. It introduces an extremely broad set of persistent data structures to avoid contention issues. Even in the narrow areas where conflicts can arise among operations, Clojure uses software transactional memory to manage concurrency. Moreover, an Agent framework is also built on top of Clojure, if you want to go that route. My understanding is that the inherent capability only allows Agents among threads on the same JVM, but I am certain someone has built or will build larger scale multinode agent systems (heck, just use one of the several Agent systems that exist already for Java code).
Learn
- Learn more about the
O'Reilly Open Source Convention.
- The project
Open Source for America
is a broad advocacy pushing for open source as a way to better the work of
government.
- The
Open Voting Consortium
is a great place to get involved in assuring accurate and transparent
public elections.
- The
Election Markup Language (EML)
is an attempt to deliver a reliable XML specification for data interchange
among hardware, software, and service vendors who provide election systems
and services. EML is brought to you by the OASIS Election and Voter
Services Technical Committee.
- View the
PowerPoint slides from Neal McBurnett's presentation
"ElectionAudits: a Django App for Advanced Election Auditing."
- The session
"Hacking the Open Government"
discussed how to use the volumes of data recently available in Internet
form.
- REvolution Computing's
ParallelR 2.0
provides parallelized routines for classification and bootstrapping
statistical methods, as well as the ability to easily parallelize custom R
programs.
- Wikipedia has a good backgrounder on
Linda's concurrency constructs.
- David's "Statistical programming with R"
series offers some good resources on R programming:
- "Dabbling with a wealth of statistical facilities" (developerWorks, September 2004) helps you get raw data into shape.
- "Functional programming and data exploration" (October 2004) shows you how to find and analyze anomalies.
- "Reusable and object-oriented programming" (January 2006) unveils R's underlying features.
- Read the PDFs from
Ted Leung's survey presentation
and
Howard Lewis Ship's presentation on Clojure.
- Enjoy using
Clojure to bring even more dynamic
programming to the Java Virtual Machine.
- Wikipedia has a good explanation of
big-O notation.
- Concurrency was also a big hit at OSCON
2008, which
David
also covered.
-
In the
developerWorks Linux zone,
find more resources for Linux developers, and scan our
most popular articles and
tutorials.
-
See all
Linux tips and
Linux tutorials on developerWorks.
-
Stay current with
developerWorks technical events and Webcasts.
Get products and technologies
-
With
IBM trial software,
available for download directly from developerWorks, build your next development
project on Linux.
Discuss
-
Get involved in the
My developerWorks community; with your personal profile and custom home page, you
can tailor developerWorks to your interests and interact with other developerWorks users.

David Mertz has been writing the developerWorks columns Charming Python and XML Matters since 2000. Check out his book Text Processing in Python. For more on David, see his personal Web page.




