Language designer's notebook: First, do no harm

Sometimes, the bad code enabled by a new language feature outweighs the good

While some proposed language features are simply a solution in search of a problem, most have their roots in real-world situations in which the existing features do not enable programmers to express what they want to say as easily, clearly, succinctly, or safely as they'd like. Although having a use case in mind — "this feature enables me to write this code that I want to be able to write" — is good, language designers also need to evaluate language features in light of the bad code they might also enable.

Share:

Brian Goetz (brian.goetz@oracle.com), Java Language Architect, Oracle

Brian Goetz photoBrian Goetz is the Java Language Architect at Oracle and a veteran contributor to developerWorks. Brian's writings include the Java theory and practice column series published here from 2002 to 2008, and the definitive work on Java concurrency, Java Concurrency in Practice (Addison-Wesley, 2006).



19 July 2011

Also available in Chinese Japanese

About this series

Every Java™ developer probably has a few ideas about how the Java language could be improved. In this series, Java Language Architect Brian Goetz explores some of the language design issues that have presented challenges for the evolution of the Java language in Java SE 7, Java SE 8, and beyond.

When designing a language from scratch, we have the opportunity to evaluate language features as a group, adjusting them to interact synergistically or avoid negative interactions. And we have the opportunity to pick which programming idioms and mental models we want to encourage through the selection of language features. When considering new language features for existing languages, we have fewer options. We often can't (at least not easily) adjust other features to accommodate the new features, and certain programming idioms are already in the language's vernacular. In these cases, often the best we can do is to design around them.

Although some proposed features are inspired by blue-sky thinking, most have their roots in concrete use cases. They are often born of frustration with how clunky, verbose, or fragile the code for expressing a specific idiom is in the language now, accompanied by the thought that "wouldn't it be nice if I could just write ...." But it is a long way from enables this cool code to good feature. Clearly, for a language feature to be desirable, it must enable us to express some "good" new programs that were not expressible before — but new language features can also enable us to express some new "bad" programs too. And even if the new feature doesn't enable new "bad" programs, it might undermine existing language invariants, user expectations, or performance-model characteristics. Evolving an existing language requires trading off the benefit of increased expressiveness against the harm of reduced safety, feature interactions, or user confusion.

A simple example: Switching on objects

One of the language features introduced in Java SE 7 is to allow the switch statement to operate on variables of type String as well as on primitive types and enums. Extending the reach of the switch statement, not only to strings but also to other types, has been the subject of repeated enhancement requests over the years. (RFE 5012262, for example, requests switching not only on strings, but on any object, comparing via its equals() method [see Resources].) Another similar frequent request is to allow non-constant expressions to appear as case labels of switch statements.

At first glance, the switch statement appears to be nothing more than syntactic sugar for an equivalent nest of if ... else statements. In fact, developers frequently choose between switch and nests of if...else statements largely on the basis of which looks better in the code. But they are not equivalent. The restrictions inherent in the switch statement (case labels must be constant values, and the switch operand is restricted to types that behave like constants) exist for both performance and safety reasons. The restriction to constant labels enables the branch computation to be an O(1) operation rather than an O(n) operation. In the nest of if...else statements, reaching the else block requires performing all the comparisons, because the semantics of the if...else statement require sequential execution. The restriction of case labels to constant-like values (primitives, enums, and strings) ensures that the comparison operation is side-effect-free (and therefore permits certain optimizations that would not otherwise be possible). If we permitted arbitrary objects as case labels, then calling the equals() method might have unpredictable side-effects.

If we were designing the language from scratch, we would have more latitude to decide whether programmer convenience is more important than predictability here, and define the semantics (and restrictions) of the switch statement accordingly. But for the Java language, that ship has sailed. Extending switch beyond constant-like values would undermine the performance model that Java developers have built up over the years, and so the additional expressiveness of allowing arbitrary objects in switch does not outweigh the costs. Because the String class is immutable, and highly specified and controlled, it was practical to bring it into the fold of switch — but it was prudent to stop there.


A more controversial example

The most significant new language feature in Java SE 8 is lambda expressions (or closures). Closures are function literals — expressions that embody a deferred computation that can be treated as a value and invoked later. And they are lexically scoped, which means that the meanings of symbols inside a closure should be the same as their meanings outside (modulo local variable declarations inside the closure that might shadow variables from the lexical scope). The Java language has had a weak form of closures since Java SE 1.1 — inner classes — but their limitations and cumbersome syntax have discouraged the development of APIs that really exploit the abstractive power that such code-as-data mechanisms allow.

Having closures in the language enables APIs to express more-cooperative — and therefore richer — computations, by allowing the client to provide bits of the computation. The Collections APIs support a limited form of this behavior, such as passing a Comparator to Collections.sort(), but only for relatively heavyweight operations such as sorting. For simpler operations like "make a list of elements whose size is greater than 10," we instead force the client to unroll the operation manually, as in this example:

Collection<Element> bigOnes = new ArrayList<Element>();
for (Element e : collection)
    if (e.size() > 10)
        bigOnes.add(e);

Although this code is reasonably compact and readable, the collection API is not helping us very much, and we're forced into a fundamentally serial execution (because the semantics of the for loop are serial, and this is our only means of iterating the elements of a collection). This operation — extracting a desired subset of elements from a collection — is a common one. It would be nice if we could move all the control logic (serial or parallel) into a library routine, parameterized only by the predicate for which elements we want. Then the code reduces to something like:

Collection<Element> bigOnes
    = collection.filter(#{ Element e -> e.size() > 10 });

We could do this with inner classes, but they are so clunky to use that sometimes it seemed the cure was worse than the disease. Inner classes were available at the time the Collections framework was developed, but the syntactic overhead of inner classes made creating Collections APIs that encouraged using them less desirable. (The syntax here for lambda expressions, as well as the improvements to the Collections APIs, are provisional and are only meant to be suggestive of what code we might be able to write with Java SE 8.)

The lambda expression in the previous example is a particularly well-behaved kind of lambda expression — one that captures no values from its lexical scope. But it is often valuable to express a computation relative to other values already in scope, such as the capture of the local variable n in the following method:

public static<T> Collection<Element> biggerThan(Collection<Element> coll, int n) {
    return coll.filter(#{ Element e -> e.size() > n });
}

One limitation of inner classes — and lambda expressions too — is that they can only refer to final local variables from their lexical scope. Lambda expressions in Java SE 8 make this restriction slightly more palatable by also allowing capture of effectively final variables — those that are not declared final but are not modified after their initial assignment. (Inner classes can refer to mutable instance fields if the inner-class expression occurs in an instance context, but this is not the same thing. A good way to think of this is that a reference inside an inner class to the field x from the enclosing class is really shorthand for Outer.this.x, where Outer.this is an implicitly declared final local variable.) Among the several motivations for this restriction, not the least was that restricting local-variable capture to final fields allows the closure to copy the reference, thereby preserving the behavior that the lifetime of a local variable is the lifetime of the block in which it is declared.

There's no doubt that the restriction to only capturing immutable state from the lexical context is quite irritating to programmers. It's probably doubly irritating that it seems that while the Java language is finally getting closures, this aspect of closures seemingly missed the boat.

The canonical example of code that would want to capture a mutable local would be something like Listing 1:

Listing 1. Capture of a mutable local variable by a closure (illegal in Java SE 8)
int sum = 0;
collection.forEach(#{ Element e -> sum += e.size() });
System.out.printf("The sum is %d%n", sum);

This certainly seems a reasonable — and even obvious — thing to want to do, and this is certainly a common idiom in some other languages that support closures. Why wouldn't we want to allow this code in Java?

First, although it doesn't look that way initially, this is a major change to the semantics of local variables. The lifetime of a local variable is limited to the lifetime of the block in which it is declared. But lambda expressions are treated as values and can therefore be stored in a variable and executed after the block in which the captured variable was declared has gone out of scope. If capture of mutable locals were allowed, the platform would need to extend the local variable's lifetime for as long as the dynamic lifetime of any lambda expression that captures it. This would be a major change to programmer expectations about local variables, especially in the absence of any special declaration that this is one of those weird new long-lived local variables.

The problem gets worse when you consider that the forEach() method might want to call the lambda from other threads, so that the function could be applied to different elements of the collection in parallel. Now, we would have created a data race on the local variable sum, because multiple threads might want to update it simultaneously. Data races on local variables would be a new category of hazards, because we currently can always count on local-variable accesses being data-race-free. There is no straightforward way to make the code in Listing 1 thread-safe, making this idiom an accident waiting to happen.

At this point, prudence dictates that we retreat. Avoiding data races in concurrent Java programs is already far more difficult than we'd like. One haven of safety from this hazard has been that local variables are immune to data races, because they are only ever accessible from a single thread. But allowing mutable locals to be captured by lambda expression elevates them to behaving like fields, not locals — invisibly — and thereby exposes them to the hazards of data races. To evolve the language in 2011 in a way that makes concurrent and parallel operations even more dangerous would be foolish.

It might be possible to rescue this idiom — say, by defining a modifier on capturable mutable locals (thus differentiating them explicitly from ordinary locals) that would constrain the semantics of lambdas capturing them to be restricted to the thread and lexical scope in which the variable is defined. Such a feature would be a trade-off — adding complexity to the language to preserve the viability of a specific programming idiom (and an outdated — inherently serial — one at that).

A better solution

A reason to not embrace additional complexity at this time to support this idiom is that there's a better way to get the same result. This idiom is an example of a mapping operation combined with a reduction or folding operation, whereby an associative operator (such as sum or max) is applied pairwise to a sequence of values. Such reduction operations, by virtue of associativity, are amenable to parallelization. We can expose a mapReduce() method directly on the collection as such:

int sum = collection.mapReduce(0, #{ Element e -> e.size() },
                               #{ int left, int right -> left + right });

Here, the first lambda expression is the mapper (mapping each element to its size), and the second lambda expression is the reducer, which takes two sizes and adds them. This code computes the same result as the example in Listing 1, except in a parallel-friendly way. (The parallelism doesn't come for free — the library would have to provide the parallelization — but at least when the idiom is expressed in this way, the library can implement the operation in parallel.) Not only are mapping and reduction amenable to parallelization, but the map and reduce operations can be combined into a single parallel pass, which is even more efficient. (And this all can be done with no mutable state in the client code.)

In reality, we would probably express this more compactly using a method reference to the size() method for our mapper and a predefined reducer for integer summation:

int sum = collection.mapReduce(0, #Element.size, Reducers.INT_SUM);

Once you get used to the idea of specifying computations in this way, this code reads like the problem statement: apply integer summation to the result of the size() method on each element in the collection.

Don't fight it

It probably won't take most developers long to figure out that there is a "workaround" for the restriction on capture of mutable locals: replacing the local with a final reference to a one-element array, as shown in Listing 2:

Listing 2. Faking out the compiler with a final reference to a one-element array. Don't do this!
int[] sumH = new int[1];
collection.forEach(#{ Element e -> sumH[0] += e.size() });
System.out.printf("The sum is %d%n", sumH[0]);

This code passes the compiler — and therefore might offer the brief satisfaction of having "put one over on the system" — but it reopens the possibility of data races. It's not a good idea, and you should resist the temptation. Like removing the blade guard on your table saw, it increases the risk of an accident. But unlike with the table saw, any severed fingers are more likely to be someone else's than your own. Given that a safer (and possibly faster) idiom exists for this — map-reduce — there's no excuse for writing such unsafe code, even if it seems safe "in this case."


Conclusion

It's easy to look at a new language feature and see only the good code that it will enable. We should always be on the lookout for better ways to do things, but new language features also can enable some really bad things to happen too. Because the risk of introducing a bad language feature is so high, language designers need to be conservative when making the cost-benefit analysis of whether the good outweighs the bad. If in doubt, we should be heedful of the maxim Primum non nocere: first, do no harm.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Java technology on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology
ArticleID=733101
ArticleTitle=Language designer's notebook: First, do no harm
publish-date=07192011