Spice up collections with generics and concurrency

Use the Java Collections Framework enhancements in Java SE 6

The Java™ Collections Framework is an important aspect of the Java platform. Both desktop and enterprise applications typically collect items to work with. This article shows you how to work with collections while taking advantage of enhancements made to the framework in Java SE 6. You can go far beyond HashMap or TreeSet by using generics and concurrency features to make your applications more maintainable and scalable.

John Zukowski (jaz@zukowski.net), Consultant, JZ Ventures, Inc.

John ZukowskiJohn Zukowski does strategic Java technology consulting for JZ Ventures, Inc. He has been working with the Java platform since before its 1.0 release and has authored numerous articles and books, including Java Collections and Java 6 Platform Revealed, both from Apress.



08 April 2008

Also available in Chinese Japanese

The Java Collections Framework has evolved considerably since its initial release with the Java 2 platform, Version 1.2. With the Java SE 5 release, generics enhanced the framework, and the introduction of java.util.concurrent added direct concurrency support (see Resources). With Java SE 6, the framework adds better bidirectional collection access. This article introduces you to all of these aspects of the collections library and helps you to take advantage of the most popular concurrency-related features.

The high-level task for this article is a Web crawler: given a base URL for a Web site, you'll collect elements from the site that you can use for some purpose. You'll collect a series of links from a single Web page and then crawl a whole site. You'll break the high-level tasks down into subtasks that can be spun off into their own jobs. Along the way, you'll get to know and use generics and thread pools. To keep matters simple, you'll implement the task as a standalone client-side application. (Explaining how to deploy a Web application isn't central to the article's purpose. But feel free to create a Web application from which the task can be initiated as an additional exercise.)

You should be familiar with developing programs for the Java platform. This article assumes familiarity with the networking and I/O libraries for making socket connections and reading streams, respectively. You need a working developer installation of the Java SE 6 platform. It should be at least Update 5 of JDK 6 from Sun Microsystems or the latest SDK for Java, Version 6 from IBM.

Getting to know generics

The concept of generics has been part of the Java platform since the Java SE 5 release (see Resources). Simply put, generics provide compile-time type safety for collections. With earlier Java platform versions, you created a collection and added items into it, as in Listing 1:

Listing 1. Adding items to a collection — the old way
List buttonList = new LinkedList();
buttonList.add(new JButton("One"));
buttonList.add(new JButton("Two"));
buttonList.add(new JButton("Three"));
buttonList.add(new JButton("Four"));

To fetch elements from the collection, you had to know the type of object in the collection so you could cast it back to the appropriate local variable:

JButton first = (JButton)buttonList.get(0);

You didn't need to cast it back to the right type, but you did so on the assumption that you wanted to do something with the specific class type. This worked fine until you accidentally added the wrong type object to the collection:

buttonList.add(new JLabel("Five"));

Now, if you try to fetch the last element as a JButton, a class cast exception happens at run time:

Line 13: JButton last = (JButton)buttonList.get(4);
>java GetIt
Exception in thread "main" java.lang.ClassCastException:
  javax.swing.JLabel cannot be cast to javax.swing.JButton
        at GetIt.main(GetIt.java:13)

There's nothing wrong per se with putting a JLabel into a collection, but if the fetching code is expecting all the elements in the collection to be of the same type — JButton here — then fetching a JLabel out of it generates the ClassCastException. And the exception doesn't occur until run time; if you haven't done enough testing, it might not happen until after the application is deployed.

Generic collection use

Step now into the world of generics. Generics can help you resolve coding problems earlier in the development cycle. Instead of just having a collection and adding JButton objects into it, you say that you have a collection of JButton objects. Then, if you try to add a JLabel to that collection, the compiler picks up the discrepancy and complains at compile time.

Listing 2 contains a program that generates the compilation-time error message that happens when you try to add the wrong type of element to a generic collection — a List<JButton> in this case:

Listing 2. Sample code using generics (does not compile)
import java.util.*;
import javax.swing.*;

public class GetIt {
  public static void main(String args[]) {
    List<JButton> buttonList = new LinkedList<JButton>();
    buttonList.add(new JButton("One"));
    buttonList.add(new JButton("Two"));
    buttonList.add(new JButton("Three"));
    buttonList.add(new JButton("Four"));
    JButton first = buttonList.get(0);
    buttonList.add(new JLabel("Five"));
    JButton last = buttonList.get(4);
  }
}

When you save and compile the program, you'll notice the final call to add() fails:

>javac GetIt.java
GetIt.java:12: cannot find symbol
symbol  : method add(javax.swing.JLabel)
location: interface java.util.List<javax.swing.JButton>
                buttonList.add(new JLabel("Five"));
                          ^
1 error

The second line of the error message says you are trying to add a JLabel to what the third error line is reporting as a List of JButton objects. You then must decide if the collection should really be a collection of Component objects (or JComponents if you want to stick to Swing platform components) or if you shouldn't have tried to add the JLabel in the first place.

Notice in Listing 2 that fetching an item out of a collection doesn't require a cast to the right type. Because you've described the collection to be of a certain type, all calls to get items out of that collection return the given type.

The use of generics makes your code base much more maintainable, especially as it grows and you turn code elements into reusable libraries. Your library's users no longer need to worry that there are limitations on the types of objects in a collection. Properly defined methods should now include those types with their definitions. And the compiler warns you if and when you deviate from that type.


Generic compiler issues

The compiler complains when it compiles a class that uses a collection whose definition lacks a generic type, as it does for the code in Listing 1. For example, suppose you try to compile a class containing this line:

List buttonList = new LinkedList();

The compiler issues a warning:

>javac GetIt.java
Note: GetIt.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.

You can ignore the warning and just leave things be. Assuming you don't accidentally add the wrong datatype to the collection, everything will work fine.

Seeing detailed warnings

To reveal the specific issues the compiler is warning you about, pass the -Xlint:unchecked command to the compiler. You'll see output like Listing 3's:

Listing 3. Detail from compilation with Xlint
>javac -Xlint:unchecked GetIt.java
GetIt.java:7: warning: [unchecked] unchecked call to add(E) as a member of
        the raw type java.util.List
    buttonList.add(new JButton("One"));
                  ^
GetIt.java:8: warning: [unchecked] unchecked call to add(E) as a member of
        the raw type java.util.List
    buttonList.add(new JButton("Two"));
                  ^
GetIt.java:9: warning: [unchecked] unchecked call to add(E) as a member of
        the raw type java.util.List
    buttonList.add(new JButton("Three"));
                  ^
GetIt.java:10: warning: [unchecked] unchecked call to add(E) as a member of
        the raw type java.util.List
    buttonList.add(new JButton("Four"));
                  ^
4 warnings

You can see in Listing 3 that the compiler doesn't specifically care that the List wasn't defined with a datatype. What it does say is that each call to add() is a problem because the List wasn't defined with a datatype.

Again, these are warnings, so you can ignore them. However, fixing the collection to specify the type explicitly will remove the chance that you'll miss a real error at compile time among all the warnings.

Suppressing compiler warnings

If you're using a library that you can't or don't want to change, you can suppress the compiler warnings. An @SuppressWarnings annotation tells the compiler that you know the code generates warnings, but you don't want to see them. If you include the following line right before the method whose unchecked warnings you want to ignore, the compiler stops complaining about that method:

@SuppressWarnings("unchecked")

Now, when you compile the class, you get no warning messages or error messages. You still run the risk of getting a ClassCastException if a datatype other than what's expected is handled. The choice is yours.


Reading a Web page

You should now have a good grasp of what generics can buy you and how they can make your programs more maintainable. Next up is creating a program to collect all the links on a particular Web page. Although you could write a program yourself that reads in the Web page and parses the content, that isn't necessary. The Swing component libraries provide this capability. All you need to do is ask for all the href attributes associated with the anchor (<a>) tags on a page.

Getting the document

The javax.swing.text.html package includes an HTMLEditorKit. If you provide a stream to it, it parses the associated Web page. From that parsed stream, you can tell the kit to iterate through all available tags and get the anchor tags' href attributes. This program could be more elaborate and collect image tags or Flash movies, but it will only collect those of the form <a href="...">xxx</a>.

All you need to do is create a new HTMLEditorKit instance and pass in a Reader to the content. Because the idea is to get the content from a remote Web site, you must get that Reader with a http:// string you enter on the command line. That string is then passed into the URL constructor, from which you get a URLConnection. The process sounds complicated but really isn't. Listing 4 shows how it works:

Listing 4. Reading a Web page
HttpURLConnection.setFollowRedirects(false);
EditorKit kit = new HTMLEditorKit();
Document doc = kit.createDefaultDocument();
doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);

String uri = args[0];
Reader reader = null;

if (uri != null && uri.startsWith("http")) {
  URLConnection conn = new URL(uri).openConnection();
  reader = new InputStreamReader(conn.getInputStream());
} else {
  System.err.println(
   "Usage: java ListUrls http://example.com/startingpage");
  System.exit(-1);
}

kit.read(reader, doc, 0);

The input stream associated with the connection is provided to the EditorKit's read() method. The other arguments to read() are a Document, which you create by calling the kit's createDefaultDocument() method, and a position to start reading, typically 0 to indicate start of stream.

Listing 4 throws in two additional helpful tasks. The HttpURLConnection class's setFollowRedirects() method is called to disable following redirect requests. And the IgnoreCharsetDirective property of the Document is set because apparently there's a bug in HTMLEditorKit when a charset attribute is in a page's <meta> tag.

Iterating through the elements

The next Swing class you'll use is ElementIterator, found in the javax.swing.text package. Given a Document like the one just created, you can iterate through the elements within it:

ElementIterator it = new ElementIterator(doc);
javax.swing.text.Element elem;
while ((elem = it.next()) != null) {
  // ...
}

By hunting for the <a> tags, you can get the associated href attributes to add to the collection of found links. The collection you use here is a Set, because it isn't necessary to collect duplicates:

Set<String> uriList = new TreeSet<String>();
// Below is inside of while loop
AttributeSet s = (AttributeSet)
  elem.getAttributes().getAttribute(HTML.Tag.A);
if (s != null) {
  String href = (String)
    s.getAttribute(HTML.Attribute.HREF);
  uriList.add(href); 
}

Although the steps you've performed so far are sufficient to collect all the links, you might choose to handle some special cases. For instance, you don't need to add cases where the found href is null — something that shouldn't happen with a well-formed document but sometimes does. Also, internal links don't have a leading http://. It is best to append these internal links to the document's base URL so that if you need to loop through the list again (as in the next task), you have a full URL. And, it's probably best not to follow javascript: tags. Many more enhancements are probably doable. Listing 5 shows the complete program:

Listing 5. Code for listing a single page's URLs
import java.io.*;
import java.net.*;
import java.util.*;
import javax.swing.text.*;
import javax.swing.text.html.*;

public class ListUrls {
  public static void main(String args[]) throws Exception {
    Set<String> uriList = new TreeSet<String>();
    HttpURLConnection.setFollowRedirects(false);
    EditorKit kit = new HTMLEditorKit();
    Document doc = kit.createDefaultDocument();
    doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);

    String uri = args[0];
    Reader reader = null;
    
    if (uri != null && uri.startsWith("http")) {
      URLConnection conn = new URL(uri).openConnection();
      reader = new InputStreamReader(conn.getInputStream());
    } else {
      System.err.println(
        "Usage: java ListUrls http://example.com/startingpage");
      System.exit(-1);
    }

    kit.read(reader, doc, 0);
    ElementIterator it = new ElementIterator(doc);
    javax.swing.text.Element elem;
    while ((elem = it.next()) != null) {
      AttributeSet s = (AttributeSet)
        elem.getAttributes().getAttribute(HTML.Tag.A);
      if (s != null) {
        String href = (String)s.getAttribute(HTML.Attribute.HREF);
        if (href == null) {
          continue;
        } else if (href.startsWith("javascript:")) {
          continue; // skip it
        } else if (href.startsWith("https:")) {
          // add as is
        } else if (!href.startsWith("http:")) {
          href = uri + href;
        }
        uriList.add(href);
      }
    }
    for (String element: uriList) {
      System.out.printf(">>%s<<%n", element);
    }
  }
}

The program prints out the collected set of URLs. Download and compile the ListUrls program, and run it by passing in a URL on the command line (see the Download section for a link to this article's full source code). Your exact results will depend on the page you're collecting from.


Thread pools

The ListUrls program in Listing 5 collects all the outgoing links on a particular page. To take this program to the next level and make it crawl a whole site, it's best to break it apart into smaller tasks. Although you can keep the whole job in a single thread, the application is bound to be blocked by I/O delays because it must read Web pages in full before they can be processed. Network delays are another reason to break things up into multiple threads. Handling each element of the Set in its own thread should speed up the whole job considerably. Of course, you need to bound the number of threads, or else too many different tasks will execute, and more time will be spent swapping between them than doing any actual work.

Executor

Java SE 5 introduced the java.util.concurrent library as well as generics (see Resources). The Executor interface accepts Runnable objects to execute. This is just like passing Runnable objects into the Thread constructor, but with an Executor, a Thread can be reused to get a new Runnable after the prior one finishes. Thus, the program avoids the overhead of constantly throwing away and re-creating threads. The Executor interface has a single method —execute()— that accepts the Runnable parameter. What happens with it depends upon the specific implementation of the Executor interface.

One such implementation of Executor is ThreadPoolExecutor. You don't call the ThreadPoolExecutor constructor directly to create a thread pool but instead use the Executors utility class to create one. For a fixed-size thread pool, use newFixedThreadPool(int maxThreads); or use newFixedThreadPool(int maxThreads, ThreadFactor factory), which allows you to provide a factory for creating the underlying threads.

Once you've created your thread pool, you add tasks to run with the service(Runnable) method. For the Web crawler you're creating, you can call the awaitTermination() method to find out when all the tasks are done, or at least when the thread pool has terminated, as shown in Listing 6:

Listing 6. Working with a thread pool
String uri =...
ExecutorService service = Executors.newFixedThreadPool(5);
service.execute(new Crawler(service, uri, uri));
service.awaitTermination(300, TimeUnit.SECONDS);
for (String element: allUriList) {
  System.out.printf(">>%s<<%n", element);
}

The awaitTermination() method accepts a timeout. In Listing 6, the program is set to time out after five minutes. You can use a shorter or longer timeout based on how long you want the program to run, your network connection's speed, and the depth of the Web site you want to crawl.

Notice also that only the base URI string is added to the crawler. As each page is read, new URIs are added to the job queue.

Runnable

The Runnable task performed by the service is the bulk of the code from Listing 5. I've added some additional checks to improve building the URLs for the next page. A check at the end of the execute() method sees if the service should terminate. Typically, the pool would just run until the program ends, but this program ends when the pool is done, so the check is necessary.

Download the CollectUrls program and run it on a relatively small site, preferably your own, to get a dump of all the links from the site. You can also modify the program to maintain a multimap: if you know the source of each link, you can automatically generate a map of the site hierarchy and interconnections.


Other thread pools

The CollectUrls Web crawler program takes advantage of a fixed-size thread pool. It's not the only one available to you. Three other types can be created using the Executors utility class:

  • newCachedThreadPool() creates a pool of potentially unbounded size, but it kills the threads when they are idle for too long. Consider using it if you have a bunch of short asynchronous tasks. If the thread is available in the pool, it gets used. If no thread is available, a new one is created, and then if the thread from the pool is idle for 60 seconds, it goes away. When nothing is going on, no resources are used. In contrast, the fixed-size pool keeps all the threads around waiting for something to do when there are no tasks to complete.
  • newSingleThreadExecutor() creates a pool that's useful for jobs that need to be done sequentially. If the underlying thread dies, it gets re-created. This is similar to creating a fixed-size pool of one, but without being able to change the underlying pool size.
  • newScheduledThreadPool() creates a pool that works like the Timer object but is better at handling uncaught exceptions and thread starvation. With the Timer class, you can have a TimerTask that runs for a long time, blocking other tasks from running. Having multiple threads in a pool can prevent that from happening and still keep the threads' schedule.

Additional collection types are also available. As an alternative to the scheduled thread pool, consider using a DelayQueue. It allows you to add items to the collection that cannot be pulled out until their delay period has expired. It is a specific type of BlockingQueue: getting an item from the queue will block until the delay has passed, if an item isn't available immediately.


Conclusion

This article took you through the task of creating a Web crawler by:

  • Collecting the set of URI strings found in a generic Set
  • Spawning Runnable tasks to find more URIs on a site's pages
  • Using a thread pool to work its way through the Runnable operations

To expand the Web crawler, consider collecting image references or searching for specific text strings. You can do a lot with the program to enhance its functionality and learn more about working with concurrent collection techniques.


Download

DescriptionNameSize
Sample code for this articlej-collections-code.zip3KB

Resources

Learn

  • Java Collections Framework: API documentation, tutorials, and other resources for the Collections Framework.
  • "The Collections Framework" (John Zukowski, developerWorks, July 2005): This article in the author's Taming Tiger series covers the generics, concurrency, and other Java Collections Framework enhancements introduced in Java SE 5.
  • "Concurrency in JDK 5.0" (Brian Goetz, developerWorks, November 2004): This tutorial shows how the concurrency classes introduced in Java SE 5 help make code faster and more scalable, reliable, maintainable.
  • "Introduction to generic types in JDK 5.0" (Brian Goetz, developerWorks, December 2004): This tutorial, designed for intermediate and advanced Java developers, introduces generic types.
  • "Generics gotchas" (Brian Goetz, developerWorks, January 2005): Read this article if you want to avoid common pitfalls that can ensnare first-time users of generics.
  • Java Collections (John Zukowski, Apress, April 2001): This book covers the Collections Framework introduced with the Java 2 platform.
  • Browse the technology bookstore for books on these and other technical topics.
  • developerWorks Java technology zone: Find hundreds of articles about every aspect of Java programming.

Get products and technologies

  • Download IBM product evaluation versions and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Java technology on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Java technology
ArticleID=300200
ArticleTitle=Spice up collections with generics and concurrency
publish-date=04082008