The Java Collections Framework has evolved considerably since its initial release
with the Java 2 platform, Version 1.2. With the Java SE 5 release, generics
enhanced the framework, and the introduction of
java.util.concurrent added direct concurrency support
(see Resources). With Java SE 6, the framework adds
better bidirectional collection access. This article introduces you to all of
these aspects of the collections library and helps you to take advantage of the
most popular concurrency-related features.
The high-level task for this article is a Web crawler: given a base URL for a Web site, you'll collect elements from the site that you can use for some purpose. You'll collect a series of links from a single Web page and then crawl a whole site. You'll break the high-level tasks down into subtasks that can be spun off into their own jobs. Along the way, you'll get to know and use generics and thread pools. To keep matters simple, you'll implement the task as a standalone client-side application. (Explaining how to deploy a Web application isn't central to the article's purpose. But feel free to create a Web application from which the task can be initiated as an additional exercise.)
You should be familiar with developing programs for the Java platform. This article assumes familiarity with the networking and I/O libraries for making socket connections and reading streams, respectively. You need a working developer installation of the Java SE 6 platform. It should be at least Update 5 of JDK 6 from Sun Microsystems or the latest SDK for Java, Version 6 from IBM.
The concept of generics has been part of the Java platform since the Java SE 5 release (see Resources). Simply put, generics provide compile-time type safety for collections. With earlier Java platform versions, you created a collection and added items into it, as in Listing 1:
Listing 1. Adding items to a collection — the old way
List buttonList = new LinkedList();
buttonList.add(new JButton("One"));
buttonList.add(new JButton("Two"));
buttonList.add(new JButton("Three"));
buttonList.add(new JButton("Four"));
|
To fetch elements from the collection, you had to know the type of object in the collection so you could cast it back to the appropriate local variable:
JButton first = (JButton)buttonList.get(0); |
You didn't need to cast it back to the right type, but you did so on the assumption that you wanted to do something with the specific class type. This worked fine until you accidentally added the wrong type object to the collection:
buttonList.add(new JLabel("Five"));
|
Now, if you try to fetch the last element as a
JButton, a class cast exception happens at run time:
Line 13: JButton last = (JButton)buttonList.get(4);
>java GetIt
Exception in thread "main" java.lang.ClassCastException:
javax.swing.JLabel cannot be cast to javax.swing.JButton
at GetIt.main(GetIt.java:13)
|
There's nothing wrong per se with putting a JLabel
into a collection, but if the fetching code is expecting all the elements in the
collection to be of the same type —
JButton here — then fetching a
JLabel out of it generates the
ClassCastException. And the exception doesn't occur
until run time; if you haven't done enough testing, it might not happen until
after the application is deployed.
Step now into the world of generics. Generics can help you resolve coding
problems earlier in the development cycle. Instead of just having a collection and
adding JButton objects into it, you say that you have a
collection of JButton objects. Then, if you try to add
a JLabel to that collection, the compiler picks up the
discrepancy and complains at compile time.
Listing 2 contains a program that generates the
compilation-time error message that happens when you try to add the wrong type of
element to a generic collection — a
List<JButton> in this case:
Listing 2. Sample code using generics (does not compile)
import java.util.*;
import javax.swing.*;
public class GetIt {
public static void main(String args[]) {
List<JButton> buttonList = new LinkedList<JButton>();
buttonList.add(new JButton("One"));
buttonList.add(new JButton("Two"));
buttonList.add(new JButton("Three"));
buttonList.add(new JButton("Four"));
JButton first = buttonList.get(0);
buttonList.add(new JLabel("Five"));
JButton last = buttonList.get(4);
}
}
|
When you save and compile the program, you'll notice the final call to
add() fails:
>javac GetIt.java
GetIt.java:12: cannot find symbol
symbol : method add(javax.swing.JLabel)
location: interface java.util.List<javax.swing.JButton>
buttonList.add(new JLabel("Five"));
^
1 error
|
The second line of the error message says you are trying to add a
JLabel to what the third error line is reporting as a
List of JButton objects. You
then must decide if the collection should really be a collection of
Component objects (or
JComponents if you want to stick to Swing platform
components) or if you shouldn't have tried to add the
JLabel in the first place.
Notice in Listing 2 that fetching an item out of a collection doesn't require a cast to the right type. Because you've described the collection to be of a certain type, all calls to get items out of that collection return the given type.
The use of generics makes your code base much more maintainable, especially as it grows and you turn code elements into reusable libraries. Your library's users no longer need to worry that there are limitations on the types of objects in a collection. Properly defined methods should now include those types with their definitions. And the compiler warns you if and when you deviate from that type.
The compiler complains when it compiles a class that uses a collection whose definition lacks a generic type, as it does for the code in Listing 1. For example, suppose you try to compile a class containing this line:
List buttonList = new LinkedList(); |
The compiler issues a warning:
>javac GetIt.java Note: GetIt.java uses unchecked or unsafe operations. Note: Recompile with -Xlint:unchecked for details. |
You can ignore the warning and just leave things be. Assuming you don't accidentally add the wrong datatype to the collection, everything will work fine.
To reveal the specific issues the compiler is warning you about, pass the
-Xlint:unchecked command to the compiler. You'll see
output like Listing 3's:
Listing 3. Detail from compilation with
Xlint
>javac -Xlint:unchecked GetIt.java
GetIt.java:7: warning: [unchecked] unchecked call to add(E) as a member of
the raw type java.util.List
buttonList.add(new JButton("One"));
^
GetIt.java:8: warning: [unchecked] unchecked call to add(E) as a member of
the raw type java.util.List
buttonList.add(new JButton("Two"));
^
GetIt.java:9: warning: [unchecked] unchecked call to add(E) as a member of
the raw type java.util.List
buttonList.add(new JButton("Three"));
^
GetIt.java:10: warning: [unchecked] unchecked call to add(E) as a member of
the raw type java.util.List
buttonList.add(new JButton("Four"));
^
4 warnings
|
You can see in Listing 3 that the compiler doesn't specifically care that the
List wasn't defined with a datatype. What it does say
is that each call to add() is a problem because the
List wasn't defined with a datatype.
Again, these are warnings, so you can ignore them. However, fixing the collection to specify the type explicitly will remove the chance that you'll miss a real error at compile time among all the warnings.
If you're using a library that you can't or don't want to change, you can
suppress the compiler warnings. An @SuppressWarnings
annotation tells the compiler that you know the code generates warnings, but you
don't want to see them. If you include the following line right before the method
whose unchecked warnings you want to ignore, the compiler stops complaining about
that method:
@SuppressWarnings("unchecked")
|
Now, when you compile the class, you get no warning messages or error messages.
You still run the risk of getting a ClassCastException
if a datatype other than what's expected is handled. The choice is yours.
You should now have a good grasp of what generics can buy you and how they can
make your programs more maintainable. Next up is creating a program to collect all
the links on a particular Web page. Although you could write a program yourself
that reads in the Web page and parses the content, that isn't necessary. The Swing
component libraries provide this capability. All you need to do is ask for all the
href attributes associated with the anchor
(<a>) tags on a page.
The javax.swing.text.html package includes an
HTMLEditorKit. If you provide a stream to it, it parses
the associated Web page. From that parsed stream, you can tell the kit to iterate
through all available tags and get the anchor tags'
href attributes. This program could be more elaborate
and collect image tags or Flash movies, but it will only collect those of the form
<a href="...">xxx</a>.
All you need to do is create a new HTMLEditorKit
instance and pass in a Reader to the content. Because
the idea is to get the content from a remote Web site, you must get that
Reader with a http:// string
you enter on the command line. That string is then passed into the
URL constructor, from which you get a
URLConnection. The process sounds complicated but
really isn't. Listing 4 shows how it works:
Listing 4. Reading a Web page
HttpURLConnection.setFollowRedirects(false);
EditorKit kit = new HTMLEditorKit();
Document doc = kit.createDefaultDocument();
doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
String uri = args[0];
Reader reader = null;
if (uri != null && uri.startsWith("http")) {
URLConnection conn = new URL(uri).openConnection();
reader = new InputStreamReader(conn.getInputStream());
} else {
System.err.println(
"Usage: java ListUrls http://example.com/startingpage");
System.exit(-1);
}
kit.read(reader, doc, 0);
|
The input stream associated with the connection is provided to the
EditorKit's read() method.
The other arguments to read() are a
Document, which you create by calling the kit's
createDefaultDocument() method, and a position to start
reading, typically 0 to indicate start of stream.
Listing 4 throws in two additional helpful tasks. The
HttpURLConnection class's
setFollowRedirects() method is called to disable
following redirect requests. And the
IgnoreCharsetDirective property of the
Document is set because apparently there's a bug in
HTMLEditorKit when a charset
attribute is in a page's <meta> tag.
Iterating through the elements
The next Swing class you'll use is ElementIterator,
found in the javax.swing.text package. Given a
Document like the one just created, you can iterate
through the elements within it:
ElementIterator it = new ElementIterator(doc);
javax.swing.text.Element elem;
while ((elem = it.next()) != null) {
// ...
}
|
By hunting for the <a> tags, you can get
the associated href attributes to add to the collection
of found links. The collection you use here is a Set,
because it isn't necessary to collect duplicates:
Set<String> uriList = new TreeSet<String>();
// Below is inside of while loop
AttributeSet s = (AttributeSet)
elem.getAttributes().getAttribute(HTML.Tag.A);
if (s != null) {
String href = (String)
s.getAttribute(HTML.Attribute.HREF);
uriList.add(href);
}
|
Although the steps you've performed so far are sufficient to collect all the
links, you might choose to handle some special cases. For instance, you don't need
to add cases where the found href is null —
something that shouldn't happen with a well-formed document but sometimes does.
Also, internal links don't have a leading http://. It
is best to append these internal links to the document's base URL so that if you
need to loop through the list again (as in the next task), you have a full URL.
And, it's probably best not to follow javascript: tags.
Many more enhancements are probably doable. Listing 5 shows the complete program:
Listing 5. Code for listing a single page's URLs
import java.io.*;
import java.net.*;
import java.util.*;
import javax.swing.text.*;
import javax.swing.text.html.*;
public class ListUrls {
public static void main(String args[]) throws Exception {
Set<String> uriList = new TreeSet<String>();
HttpURLConnection.setFollowRedirects(false);
EditorKit kit = new HTMLEditorKit();
Document doc = kit.createDefaultDocument();
doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
String uri = args[0];
Reader reader = null;
if (uri != null && uri.startsWith("http")) {
URLConnection conn = new URL(uri).openConnection();
reader = new InputStreamReader(conn.getInputStream());
} else {
System.err.println(
"Usage: java ListUrls http://example.com/startingpage");
System.exit(-1);
}
kit.read(reader, doc, 0);
ElementIterator it = new ElementIterator(doc);
javax.swing.text.Element elem;
while ((elem = it.next()) != null) {
AttributeSet s = (AttributeSet)
elem.getAttributes().getAttribute(HTML.Tag.A);
if (s != null) {
String href = (String)s.getAttribute(HTML.Attribute.HREF);
if (href == null) {
continue;
} else if (href.startsWith("javascript:")) {
continue; // skip it
} else if (href.startsWith("https:")) {
// add as is
} else if (!href.startsWith("http:")) {
href = uri + href;
}
uriList.add(href);
}
}
for (String element: uriList) {
System.out.printf(">>%s<<%n", element);
}
}
}
|
The program prints out the collected set of URLs. Download and compile the
ListUrls program, and run it by passing in a URL on the
command line (see the Download section for a link to this
article's full source code). Your exact results will depend on the page you're
collecting from.
The ListUrls program in
Listing 5 collects all the outgoing links on a particular
page. To take this program to the next level and make it crawl a whole site, it's
best to break it apart into smaller tasks. Although you can keep the whole job in
a single thread, the application is bound to be blocked by I/O delays because it
must read Web pages in full before they can be processed. Network delays are
another reason to break things up into multiple threads. Handling each element of
the Set in its own thread should speed up the whole job
considerably. Of course, you need to bound the number of threads, or else too many
different tasks will execute, and more time will be spent swapping between them
than doing any actual work.
Java SE 5 introduced the java.util.concurrent library
as well as generics (see Resources). The
Executor interface accepts
Runnable objects to execute. This is just like passing
Runnable objects into the
Thread constructor, but with an
Executor, a Thread can be
reused to get a new Runnable after the prior one
finishes. Thus, the program avoids the overhead of constantly throwing away and
re-creating threads. The Executor interface has a
single method —
execute()
— that accepts the Runnable parameter. What
happens with it depends upon the specific implementation of the
Executor interface.
One such implementation of Executor is
ThreadPoolExecutor. You don't call the
ThreadPoolExecutor constructor directly to create a
thread pool but instead use the Executors utility class
to create one. For a fixed-size thread pool, use
newFixedThreadPool(int maxThreads); or use
newFixedThreadPool(int maxThreads, ThreadFactor factory),
which allows you to provide a factory for creating the underlying threads.
Once you've created your thread pool, you add tasks to run with the
service(Runnable) method. For the Web crawler you're
creating, you can call the awaitTermination() method to
find out when all the tasks are done, or at least when the thread pool has
terminated, as shown in Listing 6:
Listing 6. Working with a thread pool
String uri =...
ExecutorService service = Executors.newFixedThreadPool(5);
service.execute(new Crawler(service, uri, uri));
service.awaitTermination(300, TimeUnit.SECONDS);
for (String element: allUriList) {
System.out.printf(">>%s<<%n", element);
}
|
The awaitTermination() method accepts a timeout. In
Listing 6, the program is set to time out after five minutes. You can use a
shorter or longer timeout based on how long you want the program to run, your
network connection's speed, and the depth of the Web site you want to crawl.
Notice also that only the base URI string is added to the crawler. As each page is read, new URIs are added to the job queue.
The Runnable task performed by the service is the bulk
of the code from Listing 5. I've added some additional
checks to improve building the URLs for the next page. A check at the end of the
execute() method sees if the service should terminate.
Typically, the pool would just run until the program ends, but this program ends
when the pool is done, so the check is necessary.
Download the CollectUrls
program and run it on a relatively small site, preferably your own, to get a dump
of all the links from the site. You can also modify the program to maintain a
multimap: if you know the source of each link, you can automatically generate a
map of the site hierarchy and interconnections.
The CollectUrls Web crawler program takes advantage of
a fixed-size thread pool. It's not the only one available to you. Three other
types can be created using the Executors utility class:
-
newCachedThreadPool()creates a pool of potentially unbounded size, but it kills the threads when they are idle for too long. Consider using it if you have a bunch of short asynchronous tasks. If the thread is available in the pool, it gets used. If no thread is available, a new one is created, and then if the thread from the pool is idle for 60 seconds, it goes away. When nothing is going on, no resources are used. In contrast, the fixed-size pool keeps all the threads around waiting for something to do when there are no tasks to complete.
-
newSingleThreadExecutor()creates a pool that's useful for jobs that need to be done sequentially. If the underlying thread dies, it gets re-created. This is similar to creating a fixed-size pool of one, but without being able to change the underlying pool size.
-
newScheduledThreadPool()creates a pool that works like theTimerobject but is better at handling uncaught exceptions and thread starvation. With theTimerclass, you can have aTimerTaskthat runs for a long time, blocking other tasks from running. Having multiple threads in a pool can prevent that from happening and still keep the threads' schedule.
Additional collection types are also available. As an alternative to the
scheduled thread pool, consider using a DelayQueue. It
allows you to add items to the collection that cannot be pulled out until their
delay period has expired. It is a specific type of
BlockingQueue: getting an item from the queue will
block until the delay has passed, if an item isn't available immediately.
This article took you through the task of creating a Web crawler by:
- Collecting the set of URI strings found in a generic
Set - Spawning
Runnabletasks to find more URIs on a site's pages - Using a thread pool to work its way through the
Runnableoperations
To expand the Web crawler, consider collecting image references or searching for specific text strings. You can do a lot with the program to enhance its functionality and learn more about working with concurrent collection techniques.
| Description | Name | Size | Download method |
|---|---|---|---|
| Sample code for this article | j-collections-code.zip | 3KB | HTTP |
Information about download methods
Learn
-
Java Collections Framework:
API documentation, tutorials, and other resources for the Collections Framework.
-
"The Collections Framework"
(John Zukowski, developerWorks, July 2005): This article in the author's
Taming
Tiger
series covers the generics, concurrency, and other Java Collections Framework
enhancements introduced in Java SE 5.
-
"Concurrency in JDK 5.0"
(Brian Goetz, developerWorks, November 2004): This tutorial shows how the
concurrency classes introduced in Java SE 5 help make code faster and more
scalable, reliable, maintainable.
-
"Introduction to generic types in JDK 5.0"
(Brian Goetz, developerWorks, December 2004): This tutorial, designed for
intermediate and advanced Java developers, introduces generic types.
-
"Generics gotchas"
(Brian Goetz, developerWorks, January 2005): Read this article if you want to
avoid common pitfalls that can ensnare first-time users of generics.
-
Java Collections
(John Zukowski, Apress, April 2001): This book covers the Collections Framework
introduced with the Java 2 platform.
- Browse the
technology bookstore
for books on these and other technical topics.
-
developerWorks Java technology zone:
Find hundreds of articles about every aspect of Java programming.
Get products and technologies
- Download
IBM product evaluation versions
and get your hands on application development tools and middleware products from
DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
- Check out
developerWorks blogs
and get involved in the
developerWorks community.
Comments (Undergoing maintenance)






