Help Web crawlers efficiently crawl your portal sites and Web sites

An introduction to the Sitemaps 0.90 protocol and how WebSphere Portal complements it

Web site administrators, as well as search technology providers, face the challenge to find or assist in locating high-quality information. With the new Sitemaps 0.90 protocol, Web site administrators can support this endeavor by providing information about their site to crawlers in a more optimal fashion. This protocol makes it possible to specify exactly what to crawl, the frequency of updates to the information, and its importance relative to that specific site.

This article provides an overview of the Sitemaps 0.90 protocol and also tells you how to use IBM® WebSphere® Portal Version 6 (hereafter called WebSphere Portal) to produce and deploy such a sitemap. You should have a good understanding of XML and WebSphere Portal administration to fully understand what you are reading.

Andreas Prokoph (pkp@de.ibm.com), Software Architect, Search Technologies, WebSphere Portal development, IBM

Andreas Prokoph is a software architect at the IBM Development Lab in Boeblingen, Germany, working in the field of text search and information retrieval for the past 18 years. He has held various positions as technical lead and architect for many search products and solutions ranging from Intranet search engines to client-side embedded search technologies. You can reach Andreas at pkp@de.ibm.com.



08 May 2007

Also available in Chinese Japanese

Introduction

Up until now, both Web crawlers and site administrators had to spend a reasonable amount of time figuring out how to optimize the crawlability of a Web site and still be able to discover relevant information that is hosted on that Web site. And, ideally, doing so without adding too much load on the hosting server, triggering unwanted actions, and allowing repeated crawls of that Web site at appropriate intervals.

Sitemaps 0.90 protocol (see Sitemaps.org in Resources) provides a convenient way for Web site administrators to feed crawlers with the information they need to safely and efficiently crawl their Web site. And furthermore, it relies on Web standards such as XML.

What you need to do, in short, is to generate a list of page references (URLs) for a crawler to fetch. This list is stored in a simple XML Sitemaps 0.90 compliant file. The XML file contains one entry for every URL and the only mandatory input is the URL reference itself. Optionally, you can list additional information for each URL, including its last modification date, its change frequency, an expiration date, and a priority value. (The priority value specifies the importance of that page relative to that Web site only.) Once the Sitemap XML is completed, you can make it available to Web crawlers by submitting a URL to the sitemap to those sites that support the protocol. Google and Yahoo! already support the protocol today.

Once the sitemap is registered, the search engine’s Web crawlers will use the information provided in the sitemap file as follows: identify what set of pages needs to be crawled, and then also use the change frequency information to determine which pages need to be processed at this time. Thus the efficiency of site crawls provides relief to both sides: the hosting Web server and the crawler as well, by keeping the number of GET requests for pages at a minimum.

Overall Sitemaps 0.90 protocol offers an improvement in terms of crawling efficiency that cannot be accomplished using regular sitemap references and robot directives combined.


About Sitemaps 0.90

Sitemaps 0.90 is a simple and intuitive way for webmasters to provide the right level of information to Web crawlers so that they can efficiently crawl a Web site. Sitemaps 0.90 is a big step forward compared to the old-style "crawl the Web site by following the hyperlinks" approach. Today, the approach that many search engines still recommend for some level of crawling efficiency is to make an HTML-based site map page available for a crawler to pick up. However, in today's world of growing and complex Web sites, that is not the final solution. Sitemaps 0.90 goes one step further in allowing Web sites to specify information about the content or pages which crawlers would otherwise need to determine and store, such as update frequency of pages or whether a page has changed. At this point it is important to mention that the Sitemaps 0.90 protocol relies on Web standards and reuses existing concepts. The sitemap protocol is based on a straight forward and intuitively structured XML file, which is composed of a list of URLs and their associated metadata. This information helps the Web crawler determine what the set of pages is and when to crawl them. The webmaster, or whoever takes responsibility for that Web site, provides recommendations and information.

Sitemaps 0.90 generation tools are showing up already on the Internet as open source, freeware, and shareware offerings. This is possible because of the Beta program for Sitemaps run by Google. An example of such an open source tool is the Sitemaps 0.90 generator tool. Beyond these standalone tools, it is possible for systems that manage and generate Web content to automatically create and maintain Sitemaps 0.90 sitemap files, and for Web development tools to offer save as sitemap functionality.

Sitemaps 0.90 is also very flexible in terms of various sources contributing input into sitemaps files. The Sitemaps 0.90 protocol offers the option of a Sitemaps index file to be provided to the crawler as well. The main advantage of sitemap index files is, of course, a means to partition sitemap files of very large Web sites into smaller chunks. You can also use the sitemap index file to combine multiple sitemaps that have been generated by multiple content sources or content delivery applications into a single sitemap (index) file.

The importance of Sitemaps 0.90 for products like IBM WebSphere Portal lies within the nature of portals. Portals allow for generation of dynamic content. As a result of that, the complex URLs they generate do not allow for the use of standard crawler directives like the robots.txt to to define the Web space which the crawler crawls. Sitemaps 0.90, together with the improved crawlability features provided with the latest release of WebSphere Portal, is an essential improvement for site administrators to control crawling of their public portal sites.

The following sections provide details about the Sitemaps 0.90 protocol.

The Sitemaps 0.90 XML file

Listing 1 is a sample Sitemaps 0.90 file which lists four pages of a small Web site with all the required attributes for every one of the four pages:

Listing 1. Required attributes
<?xml version="1.0" encoding="UTF-8"?> 

    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> 

    <url>
      <loc>http://www.mycompany.com/</loc>
       <lastmod>2006-08-07</lastmod>
       <changefreq>daily</changefreq>
       <priority>0.7</priority>
     </url>
    <url>
      <loc>http://www.mycompany.com/products/</loc>
       <lastmod>2006-08-01</lastmod>
       <changefreq>weekly</changefreq> 
       <priority>0.8</priority>
     </url>
    <url>
       <loc>http://www.mycompany.com/news/</loc>
       <lastmod>2006-08-07</lastmod>
       <changefreq>weekly</changefreq> 
       <priority>0.8</priority>
     </url>
    <url>
      <loc>http://www.mycompany.com/archive/</loc>
        <lastmod>2006-05-01</lastmod>
        <changefreq>monthly</changefreq> 
        <priority>0.3</priority>
     </url>
    </urlset></xml>

The following table highlights the mandatory input shown in Listing 1:

XML tags explained
<urlset>Encapsulates the XML file and provides the reference to the protocol standard
<url> This is the parent tag for each URL entry.
<loc> URL of the Web page and must be fully qualified.
<lastmod> Date of last modification. Specified in W3C Datetime format. Can be shortened to date only, for example, YYYY-MM-DD.
<changefreq> Specifies how often a page changes and accepts as values: always, hourly, daily, weekly, monthly, yearly and never.
<priority> A value between 0.0 and 1.0, which specifies the importance of a page relative to other pages on that site. Used to prioritize the crawling sequence of that site's pages. Note that this does not have an influence on the relevance of the page in a search result.

You can find details about the XML schema for the Sitemap protocol on the Sitemaps 0.90 protocol page.

Making the sitemap accessible to crawlers

Once you create the sitemap XML file and make it available through a Web server, it then can be submitted with the URL registration services that the supporting search engines provide. This is comparable to the way the traditional HTML-based sitemaps (or homepages) are registered. As of the publication of this article, both Google and Yahoo! support this protocol through their respective registration services.

Note: You'll want to follow additional rules regarding where the sitemap file resides on the Web server (see Sitemap location for details).

The following is a valid sitemap location with respect to the URLs listed in the sitemap itself:

Site: http://www.mycompany.com/xyz/index.html
Sitemap: http://www.mycompany.com/xyz/sitemap.xml

The following are examples for invalid sitemap locations (assuming the same sample sitemap listed above) is registered:

bad Site: http://www.mycompany.com/index.html
bad Site: http://www.notmycompany.com/index.html
bad Site: http://subdomain.mycompany.com/

Details about sitemap locations can be found at Sitemaps 0.90 protocol.


WebSphere Portal V6 support for the Sitemaps 0.90 protocol

The IBM Search Sitemap Utility portlet for generating sitemaps compliant with Sitemaps 0.90

The IBM Search Sitemap Utility portlet is available on the WebSphere Portal Catalog. The following section provides an overview of the functionalities of this portlet.

The IBM Search Sitemap Utility portlet is an extension to the Sitemap portlet delivered with WebSphere Portal. It is an enhancement that enables you to export public portal pages as a Sitemaps 0.90 compliant XML file. Figure 1 illustrates the main view of the portlet.

Viewing pages and portlets in the IBM Search Sitemap Utility portlet

Use the IBM Search Sitemap Utility to view a listing of the pages and portlets in the portal. Users with the appropriate administrative authorization can set the number of pages and portlets displayed on each page of the portlet. The default is to display 50 entries per page. The pages and portlets are displayed in a tree hierarchy.

Note: If you want to display both portlets and pages of a portal, configure the IBM Search Sitemap Utility portlet to do so.

Figure 1. Main view of the IBM Search Sitemap Utility portlet
Main view of the IBM Search Sitemap Utility portlet

From the main view you can initiate the following actions:

  • Navigate Search Sitemap -- If the Sitemap list spans more than 50 portal pages, additional links are available to navigate through the list.
  • Access pages and portlets -- If you click on a portlet link, you view the page containing that portlet.
  • Change locales -- This displays the pages and portlets in the corresponding language.

Editing the preferences for the Sitemap Utility portlet

Use edit mode to filter portal sections:

Select Filter Portal Sections to enable filtering of portal pages.

  • When this is enabled, only the sections selected in the list are displayed in the Search Sitemap portlet.
  • If it is enabled but no sections are selected, the Search Sitemap portlet displays a message that no pages are available.
  • If Filter Portal Sections is not enabled, then all pages are displayed.
Figure 2. Edit view of the IBM Search Sitemap Utility portlet
Edit view of the IBM Search Sitemap Utility portlet

Accessibility of the Sitemap XML file through a Web server

Once you set up the sitemap to contain all relevant pages to be crawled by the respective robots, the information is ready to export to the file system as a Sitemaps 0.90 compliant XML file.

Figure 3. Export Sitemap XML
Export Sitemap XML

To do this, click on the icon at the top of the Export Search Sitemap portlet. A browser Open file dialog window displays to ask what action to take. Click Save to Disk and in the next dialog box select the appropriate target location where you want to store the Sitemap XML file. The final step is to allow this XML document to be made accessible to crawlers through a Web server. The easiest way to do this is to simply copy the file to the respective folder managed by your Web server. Store the file in the document root folder of the Web server (see Sitemap location for details and restrictions).


Looking toward the future

This article provided high-level information about the importance and mutual benefits of the Sitemaps 0.90 protocol. In particular, Sitemaps 0.90 addresses one of the biggest challenges facing Web crawlers today -- the increasing complexity of modern Web sites and the increasing use of techniques that challenge traditional crawlers. For example, JavaScript on Web pages allows you to do neat things that are often appreciated by users. However, crawlers don't care about these; they ignore (and should!) JavaScript and are thus locked out from getting to more information (though this might be intentional as well).

Today, webmasters must make a conscious choice to forego easy crawlability when they choose to use techniques like this. At the same time, Web crawler developers work to create new crawling approaches that account for the ever-increasing complexity of Web pages, often with impact on their processing speed. If this pattern continues, crawlers are squeezed between the dual pressures of maintaining acceptable crawling performance while gathering constantly-increasing volumes of information.

The importance of Search as a primary means of navigation -- be it on the Internet or large Web sites -- is ever increasing. It is, therefore, important that applications and solutions which deliver content and information become good citizens in the search world and contribute their share of enablement to provide searchable information. Sitemaps 0.90 is an important step toward this goal and allows simple and efficient crawling even of complex and dynamic modern Web sites.

Save as Sitemap will, hopefully, be an option that content management systems, document management applications, news feeds, and so forth will provide in the near future. Together with appropriate editing capabilities on the resulting sitemaps, this will greatly improve the ability for Web crawlers to efficiently pick up the right information in a timely manner and, not to forget, ensure the crawlers are not locked out from content sources which they would not be able to consume with existing methods today.

Resources

Learn

Get products and technologies

  • IBM Search Sitemap Utility portlet: Download this portlet from the IBM WebSphere Portal catalog.
  • W3C Datetime: Review a profile of ISO 8601, the International Standard for the representation of dates and times.
  • WebSphere Portal Express: Download this trial version of WebSphere Portal Express and speed your creation of easily deployable and customizable Web sites.
  • IBM trial software: Build your next development project with trial software available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, WebSphere
ArticleID=210887
ArticleTitle=Help Web crawlers efficiently crawl your portal sites and Web sites
publish-date=05082007