Up until now, both Web crawlers and site administrators had to spend a reasonable amount of time figuring out how to optimize the crawlability of a Web site and still be able to discover relevant information that is hosted on that Web site. And, ideally, doing so without adding too much load on the hosting server, triggering unwanted actions, and allowing repeated crawls of that Web site at appropriate intervals.
Sitemaps 0.90 protocol (see Sitemaps.org in Resources) provides a convenient way for Web site administrators to feed crawlers with the information they need to safely and efficiently crawl their Web site. And furthermore, it relies on Web standards such as XML.
What you need to do, in short, is to generate a list of page references (URLs) for a crawler to fetch. This list is stored in a simple XML Sitemaps 0.90 compliant file. The XML file contains one entry for every URL and the only mandatory input is the URL reference itself. Optionally, you can list additional information for each URL, including its last modification date, its change frequency, an expiration date, and a priority value. (The priority value specifies the importance of that page relative to that Web site only.) Once the Sitemap XML is completed, you can make it available to Web crawlers by submitting a URL to the sitemap to those sites that support the protocol. Google and Yahoo! already support the protocol today.
Once the sitemap is registered, the search engineâs Web crawlers will use the information provided in the sitemap file as follows: identify what set of pages needs to be crawled, and then also use the change frequency information to determine which pages need to be processed at this time. Thus the efficiency of site crawls provides relief to both sides: the hosting Web server and the crawler as well, by keeping the number of
GET requests for pages at a minimum.
Overall Sitemaps 0.90 protocol offers an improvement in terms of crawling efficiency that cannot be accomplished using regular sitemap references and robot directives combined.
Sitemaps 0.90 is a simple and intuitive way for webmasters to provide the right level of information to Web crawlers so that they can efficiently crawl a Web site. Sitemaps 0.90 is a big step forward compared to the old-style "crawl the Web site by following the hyperlinks" approach. Today, the approach that many search engines still recommend for some level of crawling efficiency is to make an HTML-based site map page available for a crawler to pick up. However, in today's world of growing and complex Web sites, that is not the final solution. Sitemaps 0.90 goes one step further in allowing Web sites to specify information about the content or pages which crawlers would otherwise need to determine and store, such as update frequency of pages or whether a page has changed. At this point it is important to mention that the Sitemaps 0.90 protocol relies on Web standards and reuses existing concepts. The sitemap protocol is based on a straight forward and intuitively structured XML file, which is composed of a list of URLs and their associated metadata. This information helps the Web crawler determine what the set of pages is and when to crawl them. The webmaster, or whoever takes responsibility for that Web site, provides recommendations and information.
Sitemaps 0.90 generation tools are showing up already on the Internet as open source, freeware, and shareware offerings. This is possible because of the Beta program for Sitemaps run by Google. An example of such an open source tool is the Sitemaps 0.90 generator tool. Beyond these standalone tools, it is possible for systems that manage and generate Web content to automatically create and maintain Sitemaps 0.90 sitemap files, and for Web development tools to offer save as sitemap functionality.
Sitemaps 0.90 is also very flexible in terms of various sources contributing input into sitemaps files. The Sitemaps 0.90 protocol offers the option of a Sitemaps index file to be provided to the crawler as well. The main advantage of sitemap index files is, of course, a means to partition sitemap files of very large Web sites into smaller chunks. You can also use the sitemap index file to combine multiple sitemaps that have been generated by multiple content sources or content delivery applications into a single sitemap (index) file.
The importance of Sitemaps 0.90 for products like IBM WebSphere Portal lies within the nature of portals. Portals allow for generation of dynamic content. As a result of that, the complex URLs they generate do not allow for the use of standard crawler directives like the robots.txt to to define the Web space which the crawler crawls. Sitemaps 0.90, together with the improved crawlability features provided with the latest release of WebSphere Portal, is an essential improvement for site administrators to control crawling of their public portal sites.
The following sections provide details about the Sitemaps 0.90 protocol.
Listing 1 is a sample Sitemaps 0.90 file which lists four pages of a small Web site with all the required attributes for every one of the four pages:
Listing 1. Required attributes
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.mycompany.com/</loc> <lastmod>2006-08-07</lastmod> <changefreq>daily</changefreq> <priority>0.7</priority> </url> <url> <loc>http://www.mycompany.com/products/</loc> <lastmod>2006-08-01</lastmod> <changefreq>weekly</changefreq> <priority>0.8</priority> </url> <url> <loc>http://www.mycompany.com/news/</loc> <lastmod>2006-08-07</lastmod> <changefreq>weekly</changefreq> <priority>0.8</priority> </url> <url> <loc>http://www.mycompany.com/archive/</loc> <lastmod>2006-05-01</lastmod> <changefreq>monthly</changefreq> <priority>0.3</priority> </url> </urlset></xml>
The following table highlights the mandatory input shown in Listing 1:
XML tags explained
|<urlset>||Encapsulates the XML file and provides the reference to the protocol standard|
|<url>||This is the parent tag for each URL entry.|
|<loc>||URL of the Web page and must be fully qualified.|
|<lastmod>||Date of last modification. Specified in W3C Datetime format. Can be shortened to date only, for example, YYYY-MM-DD.|
|<changefreq>||Specifies how often a page changes and accepts as values: always, hourly, daily, weekly, monthly, yearly and never.|
|<priority>||A value between 0.0 and 1.0, which specifies the importance of a page relative to other pages on that site. Used to prioritize the crawling sequence of that site's pages. Note that this does not have an influence on the relevance of the page in a search result.|
You can find details about the XML schema for the Sitemap protocol on the Sitemaps 0.90 protocol page.
Once you create the sitemap XML file and make it available through a Web server, it then can be submitted with the URL registration services that the supporting search engines provide. This is comparable to the way the traditional HTML-based sitemaps (or homepages) are registered. As of the publication of this article, both Google and Yahoo! support this protocol through their respective registration services.
Note: You'll want to follow additional rules regarding where the sitemap file resides on the Web server (see Sitemap location for details).
The following is a valid sitemap location with respect to the URLs listed in the sitemap itself:
Site: http://www.mycompany.com/xyz/index.html Sitemap: http://www.mycompany.com/xyz/sitemap.xml
The following are examples for invalid sitemap locations (assuming the same sample sitemap listed above) is registered:
bad Site: http://www.mycompany.com/index.html bad Site: http://www.notmycompany.com/index.html bad Site: http://subdomain.mycompany.com/
Details about sitemap locations can be found at Sitemaps 0.90 protocol.
The IBM Search Sitemap Utility portlet is available on the WebSphere Portal Catalog. The following section provides an overview of the functionalities of this portlet.
The IBM Search Sitemap Utility portlet is an extension to the Sitemap portlet delivered with WebSphere Portal. It is an enhancement that enables you to export public portal pages as a Sitemaps 0.90 compliant XML file. Figure 1 illustrates the main view of the portlet.
Use the IBM Search Sitemap Utility to view a listing of the pages and portlets in the portal. Users with the appropriate administrative authorization can set the number of pages and portlets displayed on each page of the portlet. The default is to display 50 entries per page. The pages and portlets are displayed in a tree hierarchy.
Note: If you want to display both portlets and pages of a portal, configure the IBM Search Sitemap Utility portlet to do so.
Figure 1. Main view of the IBM Search Sitemap Utility portlet
From the main view you can initiate the following actions:
- Navigate Search Sitemap -- If the Sitemap list spans more than 50 portal pages, additional links are available to navigate through the list.
- Access pages and portlets -- If you click on a portlet link, you view the page containing that portlet.
- Change locales -- This displays the pages and portlets in the corresponding language.
Use edit mode to filter portal sections:
Select Filter Portal Sections to enable filtering of portal pages.
- When this is enabled, only the sections selected in the list are displayed in the Search Sitemap portlet.
- If it is enabled but no sections are selected, the Search Sitemap portlet displays a message that no pages are available.
- If Filter Portal Sections is not enabled, then all pages are displayed.
Figure 2. Edit view of the IBM Search Sitemap Utility portlet
Once you set up the sitemap to contain all relevant pages to be crawled by the respective robots, the information is ready to export to the file system as a Sitemaps 0.90 compliant XML file.
Figure 3. Export Sitemap XML
To do this, click on the icon at the top of the Export Search Sitemap portlet. A browser Open file dialog window displays to ask what action to take. Click Save to Disk and in the next dialog box select the appropriate target location where you want to store the Sitemap XML file. The final step is to allow this XML document to be made accessible to crawlers through a Web server. The easiest way to do this is to simply copy the file to the respective folder managed by your Web server. Store the file in the document root folder of the Web server (see Sitemap location for details and restrictions).
Today, webmasters must make a conscious choice to forego easy crawlability when they choose to use techniques like this. At the same time, Web crawler developers work to create new crawling approaches that account for the ever-increasing complexity of Web pages, often with impact on their processing speed. If this pattern continues, crawlers are squeezed between the dual pressures of maintaining acceptable crawling performance while gathering constantly-increasing volumes of information.
The importance of Search as a primary means of navigation -- be it on the Internet or large Web sites -- is ever increasing. It is, therefore, important that applications and solutions which deliver content and information become good citizens in the search world and contribute their share of enablement to provide searchable information. Sitemaps 0.90 is an important step toward this goal and allows simple and efficient crawling even of complex and dynamic modern Web sites.
Save as Sitemap will, hopefully, be an option that content management systems, document management applications, news feeds, and so forth will provide in the near future. Together with appropriate editing capabilities on the resulting sitemaps, this will greatly improve the ability for Web crawlers to efficiently pick up the right information in a timely manner and, not to forget, ensure the crawlers are not locked out from content sources which they would not be able to consume with existing methods today.
- Unleashing the power of WebSphere Portal V6 Search with the Portal Search Toolbox (David Konopnicki and Eitan Shapiro, developerWorks, January 2007): Add additional searching capabilities such as such as suggested links or faceted search navigation to your portal.
- WebSphere Portal product documentation: Find links to information about several WebSphere products.
- developerWorks WebSphere Portal zone: Access multitude of information about WebSphere portal, including demos, articles, and tutorials.
- Sitemaps 0.90 protocol: Read about the XML schema for the Sitemap protocol.
- Sitemaps 0.90 homepage: Learn what sitemaps are and how they improve the odds that your Web pages will be included in search engines.
- Sitemaps 0.90 generator: Check out an example of an open source tool that allows you to generate a Sitemaps XML file for your Web site.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- XML technical library: See the developerWorks XML zone for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
Get products and technologies
- IBM Search Sitemap Utility portlet: Download this portlet from the IBM WebSphere Portal catalog.
- W3C Datetime: Review a profile of ISO 8601, the International Standard for the representation of dates and times.
- WebSphere Portal Express: Download this trial version of WebSphere Portal Express and speed your creation of easily deployable and customizable Web sites.
- IBM trial software: Build your next development project with trial software available for download directly from developerWorks.
- Participate in the discussion forum.
- XML zone discussion forums: Participate in any of several XML-centered forums.
Andreas Prokoph is a software architect at the IBM Development Lab in Boeblingen, Germany, working in the field of text search and information retrieval for the past 18 years. He has held various positions as technical lead and architect for many search products and solutions ranging from Intranet search engines to client-side embedded search technologies. You can reach Andreas at firstname.lastname@example.org.