A successful e-business Web site gives special treatment to its repeat visitors who buy. Does yours? If it doesn't, you know it needs to. If it already does, you know it can do better. And even if it's pretty good, it could be faster. Providing special treatment in the form of information and applications matched to a visitor's interests, roles, and needs is known as personalization. A personalized e-business site is more likely to attract and retain visitors and to build sales. Personalized sites for employees improve their productivity by simplifying access to information and applications. Overall customer satisfaction is increased when less time is required to locate account information, and service is personalized to the customer's needs. Two common reasons for personalizing a site are to make the site easier to use and to increase sales.
Personalization is a process of gathering and storing information about site visitors, analyzing the information, and, based on the analysis, delivering the right information to each visitor at the right time. A number of personalization techniques, with more on the way, can enable your site to target advertising, promote products, personalize news feeds, recommend documents, make appropriate advice, and target e-mail.
Providing personalization for real-time applications affects the system performance. How personalization is deployed is thus important and needs to be integrated into the overall system design. This is especially true for high-volume Web sites. As described in "Design for scalability" (see Resource 1), your selection of personalization techniques should be directed by your Web site type. In our work with high-volume Web sites, IBM determined there are generally five types of sites, distinguished by workload pattern: publish/subscribe, online shopping, customer self-service, trading, and business-to-business. Regardless of type, Web sites look increasingly to the use of personalization to increase repeat business.
This paper introduces personalization and describes some current techniques. It also explains how personalization affects the system performance and introduces techniques such as content caching, also called intelligent content distribution, for implementing appropriate, effective personalization while still meeting the performance requirements of high-volume e-business sites. Finally, the paper suggests what we believe to be the most effective personalization techniques for each type of Web site.
The information contained in this document has not been submitted to any formal IBM test and is distributed as is. The use of this information or the implementation of any of these techniques is a customer responsibility and depends on the customer's ability to evaluate and integrate the techniques into the customer's operational environment. While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk.
Personalization is a process of gathering and storing information about site visitors, analyzing the information, and, based on the analysis, delivering the right information to each visitor at the right time. It is a key technology needed in various e-business applications, such as:
- Managing customer relationships
- Targeting advertisements and promoting products
- Managing marketing campaigns
- Managing Web site content
- Managing knowledge
- Managing personalized portals and channels
Although each application area may need tailoring, especially in the areas of user interface and data collection, the core techniques for personalization, depicted in Figure 1, are quite similar.
Figure 1. Elements of a personalization system
Personalization has gone through different phases. Initially, personalization was used to keep the visitor on the site, exploring more of the site, which provided opportunities to advertise and promote products. The next phase attempted to increase how much money a visitor spent at each visit by offering more expensive or related products. Today, personalization is increasingly used as a means to expedite the delivery of information to a visitor, making the site useful and attractive to return to.
In July 1999, Forrester Research published a report, "Smart Personalization" (see Resource 2), describing their research to-date on why and how companies implement personalization. e-businesses want personalization to accomplish goals that range from making their sites easier to use to increasing sales. The overarching goal is to increase repeat business. Companies use different methods to personalize their e-business sites. The most common are tailored e-mail alerts, customized content, and account access.
True measurements of the results of installing personalization features are not available. Companies implement personalization simply because they think it's worth the investment. Depending on size and complexity of effort, some believe that an investment in personalization can be returned in less than 12 months. Successful sites, such as Amazon.com and Garden.com, use rich profile information as the basis for providing valuable services. These sites are considered models for those who want to personalize their sites.
Custom pricing, customized content, targeted marketing, and advertising are more advanced personalization methods that require sophisticated data mining. These methods rely on personalized Web pages and deliver business value by enabling site owners to determine how and when to change site content. However, dynamically building such pages requires additional resources and may affect overall system performance. Minimizing the impact of these pages requires a personalization engine that is scalable to handle a large number of requests, a large and complex content space, and the collection of customer information.
This section introduces current techniques for collecting and analyzing information. Figure 2 is an overview of personalization techniques. The major steps -- collecting visitor information, filtering, and developing recommendations -- may or may not be performed dynamically; part or all of some steps may be performed offline, in batch mode, or even manually.
Collecting visitor information
The objective of collecting visitor information is to develop a profile that describes a site visitor's interests, role in an organization, entitlements, purchases, or some other set of descriptors important to the site owner. The most common techniques are explicit profiling, implicit profiling, and using legacy data:
- Explicit profiling asks each visitor to fill out information or questionnaires. This method has the advantage of letting customers tell the site directly what they want to see. An example is MyYahoo, where the visitor is asked to specify profile information, including, for example, what stocks to track and what news categories to report. MyYahoo dynamically constructs a personalized Web page accordingly.
- Implicit profiling tracks the visitor's behavior. This technique is generally transparent to the visitor. Browsing and buying patterns are the behaviors most often assessed. The browsing pattern is usually tracked by saving specific visitor identification and behavior information in what is called a cookie that is kept at the browser and updated at each visit. The buying pattern is generally available in the customer purchase database. For example, Amazon.com logs each customer's buying history and, based on that history, recommends specific purchases.
- Using legacy data accesses legacy data for valuable profile information, such as credit applications and previous purchases. For existing customers and known visitors, legacy data often provides the richest source of profile information.
Figure 2. Overview of personalization techniques
The techniques can be combined to produce comprehensive profiles. Access to legacy data can be an important component of explicit or implicit profiling. Profile and legacy data become the metadata processed by the filtering techniques.
When the profile is available, the next step is to analyze the profile information in order to present or recommend documents, purchases, or actions specific to the visitor. Making such recommendations is the most challenging step. Many techniques for presenting content and making recommendations are in use or under development. Rule-based and filtering techniques are the best known.
Rule-based
techniques
Rule-based techniques provide a visual editing
environment for the business administrator to
specify business rules to drive personalization.
This requires the administrator, most likely with
the help of a consultant, to figure out the appropriate
rules. The rule-based approach provides a flexible
mechanism to specify rules for business applications
or marketing campaigns. IBM WebSphere provides
a set of tools and services that enable an e-business
development team to easily create personalized
Web sites.
Cross-selling is an e-business example of the rule-based technique. For example, a rule could be specified to offer product X to a customer who has just bought product Y; for example, a customer of a book might be interested in current or previous books by the same author or in books on the same subject.
Rule-based techniques can be used with filtering techniques, either before or after the filtering process, to develop the best recommendation.
Filtering
techniques
Filtering techniques employ algorithms to analyze
meta data and drive presentation and recommendations.
The three most common filtering techniques --
simple filtering, content-based filtering, and
collaborative filtering -- are introduced below.
These techniques are described in more detail
in Appendix A: More on filtering
techniques.
Simple filtering relies on predefined groups, or classes, of visitors to determine what content is displayed or what service is provided. An example of simple filtering is managing access to corporate information. For example, employees identified with the Human Resources department would have personalized Web sites that give them access to information and applications specific to their job. Online brokerages often classify their accounts by asset value or age groups. Their sites could use simple filtering to provide preferential treatment to customers based on whether they are in the silver, gold, or platinum account class. Or, referring to the age group, the site could recommend savings accounts for college tuition or retirement.
Content-based filtering works by analyzing the content of the objects to form a representation of the visitor's interests. Generally, the analysis needs to identify a set of key attributes for each object and then fill in the attribute values. One example is a document filtering system that analyzes documents based on keywords. Recommending video movie purchases is another example of content-based filtering. Content-based filtering is most suitable when the objects are easily analyzed by computer and the visitor's decision about object suitability is not subjective.
Collaborative filtering collects visitors' opinions on a set of objects, using either explicit or implicit ratings, to form like-minded peer groups and then learns from the peer groups to predict a particular visitor's interest in an item. Instead of finding objects similar to those a visitor liked in the past, as in content-based filtering, collaborative filtering develops recommendations by finding visitors with similar tastes. Recommendations produced by collaborative filtering are based on the peer group's response and are not restricted to a simple profile matching. For product recommendations, collaborative filtering is most suitable for homogeneous, simple products, such as books, CDs, or videos.
The numbers of Web site types, personalization goals, and personalization methods suggest that none of the current techniques can satisfy all needs (see Resources 5 and 6). Generally speaking, different personalization techniques are most suitable for different variables, such as type of Web site, Web site component, or product/services. Consider the case of product recommendations. Selling books or CDs requires techniques different from those required to sell groceries, computers, or apparel. A technique that improves on the best of the current techniques and offers additional options could satisfy a wider set of needs (see Resource 4). With a flexible architecture that allows for multiple recommendation engines, each engine would use specific personalization techniques to make its recommendations (see Resource 3). Such an architecture makes it easy to accommodate new techniques as technology evolves and new requirements develop.
Use content caching to maximize performance
Providing personalization for real-time applications, such as dynamically constructing Web pages based on the visitor's profile, affects system performance. How personalization is deployed is thus important and needs to be integrated into the overall system design. This is especially true for high-volume Web sites.
Caching techniques have long been used to improve the system performance. With content caching, frequently accessed pages do not need to be retrieved remotely or materialized at the server for each access. This can significantly reduce the latency for obtaining Web pages, as well as reduce the load on the server and network. In the Web environment, frequently accessed Web pages can be cached at the client browser, proxy servers, and server caches.
For caching to be effective, data needs to be reused frequently. With personalization, each Web page may be specific to each visitor. Personalization identifies the visitor using a cookie or session logon, and dynamically generates a page specific to the visitor. Dynamic pages are not cached at proxy servers and most server caches. Even if the page were cached at the server or proxy, the likelihood of reusing a personalized page is low. Doing so would significantly impact cache hit rates. Note also that the CPU overhead at the Web server for creating personalized pages can be significantly higher than serving static pages. There can thus be a performance penalty for introducing personalization to a Web site.
The basic approach to handling personalized and other dynamic pages is to serve the base HTML page from the server, while caching embedded image files. This doesn't require new technology and is how proxy serving typically works on the Web today. For example, IBM WebSphere Performance Pack, installed at Deutsche Telekom as a proxy server, caches the embedded images of popular pages. Since image files tend to outnumber HTML pages, reasonable proxy hit rates are still possible. The drawback is that, even if personalized HTML pages represent significantly less than 50% of the requests and bytes requested from a Web site, the CPU overhead for generating the personalized pages can still be significant and can affect the throughput of the Web site. Where SSL is used for secured pages, avoid encrypting and decrypting image GIF files to improve performance by increasing reuse of cached images.
Other strategies to maximize performance
IBM is developing technologies and techniques for reducing the overhead of serving dynamic data, such as personalized data (see Resources 3, 4, and 5). Figure 3 shows a multi-tiered Web site and the caching and personalization techniques suitable for each Web site component. The caching levels show that performance is maximized when cache hits occur close to the browsers. Similarly, more complex and sophisticated personalization techniques are introduced as you move through the different tiers and get closer to the database layer. For example, at the ISP and router levels, rule-based and simple filtering may offer sufficient personalization capabilities for a relatively small investment of effort. When more is needed or wanted, more complex techniques can be implemented. Note that all the techniques could be employed at the application server, while just the most complex techniques are in use at the database server. When data mining is needed to develop business intelligence and offer highly sophisticated personalization, the processing occurs at the database layer.
Figure 3. Overview of Web site with personalization and intelligent content distribution
When database changes arrive rapidly, as they do during a sporting event or a trading day, a trigger monitor can be implemented to watch changes (see Resources 7, 8, and 9). Changes can then be propagated forward from the database server to the browser. When a certain number of changes, or certain changes, occur, the trigger monitor rebuilds the affected Web pages and distributes updated pages to the caches. This technique ensures data is current and performance is maximized, making it appropriate for use with dynamic personalized pages and ideally reducing a page's only dynamic content to, for example, personalized account information. The trigger monitor is the key technology at the heart of a robust implementation of intelligent content distribution.
IBM's sports Web sites efficiently create and serve dynamic Web data, including personalized data. These Web sites use new techniques for caching dynamic data as well as for creating complex dynamic Web pages from simpler fragments. Figure 4 depicts the evolution of IBM's sporting event Web sites. Current sites use an integrated cache to serve dynamic pages. An externalized API enables the server to load and invalidate pages as needed. A trigger monitor keeps caches current while content is changing rapidly.
Figure 4. Evolution of IBM sporting event sites
Sites can benefit from content caches, as well as the trigger monitor. A content cache can build certain types of personalized pages from fragments stored in cache. For example, for the 2000 Olympics Web site, advertisements need to be based on the country of origin of the client. This could be done based on the IP address of the client and advertising fragments for each country. More generally, this could be done by partitioning clients into groups, with each group being served pages for a specific URL, where some of the page fragments are personalized based on the client group. The client group could be identified in various ways, for example, by source IP address, URL extensions identifying the client or group, or cookies. Then, based on the client group, the content cache combines specific page fragments, sometimes called tagged content, to compose the personalized page. A tagged content design facilitates managing and reusing content fragments (see Resource 2). The level of personalization possible with these intelligent content distribution techniques covers a significant subset of personalization requirements. However, the techniques at the cache are still more limited than personalization achievable at the server because available information about the client is limited, and performance requirements limit the degree of personalization.
The trigger monitor keeps track of the scores and statistics that arrive rapidly during sporting events. As the database updates arrive, the trigger monitor keeps track of changes, rebuilds the affected Web pages, and distributes updated pages to the caches, assuring they are kept current and the personalized Web sites are updated as well.
You can reduce the overhead of personalization by reducing the degree of personalization. For example, instead of creating pages specialized for each individual client, you could create sets of pages specialized (tagged) to groups of visitors. This could significantly reduce the total number of pages and allow reuse of some pages, thus increasing the utility of caching. This reduced level of personalization can be provided at a content cache.
You can also vary the degree of personalization based on server load. When servers are heavily loaded, the amount of personalization could be minimized. For example, for personalized advertisements, when the server is highly loaded, random advertisements can be included in the page, while at lower loads the advertisements can be highly targeted. You could combine this technique with content caches, where a lower degree of personalization pages could be served from the cache when the server is highly loaded, while deep personalization could be done at the server, when server load permits.
Personalized Web pages can be assembled at the client if it is enabled with Java. Some sites even provide Java to the client to optimize personalization and performance.
Among the multiple recommendation engines, one uses a new content-based collaborative filtering approach, where the object content is captured in making collaborative filtering. This technique achieves the advantages of both content-based and collaborative filtering approaches. The content-based collaborative filtering technique is applicable to both product and document recommendations.
Because collecting visitor information can be an expensive effort and also affect the performance, you should be able to measure its effectiveness. The issue is not only what to recommend, but also when and how. The personalized recommendation engines deal with the issue of what to recommend given a set of alternatives, but a more sophisticated application would decide when to invoke the recommendation engine and how to apply it, for example, whether to send the customer an e-mail or e-coupon, or add a Web link on the personalized Web page.
Personalizing your site based on site classification
IBM's IT experts have been working with customers to analyze many of the world's largest Internet and intranet sites, including IBM's own, to determine which attributes affect scalability and to help customers implement scalable Web sites. IBM has determined that:
- Large sites are distinguished primarily by workload pattern
- Based on workload patterns, Web sites can generally be classified into five types: publish/subscribe, online shopping, customer self-service, trading, and business-to-business
- Scaling techniques must be selected and applied based on workload pattern
If you're unfamiliar with the Web site classifications, refer to Appendix B: Summary of high-volume Web site classifications.
In the same way that a workload pattern suggests appropriate scaling techniques, it can also suggest the most effective personalization techniques. While it's possible to implement any or all of the techniques at each site type, some techniques require significant effort and may degrade performance; you may or may not need that level of investment.
For each type of Web site, Figure 5 shows the personalization techniques that would be most effective. Note, for example, that rule-based techniques apply to all site types except publish/subscribe, while all techniques apply to the self-service and business-to-business sites. After you determine which type of site you have, use this table to identify the personalization techniques you should consider. Note that at least one effective, relatively simple technique is suggested for each type. From another perspective, consider Amazon.com, one of the most successful and "smartest" online shopping sites (see Resource 2). Given the volume and attributes of Amazon's objects, content-based filtering would require excessive effort and so would not be considered effective.
Figure
5. Personalization techniques mapped to workload
patterns
| Site type | |||||
| Technique | Publish/ subscribe | Online shopping | Self-service | Trading | Business-to- business |
| Rule based | X | X | X | X | |
| Simple filtering | X | X | X | X | X |
| Content-based filtering | X | X | X | X | |
| Collaborative filtering | X | X | X | ||
Quite simply, personalization has become a required, expected feature of an e-business Web site. The presence and quality of site personalization determines whether visitors find your site attractive and return to it with an intention to buy. The real question is not whether to personalize, but how and how much, and how to implement personalization while maximizing performance, which can be as important as the business effectiveness of the techniques you choose. In this paper, you've learned about current personalization techniques and the significance of intelligent content distribution and other techniques to maximize site performance. During site design, be sure to consider your workload pattern and to insist that your personalization and caching strategies be considered early and in relation to each other.
IBM has products and services that can help you get started today and position your site for enhancements as your business rules and requirements change and additional personalization techniques are developed.
Appendix A: More on filtering techniques
Content-based filtering works by analyzing the content of the objects to form a representation of the visitor's interests. Generally, the analysis needs to identify a set of key attributes for each object and then fill in the attribute values.
Recommending video purchases is an example of content-based filtering. The example below uses seven attributes to analyze video content: action, drama, sex, violence, suspense, humor, and offbeat. The rating goes from 0 to 10 indicating the intensity. For example, a violence rating of 10 means extreme violence and 0 means no violence.
| Figure 6. An example of content-based filtering | |||||||
| Video / Attribute | Action | Drama | Humor | Sex | Violence | Suspense | Offbeat |
| (A) Silence of the Lambs | 7 | 3 | 1 | 9 | 10 | ||
| (B) Seven | 5 | 5 | 1 | 2 | 10 | 9 | 5 |
| (C) Cape Fear | 5 | 7 | 4 | 5 | 9 | 9 | 3 |
| (D) Casablanca | 2 | 10 | 5 | 0 | 1 | 8 | |
| (E) Waterboy | 4 | 2 | 6 | 3 | 4 | 3 | 1 |
| (F) L.A. Confidential | 8 | 9 | 6 | 6 | 9 | 9 | 6 |
| (G) West Side Story | 3 | 5 | 4 | 0 | 1 | 3 | 1 |
Using a concept known as "Euclidean distance" or nearest neighbor, content-based filtering analyzes the ratings to determine for any one video, which other video has the closest ratings and could be recommended to a visitor who ordered the first video. For example, Silence of the Lambs could be found to come closest in content to Seven, in which case Seven could be a candidate to recommend to customers interested in Silence of the Lambs.
Collaborative filtering collects visitors' opinions on a set of items, using either explicit or implicit ratings, to form like-minded peer groups and then learns from the peer groups to predict a particular visitor's interest in an item. Instead of finding objects similar to those a visitor liked in the past, as in content-based filtering, collaborative filtering develops recommendations by finding visitors with similar tastes.
Below is an example of collaborative filtering. Assume each person can rate a video from 1 to 7, where 7 means strongly like, 4 is neutral, and 1 means strongly dislike. Videos A through G represent the seven videos shown in the previous table.
| Figure 7. An example of collaborative filtering | |||||||
| Video / Visitor | A | B | C | D | E | F | G |
| Adam | 7 | 6 | 2 | 2 | |||
| Bill | 7 | 1 | 2 | 5 | |||
| Jennifer | 4 | 2 | 2 | ||||
| John | 6 | 2 | 7 | 7 | |||
| Mary | 2 | 7 | 7 | ||||
| Rose | 1 | 7 | 6 | ||||
| Susan | 2 | 6 | 7 | 6 | |||
For ease of illustration, we again use a nearest neighbor measure of closeness. When measuring the distance between two persons, only videos both have rated are considered. For example, when considering the distance between Adam and John, only the ratings on videos A and C are considered. Adam's close peers are Bill and John. For Adam, we can recommend video E based on John's liking. The point to note here is that the content of video E, Waterboy, can be quite different from the content of videos rated highly by Adam. Although with similar content to videos A and C, video B, Seven, will not be recommended to Adam, because his peer, John, does not like it. However, content-based collaborative filtering will recommend video B to Adam, based solely on the fact that its content is similar to videos A and C, which are liked by Adam.
Rule-based techniques and simple filtering offer significant personalization capabilities for an investment of effort relatively smaller than content-based and collaborative filtering.
Content-based filtering is most suitable when the objects are easily analyzed by computer and the visitor's decision about object suitability is not subjective. For some objects, such as the videos, analyzing content cannot be automated today, and the effort to identify attributes and evaluate each object can be considerable and require specific knowledge or skills. Recommendations are limited to objects related to those the visitor has tried, with no provision for visitor qualification. Relating these limitations to the video example, Adam cannot get a recommendation on a video that belongs to a group or category he has never rated or tried. If video B is closely related to video A on the content, it will always be recommended to visitors interested in video A. Whether anyone interested in video A actually finds video B worth viewing is not factored into the consideration for making the recommendation.
Even in the case of recommending documents or Web pages, which is most amenable for automation, content-based filtering remains an active area of research because of inherent redundancies and ambiguities in textual descriptions. The basic approach is to treat each document as a weighted vector of keywords and partition documents into clusters. From the documents of interest to each visitor, one can derive the relevant keywords or document clusters of interest to a visitor.
Basic collaborative filtering addresses some of the shortcomings of pure content-based filtering. Recommendations produced by collaborative filtering are qualified based on the peer group's response and are not restricted to a simple profile matching. Let's go back to the video example. From content-based analysis, video F (L.A. Confidential) is close to video C (Cape Fear). Although John likes video C, video F will not be recommended to him because his peers (Adam and Bill) dislike video F. However, collaborative filtering requires a large customer base in order to find a peer group for each visitor. This also might imply a long learning curve, because in the beginning when the number of participating visitors is small, the quality of the recommendations will be low. The results improve gradually as the number of participating visitors increases. In any case, there always may be someone with such a unique taste that no other person will show similar behavior.
Collaborative filtering requires visitors to rate objects, introducing the biases of different visitors. Certain people tend to give ratings on the extreme ends of the scale, and others tend to give rating around the middle. This can make the formation of a peer group difficult.
For product recommendations, collaborative filtering is most suitable for homogeneous, simple products, such as books, CDs, or videos. This is because a peer group is determined using some type of nearest neighbor algorithm, where each person is represented by a vector of ratings. The more objects two visitors have rated similarly, the closer the two visitors are. This implicitly assumes a homogeneous environment. In a computer store such as CompUSA, there are objects of vastly different characteristics and prices, from cable to memory chips to software CDs to PCs. Collaborative filtering to form peer groups and make recommendations for such a store would need to account not only for how many objects, but also for which type of objects are of common interest.
Appendix B: Summary of high-volume Web site classifications
Publish/subscribe Web sites provide visitors with information. Some examples include search engines, media sites such as newspapers and magazines, and event sites such as those for the Olympics and for the tennis championships at Wimbledon. Site content changes frequently, driving changes to page layouts. While search traffic is low volume, the number of unique items sought is high resulting in the largest number of page views of all site types. As an example, the Wimbledon site successfully handled a peak volume of 430,000 hits per minute using IBM WebSphere Performance Pack. Security considerations are minor compared to other site types. Data volatility is low. This site type processes the fewest transactions and has little or no connection to any legacy systems.
Online shopping sites let visitors browse and buy. Examples are typical retail sites where visitors buy books, clothes, and even cars. Site content can be relatively static, such as a parts catalog, or dynamic where items are frequently added and deleted, for example, as promotions and special discounts come and go. Search traffic is heavier than the publish/subscribe site, though the number of unique items sought is not as large. Data volatility is low. Transaction traffic is moderate to high, and almost always grows. The typical daily volumes for many large retail customers, running on IBM Net.Commerce, range from less than one million hits per day to over 3 million hits per day, and with a range from 100,000 transactions per day to 700,000 transactions per day in the top range; of the total transactions, typically between 1% and 5% are buy transactions. When visitors buy, security requirements become significant and include privacy, nonrepudiation, integrity, authentication, and regulations. Shopping sites have more connections to legacy systems, such as fulfillment systems, than the publish/subscribe sites, but generally less than the other site types.
Customer self-service sites let visitors help themselves. Sample sites include banking from home, tracking packages, and making travel arrangements. Data comes largely from legacy applications and often comes from multiple sources, thereby exposing data consistency. Security considerations are significant for home banking and purchasing travel services, less so for other uses. Search traffic is low volume; transaction traffic is low to moderate, but growing.
Trading sites let visitors buy and sell. Of all site types, trading sites have the most volatile content, the highest transaction volumes (with significant swing), the most complex transactions, and the most time sensitivity. Products like IBM's CICS high-volume transaction processing system play a key role at these sites. Trading sites are tightly connected to the legacy systems, for example, using IBM MQSeries for connectivity. Nearly all transactions interact with the back-end servers. Security considerations are high, equivalent to online shopping, with an even larger number of secure pages. Search traffic is low volume.
Business-to-business sites let businesses buy from and sell to each other. Many businesses are implementing a Web site for their purchasing applications. Such purchasing activity may also be characteristic of other site types, such as publish/subscribe sites and self-service sites. Data comes largely from legacy applications and often comes from multiple sources, thereby exposing data consistency. Security requirements are equivalent to online shopping. Transaction volume is low to moderate, but growing; transactions are typically complex, connecting multiple suppliers and distributors.
For more information, contact Willy Chiu, Director, High Volume Web Sites, at wchiu@us.ibm.com.
The High-Volume Web site team is grateful to the major contributors to this article: Charu Aggarwal, Jim Challenger, Daniel Dias, Paul Dantzig, Arun Iyengar, Carol Jones, Doug Riecken, Jacob P. Ukelson, Robert Will, Joel Wolf, Kun-lung Wu, and Philip S. Yu.
- Forrester
Research, Inc. "Smart Personalization," July
1999
- "The
Intelligent Recommendation Analyzer," by C.
Aggarwal, J.L.Wolf, K-L. Wu, and P.S. Yu, to
appear in proceedings of ICDCS Workshop on Knowledge
Discovery and Data Mining, April 2000.
- "On
Text Mining Techniques for Personalization,"
in New Directions in Rough Sets, Data Mining,
and Granular-Soft Computing, by C. Aggarwal
and P.S. Yu, (Lecture Notes in Artificial Intelligence
1711), ed. by A. Skowron and N. Zhong, Springer
Verlag, 1999.
- "Horting
Hatches an Egg: A New Graph-Theoretic Approach
to Collaborative Filtering," by C. Aggarwal,
J.L.Wolf, K-L. Wu, and P.S. Yu, Proc. 1999 ACM
SIGKDD Conference, San Diego, CA, Aug. 1999,
pp. 201-212.
- IBM
High-Volume Web Sites, Design
for Scalability, December 1999
- "On
the Merits of Building Categorization Systems
by Supervised Clustering," by C. Aggarwal, S.
Gates and P.S. Yu, Proc. 1999 ACM SIGKDD Conference,
San Diego, CA, Aug. 1999, pp. 352-356.
- A
Scalable System for Consistently Caching Dynamic
Web Data (PostScript version, with Jim Challenger,
Arun Iyengar and Paul Dantzig). In Proceedings
of IEEE INFOCOM'99, New York, New York, March
1999.
- "A
Publishing System for Efficiently Creating Dynamic
Web Content," by Jim Challenger, Arun Iyengar,
Karen Witting, Cameron Ferstat, and Paul Reed.
Can be referenced in 2 ways: To appear in Proceedings
of INFOCOM 2000, March 2000 (latest version),
IBM Research Report RC 21546(97091), July 1999
(an earlier version).





