 | Level: Introductory Willy Chiu (wchiu@us.ibm.com), Vice-President, High Volume Web Sites, Software Group (AIM Division)
17 Apr 2001 This paper introduces current and future techniques for personalizing your Web site. Techniques for maximizing the performance of personalized Web sites, such as content caching, are also discussed.
A
successful e-business Web site gives special treatment
to its repeat visitors who buy. Does yours? If
it doesn't, you know it needs to. If it already
does, you know it can do better. And even if it's
pretty good, it could be faster. Providing special
treatment in the form of information and applications
matched to a visitor's interests, roles, and needs
is known as personalization. A personalized
e-business site is more likely to attract and
retain visitors and to build sales. Personalized
sites for employees improve their productivity
by simplifying access to information and applications.
Overall customer satisfaction is increased when
less time is required to locate account information,
and service is personalized to the customer's
needs. Two common reasons for personalizing a
site are to make the site easier to use and to
increase sales.
Personalization
is a process of gathering and storing information
about site visitors, analyzing the information,
and, based on the analysis, delivering the right
information to each visitor at the right time.
A number of personalization techniques, with more
on the way, can enable your site to target advertising,
promote products, personalize news feeds, recommend
documents, make appropriate advice, and target
e-mail.
Providing
personalization for real-time applications affects
the system performance. How personalization is
deployed is thus important and needs to be integrated
into the overall system design. This is especially
true for high-volume Web sites. As described in
"Design for scalability" (see Resource
1), your selection of personalization techniques
should be directed by your Web site type. In our
work with high-volume Web sites, IBM determined
there are generally five types of sites, distinguished
by workload pattern: publish/subscribe, online
shopping, customer self-service, trading, and
business-to-business. Regardless of type, Web
sites look increasingly to the use of personalization
to increase repeat business.
This
paper introduces personalization and describes
some current techniques. It also explains how
personalization affects the system performance
and introduces techniques such as content caching,
also called intelligent content distribution,
for implementing appropriate, effective personalization
while still meeting the performance requirements
of high-volume e-business sites. Finally, the
paper suggests what we believe to be the most
effective personalization techniques for each
type of Web site.
The
information contained in this document has not
been submitted to any formal IBM test and is distributed
as is. The use of this information or the implementation
of any of these techniques is a customer responsibility
and depends on the customer's ability to evaluate
and integrate the techniques into the customer's
operational environment. While each item may have
been reviewed by IBM for accuracy in a specific
situation, there is no guarantee that the same
or similar results will be obtained elsewhere.
Customers attempting to adapt these techniques
to their own environments do so at their own risk.
Introducing personalization
Personalization is a process of gathering and
storing information about site visitors, analyzing
the information, and, based on the analysis, delivering
the right information to each visitor at the right
time. It is a key technology needed in various
e-business applications, such as:
- Managing
customer relationships
- Targeting
advertisements and promoting products
- Managing
marketing campaigns
- Managing
Web site content
- Managing
knowledge
- Managing
personalized portals and channels
Although
each application area may need tailoring, especially
in the areas of user interface and data collection,
the core techniques for personalization, depicted
in Figure 1, are quite
similar.
Figure 1. Elements of a personalization system
Personalization
has gone through different phases. Initially,
personalization was used to keep the visitor on
the site, exploring more of the site, which provided
opportunities to advertise and promote products.
The next phase attempted to increase how much
money a visitor spent at each visit by offering
more expensive or related products. Today, personalization
is increasingly used as a means to expedite the
delivery of information to a visitor, making the
site useful and attractive to return to.
In
July 1999, Forrester Research published a report,
"Smart Personalization" (see Resource
2), describing their research to-date on why and
how companies implement personalization. e-businesses
want personalization to accomplish goals that
range from making their sites easier to use to
increasing sales. The overarching goal is to increase
repeat business. Companies use different methods
to personalize their e-business sites. The most
common are tailored e-mail alerts, customized
content, and account access.
True
measurements of the results of installing personalization
features are not available. Companies implement
personalization simply because they think it's
worth the investment. Depending on size and complexity
of effort, some believe that an investment in
personalization can be returned in less than 12
months. Successful sites, such as Amazon.com and
Garden.com, use rich profile information as the
basis for providing valuable services. These sites
are considered models for those who want to personalize
their sites.
Custom
pricing, customized content, targeted marketing,
and advertising are more advanced personalization
methods that require sophisticated data mining.
These methods rely on personalized Web pages and
deliver business value by enabling site owners
to determine how and when to change site content.
However, dynamically building such pages requires
additional resources and may affect overall system
performance. Minimizing the impact of these pages
requires a personalization engine that is scalable
to handle a large number of requests, a large
and complex content space, and the collection
of customer information.
Personalization
This section introduces current techniques for
collecting and analyzing information. Figure
2 is an overview of personalization techniques.
The major steps -- collecting visitor information,
filtering, and developing recommendations -- may
or may not be performed dynamically; part or all
of some steps may be performed offline, in batch
mode, or even manually.
Collecting visitor information
The objective of collecting visitor information
is to develop a profile that describes a site
visitor's interests, role in an organization,
entitlements, purchases, or some other set of
descriptors important to the site owner. The most
common techniques are explicit profiling, implicit
profiling, and using legacy data:
- Explicit
profiling asks each visitor to fill out
information or questionnaires. This method has
the advantage of letting customers tell the
site directly what they want to see. An example
is MyYahoo, where the visitor is asked to specify
profile information, including, for example,
what stocks to track and what news categories
to report. MyYahoo dynamically constructs a
personalized Web page accordingly.
- Implicit
profiling tracks the visitor's behavior.
This technique is generally transparent to the
visitor. Browsing and buying patterns are the
behaviors most often assessed. The browsing
pattern is usually tracked by saving specific
visitor identification and behavior information
in what is called a cookie that is kept at the
browser and updated at each visit. The buying
pattern is generally available in the customer
purchase database. For example, Amazon.com logs
each customer's buying history and, based on
that history, recommends specific purchases.
- Using
legacy data accesses legacy data for valuable
profile information, such as credit applications
and previous purchases. For existing customers
and known visitors, legacy data often provides
the richest source of profile information.
Figure 2. Overview of personalization techniques
The
techniques can be combined to produce comprehensive
profiles. Access to legacy data can be an important
component of explicit or implicit profiling. Profile
and legacy data become the metadata processed
by the filtering techniques.
Analyzing visitor profiles
When the profile is available, the next step is
to analyze the profile information in order to
present or recommend documents, purchases, or
actions specific to the visitor. Making such recommendations
is the most challenging step. Many techniques
for presenting content and making recommendations
are in use or under development. Rule-based and
filtering techniques are the best known.
Rule-based
techniques
Rule-based techniques provide a visual editing
environment for the business administrator to
specify business rules to drive personalization.
This requires the administrator, most likely with
the help of a consultant, to figure out the appropriate
rules. The rule-based approach provides a flexible
mechanism to specify rules for business applications
or marketing campaigns. IBM WebSphere provides
a set of tools and services that enable an e-business
development team to easily create personalized
Web sites.
Cross-selling
is an e-business example of the rule-based technique.
For example, a rule could be specified to offer
product X to a customer who has just bought product
Y; for example, a customer of a book might be
interested in current or previous books by the
same author or in books on the same subject.
Rule-based
techniques can be used with filtering techniques,
either before or after the filtering process,
to develop the best recommendation.
Filtering
techniques
Filtering techniques employ algorithms to analyze
meta data and drive presentation and recommendations.
The three most common filtering techniques --
simple filtering, content-based filtering, and
collaborative filtering -- are introduced below.
These techniques are described in more detail
in Appendix A: More on filtering
techniques.
Simple
filtering relies on predefined groups, or classes,
of visitors to determine what content is displayed
or what service is provided. An example of simple
filtering is managing access to corporate information.
For example, employees identified with the Human
Resources department would have personalized Web
sites that give them access to information and
applications specific to their job. Online brokerages
often classify their accounts by asset value or
age groups. Their sites could use simple filtering
to provide preferential treatment to customers
based on whether they are in the silver, gold,
or platinum account class. Or, referring to the
age group, the site could recommend savings accounts
for college tuition or retirement.
Content-based
filtering works by analyzing the content of the
objects to form a representation of the visitor's
interests. Generally, the analysis needs to identify
a set of key attributes for each object and then
fill in the attribute values. One example is a
document filtering system that analyzes documents
based on keywords. Recommending video movie purchases
is another example of content-based filtering.
Content-based filtering is most suitable when
the objects are easily analyzed by computer and
the visitor's decision about object suitability
is not subjective.
Collaborative
filtering collects visitors' opinions on a set
of objects, using either explicit or implicit
ratings, to form like-minded peer groups and then
learns from the peer groups to predict a particular
visitor's interest in an item. Instead of finding
objects similar to those a visitor liked in the
past, as in content-based filtering, collaborative
filtering develops recommendations by finding
visitors with similar tastes. Recommendations
produced by collaborative filtering are based
on the peer group's response and are not restricted
to a simple profile matching. For product recommendations,
collaborative filtering is most suitable for homogeneous,
simple products, such as books, CDs, or videos.
The
numbers of Web site types, personalization goals,
and personalization methods suggest that none
of the current techniques can satisfy all needs
(see Resources 5 and
6). Generally speaking, different personalization
techniques are most suitable for different variables,
such as type of Web site, Web site component,
or product/services. Consider the case of product
recommendations. Selling books or CDs requires
techniques different from those required to sell
groceries, computers, or apparel. A technique
that improves on the best of the current techniques
and offers additional options could satisfy a
wider set of needs (see Resource
4). With a flexible architecture that allows for
multiple recommendation engines, each engine would
use specific personalization techniques to make
its recommendations (see Resource
3). Such an architecture makes it easy to accommodate
new techniques as technology evolves and new requirements
develop.
Use content caching to maximize performance
Providing personalization for real-time applications,
such as dynamically constructing Web pages based
on the visitor's profile, affects system performance.
How personalization is deployed is thus important
and needs to be integrated into the overall system
design. This is especially true for high-volume
Web sites.
Caching
techniques have long been used to improve the
system performance. With content caching, frequently
accessed pages do not need to be retrieved remotely
or materialized at the server for each access.
This can significantly reduce the latency for
obtaining Web pages, as well as reduce the load
on the server and network. In the Web environment,
frequently accessed Web pages can be cached at
the client browser, proxy servers, and server
caches.
For
caching to be effective, data needs to be reused
frequently. With personalization, each Web page
may be specific to each visitor. Personalization
identifies the visitor using a cookie or session
logon, and dynamically generates a page specific
to the visitor. Dynamic pages are not cached at
proxy servers and most server caches. Even if
the page were cached at the server or proxy, the
likelihood of reusing a personalized page is low.
Doing so would significantly impact cache hit
rates. Note also that the CPU overhead at the
Web server for creating personalized pages can
be significantly higher than serving static pages.
There can thus be a performance penalty for introducing
personalization to a Web site.
The
basic approach to handling personalized and other
dynamic pages is to serve the base HTML page from
the server, while caching embedded image files.
This doesn't require new technology and is how
proxy serving typically works on the Web today.
For example, IBM WebSphere Performance Pack, installed
at Deutsche Telekom as a proxy server, caches
the embedded images of popular pages. Since image
files tend to outnumber HTML pages, reasonable
proxy hit rates are still possible. The drawback
is that, even if personalized HTML pages represent
significantly less than 50% of the requests and
bytes requested from a Web site, the CPU overhead
for generating the personalized pages can still
be significant and can affect the throughput of
the Web site. Where SSL is used for secured pages,
avoid encrypting and decrypting image GIF files
to improve performance by increasing reuse of
cached images.
Other strategies to maximize performance
IBM is developing technologies and techniques
for reducing the overhead of serving dynamic data,
such as personalized data (see Resources
3, 4, and 5). Figure 3
shows a multi-tiered Web site and the caching
and personalization techniques suitable for each
Web site component. The caching levels show that
performance is maximized when cache hits occur
close to the browsers. Similarly, more complex
and sophisticated personalization techniques are
introduced as you move through the different tiers
and get closer to the database layer. For example,
at the ISP and router levels, rule-based and simple
filtering may offer sufficient personalization
capabilities for a relatively small investment
of effort. When more is needed or wanted, more
complex techniques can be implemented. Note that
all the techniques could be employed at the application
server, while just the most complex techniques
are in use at the database server. When data mining
is needed to develop business intelligence and
offer highly sophisticated personalization, the
processing occurs at the database layer.
Figure 3. Overview of Web site with personalization and intelligent content distribution
When
database changes arrive rapidly, as they do during
a sporting event or a trading day, a trigger
monitor can be implemented to watch changes
(see Resources 7, 8,
and 9). Changes can then be propagated forward
from the database server to the browser. When
a certain number of changes, or certain changes,
occur, the trigger monitor rebuilds the affected
Web pages and distributes updated pages to the
caches. This technique ensures data is current
and performance is maximized, making it appropriate
for use with dynamic personalized pages and ideally
reducing a page's only dynamic content to, for
example, personalized account information. The
trigger monitor is the key technology at the heart
of a robust implementation of intelligent content
distribution.
IBM's
sports Web sites efficiently create and serve
dynamic Web data, including personalized data.
These Web sites use new techniques for caching
dynamic data as well as for creating complex dynamic
Web pages from simpler fragments. Figure
4 depicts the evolution of IBM's sporting
event Web sites. Current sites use an integrated
cache to serve dynamic pages. An externalized
API enables the server to load and invalidate
pages as needed. A trigger monitor keeps caches
current while content is changing rapidly.
Figure 4. Evolution of IBM sporting event sites
Sites
can benefit from content caches, as well as the
trigger monitor. A content cache can build certain
types of personalized pages from fragments stored
in cache. For example, for the 2000 Olympics Web
site, advertisements need to be based on the country
of origin of the client. This could be done based
on the IP address of the client and advertising
fragments for each country. More generally, this
could be done by partitioning clients into groups,
with each group being served pages for a specific
URL, where some of the page fragments are personalized
based on the client group. The client group could
be identified in various ways, for example, by
source IP address, URL extensions identifying
the client or group, or cookies. Then, based on
the client group, the content cache combines specific
page fragments, sometimes called tagged content,
to compose the personalized page. A tagged content
design facilitates managing and reusing content
fragments (see Resource
2). The level of personalization possible with
these intelligent content distribution techniques
covers a significant subset of personalization
requirements. However, the techniques at the cache
are still more limited than personalization achievable
at the server because available information about
the client is limited, and performance requirements
limit the degree of personalization.
The
trigger monitor keeps track of the scores and
statistics that arrive rapidly during sporting
events. As the database updates arrive, the trigger
monitor keeps track of changes, rebuilds the affected
Web pages, and distributes updated pages to the
caches, assuring they are kept current and the
personalized Web sites are updated as well.
You
can reduce the overhead of personalization by
reducing the degree of personalization. For example,
instead of creating pages specialized for each
individual client, you could create sets of pages
specialized (tagged) to groups of visitors. This
could significantly reduce the total number of
pages and allow reuse of some pages, thus increasing
the utility of caching. This reduced level of
personalization can be provided at a content cache.
You
can also vary the degree of personalization based
on server load. When servers are heavily loaded,
the amount of personalization could be minimized.
For example, for personalized advertisements,
when the server is highly loaded, random advertisements
can be included in the page, while at lower loads
the advertisements can be highly targeted. You
could combine this technique with content caches,
where a lower degree of personalization pages
could be served from the cache when the server
is highly loaded, while deep personalization could
be done at the server, when server load permits.
Personalized
Web pages can be assembled at the client if it
is enabled with Java. Some sites even provide
Java to the client to optimize personalization
and performance.
Among
the multiple recommendation engines, one uses
a new content-based collaborative filtering approach,
where the object content is captured in making
collaborative filtering. This technique achieves
the advantages of both content-based and collaborative
filtering approaches. The content-based collaborative
filtering technique is applicable to both product
and document recommendations.
Because
collecting visitor information can be an expensive
effort and also affect the performance, you should
be able to measure its effectiveness. The issue
is not only what to recommend, but also when and
how. The personalized recommendation engines deal
with the issue of what to recommend given a set
of alternatives, but a more sophisticated application
would decide when to invoke the recommendation
engine and how to apply it, for example, whether
to send the customer an e-mail or e-coupon, or
add a Web link on the personalized Web page.
Personalizing your site based on site classification
IBM's IT experts have been working with customers
to analyze many of the world's largest Internet
and intranet sites, including IBM's own, to determine
which attributes affect scalability and to help
customers implement scalable Web sites. IBM has
determined that:
- Large
sites are distinguished primarily by workload
pattern
- Based
on workload patterns, Web sites can generally
be classified into five types: publish/subscribe,
online shopping, customer self-service,
trading, and business-to-business
- Scaling
techniques must be selected and applied based
on workload pattern
If
you're unfamiliar with the Web site classifications,
refer to Appendix B: Summary
of high-volume Web site classifications.
In
the same way that a workload pattern suggests
appropriate scaling techniques, it can also suggest
the most effective personalization techniques.
While it's possible to implement any or all of
the techniques at each site type, some techniques
require significant effort and may degrade performance;
you may or may not need that level of investment.
For
each type of Web site, Figure
5 shows the personalization techniques that
would be most effective. Note, for example, that
rule-based techniques apply to all site types
except publish/subscribe, while all techniques
apply to the self-service and business-to-business
sites. After you determine which type of site
you have, use this table to identify the personalization
techniques you should consider. Note that at least
one effective, relatively simple technique is
suggested for each type. From another perspective,
consider Amazon.com, one of the most successful
and "smartest" online shopping sites (see Resource
2). Given the volume and attributes of Amazon's
objects, content-based filtering would require
excessive effort and so would not be considered
effective.
Figure
5. Personalization techniques mapped to workload
patterns
| Site
type | | Technique | Publish/
subscribe | Online
shopping | Self-service | Trading | Business-to-
business | | Rule
based | | X | X | X | X | | Simple
filtering | X | X | X | X | X | | Content-based
filtering | X | | X | X | X | | Collaborative
filtering | | X | X | | X |
Summary
Quite simply, personalization has become a required,
expected feature of an e-business Web site. The
presence and quality of site personalization determines
whether visitors find your site attractive and
return to it with an intention to buy. The real
question is not whether to personalize, but how
and how much, and how to implement personalization
while maximizing performance, which can be as
important as the business effectiveness of the
techniques you choose. In this paper, you've learned
about current personalization techniques and the
significance of intelligent content distribution
and other techniques to maximize site performance.
During site design, be sure to consider your workload
pattern and to insist that your personalization
and caching strategies be considered early and
in relation to each other.
IBM
has products and services that can help you get
started today and position your site for enhancements
as your business rules and requirements change
and additional personalization techniques are
developed.
Appendix A: More on filtering techniques
Content-based filtering
Content-based filtering works by analyzing the
content of the objects to form a representation
of the visitor's interests. Generally, the analysis
needs to identify a set of key attributes for
each object and then fill in the attribute values.
Recommending
video purchases is an example of content-based
filtering. The example below uses seven attributes
to analyze video content: action, drama, sex,
violence, suspense, humor, and offbeat. The rating
goes from 0 to 10 indicating the intensity. For
example, a violence rating of 10 means extreme
violence and 0 means no violence.
| Figure
6. An example of content-based filtering | | Video
/ Attribute | Action | Drama | Humor | Sex | Violence | Suspense | Offbeat | | (A)
Silence of the Lambs | | 7 | 3 | 1 | 9 | 10 | | | (B)
Seven | 5 | 5 | 1 | 2 | 10 | 9 | 5 | | (C)
Cape Fear | 5 | 7 | 4 | 5 | 9 | 9 | 3 | | (D)
Casablanca | 2 | 10 | 5 | 0 | 1 | 8 | | | (E)
Waterboy | 4 | 2 | 6 | 3 | 4 | 3 | 1 | | (F)
L.A. Confidential | 8 | 9 | 6 | 6 | 9 | 9 | 6 | | (G)
West Side Story | 3 | 5 | 4 | 0 | 1 | 3 | 1 |
Using
a concept known as "Euclidean distance" or nearest
neighbor, content-based filtering analyzes the
ratings to determine for any one video, which
other video has the closest ratings and could
be recommended to a visitor who ordered the first
video. For example, Silence of the Lambs
could be found to come closest in content to Seven,
in which case Seven could be a candidate
to recommend to customers interested in Silence
of the Lambs.
Collaborative filtering
Collaborative filtering collects visitors' opinions
on a set of items, using either explicit or implicit
ratings, to form like-minded peer groups and then
learns from the peer groups to predict a particular
visitor's interest in an item. Instead of finding
objects similar to those a visitor liked in the
past, as in content-based filtering, collaborative
filtering develops recommendations by finding
visitors with similar tastes.
Below
is an example of collaborative filtering. Assume
each person can rate a video from 1 to 7, where
7 means strongly like, 4 is neutral, and 1 means
strongly dislike. Videos A through G represent
the seven videos shown in the previous table.
| Figure
7. An example of collaborative filtering | | Video
/ Visitor | A | B | C | D | E | F | G | | Adam | 7 | | 6 | 2 | | 2 | | | Bill | 7 | | | 1 | | 2 | 5 | | Jennifer | 4 | 2 | | | | | 2 | | John | 6 | 2 | 7 | | 7 | | | | Mary | 2 | 7 | | 7 | | | | | Rose | 1 | 7 | | | | 6 | | | Susan | 2 | 6 | | 7 | | | 6 |
For
ease of illustration, we again use a nearest neighbor
measure of closeness. When measuring the distance
between two persons, only videos both have rated
are considered. For example, when considering
the distance between Adam and John, only the ratings
on videos A and C are considered. Adam's close
peers are Bill and John. For Adam, we can recommend
video E based on John's liking. The point to note
here is that the content of video E, Waterboy,
can be quite different from the content of videos
rated highly by Adam. Although with similar content
to videos A and C, video B, Seven, will
not be recommended to Adam, because his peer,
John, does not like it. However, content-based
collaborative filtering will recommend video B
to Adam, based solely on the fact that its content
is similar to videos A and C, which are liked
by Adam.
Comparing the techniques
Rule-based techniques and simple filtering offer
significant personalization capabilities for an
investment of effort relatively smaller than content-based
and collaborative filtering.
Content-based
filtering is most suitable when the objects are
easily analyzed by computer and the visitor's
decision about object suitability is not subjective.
For some objects, such as the videos, analyzing
content cannot be automated today, and the effort
to identify attributes and evaluate each object
can be considerable and require specific knowledge
or skills. Recommendations are limited to objects
related to those the visitor has tried, with no
provision for visitor qualification. Relating
these limitations to the video example, Adam cannot
get a recommendation on a video that belongs to
a group or category he has never rated or tried.
If video B is closely related to video A on the
content, it will always be recommended to visitors
interested in video A. Whether anyone interested
in video A actually finds video B worth viewing
is not factored into the consideration for making
the recommendation.
Even
in the case of recommending documents or Web pages,
which is most amenable for automation, content-based
filtering remains an active area of research because
of inherent redundancies and ambiguities in textual
descriptions. The basic approach is to treat each
document as a weighted vector of keywords and
partition documents into clusters. From the documents
of interest to each visitor, one can derive the
relevant keywords or document clusters of interest
to a visitor.
Basic
collaborative filtering addresses some of the
shortcomings of pure content-based filtering.
Recommendations produced by collaborative filtering
are qualified based on the peer group's response
and are not restricted to a simple profile matching.
Let's go back to the video example. From content-based
analysis, video F (L.A. Confidential) is
close to video C (Cape Fear). Although
John likes video C, video F will not be recommended
to him because his peers (Adam and Bill) dislike
video F. However, collaborative filtering requires
a large customer base in order to find a peer
group for each visitor. This also might imply
a long learning curve, because in the beginning
when the number of participating visitors is small,
the quality of the recommendations will be low.
The results improve gradually as the number of
participating visitors increases. In any case,
there always may be someone with such a unique
taste that no other person will show similar behavior.
Collaborative
filtering requires visitors to rate objects, introducing
the biases of different visitors. Certain people
tend to give ratings on the extreme ends of the
scale, and others tend to give rating around the
middle. This can make the formation of a peer
group difficult.
For
product recommendations, collaborative filtering
is most suitable for homogeneous, simple products,
such as books, CDs, or videos. This is because
a peer group is determined using some type of
nearest neighbor algorithm, where each person
is represented by a vector of ratings. The more
objects two visitors have rated similarly, the
closer the two visitors are. This implicitly assumes
a homogeneous environment. In a computer store
such as CompUSA, there are objects of vastly different
characteristics and prices, from cable to memory
chips to software CDs to PCs. Collaborative filtering
to form peer groups and make recommendations for
such a store would need to account not only for
how many objects, but also for which type of objects
are of common interest.
Appendix B: Summary of high-volume Web site classifications
Publish/subscribe Web sites provide visitors
with information. Some examples include search
engines, media sites such as newspapers and magazines,
and event sites such as those for the Olympics
and for the tennis championships at Wimbledon.
Site content changes frequently, driving changes
to page layouts. While search traffic is low volume,
the number of unique items sought is high resulting
in the largest number of page views of all site
types. As an example, the Wimbledon site successfully
handled a peak volume of 430,000 hits per minute
using IBM WebSphere Performance Pack. Security
considerations are minor compared to other site
types. Data volatility is low. This site type
processes the fewest transactions and has little
or no connection to any legacy systems.
Online
shopping sites let visitors browse and buy.
Examples are typical retail sites where visitors
buy books, clothes, and even cars. Site content
can be relatively static, such as a parts catalog,
or dynamic where items are frequently added and
deleted, for example, as promotions and special
discounts come and go. Search traffic is heavier
than the publish/subscribe site, though the number
of unique items sought is not as large. Data volatility
is low. Transaction traffic is moderate to high,
and almost always grows. The typical daily volumes
for many large retail customers, running on IBM
Net.Commerce, range from less than one million
hits per day to over 3 million hits per day, and
with a range from 100,000 transactions per day
to 700,000 transactions per day in the top range;
of the total transactions, typically between 1%
and 5% are buy transactions. When visitors buy,
security requirements become significant and include
privacy, nonrepudiation, integrity, authentication,
and regulations. Shopping sites have more connections
to legacy systems, such as fulfillment systems,
than the publish/subscribe sites, but generally
less than the other site types.
Customer
self-service sites let visitors help themselves.
Sample sites include banking from home, tracking
packages, and making travel arrangements. Data
comes largely from legacy applications and often
comes from multiple sources, thereby exposing
data consistency. Security considerations are
significant for home banking and purchasing travel
services, less so for other uses. Search traffic
is low volume; transaction traffic is low to moderate,
but growing.
Trading
sites let visitors buy and sell. Of all site
types, trading sites have the most volatile content,
the highest transaction volumes (with significant
swing), the most complex transactions, and the
most time sensitivity. Products like IBM's CICS
high-volume transaction processing system play
a key role at these sites. Trading sites are tightly
connected to the legacy systems, for example,
using IBM MQSeries for connectivity. Nearly all
transactions interact with the back-end servers.
Security considerations are high, equivalent to
online shopping, with an even larger number of
secure pages. Search traffic is low volume.
Business-to-business
sites let businesses buy from and sell to
each other. Many businesses are implementing a
Web site for their purchasing applications. Such
purchasing activity may also be characteristic
of other site types, such as publish/subscribe
sites and self-service sites. Data comes largely
from legacy applications and often comes from
multiple sources, thereby exposing data consistency.
Security requirements are equivalent to online
shopping. Transaction volume is low to moderate,
but growing; transactions are typically complex,
connecting multiple suppliers and distributors.
Contact
for more information
For more information, contact Willy Chiu, Director,
High Volume Web Sites, at wchiu@us.ibm.com.
Acknowledgments
The High-Volume Web site team is grateful to the
major contributors to this article: Charu Aggarwal,
Jim Challenger, Daniel Dias, Paul Dantzig, Arun
Iyengar, Carol Jones, Doug Riecken, Jacob P. Ukelson,
Robert Will, Joel Wolf, Kun-lung Wu, and Philip
S. Yu.
Resources - Forrester
Research, Inc. "Smart Personalization," July
1999
- "The
Intelligent Recommendation Analyzer," by C.
Aggarwal, J.L.Wolf, K-L. Wu, and P.S. Yu, to
appear in proceedings of ICDCS Workshop on Knowledge
Discovery and Data Mining, April 2000.
- "On
Text Mining Techniques for Personalization,"
in New Directions in Rough Sets, Data Mining,
and Granular-Soft Computing, by C. Aggarwal
and P.S. Yu, (Lecture Notes in Artificial Intelligence
1711), ed. by A. Skowron and N. Zhong, Springer
Verlag, 1999.
- "Horting
Hatches an Egg: A New Graph-Theoretic Approach
to Collaborative Filtering," by C. Aggarwal,
J.L.Wolf, K-L. Wu, and P.S. Yu, Proc. 1999 ACM
SIGKDD Conference, San Diego, CA, Aug. 1999,
pp. 201-212.
- IBM
High-Volume Web Sites, Design
for Scalability, December 1999
- "On
the Merits of Building Categorization Systems
by Supervised Clustering," by C. Aggarwal, S.
Gates and P.S. Yu, Proc. 1999 ACM SIGKDD Conference,
San Diego, CA, Aug. 1999, pp. 352-356.
- A
Scalable System for Consistently Caching Dynamic
Web Data (PostScript version, with Jim Challenger,
Arun Iyengar and Paul Dantzig). In Proceedings
of IEEE INFOCOM'99, New York, New York, March
1999.
- "A
Publishing System for Efficiently Creating Dynamic
Web Content," by Jim Challenger, Arun Iyengar,
Karen Witting, Cameron Ferstat, and Paul Reed.
Can be referenced in 2 ways: To appear in Proceedings
of INFOCOM 2000, March 2000 (latest version),
IBM Research Report RC 21546(97091), July 1999
(an earlier version).
About the author  | |  | Willy Chiu is a Vice-President, High Volume Web Sites, Software Group (AIM Division) |
Rate this page
|  |