Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Web site user modeling with PHP

Paul Meagher (paul@datavore.com), CEO, Datavore Productions
Paul Meagher is a freelance Web developer, writer, and data analyst. Paul has a graduate degree in Cognitive Science and has spent the last six years developing Web applications. His current projects and interests center around e-learning, content management, and math-enabled Web applications. Paul resides in Truro, Nova Scotia.

Summary:  Web site user modeling, a mathematical discipline, is easier than you might expect. In this tutorial, Paul Meagher shows you how to construct a user-modeling platform with PHP and MySQL -- technologies well suited for a species of user-modeling called Web site user modeling. Even small Web-development shops can use clickstream data to build Web site user models.

Date:  30 Dec 2003
Level:  Introductory PDF:  A4 and Letter (300 KB | 52 pages)Get Adobe® Reader®

Activity:  8070 views
Comments:  

Wrap up

Data issues

When Bucklin and Sismeiro prepared their Apache access logs for analysis, they removed all visits involving a single page request. They defined a site visit as including two or more page views. Their rationale was that they were interested in understanding browsing behavior and that such data was not informative on that score. They also pointed out that "If we were to retain visits with single-page requests, they would dominate the entire sample."

The fact that many visitors only provide one data point of information should be kept in mind when thinking about the results one might observe under real-world conditions. Only a small subset of the visitors are likely to provide enough user transition data that you can begin to use it to try to predict the next pages they will view.

Another data issue that Bucklin and Sismeiro needed to address was an operational definition of a site visit. In particular, they asked how you should count sessions when a person is idle on your site for 50 minutes, then comes back and starts surfing again. They decided that this was to count as a new session because a 30-minute period of inactivity had transpired. Other researchers also used this 30-minute inactivity criterion to define when a new session should be declared.

My access logging code does not currently implement the idea of starting a new session if a visitor's last access was 30 minutes ago. The solution would involve recording the time stamp of the last access in the transition matrix. When you retrieve the visitor's transition matrix at the start of each page, you could check the time stamp value for the last access and if it was greater than 30 minutes, you could create a new session ID for the visitor.

Bucklin and Sismeiro also needed to deal with the issue of what method they would use to track users across sessions: Cookies, IPs, or member IDs? They decided to remove records from their analysis that did not have a cookie identifier. They retained about 90 percent of their records after doing this. Basing user tracking on cookie identifiers is probably one of the most reliable ways to identify users across sessions; however, it is not without limitations as well (such as, users can delete their cookies). As indicated previously, I believe that the most reliable method to track users across sessions is probably a redundant approach that uses both cookie and IP data, and member ID data as a third backup if you have it.


The individual and the collective

It would be possible to use transition data to update both an individual and a collective transition matrix.

The collective transition matrix would aggregate the transition frequencies for all users in the site. Each user would contribute new data to the collective transition matrix every time they clicked from one page to another. It might also be possible to derive the group transition matrix as the sum of individual frequency transition matrices.

What would be the benefit of doing this? From a predictive point of view, having a collective transition matrix might provide additional information that your prediction algorithms can use to forecast what page a user is likely to visit next. The observed clickstream of a user can be regarded as a function of both Web site variables and user variables. The collective transition matrix might be a useful way to represent the influence of Web site variables.

The collective transition matrix represents the averaging out of all idiosyncratic tendencies of users. Because the collective transition matrix doesn't represent how any individual in particular uses the site, it might be regarded as estimating the general flowpath that the site tends to produce.

Predicting the next page a visitor will view might be improved if you combine information about the idiosyncratic tendencies of the user (measured by their individual transition matrix) and the general flowpath the site tends to produce (measured by the collective transition matrix). The relative weight of these factors as well as their method of combination are matters for future research.


Intelligent flowpath management

My feeling is that Markov theory might be used, in various ways, for intelligent flowpath management purposes. In chapter 4 of John Lenker's book, Train of Thought, he defines Intelligent Flowpath Management Systems as follows:

Because Web enterprises seem to struggle so much to provide information in a "structure" that's effective for people, my assertion is that "architecture", both as metaphor and as a process, has outlived its usefulness. The time has come to enlist a more powerful vision that reaches farther. Instead of information architecture, I believe we need to begin developing Intelligent Flowpath Management Systems that procedurally analyze the needs of individuals and then release appropriate sequences of notions. These sequences must be programmed to guide people's minds naturally and in a manner that helps them to build anticipation of that which will succeed in fulfilling their expectations.

Web site user transition matrices offer a way to procedurally analyze the needs of individual Web site users. The transition matrix of an individual person might be interpreted as a measure of the expected value they assign to each next page state. If you have enough visitor interaction data, you can, using a cookie, recover his user model and deliver "creative" for the home page that utilizes his transition matrix to anticipate modular content that will fulfill the user's next-page expectations. The home page would be dynamically generated so it conforms to the user's value expectancies about where to go next. In this way, you might increase the stickiness of your site which has been shown to increase the likelihood of buying products or services.

If you are an advertising-driven site, then stickiness is also critically important. Eytan Adar and Bernardo A. Huberman expound further upon the economic importance of stickiness and the need for versioned information services in "The Economics of Surfing."

From the marketing side, James Hering discusses a similar notion in his recent discussion of behavioral targeting:

...most people think behavioral targeting involves tracking user behavior, then dynamically leveraging the knowledge to serve relevant messaging. I think of it from the customer perspective; how can I reward a potential or current customer with more relevant information, as indicated by their implied or expressed behavior? ... We use expressed behavior to trigger the ad server to queue your creative on the next page. Think of it as "sequentially relevant ad serving".

One correction is that you can in fact use expressed behavior to trigger the ad server for "this page" (instead of just the "next page") because predictive information about where a user will go next is available (in the header.php script) before a page is ever generated.

Markov process theory provides a framework for understanding how intelligent flowpath management and sequentially relevant ad serving might be implemented in practical terms. I see individual and collective transition matrices as providing the individual and collaborative filters needed to implement an intelligent flowpath management system.


Other applications

Markov process theory has many other areas of application beside Web site user modeling. I would be remiss not to mention a small sampling of some of these application areas:

  • Bioinformatics. In modern terms, Gregor Mendel's lasting legacy is the idea that you can use a transition matrix to model the transmission of features from parents to offspring. Indeed, Markov process theory and bioinformatics both have as their central concern the understanding of dependent sequences. Markov process theory offers a framework for organizing and thinking about patterns in sequential data which is the hallmark of much biological data. Those interested in the emerging area of bioinformatics would do well to study Markov process theory further.
  • Queuing problems. A queue is a waiting line. Queues are ubiquitous not only in business contexts, but also in work-scheduling or transportation problems, for example. These problems share the feature that what happens to the queue at one moment in time determines the state of the queue at the next moment of time. Queueing theory relies extensively upon simulation to model the many factors that might affect queue dynamics. Markov process theory has been extensively applied to the simulation of queue dynamics. Indeed, simulation books in general often devote considerable space to Markov process models because they are a primary tool that simulation-based analysts use to gain insight into the temporal structure of the systems they study.
  • Human behavior. Markov process theory has been used to understand the dynamics of turn taking in meetings, the behavior of rats in mazes, the allocation of attention in the processing of faces, and other behaviors. You could, for example, replace the names of links in my demo Web site with the names of people participating in a meeting. As each person talks, you could record that fact by clicking on a link corresponding to the name of the person. If a person speaks, do they continue speaking? If they fall silent, do they remain silent? The answer is often yes, indicating conversational dynamics exhibit the Markov property. The distribution of sounds in a word, words in speech, or words in text can also be approached through Markov process theory. The possible applications of Markov process theory to human behavior are only limited by one's imagination.

Markov process theory was originated by Andrei Andreivich Markov (1856-1922) in a 1907 paper called "Extension of the Limit Theorems of Probability Theory to a Sum of Variables Connected in a Chain". Semi-Markov Processes, Hidden Markov Chains, and Monte Carlo Markov Chains (MCMC) are comparatively recent additions to Markov process theory that are being used to extend the power and reach of Markov theory to new domains. These theoretical additions, and the increase in computing power for Markov-based simulations, are accounting for a renaissance of Markov process theory in many domains of inquiry.


Ethical deployment

I haven't talked much about ethical issues that might arise as a result of increased tracking and monitoring of individual Web site users. Some users might be offended at the idea that their individual actions are being monitored in such minute detail. Transition matrices, however, don't provide data that can't already be reconstructed from access log data. Transition matrices don't create new data, they just provide a way of structuring existing data that makes it more readily available for real-time use in determining what content to display.

What most users probably object to more is the possibility that their actions might be linked to their identity. In other words, as long as user tracking is carried out in such a way that the identity of the user is kept anonymous, users are less likely to raise ethical objections with increasing the level of user surveillance if some user benefit might be realized.

I see few real ethical objections to anonymous Web site user modeling. I recognize, however, the real urge to associate a visitor cookie or IP with an actual identity so that you can leverage the additional information for usability, marketing, research, or commercial purposes.

If you decide you are going to incorporate identity information into your Web site user models, then in my opinion, you should consider obtaining the permission of your Web site users before doing so. This is easier said than done, as Web site user modeling often works best if it kicks in right away. At the very least, you should provide a document that indicates such tracking is occurring so that the user can elect not to visit the site in the future if this is a concern.

Some maintain that such user tracking is no different than a video camera mounted in a store and requires about as much permission, namely none. Even with this analogy, you can distinguish between the ethics of mounting hidden video cameras versus making them clearly visible to the customer so they know they are being monitored.


The next generation

In this tutorial, I defined what a Web site user model is and used Bucklin and Sismeiro's research to provide you with a concrete example of what a Web site user model might look like. Bucklin and Sismeiro identified two critical components of a Web site user model:

  • A Page-Request Model
  • A Page-View Duration Model

To construct a complete Web site user model, I suggested that a third component was required:

  • A Page-Choice Model

This tutorial was largely dedicated to elaborating upon how one might specify and implement the Page-Choice component of a Web site user model. Towards this end, I discussed and implemented transition matrices and parts of Markov process theory. The Markov process approach appears promising but I have no results to report as yet. I only recently developed the code as proof-of-concept that this real-time approach to Page-Choice modeling might work.

Others have looked into applying Markov process theory to clickstream data; however, it is generally in the context of mining Web logs to estimate Web site user models rather than implementing a real-time system for retrieving and updating Web site user models. This difference is critical in terms of what you can ultimately do with your Web site user-models. It was suggested that such real-time user models might be used to construct Intelligent Flowpath Management Systems.

Another difference between the current research and other research on Web site user modeling is that this research clearly addresses user behavior at the individual level. Many Web site user models are not really user models in this sense. Instead, they are more accurately described as consensus user models that presume that general parameter estimates apply to the individual. Web site user models based upon collaborative filtering, for example, presume that a user will want to see what other users in the same situation elected to see. This is quite different than tracking individual visitor clickstreams across sessions and using their individual transition matrices to predict what content they might want to see next. In the end, I recommend using both consensus models and individual Web site user models to arrive at useful predictions about what content a user might want to see next.

I hope this tutorial has at least convinced you that clickstream data can be used for purposes other than determining the volume and distribution of visitor traffic on your Web site. It can also be used to estimate Web site user models. In the next generation of Web sites, this latter role will become increasingly important.


Acknowledgements

I would like to thank Raymond Klein for giving me the opportunity to present an earlier version of this research to faculty and students at Dalhousie University.

6 of 8 | Previous | Next

Comments



Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Web development
ArticleID=136562
TutorialTitle=Web site user modeling with PHP
publish-date=12302003
author1-email=paul@datavore.com
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.