Andreas Weigend
Stanford University
Data Mining and Electronic Business
Stat 252 and MS&E 238
Spring 2008

Table of Contents

Audio (as mp3):

Class 6


Quinn Slack
Alex Gleitz
Charles Tripp
Jaehyeok Heo
Myunghwan Kim
Bill Whiteley

What is the space recommendation systems live in?

Discovery and recommendations are key elements in online commerce.

Amazon Survey

Amazon makes 20-30% off its recommendation systems. A survey: What you plan to do at today? 3% response rate. Free text user responses classified by category with multiple categories possible:

35% - Research
31% - Browse
16% - Intent to Buy!
10% - Complain
9% - Post-buy Activities
7% - Community Features
7% - Price Check

Survey Data Analysis

To get insights, stated user preferences from the survey were compared with user actions:
  • Look at those who bought something
    • Only half of purchasers had the prior intent to buy something. Overall, 20-50% of online retail purchases are not planned. People discover interesting things or get recommendations at the e-commerce sites.
  • Look at those who said they wanted to buy something
    • Only one third of visitors with intent to buy, ended up buying something -> Why?
      • After evaluating user actions on the website, discovered that product search was broken

What other problems have the same underlying structure as recommender systems?

All manner of tasks, people or events can be couched as recommendation problems.

Recommendation history

Generally, items do not have any persistent history. A book on Amazon would be showed the same way to two different people, but an individual does have history. Example: Netflix, where a user is a sum of all his or her ratings. Socially, people listen to their friends - a book recommendation warrants at least a look.

Compare Amazon's recommendations with a social network's:
  • Amazon's recommendations take the person out of the equation - people who bought X also buy Y. Core purchasing matrix gets updated in the event of two products purchased in the 24-hour period, regardless of the person - no persistent history
  • Social recommendations - history kept, data persistent. Facebook feed is an excellent example - not aggregated, presents actual events

What can $5 do for you?
For $5 Acxiom in Little Rock, AR will provide you 350 fields of information that summarizes what the world can see about you. How likely are you to buy a hot tub?... They have that answer :-) Here is an older article about the depth and breadth of Acxiom's involvement in economic, political, and security endeavors, not to mention security problems of their own.

What do all discussed tasks have in common?

Recommendations touch practically every aspect of our lives. Consequently, it is not surprising that almost any problem can be framed as a recommendation.

User Actions

Collecting Recommendation Data From Users
Speaker: **Toby Segaran** - Currently works at **Metaweb Technologies**
(Those on Stanford campus can read his book online here)
external image 51AnWLR89xL._SL500_AA240_.jpg

Ways of collecting recommendation data:


Voting data

  • This is eactly what it sounds like. Some person, place or thing is voted on or rated. The vote or rating is important but what people choose to vote on or rate is also a good indicator of a recommendation. However, it is not only the voting or rating that counts. The users can be also be rated on how well they recommend things. If users vote for things that are highly regarded by other people, then they may be very reliable users (good recommenders).
    • Digg - News Stories are rated and the most popular stories float to the top
    • Yelp - Rating and reviewing restaurants, dentists, bar, beauty salons, etc...
    • Hot or Not - Rate people based on looks. This is site is pretty lame and poorly done, but I think it is one of the orginals in this category

Consumer Data

  • Recommendation data collected from people as they spend their money. Not only what they buy, but what else they buy is very important. In other words what people buy together is very powerfull data.
    • Amazon- One of the first, biggest, most active and most successful players in this field.
    • Netflix- Movie suggestions are one of the most common types of recommendations. They ability to recommend good movies to users is so important they have offered the Netflix Prize.

Implicit Data

  • This is data collected often unbeknownst to the user and consists of data that is implied by the users rather than the result of a conscientious direct action.
    • Long clicks / Short clicks - This data is simply the amount of time a person takes to make a click after they are presented with a webpage. Inferences can be made about the information that was presented based on the number of "Long Clicks" and "Short Clicks" that were made on a given page. You could reasonably surmise that a "Long Click" would constitue a recommendation of the page while a "Short Click" would be considered a shun.
    • Link Structures - In a nutshell link structures attempt to determing the relevance of web pages by looking at the number of pages that link to the page in question. With this methodology when one page links to another, it is essentially recommending the destiation of the link. Larry Page and Sergey Brin popularized this technique with the Google and here is the paper that explains it.


Forcing the User to Act

  • In this methodology the user is forced to take some non-voluntary action such as entering a description for some piece of data.
    • YouTube - YouTube requires that when a user uploads a video to the site that they enter a title. This allows the site to recommend the video based on the keywords in the title.
    • Misc Free Services - Often to obtain some free service, such as a subscription, users are required to enter information about there personal likes and dislikes and this is basically forcing users to recommend things.
    • Tagging ( - Tagging, although not always mandatory, is another way that people recommend various sites. If a person gives a certain tag to a web page they are recommending that web page to people who are interested in the subject of the tag.

ASW Metrics / Cost Function

Getting to the right cost function is important. We need to evaluate the systems. We need standards to evaluate the system’s performances. Therefore, in this part, we did discuss about the proper cost functions.

A/B Testing

A/B testing is the testing method that allows us to test the performance of two or more different versions of the page or the system. A/B testers introduce different versions to different groups of people and look at differences in performances. This method can be used for several purposes.


  • Robert Schwartzs experiment (University of Michigan): He had two groups of people and gave each group bunch of questions. For one group, he gave a question if you want to buy a car, what car would you buy? And another group has same questions but not the car question. The result was significant in that people in the group which had car related question had more intention to buy a car in the future.
  • Banner evaluation: A tester makes many versions of the banners. Then distribute them to show them to the visitors. By comparing the performances of the banners such as number of clicking or profit from them, we can find out which banner is best for the tester. For instance, the following banners are two different versions of banners. We can simply show these banners randomly to the users and compare the performances.
A/B Testing

What to measure

  • Profits: Very business minded! We can evaluate the systems by applying the systems to the same situations and measure the profits of each system for its performance.
  • Number of clicks: We can compare the number of clicks of each system. However, we need to consider about maximizing or optimizing the number of clicks. In other words, number of clicks may have different meaning according to the views. We will discuss about this in the next part.

References of A/B Testing



Single click sessions - Bounce rate and etc

This measures how much time people spend in a certain page. The bounce rate is more important than the number of clicks, if the main function of the web page is providing information. Longer one spends in a certain web page, more interested in the web page one might be. The factors like amount of valuable information and the type of media which is used for presenting such information mostly determine the bounce rate. By observing this, we can evaluate the values of web pages and the system

Maximization of the number of clicks VS optimization of clicks

Maximization and optimization are related but different concepts. Sometimes you might need to maximize the number of clicks and sometimes you might need to optimize the clicks. "Pay Per Click" is the one of the situation that maximizing the number of clicks is more important. The number of clicks might not have meaning, even though web pages are mostly evaluated by the number of clicks. If the service provider designs the web page as it force visitors click, then the number of clicks would not have any meaning. Thus, in this sense, optimizing clicks is the fact that we need to consider about.

A Google promotional graphic, highlighting AdWords, the largest PPC program


When you build a model of purchasing behavior, you might want to predict the probability that somebody click on the item. And this is also different from maximizing and optimizing the number of clicks. There is not only one metric, but you may need variety metrics according to what you want to look at.

Predict the rating

Suppose that here is a new movie that forces you to rate that movie. It is hard to predict you would give 5 stars or 1 star for the movie. This is very difficult metric. There would be errors. It is also hard to predict that even errors increase or decrease when the movie is a really good movie. Even if the rating is forced to watchers, predicting ratings is hard like this. However, in reality, you would not be forced to rate certain items. Thus, the prediction becomes harder. In the real life, if one rates the item, then it already has a lot of meanings in it. In real life, the ratings are done for more extreme cases.

Use of rating prediction

  • Movie recommender sites (Netflix)
  • Dating sites
  • Online bookstores

Example - Cinematch and Netflix Prize

Netfilx Prize Main Page

The Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences. Netflix has a recommender system called Cinematch. It’s job is to predict one’s movie rating based on his or her ratings for other movies. Netfilx thinks it would be hard to improve the system over 10%. Netflix made Netfilx prize which will be given to the team who would make an algorithm that perform very well in rating prediction. It is hard to predict the ratings exactly, but many people are trying to do this. Netflix prize is one of these trials. For more information, you can follow the link and see the rules.

Musics - skipping songs

If a person like a song, he or she would keep listen the song. However, if one does not like it, then one would skip the song. Skipping songs is the negative sign for the recommenders. This kind of information can be used for evaluating music recommenders.


You must not look at only one metric for the system. Also you should set the metric very well. However, many firms do not do this very well. Suppose that a firm only use bounce rates for its metric. This might give bad result for the firm. One of the ways to reduce the bounce rate is CLOSING the website. To avoid this kind of effect, you need to consider various metrics in proper way.

Data sources

Metadata (data about the data)

Metadata is often the first kind of data source to come to mind. In many cases, a correlation between two things' metadata indicates that there is a correlation between those two things.

Metadata examples for music (beyond the obvious artist/album/year/genre):

  • Which bands choose which other bands to open for them at concerts -- this indicates that the band themselves sees an overlap in fan base
  • Which school a classical music performer attended
  • Waveform analysis (determine "Je ne sais quoi" attributes of a song)
  • Instrument extraction (determine which musical instruments are in a song)

Session context

In the music example, a recommender could take into account the session context:

  • the songs that the user recently listened to
    • for example, if a user had recently listened to a bunch of upbeat music and a slow song came on, a negative rating might just mean "I want to hear upbeat stuff now" rather than "I don't like this song at all"

The session context could help build a Markov model, which would play songs according to probability given the current song.

Active learning

Some actions by the user will result in a lot more new information for the recommendation model. For example, if a user has rated all of the Star Wars movies but Return of the Jedi as 5s, the recommender can say with high certainty that the user would also rate Return of the Jedi a 5. Not much new data from which to make new recommendations would be gained.

But let's say the user rated the first Matrix movie a 5 and the last one zero stars. The recommender is likely to have little certainty as to how he would rate the second Matrix movie. It can scan through its recommendation model for cases like this and ask the user explicitly to rate The Matrix II, and in doing so fill in holes in its certainty.


Prof Weigend noted that sometimes you can just recommend items that the user has already told you he likes. Amazon will often recommend items from your wishlist. Not only are these effective recommendations, but they increase the users' confidence and trust in Amazon recommendations.

Efron's bootstrap paper

Prof Weigend says Bradley Efron's paper on bootstrapping is one of his favorites.


Distance Measures

Distance Measures can be used as a simple method of measuring the similarity (or difference) between various elements of a set. Distance Measures can be used to measure many different relationships, from preference similarity to the similarity of two documents or phrases.

Distance Measures require a method for computing the distance in a space between two points in that space, and therefore require a norm and a notion of distance within the space on which the norm operates. This type of space is known as a Metric Space. There are many possible notions of distance, and the meaningfulness of each notion varies both by application, and by the choice of dimensions.

Useful distances can be computed between many different entities, such as users, items for sale, documents, news headlines, etc. Distances metrics between click streams or pages visited can help to compare users

Some great examples of using distance metrics in recommendation systems (with an emphasis on application using Ruby) can be found here.

Example: Preference Difference

The euclidean norm of the difference between two user's ratings on a selection of movies can be used to compute the similarity of those two users, in terms of their movie preferences.

For instance, here are three people's ratings for Iron Man and Smart People:
Rating for Iron Man
Rating for Smart People
Distance from Steve
Distance from Jane
Distance from Mary
Which can be visualized graphically:

using a simple distance measure to compare users

Example: Linguistic Difference

One linguistic distance measure is to take the euclidean norm over a space for which each dimension is the number of times a word appeared in a document. By comparing the number of occurrences of words in two documents, via a norm, one can have a simple but effective method for measuring the difference in word-choice, and possibly the content, of two documents.

Bayesian Filtering

Bayesian filters, using a naive Bayes classifier, can be used to classify items with gradually increasing accuracy as additional samples are added to the dataset. Bayesian classifiers are based around Bayes' rule, and attempt to compute the probability that a given item belongs to each of the data classes, given each independent (or, assumed to be independent) feature of the sample. Naive Bayes classifiers are called naive because they assume that each feature is independent of the other features, an assumption which is not generally true. Despite this strong assumption, naive Bayes classifiers tend to perform quite well on many sets of real world data, even when several features are correlated. The most famous use of such classifiers is spam filtering.

Influence diagram for a naive Bayes spam filter

Principal Components Analysis (PCA)
An example of PCA on 2-dimentional data

Principal Components Analysis is a method for reducing the dimensionality of a set of data. PCA finds a set of dimensions which are a linear combination of the original dimensions, and these dimensions are ranked in order of decreasing variance along the dimension. Thus, the highest variance dimensions can be selected, and the lower variance dimensions can be ignored, resulting in an efficient dimensional reduction which preserves the maximum amount of variance (when constrained to linear combinations of the original dimensions). PCA can be useful for mapping out the relationships and distributions between users, items, news stories, etc.

The largest advantage of PCA is that it can allow data to be viewed in a space with reduced dimensionality. This can be helpful in filtering out less useful or interesting dimensions, as well as visualizing data. However, because PCA generally removes the meaning of the dimensions, it can be difficult to interpret graphs constructed using PCA.

Independent Component Analysis (ICA)

Independent Component Analysis is a method for separating independent sources from multiple, related data streams. For example, ICA can be used to create separate music and speech from two simultaneous recordings of speech and music. Applications to electronic business include separating contributing factors from multiple trends such as click rates and purchasing behavior, as well as the possibility of separating multiple "shadow users" from a single account (for instance, multiple family members using the same imdb account to rate multiple movies, or buy several books on

ICA can be a very powerful technique when appropriately applied, and one of the factors which effects its effectiveness is the particular type of ICA used. Several popular forms exist including linear noisy and noiseless ICA, as well as a number of nonlinear ICA methods.

An example of ICA

Markov Models and Markov Chains

Markov Chains are models of stochastic processes which are particularly well-suited to several electronic business problems. Markov chains model the probability of transitions from a particular state to each other state. For example, a Markov chain can be used to find the probabilities that a user will visit a page, given the previous three pages that they viewed. Other examples include predicting the next product a user might buy, or the next song they will play from a playlist. When the state is not directly observable, the model is referred to as a hidden Markov model (HMM). HMM's are harder to construct than standard Markov chains, but are also applicable to a much wider range of domains.

An example Markov chain using the previous 3 pages visited to predict the next page to be visited.


When looking at data from an electronic business perspective, it can be quite useful to think about the different combinations of attributes, relationships, sets and instances have within a particular context. Some examples in the space of websites:
  • Grouping pages into different topics is an example of a set of pages and an associated attribute of these pages.
  • The entry page and exit page on a site are attributes of instances of user visit to a site.
  • The transition path through pages within a site is an example of the relationship between instances of pages on a site.

Within a several spaces this type of division can be quite helpful: See below contents and tables

E-Business Perspective

What really is shopping?

Shopping as Process and Personal Activity

  • The process of creating, maintaining, and refining product-space awareness
  • This process has following features
    • Multi-session
    • Multi-site
    • Multi-channel
    • Punctuated by occasional purchases
    • Space as well as taste change over time. Thus, appropriate recommendations vary over time.

Shopping as Conversation (Dialog)

  • Attention gestures of the shoper
    • Clicks on link : Implicit way (e.g. navigation)
    • Enters keyword : Explicit way (e.g. search)
    • Selects product and puts it into cart : Implicit way
    • Gives ratings : Explicit way
    • Provides tags : Explicit way (e.g. bookmarking)

Shopping as Conversation (Multiple partners)

  • 10 years ago : Little activity across the site
  • Now : Significant activity across the site.
    • Retailers realize they do not own the visitors : Visitors are able to visit other retailers through search engines or shopping comparison sites.
    • Retailers realize they do not own the catalog : Search engines like Google know much more than retailers.

Shopping as Social Activity

  • Person to Product mediated through group
    • User generated content
    • Collaborative filtering
  • Person to Product mediated through individuals
    • Viral activity by individuals (Amazon : Amapedia)
    • Lists created by individuals (Amazon : Listmania)
  • Social shopping sites
    • Kaboodle: Share recommendations and discover new things from the community
Kaboodle Screen Shot

    • ThisNext: Explore good product recommendation / Get personalized shopping suggestion
ThisNext Screen Shot

Shopping spaces


What kind of technology is used?


  • Use javascript to extract attributes
  • Style sheet
  • Product catalog

Visit Modeling

  • Interactions based on relationships among people, sites, products
  • Adaptation in real-time, within session
  • Anticipate and responde to unique customer responses
  • Relational Modeling
  • Dynamic Tracking

Example : eStara

  • Recommendation system solution
  • Collecting a lot of information from each user's session
    • Each visitor's click
    • The web site's product information
    • The site structure
    • Other additional useful data
  • Using dynamic behavior models
  • Instantly analyzing all of the data in the real time
  • Allowing merchandisers to designate any place on any web pages to display recommendations

Overview of Visit Modeling

Relational Modeling

  • Collaborative filtering (Reference : Wikipedia)
    • The process of filtering for information or patterns using techniques involving collaboration among multiple data sources.
    • Collaborative filtering usually takes two step
      • Look for users who share the same rating patterns with the active user
      • Use the ratings found above to calculate a prediction for the active user
    • Alternatively, item-based collaborative filtering is possible (e.g. Slope One)
      • Build an item-by-item matrix determining relationships between pairs of items
      • Using the matrix and the data on the current user, infer his or her taste
    • A lot of commercial sites are using this collaborative filtering

Dynamic Tracking

  • Behavioral Targeting (Reference : Wikipedia)
    • A technique used to increase the effectiveness of their campaigns
    • Using information collected or an individual's web-browsing behavior
    • Onsite Targeting
      • Monitoring visitor's response to the site content and learning what is most likely to generate a desired event
      • Requires at least some amount of traffic before statistical confidence can be reached
      • e.g. Yahoo! Inc.
    • Network Behavioral Targeting
      • Advertising networks are able to build a picture of the likely demographic makeup of different site users
        • e.g. A user on football sites, business sites, and male fashion sites
      • Demographic analysises of individual sites allow the Ad networks to sell audiences rather than sites.

  • Dynamic Tracking
    • Real-time recommendation
    • Tracking user's behavior in real-time and providing a new result of recommendation per each action


Components of Modeling

Dimension of Information Space

  • Instance or Abstraction(Set)?
    • Instance : Is it DELL XPS 1330?
    • Abstraction : Is it 13" Labtop?
  • Attribute or Relation?
    • Attribute : Is it a black car or silver one?
    • Relation(ship) : Which network do you belong to?
  • Structured or Unstructured?
    • Structured : Price $199.99
    • Unstructured : A fine starter camera for the casual photographer

Behavioral Component

  • "People" in above picture
  • Relationships abstracted from click-stream
    • Persistent : Registration info., Order history
    • Session-level : Refer, IP info, Current time
    • Page-view : Page action, Time on page, Page characteristic
  • Dynamically track visitor through sessions and cross sessions
  • Adaptation to current behavior in session

Behavioral Targeting Model


Social Network

Location Component

  • "Site" in above picture
  • Page type (Static, pre-defined categorization)
  • Structured & Unstructured content
  • Length and type of content
  • Order viewed
  • URL visited

Page Type

Page Entry/Exit

Page Transition

Catalog Component

  • "Catalog" in above picture
  • Length and type of description
  • Structured & Unstructured text
  • Modeling customer-view of products (e.g. Attitude of reviews, negative? or positive?)
  • Taking behavior and website structure into account

Product Catalog
(Information at category, brand, or level)

Product Catalog
(Individual product attributes)

Collaborative Filtering

Other Contributions:

In the last class while we were discussing the problem of music recommendations and discovery, someone suggested that Twitter could be used as a tool for this. I came across this article while going through my feeds a few days afterwards and thought it was relevant enough to share:
In short, the article talks about the company Blip and how it provide a way to suggest and discuss thoughts on music through Twitter.
- Jaebock Lee