Andreas Weigend
Stanford University
Data Mining and Electronic Business
Stat 252 and MS&E 238
Spring 2008
Note: thanks to James Mao, we have a mp3 version of this class (33MB). it is at

Class 2



HW2 (due Apr 20 by 5 pm)

  • The weekly homework is due by Sunday at 5pm, please email it to Ryan Tibshirani and Jun As of 4/15, you can now start on Exercise 1 of Homework 2. (Thanks to Harry Wang for all his hard work to get this working!)
  • Purpose:
  • * Understand the traces people leave (access log)
external image facebook-clamp.jpg

  • * Understand how you can find data
  • (In the past used Web crawlers, now use RSS feeds)
external image rsspo1.png

HW3 out Apr 21

HW 4 out Apr 28

  • Python warm up
  • Read Horwitz

HW5: out May 5, due May 18: delicious recommender system - Build one

  • Programming Collective Intelligence, Building Smart Web 2.0 Applications by Toby Segaran

- Theory Netflix context feedback
Toby will come to class May 12

Class Topics

2 (Apr 14) Data. Why?

  • Relevance
  • All about relevance!
  • Examples:
  • Ads: about showing right ad to the right person at the right moment.
  • Search: which of the hits to the webpage are relevant.
  • Webpage: which features on the webpage are the most relevant

3 (Apr 21) Marketing, Virality

  • Virality: you send something to more than one person, that message gets forwarded at an exponential rate.
  • Enrique: Learning from FB class

4 (Apr 28) Prediction markets, betting

  • Read Bo Cowgill's paper on the Prediction markets at:
  • Several startups that deal with prediction markets. Predictify is a Bay Area startup with an interesting business model, implementing prediction markets.
  • Scientific American article talking about the success of the Iowa prediction markets at predicting election outcomes: here

5 (May 5) Economics, Incentives / Interaction. Email

  • Read Horvitz: Economics of Data
  • Talk by Colin Harrison (15 years back, next big thing was (not the Internet!): Right of access to information on your own data)

6 (May 12) Recommendations

7 (May 19) Mobile

8 (Jun 2) Health Visualization

  • Big data sets
  • 23andMe
  • Gapminder guy from GOOG

9 / Final

  • Your projects and ideas
  • What can I do to facilitate your projects?

Cool Technology Company: Tacit Knowledge has a product called Illumio that profiles the different topics that are indexed on your hard drive. Another user can query the community on a question related to a particular topic (say SIPs - Statistically Improbable Phrases) and if your "knowledge" of SIPs is high, you will get a message asking if you wanted to answer the user's question. The user will not know that you were notified unless you choose to answer the question.


What Data Can We Now Collect? / Sources of data (mainly implicit data)

  • Qualitatively: relationship data, sentiment data (opinmind, buzzlogic)
  • Quantitative: transactional data, intention data, attention data,
  • Attention Stream of Customers

  • How?
    • Explicit, implicit
    • Whether: implicitly, explicitly, intentionally, unintentionally, on your platform, or outside your platform



  • Inventory, POS, loyalty (what do they do with it?), experiments (in store display), RFID: Process of creating and refining product space awareness
  • Example: Retail data is not very well know – for example, GAP stores can account for only 60-70% of their inventory. This is similar for other retail stores. Point of Sale data sometimes reveal interesting connections – for example, it was shown some time back that there is a strong correlation to beer sales and diaper sale. Presumably, men buying diapers for home on the weekend also purchased beer (more on this at
  • Another example, on the subject of diapers, purchase data of adult diapers revealed that the highest density of adult diaper purchase/usage is in New York City.
  • Point of Sales data doesn’t reveal anything about the person. Loyalty cards use persistent history to reward shoppers with coupons – this is weak reason to reward shoppers.
  • In store displays: possible uses - camera detects is person is close by and measures attention of person. The impact of the display to the person’s purchasing decision can then be studied. For example, did switching the display ad from Coke to Pepsi have an impact on sales?



  • Internal use: site optimization (most easily in conjunction with experiments). For example, HW2 requires one to examine the access logs of web page. The “HTTP referrer” field in the access log indicates the web page from which the user visited your site next.
  • Weak monetization: market research
  • Application: Ppl who click at this eventually buy that


  • Future focus
  • Search
  • Strong Monetization: Ads (action clear and traceable: CPC, big difference to media without feedback loop)
  • Weak Monetization: Insights (not as clear actionability), competitive analysis (Hitwise, also clicks)
  • Trends: Ex: Search: discovering disease outbreak, aggregate all the data and quickly find out that something is happening.
  • Economy- monetizes what people want. ex is Google Trends.
  • Google Trends


  • E.g., tag, save, flag, vote (e.g., digg), forward (also relationship data)=
  • Further downstream from search
  • Use: Discovery (Attensa). When people come up with new technology, then tend to stick with the old paradigm, ex. google reader, or with other readers, what does "mark as read" mean? Makes no sense. The underlying plumbing does not describe our intentions.
  • Attensa-- new RSS network. Some people are pretty good at discovery. When I get to work, I crank up my discovery box, and it will map out my reading behavior. Simplist case-- you work in a law firm, you don't want opposing counsel to know what you are working on. You can share the information within as a knowledge repository.
  • Distinction between discovery and maintenance. You want to change your phone number- this is maintenance. Discovery is finding new people. Search allows you to find stuff you know exists. Discovery is for things we don't know it does exits. This uses collective intelligence to discover thing.
  • Reading Notes, "The Attention Economy and the Net:
    • Illusory attention - creates an apparent equality of attention eg the audience members can each feel they have not paid as
      much attention to a speaker as the speaker has paid personally to them even though the reverse true.
    • Attention can be passed on from someone who has it to someone else, and on and on- vital feature to economy - eg you pass the whole audience's attention on to someone else.
    • Suppose you get attention through some text you send out over the Internet. Would you want your audience to copy this and pass it on to others who might pay attention in turn? Of course you would.
    • Money now flows along with attention
    • Contrary to what you are sometimes urged to believe, money cannot reliably buy attention


Example: Phone company marketing: 4.8x response rate compared to state of art direct marketing without social network. Foster Telecom- they looked at the people who made the calls rather than the people who bought the plans. We make the calls, then we are recommending the product. If you have ideas about how to capture the behavior of people, you will get better results even if you optimize for a small subset.


Example: Traffic Dash -- bi-directional. two-way, Internet-connected GPS navigation system. Dash delivers traffic and destination information

Google adds real-time traffic data - Article about data sources for Google Maps, it turns out they use multiple sources.
  • Mobile Marketing

    • Power of location
    • New data sources
    • Using consumers as allies and partners (UAL, vs participatory)
    • 101 Freeway sign that shows time till arrival at destintation city. Where does this data come from? Thinks it is some guy typing in the number. However, there are numerous scientific ways of measuring-- mobile phone rate of moving from baystation to baystation, count number of cars going in and out, tracking data from large fleet company with GPS on trucks, etc.

Metrics for Discovery

Measures headline


  • Quote in their blog?
  • Forward to friend
  • Time on page
  • Tag
  • Nr of subsequent searches in that space, nr of searches it took s.o, else to get to that space
  • Subscribe to that RSS feed
  • Bookmarking
  • Emotionally: Pat on shoulder


  • Degree of difference to what you have already
  • How many other people have bookmarked this?

  • Benefits to the org
  • Increase creativity, innovation
  • Reduce cost of repetition


  • Rent the movie? Buy the item?
  • Profit? What is the probability that someone will buy? Very powerful. The way to really find out if someone wil buy is to run experiments, run two version. Which one is doing better?

More Examples:

illumio, Tacit Knowledge

  • illumio groups let you share web feeds whose content is intelligently and automatically matched to each member based on their interests. Group members can also tap into the collective intelligence of the group by sending questions that are automatically delivered to the people with the most knowledge and expertise to respond.
  • Pay for premium membership and in-house discovery/content sharing eg don't want your competitors knowing what you are thinking

Google, what data sources and what can you do with it?

  • Search: Query sequence, choice set (what you are clicking is different than what you see), which link the user chooses, queries and their refinements, use for search quality improvement and trends
  • Gmail: content and social relationships, header information (who to who, how long), opening up social graph (content used for AdSense), model response behavior, potential use for relevance ranking and discovery
  • Spam: why is it marked as SPAM (style of message, frequency)? Training data for what people choose. How does Gmail filter, can they use social graph?
    false negatives- it comes through but you didn't want to see
    false positive- marked as spam, but you wanted to see
"Spam is like an arms race" between spammers and spam filters. Ultimately this is about collection of data that indicates that certain mail is spam. Cost of sending spam is still virtually zero, which makes it economically effective.

The purpose is to increase the cost of transaction and to decrease the cost of discovery.
Is there an economic solution? Make e-mail cost something, like a stamp?
  • Docs: content and social relationships
  • Toolbar: follow user across sites. Always with consumer, give data to get data
  • Analytics:Javascript embedded by site owner. Gives insights e.g., on how popular a given link is. Insight in what people do going through websites. Get number of “installs” from GOOG
  • Checkout: From the cradle to the grave. Especially for purchases of higher priced items, process of learning about product dimensions etc. Payment systems. Low margin business that is to drive the high-margin ad business eg want more websites to be successful
  • Maps: Measure intent. UGC, being able to change info. Phone: my location feature. loopt GPS
  • Calendar: future intentions and interests.
  • Reader: Read, mark, forward to friend
➢ Trends, ppl wanting to get insights
➢ Discovery, including data across services
  • Notebook: Annotate web pages, increase efficiency of research
  • Google earth: Repository of 3D objects. Modeling with geodata eg weather/natural disaster modeling. Potential competitor with virtual worlds like Second Life?
  • Google groups: Bin’s friends and their friends finance and investment club vs Wiki
  • Images: Metrics of location, downloads, popularity, value, integration with Picasa
  • Google gears: Let web applications interact naturally with your desktop. Store data locally in a fully-searchable database. Run JavaScript in the background to improve performance
  • GTalk:
  • Interesting Examples:** Obtain info from govt- model traffic data. Integrate with CAD- Make design elements links, create sales lead for window manufacturer


Social discovery (of people)

Content discovery (e.g., of events)

Sales lead generation

Marketing opportunities (San Louis Obispo)

Personal use (simplify life)

Improving search

Summary for Today:

  • Very happy with the good will among students.
  • Purpose of HW#2 is to understand traces and getting a good mental model. Can abstract with Google Analytics with richer information. Help you to find information out there. Moved the homework to Yahoo pipes.
  • Then we looked at data structures, new data sources to solve old problems. Talking to people showed the simple example of 4.8X better response rate.
  • Evaluating discovery systems- the generated knowledge requires a broad set of metrics. Not boiled down to one; every change you make, you want to know the tradeoffs.
  • Building relevance functions is key. Try to come up with the alogrithms.
  • Statistics is knowing how to deal with noise.

Professor Weigend OPENED THE FEDEX BOX with the data collection device that costs about as much as 2TB of disk space... and

... INSIDE was: Saliva kit from 23andme Genetics just got personal !

Initial Contributors