Andreas Weigend
Stanford University
Data Mining and Electronic Business
Stat 252 and MS&E 238
Spring 2008
Note: thanks to James Mao, we have a mp3 version of the first class (33MB). it is at

Class 1

This wiki is designed to note and extend the first lecture of Data Mining and Electronic Business
See also last year


Data is cheap, or approximately (asymptotically?) free.

Data Revolutions

There have been, roughly, three data revolutions:

1) Online Data Collection
Companies began to realize the potential of collecting data. was a big first mover in this revolution.
  • John Tukey: The more data and more complex the problem, the greater the need for simpler answers.

  • This revolution was characterized by emphasis and advances in new algorithms.
2) Shared Personal Data
Users enjoy sharing information about themselves, and this information is mineable, especially the links.
Facebook - how much personal information people share
NavTech - Sold to Nokia, who understood the value of good map data; $200MM to start, sold for $8.5Bn.
Why pay for data, when people are willing to give it for free? The question is how to create that incentive for people to provide the data?
3) Consumer Data Revolution
Companies can employ economic models to these emerging data dynamics.
New consumer data revolution - now, the user's in the center.
They are now adding economic systems to data. Users are beginning to realize that the data they spread has value, and they want to be compensated for this.

external image ff_free1_f.jpg

Roughly organized (1) 20 years ago (2) 10 years ago (3) 5 years ago (4) Right Now

Communication is (essentially) free

1. Data collection - Change in time scale; collection time shrinks from months to minutes as the collection process becomes automated.
2. Experiments - Easy (and cheap) to run side by side experiments with web pages.
3. User contribution - Architectures of participation - users create both format and content instead of being "handed tablets" by editors.
4. User interaction - Users can now "connect the dots" by combining existing elements.

Data - If people are paid for giving data, the data might not end up being truthful. The better way is to provide an incentive to better their lives by using the data they give.


1. Wall Street
  • Willing to pay for reams of data, spend a lot on cutting edge methods. Realization that minute-by-minute data lent itself to better predictions than opening / closing data. Entire competitive advantage is on having proprietary data no one else has.
2. Clicks
  • Optimized for company, focused on a specific site. Internet enabled a rich and vast new set of data to be collected. The propagation of RFID- unique identifiers - provides similar type of information for the material world.
3. Profile, blog
  • User-centric - focus on an individual -- unless you can see who someone is linking to and who is commenting on the blog. Then it becomes highly social.
4. Relationships


  • Metrics first, there should be clarity on what type of metrics are to be used.
  • Good Practice: set baseline 1st - must know what is the quantity of interest, collect data, and compare.
  • You must know which is the dependent variable in your regressions. Coming up with this might take a lot of your time.
  • Wall St pioneered models and methods.
  • Stock price and similar metrics are too obtuse to be able to derive a direct connection to the success of our solution. Too much noise.
1. Trading models -
2. Company centric - Conversion focus, user viewed as passive target. (Think Amazon)
3. User centric - (Think Facebook) - metrics move to the point of engagement and away from the company-centric approach.
4. Relationships - This is the next step. We are not quite there yet. How can we get there?


1. “Idea” - products and services, not necessarily related to the web.
2. E-Business - The company is in the center.
3. Me-Business - Anti-copernican - the user is in the center.
4. We-Business - Focus on the community - interactions, relationships, networks.


1. Expert
  • “Trust me”, feedback loops? Are experts really experts? High barriers to entry to become one.
  • “Conversation” with my fund manager. This is obviously something of a fraud.
2. Algorithm
  • To get insights into data. The farther past is given less weight than the recent past.
  • Reinforcement learning: Expected reward (state, action). Use co-purchasing behavior: people who bought X also bought Y.
  • The individual doesn't matter, only the time frame does.
3. Situation
  • Hidden Markov Model, unobserved states, proxies for the "truth". "The process of creating and maintaining product space awareness."
  • Dating site example.
  • Andreas says that voice recognition is a type of Hidden Markov model (a type of dynamic neural network). Wiki page for Markov model agrees.
4. Social
  • Facebook feed - exactly what your friends are doing


1. None - Marketing agencies collect one-way data on opinions and preferences.
2. Push / Targeting - behavioral targeting
Ex) TV
People Watching By RJMiller

3. Discovery (Pull) - people seeking their own content
4. True conversations - C2C - consumer to consumer

Companies mentioned in this section:
  • IMMI - Integrated Media Measurement, Inc. - "IMMI conducts custom research that helps the following businesses evaluate the effectiveness of their advertising."
  • Open Mind - "Open Mind collects information from people like you -- non-expert 'netizens' -- in order to teach computers the myriad things which we all know and which underlie our general intelligence but which we usually take for granted."
  • Fire Eagle - Fire Eagle is the secure and stylish way to share your location with sites and services online while giving you unprecedented control over your data and privacy. We're here to make the whole web respond to your location and help you to discover more about the world around you."
  • Metro Group's Future Store - It uses RFID technology to gather data about customers and make purchase recommendations. Allows instant check-out/billing.


• Data mining (insights | data) → Data mining (data | problem)
  • New focus, previously seeking insights given data, now looking for data to solve problems
• Economics: Who pays whom? At some point, consumers will realize that their data is being used to make someone else money. Then what?

The 3 flavors of Collective Intelligence
1. Decomposition (parallel execution, similar to MTurk, results are added together)
2. Portfolio (predictive markets). Later we'll have a class about this.
3. Immersion (people creating the architecture of interaction)

Data sources
  • D E Shaw
  • Bezos
  • Holden

Data economics
Proprietary → Peer-production
?? Data strategy?

Topics of Interest:
Maps- Ryan's Mashup
Reputation - property of the person, page, ... where should reputation point?
Collective Intelligence - Amazon's Mechanical Turk. Small example of an endeavor fuelled by MT: SheepMarket
Facebook - how to use social data

Digital Network Economy

production costs $high
distribution costs $0
In other words: the cost of the first item is high, the cost of duplicating and distributing is zero.

ex) Walmart mandates RFID for tags
  • new way to track individual items (RFID - radio frequency id) not just item types (SKU - stock keeping unit)

Powerpoint slides - Set 9 : New business of Consumer Data - Who pays whom? Slide 17.

Consumer Data in the Digital Networked Economy
Economics of bits. prices have dropped by 5 orders of magnitude over the last 20 years
Storage is free, communication is free.

Communication is the heart of this economy.
It used to be that distribution was somethign people got paid for. Eg. the chinese TV factory who wants to sell to US
Now. distribution is easy because of standards

It is now easier to collect data than beforehand.
RFIDs: WalMart believes they save more than the revenues of Amazon ( numbers? ) just by knowing where their stuff is.
Amazon collects about a hundred terabytes of clicks per year.
  • However, individual customer data is not that much. ie, address and phone numbers. etc very small size
  • orders (transaction data) is about 10 gig.
  • Session aggregate data (when did the person come, what was his http referrer etc) is about a terabyte.

Data Types

  • Clicks
  • Attention - engagement, measured by activities such as tagging
  • Intention - seach terms
  • Interaction - relationships. ie, comments, wall posts on facebook
  • Situation
  • Location - GPS, location data.
bi-directional data flow to improve the GPS system.
estimate the flow of traffic and how much time you need to reach a place using data collection from vehicles sending information back to
Amazon makes $40-$50 from each review.
Experiment of stripping reviews from one of two very similar books, and measuring how much they earn.

All these are examples of how data can be captured to help people make's decisions better.

However Privacy can be a concern
In Pay as you Drive Insurance, GPS data can be used to determine that a user is speeding, and thus make him ineligible for claims.
DNA results might lead to a higher insurance rates if the person has a higher risk of contracting cancer at a later age.

Fast innovation through experimentation

  • First step – Data
  • Second step – Experimentation
  • Third step – Participation
  • Fourth step – Interaction
  • Fifth step – Community

Where do people get their information from?

  • The Old Paradigm - 20+ Years ago - Produced News from a small editorialized Body
  • 10 Years Ago - Search - Yahoo and Google
  • Now - Content is pulled actively by the user, or content is pushed smartly through recommendations and personalization

Data Silos and the Attention Economy

  • Previously you had silos of data about people - credit card companies, credit bureaus, facebook now even
  • Value was in having exclusive access to this data
  • Now these data silos are crumbling, and the users feel the data belongs to them
  • Threats to companies which used to hold monopolies on this data, now turning into opportunities to use this data better
  • Towards the Attention Economy: Will Attention Silos Ever Open Up? - Alex Iskold, Read/Write/Web - Good outside primer on data silos
external image opening-attention-silos.jpg

Innovative Companies Utilizing Exemplifying the new Economy: (bidirectional communication - reducing asymmetries of Information)

Initial Contributors

Ryan Mason:
- rkm3 rkm3 Aug 22, 2008 <- Next birthday
Shaun Maguire:
- shaunm1 shaunm1
Hamilton Ulmer:
Randal Truong:
- randaltruong randaltruong