1.+Apr+07,+2008

[|Andreas Weigend] Stanford University Data Mining and Electronic Business Stat 252 and MS&E 238 Spring 2008 Note: thanks to James Mao, we have a mp3 version of the first class (33MB). it is at http://www.weigend.com/files/teaching/stanford/recordings/WeigendStanford2008Class1.mp3

toc =Class 1= This wiki is designed to note and extend the first lecture of [|Data Mining] and Electronic Business See also last year http://aweigend.wikispaces.com

=Introduction=

Data is cheap, or approximately ([|asymptotically?]) free.

Data Revolutions
There have been, roughly, three data revolutions:

1) Online Data Collection Companies began to realize the potential of collecting data. [|Amazon.com] was a big first mover in this revolution. > [|John Tukey]: The more data and more complex the problem, the greater the need for simpler answers. > > This revolution was characterized by emphasis and advances in new algorithms. 2) Shared Personal Data Users enjoy sharing information about themselves, and this information is mineable, especially the links. [|Facebook] - how much personal information people share NavTech - Sold to Nokia, who understood the value of good map data; $200MM to start, sold for $8.5Bn. Why pay for data, when people are willing to give it for free? The question is how to create that incentive for people to provide the data? 3) Consumer Data Revolution Companies can employ economic models to these emerging data dynamics. New consumer data revolution - now, the user's in the center. They are now adding economic systems to data. Users are beginning to realize that the data they spread has value, and they want to be compensated for this. > [|Free! Why $0.00 Is the Future of Business] - Chris Anderson, Editor of Wired. > [|Beware of Freeconomics] - Alex Iskold, Read/Write/Web - A good response to the above article.



//Roughly organized (1) 20 years ago (2) 10 years ago (3) 5 years ago (4) Right Now//

Communication is (essentially) free
1. **Data collection** - Change in time scale; collection time shrinks from months to minutes as the collection process becomes automated. 2. **Experiments** - Easy (and cheap) to run side by side experiments with web pages. 3. **User contribution** - Architectures of participation - users create both format and content instead of being "handed tablets" by editors. 4. **User interaction** - Users can now "connect the dots" by combining existing elements.

Data - If people are paid for giving data, the data might not end up being truthful. The better way is to provide an incentive to better their lives by using the data they give.

Metrics
> Metrics first, there should be clarity on what type of metrics are to be used. > Good Practice: set baseline 1st - must know what is the quantity of interest, collect data, and compare. > You must know which is the dependent variable in your regressions. Coming up with this might take a lot of your time. > Wall St pioneered models and methods. > Stock price and similar metrics are too obtuse to be able to derive a direct connection to the success of our solution. Too much noise. > 1. **Trading models** - > [|Sharpe ratio] > [|DE Shaw] Hedge Fund - better data -> higher frequency - different time scale 2. **Company centr**ic - Conversion focus, user viewed as passive target. (Think Amazon) 3. **User centric** - (Think Facebook) - metrics move to the point of engagement and away from the company-centric approach. 4. **Relationships** - This is the next step. We are not quite there yet. How can we get there?

Applications
1. **“Idea”** - products and services, not necessarily related to the web. 2. **E-Business** - The //company// is in the center. 3. **Me-Business** - Anti-[|copernican] - the //user// is in the center. 4. **We-Business** - Focus on the //community// **-** interactions, relationships, networks.

Recommendations
1. **Expert** > “Trust me”, feedback loops? Are experts really experts? High barriers to entry to become one. > “Conversation” with my fund manager. This is obviously something of a fraud. 2. **Algorithm** > To get insights into data. The farther past is given less weight than the recent past. > Reinforcement learning: Expected reward (state, action). Use co-purchasing behavior: people who bought X also bought Y. > The individual doesn't matter, only the time frame does. 3. **Situation** > [|Hidden Markov Model], unobserved states, proxies for the "truth". "The process of creating and maintaining product space awareness." > Dating site example. > Andreas says that voice recognition is a type of Hidden Markov model (a type of dynamic neural network). Wiki page for Markov model agrees. 4. **Social** > Facebook feed - exactly what //your// friends are doing

Conversations
1. **None** - Marketing agencies collect one-way data on opinions and preferences. 2. **Push / Targeting** - behavioral targeting Ex) TV 3. **Discovery (Pull)** - people seeking their own content 4. **True conversations** - C2C - consumer to consumer

Companies mentioned in this section:
 * [|IMMI] - Integrated Media Measurement, Inc. - "IMMI conducts custom research that helps the following businesses evaluate the effectiveness of their advertising."
 * [|Open Mind] - "Open Mind collects information from people like you -- non-expert 'netizens' -- in order to teach computers the myriad things which we all know and which underlie our general intelligence but which we usually take for granted."
 * [|Fire Eagle] - Fire Eagle is the secure and stylish way to share your location with sites and services online while giving you unprecedented control over your data and privacy. We're here to make the whole web respond to your location and help you to discover more about the world around you."
 * Metro Group's [|Future Store] - It uses RFID technology to gather data about customers and make purchase recommendations. Allows instant check-out/billing.

=PART 2= • Data mining (insights | __data__) → Data mining (__data__ | problem) > New focus, previously seeking insights given data, now looking for data to solve problems • Economics: Who pays whom? At some point, consumers will realize that their data is being used to make someone else money. Then what?

1. Decomposition (parallel execution, similar to [|MTurk], results are added together) 2. Portfolio (predictive markets). Later we'll have a class about this. 3. Immersion (people creating the architecture of interaction)
 * The 3 flavors of Collective Intelligence**


 * Data sources**
 * D E Shaw
 * Bezos
 * Holden

Data economics Production Proprietary → Peer-production ?? Data strategy?

Topics of Interest: Maps- [|Ryan's Mashup] > Google Maps [|now lets users move locations] upto 200 without supervision Reputation - property of the person, page, ... where should reputation point? Collective Intelligence - [|Amazon's Mechanical Turk]. Small example of an endeavor fuelled by MT: [|SheepMarket] Facebook - how to use social data

Digital Network Economy
production costs $high distribution costs $0 In other words: the cost of the first item is high, the cost of duplicating and distributing is zero.

ex) [|Walmart mandates RFID for tags] > new way to track individual items (RFID - radio frequency id) not just item types (SKU - stock keeping unit)

Powerpoint slides - Set 9 : New business of Consumer Data - Who pays whom? Slide 17.

Economics of bits. prices have dropped by 5 orders of magnitude over the last 20 years Storage is free, communication is free.
 * Consumer Data in the Digital Networked Economy**

It used to be that distribution was somethign people got paid for. Eg. the chinese TV factory who wants to sell to US Now. distribution is easy because of standards
 * Communication is the heart of this economy.**

RFIDs: WalMart believes they save more than the revenues of Amazon ( numbers? ) just by knowing where their stuff is. Amazon collects about a hundred terabytes of clicks per year.
 * It is now easier to collect data than beforehand.**
 * However, **individual customer data** is not that much. ie, address and phone numbers. etc very small size
 * **orders (transaction data)** is about 10 gig.
 * **Session aggregate data** (when did the person come, what was his http referrer etc) is about a terabyte.

Data Types
bi-directional data flow to improve the GPS system. estimate the flow of traffic and how much time you need to reach a place using data collection from vehicles sending information back to Dash.net Amazon makes $40-$50 from each review. Experiment of stripping reviews from one of two very similar books, and measuring how much they earn.
 * Clicks
 * Attention - engagement, measured by activities such as tagging
 * Intention - seach terms
 * Interaction - relationships. ie, comments, wall posts on facebook
 * Situation
 * Location - GPS, location data.
 * Dash.net**
 * Amazon**

All these are examples of how data can be captured to help people make's decisions better.

In Pay as you Drive Insurance, GPS data can be used to determine that a user is speeding, and thus make him ineligible for claims. DNA results 23andme.com might lead to a higher insurance rates if the person has a higher risk of contracting cancer at a later age.
 * However Privacy can be a concern**

Fast innovation through experimentation

 * First step – Data
 * Second step – Experimentation
 * Third step – Participation
 * Fourth step – Interaction
 * Fifth step – Community

Where do people get their information from?

 * The Old Paradigm - 20+ Years ago - Produced News from a small editorialized Body
 * 10 Years Ago - Search - Yahoo and Google
 * Now - Content is pulled actively by the user, or content is pushed smartly through recommendations and personalization

Data Silos and the Attention Economy

 * Previously you had silos of data about people - credit card companies, credit bureaus, facebook now even
 * Value was in having exclusive access to this data
 * Now these data silos are crumbling, and the users feel the data belongs to them
 * Threats to companies which used to hold monopolies on this data, now turning into opportunities to use this data better
 * [|Towards the Attention Economy: Will Attention Silos Ever Open Up?] - Alex Iskold, Read/Write/Web - Good outside primer on data silos

Innovative Companies Utilizing Exemplifying the new Economy: (bidirectional communication - reducing asymmetries of Information)

 * [[|http://23andme.com|23andme.com]]] - Personal Genome Service
 * [|Pay as you drive insurance Norwich Union]
 * http://www.jigsaw.com/ - Share contact information
 * [|Jobscore.com >]

Initial Contributors

 * Ryan Mason: || [|rkm3@stanford.edu] || 1219388400 <- Next birthday ||
 * Shaun Maguire: || [|shaunm1@stanford.edu] || user:shaunm1 ||
 * Hamilton Ulmer: || [|ulmerham@stanford.edu] || username ||
 * Jackson: || jackotan@stanford.edu || username ||
 * Randal Truong: || [|rtruong@stanford.edu] || user:randaltruong ||