CourseDescr2008

Stanford University Stat 252 and MS&E 238 Spring 2008 class time: Monday 3:15 - 6:05 pm Spring 2008 class location: Gates B01

Data Mining and Electronic Business
This course is about **people and data**: Collecting data about behavior on the web, in social networks, in communication, on dating sites, etc. Mining the data, building predictive models, creating (and rejecting) hypotheses, designing cool experiments, and learning from them quickly. And figuring out what is truly new, what is similar to the past, and what the underlying drivers are. We will discuss the impact of the communication and data revolution on individuals, business, and society, i.e., to many aspects of the world we live in.

The 90’s, the decade of algorithms (data __mining__), focused on the question: "Given these data, what insights can you get?". Great algorithms were invented, refined, and their strengths and weaknesses understood. The current decade is the decade of data (__data__ mining), and the question has shifted to: "Given these problems, what data can you get?". Furthermore, economic aspects of data are becoming increasingly important, with the question becoming "Who pays whom?".

The first half of the course focuses on data: **Click** data (what all can be collected and what it is useful for), **intention** data (such the queries from the searches you do, we will also discuss social search), **attention** data (such as tags on social bookmarking sites with its important application for discovery), and **interaction** data (of email headers and social networking sites). The second half of the quarter focuses on models and on creating appropriate structures and incentives. We will discuss models for **products** (recommender systems), **people** (reputation systems), **situation** and **location**.

The second half discusses applications. They range from **personalization**, **recommendations** and online marketing (behavioral and situational **targeting**), to the principles behind **collective intelligence,** **reputation systems** and peer-production, as well as **prediction markets** as yet another way of gleaning data from people and fostering interactions between them.

Students are expected to actively engage in class discussions, to have their assumptions challenged, and to bring their various backgrounds to bear to make it a great experience for themselves and everybody else. We will also have some great guest speakers come to class.

After each class, a detailed write-up is created by the students as the |course wiki (see 2007). To help prospective students with the decision of whether to take this course, previous syllabi ([|2004], [|2005]) might also be useful.

> Note that the last class is our slot for finals: Friday, 12:15 - 3:15 Meeting only once a week proved useful in the past since it makes it as easy as possible for students to **attend class in person**. This is a lot more fun than just watching it over the web, and you learn a lot more. Note that this explicitly includes [|SCPD] students who only signed up for remote access, just do not tell anyone :)
 * Schedule**: We meet once a week (Monday afternoons) for 3 hours. The dates in Spring 2008 are:
 * Apr 7 The Business of Data
 * Apr 14 Click, Intention, and Attention Data
 * Apr 21 Social Networks and Viral Marketing
 * Apr 28 Prediction Markets
 * May 5 Reputation Systems, Instrumenting the Planet
 * May 12 Location Data (Mobile)
 * May 19 Discovery Systems (Products, People)
 * [no class on May 26, Memorial Day]
 * Jun 2 Personal Genome (tentatively guest from 23andme)
 * Jun 6 Outlook, and Project presentation by students


 * Course wiki**: All students have full read/write access to the course wiki at |stanford2008.wiskispaces.com. I encourage you to actively contribute -- the class //and you// will benefit.


 * Grading**: The main goal is that you get insights and that you transfer them to your area, coming up with some interesting ideas and applications. To support this objective, your grade will be determined by the following:

There are also internship opportunities available for students who like to code, both in the Bay Area and abroad, ranging from Bangkok ([|Agoda], online travel) to Helsinki ([|Fruugo], e-business).
 * Course wiki: We will form 8 groups. Each group is responsible to create the initial wikipage for one of the classes by Friday 6pm (i.e., 4 days after class). These pages emphasize the key learnings of each class and have links to other materials wherever useful. [40%]
 * Homework: There will be assignments. They are due the day before class at 5pm, such that we can look through them and give brief feedback in a timely manner. [40%]
 * Class participation. [20%]
 * Project: If you have a good and solid idea for an interesting project, I am happy to give feedback and jointly decide on whether it makes sense to do the project. I encourage projects in small groups. [optional]

Some of the material is very recent and originates from several academic disciplines. Besides statistics and computer science, it discusses modern marketing techniques, behavioral economics, social network analysis ideas and other concepts. Depending on your specific background and interests, the following might be useful: [|Readings] and [|mp3 recordings] of the classes are online at [|weigend.com/files/teaching/stanford/]. We also have a [|facebook group] for the class.
 * Readings**
 * [|T. Segaran: Collective Intelligence] (2007) Hands on, hacker mentality, includes python code, useful for the del.icio.us recommendation engine homework
 * [|P. Baldi, P. Frasconi, and P. Smyth: Modeling the Internet and the Web] (2003) Background on web technology, solid statistical modeling of behavior, information retrieval
 * [|C. Shapiro, and H.R. Varian: Information Rules] (1998) Short book with insights about the networked economy (network effects, economics of digital goods, pricing, etc.)
 * [|M.J.A. Berry and G.S. Linoff: Data Mining Techniques] ([|pdf]) (2004) Applications of data mining in broad marketing and business in general (not just web)
 * [|T. Hastie, R. Tibshirani, and J.H. Friedman: The Elements of Statistcal Learning] (2003) The classic for more theoretical aspects in data mining
 * [|C.M. Bishop: Pattern Recognition and Machine Learning] (2006) Recent book on machine learning from a Bayesian perspective


 * Teaching Assistants**, office hours and other information is on the |course wiki.