Andreas Weigend
Stanford University
Stat 252 and MS&E 238
Spring 2008

Homework 6

Choice of 3, annouced in class May 19.

Homework

In Groups of 2, 3, or 4 students

Please choose one of the following three options and submit one writup per group indicating who the team members are. Everyone in the group will get the same number of points.

1. Social networks: Analyze our crowdvine


Crowdvine has given us a huge dump of their data. What kind of insights would you like to extract from this? This is an open ended homework, much like the FriendsForSale assignment (HW4). The raw data dump is below.

stanford2008.log - This is a log of all activity on the site. It includes names and uniqueIDs. The first thing you might want to do is: Reformat this into CSV or some other format that can be easily read. There are about 5200 records. Notice that there is a unique id for each user printed immediately before his/her name.

stanford2008.log.summary.txt and stanford2008.stats.csv.txt - These contain a number of summary statistics.

2. Facebook all: Understanding your social graph

FB: How many of your friend put in gender they are looking for ?
Who are the singles of your target gender who are looking for

How do you make people self-select?
App idea (not original): "StanfordTopModels", "StanfordSexiest", "HottestStanfordGirls"
Who is more likely to engage with this app etc.?
- Enrique.Allen@stanford.edu

Note:
http://wiki.developers.facebook.com/index.php/Users.getInfo
Privacy note: For any user submitted to this method, the following user fields are
visible to an application only if that user has signed up for that application:
  • meeting_for
  • meeting_sex
  • religion
  • significant_other_id


3. Geodata: Visualize location and movement

Hi, Andreas asked me to design the geodata homework question using data from the SPoT that I've been using the last few weeks. My goal is to get you to play with fresh geodata and to see what you can do with it. I'm giving you all of my data and trying to ask questions at a variety of levels. -Ryan rkm3@stanford.edu

The Data:
The workflow: SPOT unit -> Satellite -> Spot server (findmespot.com-XML) -> parse, reverse geotag, and store (5Pears.org) -> output (5Pears.org)
Spot has a Shared page that they create which you can view here: Spot Shared Page
The raw data comes from the Spot in an XML feed which you can view here (it's buried in the source for the above page: raw XML feed
This php processes the feed, does a reverse geotag on the latitude/longitude (using geonames.org), and stores each point in my database: updateLocation.php(txt)
Raw data as a CSV: Spot CSV directory
php used to create the KML: spotkml.php(txt)
KML output that groups the points by date: KML Output. View this (or any KML) in Google Maps.
To see specific days, you use http://5pears.org/252/spotkml.php?startDate=2008-05-20&endDate=2008-05-23

Important Topics from Andreas
There are two big areas to consider, time and space.
How are the points related? There is more meaning than just a location and time stamp.
Which are from the same set/trip?
What can you do to show direction?
How about speed? Can you calculate a rough average?
(You can use the Haversine formula to find the distance between two points.)
Using the given points show their spacial and temporal relationships.

Some Sample Questions:
I hope you will develop more interesting ones.

What can you do with this data?
How can you view it? (by day, by place, by type of point, by group, by trip,...)
What can you tell about the resolution of the device?
Can you guess where I live?

Harnessing the Coordinates:
What can you find out about a location and its environs?
What services provide reverse geocoding?
Can you find out about the weather?
Nearby wifi spots?
Cafes?

Digging into the KML:
can you suggest a color scheme for the points, should they all be blue?
Should they be connected with a polyline?
How can you show more information that just a blue marker? Look at image below.
What information should the text bubble show?

Using R:
What can you learn from looking at the CSVs in R?
What statistics can you generate about the points? How can you group them?
Try using the Haversine formula to calculate the distance between two points and divide that by the time interval to get a (not so good estimate of) average speed.

Fire Eagle / Facebook:
Are there any tools out there for sharing this information that you know of?
Would you want your Facebook page to show where you are?
What granularity would you tolerate? street address/ city/ state/ country/ time zone

What you should submit
Like the other homeworks, you should submit something interesting that shows you've put some thought and effort into the problem. Please submit what you create to the homework email.

More thoughts
These points are hard to read. What can you do to make this clearer?
stanford.jpg