HW5

 [|Andreas Weigend] Stanford University Stat 252 and MS&E 238 Spring 2008

= Homework 5: Recommender System for del.icio.us =

Note: This homework should be done individually, but feel free to discuss questions about Python programming.
In this assignment you will create a simple recommender system for del.icio.us. Here is the case: If you have just bookmarked [|www.weigend.com], the program will recommend URLs for you based on your own tagging history and that of others who, at some stage in the past, have also tagged [|www.weigend.com]. To design a recommendation system, let's start simple. Find the set of all users who have bookmarked [|www.weigend.com] or [|weigend.com] (in a later version of del.icio.us these will be mapped together, in the current version you need to do this manually). Looking at all of the URLs tagged by this set of users, how many URLs are tagged by only a single user, how many are tagged by 2 users, 3 users etc.? When there are many such multiply bookmarked URLs, you might consider to rank the results based on a "weighted" version that gives different users different weights before aggregation. Two examples for such weights are: Another time dimension to be considered in weighting is: How do you want to treat time By design, an algorithm that removes noise by focusing on URLs that occur more than once (favoring more popular URLs over less popular URLs) is biasing towards the common. When is this property desired? What are its potentially negative side effects? To create an algorithm for more bold recommendations, biased towards uniqueness, how would you find the most "interesting" users having tagged [|weigend.com]? And how would you derive recommendations from their most "interesting" bookmarks? Compare these two different approaches, explain the differences. When would you use the first algorithm, biased towards more frequently tagged URLs, when would you use the second algorithm, biased towards individuality? So far, you have only considered the fact that others have tagged the // URL, // and ignored the // specific tags // as well as the // comments //. How would you use the number of tags (maybe relative to how many the person usually uses, maybe relative to an overall distribution), how the uniqueness of tags? And how about the comments they leave? Before you start working on this, first register an account on del.icio.us, and tag some URLs of interest to you (this will be your history). **Please put your Python code and 25 ranked recommendations for the target URL on your webpage.** Carefully examine and discuss the change in results as you change your parameters. Post the description of the choices you made, the effects you investigated, and a set of 25 recommendations on your page, and submit the link to this webpage to the usual email address by the usual deadline.
 * heavier weights for users who have more similar tagging history to yours.
 * lower weights for users who have not been using del.icio.us for some time;
 * relative to the user having tagged weigend.com (i.e., to you want to weight URLs that came afterwards higher than those that came before?), and
 * relative to the current time (should a URL tagged yesterday be weighted more than a URL tagged a year ago?)

-- This exercise uses Python and the pydelicious (del.icio.us Python API) package. To check out this package, you need to install SVN and the feedparser package. Here are some details. (1) Install Python. Python and its tutorials are available on [|www.python.org]. On Windows, after you have run the Python installer, you may have to manually add the installation directory to your system path environment variable (Right click on My Computer -> Properties -> Advanced -> Environment Variables). (2) Install SVN. SVN is a source code version controlling system. Please download it (Windows binary: http://subversion.tigris.org/files/documents/15/36797/svn-1.4.3-setup.exe; Mac OS: http://metissian.com/projects/macosx/subversion/ ; Linux: you should know) and install it. (3) After installing SVN, you need to check out a package named “feedparser”. Please download it from http://code.google.com/p/feedparser/downloads/list, unpack it, go to its directory, and type the following code to install it. // python setup.py build // // python setup.py install // (4) Finally, you check out “pydelicious”. Please follow the instructions on http://code.google.com/p/pydelicious/source/checkout. After you have checked out, go to its directory, and type the following code to install it. // python setup.py build // // python setup.py install // Only after the above four steps, you should be able to import the pydelicious package in Python. If you encounter errors saying that you need some more packages preinstalled, please install them by the similar way as (3).

For example, one student (Bo) was getting errors with the elementTree library used in pydelicious. Here's what to do: code # go to your homework directory > curl "http://effbot.org/media/downloads/elementtree-1.2.6-20050316.zip" > elementtree-1.2.6-20050316.zip > unzip elementtree-1.2.6-20050316.zip > cd elementtree-1.2.6-20050316/ > python setup.py build; > python setup.py install; code Now go to your pydelicious directory and build/install again using // python setup.py build // and   // python setup.py install // //I did it without having to worry about SVN, by downloading the source for the two packages directly from code.google.com and then installing with ">>python setup.py install" from the directory with the downloaded files. -Ryan//

-- First: ssh into one of Stanford's many computer server and type this: svn checkout http://pydelicious.googlecode.com/svn/trunk/ pydelicious // cd pydelicious // // python setup.py build // and now you are ready to run python // >python // // >>>import pydelicious // // Because we cannot copy the pydelicious code into python directory due to access restriction, you must be inside the pydelicious directory created by svn (use pwd command in Linux to check path). --jack // --  Here are some code snippets to illustrate how to use pydelicious to get information from del.icio.us, 1) ** get the list of posts for a given URL ** (each post corresponds to a user's bookmark entry of this URL). The post is a data object containing URL, user, timestamp etc. function pydelicious.get_urlposts will return a list of dictionaries each correponding to a post. import pydelicious up = pydelicious.get_urlposts(' [|http://www.weigend.com/')] To see what a post looks like, print up[1] (note up[0] is the first post) {'count': ,  'extended': u'stat252 teacher',  'hash': ,  'description': u'[from junli07] Andreas S. Weigend, Ph.D.',  'tags': u'amazon weigend',  'href': u'http://del.icio.us/url/e78668880a3ac1f62edbd3449e095df3#junli07',  'user': u'junli07',  'dt': u'2008-05-05T20:48:48Z'} // You can get all the information you need from this list of posts. Please note you have to write the full URL with a leading "http: // " and a trailing slash, for example, [|www.weigend.com], [|http://www.weigend.com] will not work. Only http://www.weigend.com/ works. // 2) // ** get the list of recent posts for a given user ** // import pydelicious up = pydelicious.get_userposts("junli07") // You can only get recent posts up to a limited number. 3) ** get the list of recent posts for a given tag ** import pydelicious up = pydelicious.get_tagposts("weigend") ** For more information, please refer to pydelicious.html in your pydelicious doc directory.
 * note: ** Instead of installing python and SVN on your own personal Linux/Windows/Mac machine, you can use Stanford's cluster. They already have Python and SVN installed. (However, there might be differences between different clusters, so in case you find it does not work, please use your own computer.)

**

=Playing Nice with Delicious= So comrades, you may have seen the ugly side of delicious (i.e. 503). If you don't know what I am talking about, then don't worry about it. Below, you will find some code that may help you to deal with the delicious throttling mechanism. There are 3 decorators that I use to query delicious (or any webservice). One of them is a simple in memory cache (memoized). It's a convenient way of caching functions that query webservices to avoid hitting them multiple times for the same data. Another, (rate_limited) lets you set a minimum number of seconds between successive calls to the same resource. This is better than always sleeping for the maximum amount of delay needed. Also, this makes it easy to change the rate limiting in one place and it lets you share a rate limited resource between different functions. (If you really want to have fun, you can easily modify it to use a pool of API Keys / IP addresses and automatically balance the requests.) The third and one that I cobbled together for this particular problem is a database based cache (cached) which saves function results across executions or machines (if the cache file is on a shared system). This makes it easier to debug your program without getting 503'ed because you only pull the data once no matter how many times you run your program.

Hope this helps. - JДMES

code format="python"
 * 1) The first attempt:
 * 2)   Try to play nice with Delicious by using a memoizing decorator and a
 * 3)   rate limiting decorator to follow their rules.
 * 4) The hack:
 * 5)   While the Delicious API says that you should wait at least 1 second, it
 * 6)   will still throttle/ban you if you do a posts/all type call that often.
 * 7)   To handle the large amount of data that need to be gathered and possibly
 * 8)   from different hosts, we implement a SQLite based cache.
 * 1)   from different hosts, we implement a SQLite based cache.

from datetime import datetime, timedelta import time import pydelicious as deli import sqlite3 as sqlite import pickle

class CacheMissError(Exception): pass

class DBCache(object): def __init__(self, dbfile, debug = False): self.__state = {} self.__connection = sqlite.connect(dbfile) self.__connection.text_factory = str self.__cursor = self.__connection.cursor self.__debug = debug try: self.__cursor.execute('CREATE TABLE `cache` (id INTEGER PRIMARY KEY, `func` varchar(255), `args` blob, `result` blob);'); except: pass

def save(self, func, args, result): self.__cursor.execute('INSERT INTO `cache` (`func`, `args`, `result`) VALUES (?, ?, ?)', (func, pickle.dumps(args), pickle.dumps(result))) self.__connection.commit if self.__debug: print 'Saving %s(%s) -> %s!' % (func, args, result)

def load(self, func, args): self.__cursor.execute('SELECT `result` FROM `cache` WHERE `func` = ? AND `args` = ?;', (func, pickle.dumps(args))) result = self.__cursor.fetchone if result: value = pickle.loads(result[0]) if self.__debug: print 'Loaded %s(%s) -> %s!' % (func, args, value) return value else: raise CacheMissError

class memoized(object): """Decorator that caches a function's return value each time it is called.   If called later with the same arguments, the cached value is returned, and    not re-evaluated.    """ def __init__(self, func): self.func = func self.cache = {} def __call__(self, *args): try: return self.cache[args] except KeyError: self.cache[args] = value = self.func(*args) return value except TypeError: # uncachable -- for instance, passing a list as an argument. # Better to not cache than to blow up entirely. return self.func(*args)

class cached(object): """Use DBCache to cache   """ def __init__(self, cache): self.cache = cache

def __call__(self, f): def wrapper(*args): try: return self.cache.load(self.func.__name__, args) except CacheMissError: value = self.func(*args) self.cache.save(self.func.__name__, args, value) return value except TypeError: # uncachable -- for instance, passing a list as an argument. # Better to not cache than to blow up entirely. return self.func(*args)

self.func = f       wrapper.__name__ = f.__name__ wrapper.__dict__.update(f.__dict__) wrapper.__doc__ = f.__doc__ return wrapper

class rate_limited(object): """Decorator that limits the rate for a given resource. Wait at least    min_delay in seconds before consecutive calls to resource""" def __init__(self, resource, min_delay, debug = False): self.resource = resource self.min_delay = timedelta(seconds=min_delay) self.last_run = None self.debug = debug

def __call__(self, f): def wrapper(*args, **kw): if self.last_run: now = datetime.now delta = self.min_delay - (now - self.last_run) if self.debug: print 'Now %s : Last %s' % (now, self.last_run) if delta > timedelta: if self.debug: print 'Sleeping for %s' % (delta.seconds + float(delta.microseconds) / 1000000) time.sleep(delta.seconds + float(delta.microseconds) / 1000000) self.last_run = datetime.now return self.func(*args, **kw)

self.func = f       wrapper.__name__ = f.__name__ wrapper.__dict__.update(f.__dict__) wrapper.__doc__ = f.__doc__ return wrapper

rate = 5 debug = False cache = DBCache('tasty.db', debug)

@cached(cache) @rate_limited('delicious', rate, debug) def get_userposts(user): return deli.get_userposts(user)

@cached(cache) @rate_limited('delicious', rate, debug) def get_urlposts(url): return deli.get_urlposts(url)

def add_list_to_set(aset, alist): [aset.add(x) for x in alist]

def main: target_user = u'aweigend' user_set = set url_set = set

get_userposts(target_user)

code