Andreas Weigend
Stanford University
Stat 252 and MS&E 238
Spring 2008


Homework 5: Recommender System for del.icio.us


Due: Sunday May 18th at 5pm. Send to stat252@gmail.com


Note: This homework should be done individually, but feel free to discuss questions about Python programming.


In this assignment you will create a simple recommender system for del.icio.us. Here is the case: If you have just bookmarked www.weigend.com, the program will recommend URLs for you based on your own tagging history and that of others who, at some stage in the past, have also tagged www.weigend.com.
To design a recommendation system, let's start simple.
Find the set of all users who have bookmarked www.weigend.com or weigend.com (in a later version of del.icio.us these will be mapped together, in the current version you need to do this manually).
Looking at all of the URLs tagged by this set of users, how many URLs are tagged by only a single user, how many are tagged by 2 users, 3 users etc.?
When there are many such multiply bookmarked URLs, you might consider to rank the results based on a "weighted" version that gives different users different weights before aggregation. Two examples for such weights are:

  • heavier weights for users who have more similar tagging history to yours.
  • lower weights for users who have not been using del.icio.us for some time;
Another time dimension to be considered in weighting is: How do you want to treat time
  • relative to the user having tagged weigend.com (i.e., to you want to weight URLs that came afterwards higher than those that came before?), and
  • relative to the current time (should a URL tagged yesterday be weighted more than a URL tagged a year ago?)
By design, an algorithm that removes noise by focusing on URLs that occur more than once (favoring more popular URLs over less popular URLs) is biasing towards the common. When is this property desired? What are its potentially negative side effects?
To create an algorithm for more bold recommendations, biased towards uniqueness, how would you find the most "interesting" users having tagged weigend.com? And how would you derive recommendations from their most "interesting" bookmarks?
Compare these two different approaches, explain the differences. When would you use the first algorithm, biased towards more frequently tagged URLs, when would you use the second algorithm, biased towards individuality?
So far, you have only considered the fact that others have tagged the URL, and ignored the specific tags as well as the comments. How would you use the number of tags (maybe relative to how many the person usually uses, maybe relative to an overall distribution), how the uniqueness of tags? And how about the comments they leave?
Before you start working on this, first register an account on del.icio.us, and tag some URLs of interest to you (this will be your history). Please put your Python code and 25 ranked recommendations for the target URL on your webpage. Carefully examine and discuss the change in results as you change your parameters. Post the description of the choices you made, the effects you investigated, and a set of 25 recommendations on your page, and submit the link to this webpage to the usual email address by the usual deadline.

------------------------------------------------------------------------------------------------------------------------------------------

This exercise uses Python and the pydelicious (del.icio.us Python API) package. To check out this package, you need to install SVN and the feedparser package. Here are some details.
(1) Install Python. Python and its tutorials are available on www.python.org. On Windows, after you have run the Python installer, you may have to manually add the installation directory to your system path environment variable (Right click on My Computer -> Properties -> Advanced -> Environment Variables).
(2) Install SVN. SVN is a source code version controlling system. Please download it (Windows binary: http://subversion.tigris.org/files/documents/15/36797/svn-1.4.3-setup.exe; Mac OS: http://metissian.com/projects/macosx/subversion/ ; Linux: you should know) and install it.
(3) After installing SVN, you need to check out a package named “feedparser”. Please download it from http://code.google.com/p/feedparser/downloads/list , unpack it, go to its directory, and type the following code to install it.
python setup.py build
python setup.py install
(4) Finally, you check out “pydelicious”. Please follow the instructions on http://code.google.com/p/pydelicious/source/checkout . After you have checked out, go to its directory, and type the following code to install it.
python setup.py build
python setup.py install
Only after the above four steps, you should be able to import the pydelicious package in Python.
If you encounter errors saying that you need some more packages preinstalled, please install them by the similar way as (3).

For example, one student (Bo) was getting errors with the elementTree library used in pydelicious. Here's what to do:

    # go to your homework directory
    > curl "http://effbot.org/media/downloads/elementtree-1.2.6-20050316.zip" > elementtree-1.2.6-20050316.zip
    > unzip elementtree-1.2.6-20050316.zip
    > cd elementtree-1.2.6-20050316/
    > python setup.py build;
    > python setup.py install;
Now go to your pydelicious directory and build/install again using python setup.py build and python setup.py install

I did it without having to worry about SVN, by downloading the source for the two packages directly from code.google.com and then installing with ">>python setup.py install" from the directory with the downloaded files. -Ryan

------------------------------------------------------------------------------------------------------------------------------------------

note: Instead of installing python and SVN on your own personal Linux/Windows/Mac machine, you can use Stanford's cluster. They already have Python and SVN installed. (However, there might be differences between different clusters, so in case you find it does not work, please use your own computer.)
First: ssh into one of Stanford's many computer server and type this:
svn checkout http://pydelicious.googlecode.com/svn/trunk/ pydelicious
cd pydelicious

python setup.py build
and now you are ready to run python
>python
>>>import pydelicious
Because we cannot copy the pydelicious code into python directory due to access restriction, you must be inside the pydelicious directory created by svn (use pwd command in Linux to check path). --jack

------------------------------------------------------------------------------------------------------------------------------------------

Here are some code snippets to illustrate how to use pydelicious to get information from del.icio.us,
1) get the list of posts for a given URL (each post corresponds to a user's bookmark entry of this URL). The post is a data object containing URL, user, timestamp etc. function pydelicious.get_urlposts will return a list of dictionaries each correponding to a post.
import pydelicious
up = pydelicious.get_urlposts('http://www.weigend.com/')
To see what a post looks like, print up[1] (note up[0] is the first post)
{'count': '',

'extended': u'stat252 teacher',
'hash': '',
'description': u'[from junli07] Andreas S. Weigend, Ph.D.',
'tags': u'amazon weigend',
'href': u'http://del.icio.us/url/e78668880a3ac1f62edbd3449e095df3#junli07',
'user': u'junli07',
'dt': u'2008-05-05T20:48:48Z'}
You can get all the information you need from this list of posts. Please note you have to write the full URL with a leading "http:" and a trailing slash, for example, www.weigend.com, http://www.weigend.com will not work. Only http://www.weigend.com/ works.
2) get the list of recent posts for a given user
import pydelicious
up = pydelicious.get_userposts("junli07")
You can only get recent posts up to a limited number.
3) get the list of recent posts for a given tag
import pydelicious
up = pydelicious.get_tagposts("weigend")
For more information, please refer to pydelicious.html in your pydelicious doc directory.





Playing Nice with Delicious

So comrades, you may have seen the ugly side of delicious (i.e. 503). If you don't know what I am talking about, then don't worry about it. Below, you will find some code that may help you to deal with the delicious throttling mechanism. There are 3 decorators that I use to query delicious (or any webservice). One of them is a simple in memory cache (memoized). It's a convenient way of caching functions that query webservices to avoid hitting them multiple times for the same data. Another, (rate_limited) lets you set a minimum number of seconds between successive calls to the same resource. This is better than always sleeping for the maximum amount of delay needed. Also, this makes it easy to change the rate limiting in one place and it lets you share a rate limited resource between different functions. (If you really want to have fun, you can easily modify it to use a pool of API Keys / IP addresses and automatically balance the requests.) The third and one that I cobbled together for this particular problem is a database based cache (cached) which saves function results across executions or machines (if the cache file is on a shared system). This makes it easier to debug your program without getting 503'ed because you only pull the data once no matter how many times you run your program.

Hope this helps. - JДMES


# The first attempt:
#   Try to play nice with Delicious by using a memoizing decorator and a
#   rate limiting decorator to follow their rules.
#
# The hack:
#   While the Delicious API says that you should wait at least 1 second, it
#   will still throttle/ban you if you do a posts/all type call that often.
#   To handle the large amount of data that need to be gathered and possibly
#   from different hosts, we implement a SQLite based cache.
#
 
from datetime import datetime, timedelta
import time
import pydelicious as deli
import sqlite3 as sqlite
import pickle
 
class CacheMissError(Exception):
    pass
 
class DBCache(object):
    def __init__(self, dbfile, debug = False):
        self.__state = {}
        self.__connection = sqlite.connect(dbfile)
        self.__connection.text_factory = str
        self.__cursor = self.__connection.cursor()
        self.__debug = debug
        try:
            self.__cursor.execute('CREATE TABLE `cache` (id INTEGER PRIMARY KEY, `func` varchar(255), `args` blob, `result` blob);');
        except:
            pass
 
    def save(self, func, args, result):
        self.__cursor.execute('INSERT INTO `cache` (`func`, `args`, `result`) VALUES (?, ?, ?)', (func, pickle.dumps(args), pickle.dumps(result)))
        self.__connection.commit()
        if self.__debug:
            print 'Saving %s(%s) -> %s!' % (func, args, result)
 
    def load(self, func, args):
        self.__cursor.execute('SELECT `result` FROM `cache` WHERE `func` = ? AND `args` = ?;', (func, pickle.dumps(args)))
        result = self.__cursor.fetchone()
        if result:
            value = pickle.loads(result[0])
            if self.__debug:
                print 'Loaded %s(%s) -> %s!' % (func, args, value)
            return value
        else:
            raise CacheMissError
 
class memoized(object):
    """Decorator that caches a function's return value each time it is called.
    If called later with the same arguments, the cached value is returned, and
    not re-evaluated.
    """
    def __init__(self, func):
        self.func = func
        self.cache = {}
    def __call__(self, *args):
        try:
            return self.cache[args]
        except KeyError:
            self.cache[args] = value = self.func(*args)
            return value
        except TypeError:
            # uncachable -- for instance, passing a list as an argument.
            # Better to not cache than to blow up entirely.
            return self.func(*args)
 
class cached(object):
    """Use DBCache to cache
    """
    def __init__(self, cache):
        self.cache = cache
 
    def __call__(self, f):
        def wrapper(*args):
            try:
                return self.cache.load(self.func.__name__, args)
            except CacheMissError:
                value = self.func(*args)
                self.cache.save(self.func.__name__, args, value)
                return value
            except TypeError:
                # uncachable -- for instance, passing a list as an argument.
                # Better to not cache than to blow up entirely.
                return self.func(*args)
 
        self.func = f
        wrapper.__name__ = f.__name__
        wrapper.__dict__.update(f.__dict__)
        wrapper.__doc__ = f.__doc__
        return wrapper
 
class rate_limited(object):
    """Decorator that limits the rate for a given resource.  Wait at least
    min_delay in seconds before consecutive calls to resource"""
    def __init__(self, resource, min_delay, debug = False):
        self.resource = resource
        self.min_delay = timedelta(seconds=min_delay)
        self.last_run = None
        self.debug = debug
 
    def __call__(self, f):
        def wrapper(*args, **kw):
            if self.last_run:
                now = datetime.now()
                delta = self.min_delay - (now - self.last_run)
                if self.debug:
                    print 'Now %s : Last %s' % (now, self.last_run)
                if delta > timedelta():
                    if self.debug:
                        print 'Sleeping for %s' % (delta.seconds + float(delta.microseconds) / 1000000)
                    time.sleep(delta.seconds + float(delta.microseconds) / 1000000)
            self.last_run = datetime.now()
            return self.func(*args, **kw)
 
        self.func = f
        wrapper.__name__ = f.__name__
        wrapper.__dict__.update(f.__dict__)
        wrapper.__doc__ = f.__doc__
        return wrapper
 
rate = 5
debug = False
cache = DBCache('tasty.db', debug)
 
@cached(cache)
@rate_limited('delicious', rate, debug)
def get_userposts(user):
    return deli.get_userposts(user)
 
@cached(cache)
@rate_limited('delicious', rate, debug)
def get_urlposts(url):
    return deli.get_urlposts(url)
 
def add_list_to_set(aset, alist):
    [aset.add(x) for x in alist]
 
def main():
    target_user = u'aweigend'
    user_set = set()
    url_set = set()
 
    get_userposts(target_user)