The data that I chose to use was movielens to cluster movies based on what users watched a movie. The data is available on : http://www.grouplens.org/node/73. I have chosen the one with 1 million ratings. But the structure of the data is the same across all different sizes of data.
There are 3 data files once you untar.
movie id :: movie title :: movie genre
user id :: gender :: age :: occupation :: zip code
user id :: movie id :: rating :: timestamp
Since, I needed data-points to be movies, i.e. for each movie would have the list of users that watched this movie. This requires massaging the RATINGS data a bit.
if the RATINGS data has :
you would want to have the data that you want k-medoids to fed something like this:
Please note that I am ignoring the following fields :
All I care is what users watched a particular movie. After performing the data transpose, I had in my input data 3,706 movies.
To accomplish this, I resorted to a one-time execution of a map-reduce job that aggregates the user list for each movie and outputs to a file that would serve as a input to the k-medoids algorithm.
So, as I mentioned earlier the only thing that separates k-medoids from k-means is the way the Centroids(medoids here) are re-calculated. I applied the PAM algorithm(Partitioning around medoids) to find k medoids, one for each of the k clusters.
The algorithm proceeds in two steps:
BUILD-step: This step sequentially selects k "centrally located" objects, to be used as initial medoids
SWAP-step: If the objective function can be reduced by interchanging (swapping) a selected object with an unselected object, then the swap is carried out. This is continued till the objective function can no longer be decreased.
This entire process can be implemented in the reducer as follows :
MAPPER : (exactly as in k-means) Finds out the closest medoid for the data-point and emits as key the medoid id and the value as the data-point.
REDUCER : (implements the PAM)
for each movie in values
if making movie as new centroid, improves the best sse
swap movie with the current centroid
best sse = sse(movie, values)
I chose JaccardSimilarity as a measure of how similar 2 movies are. If A and B are 2 movies and the set of users that have watched movies A and B are a and b respectively, the JaccardSimilarity is given by
JaccardSimilarity(a,b) = |intersection(a, b)| / |union(a, b)|
Dissimilarity(a, b) or Distance(a, b) thats required for calculating SSE is simply 1 - JaccardSimilarity(a, b).
I chose k = 37, no apriori tuning, just since I wanted to form clusters of size 3706/20 = approx. 100.
Silhouette is an index [-1, 1] indicating how good the resulting clusters are. A value close to 1 for Silhouette index indicates the data-points are perfectly clustered. A Silhouette of -1 indicates all the data-points have been wrongly assigned to the clusters. A Silhouette close to 0 indicates that data-points are equally close to atleast one other cluster than the one they are assigned to. Silhouette for a data-point i is defined as :
Silhouette(i) = b(i) - a(i) / max(a(i), b(i))
b(i) : average distance from data-point i to all the data-points in the the closest cluster(other than the one i belongs to)
a(i) : average distance from data-point i to all the data-points in the cluster i belongs to
The initial execution with the setting of k=37 gave an overall Silhouette of 0.06555513; which indicated that most of the data-points were almost exactly within the same distance of atleast one other cluster than the cluster it belongs to. This situation can be overcome by two ways :
- Post-process the resultant clusters to merge clusters that improve the overall SSE
- Choose different values of k
I preferred for the moment going for tuning with different values of k.
I will be committing the code soon on github. Check back later and I will post a link to both the k-means and k-medoids implementations in Java.