I am starting off my blog with one of my favorite unsupervised class of algorithms - kmeans.
I just completed successfully implementing k-means on a simple data, data generated from a mixture of 3 Gaussian distributions with means spread out and standard deviation of 1. I have tested it both in a pseudo-distributed environment locally as well as on a small EC2 cluster on Amazon AWS.
Let me jot down the plain old k-means algorithm, even though I am sure most you reading this post might be aware of it.
Step 1 : Choose k random initial cluster centroids
Step 2 : Assign each data point to the closest cluster centroid
Step 3 : Re-calculate the cluster centroids based on some aggregate function applied on the data points that belong to the same cluster
Step 4 : Repeat Step 2, 3 until data points do not change clusters
The above iterative version of k-means can be converted to a map-reduce framework as follows:
A JVM is spawned for every block of data for the slave node. The map method gets called for every record with the block. In the initialization step, the mapper randomly picks a number between 0..k-1 and assigns it as the centroid id to the data-point. The map method hence emits the random number indicating the centroid id as key and the data-point as value. Please note there there may be other implementations where the mapper emits the centroid id as key and a two element tuple <sum, n> as value, where sum is the sum of all data-points belonging to the centroid and n is the number of data-points assigned to the cluster.
M1 :: map(k1, v1)
choose a random number c between 0..k-1
emit <c, v1>
In every iteration following the initialization step, the mapper calculates the nearest centroid for each data-point and emits the centroid id as key and the data point as a value.
M2 :: map(k1, v1)
find the nearest centroid c to data-point v1
emit <c, v1>
The actual work of reducer starts once all the mappers have completed executing. Reducer aggregates the list of data-points in list v and emits the centroid id as key and the aggregate as value. The aggregate(new centroid) in kmeans is the mean of all the data-points in the cluster. The same reducer can be used to aggregate the values at the initialization as well at each of the iteration step. In fact, in this scenario, the reducer can also be used to utilize the combiner or mapper-final functionality of map-reduce framework.
R1 :: reduce(k2, list v)
emit <k2, aggregate(v)>
Just on a side note, I am using the Amazon Elastic MapReduce Ruby Client to submit my job on AWS. For that, I need to create bucket on S3 and save my input data and jar file. s3cmd utility is a great command tool to performing some useful filesystem command. It shares a lot in common with the 'hadoop fs' command.
Amazon Elastic MR Ruby Client : http://aws.amazon.com/developertools/2264
s3 tools : http://s3tools.org/s3tools
One other thing to note here is that when you create a jobflow on Amazon AWS, it takes a while to go from 'starting' to 'running' mode and hence it sometime is irritating if you are developing a prototype or just fixing a minor bug in your program and want to test it out. It is for this reason why I won't really recommend the Amazon AWS API since your code would then be totally dependent that your inputs and outputs are all supposed to be on S3. If you pass your paths in some properties file or just pass them as command line arguments, you can then run the same code locally on a single node first using the hadoop command and then Amazon AWS using the Elastic Mapreduce ruby client. The only difference that I have really seen is that if you have any 'hadoop fs' commands in your code, for the single node execution using the hadoop command on command line, you need to specify the complete path i.e. '/usr/local/hadoop/bin/hadoop fs -copyToLocal ...' vs. when you execute on Amazon AWS you can just say 'hadoop fs -copyToLocal ...'.
I am currently in the process of trying out k-medoids, again a prototype based clustering method, however, unlike k-means, where the centroids are an approximation of the cluster data-points and not an actual data-point, k-mediod chooses an actual data-points as the centroids as representatives of the clusters. I am testing k-medoids on the movie-lens data with 1 million movie ratings. I will share more insight in the next blog about this.