“big data”

Integrating Spark with scikit-learn, visualizing eigenvectors, and fun!

Three topics in this post, to make up for the long hiatus!

1.

Apache Spark’s MLlib has built-in support for many machine learning algorithms, but not everything of course. But one can nicely integrate scikit-learn (sklearn) functions to work inside of Spark, distributedly, which makes things very efficient. That’s what I’m going to be talking about here.

As a practical example, let’s consider k-Nearest-Neighbors (k-NN). Spark’s MLlib doesn’t have built-in support for this, but scikit-learn does.

So let’s talk about sklearn for a minute. If you have a large number of points, say a million or more, and you want to obtain nearest neighbors for all of them (as may be the case with a k-NN-based recommender system), sklearn’s NearestNeighbors on a single machine can be hard to work with. The fit() method isn’t what takes a long time, it’s subsequently producing the results for the large number of queries with kneighbors() that is expensive:

In the most straightforward deployment, if you try to send kneighbors() all point vectors in a single large matrix and ask it to come up with nearest neighbors for all of them in one fell swoop, it quickly exhausts the RAM and brings the machine to a crawl. Alternatively, the batch iteration method that I mentioned before is a good solution: after performing the initial fit, you can break the large matrix into chunks and obtain their neighbors chunk by chunk. This eases memory consumption, but can take a long time.

There are of course approximate nearest-neighbor implementations such as Spotify’s Annoy. In my use case, Annoy actually did worse than sklearn’s exact neighbors, because Annoy does not have built-in support for matrices: if you want to evaluate nearest neighbors for n query points, you have to loop through each of your n queries one at a time, whereas sklearn’s k-NN implementation can take in a single matrix containing many query points and return nearest neighbors for all of them at a blow, relatively quickly. Your mileage may vary. I’ll talk about Annoy again a little later.

To summarize the problem:

  • sklearn has good support for k-NN; Spark doesn’t.
  • sklearn’s k-NN fit() isn’t a problem
  • sklearn’s k-NN kneighbors() is a computational bottleneck for large data sets; is a good candidate for parallelization

This is where Spark comes in. All we have to do is insert kneighbors() into a Spark map function after setting the stage for it. This is especially neat if you’re already working in Spark and/or if your data is already in HDFS to begin with, as is commonly the case.

Below is a simplified Python (PySpark) code snippet to make this approach clear:

# Imports
from pyspark import SparkConf, SparkContext
from sklearn.neighbors import NearestNeighbors

# Let's say we already have a Spark object containing
# all our vectors, called myvecs
myvecs.cache()

# Create kNN tree locally, and broadcast
myvecscollected = myvecs.collect()
knnobj = NearestNeighbors().fit(myvecscollected)
bc_knnobj = sc.broadcast(knnobj)

# Get neighbors for each point, distributedly
results = myvecs.map(lambda x: bc_knnobj.value.kneighbors(x))

Boom! That’s all you need. The key point in the above code is that we were able to pass sklearn’s NearestNeighbors’ kneighbors() method inside of Spark’s map(), which means that it can be parallel-y and nicely handled by Spark.

(You can do the same thing using Annoy instead of sklearn, except that instead of broadcasting the Annoy object to workers, you need to serialize it to a file and distribute the file to workers instead. This code shows you how.)

In my use case, harnessing Spark to distribute my sklearn code brought my runtime down from hours to minutes!

Update: between the time I first considered this problem and now, there has also emerged a Spark package for distributing sklearn functionality over Spark, as well as a more comprehensive integration called sparkit-learn. So there are several solutions available now. I still like the approach shown above for its simplicity, and for not requiring any extraneous code.

2.

A beautiful interactive visualization of eigenvectors, courtesy of the wonderful folks at Setosa.

The thing that I love about this viz is that it doesn’t just show how eigenvectors are computed, it gives you an intuition for what they mean.

3.

Lastly, and just for fun: Is it Pokemon or Big Data? ☺