Let’s say you have a data set with a million or more training points (“rows”). What’s a reasonable way to implement supervised learning?

One approach, of course, is to only use a subset of the rows. This has its merits, but there may be various reasons why you want to use the entire available data. What then?

Andy Müller created an excellent cheat sheet, thumbnailed below, showing which machine learning techniques are likely to work best in different situations (clickable version here). It’s obviously not meant to be a rigid rule, but it’s still a good place to start answering the question above, or most similar questions.

What we see from the above is that our situation points us towards **Stochastic Gradient Descent (SGD)** regression or classification.

Why SGD? The problem with standard (usually gradient-descent-based) regression/classification implementations, support vector machines (SVMs), random forests etc is that they do not effectively scale to the data size we are talking, because of the need to load all the data into memory at once and/or nonlinear computation time. SGD, however, can deal with large data sets effectively by breaking up the data into chunks and processing them sequentially, as we will see shortly; this is often called **minibatch learning**. The fact that we only need to load one chunk into memory at a time makes it useful for large-scale data, and the fact that it can work iteratively allows it to be used for **online learning** as well. SGD can be used for regression or classification with any regularization scheme (ridge, lasso, etc) and any loss function (squared loss, logistic loss, etc).

What *is* SGD? It’s been explained very nicely by Andrew Ng in his Coursera class (Week 10: Large Scale Machine Learning), and Léon Bottou has a somewhat more in-depth tutorial on it. Their explanations are excellent, and there’s no point in my duplicating them, so I’ll move on to implementation using Python and the scikit-learn (sklearn) library.

The key feature of sklearn’s SGDRegressor and SGDClassifier classes that we’re interested in is the* partial_fit() *method; this is what supports minibatch learning. Whereas other estimators need to receive the entire training data in one go, there is no such necessity with the SGD estimators. One can, for instance, break up a data set of a million rows into a thousand chunks, then successively execute *partial_fit()* on each chunk. Each time one chunk is complete, it can be thrown out of memory and the next one loaded in, so memory needs are limited to the size of one chunk, not the entire data set.

(It’s worth mentioning that the SGD estimators are not the only ones in sklearn that support minibatch learning; a variety of others are listed here. One can use this approach with any of them.)

Finally, the use of a generator in Python makes this easy to implement.

Below is a piece of simplified Python code for instructional purposes showing how to do this. It uses a generator called ‘batcherator’ to yield chunks one at a time, to be iteratively trained on using *partial_fit()* as described above.

from sklearn.linear_model import SGDRegressor def iter_minibatches(chunksize): # Provide chunks one by one chunkstartmarker = 0 while chunkstartmarker < numtrainingpoints: chunkrows = range(chunkstartmarker,chunkstartmarker+chunksize) X_chunk, y_chunk = getrows(chunkrows) yield X_chunk, y_chunk chunkstartmarker += chunksize def main(): batcherator = iter_minibatches(chunksize=1000) model = SGDRegressor() # Train model for X_chunk, y_chunk in batcherator: model.partial_fit(X_chunk, y_chunk) # Now make predictions with trained model y_predicted = model.predict(X_test)

We haven't said anything about the *getrows()* function in the code above, since it pretty much depends on the specifics of where the data resides. Common situations might involve the data being stored on disk, stored in distributed fashion, obtained from an interface etc.

Also, while this simplistic code calls SGDRegressor with default arguments, this may not be the best thing to do. It is best to carry out careful cross-validation to determine the best hyperparameters to use, especially for regularization. There is a bunch more practical info on using sklearn’s SGD estimators here.

Hopefully this post, and the links within, give you enough info to get started. Happy large-scale learning!

Is a million rows really too much to fit in memory? It’s easy to check out 100-200Gb hosts these days, so I would have expected the guideline limit to be much higher.

LikeLike

The answer, as with most questions of this type, is “It depends!” In my use case which inspired this post, I was handling data sets with millions of rows and trillions of features with varying sparsity using a machine with 24GB of RAM shared with other users, and in this situation out-of-core learning was necessary. YMMV.

LikeLike

Ashish – you are clearly getting lost in specifics or want to debate and are missing the more generalized point of the post. I’ll sum it up for you. The author is saying, regardless of the data size, you can use sklearn to perform minibatch operations in Sklearn so you don’t have to process the entire dataset that could potentially be too big to fit into memory.

LikeLike

can you recommend a good book on such similar mathematical optimization?

LikeLike

i’m a bit lost. is numtrainingpoints a constant from the model?

LikeLike

numtrainingpoints is the total number of rows in the training dataset.

LikeLike

Each time I call partial_fit it only performs one iteration over the batch. How can I use partial_fit with more than one iteraction per batch?

LikeLike

The link to the list of models supporting this is broken. The new link is https://sklearn.org/modules/scaling_strategies.html

LikeLiked by 1 person

Link fixed now. Thanks for pointing this out!

LikeLike