*(tl;dr: Use reciprocal distributions in your scikit-learn randomized-search cross-validation. If you don’t believe that’s easy, scroll down to see how little Python code you need to do this.)*

Picking model parameters (“hyperparameters”) is a common problem. I’ve written before on a powerful online-learning approach to parameter optimization using Gaussian Process Regression. There are other similar approaches out there, like Spearmint etc.

But a lot of the time we don’t necessarily need such a powerful tool – we’d rather have something quick and easy that is available in scikit-learn (sklearn). For example, let’s say we want to use a classification model with one or two regularization parameters – what’s an easy way to pick values for them?

## Cross-validation and grid-search

Cross-validation (CV) has been explained well by other folks so I won’t rehash it here. But let’s talk about deciding which parameter value choices to try.

Let’s say we expect our regularization parameter to have its optimal value between 1e-7 and 1e2. In this case we might try this set of ten choices:

If we exhaustively try these, we’d have to run ten CVs. We could also try more values, which would take longer, or fewer values, which might miss optimality by quite a bit.

What if we have two parameters? If we again try ten choices per parameter, we’re now talking a hundred CVs. This kind of exhaustive search is called a grid search because we are searching over a grid of every combination of parameter choices.

If you have even more parameters, or are trying to do a search that is more fine-grained or over a larger range, you can see that the number of CVs to run will really balloon into a very time-consuming endeavor.

## Randomized search

Instead of a grid search exhaustively combing through every combination of parameter choices, what if we just picked a limited number of combinations – say fifty – at random. Obviously, this would make the process quicker than running a hundred, or a thousand, or a million CVs. But the results would obviously be worse, right?

In fact, it turns out that randomized search can do about as well as a much longer exhaustive search. This paper by Bergstra & Bengio explains why, and below is a beautiful figure from the paper that illustrates one mechanism of how this works:

In the figure above the two parameters are shown on the vertical and horizontal axis, and their contribution is shown in green and yellow. You can see that randomized search does a better job of nailing the sweet spot for the parameter that really matters – *so long as we don’t just use the same grid points for the random search, but are actually searching in the continuous space*. We’ll see how to do this in a moment.

## Scikit-learn

Scikit-learn has very convenient built-in support for CV-based parameter search for both the exhaustive grid and randomized method. Both can be used with any sklearn model, or any sklearn-compatible model for that matter.

I’ll focus on the randomized search method, which is called *RandomizedSearchCV()*.

You’ll notice the documentation in the link above echoes what we said in the last section: *“it is highly recommended to use continuous distributions for continuous parameters.”* So let’s talk about how to do this.

## Choosing a continuous space for regularization parameters

Look at what we intuitively did for the grid search case: we laid out a few options like this:

What kind of selection is this? Or to put it formally, if you think about these values as sample pulls from a distribution, what kind of distribution is this?

One way to think about this is: we want about an equal chance of ending up with a number of any order of magnitude within our range of interest. Let’s put this a little more concretely: we’d like equal chances of ending up with a number in the interval [1e-7, 1e-6] as in the interval [1e1, 1e2].

If you think about it a little, this is *not* a uniform distribution. A uniform distribution would be many orders of magnitude more likely to give you the a number in the latter interval than in the former.

Exponential then? Nope, not that either.

I had to figure this out on my own with some head-scratching and math-scribbling. It turns out that what we need here is a reciprocal distribution: this is a distribution with the probability density function (pdf) given by:

where is limited to a specified range. In our case, the range is our range of interest: [1e-7, 1e2].

Defining this distribution for our regularization parameters will give us the kind of random picks we want – equiprobable for any order of magnitude within the range. Try it out:

# Pick ten random choices from our reciprocal distribution import scipy.stats scipy.stats.reciprocal.rvs(a=1e-7,b=1e2,size=10)

## Putting it all together

Finally the fun part! Here’s the Python code for the whole thing:

# Imports from sklearn.datasets import load_breast_cancer from sklearn.model_selection import RandomizedSearchCV, train_test_split import scipy.stats from polylearn import FactorizationMachineClassifier # Load and split data X,y = load_breast_cancer(return_X_y = True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) # specify parameters and distributions to sample from param_dist = { "alpha": scipy.stats.reciprocal(a=1e-7,b=1e2), "beta": scipy.stats.reciprocal(a=1e-7,b=1e2), "loss": ["squared_hinge","logistic"], "degree": [2,3] } # Model type: in this case, a Factorization Machine (FM) fm = FactorizationMachineClassifier(max_iter=1000) # Now do the search random_search = RandomizedSearchCV( fm, param_distributions=param_dist, n_iter=50, scoring='roc_auc', return_train_score=False ) random_search.fit(X_train, y_train) # Show key results; details are in random_search.cv_results_ print "\nBest model:", random_search.best_estimator_ print "\nTest score for best model:", random_search.score(X_test, y_test)

#### Notes:

- Here the model I’m trying to design is polylearn’s FactorizationMachineClassifier. This may be overkill for this toy dataset, but I’m using it because:
- It shows that you can use this approach not just for sklearn models but also for any sklearn-compatible model
- It’s a good showcase for multiple parameters of which some are continuous and some discrete

- Of course you could instead use any other model you like, like LogisticRegression
- You can see how convenient it is to specify the continuous parameters
*alpha*and*beta*as random variables. The 50 search iterations will automatically pull values to use from the specified distributions (reciprocal in this case). - I’m using area under the ROC curve as my success criterion; you could use any other choice you like.
- For a convenient way to check out the search results in more detail, check out the sample code on this page (specifically the
*report()*function).

Hope that helps. Happy reciprocal-distribution random-searching!

## Postscript

Sergey Feldman points out a simpler and more intuitive way to think about the reciprocal distribution: if we have a random variable *X* with a uniform distribution, then *Y* = 10* ^{x}* has a reciprocal distribution.

In other words, the distribution we use above (a reciprocal distribution in the range [10^{-7}, 10^{-2}]) gives us a uniform sampling of *exponents* in the range [-7, 2], which is what we want.