# Picking regularization parameters the easy way

(tl;dr: Use reciprocal distributions in your scikit-learn randomized-search cross-validation. If you don’t believe that’s easy, scroll down to see how little Python code you need to do this.)

Picking model parameters (“hyperparameters”) is a common problem. I’ve written before on a powerful online-learning approach to parameter optimization using Gaussian Process Regression. There are other similar approaches out there, like Spearmint etc.

But a lot of the time we don’t necessarily need such a powerful tool – we’d rather have something quick and easy that is available in scikit-learn (sklearn). For example, let’s say we want to use a classification model with one or two regularization parameters – what’s an easy way to pick values for them?

## Cross-validation and grid-search

Cross-validation (CV) has been explained well by other folks so I won’t rehash it here. But let’s talk about deciding which parameter value choices to try.

Let’s say we expect our regularization parameter to have its optimal value between 1e-7 and 1e2. In this case we might try this set of ten choices:

$\{10^{-7}, 10^{-6}, 10^{-5}, 10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}, 1, 10^1, 10^2\}$

If we exhaustively try these, we’d have to run ten CVs. We could also try more values, which would take longer, or fewer values, which might miss optimality by quite a bit.

What if we have two parameters? If we again try ten choices per parameter, we’re now talking a hundred CVs. This kind of exhaustive search is called a grid search because we are searching over a grid of every combination of parameter choices.

If you have even more parameters, or are trying to do a search that is more fine-grained or over a larger range, you can see that the number of CVs to run will really balloon into a very time-consuming endeavor.

## Randomized search

Instead of a grid search exhaustively combing through every combination of parameter choices, what if we just picked a limited number of combinations – say fifty – at random. Obviously, this would make the process quicker than running a hundred, or a thousand, or a million CVs. But the results would obviously be worse, right?

In fact, it turns out that randomized search can do about as well as a much longer exhaustive search. This paper by Bergstra & Bengio explains why, and below is a beautiful figure from the paper that illustrates one mechanism of how this works:

In the figure above the two parameters are shown on the vertical and horizontal axis, and their contribution is shown in green and yellow. You can see that randomized search does a better job of nailing the sweet spot for the parameter that really matters – so long as we don’t just use the same grid points for the random search, but are actually searching in the continuous space. We’ll see how to do this in a moment.

## Scikit-learn

Scikit-learn has very convenient built-in support for CV-based parameter search for both the exhaustive grid and randomized method. Both can be used with any sklearn model, or any sklearn-compatible model for that matter.

I’ll focus on the randomized search method, which is called RandomizedSearchCV().

You’ll notice the documentation in the link above echoes what we said in the last section: “it is highly recommended to use continuous distributions for continuous parameters.” So let’s talk about how to do this.

## Choosing a continuous space for regularization parameters

Look at what we intuitively did for the grid search case: we laid out a few options like this: $\{10^{-7}, 10^{-6}, 10^{-5}, 10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}, 1, 10^1, 10^2\}$

What kind of selection is this? Or to put it formally, if you think about these values as sample pulls from a distribution, what kind of distribution is this?

One way to think about this is: we want about an equal chance of ending up with a number of any order of magnitude within our range of interest. Let’s put this a little more concretely: we’d like equal chances of ending up with a number in the interval [1e-7, 1e-6] as in the interval [1e1, 1e2].

If you think about it a little, this is not a uniform distribution. A uniform distribution would be many orders of magnitude more likely to give you the a number in the latter interval than in the former.

Exponential then? Nope, not that either.

I had to figure this out on my own with some head-scratching and math-scribbling. It turns out that what we need here is a reciprocal distributionthis is a distribution with the probability density function (pdf) given by:

$f(x) = \text{constant}\times\cfrac{1}{x}$

where $x$ is limited to a specified range. In our case, the range is our range of interest: [1e-7, 1e2].

Defining this distribution for our regularization parameters will give us the kind of random picks we want – equiprobable for any order of magnitude within the range. Try it out:

# Pick ten random choices from our reciprocal distribution
import scipy.stats
scipy.stats.reciprocal.rvs(a=1e-7,b=1e2,size=10)


## Putting it all together

Finally the fun part! Here’s the Python code for the whole thing:

# Imports
from sklearn.model_selection import RandomizedSearchCV, train_test_split
import scipy.stats
from polylearn import FactorizationMachineClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# specify parameters and distributions to sample from
param_dist = {
"alpha": scipy.stats.reciprocal(a=1e-7,b=1e2),
"beta": scipy.stats.reciprocal(a=1e-7,b=1e2),
"loss": ["squared_hinge","logistic"],
"degree": [2,3]
}

# Model type: in this case, a Factorization Machine (FM)
fm = FactorizationMachineClassifier(max_iter=1000)

# Now do the search
random_search = RandomizedSearchCV(
fm, param_distributions=param_dist, n_iter=50,
scoring='roc_auc', return_train_score=False
)
random_search.fit(X_train, y_train)

# Show key results; details are in random_search.cv_results_
print "\nBest model:", random_search.best_estimator_
print "\nTest score for best model:", random_search.score(X_test, y_test)


#### Notes:

• Here the model I’m trying to design is polylearn’s FactorizationMachineClassifier. This may be overkill for this toy dataset, but I’m using it because:
• It shows that you can use this approach not just for sklearn models but also for any sklearn-compatible model
• It’s a good showcase for multiple parameters of which some are continuous and some discrete
• Of course you could instead use any other model you like, like LogisticRegression
• You can see how convenient it is to specify the continuous parameters alpha and beta as random variables. The 50 search iterations will automatically pull values  to use from the specified distributions (reciprocal in this case).
• I’m using area under the ROC curve as my success criterion; you could use any other choice you like.
• For a convenient way to check out the search results in more detail, check out the sample code on this page (specifically the report() function).

Hope that helps. Happy reciprocal-distribution random-searching!

## Postscript

Sergey Feldman points out a simpler and more intuitive way to think about the reciprocal distribution: if we have a random variable X with a uniform distribution, then Y = 10x has a reciprocal distribution.

In other words, the distribution we use above (a reciprocal distribution in the range [10-7, 10-2]) gives us a uniform sampling of exponents in the range [-7, 2], which is what we want.