Hyperparameter tuning is an important step for maximizing the performance of a model. Several Python packages have been developed specifically for this purpose. Scikit-learn provides a few options, GridSearchCV and RandomizedSearchCV being two of the more popular options. Outside of scikit-learn, the Optunity, Spearmint and hyperopt packages are all designed for optimization. In this post, I will focus on the hyperopt package which provides algorithms that are able to outperform randomized search and can find results comparable to a grid search while fitting substantially less models.
The IMBD movie review sentiment classification problem will be used as an illustrative example. This dataset appears in a Kaggle tutorial Bag of Words Meets Bags of Popcorn. The purpose of this post is not to achieve record breaking performance for this classification model; many people have already taken care of that (see the Kaggle discussion). Instead, I will discuss how you can use the hyperopt package to automate the hyperparamter tuning process. The ideas here are not limited to models built using scikit-learn, they can easily be adapted to any optimization problem. For models built with scikit-learn, hyperopt provides a packaged called hyperopt-sklearn that is designed to work with scikit-learn models. I will not be discussing this package here, rather I will work with the base hyperopt package which allows greater flexibility.
The model I will use is a fairly simple model for text classification consisting of building a term frequency–inverse document frequency (TF-IDF) matrix from the corpus of training texts, then training the SGDClassifier from scikit-learn with modified-huber loss on the corpus. The scoring metric I will use to evaluate the performance of the model is the area under the receiver operating characteristic curve (ROC AUC) as this is used in the Kaggle tutorial. Training this model with the default hyperparameters already provides a good result with an AUC score of 0.9519 on the test data. Using the procedure that follows I was able to improve the performance to an AUC score of 0.9621.
The data for this tutorial can be downloaded from the Kaggle tutorial here. You should download the labeledTrainData.tsv.zip and testData.tsv.zip files. The unlabeled data will not be used.
First, we need to load our data into a Pandas dataframe and add columns for the score (the rating of the review out of 10) and for the class (1 for positive, and 0 for negative). The test data will be sorted by class, this will be needed later when plotting the ROC curve.
The next step in any text classification problem is cleaning the documents.
When we build the model later the TfidfVectorizer in sklearn.feature_extraction.text will be used to construct a TF-IDF matrix. This function is able to do the preprocessing of the text, I avoid doing that here because the TF-IDF matrix will need to be reconstructed for each iteration of the hyperparamter tuning process. By processing the documents before building the TF-IDF matrix we will avoid repeating this work.
Finally, we separate the reviews and their classes into separate numpy arrays.
Now that the data is cleaned, we can begin building the model. The model will consist of 3 parts:
- Construct a matrix of term frequency–inverse document frequency (TF-IDF) features from the corpus of movie reviews
- Select the features with the highest values for the chi-squared statistic from the training data
- Train a linear model using modified-huber loss with stochastic gradient descent (SGD) learning
The SGDClassifier will use modified-huber loss with an elasticnet penalty. The elasticnet penalty will allow a convex combination of L1 and L2 penalties. The SGDClassifier has a hyperparameter called l1_ratio that controls how much the L1 penalty is used. This will be one of the hyperparamters that will be tuned. This will allow the hyperparamter tuning to select between using L1 or L2 penalty or a combination of the two.
A pipeline will be used to make fitting and testing the model easier.
Before we begin the hyperparameter tuning process, we will train the model with the default parameters. The AUC score of the test data with the default hyperparameters is 0.9523. The ROC curve is plotted in Figure 1 below.
We are now ready to set up hyperopt to optimize the ROC AUC. hyperopt works by minimizing an objective function and sampling parameter values from various distributions. Since the goal is to maximize the AUC score, we will have hyperopt minimize .
10-fold cross-validation will be used in for hyperparamter tuning to avoid overfitting to the training data. For each iteration of hyperparamter tuning, the folds will be randomized, again to avoid overfitting to the 10-folds selected.
Here we will tune 8 hyperparamters:
|ngram_range||TfidfVectorizer||The range of n-grams to use. (1,1) corresponds to only unigrams, and (1,3) would be unigrams, 2-grams and 3-grams.||(1,1), (1,2), (1,3)|
|min_df||TfidfVectorizer||The minimum number (or frequency) of documents a term must appear in to be included.||1, 2, 3, or 4 documents|
|max_df||TfidfVectorizer||The maximum number (or frequency) of documents a term must appear in to be included.||Uniformly distributed between 70% and 100%|
|sublinear_tf||TfidfVectorizer||Applies sublinear term-frequency scaling (i.e. tf = 1 + log(tf))||True or False|
|percentile||SelectPercentile||The percentile score for the chi-squared statistic, any feature not in this percentile is not used||Uniformly distributed between 50% and 100%|
|alpha||SGDClassifier||The learning rate. The SGDClassifier documentation recommends this value be in the range . See http://scikit-learn.org/stable/modules/sgd.html#tips-on-practical-use||Log-uniformly distributed between and|
|n_iter||SGDClassifier||Number of passes on the data, a value of 1e6/n_samples is recommended.||Random integer between 20 and 80 that is divisible by 5.|
|l1_ratio||SGDClassifier||The elastic net mixing parameter. Penalty is (1 – l1_ratio) * L2 + l1_ratio * L1.||Uniform between 0 and 1|
hyperopt provides several functions for defining various distributions. A full list can be found here. Each hyperparameter is sampled from a distribution. To define the parameter space a dictionary is used where they keys start with the name of the function in the pipeline, followed by a double underscore, then the name of the hyperparameter for that function. This allows us to use the set_params function of the pipeline easily. Each of the functions in hyperopt also requires a name, I have used the same names for simplicity.
Now that the hyperparaemter space has been defined we need to define an objective function for hyperopt to minimize. The objective function will start by randomly splitting the training data into 10 folds. The KFold function is used to ensure we get a new, random set of folds each time we call the objective function (otherwise we might overfit to the folds). The cross_val_score function with the roc_auc scoring metric is then called to run 10-fold cross-validation on the model. Since our goal is to maximize the ROC AUC, and hyperopt works by minimizing the objective function, we will return 1 – score.mean() from the objective function. To significantly decrease the execution time, you can set the n_jobs parameter in the cross_val_score function.
I will note that I am using several of the variables as global variables here. There are a few ways to avoid this, you could either reload the data each time the objective function is called, or you could use a generator so the function loads the data once. When I initially drafted the code for this article I used a generator, although things became more complicated than I wanted for the tutorial as hyperopt will not work with generators. You can make hyperopt work with generators by setting up a simple producer-consumer model between two functions, one which sends new hyperparameter values to the consumer, and then returns the new average loss.
We are now ready to use hyperopt to find optimal hyperparameter values. Here I will use the fmin function from hyperopt. I will also use the Trials object to store information about each iteration. This is useful for processing the information from each iteration afterwards. Here I have set it to run for 10 iterations for illustrative purposes.
After hyperopt is done, you can train the model using the entire training set and test it on the test data. Note the use of the space_eval function. You must use this to get the optimal hyperparameter values. The best dictionary might look like it contains parameters, but it simply contains the encoding used by hyperopt to get the parameter values from the space dictionary.
Running this for 1000 iterations found an optimal training AUC score of 0.9672 corresponding to a test AUC of 0.9621. Using 1000 iterations was easily overkill as the best training AUC score found in the first 100 iterations was 0.9662, so little was gained by running the other 900 iterations. I ran the hyperparameter search on an AWS EC2 c4.4xlarge instance with the n_jobs parameter for cross_val_score set to 10 (i.e. each fold is trained in parallel). Execution took just under 15 hours. The ROC curve corresponding to a training AUC score of 0.9672 along with the ROC curve with no hyperparameter tuning can be seen in Figure 2 below.
The parameter_values.csv file available on the GitHub repository that complements this post contains the parameter values used at each of the 1000 iterations. Plotting each of these parameter values against the iteration gives some interesting results. Figure 3 shows the hyperparameter values that hyperopt sampled at each iteration. Many of the hyperparameters converge to regions surrounding their optimal values within the first 200 iterations. In the first 100 iterations, all of the hyperparameters appear to be randomly sampled before things begin to converge.
Overall, hyperopt provides a nice way to perform automated hyperparameter tuning without the cost associated with a grid search. The optimal hyperparameters should be better than what a randomized search will find as the hyperparamters seem to converge to values that result in improved results. Using hyperopt is fairly easy once understand how it is set up. Although hyperopt will converge to optimal values for the hyperparameters, it is still important to have a good understanding of the expected range of values for your hyperparamters.