Hyperparameter Tuning, Batch Regularization and Program Framework
Tuning process
Hello everyone and welcome back, so far you have seen that changes to neural networks involve the setting of many different hyperparameters. Now, how do you find a good set of settings for hyperparameters? In this video, I would like to share with you some guidelines, some tips on how to systematically organize the hyperparameter tuning process, which hopefully will help you focus more effectively on the proper hyperparameter settings.
One of the hardest things about training depth is the number of parameters you have to deal with, from the learning rate $a$ to the parameters $\beta$ for Momentum (momentum gradient descent). If you use Momentum or Adam to optimize the parameters of the algorithm, $\beta {1}$, ${\beta} {2}$ and $\varepsilon$, maybe you have to choose the number of layers, maybe you have to choose the hidden layers in different layers number of units, maybe you also want to use learning rate decay. So, you are not using a single learning rate $a$. Then, of course, you may also need to choose the size of the mini-batch .
It turns out that some hyperparameters are more important than others, and in my opinion, the most widespread application of learning is $a$, and the learning rate is the most important hyperparameter to tune.
In addition to $a$, there are some parameters that need to be debugged, such as the Momentum parameter $\beta$, 0.9 is a good default value. I also tune the mini-batch size to make sure the optimal algorithm works well. I also often debug the hidden units, which I circled in orange. These three are the next most important to me, compared to $a$. Third in importance are other factors, the number of layers can sometimes make a big difference, as does learning rate decay. When applying Adam's algorithm, in fact, I never debug $\beta {1}$, ${\beta} {2}$ and $\varepsilon$, I always choose to be 0.9, 0.999 and $10^ {-8}$, you can also debug them if you want.
But hopefully you get a rough idea of which hyperparameters are important, $a$ is definitely the most important, next are the ones I've circled in orange, then the ones I've circled in purple, but that's not a strict and fast criterion , I think other deep learning researchers may disagree with me or have different intuitions.
Now, if you try to tune some hyperparameters, how do you choose debug values? In earlier generations of machine learning algorithms, if you had two hyperparameters, which I'll call hyperparameter 1 and hyperparameter 2, it was common practice to sample points in a grid, like this, and then systematically study these numerical value. Here I put a 5x5 grid, it turns out that the grid can be 5x5 or more or less, but for this example you can try all 25 points and choose which parameter works best good. This method is useful when the number of parameters is relatively small.
In deep learning, what we often do, I recommend you do the following, choose points randomly, so you can choose an equal number of points, right? 25 points, and then use these randomly selected points to test the effect of hyperparameters. The reason for this is that it is difficult to know in advance which hyperparameters are the most important for the problem you are trying to solve, and as you saw earlier, some hyperparameters are indeed more important than others.
For example, suppose hyperparameter 1 is $a$ (learning rate). To take an extreme example, suppose hyperparameter 2 is $\varepsilon$ in the denominator of Adam's algorithm. In this case, the value of $a$ is important, but the value of $\varepsilon$ is irrelevant. If you take points in the grid, and then you experiment with 5 values of $a$, you will find that the result is basically the same regardless of the value of $\varepsilon$. So, you know there are 25 models in total, but there are only 5 values of $a to experiment with, which I think is important.
In contrast, if you randomize the values, you'll be experimenting with 25 separate $a$, and it seems that you're more likely to find the one that works well.
I've explained the two-parameter case, in practice, you may be searching for more than two hyperparameters. Suppose, you have three hyperparameters, then you are not searching for a square, but a cube, hyperparameter 3 represents the third dimension, and then, taking values in the three-dimensional cube, you will experiment with a large number of more values, three Each of the hyperparameters is .
In practice, you may be searching for more than three hyperparameters. It is sometimes difficult to predict which hyperparameter is the most important. For your specific application, random values rather than grid values indicate that you are exploring more important ones. Potential values for hyperparameters, whatever the result.
Another convention is to use a coarse-to-fine strategy when you give values to hyperparameters.
For example, in the two-dimensional example, you have taken a value, maybe you will find a point that works best, and maybe some other points around this point also work well, then the next thing to do is to zoom in on this point. Block a small area (inside the small blue box), then take values more densely or randomly, gather more resources, search in this blue box, if you suspect that these hyperparameters are in this area , then after doing a cursory search in the entire square, you'll know that you should focus on smaller squares next. In smaller squares, you can get points more densely. So this kind of coarse-to-fine search is also often used.
By experimenting with different values of the hyperparameters, you can choose the optimal value for the training set objective, or the optimal value for the dev set, or what you most want to optimize during the hyperparameter search.
I hope, this gives you a way to systematically organize the hyperparameter search process. Another key point is random value and exact search, consider using a coarse-to-fine search process. But the search for hyperparameters doesn't stop there, and in the next video, I'll go on to explain how to choose a reasonable range of hyperparameter values.
0 Comments