( However, this additional effort may be negligible in comparison to the time required to query your target function if you are willing to consider Bayesian optimization) The disadvantage of this procedure relative to MAP or MLE estimates is a speed penalty: Not only is MCMC a more computationally intensive procedure than a gradient based optimisation, but the acquisition function must now be calculated $m$ times rather than just once. What this means for Bayesian optimization is that rather than computing the acquisition function for a single, fixed instance of the hyperparameters, $\theta^*$, one has access to $m$ samples of $\theta$ drawn approximately from $p(\theta | \mathbf \alpha(x|\theta_i) Basically the idea is to draw lots of samples of your hyperparameters proportional to how likely they are, given your data and some initial beliefs of what the hyperparameters might be. In terms of MCMC sampling for Gaussian Processes, there is an excellent three part series by Michael Betancourt ( 1, 2, 3) discussing how to do this with Stan, which explains this procedure (and it's nuances) far more effectively than I am capable here (skip to the third instalment if you are already comfortable with Gaussian Process Regression). One then "marginalises" the acquisition function over these hyperparameter samples. Usually this is accomplished using Markov-Chain Monte-Carlo (MCMC). This is often more robust than point estimates such as MAP or MLE as it avoids assuming the distribution of $\theta$ can be represented by a single point. An alternative to MAP or MLE estimates is to sample hyperparameters (i'll call them $\theta$) from a distribution proportional to their probability, given your data and some prior assumptions.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |