Hierarchical Bayesian estimation

From DDWiki

Jump to: navigation, search

Hierarchical Bayesian estimation is an estimation procedure used to estimate coefficients (βs) of behavioral models like mixed logit. Therefore, hierarchical Bayesian estimation is not a new behavioral model beside logit and probit, but only an alternative to other estimation methods, for example, the maximum likelihood method. In some cases, this method has advantages of numerical convenience over its classical competitor, the maximum likelihood method.

The HB (Hierarchical Bayesian) method has roots in Bayesian probability. However, according to Bernstein-Von Mises Theorem it can be interpreted completely in the classical probability perspective.

Contents

Motivation

There are two reasons for preferring HB method over maximum likelihood in some cases:

  1. It does not require maximization. Maximization is numerically expensive. Moreover, In some cases (like using lognormal distribution in mixed logit) it is not easy to reach the global maximum and not be trapped in the local maxima. Therefore, the method will be so sensitive to the starting guess and there is no general escape way from that.
  2. Desirable estimation properties, such as efficiency and consistency can be attained under more relaxed conditions for HB. Of course these conveniences come in the cost of some other types of difficulties in convergence for the HB method. It is up to researcher to make the decision about the more appropriate tool for her case.

Bayes Theorem

Bayes Theorem links the conditional and unconditional probabilities in a logical way.

P(B \mid A)= P(A \mid B) \cdot \frac{P(B)}{P(A)}

It is worth mentioning that Bayes theorem does not have any indication about Bayesian statistics perspective per se, by defining the conditional probability as the posterior probability and the unconditional one as the prior probability however, we enter the Bayesian perspective.

HB Procedure

We can divide the procedure in 4 steps:

  1. Same as any estimation method we start with determining some unknown parameters that we are interested in estimating them. We call it θ (all variable are vectors in this notation). For example in the mixed logit case:  \theta = \beta_{n} \sim~ N(b,W)
  2. Setting the prior density (π(θ)). This represents our prior believe about θ. According to our experience and sometimes taking the numerical convenience into account, we might assign a probability distribution to the unknown parameters. In case we have no idea about anything we can assign flat probability density to our prior density, and intuitively this perspective should give similar result to classical probability perspective. Utilizing Allenby(1997) for the mixed logit example we assume:
  • Prior on b is normal distribution with an unboundedly large variance
  • Prior on W is Inverted Wishart [1] with K degrees of freedom and scale matrix I (the K-dimensional identity matrix)
  1. Next step is to figure out what really happens out there. So a survey is done and a sample of N choice tasks and the actual outcomes are observed (Y). Based on that, we can make the likelihood function (L(θ|Y)). For mixed logit example we can write:
 L(Y_{n} | \beta) = \prod_{t} \Bigg( \frac{e^{\beta x_{ny_{nt}t}}}{\sum_{j}e^{\beta'x_{njt}}} \Bigg)
  1. We update our belief. Mathematically speaking, we find the posterior density of θ (K(θ|Y)) using the Bayes theorem.
  K(\theta | Y) = \frac{L(Y|\theta)\pi(\theta)}{\int L(Y|\theta)\pi(\theta)\,d\theta}
 L(Y) = \int L(Y|\theta)\pi(\theta)\,d\theta

Practically, implementing these steps needs doing draws from densities since we do not have access to the closed form integrals. Metropolis-Hastings and Gibbs sampling are for example used in the mixed logit case. Once we get the final posterior we basically have the probability density for the parameters that we were looking for. Therefore, we need to do draws and find the average value of the parameters (θ).
Finally, the procedure is normally called hierarchical Bayes (HB), because there is a hierarchy of parameters and priors in the procedure.

Comparing HB and ML

Comparing HB and ML(maximum likelihood) methods sounds like comparing Bayesian and Frequentest points of view. This may be partially true, but as mentioned mathematically HB is totally justifiable in classical approach and ML is called estimated posterior in Bayesian perspective.

From practical or indeed computational point of view, both methods are useful. For example, for full variance-covariance matrix and normally distributed coefficient, HB method converges much faster that ML. In the cases where lognormal is the distribution of the coefficients and there is no correlation, HB is not only faster but also is more easily converging while ML is highly dependent on starting point.

However, for the distributions that are not transformation of normal, such as triangular and uniform distribution ML is the faster method. It is the same case when we have fixed coefficients.

References

  • Discrete Choice Methods with Simulation, Kenneth Train, Cambridge University Press, 2003 [2]
  • Allenby, G. (1997), ‘An introduction to hierarchical Bayesian modeling’, Tutorial Notes, Advanced Research Techniques Forum, American Marketing Association.