It’s like that poster Mulder had in his office on the X-Files: *I want to believe* in the MaxEnt approach to statistical mechanics.

Problem is, I don’t. I used to, certainly. It’s so elegant and simple! Taking the Shannon entropy of a probability distribution as a measure of uncertainty, a good scheme for coming up with a prior probability distribution for whatever would seem to be to pick the distribution which has the maximum entropy given all the relevant constraints which are known. This way, the distribution has the least bias possible since, apart from the constraints, it has maximal uncertainty.

The recipe in statistical mechanics calls for a constraint on the average energy, and MaxEnt obliges by producing the canonical ensemble. (The microcanonical ensemble follows from a constraint on the total energy, but it would sort of be overkill to call this MaxEnt, since it’s just the principle of indifference in this case.) Similarly, the grand canonical ensemble comes from finding the distribution maximizing the entropy given fixed average energy and average particle number. A good deal of the rest of stat mech flows more or less naturally from this derivation and interpretation, making it quite attractive, if for no other reason than to save on what you need to remember for exams.

It’s all well and good (at least if you’re a Bayesian, and interpret probability as a degree of belief, not as a frequency of any shape or kind), until you start to wonder where the constraint information comes from. Then you realize that an uncomfortable situation presents itself: **MaxEnt could be in conflict with Bayes’ rule**, which, being a Bayesian, you of course believe in above all worldly things. This could happen if the constraint information comes from measured data, for example if the average energy constraint for the system at hand is actually the statement that you cooked up many copies of this system according to the same recipe every time, measured the energy of each, and took the mean .

According to the Bayesian approach, the correct way to incorporate this information is to update your original prior probability distribution with the measured data by using, well, Bayes’ rule. This just amounts to finding the probability of whatever you’re interested in (call it ) given the observed data (energy measurements) by taking the joint probability (energy measurements and ) and dividing by the probability of observed data (energy measurements). Very simple, though in this case we have to forget the actual sequence of energy measurement results and pretend that we only remember the mean, so that we end up with . (This we do by adding up all the probabilities for x given sequences having the same mean.) Using MaxEnt, on the other hand, we would forgo the original prior and just find the distribution maximizing the entropy under the constraint of the observed mean energy.

Do we get the same answer? Given that I said in the beginning of the post that I went from a state of MaxEnt belief to disbelief and have written the intervening paragraphs as I have, you can reasonably infer that the answer is no. (A word of congratulations to frequentist readers: You have just successfully completed a Bayesian inference!) Indeed, the answer is no, at least in the sense that we do not always get the same answer. In quite reasonable cases the two approaches give different answers.

In a nice paper detailing aspects of the constraint rule of MaxEnt, Uffink examines the two approaches for the case of rolling a die (in this context called the Brandeis dice problem). Suppose we roll the die many times and observe a mean number of 3.5. This is what one would expect for an unbiased die, i.e. one with probability 1/6 for each of the outcomes 1 to 6, and moreover this distribution has the largest entropy, so the MaxEnt probability is the uniform distribution. On the other hand, following the usual Bayesian prescription, if our prior distribution is that the die could be biased in any way, each equally-likely, then after going through a calculation similar to that of the rule of succession (after witnessing s in rolls, the probability of on the next toss is ) and adding up the probabilities for sequences of results which all have an average of 3.5, Uffink obtains the posterior distribution for the next roll. It’s not uniform; rather it gives more weight to 3 and 4 than to 1 and 6. (2 and 5 are somewhere in between.)

To me this dashes any hope of applying MaxEnt all over the place as Jaynes is wont to do. But my immediate concern was with MaxEnt as a means to justify the various ensembles of statistical mechanics. Does it still work in this context? The answer here is yes, but only insofar as it just reduces to the principle of insufficient reason. And in this case MaxEnt and Bayes’ rule give the same result (again, for a “reasonable” prior). This is straightforward for the microcanonical ensemble, as mentioned above. But the usual textbook justification of the canonical ensemble, for instance, is that it results when studying a system which, together with a much larger reservoir, is described by the microcanonical ensemble. So we didn’t need anything else in the first place! One of the original appeals of MaxEnt for stat mech was that it seems to dispose of the need for a reservoir system, and implies that the canonical ensemble is appropriate in a considerably more general setting. Alas, this cannot be justified.

I should mention how this works out in the context of the Brandeis dice problem, which will happily lead into the subject of applying large deviation principles to stat mech, but this will have to be left for another post.

Pingback: Large Deviations and Statistical Mechanics « Complementary Slackness

Nice blog, and cute blog name.

I’m interested in this principle of MaxEnt discussion. It seems to me that with the Bayesian interpretation, the maximum entropy principle is still valid, but the constraint must now be more complicated if you want to do it exactly.

In other words, using the empirical average as the constraint is a valid approximation so long as we have very sharply converged in our empirical process and are willing to ignore the sampling error.

If we have not, and we have observed our die only a smallish number of times (say many fewer than 6×10^23 :-)) then the MaxEnt process would need to be something like this:

We set up using the Bayesian inference a hyperparameter mu over which we are putting a probability distribution. The probability distribution is updated from the sequence of observed values. Then different values of mu have different probabilities in the Bayesian inference. To find the probability distribution that models the process, we can condition on mu.

For each mu there is an implied MaxEnt distribution, and each mu has a given bayesian probability. The overall distribution is therefore the mixture of each of the MaxEnt distributions for each of the possible mu values (possibly this is a continuous mixture model if mu is a continuous variable).

So, for example the t distribution is the continuous mixture of gaussians each of which has the same mean, but different variances weighted by a chi-squared distribution. Each of the gaussians is a MaxEnt distribution for the given mean and variance, but the variance is allowed to be uncertain.

Am I missing something about your concern or is this making sense and addressing the issue?

-Dan

Thanks for your interest in the post! Weirdly enough, though, maxent and the method of inverse probability actually agree less and less as N increases (Table 1 of Uffink’s paper), so even in the case of lots of data the two methods disagree. The trouble is, as Uffink says, having an average of 3.5 is also true of dice biased towards 3 and 4, and is even more likely in the sense that the variance is smaller. As I read it, your suggestion is to only allow maxent distributions in the inverse probability calculation, so instead of thinking of the setup as attempting to determine the “true” single-die probability distribution from the set of all possible distributions, we only take distributions which have maximum entropy given some constraint value into our calculations. This could well give a different answer, something more in line with maxent itself, but also seems to beg the question somewhat. Why only maxent distributions? For instance, what if the die has faces 2,2,3,4,5,5?

There’s another, perhaps more mundane reason why I think the two approaches are incompatible. In the case of inverse probability, we would almost certainly keep track of frequency data and not just simply the average value. (This would certainly help in identifying the 2,2,3,4,5,5 die, for instance.) However, if we consider the case which Jaynes originally used to justify maxent, then frequency data wouldn’t naturally be available, and in fact the two approaches give the same answer. There we have N identical dice (instead of one rolled in principle as many times as one likes), we know the sum of what is showing on all of them, and we’d like to calculate the probability of what any particular one is showing.

I’m not completely following Uffink, but I think of the “generalized” principle of maximum entropy as something like this (and this is how I interpret Jaynes in general):

If we have n possible outcomes of an experiment and we know information “I” about the experiment, then we should assign probability p_k to each of the k = 1..n possible outcomes in such a way that all of the information “I” is taken account of, and the entropy of the distribution is maximized subject to that information as constraints.

In Uffink’s paper he quotes Jaynes as “interpreting” the average of the dice being 4.5 as a constraint. Such an interpretation is very reasonable when we know for example that the die was thrown 10^18 times. Not so much when it was thrown 100 times. I’m quite sure that Jaynes would agree that the Bayesian could bring in prior information to interpret the 100 throws case as implying a different constraint.

If you have performed an experiment N times and observed the relative frequency of each of the n possible outcomes, then taking all of this information into account leaves you with essentially no degrees of freedom over which to maximize your entropy. In the limit of large N you should assign n(k)/N as p_k (for smaller than infinite values perhaps you go with the method of succession which would be something like (n(k)+1)/(N+k)). Since there are no degrees of freedom, there is no maximization of the entropy, because you’re completely constrained.

If on the other hand you have only the sum of all the dice rolls and the count of dice rolls, then you only know the average, and the maximum entropy distribution is “broader” than the one Uffink gives. This is a “good” thing because you’re not “inventing” information. Uffink gets a tighter distribution because he has more than just the average, he has knowledge that the process is “rolling dice” and that there are 6 distinct faces, and no duplicates and therefore none of the probabilities can be 0. In essence the method of succession is “inventing” 6 extra dice rolls each of which comes up once on each of the possible faces.

Basically, because the bayesian is using prior knowledge of the process, the bayesian’s “maximum entropy” distribution has lower entropy and therefore is more peaked. I don’t see this as contradictory, nor do I think Jaynes would have any trouble with it either.

Pingback: Is the principle of maximum entropy compatible with Bayes rule? | Models Of Reality

I wish I could understand everything.

I know a bit of Bayesian inference (I am starting to fit multilevel models in applyed social science) and take a course on bayesian inference one year ago.

But I know notthing of physics! Anyway, I would like to understand better your Irony (the congrat part). Why is it a bayesian inference? In other words, couldn’t a frequentist mind arrive at the same conclusion withou bayesianism?

it’s not, really, and yes, I hope frequentists would also arrive at the same conclusion. But since I mentioned frequency earlier, and inference here, it just seemed to fit…

Let’s see if we can do a finite sample size version of Jaynes’ calculation.

We have rolled 100 dice and gotten an empirical average of 3.5. We want to know what is the probability that one of the 100 dice has a given value face up. We know that all outcomes 1, 2, 3, 4, 5, 6 are possible (nonzero probability) and that we have rolled exactly 100. What does the principle of maximum entropy together with our Bayesian beliefs tell us about the probability of getting any given average value?

find p_k k = 1..6 such that sum(p_k*ln(p_k),k,1,6) is maximized, subject to the constraint that p_k = (n_k)/100 for some integer value of n_k >= 0 and sum(p_k*k)=3.5

here we are interpreting the situation as trying to find the best distribution to model the given finite set of 100 dice. I don’t have the solution to this problem at hand, but if we’re happy with exhaustive search I think we could get it from a simple prolog program… I’ll think about how to do that.

Now suppose instead that we’ve rolled one die 100 times and gotten the 3.5 average but have not kept track of the individual outcomes. We’re interested in predicting the future rolls of this single die. We can no longer interpret the probabilities as ratios of integers with 100 in the denominator since in the long run that is no longer true.

Nevertheless we can imagine our prior information tells us that if dice are biased it is a small bias due to irregular manufacture, and that there are still 6 separate outcomes. Now we can imagine our “prior” for the sum of 100 rolls is a truncated discretized gaussian centered at 350 over the interval 100 to 600. Using our one observation of 350 on 100 rolls we update our prior for the average using some bayesian procedure (we need to choose a likelihood). Based on our posterior for the average, we calculate the maximum entropy distribution for the 6 spots for every possible value of the average from 100 to 600, weight it by the posterior probability, and add them up to get a mixture model.

What do we get? it’s much more complicated than assuming the sharp average constraint. I’m not sure but it does correspond to “using some information I” to assume some constraints and then maximizing the entropy over those constraints. Here though, the constraint is itself probabilistic.