So what if MaxEnt conflicts with Bayes’ rule (when the constraint information comes from measured data)? MaxEnt is far from the only inference-based stat mech game in town! We can just retreat to the good ol’ principle of indifference to derive the microcanonical ensemble and from there the canonical ensemble via the usual textbook construction of a microcanonical system + reservoir.

This means we can’t use the canonical ensemble when there isn’t a reservoir connected to the system of interest (such that the total energy is fixed). Except that sometimes we can. Sometimes the microcanonical and canonical ensembles are equivalent. And the simplest case of this is exactly when MaxEnt and Bayes’ rule turn out to give the same answer.

Let’s go back to the Brandeis dice problem of the previous post. Only now, instead of rolling one die times and then trying to predict what will happen on the next role given the average value of the rolls, suppose that we instead have dice, roll them all at once, and then sum up the values we get. Call the sum . What’s the probability that the first one is showing , i.e. ? Depends on the prior, of course. Following the principle of indifference, we would initially assign a uniform prior to all possible dice sequences. Then comes the constraint, fixing the sum to . We’re hoping to get the Gibbs distribution, for a value of such that , which is the canonical ensemble. This would also be the MaxEnt answer when directly given the constraint .

Here’s a roundabout way to intuitively see that this is indeed the correct result (readers of Uffink’s paper will already know that a rigorous derivation can be found in the paper by Van Campenhout and Cover). Suppose that, in accordance with , there are 1s, 2s and so on up to 6s. Clearly and . Dividing the by we get the relative frequencies . How many different dice sequences give the same constraint value ? Simple, it’s just the multinomial coefficient . Using Stirling’s approximation, we can write this as

where and is the Shannon entropy . Due to the exponential dependence on , frequencies with higher entropy are vastly more likely than those with lower entropy. So let’s just approximate the situation by saying that the most likely relative frequency is the only one that matters and figure out what is for this case. Given the frequency , the probability of the first die showing is just . And comes from maximizing the multinomial coefficient (or for all practical purposes, its approximation involving the entropy) under the constraint or . Since this is formally equivalent to the MaxEnt setup, we get the same answer and for the particular as intended.

In passing, I should mention that the current dice problem and the original version use *very* different priors. Here we use a uniform distribution over dice *sequences*. In the other case we effectively used a uniform distribution over *frequences*. That is, sequences belonging to frequency sets with larger numbers of elements have correspondingly smaller prior probability. Using the uniform distribution on sequences for the original problem would be stupid, since then the probability of the next roll would always be 1/6, no matter what the results of the previous rolls (that is, you are utterly convinced that the die is unbiased, and continue to be so even if the result is 1 every time). On the other hand, the uniform on sequences distribution makes more sense in the present context since there are now actually different dice and whether a particular one shows 4 doesn’t say anything about the probability of another die showing 4.

Back to stat mech. What’s this about the ensembles being sometimes equivalent? The entire collection of dice (microcanonical) and one die picked at random from the collection (canonical) are two different things. But if all we care about are single-die properties, then the two ensembles are equivalent. To move to the statistical mechanics of something more interesting than dice, think instead of an ideal gas, i.e. identical molecules of some type confined to a box and so dilute that interactions via collision are negligible. The energy is therefore a single-particle property, but of course the entire scheme is built on the fact that the total energy is fixed (and we know what it is roughly), so it doesn’t make sense to then turn around and determine its average value. Pressure is also a single-particle property, since it arises from collisions of the (non-interacting) molecules with the walls of the container. So calculating the pressure using the microcanonical and canonical ensembles should give the same answer.

Wait a minute, didn’t the title of this post say something about large deviations, whatever this means? When are we going to get to that? Well, we just did, sort of! Large deviations refers to situation in which some random variable, like frequency in the above example, has a probability which goes like for some (large) parameter and rate function . (technically it’s a statement in the limit .) In the above example, the rate function was the Shannon entropy of the relative frequency distribution, but rate functions are not always entropies. The name “large deviations” comes from the fact that we’re interested in the probability of events which deviate greatly (largely) from the mean. When a random variable obeys a large deviation principle, then large deviations are exponentially unlikely.

How is this relevant to stat mech? Inasmuch as stat mech is seen as an exercise in inference about mechanical systems given background data, it is not necessarily relevant at all. Given background data about tidal activity in the last few days doesn’t allow me to make very precise predictions about solar flares next week. However, the cases of interest in stat mech, i.e. when it can provide a justification for thermodynamics, are precisely those in which we can make very sharp predictions. And large deviations is the art of doing this.

The dice setup provides a good example again, illustrating the basic idea. What’s the probability of frequency given total ? Let’s just calculate it as

Now what we derived before is nearly the statement that satisfies a large deviation principle under the unconstrained uniform i.i.d. distribution over sequences. The probability for a frequency given i.i.d. probabilities for each die to roll an is just the multinomial distribution

where the approximation is again Stirling’s and the rate function is the relative entropy. For a uniform as we have here, the relative entropy just reduces to the usual entropy .

Now observe that the total is a function of the frequency, just . One of the nice things about a random variable satisfying a large deviation principle is that functions of the random variable also obey a large deviation principle (this trick is called contraction). That is, since we have

using Laplace’s approximation, where is the which maximizes while satisfying . To keep things clear, we define . Since determines , and we’re left with

for such that and zero otherwise. Now following the same argument we made above, the dominant in probability is of course , so we’re back to MaxEnt and the distribution of itself under the constraint is the Gibbs distribution.

The great thing about this is that it generalizes to other situations quite easily. It all revolves around finding a macrostate, a function of the microstate like the frequency above, which has two properties: 1. it satisfies a large deviation principle, and 2. the quantity to be constrained is a function of it. Then contraction comes into play, and due to the exponential we can use Laplace’s approximation everywhere to simplify the resulting expressions. We don’t need to start with the microcanonical ensemble, we could have started with the canonical ensemble (it’s important to realize that large deviations doesn’t provide a justification of the Gibbs distribution on its own, for that you use the textbook argument), we can easily consider more complicated systems like mean field theories and (with a bit more work) interacting systems. Even some nonequlibrium systems (think Markov chains) and some of the fluctuation theorems discovered recently. And the mathematics of large deviations guides the stat mech argumentation the entire way; for instance, I can’t resist mentioning that the free energy arises naturally as a quantity called the scaled cumulant generating function whose Legendre(-Fenchel) transform gives the rate function of the macrostate. The -Fenchel part of the transform makes sure that everything works out alright when there’s a first order phase transition, automatically employing the Maxwell equal-area rule.

If you’re still reading, you’re sufficiently interested to look past my butchering of the subject, and it’s time to turn to literature by experts. I highly recommend a recent review by Hugo Touchette (the mathematics is not too dense that you can’t see the forest for the trees). Once you’re ready to look at the trees in more detail, turn to Ellis’ book Entropy, Large Deviations, and Statistical Mechanics. Both are probability-agnostic as far as I can tell; what you see here is my own Bayesian spin.