# Large Deviations and Statistical Mechanics

So what if MaxEnt conflicts with Bayes’ rule (when the constraint information comes from measured data)? MaxEnt is far from the only inference-based stat mech game in town! We can just retreat to the good ol’ principle of indifference to derive the microcanonical ensemble and from there the canonical ensemble via the usual textbook construction of a microcanonical system + reservoir.

This means we can’t use the canonical ensemble when there isn’t a reservoir connected to the system of interest (such that the total energy is fixed). Except that sometimes we can. Sometimes the microcanonical and canonical ensembles are equivalent. And the simplest case of this is exactly when MaxEnt and Bayes’ rule turn out to give the same answer.

Let’s go back to the Brandeis dice problem of the previous post. Only now, instead of rolling one die ${N-1}$ times and then trying to predict what will happen on the next role given the average value of the ${N-1}$ rolls, suppose that we instead have ${N}$ dice, roll them all at once, and then sum up the values we get. Call the sum ${S}$. What’s the probability that the first one is showing ${i}$, i.e. ${p(i|S)}$? Depends on the prior, of course. Following the principle of indifference, we would initially assign a uniform prior to all possible dice sequences. Then comes the constraint, fixing the sum to ${S}$. We’re hoping to get the Gibbs distribution, ${p(i|S)\propto e^{-\beta i}}$ for a value of ${\beta}$ such that ${\langle i\rangle=\sum_i i p(i|S)=\frac{S}{N}}$, which is the canonical ensemble. This would also be the MaxEnt answer when directly given the constraint ${\langle i\rangle=\frac{S}{N}}$.

Here’s a roundabout way to intuitively see that this is indeed the correct result (readers of Uffink’s paper will already know that a rigorous derivation can be found in the paper by Van Campenhout and Cover). Suppose that, in accordance with ${S}$, there are ${N_1}$ 1s, ${N_2}$ 2s and so on up to ${N_6}$ 6s. Clearly ${\sum_i N_i=N}$ and ${\sum_i i N_i=S}$. Dividing the ${N_i}$ by ${N}$ we get the relative frequencies ${f_i=N_i/N}$. How many different dice sequences give the same constraint value ${S}$? Simple, it’s just the multinomial coefficient ${\frac{N!}{N_1!N_2!\cdots N_6!}}$. Using Stirling’s approximation, we can write this as

$\displaystyle \frac{N!}{N_1!N_2!\cdots N_6!}\approx e^{NH(\vec{f})},$

where ${\vec{f}=(f_1,\dots,f_6)}$ and ${H}$ is the Shannon entropy ${H(\vec{f})=-\sum_{i=1}^6 f_i\log f_i}$. Due to the exponential dependence on ${N}$, frequencies with higher entropy are vastly more likely than those with lower entropy. So let’s just approximate the situation by saying that the most likely relative frequency is the only one that matters and figure out what ${p(i|S)}$ is for this case. Given the frequency ${\vec{f}}$, the probability of the first die showing ${i}$ is just ${f_i}$. And ${f_i}$ comes from maximizing the multinomial coefficient (or for all practical purposes, its approximation involving the entropy) under the constraint ${\sum_i iNf_i=S}$ or ${\sum_i if_i=\frac{S}{N}}$. Since this is formally equivalent to the MaxEnt setup, we get the same answer and ${f_i\propto e^{-\beta i}}$ for the particular ${\beta}$ as intended.

In passing, I should mention that the current dice problem and the original version use very different priors. Here we use a uniform distribution over dice sequences. In the other case we effectively used a uniform distribution over frequences. That is, sequences belonging to frequency sets with larger numbers of elements have correspondingly smaller prior probability. Using the uniform distribution on sequences for the original problem would be stupid, since then the probability ${i}$ of the next roll would always be 1/6, no matter what the results of the previous ${N-1}$ rolls (that is, you are utterly convinced that the die is unbiased, and continue to be so even if the result is 1 every time). On the other hand, the uniform on sequences distribution makes more sense in the present context since there are now actually ${N}$ different dice and whether a particular one shows 4 doesn’t say anything about the probability of another die showing 4.

Back to stat mech. What’s this about the ensembles being sometimes equivalent? The entire collection of dice (microcanonical) and one die picked at random from the collection (canonical) are two different things. But if all we care about are single-die properties, then the two ensembles are equivalent. To move to the statistical mechanics of something more interesting than dice, think instead of an ideal gas, i.e. ${N}$ identical molecules of some type confined to a box and so dilute that interactions via collision are negligible. The energy is therefore a single-particle property, but of course the entire scheme is built on the fact that the total energy is fixed (and we know what it is roughly), so it doesn’t make sense to then turn around and determine its average value. Pressure is also a single-particle property, since it arises from collisions of the (non-interacting) molecules with the walls of the container. So calculating the pressure using the microcanonical and canonical ensembles should give the same answer.

Wait a minute, didn’t the title of this post say something about large deviations, whatever this means? When are we going to get to that? Well, we just did, sort of! Large deviations refers to situation in which some random variable, like frequency in the above example, has a probability which goes like ${e^{-NR}}$ for some (large) parameter ${N}$ and rate function ${R}$. (technically it’s a statement in the limit ${N\rightarrow \infty}$.) In the above example, the rate function was the Shannon entropy of the relative frequency distribution, but rate functions are not always entropies. The name “large deviations” comes from the fact that we’re interested in the probability of events which deviate greatly (largely) from the mean. When a random variable obeys a large deviation principle, then large deviations are exponentially unlikely.

How is this relevant to stat mech? Inasmuch as stat mech is seen as an exercise in inference about mechanical systems given background data, it is not necessarily relevant at all. Given background data about tidal activity in the last few days doesn’t allow me to make very precise predictions about solar flares next week. However, the cases of interest in stat mech, i.e. when it can provide a justification for thermodynamics, are precisely those in which we can make very sharp predictions. And large deviations is the art of doing this.

The dice setup provides a good example again, illustrating the basic idea. What’s the probability of frequency ${\vec{f}}$ given total ${S}$? Let’s just calculate it as

$\displaystyle p(\vec{f}|S)=\frac{p(\vec{f}\& S)}{p(S)}.$

Now what we derived before is nearly the statement that ${\vec{f}}$ satisfies a large deviation principle under the unconstrained uniform i.i.d. distribution over sequences. The probability for a frequency ${\vec{f}}$ given i.i.d. probabilities ${p_i}$ for each die to roll an ${i}$ is just the multinomial distribution

$\displaystyle p(\vec{f}|\vec{p})=\frac{N!}{N_1!\cdots N_6!}p_1^{N_1}\cdots p_6^{N_6}\approx e^{-N H(\vec{f}||\vec{p})},$

where the approximation is again Stirling’s and the rate function ${H(\vec{f}||\vec{p})=\sum_i f_i\log f_i/p_i}$ is the relative entropy. For a uniform ${p_i}$ as we have here, the relative entropy just reduces to the usual entropy ${H(\vec{f}||\vec{p})\rightarrow \log 6-H(\vec{f})}$.

Now observe that the total ${S}$ is a function of the frequency, just ${S=s(\vec{f})=\sum_i iN f_i}$. One of the nice things about a random variable satisfying a large deviation principle is that functions of the random variable also obey a large deviation principle (this trick is called contraction). That is, since ${S=s(\vec{f})}$ we have

$\displaystyle p(S)=\sum_{\vec{f}:s(\vec{f})=S}p(\vec{f})=\sum_{\vec{f}:s(\vec{f})=S}e^{-N(\log 6-H(\vec{f}))}\approx e^{-N(\log 6-H(\vec{f}^*))},$

using Laplace’s approximation, where ${\vec{f}^*}$ is the ${\vec{f}}$ which maximizes ${H(\vec{f})}$ while satisfying ${s(\vec{f})=S}$. To keep things clear, we define ${R(S)=H(\vec{f}^*)}$. Since ${\vec{f}}$ determines ${S}$, ${p(\vec{f}\& S)=p(\vec{f})}$ and we’re left with

$\displaystyle p(\vec{f}|S)\propto e^{-N(R(S)-H(\vec{f})}$

for ${\vec{f}}$ such that ${s(\vec{f})=S}$ and zero otherwise. Now following the same argument we made above, the dominant ${\vec{f}}$ in probability is of course ${\vec{f}^*}$, so we’re back to MaxEnt and the distribution of ${i}$ itself under the constraint is the Gibbs distribution.

The great thing about this is that it generalizes to other situations quite easily. It all revolves around finding a macrostate, a function of the microstate like the frequency above, which has two properties: 1. it satisfies a large deviation principle, and 2. the quantity to be constrained is a function of it. Then contraction comes into play, and due to the exponential we can use Laplace’s approximation everywhere to simplify the resulting expressions. We don’t need to start with the microcanonical ensemble, we could have started with the canonical ensemble (it’s important to realize that large deviations doesn’t provide a justification of the Gibbs distribution on its own, for that you use the textbook argument), we can easily consider more complicated systems like mean field theories and (with a bit more work) interacting systems. Even some nonequlibrium systems (think Markov chains) and some of the fluctuation theorems discovered recently. And the mathematics of large deviations guides the stat mech argumentation the entire way; for instance, I can’t resist mentioning that the free energy arises naturally as a quantity called the scaled cumulant generating function whose Legendre(-Fenchel) transform gives the rate function of the macrostate. The -Fenchel part of the transform makes sure that everything works out alright when there’s a first order phase transition, automatically employing the Maxwell equal-area rule.

If you’re still reading, you’re sufficiently interested to look past my butchering of the subject, and it’s time to turn to literature by experts. I highly recommend a recent review by Hugo Touchette (the mathematics is not too dense that you can’t see the forest for the trees). Once you’re ready to look at the trees in more detail, turn to Ellis’ book Entropy, Large Deviations, and Statistical Mechanics. Both are probability-agnostic as far as I can tell; what you see here is my own Bayesian spin.