Finally got around to putting the solutions I worked out to many of the problems in Baez and Muniain’s (masterful) tome on my website. Part I is essentially complete, Part II roughly half-complete, and I still haven’t done anything from Part III. Nevertheless I hope you find them useful!
And Sarah Palin sells copies. She is the country’s first WWE politician — a cartoon combatant who inspires stadiums full of frustrated middle American followers who will cheer for her against whichever villain they trot out, be it Newsweek, Barack Obama, Katie Couric, Steve Schmidt, the Mad Russian, Randy Orton or whoever. Her followers will not know that she is the perfect patsy for our system, designed as it is to channel popular anger in any direction but a useful one, and to keep the public tied up endlessly in pointless media melees over meaningless horseshit (melees of the sort that develop organically around Palin everywhere she goes). Like George W. Bush, even Palin herself doesn’t know this, another reason she’s such a perfect political tool.
After reading that, read John Cole’s response to the sorry, sorry state of our political discourse.
So what if MaxEnt conflicts with Bayes’ rule (when the constraint information comes from measured data)? MaxEnt is far from the only inference-based stat mech game in town! We can just retreat to the good ol’ principle of indifference to derive the microcanonical ensemble and from there the canonical ensemble via the usual textbook construction of a microcanonical system + reservoir.
This means we can’t use the canonical ensemble when there isn’t a reservoir connected to the system of interest (such that the total energy is fixed). Except that sometimes we can. Sometimes the microcanonical and canonical ensembles are equivalent. And the simplest case of this is exactly when MaxEnt and Bayes’ rule turn out to give the same answer.
Let’s go back to the Brandeis dice problem of the previous post. Only now, instead of rolling one die times and then trying to predict what will happen on the next role given the average value of the rolls, suppose that we instead have dice, roll them all at once, and then sum up the values we get. Call the sum . What’s the probability that the first one is showing , i.e. ? Depends on the prior, of course. Following the principle of indifference, we would initially assign a uniform prior to all possible dice sequences. Then comes the constraint, fixing the sum to . We’re hoping to get the Gibbs distribution, for a value of such that , which is the canonical ensemble. This would also be the MaxEnt answer when directly given the constraint .
Here’s a roundabout way to intuitively see that this is indeed the correct result (readers of Uffink’s paper will already know that a rigorous derivation can be found in the paper by Van Campenhout and Cover). Suppose that, in accordance with , there are 1s, 2s and so on up to 6s. Clearly and . Dividing the by we get the relative frequencies . How many different dice sequences give the same constraint value ? Simple, it’s just the multinomial coefficient . Using Stirling’s approximation, we can write this as
where and is the Shannon entropy . Due to the exponential dependence on , frequencies with higher entropy are vastly more likely than those with lower entropy. So let’s just approximate the situation by saying that the most likely relative frequency is the only one that matters and figure out what is for this case. Given the frequency , the probability of the first die showing is just . And comes from maximizing the multinomial coefficient (or for all practical purposes, its approximation involving the entropy) under the constraint or . Since this is formally equivalent to the MaxEnt setup, we get the same answer and for the particular as intended.
In passing, I should mention that the current dice problem and the original version use very different priors. Here we use a uniform distribution over dice sequences. In the other case we effectively used a uniform distribution over frequences. That is, sequences belonging to frequency sets with larger numbers of elements have correspondingly smaller prior probability. Using the uniform distribution on sequences for the original problem would be stupid, since then the probability of the next roll would always be 1/6, no matter what the results of the previous rolls (that is, you are utterly convinced that the die is unbiased, and continue to be so even if the result is 1 every time). On the other hand, the uniform on sequences distribution makes more sense in the present context since there are now actually different dice and whether a particular one shows 4 doesn’t say anything about the probability of another die showing 4.
Back to stat mech. What’s this about the ensembles being sometimes equivalent? The entire collection of dice (microcanonical) and one die picked at random from the collection (canonical) are two different things. But if all we care about are single-die properties, then the two ensembles are equivalent. To move to the statistical mechanics of something more interesting than dice, think instead of an ideal gas, i.e. identical molecules of some type confined to a box and so dilute that interactions via collision are negligible. The energy is therefore a single-particle property, but of course the entire scheme is built on the fact that the total energy is fixed (and we know what it is roughly), so it doesn’t make sense to then turn around and determine its average value. Pressure is also a single-particle property, since it arises from collisions of the (non-interacting) molecules with the walls of the container. So calculating the pressure using the microcanonical and canonical ensembles should give the same answer.
Wait a minute, didn’t the title of this post say something about large deviations, whatever this means? When are we going to get to that? Well, we just did, sort of! Large deviations refers to situation in which some random variable, like frequency in the above example, has a probability which goes like for some (large) parameter and rate function . (technically it’s a statement in the limit .) In the above example, the rate function was the Shannon entropy of the relative frequency distribution, but rate functions are not always entropies. The name “large deviations” comes from the fact that we’re interested in the probability of events which deviate greatly (largely) from the mean. When a random variable obeys a large deviation principle, then large deviations are exponentially unlikely.
How is this relevant to stat mech? Inasmuch as stat mech is seen as an exercise in inference about mechanical systems given background data, it is not necessarily relevant at all. Given background data about tidal activity in the last few days doesn’t allow me to make very precise predictions about solar flares next week. However, the cases of interest in stat mech, i.e. when it can provide a justification for thermodynamics, are precisely those in which we can make very sharp predictions. And large deviations is the art of doing this.
The dice setup provides a good example again, illustrating the basic idea. What’s the probability of frequency given total ? Let’s just calculate it as
Now what we derived before is nearly the statement that satisfies a large deviation principle under the unconstrained uniform i.i.d. distribution over sequences. The probability for a frequency given i.i.d. probabilities for each die to roll an is just the multinomial distribution
where the approximation is again Stirling’s and the rate function is the relative entropy. For a uniform as we have here, the relative entropy just reduces to the usual entropy .
Now observe that the total is a function of the frequency, just . One of the nice things about a random variable satisfying a large deviation principle is that functions of the random variable also obey a large deviation principle (this trick is called contraction). That is, since we have
using Laplace’s approximation, where is the which maximizes while satisfying . To keep things clear, we define . Since determines , and we’re left with
for such that and zero otherwise. Now following the same argument we made above, the dominant in probability is of course , so we’re back to MaxEnt and the distribution of itself under the constraint is the Gibbs distribution.
The great thing about this is that it generalizes to other situations quite easily. It all revolves around finding a macrostate, a function of the microstate like the frequency above, which has two properties: 1. it satisfies a large deviation principle, and 2. the quantity to be constrained is a function of it. Then contraction comes into play, and due to the exponential we can use Laplace’s approximation everywhere to simplify the resulting expressions. We don’t need to start with the microcanonical ensemble, we could have started with the canonical ensemble (it’s important to realize that large deviations doesn’t provide a justification of the Gibbs distribution on its own, for that you use the textbook argument), we can easily consider more complicated systems like mean field theories and (with a bit more work) interacting systems. Even some nonequlibrium systems (think Markov chains) and some of the fluctuation theorems discovered recently. And the mathematics of large deviations guides the stat mech argumentation the entire way; for instance, I can’t resist mentioning that the free energy arises naturally as a quantity called the scaled cumulant generating function whose Legendre(-Fenchel) transform gives the rate function of the macrostate. The -Fenchel part of the transform makes sure that everything works out alright when there’s a first order phase transition, automatically employing the Maxwell equal-area rule.
If you’re still reading, you’re sufficiently interested to look past my butchering of the subject, and it’s time to turn to literature by experts. I highly recommend a recent review by Hugo Touchette (the mathematics is not too dense that you can’t see the forest for the trees). Once you’re ready to look at the trees in more detail, turn to Ellis’ book Entropy, Large Deviations, and Statistical Mechanics. Both are probability-agnostic as far as I can tell; what you see here is my own Bayesian spin.
It’s like that poster Mulder had in his office on the X-Files: I want to believe in the MaxEnt approach to statistical mechanics.
Problem is, I don’t. I used to, certainly. It’s so elegant and simple! Taking the Shannon entropy of a probability distribution as a measure of uncertainty, a good scheme for coming up with a prior probability distribution for whatever would seem to be to pick the distribution which has the maximum entropy given all the relevant constraints which are known. This way, the distribution has the least bias possible since, apart from the constraints, it has maximal uncertainty.
The recipe in statistical mechanics calls for a constraint on the average energy, and MaxEnt obliges by producing the canonical ensemble. (The microcanonical ensemble follows from a constraint on the total energy, but it would sort of be overkill to call this MaxEnt, since it’s just the principle of indifference in this case.) Similarly, the grand canonical ensemble comes from finding the distribution maximizing the entropy given fixed average energy and average particle number. A good deal of the rest of stat mech flows more or less naturally from this derivation and interpretation, making it quite attractive, if for no other reason than to save on what you need to remember for exams.
It’s all well and good (at least if you’re a Bayesian, and interpret probability as a degree of belief, not as a frequency of any shape or kind), until you start to wonder where the constraint information comes from. Then you realize that an uncomfortable situation presents itself: MaxEnt could be in conflict with Bayes’ rule, which, being a Bayesian, you of course believe in above all worldly things. This could happen if the constraint information comes from measured data, for example if the average energy constraint for the system at hand is actually the statement that you cooked up many copies of this system according to the same recipe every time, measured the energy of each, and took the mean .
According to the Bayesian approach, the correct way to incorporate this information is to update your original prior probability distribution with the measured data by using, well, Bayes’ rule. This just amounts to finding the probability of whatever you’re interested in (call it ) given the observed data (energy measurements) by taking the joint probability (energy measurements and ) and dividing by the probability of observed data (energy measurements). Very simple, though in this case we have to forget the actual sequence of energy measurement results and pretend that we only remember the mean, so that we end up with . (This we do by adding up all the probabilities for x given sequences having the same mean.) Using MaxEnt, on the other hand, we would forgo the original prior and just find the distribution maximizing the entropy under the constraint of the observed mean energy.
Do we get the same answer? Given that I said in the beginning of the post that I went from a state of MaxEnt belief to disbelief and have written the intervening paragraphs as I have, you can reasonably infer that the answer is no. (A word of congratulations to frequentist readers: You have just successfully completed a Bayesian inference!) Indeed, the answer is no, at least in the sense that we do not always get the same answer. In quite reasonable cases the two approaches give different answers.
In a nice paper detailing aspects of the constraint rule of MaxEnt, Uffink examines the two approaches for the case of rolling a die (in this context called the Brandeis dice problem). Suppose we roll the die many times and observe a mean number of 3.5. This is what one would expect for an unbiased die, i.e. one with probability 1/6 for each of the outcomes 1 to 6, and moreover this distribution has the largest entropy, so the MaxEnt probability is the uniform distribution. On the other hand, following the usual Bayesian prescription, if our prior distribution is that the die could be biased in any way, each equally-likely, then after going through a calculation similar to that of the rule of succession (after witnessing s in rolls, the probability of on the next toss is ) and adding up the probabilities for sequences of results which all have an average of 3.5, Uffink obtains the posterior distribution for the next roll. It’s not uniform; rather it gives more weight to 3 and 4 than to 1 and 6. (2 and 5 are somewhere in between.)
To me this dashes any hope of applying MaxEnt all over the place as Jaynes is wont to do. But my immediate concern was with MaxEnt as a means to justify the various ensembles of statistical mechanics. Does it still work in this context? The answer here is yes, but only insofar as it just reduces to the principle of insufficient reason. And in this case MaxEnt and Bayes’ rule give the same result (again, for a “reasonable” prior). This is straightforward for the microcanonical ensemble, as mentioned above. But the usual textbook justification of the canonical ensemble, for instance, is that it results when studying a system which, together with a much larger reservoir, is described by the microcanonical ensemble. So we didn’t need anything else in the first place! One of the original appeals of MaxEnt for stat mech was that it seems to dispose of the need for a reservoir system, and implies that the canonical ensemble is appropriate in a considerably more general setting. Alas, this cannot be justified.
I should mention how this works out in the context of the Brandeis dice problem, which will happily lead into the subject of applying large deviation principles to stat mech, but this will have to be left for another post.
Without a public option, I feel like I’ve got a gun to my head telling me that I am forced by law to pay some CEO’s obscene salary with no guarantee that I’m not going to get the shaft if I get sick. Regulation alone, subject to K Street influence and corporate whores in the congress, will not get the job done.
How does this fit with Krugman’s point that a public option is not really necessary?
Look, it is possible to have universal care without a public option; Switzerland does. But there are some good reasons for the prominence of the public option in our debate.
One is substantive: to have a workable system without the public option, you need to have effective regulation of the insurers. Given the realities of our money-dominated politics, you really have to worry whether that can be done — which is a reason to have a more or less automatic mechanism for disciplining the industry.
Well, of course he doesn’t claim it can work without regulation, which both he and Digby agree isn’t likely to be enough. I agree, too, for what it’s worth. Why couldn’t regulation work? What Krugman doesn’t mention is that, at least in Germany (and I imagine something similar is true of Switzerland), “effective regulation” = price control. I cannot envision a scenario in which price controls are debated in Congress and the big insurance companies do not immediately spend (profit – epsilon) in order to kill it. Meaning it will never happen.
What I really don’t understand is how the same people who think the government can’t do anything (non-military) right also think the government would be able to run a medical coverage plan that is so competitive with existing insurance plans that it destroys them. If it were so bad, who would opt in?
One reason why one might expect quantum computers to be more powerful than classical computers has to do with the fact that the probability of a given output is the absolute square of the sum of the amplitudes associated with each of the “computational paths” (sequences of computational states, which are themselves strings of ones and zeros), and these amplitudes are complex numbers. This is essentially Feynman’s description of quantum mechanics via the path integral, applied to computers, a point touched on by Scott Aaronson in lecture 10 of his Quantum Computing Since Democritus course. The number of paths is huge, and to get the probability of ending in any particular bit string we’ve got to add up all these complex numbers for paths ending in the target state and then take the square of the absolute value. The trouble is that knowing just a few of the path amplitudes for fixed output doesn’t really tell us anything: the others can add to what we have, or cancel it out, or anything in between.
In contrast, if the probability of a given output were just the sum of the probabilities of all the paths that end with that output, as would be the case for a probabilistic classical computer, then knowing a few path probabilities gives us at least a lower bound on the output probability. Since useful probabilistic algorithms have to land on the correct output with probability greater than one half, we can sample the computational paths randomly and we’ll come up with the likely output with high probability. It’s a little perverse to talk about simulating a probabilistic algorithm since the way to sample computational paths is to run the algorithm, but the point is this won’t work for a quantum computer.
The connection with the path integral made me realize something I hadn’t appreciated before, namely that, for a single particle at least, the absolute value of the amplitude is the same for every path. That is, the important part of the amplitude is just its complex phase, and so nature makes it as hard as it can to predict where the particle is going to go. One might have thought that the amplitude for going to nearby points would have a larger magnitude than for distant points, a sort of diffusion, but no. Ok, I do remember this from Feynman’s book on QED, his picture of particles carrying little clocks as they travel on the various paths, but in the quantum computation context it doesn’t hold. That’s because in an ordinary quantum computation a lot of the qubits are “idle” at any given time, and aren’t subjected to any quantum gate (aren’t moving in the particle language).
This got me thinking about the standard universal gate set and whether or not there exist universal sets with the property that the amplitudes of outgoing computational basis states are also “pure phase”. In terms of the matrix entries of the gate (in the computational basis), this just means that all entries have the same magnitude. The usual gate set consists of the Hadamard (H), phase (T), and CNOT gates: , , .
Of course there are! Just multiply/conjugate T and CNOT respectively by H (H on qubit 1 for CNOT). The result is and . So quantum algorithms are, in the path integral sense, purposefully trying to be difficult to simulate classically. At each step, the possible new computational basis states all have the same magnitude, frustrating efforts to calculate the ultimate output probabilities. In the same vein, it’s nice to see the quantum Fourier transform show up in almost all of the known quantum algorithms since it has the same constant magnitude property, but now on n qubits instead of just one or two.