Thursday // 2009 August 27

Large Deviations and Statistical Mechanics

So what if MaxEnt conflicts with Bayes’ rule (when the constraint information comes from measured data)? MaxEnt is far from the only inference-based stat mech game in town! We can just retreat to the good ol’ principle of indifference to derive the microcanonical ensemble and from there the canonical ensemble via the usual textbook construction of a microcanonical system + reservoir.

This means we can’t use the canonical ensemble when there isn’t a reservoir connected to the system of interest (such that the total energy is fixed). Except that sometimes we can. Sometimes the microcanonical and canonical ensembles are equivalent. And the simplest case of this is exactly when MaxEnt and Bayes’ rule turn out to give the same answer.

Let’s go back to the Brandeis dice problem of the previous post. Only now, instead of rolling one die {N-1} times and then trying to predict what will happen on the next role given the average value of the {N-1} rolls, suppose that we instead have {N} dice, roll them all at once, and then sum up the values we get. Call the sum {S}. What’s the probability that the first one is showing {i}, i.e. {p(i|S)}? Depends on the prior, of course. Following the principle of indifference, we would initially assign a uniform prior to all possible dice sequences. Then comes the constraint, fixing the sum to {S}. We’re hoping to get the Gibbs distribution, {p(i|S)\propto e^{-\beta i}} for a value of {\beta} such that {\langle i\rangle=\sum_i i p(i|S)=\frac{S}{N}}, which is the canonical ensemble. This would also be the MaxEnt answer when directly given the constraint {\langle i\rangle=\frac{S}{N}}.

Here’s a roundabout way to intuitively see that this is indeed the correct result (readers of Uffink’s paper will already know that a rigorous derivation can be found in the paper by Van Campenhout and Cover). Suppose that, in accordance with {S}, there are {N_1} 1s, {N_2} 2s and so on up to {N_6} 6s. Clearly {\sum_i N_i=N} and {\sum_i i N_i=S}. Dividing the {N_i} by {N} we get the relative frequencies {f_i=N_i/N}. How many different dice sequences give the same constraint value {S}? Simple, it’s just the multinomial coefficient {\frac{N!}{N_1!N_2!\cdots N_6!}}. Using Stirling’s approximation, we can write this as

\displaystyle \frac{N!}{N_1!N_2!\cdots N_6!}\approx e^{NH(\vec{f})},

where {\vec{f}=(f_1,\dots,f_6)} and {H} is the Shannon entropy {H(\vec{f})=-\sum_{i=1}^6 f_i\log f_i}. Due to the exponential dependence on {N}, frequencies with higher entropy are vastly more likely than those with lower entropy. So let’s just approximate the situation by saying that the most likely relative frequency is the only one that matters and figure out what {p(i|S)} is for this case. Given the frequency {\vec{f}}, the probability of the first die showing {i} is just {f_i}. And {f_i} comes from maximizing the multinomial coefficient (or for all practical purposes, its approximation involving the entropy) under the constraint {\sum_i iNf_i=S} or {\sum_i if_i=\frac{S}{N}}. Since this is formally equivalent to the MaxEnt setup, we get the same answer and {f_i\propto e^{-\beta i}} for the particular {\beta} as intended.

In passing, I should mention that the current dice problem and the original version use very different priors. Here we use a uniform distribution over dice sequences. In the other case we effectively used a uniform distribution over frequences. That is, sequences belonging to frequency sets with larger numbers of elements have correspondingly smaller prior probability. Using the uniform distribution on sequences for the original problem would be stupid, since then the probability {i} of the next roll would always be 1/6, no matter what the results of the previous {N-1} rolls (that is, you are utterly convinced that the die is unbiased, and continue to be so even if the result is 1 every time). On the other hand, the uniform on sequences distribution makes more sense in the present context since there are now actually {N} different dice and whether a particular one shows 4 doesn’t say anything about the probability of another die showing 4.

Back to stat mech. What’s this about the ensembles being sometimes equivalent? The entire collection of dice (microcanonical) and one die picked at random from the collection (canonical) are two different things. But if all we care about are single-die properties, then the two ensembles are equivalent. To move to the statistical mechanics of something more interesting than dice, think instead of an ideal gas, i.e. {N} identical molecules of some type confined to a box and so dilute that interactions via collision are negligible. The energy is therefore a single-particle property, but of course the entire scheme is built on the fact that the total energy is fixed (and we know what it is roughly), so it doesn’t make sense to then turn around and determine its average value. Pressure is also a single-particle property, since it arises from collisions of the (non-interacting) molecules with the walls of the container. So calculating the pressure using the microcanonical and canonical ensembles should give the same answer.

Wait a minute, didn’t the title of this post say something about large deviations, whatever this means? When are we going to get to that? Well, we just did, sort of! Large deviations refers to situation in which some random variable, like frequency in the above example, has a probability which goes like {e^{-NR}} for some (large) parameter {N} and rate function {R}. (technically it’s a statement in the limit {N\rightarrow \infty}.) In the above example, the rate function was the Shannon entropy of the relative frequency distribution, but rate functions are not always entropies. The name “large deviations” comes from the fact that we’re interested in the probability of events which deviate greatly (largely) from the mean. When a random variable obeys a large deviation principle, then large deviations are exponentially unlikely.

How is this relevant to stat mech? Inasmuch as stat mech is seen as an exercise in inference about mechanical systems given background data, it is not necessarily relevant at all. Given background data about tidal activity in the last few days doesn’t allow me to make very precise predictions about solar flares next week. However, the cases of interest in stat mech, i.e. when it can provide a justification for thermodynamics, are precisely those in which we can make very sharp predictions. And large deviations is the art of doing this.

The dice setup provides a good example again, illustrating the basic idea. What’s the probability of frequency {\vec{f}} given total {S}? Let’s just calculate it as

\displaystyle p(\vec{f}|S)=\frac{p(\vec{f}\& S)}{p(S)}.

Now what we derived before is nearly the statement that {\vec{f}} satisfies a large deviation principle under the unconstrained uniform i.i.d. distribution over sequences. The probability for a frequency {\vec{f}} given i.i.d. probabilities {p_i} for each die to roll an {i} is just the multinomial distribution

\displaystyle p(\vec{f}|\vec{p})=\frac{N!}{N_1!\cdots N_6!}p_1^{N_1}\cdots p_6^{N_6}\approx e^{-N H(\vec{f}||\vec{p})},

where the approximation is again Stirling’s and the rate function {H(\vec{f}||\vec{p})=\sum_i f_i\log f_i/p_i} is the relative entropy. For a uniform {p_i} as we have here, the relative entropy just reduces to the usual entropy {H(\vec{f}||\vec{p})\rightarrow \log 6-H(\vec{f})}.

Now observe that the total {S} is a function of the frequency, just {S=s(\vec{f})=\sum_i iN f_i}. One of the nice things about a random variable satisfying a large deviation principle is that functions of the random variable also obey a large deviation principle (this trick is called contraction). That is, since {S=s(\vec{f})} we have

\displaystyle p(S)=\sum_{\vec{f}:s(\vec{f})=S}p(\vec{f})=\sum_{\vec{f}:s(\vec{f})=S}e^{-N(\log 6-H(\vec{f}))}\approx e^{-N(\log 6-H(\vec{f}^*))},

using Laplace’s approximation, where {\vec{f}^*} is the {\vec{f}} which maximizes {H(\vec{f})} while satisfying {s(\vec{f})=S}. To keep things clear, we define {R(S)=H(\vec{f}^*)}. Since {\vec{f}} determines {S}, {p(\vec{f}\& S)=p(\vec{f})} and we’re left with

\displaystyle p(\vec{f}|S)\propto e^{-N(R(S)-H(\vec{f})}

for {\vec{f}} such that {s(\vec{f})=S} and zero otherwise. Now following the same argument we made above, the dominant {\vec{f}} in probability is of course {\vec{f}^*}, so we’re back to MaxEnt and the distribution of {i} itself under the constraint is the Gibbs distribution.

The great thing about this is that it generalizes to other situations quite easily. It all revolves around finding a macrostate, a function of the microstate like the frequency above, which has two properties: 1. it satisfies a large deviation principle, and 2. the quantity to be constrained is a function of it. Then contraction comes into play, and due to the exponential we can use Laplace’s approximation everywhere to simplify the resulting expressions. We don’t need to start with the microcanonical ensemble, we could have started with the canonical ensemble (it’s important to realize that large deviations doesn’t provide a justification of the Gibbs distribution on its own, for that you use the textbook argument), we can easily consider more complicated systems like mean field theories and (with a bit more work) interacting systems. Even some nonequlibrium systems (think Markov chains) and some of the fluctuation theorems discovered recently. And the mathematics of large deviations guides the stat mech argumentation the entire way; for instance, I can’t resist mentioning that the free energy arises naturally as a quantity called the scaled cumulant generating function whose Legendre(-Fenchel) transform gives the rate function of the macrostate. The -Fenchel part of the transform makes sure that everything works out alright when there’s a first order phase transition, automatically employing the Maxwell equal-area rule.

If you’re still reading, you’re sufficiently interested to look past my butchering of the subject, and it’s time to turn to literature by experts. I highly recommend a recent review by Hugo Touchette (the mathematics is not too dense that you can’t see the forest for the trees). Once you’re ready to look at the trees in more detail, turn to Ellis’ book Entropy, Large Deviations, and Statistical Mechanics. Both are probability-agnostic as far as I can tell; what you see here is my own Bayesian spin.

Tuesday // 2009 August 25

MaxEnt?

It’s like that poster Mulder had in his office on the X-Files: I want to believe in the MaxEnt approach to statistical mechanics.

Problem is, I don’t. I used to, certainly. It’s so elegant and simple! Taking the Shannon entropy of a probability distribution as a measure of uncertainty, a good scheme for coming up with a prior probability distribution for whatever would seem to be to pick the distribution which has the maximum entropy given all the relevant constraints which are known. This way, the distribution has the least bias possible since, apart from the constraints, it has maximal uncertainty.

The recipe in statistical mechanics calls for a constraint on the average energy, and MaxEnt obliges by producing the canonical ensemble. (The microcanonical ensemble follows from a constraint on the total energy, but it would sort of be overkill to call this MaxEnt, since it’s just the principle of indifference in this case.) Similarly, the grand canonical ensemble comes from finding the distribution maximizing the entropy given fixed average energy and average particle number. A good deal of the rest of stat mech flows more or less naturally from this derivation and interpretation, making it quite attractive, if for no other reason than to save on what you need to remember for exams.

It’s all well and good (at least if you’re a Bayesian, and interpret probability as a degree of belief, not as a frequency of any shape or kind), until you start to wonder where the constraint information comes from. Then you realize that an uncomfortable situation presents itself: MaxEnt could be in conflict with Bayes’ rule, which, being a Bayesian, you of course believe in above all worldly things. This could happen if the constraint information comes from measured data, for example if the average energy constraint for the system at hand is actually the statement that you cooked up many copies of this system according to the same recipe every time, measured the energy of each, and took the mean \langle E\rangle.

According to the Bayesian approach, the correct way to incorporate this information is to update your original prior probability distribution with the measured data by using, well, Bayes’ rule. This just amounts to finding the probability of whatever you’re interested in (call it x) given the observed data (energy measurements) by taking the joint probability (energy measurements and x) and dividing by the probability of observed data (energy measurements). Very simple, though in this case we have to forget the actual sequence of energy measurement results and pretend that we only remember the mean, so that we end up with p(x|\langle E\rangle). (This we do by adding up all the probabilities for x given sequences having the same mean.) Using MaxEnt, on the other hand, we would forgo the original prior and just find the distribution maximizing the entropy under the constraint of the observed mean energy.

Do we get the same answer? Given that I said in the beginning of the post that I went from a state of MaxEnt belief to disbelief and have written the intervening paragraphs as I have, you can reasonably infer that the answer is no. (A word of congratulations to frequentist readers: You have just successfully completed a Bayesian inference!) Indeed, the answer is no, at least in the sense that we do not always get the same answer. In quite reasonable cases the two approaches give different answers.

In a nice paper detailing aspects of the constraint rule of MaxEnt, Uffink examines the two approaches for the case of rolling a die (in this context called the Brandeis dice problem). Suppose we roll the die many times and observe a mean number of 3.5. This is what one would expect for an unbiased die, i.e. one with probability 1/6 for each of the outcomes 1 to 6, and moreover this distribution has the largest entropy, so the MaxEnt probability is the uniform distribution. On the other hand, following the usual Bayesian prescription, if our prior distribution is that the die could be biased in any way, each equally-likely, then after going through a calculation similar to that of the rule of succession (after witnessing N_i is in N rolls, the probability of i on the next toss is \frac{N_i+1}{N+6}) and adding up the probabilities for sequences of results which all have an average of 3.5, Uffink obtains the posterior distribution for the next roll. It’s not uniform; rather it gives more weight to 3 and 4 than to 1 and 6. (2 and 5 are somewhere in between.)

To me this dashes any hope of applying MaxEnt all over the place as Jaynes is wont to do. But my immediate concern was with MaxEnt as a means to justify the various ensembles of statistical mechanics. Does it still work in this context? The answer here is yes, but only insofar as it just reduces to the principle of insufficient reason. And in this case MaxEnt and Bayes’ rule give the same result (again, for a “reasonable” prior). This is straightforward for the microcanonical ensemble, as mentioned above. But the usual textbook justification of the canonical ensemble, for instance, is that it results when studying a system which, together with a much larger reservoir, is described by the microcanonical ensemble. So we didn’t need anything else in the first place! One of the original appeals of MaxEnt for stat mech was that it seems to dispose of the need for a reservoir system, and implies that the canonical ensemble is appropriate in a considerably more general setting. Alas, this cannot be justified.

I should mention how this works out in the context of the Brandeis dice problem, which will happily lead into the subject of applying large deviation principles to stat mech, but this will have to be left for another post.

Friday // 2009 August 21

Public Option

Digby says:

Without a public option, I feel like I’ve got a gun to my head telling me that I am forced by law to pay some CEO’s obscene salary with no guarantee that I’m not going to get the shaft if I get sick. Regulation alone, subject to K Street influence and corporate whores in the congress, will not get the job done.

How does this fit with Krugman’s point that a public option is not really necessary?

Look, it is possible to have universal care without a public option; Switzerland does. But there are some good reasons for the prominence of the public option in our debate.

One is substantive: to have a workable system without the public option, you need to have effective regulation of the insurers. Given the realities of our money-dominated politics, you really have to worry whether that can be done — which is a reason to have a more or less automatic mechanism for disciplining the industry.

Well, of course he doesn’t claim it can work without regulation, which both he and Digby agree isn’t likely to be enough. I agree, too, for what it’s worth. Why couldn’t regulation work? What Krugman doesn’t mention is that, at least in Germany (and I imagine something similar is true of Switzerland),  “effective regulation” = price control. I cannot envision a scenario in which price controls are debated in Congress and the big insurance companies do not immediately spend (profit – epsilon) in order to kill it. Meaning it will never happen.

What I really don’t understand is how the same people who think the government can’t do anything (non-military) right also think the government would be able to run a medical coverage plan that is so competitive with existing insurance plans that it destroys them. If it were so bad, who would opt in?

Wednesday // 2009 August 5

Path Integrals and Quantum Computation

One reason why one might expect quantum computers to be more powerful than classical computers has to do with the fact that the probability of a given output is the absolute square of the sum of the amplitudes associated with each of the “computational paths” (sequences of computational states, which are themselves strings of ones and zeros), and these amplitudes are complex numbers. This is essentially Feynman’s description of quantum mechanics via the path integral, applied to computers, a point touched on by Scott Aaronson in lecture 10 of his Quantum Computing Since Democritus course. The number of paths is huge, and to get the probability of ending in any particular bit string we’ve got to add up all these complex numbers for paths ending in the target state and then take the square of the absolute value. The trouble is that knowing just a few of the path amplitudes for fixed output doesn’t really tell us anything: the others can add to what we have, or cancel it out, or anything in between.

In contrast, if the probability of a given output were just the sum of the probabilities of all the paths that end with that output, as would be the case for a probabilistic classical computer, then knowing a few path probabilities gives us at least a lower bound on the output probability. Since useful probabilistic algorithms have to land on the correct output with probability greater than one half, we can sample the computational paths randomly and we’ll come up with the likely output with high probability. It’s a little perverse to talk about simulating a probabilistic algorithm since the way to sample computational paths is to run the algorithm, but the point is this won’t work for a quantum computer.

The connection with the path integral made me realize something I hadn’t appreciated before, namely that, for a single particle at least, the absolute value of the amplitude is the same for every path. That is, the important part of the amplitude is just its complex phase, and so nature makes it as hard as it can to predict where the particle is going to go. One might have thought that the amplitude for going to nearby points would have a larger magnitude than for distant points, a sort of diffusion, but no. Ok, I do remember this from Feynman’s book on QED, his picture of particles carrying little clocks as they travel on the various paths, but in the quantum computation context it doesn’t hold. That’s because in an ordinary quantum computation a lot of the qubits are “idle” at any given time, and aren’t subjected to any quantum gate (aren’t moving in the particle language).

This got me thinking about the standard universal gate set and whether or not there exist universal sets with the property that the amplitudes of outgoing computational basis states are also “pure phase”. In terms of the matrix entries of the gate (in the computational basis), this just means that all entries have the same magnitude. The usual gate set consists of the Hadamard (H), phase (T), and CNOT gates: H=\frac{1}{\sqrt{2}}\begin{pmatrix} 1 & 1 \\ 1 & -1\end{pmatrix}, T=\begin{pmatrix}1 & 0 \\ 0 & e^{i\pi/4}\end{pmatrix}, {\rm CNOT}=\begin{pmatrix}1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0\end{pmatrix}.

Of course there are! Just multiply/conjugate T and CNOT respectively by H (H on qubit 1 for CNOT). The result is T'=\frac{1}{\sqrt{2}}\begin{pmatrix}1 & e^{i\pi/4} \\ 1 & -e^{i\pi/4}\end{pmatrix} and {\rm CNOT}'=\begin{pmatrix}1 & 1 & 1 & -1 \\ 1 & 1 & -1 & 1 \\ 1 & -1 & 1 & 1 \\ -1 & 1 & 1 & 1\end{pmatrix}. So quantum algorithms are, in the path integral sense, purposefully trying to be difficult to simulate classically. At each step, the possible new computational basis states all have the same magnitude, frustrating efforts to calculate the ultimate output probabilities. In the same vein, it’s nice to see the quantum Fourier transform show up in almost all of the known quantum algorithms since it has the same constant magnitude property, but now on n qubits instead of just one or two.

Tuesday // 2009 August 4

Chile season is almost here!

chile1

Chiles!

Monday // 2009 June 1

Information Causality

In a recent preprint, Pawlowski et al. propose a natural physical principle called information causality which classical and quantum mechanics obeys but is violated by any theory in which exhibit CHSH correlations beyond Tsirelson’s bound of 2\sqrt{2}. Roughly speaking, information causality is the principle that if you send m classical bits about a random variable X to someone, their uncertainty about X should not decrease by more than m bits. For more on the background, see this excellent post by the his whole-iness, the Quantum Pontiff.

More precisely, it is defined in the following context: separated parties Alice and Bob play a game in which Alice receives a random n bit string A_1,A_2,\dots, A_n while Bob receives an index j=1,\dots n; Bob’s task is to guess the value of A_j. Additionally, Alice and Bob share a joint system  A'B (already used A), described by whatever theory you want, and can use this however they like (measure it in some way, etc.). If Alice sends Bob an m bit string X (presumably depending on A and some measurement on her half of the joint system), then information causality is the statement that no matter how Bob generates his guess \widehat{A}_j from X and measurement on his part of system A'B, I=\sum_j I(A_j:\widehat{A}_j)\leq m, where I(A_j:\widehat{A}_j) is the mutual information between Bob’s guess \widehat{A}_j and the actual bit A_j.

What does this mean, in more direct terms? Well, I limits the probability of Bob correctly guessing Alice’s bit, a fact which I imagine the authors had in mind based on the discussion surrounding equation A5 in the paper. Let me elaborate on the connection here. By using the Fano inequality, we can link the mutual information of two random variables X and Y to the probability of guessing X given Y: p_e>(H(X|Y)-1)/\log|X|, where |X| is the number of values that X can take.

Applied to the probability of error when Bob guesses Alice’s jth bit, averaged over the bits j, we end up with p_e(j) \geq 1-I(A_j:\widehat{A}_j), where A_j is the random variable corresponding to Alice’s jth bit ( H(A_j)=1 ), and \widehat{A}_j is Bob’s guess. The average probability of error is then p_e\geq 1-\frac{1}{n}\sum_{j=1}^nI(A_j:\widehat{A}_j). Since \widehat{A}_j is generated from X (the message), and B (the extra system that Bob has, be it for the moment classical or quantum), I(A_j:\widehat{A}_j)\leq I(A_j:XB) (data processing inequality/Holevo bound). In the course of showing information causality holds for classical and quantum mechanics, they prove that \sum_{j=1}^n I(A_j:XB) \leq I(A:XB) \leq m, so p_e\geq (n-m)/n. Thus, information causality says what we would have guessed intuitively: in order for Bob to be able to predict any of Alice’s bits with high probability, she’ll have to send him n bits.

I would argue this formulation is more operational than a claim about I itself, but of course they are very closely related. Phrasing things in terms of the number of bits required to complete the task is operational in the sense usually used in (quantum) information theory, e.g. the number of extractable random bits, number of distillable EPR pairs, smallest code that allows for reliable message recovery, etc. It also opens up the possibility of demonstrating that a particular theory obeys/disobeys information causality without making entropic arguments. On the other hand, it’s not operational in the sense that Alice and Bob can simply make a specified measurement on the joint system and compare with a bound like Bell/CHSH or Tsirelson. They first have to determine which protocol to use, and this may be difficult.

Thursday // 2009 May 21

On the usefulness of topology in everyday life

smallbike

Linking number = 0

Sunday // 2009 February 1

Book Review: Gauge Fields, Knots and Gravity

Up for review: Gauge Fields, Knots and Gravity by John Baez and Javier P. Muniain. Published by World Scientific; ISBN 9810220340 (pbk)
In short: Buy this book immediately!

Gauge Fields, Knots and Gravity is a surprisingly small book, given the hefty title, but its goal is to provide a solid introduction to these subjects, rather than attempt a complete and detailed treatment. The authors state in the preface that they “hope that both physicists who wish to learn more differential geometry and topology, and mathematicians who wish to learn more gauge theory and general relativity, will find this book a useful place to start.” Speaking as a physicist, I can report that they succeeded marvelously, and I further admit to having learned a lot of gauge theory and general relativity as well.

The book is divided into three parts, which you’d be forgiven for expecting are the three advertised topics. Gauge fields and knots are covered in part II, gravity in part III, while part I, under the heading of Electromagnetism, gives easily the best introduction to differential geometry that I have come across. By the end of the first part the reader can understand and appreciate Maxwell’s equations in the simple coordinate-independent form dF=0 and *d*F=J. The second part takes up the gauge theory aspects of Maxwell’s equations directly, treating fiber bundles, connections, and curvature while working up to the Yang-Mills equation, Chern-Simons theory, and the links to knot theory. Even more ambitious is part III, which (somewhat hurriedly) covers the standard mathematical apparatus of general relativity before moving on to the real goals, the ADM formalism and prospects for quantization in Ashtekar’s new variables.

There are almost surely hundreds of precise textbooks on differential geometry and fiber bundles, many bringing to mind the observation by C.N. Yang that “There are only two kinds of math books. Those you cannot read beyond the first sentence, and those you cannot read beyond the first page.” On the other end of the spectrum are the “[x] for physicists” books which often treat their chosen material intuitively but not precisely enough to be useful in calculating or deriving anything. The chief strength of this book is its ability to do both well, and in a non-cumbersome formalism. Concepts are explained in a clear, easy to read manner and then connected to precise definitions written in a useful formalism. Any one of these three can make for a useful book, but Baez and Muniain set a new standard by offering all three. And over 300 exercises.

Thursday // 2009 January 15

(Non)Additivity, Part II

In the last post I reported that regularization is a fact of life for capacities of quantum channels, at least when trying the same trick that worked for classical information, random coding. The need for regularization comes from the nonadditivity of the random-coding-capacities \chi(\mathcal{N}), P^1(\mathcal{N}), and Q^1(\mathcal{N}) for classical, private, and quantum information respectively.

But what about the additivity of the true capacities? For instance, (ir)regardless of Q^1, the quantum capacity Q could be additive. Of course, because of the regularization it certainly is additive for two identical channels: Q(\mathcal{N}\otimes \mathcal{N})=2Q(\mathcal{N}). But what about two different channels? Same question for classical and private capacities. Are they additive for dissimilar channels?

The quantum channel is spectacularly nonadditive, as shown by Smith and Yard [arxiv]. They give the most extreme example possible, two channels which are individually worthless for sending quantum information that can nevertheless be usefully combined. 0 + 0 > 0, if you will.

The recipe for making this work is quite interesting. It calls for one part “private Horodecki channel” \mathcal{N}_H, which is a special channel which can transmit secret classical information but no quantum information, and an erasure channel \mathcal{N}_E, which just throws the input away with probability one half. Each channel takes in two qubits and outputs two qubits (actually, the erasure channel has another output, because it is nice enough to tell you when the input was erased.) Now pull out the recipe for random coding and follow the directions, combining these two channels into one. The yield is Q^1(\mathcal{N}_H\otimes \mathcal{N}_E), which Smith and Yard show is positive.

So the quantum capacity is not additive, but what about the other two? To my knowledge, the additivity of the classical capacity is open. The additivity of the private capacity is not certain, but some facts have recently come to light. Smith and Smolin [arxiv] recently showed that when combining a certain class of channels (called retro- or echo-correctable) with erasure channels, either \mathcal{P} is not additive or else \chi isn’t, at least not for these channels. They believe that \chi is additive in this case, and so the private capacity is nonadditive, but the jury’s still out.

In summary, none of the random-coding capacities are additive, meaning regularization is unavoidable and the various capacities over quantum channels don’t have simple expressions similar to the capacity of a classical channel. Nor do the true capacities of a quantum channel seem to be additive. It seems reasonable to expect this to be the case (this is more convincing knowing the answer, of course!*): An additive capacity implies that the usefulness of the channel for the given purpose is independent of what other channels you have handy, but the phenomena of entanglement suggests that quantum information is not so easily divided. And so random coding, with its product-state codewords, may not be the best way of approaching the problem.
Here’s a table of the results

Additive? Capacity Random-Coding Capacity
Classical ? no [arxiv]
Private * [arxiv] no [arxiv]
Quantum no [arxiv] no [arxiv,arxiv]

*Which is precisely the point. As Penrose said: “The great thing about physical intuition is that it can be adjusted to fit the facts.” (from the preface to Quantum Field Theory by Mark Srednicki.)

Wednesday // 2009 January 14

(Non)Additivity, Part I

There have been a couple of very nice results recently on the (non)additivity of several important quantities in quantum information theory. I thought I’d try to explain a little of what’s going on, both for myself and for anyone reading.

At issue are various capacities of a quantum channel (any means of transmitting quantum states) \mathcal{N}, which quantify the usefulness of the channel for performing some sort of task. For instance, the classical capacity C(\mathcal{N}) tells you how much classical information can be transmitted by using the channel. Similarly, the quantum capacity Q(\mathcal{N}) characterizes how much quantum information can be sent, or equivalently, how much entanglement can be created between the sender and receiver. And the private capacity P(\mathcal{N}) gives the amount of classical information which can be secretly sent using the channel. Additivity refers to the question of whether or not combining two different channels gives a (classical/private/quantum) capacity which is the sum of the individual capacities or not.

Before diving into the quantum case, let’s look back at the classical case to see where this is all coming from. A classical channel N is essentially just a conditional probability distribution p(y|x) from the input X to output Y. Claude Shannon proved that the capacity of N to send (classical) information is given by the expression C(N)=\sup_{p_x} I(X:Y), where I(X:Y)=H(X)+H(Y)-H(XY) is the mutual information between the input and output and H is the Shannon entropy. This number tells you the ratio of how many bits of noiseless information you could send relative to the number of uses of the channel, in the limit of infinitely many uses of the channel. His proof method was slick, too, not just the result — by picking an error-correcting code at random you can (almost always) achieve the capacity!

The various capacities of the quantum channel \mathcal{N} are defined in the same way, as the ratio of whatever you’re trying to produce to the number of channel uses, in the limit of infinitely many channel uses. Unfortunately, the expressions for these capacities are a good deal more complicated that the classical capacity of a classical channel, and involve what’s called regularization. The classical capacity, for instance, is given by C(\mathcal{N})=\lim_{n\rightarrow\infty}\frac{1}{n}\chi(\mathcal{N}^{\otimes n}). Here \chi(\mathcal{N}), usually called the Holevo quantity after its discoverer, is the capacity when using random codes. To get the true capacity, we have to regularize by taking the limit of the Holevo quantity of n parallel uses of the channel. That is, we consider making arbitrary, yes entangled inputs to n channels, close our eyes and pretend that whole thing is a single use of a super-channel, and then use random coding on that.

None of this would be necessary if, like the classical mutual information, the Holevo quantity were additive, so that \chi(\mathcal{N}\otimes\mathcal{N})=2\chi(\mathcal{N}). The regularization would be redundant—we wouldn’t have to bother with super-channels and so on—and we could say that the classical capacity is single-letterizable, since the expression for it only involves a single use of the channel. However, per Hastings [arxiv], the Holevo quantity is not additive, so the classical capacity is not single-letterizable.

Nor are the private and quantum capacities!
They are similarly defined: P(\mathcal{N})=\lim_{n\rightarrow\infty}\frac{1}{n}P^1(\mathcal{N}^{\otimes n}), where P^1(\mathcal{N}) is the private capacity using random codes. Substitute Q for P and you’ve got the quantum capacity. The latter was shown not to be additive by DiVincenzo, Shor, and Smolin [arxiv, arxiv] and the former by Smith, yours truly, and Smolin [arxiv].

So we’re stuck with regularization, at least if we approach the question of capacity from the random coding angle, and nothing is additive. But this isn’t the only thing meant by nonadditivity, however—the capacities themselves may or may not be nonadditive! Since this post is already too long, I’ll consider this question in the next one.