It's been years since I took probability, but why start from these particular set of assumptions?
You've shown that the correct form and constant factors falls out from assuming rotational invariance and independent components. But why is this particular set of assumptions intuitively what we call the "normal distribution"?
The same distribution should fall out starting from other more useful equivalent definitions right? (e.g., the maximum entropy distribution for some mean/variance, or the limiting distribution under CLT assumptions, etc)
This was such an amazing read. The historical notes are awesome and not what I expected from a math textbook. I should really buy a copy of Jaynes!
Some teasers for others who might want to read it:
- The proof in this blog post is called the Hershel-Maxwell derivation. The weird assumptions were motivated by astronomy for finding the 2d distribution of errors of stars
- Gauss was the one who saw the maximum likelihood application. This is why we name it gaussian distribution despite being discovered earlier by others such as de Moivre. Laplace was the bro who expanded and popularized Gauss's work!
- ^ That was like the first two pages of the chapter
I'm glad you liked it. :-) The whole book is an amazing read. It's what sparked my interest in probability theory a decade ago. Unfortunately, it's unfinished work because Jaynes died while writing it in 1998. A version edited by one of his student was posthumously published (the one you can buy on Amazon). There are other gems in his bibliography and unpublished work. I never really _got_ thermodynamics until I read some of his half finished papers:
Thank you for the link to Jaynes' book! Really nice to see the different approaches.
I'm intrigued by your comment on maximum entropy, as I personally struggle with the maximum entropy derivation due to the fact that we're using differential entropy ("continuous" entropy) to derive the Gaussian under constraints on the first and second moment. The differential entropy does not satisfy the same properties as entropy for a discrete distribution, some of which are the very properties that motivated entropy as a measure of information. Jaynes himself wrote a paper on this topic of continuous entropy in the 60s (can dig out the reference in the morning).
Even ignoring this, I also struggle a bit with "we're only constraining the two first and second moment". Why exactly one the first two? Why not the three first, etc.? One could say it's motivate by the fact that the Gaussian is the only distribution with finite nonzero moments, but that seems a bit handwavey?
Would genuinely appreciate some input here, as the concept of Principle of Maximum Entropy is something I have a bit of trouble coming to terms with for the reasons described above (in general, mainly because choice of constraints is abritrary).
There's kind of two issues at least. One is the continuous-discrete issue and the other is the moment issue.
As for the moment issue, the short story is that as you get into three or four moments, there isn't a general maximum entropy distribution anymore, except for some special idiosyncratic cases in the case of three I think. So the normal is, in some ways, the most conservative distribution you can have in a general, unspecified scenario sense. You can specify more moments, but then there isn't a single maxent distribution you can specify that would apply across all third and fourth-moment scenarios in the same way that would apply for the first two moments.
As for the continuous versus discrete thing, there's some caution that's warranted, but a lot of the maxent principles apply, and there are similar, closely related principles (minimum description length, which has been shown to be equivalent to maximum entropy inferentially in a sense) that generalize in the continuous case. If you think of everything as discretized (as is the case with machine representation), there's some work showing that the discretized and continuous cases are sort of related up to a constant (doi: 10.1109/TIT.2004.836702).
I realize this is a bit hand-wavy but it is a HN post.
Thank you, I really appreciate the response. This was useful.
I do see the reasoning for choosing the normal due to it being the only distribution with finite non-zero moments, and thus, as you nicely pointed out, constraints on a finite number of higher order moments will not give a unique distribution.
But, due to the issues we've now mentioned, I find myself a bit uneasy wrt. maxent as a derivation of and/or as an explanation of the ubiquity of the normal distribution. Thus I find myself more comfortable with some of the other derivations demonstrated by Jaynes.
And thank you for the paper reference; will have a proper look at it sometime. It might be related to
I enjoyed reading the chapter but I didn't have enough time to put into it to understand all his derivations as well as I would like. So I may be incorrect here but I don't think he is proving the gaussian distribution is correct, just that it is a good (or the best) one to use.
Does someone have a dart board? It would be nice to take a look at some real data. Maybe 20 throws? Or 200?
I don't think the results will fit a gaussian particularly well. I think there will more darts at large distances than expected. For that matter I would guess, if there is enough data, that the mean would be slightly below the maximal likelihood (as in closer to the floor).
Jaynes usually tries to move away from an interpretation of probability distributions being "correct" as in representing a fact about the world, and towards a definition that is more about a state of knowledge and uncertainty about the world. Distributions are a property of a model or of a knowledgeable agent, not of an object or situation. See Chapter 10: "Physics of 'Random Experiments'". However the two definitions sorta become indistinguishable when you are dealing with experiments that are repeated enough times.
I'm not a mathematician but I think these are valid concerns. I'm only very superficially familiar with the continuous entropy debate. My understanding is that the continuous version is not quite as mathematically ironclad as the discrete version. The concept of continuous entropy seems to make be useful for reasoning about things. That's been good enough for me. I don't know if Shannon would approve.
As for the idea of using only the first two moments, to me, that's just based on the very Bayesian idea of reducing the number of parameters you work with in order to make your models more easily learnable and computable. Most of the time, you only have enough data to do parameter estimates on a limited number of parameters. As you add more parameters, it gets much more difficult to learn as well as mathematically and computationally difficult to manipulate. You also get diminishing returns in terms of predictive power. "The blessing of abstraction", reduction in the number of parameters and possible states in models is the best we have to deal with the "curse of dimentionality".
Or as Yudkowski puts it:
"Our physics uses the same theory to describe an airplane, and collisions in a particle accelerator - particles and airplanes both obey special relativity and general relativity and quantum electrodynamics and quantum chromodynamics. But we use entirely different models to understand the aerodynamics of a 747 and a collision between gold nuclei. A computer modeling the aerodynamics of the 747 may not contain a single token representing an atom, even though no one denies that the 747 is made of atoms.
A useful model isn't just something you know, as you know that the airplane is made of atoms. A useful model is knowledge you can compute in reasonable time to predict real-world events you know how to observe. Physicists use different models to predict airplanes and particle collisions, not because the two events take place in different universes with different laws of physics, but because it would be too expensive to compute the airplane particle by particle. "
However, when you do have enough data, easy enough equations and enough computing power to deal with higher moments, by all mean do so!
Most certainly. I did not intend to question the usefulness of maxent models. I just find myself a bit uneasy with maxent as a derivation of and/or explanation of the ubiquity of the normal distribution, given the two issues mentioned above when we're talking about the continuous case. I was wondering if you might have some insight into the issue which could remedy this feeling of uneasiness :)
And regarding the moments, it's just that the normal distribution is the only distribution with a finite number of non-zero moments. Therefore, constraining higher order moments is not so straight forward.
Also might be worth noting that technically, if one was sufficiently UNreasonable, one could constrain the target distribution to take on specific values given specific inputs. This would not be very useful in any real-world applications. Then choosing between constraining only the first moment ("minimal" constraints), or constraining each point you've observed to take on the normalized frequency ("maximal" constraints) becomes entirely up to you. Therefore I don't see quite how maxent models give us the tools for deciding between complexity and accuracy, as the maxent models can be on either end of the spectrum depending on what constraints we choose.
(Unsure if you were implying that it did, but nonetheless it might be something to note.)
The normal density function does fall out of proofs of the CLT. Usually those proofs stop at either the characteristic function or the moment-generating function of the normal distribution. From a proof that results in the moment-generating function of the limiting distribution [0], you can derive the normal density via an inverse Laplace transform [1]. You can probably do the same by deriving the characteristic function of the limiting distribution and taking the inverse Fourier transform, but I've never seen that proof.
There many ways to derive the normal distribution from first principles, depending on which first principles you start with, because it has many useful properties and shows up in many situations.
Rotational invariance makes sense if you're free to pick the scale and direction of your coordinates. If that's the case then a large class of distributions can be made rotationally invariant (at a guess, I'd say all elliptical distributions [0], but I don't have a handy proof)
Independence of (x, y) components is a trickier one to justify IMO. Something like a Cauchy or Student t model could be made rotationally invariant, but the coordinates are no longer independent.
A Student t distribution behaves like a standard Gaussian, but with a second source of randomness that controls the size of its radial distance from the origin. So you get darts that are often on-target, but with occasional huge misses. If the X component is gigantic, that suggests that the Y component is also large, breaking independence.
G. C. Rota has analysis of justifications of the univariate normal distribution in his Fubini lectures, problem 7. DOI
https://doi.org/10.1007/978-88-470-2107-5_5, a great place to sense borderline topics in the field.
As I understand (or maybe I should say "in my opinion") the magic of the gaussian distribution lies in the two assumptions you make. You have a rotational invariant answer (X and Y are related) but you are assuming the distribution in X and Y are independent. And these are valid things to assume.
The gaussian distribution is not a particularly good representation of most real problems in the sense that the probability for large errors decreases far too rapidly. Maybe there are ideal cases you can say are gaussian, but in any real problem there are some kind of outliers. We go in to a calculation assuming we have gaussian noise but really we don't. And, we have to add additional logic to handle these "outlier" cases.
The thing that is magic is that the gaussian distribution factorizes. If we are evolving the state of a system after taking a measurement, as long as the system had gaussian errors and the measurement has a gaussian error, the system after the measurement will still be gaussian. We can paramterize our errors with two numbers, the center and the spread.
If we didn't have this factorization, the distribution would change shape after the measurement. We would have to keep a ton more information, the amount of which grows geometrically with the number of variables we have. It is just intractable.
So as I see it at least, we use this distribution because we can, more so than because it is the correct one. (But, of course, it also still does work pretty well!)
Interesting. The assumption is that the Normal Distribution is two dimensional, that it's rotationally invariant and that the X and Y coordinate are statistically independent. From this, they find the polar equation, phi(r) is proportional to a Cartesian one, f(x) * f(y). Since phi(r) = phi(sqrt(x^2 + y^2)) which is then proportional to f(x) * f(y).
With the last line because of rotational invariance and statistical independence of the two dimensional axies. I haven't followed the rest but I assume (maybe with some other minor assumptions?) that the equation, lambda f(sqrt(x^2 + y^2)) = f(x) * f(y), uniquely determines the Gaussian.
I've long struggled to find a clear explanation of where the Normal formula comes from. The best description I've seen is by deriving limiting distribution of sums of uniform distributions on a unit interval, say. The sum of identical and independently distributed random variables is a convolution which can be 'de-convolved' by taking the Fourier Transform. After the Fourier transform, the sum of the random variables turns into a product which can easily be approximated. There's extra work involved in proving the Fourier transform of a Gaussian is itself Gaussian and some other technicalities (not to mention this is only for uniform distributions that are identical) but this seems much more motivated to me than any other descriptions I've heard, including this one.
As a benefit, if I remember correctly, the same trick works to derive the basics of Levy stable distributions as well.
You've shown that the correct form and constant factors falls out from assuming rotational invariance and independent components. But why is this particular set of assumptions intuitively what we call the "normal distribution"?
The same distribution should fall out starting from other more useful equivalent definitions right? (e.g., the maximum entropy distribution for some mean/variance, or the limiting distribution under CLT assumptions, etc)