Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yeah, 95% confidence ratio (or approximately two standard deviations) is pretty standard with regards to statistical tests.

You gotta draw the line somewhere. At high-school statistics level, its basically universally drawn at the 95% confidence level. If you wanna draw new lines elsewhere, you gotta make new rules yourself and recalculate all the rules of thumb.



I remember my high school AP Psychology teacher mocking p=0.05 as practically meaningless. In retrospect it's funny for a psychologist to say that, but I guess it was because he was from the more empirically minded behaviorist cognitive school and from time to time they have done actual rigorous experiments[1] (in rodents).

[1] For example as described by Feynman in Cargo Cult Science.


The problem is two-fold:

1. p=0.05 means that one result in 20 is going to be the result of chance.

2. It's generally pretty easy (especially in psychology) to do 20 experiments, cherry-pick -- and publish! -- the p=0.05 result, and throw away the others.

The result is that published p=0.05 results are much more likely than 1 in 20 to be the result of chance.


> p=0.05 means that one result in 20 is going to be the result of chance.

You made the same mistake most people make here: you turned the arrow of the implication. It is not "successful experiment implies chance (probability 5%)" but "chance implies successful experiment (probability 5%)".

What does that mean in practice? Imagine a hypothetical scientist that is fundamentally confused about something important, so all hypotheses they generate are false. Yet, using p=0.05, 5% of those hypotheses will be "confirmed experimentally". In that case, it is not 5% of the "experimentally confirmed" hypotheses that are wrong -- it is full 100%. Even without any cherry-picking.

The problem is not that p=0.05 is too high. The problem is, it doesn't actually mean what most people believe it means.


I think we're actually in violent agreement here, but I just wasn't precise enough. Let me try again:

    p=0.05 means that one POSITIVE result in 20 is going to be the result of chance and not causality
In other words: if I have some kind of intervention or treatment, and that intervention or treatment produces some result in a test group relative to a control group with p=0.05, then the odds of getting that result simply by chance and not because the treatment or intervention actually had an effect are 5%.

The practical effect of this is that there are two different ways of getting a p=0.05 result:

1. Find a treatment or intervention that actually works or

2. Test ~20 different (useless) interventions. Or test one useless intervention ~20 times.

A single p=0.05 result in isolation is useless because there is no way to know which of the two methods produced it.

This is why replication is so important. The odds of getting a p=0.05 result by chance is 5%. But the odds of getting TWO of them in sequential trials is 0.25%, and the odds of a positive result being the result of pure chance decrease exponentially with each subsequent replication.


> Let me try again:

> p=0.05 means that one POSITIVE result in 20 is going to be the result of chance and not causality

No, you still didn't get it. In the example above, a full 100% of positive results, 20 out of every 20, are the result of chance and not causality.

Your followup discussion is better, but your statement at the top doesn't work.

(Note also that there is an interaction between p-threshold and sample size which guarantees that, if you're investigating an effect that your sample size is not large enough to detect, any statistically significant result that you get will be several times stronger than the actual effect. They're also quite likely to have the wrong sign.)


> No, you still didn't get it. In the example above, a full 100% of positive results, 20 out of every 20, are the result of chance and not causality.

Yep, you're right. I do think I understand this, but rendering it into words is turning out to be surprisingly challenging.

Let me try this one more time: p=0.05 means that there is a 5% chance that any one particular positive result is due to chance. If you test a false hypothesis repeatedly, or test multiple false hypotheses, then 5% of the time you will get false positives (at p=0.05).

However...

> Imagine a hypothetical scientist that is fundamentally confused about something important, so all hypotheses they generate are false. Yet, using p=0.05, 5% of those hypotheses will be "confirmed experimentally". In that case, it is not 5% of the "experimentally confirmed" hypotheses that are wrong -- it is full 100%.

This is not wrong, but it's a little misleading because you are presuming that all of the hypotheses being tested are false. If we're testing a hypothesis it's generally because we don't know whether or not it's true; we're trying to find out. That's why it's important to think of a positive result not as "confirmed experimentally" but rather as "not ruled out by this particular experimental result". It is only after failing to rule something out by multiple experiments that we can start to call it "confirmed". And nothing is ever 100% confirmed -- at best it is "not ruled out by the evidence so far".


> I do think I understand this, but rendering it into words is turning out to be surprisingly challenging.

A p-value of .05 means that, under the assumption that the null hypothesis you specified is true, you just observed a result which lies at the 5th percentile of the outcome space, sorted along some metric (usually "extremity of outcome"). That is to say, out of all possible outcomes, only 5% of them are as "extreme" as, or more "extreme" than, the outcome you observed.

It doesn't tell you anything about the odds that any result is due to chance. It tells you how often the null hypothesis gives you a result that is "similar", by some definition, to the result you observed.


What do you think that "due to chance" means?


That is a very reasonable question, and in this context we might reasonably say that "this [individual] outcome is due to chance" means the same thing as "the null hypothesis we stated in our introduction is platonically correct".

But I don't really see the relevance to this discussion?

Suppose you nail down a null hypothesis, define a similarity metric for data, run an experiment, and get some data. The p-value you calculate theoretically tells you this:

If the above-mentioned hypothesis is true, then X% of all data looks like your data

It doesn't tell you this:

If you have data that looks like your data, then there is an X% chance that the above-mentioned hypothesis is true

Those are two unrelated claims; one is not informative -- at all -- as to the other. The direction of implication is reversed between them.

Imagine that you're considering three hypotheses. You collect your data and make this calculation:

1. Hypothesis A says that data looks like what I collected 20% of the time.

2. Hypothesis B says that data looks like what I collected 45% of the time.

3. Hypothesis C says that data looks like what I collected 100% of the time.

Based only on this information, what are the odds that hypothesis A is correct? What are the odds that hypothesis C is correct? What are the odds that none of the three is correct?


This is getting deep into the weeds of the philosophy of science. It is crucially important to choose good hypotheses to test. For example:

> Hypothesis C says that data looks like what I collected 100% of the time.

What this tells you depends entirely on what hypothesis C actually is. For example, if C is "There is an invisible pink unicorn in the room, but everyone will deny seeing it because it's invisible" then you learn nothing by observing that everyone denies seeing the unicorn despite the fact that this is exactly what the theory predicts.

On the other hand, if C is a tweak to the Standard Model or GR that explains observations currently attributed to dark matter, that would be a very different situation.


> It is crucially important to choose good hypotheses to test.

But if you were able to do that, you wouldn't need to test the hypotheses. You'd already know they were good.

I'm intrigued as to why you picked those two examples. They differ in aesthetics without differing in implications, but you seem to want to highlight them as being different in an important way!


Seriously? You don't see any substantive difference between an explanation of dark matter and positing invisible pink unicorns? How do I even begin to respond to that?

Well, let's start with the obvious: there is actual evidence for the existence of dark matter -- that's the entire reason that dark matter is discussed at all. There is no evidence for the existence of invisible pink unicorns. Not only is there no evidence for IPU's, the IPU hypothesis is specifically designed so that there cannot possibly be any. The IPU hypothesis is unfalsifiable by design. That's the whole point.


If the invisible pink unicorn hypothesis was true, what about the world would be different?

If the MOND hypothesis was true, what about the world would be different?

The whole reason we have a constant supply of theories attempting to explain observations currently attributed to dark matter in terms other than "dark matter" is that people feel the dark matter theory is stupid. There's nothing else to it. I assume you feel the same way about unicorns. What's the difference supposed to be?

> There is no evidence for the existence of invisible pink unicorns.

You need to be careful here too. The fact that a theory is false does not mean there is no evidence for that theory.


> The whole reason we have a constant supply of theories attempting to explain observations currently attributed to dark matter in terms other than "dark matter" is that people feel the dark matter theory is stupid.

No, that's not true. The reason we have a "constant supply" of dark matter theories is that all of the extant theories have been falsified by observations, including MOND. If this were not the case, dark matter would be a solved problem and would no longer be in the news.

> The fact that a theory is false does not mean there is no evidence for that theory.

What makes you think the IPU theory is false? The whole point of the IPU hypothesis is that it is unfalsifiable.


You can't simply ignore the base rate, even if you don't know it.

In a purely random world, 5% of experiments are false positives, at p=0.05. None are true positives.

In a well ordered world with brilliant hypotheses, there are no false positives.

If more than 5% of experiments show positive results at p=0.05, some of them are probably true, so you can try to replicate them with lower p.

p=0.05 is a filter for "worth trying to replicate" (but even that is modulated by cost of replication vs value of result).

The crisis in science is largely that people confuse "publishable" with "probably true". Anything "probably better then random guessing" is publishable to help other researchers, but that doesn't mean it's probably true.


> p=0.05 is a filter for "worth trying to replicate"

Yes, I think that is an excellent way to put it.

> The crisis in science is largely that people confuse "publishable" with "probably true".

I would put it slightly differently: people conflate "published in a top-tier peer-reviewed journal" with "true beyond reasonable dispute". They also conflate "not published in a top-tier peer-reviewed journal" with "almost certainly false."

But I think we're in substantial agreement here.


Do you know the difference between "if A then B" and "if B then A"?

This is the same thing, but with probabilities: "if A, then 5% chance of B" and "if B, then 5% chance of A". Those are two very different things.

p=0.05 means "if hypothesis is wrong, then 5% chance of published research". It does not mean "if published research, then 5% chance of wrong hypothesis"; but most people believe it does, including probably most scientists.


> if hypothesis is wrong, then 5% chance of published research

I would say "5% chance of positive result nonetheless" but yes, I do get this. I'm just having inordinate trouble rendering it into words.


> What does that mean in practice? Imagine a hypothetical scientist that is fundamentally confused about something important, so all hypotheses they generate are false. Yet, using p=0.05, 5% of those hypotheses will be "confirmed experimentally". In that case, it is not 5% of the "experimentally confirmed" hypotheses that are wrong -- it is full 100%. Even without any cherry-picking.

Well, that's example is also introducing dependence, which is a tricky thing of course whenever we talk about chance and stats.

But there's also another issue - a statement like "5% of positive published results are by chance since we have a p<=0.05 standard" treats every set of results as if p=0.05, whereas some of them are considerably lower anyway. Though the point of bad actors cherry-picking to screw up the data also comes into play here.

(And of course, fully independent things in life are much harder to find than one might think at first.)


I agree that the point about the 'confused scientist' is important, even if that itself is not stated clearly enough. Here is my own reading:

Imagine that a scientist is making experiments of the form: Does observable variable A correlate with observable variable B? Now imagine that there are billions of observable variables and almost all of them are not correlated. And imagine that there is no better way to come up with plausible correlations to test than randomly picking variables. Then it will take a very long time and a very large number of experiments to find a pair that is truly correlated. It will be inevitable that most positive results are bogus.


So run a meta-study upon the results published by a set of authors and double-check to make sure that their results are normally distributed across the p-values associated with their studies.

These problems are solved problems in the scientific community. Just announce that regular meta-studies will be done, expectations for authors to be normally distributed is published, and publicly show off the meta-study.

-------------

In any case, the discussion point you're making is well beyond the high-school level needed for a general education. If someone needs to run their own experiment (A/B testing upon their website) and cannot afford a proper set of tests/statistics, they should instead rely upon high-school level heuristics to design their personal studies.

This isn't a level of study about analyzing other people's results and finding flaws in other people's (possibly maliciously seeded) results. This is a heuristic about how to run your own experiments and how to prove something to yourself at a 95% confidence level. If you want to get published in the scientific community, the level of rigor is much higher of course, but no one tries to publish a scientific paper on just a high school education (which is where I was aiming my original comment at).


> and double-check to make sure that their results are normally distributed across the p-values associated with their studies

What is the distribution of a set of results over a set of p-values?

If you mean that you should check to make sure that the p-values themselves are normally distributed... wouldn't that be wrong? Assuming all hypotheses are false, p-values should be uniformly distributed. Assuming some hypotheses can sometimes be true, there's not a lot you can say about the appropriate distribution of p-values - it would depend on how often hypotheses are correct, and how strong the effects are.


First, I was specifically responding to this:

> I remember my high school AP Psychology teacher mocking p=0.05 as practically meaningless.

and trying to explain why the OP's teacher was probably right.

Second:

> So run a meta-study upon the results published by a set of authors and double-check to make sure that their results are normally distributed across the p-values associated with their studies.

That won't work, especially if you only run the meta-study on published results because it is all but impossible to get negative results published. Authors don't need to cherry-pick, the peer-review system does it for them.

> These problems are solved problems in the scientific community.

No, they aren't. These are social and political problems, not mathematical ones. And the scientific community is pretty bad at solving those.

> the discussion point you're making is well beyond the high-school level needed for a general education

I strongly disagree. I think everyone needs to understand this so they can approach scientific claims with an appropriate level of skepticism. Understanding how the sausage is made is essential to understanding science.

And BTW, I am not some crazy anti-vaxxer climate-change denialist flat-earther. I was an academic researcher for 15 years -- in a STEM field, not psychology, and even that was sufficiently screwed up to make me change my career. I have advocated for science and the scientific method for decades. It's not science that's broken, it's the academic peer-review system, which is essentially unchanged since it was invented in the 19th century. That is what needs to change. And that has nothing to do with math and everything to do with politics and economics.


> It's not science that's broken, it's the academic peer-review system, which is essentially unchanged since it was invented in the 19th century.

In my experience, it's not even this. Rather, it is that outside of STEM, very, very few people truly understand hypothesis testing.

At least in my experience, even basic concepts, as "falsify the null-hypothesis" is surprisingly hard, even with presumably intelligent people, such as MD's in PHd programmes.

They will still tend to believe that a "significant" result is proof of an effect, and often even believe it proves that the effect is causal with the direction they prefer.

At some point, stats just becomes a set of arcane conjurations for an entire field. At that point, the field as a whole tends to lose their ability to follow the scientific method and turns into something resembling a cult or clergy.


FWIW, I got through a Ph.D. program in CS without ever having to take a stats course. I took probability theory, which is related, but not the same thing. I had to figure out stats on my own. So yes, I think you're absolutely right, but it's not just "outside of STEM" -- sometimes it's inside of STEM too.


Yes. I was, however, not arguing that every student of the field would have to understand the scientific method well. It's enough that there is a critical mass of leaders having such understanding, to ensure that students (including PhD students) work in ways that supports it.

What I was arguing was that there are almost nobody with this understanding in many fields outside stem.

As for your case, I don't know exactly what "probability theory" meant at your college. But in principle, but if it's teaching about probability density functions and how to do integration on them to calculate various probabilities, you're a long way towards a basic understanding of stats surpassing many "stats" courses taught to social science students.

I myself only took a single "stats" course before graduating, which was mostly calculus applied to probability theory, without applications such as hypothesis testing baked in. Then I went on to do a lot of physics that was essentially applied probability theory (statistical mechanics and quantum mechanics).

Around that time, my GF (who was a bit older than me) was teaching a course in scientific methodology to a class of MD students who wanted to become "real" doctors (PhD programme for Medical Doctors), and the math and logic part was kind of hard for her (physicians may not learn a lot of stats until this level, but most of the MD PhD students are quite smart). Anyway, with a proper STEM background, picking up these applications was really easy.

Since then, I've had many encounters with people from various backgrounds that try to grapple with stats or adjacant spaces (data mining, machine learning, etc), and it seems that those who do not have a Math or Physics background, or at least a quite theoretical/mathematical Computer Science or Economics background, struggle quite hard.

Especially if they have to deal with a problem that is not covered by the set of conjurations they've been taught in their basic stats classes, since they only learned to "how" but not the "why".


There’s a professor of Human Evolutionary Biology at Harvard who only has a high school diploma[1]. Needless to say he’s been published and cited many times over.

[1] https://theconversation.com/profiles/louis-liebenberg-122680...


I don't know whether you're mocking them or being supportive of them or just stating a fact. Either way, education level has no bearing on subject knowledge. I know more about how computers, compilers, and software algorithms work than most post-docs and professors that I've run into in those subjects.

Am I smarter than them? Nope. Do I know as many fancy big words as them? Nope. Do I care about results and communicating complex topics to normal people? Yep. Do I care more about making the company money than chasing some bug-bear to go on my resume? Yep.

I fucking hate school and have no desire to ever go back. I can't put up with the bullshit, so I dropped out; I just never stopped studying and I don't need a piece of paper to affirm that fact.


To the people downvoting, at least rebuttal.


The observation above is simply true. If you toss a coin 30 times, there's about a 5% chance that you'll end up with 10-20 ratio or one more extreme.

NHST testing inverts the probability logic, makes the 5% holy, and skims over the high probability of finding something that is not equal to a specific value. That procedure is then used for theory confirmation, while it was (in another form) meant for falsification. Everything is wrong about it, even if the experimental method is flawless. Hence the reproducibility crisis.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: