No, what is meaningful is not what the effect size (a number) is measuring, it's...

thaumasiotes · on June 29, 2014

I can't say I follow you entirely, but you seem to be confusing "effect size" with p-values. At least, that's what I conclude from "Effect size tends to get smaller the bigger the sample size for any kind of psychological/sociological study".

Effect size just refers to the size of the effect. So if you do an intervention on some children and it leads to them being taller as adults than a control group of children by an average of four inches, you've got an effect size of "four inches". It's difficult to see how this can be described as "just a statistical metric"; for one thing, it has a dimension (here, inches). The effect size tells you what to expect from something.

A p-value is an answer to the statistical question "assuming the effect size is zero, what are the odds that the data I see would occur by chance?" With p-values, smaller is better. These do in fact get smaller with increasing sample size, whereas an effect size that diminishes with increasing sample size is solid evidence that the effect is not real.

pknight · on June 29, 2014

> These do in fact get smaller with increasing sample size, whereas an effect size that diminishes with increasing sample size is solid evidence that the effect is not real

The realness of an effect is not just judged by the effect size. If you have more confidence over the result (lower P and/or higher N). You can run most accepted psychological effects done with smaller sample sizes over a much greater sample size and expect to see a smaller effect size. If the effect size dissipates and the P increases, that's where you have the biggest problem. If your confidence grows bigger in proportion to the lowering of the effect size, you still should have greater trust in the realness of the effect.

The core of my point though is that looking at effect size and thinking it is not meaningful can be distorting things a bit here. A smaller numerical expression that involves a large data set does not capture the actual meaningfulness of the impact the effect has on a single person to person interaction.

thaumasiotes · on June 29, 2014

> You can run most accepted psychological effects done with smaller sample sizes over a much greater sample size and expect to see a smaller effect size

This is because, not to put too fine a point on it, the effects aren't real. The statistical power of a study places a lower limit on the effect size the study is capable of reporting (for a given p-value threshold). So underpowered studies report absurdly high effect sizes (for whatever phenomenon is being studied) because they're not capable of showing statistical significance for smaller effect sizes. Andrew Gelman likes to write about this. The upshot is that, if an effect isn't real, a paper demonstrating statistical significance (traditionally, p < 0.05) will tend to report an effect size close to the minimum it can -- since the true effect size is small, you're much more likely to find data showing a large effect size than a gargantuan effect size. Then, obviously, a study with larger N will show a smaller effect size still, because it has the power to show P < 0.05 at that smaller effect size. But the true effect size doesn't change from experiment to experiment -- the true effect size is the effect size you would measure at N = ∞. There is no reason to expect the effect size to diminish in the face of a larger study unless you expected that it was smaller than originally reported. I could measure the extra time required for an apple to hit the ground when released from 9 feet up instead of 4 feet up. The effect size, measured in seconds, will not steadily diminish as I measure more and more apples; it will stabilize at a quarter of a second.

As your parent comment points out, this study, with its very large sample size, has the statistical power to report a small effect size, which is what it's doing. However, a very small effect size is just another way of saying "this effect, while just barely large enough to measure, is much too small to matter to anyone".

pknight · on June 29, 2014

I think you continue to miss both my points. If you're measuring apples falling to the ground, of course you wouldn't expect effect size to diminish with higher N. But social/psychological studies are not like a physics study, you have far less control over variables. This is especially true with a large N, where there is a noisier environment with more variables coalescing, typically, you get less precision over what you can say about impacted individual datapoints.

I wouldn't argue that a larger effect size here wouldn't be more impressive, of course it would. I'm just saying that a small effect size for a study of this kind does not diminish it's meaningfulness and that it's to be expected for these kinds of studies. There's an effect, that we're very confident is real, works in both directions and has real world implications.

thaumasiotes · on June 29, 2014

A noisier environment doesn't mean you expect smaller effects. It means your measurement is unstable. This problem also occurs in physics; using the standard approximation of gravity of 32 feet per second per second, the extra time required to fall 9 feet instead of 4 feet is exactly 1/4 second. Should you actually try the experiment, you'll quickly notice that your measured time varies from attempt to attempt. There is an office of the government which (among other duties) measures the weight of a coin (the same physical coin) every day, and records the result. Some days are anomalous. There's variation every day.

What larger N does is enable you to see past the noise. With a large sample, the effect of the noise in your measurements diminishes to zero, letting you estimate the effect you're looking for more accurately. So over 200,000 apple drops, I should see an average fall time discrepancy very close to 0.25 seconds; whereas with 2 apple drops, I might for whatever reason measure the time discrepancy as 2/3 second. The 0.7 seconds estimate is way off because of small N.

If, as you work with larger and larger sample sizes, the effect you're measuring recedes steadily to zero, the obvious conclusion is that it's all noise.

However! We started this by talking about a different thing entirely. You say this:

> There's an effect, that we're very confident is real, works in both directions and has real world implications.

This study has immense statistical power and a minuscule effect size. The immense statistical power means, yes, "that we're very confident [the effect] is real". That's measured (from a traditional perspective) by the p-value.

The effect size measures the real-world implications. A very small effect size means that the real-world implications are likewise very small.

As a toy example, suppose I do a study finding that feeding children between the ages of 4 and 7 meat with bones in it vs meat without bones increases their height as adults by three feet (p < 0.9). The real-world implications are huge. Our confidence in the study is low.

pknight · on June 29, 2014

The physics measurements are interesting for different reasons. Those are measuring very objective things, even if measurements vary, they vary for ways we can conceivably calculate. Some physicists even raise the idea that certain constants aren't that constant.

But it's still vastly different to appreciate statistical data coming from those kinds of experiments and those that touch on psychology and social effects. Your height/nutrition example is convenient because we all can appreciate an effect expressed in objective units such as cm that we can see with our eyes. It's much harder to weight the effect that, say emotional states, have in pure numbers.

I could continue this discussion endlessly, probably not going to get anywhere with it.