> Insofar as causal inference has no such 'check', its because there never was any. Casual inference is about dispelling that illusion.
Aye, and that's the issue I'm trying to understand. How to know if model 1 or model 2 is more "real" or, for my lack of a better term, more useful and reflective of reality?
We can focus on a particular philosophical point, like parsimony / Occam's razor, but as far as I can tell that isn't always sufficient.
There should be some way to determine a model's likelihood of structure beyond "trust me, it works!" If there is, I'm trying to understand it!
> How to know if model 1 or model 2 is more "real" or, for my lack of a better term, more useful and reflective of reality?
I just want to second MJ's points here. You have to remember that 1) all models are wrong and 2) it's models all the way down. Your data is a model: it models the real world distribution, what we might call the target distribution, which is likely intractable and often very different from your data in various conditions. Your metrics are models: obviously given the previous point, but not as obvious from the point that even with perfect data these are still models. Your metrics all have limitations and you must be careful to clearly understand what they are measuring, rather than what you think they are. This is an issue of alignment and the vast majority of people do not consider precisely what their metrics mean and instead rely on the general consensus (great ML example: FID does not measure fidelity, it is distance measurement of distributions. But you shouldn't stop there, that's the start). These get especially fuzzy in higher dimensions where geometries are highly non-intuitive. It is best to remember that metrics are guides and not targets (Goodhart).
> There should be some way to determine a model's likelihood of structure beyond "trust me, it works!" If there is, I'm trying to understand it!
I mean we can use likelihood ;) if we model density of course. But that's not the likelihood that your model is the correct model, it is the likelihood that given the data that you have that your model's parameterization can reasonably model the sampling distribution of data. These are subtly different, the difference is from above. And then we gotta know if you're actually operating on the right number of dimensions. Are you approximating PCA like a typical VAE? Is the bottleneck enough for proper parameterization? Is your data in sufficient dimensionality? Does the fucking manifold hypothesis even hold for your data? What about the distribution assumption? IID? And don't get me started on indistinguishablity in large causal graphs (references in another comment).
So rather in practice it is just best to try to make a model that is robust to your data but always maintain suspicion of it. After all, all models are wrong and you're trying to model data, not have a model of data.
Evaluation is fucking hard (it is far too easy to make mistakes)
I always love to find/know there are others in the tech world that care about the nuance around evaluation math and not just benchmarks. Often it feels like I'm alone. So thank you!
In general, you can't, and most of reality isnt knowable. That's a problem with reality, and us.
I'd take a bayesian approach across an ensemble of models based on the risk of each being right/wrong.
Consider whether Drug A causes or cures cancer. If there's some circumstantial evidence of it causing cancer at rate X in population Y with risk factors Z -- and otherwise broad circumstial evidence of it curing at rate A in pop B with features C...
then what? Then create various scenarios under these (likely contradictory) assumptions. Formulate an appropriate risk. Derive some implied policies.
This is the reality of how almost all actual decisions are made in life, and necessarily so.
The real danger is when ML is used to replace that, and you end up with extremely fragile systems that automate actions of unknown risk -- on the basis they were "99.99%", "accurate", ie., considered uncontrolled experimental condition E1 and not E2...10_0000 which actually occur
> How to know if model 1 or model 2 is more "real" or, for lack of a better term, more useful and reflective of reality?
You don't. Given observational data alone, it's typically only possible to determine which d-separation equivalence class you're in. Identifying the exact causal structure requires intervening experimentally.
> There should be some way to determine a model's likelihood of structure
Why? If the information isn't there, it isn't there. No technique can change that.
Acyclic structure on variables is a very strong pre-supposition that, honestly, is not how many systems in engineering are well-described by, so I don't like this idea of boiling causality solely down to DAG-dependent phrases like "d-separation" or "exact causal structure". Exact causal structure a.k.a. actual causality is particular to one experimental run on one intervention.
D-separation still works for cyclic graphs, it just can't rule out causal relationships between variables that lie on the same cycle. And neither can any other functional-form-agnostic method, because in general feedback loops really do couple everything to everything else.
More rigorously: given a graph G for a structural equation model S, construct a DAG G' as follows
- Find a minimal subgraph C_i transitively closed under cycle membership (so a cycle, all the cycles it intersects, all the cycles they intersect, and so on)
- Replace each C_i with a complete graph C'_i on the same number of vertices, preserving outgoing edges.
- Add edges from the parents of any vertices in C_i (if not in C_i themselves) to all vertices in C'_i
- Repeat until acyclic
d-separation in G' then entails independence in S given reasonable smoothness assumptions I don't remember the details of off the top of my head.
This isn't a quality of fit issue (and even if it were, linear models are not always sufficient). The problem is that different causal structures can entail the same set of correlations, which makes them impossible to distinguish through observation alone.
Grandparent commenter here -- I'm glad I sufficiently communicated my concern, I feel like you and mjburgess have nailed it. Fit metrics alone aren't sufficient to determine an appropriate model use (even ignoring the issues of p-hacking an other ills).
Aye, and that's the issue I'm trying to understand. How to know if model 1 or model 2 is more "real" or, for my lack of a better term, more useful and reflective of reality?
We can focus on a particular philosophical point, like parsimony / Occam's razor, but as far as I can tell that isn't always sufficient.
There should be some way to determine a model's likelihood of structure beyond "trust me, it works!" If there is, I'm trying to understand it!