Thanks for this submission. I hadn't seen it. A somewhat-related recent study is this National Academies report on simulation-guided discovery systems:
OP is a long read and suffers from being something of a catalog of promising ideas -- so it's hard to digest. But from all the examples given, you do get the feeling there's something really powerful and new coming into being that combines the large datasets you can obtain from simulation with machine learning/stats tools to build models, find posteriors, etc.
In science applications of remote sensing, my group at NASA/JPL has obtained huge speedups from replacing computationally-expensive physics-based forward models with emulators based on ANNs or GPs. In terms of OP, this is "Motif 2" (surrogates/emulators) combined with fitting based on "Motif 7" (differentiable programming).
You build a training set using selected runs of the physics-based forward model, and train an emulator that links (say) at-sensor radiances with ground or atmosphere conditions. Roughly a 400-variable to 400-variable function emulator.
Then you use the emulator for (say) each 30m x 30m pixel of a global satellite dataset, instead of running the forward model for each such pixel. Replacing the expensive forward model can reduce workload for a NASA science mission by factors of 10 or more.
This capability really didn't exist 10 years ago, even though a lot of the algorithms were almost there, and the computational capacity was almost there. It's some kind of capacity-building thing where you have to have competence across several kinds of "data science" in order to make a system that's effective.
Similar things are happening in galaxy formation too; numerical models here cost tens of thousands - tens of millions of cpu hours to run, and have maybe 8-10 free parameters. We use emulators based upon a training set of maybe 100-200 real simulations to calibrate out those free parameters to scaling relations (e.g. sizes of galaxies v.s. their mass).
Odd that they did not mention these when discussing sub-grid modelling, as afaik these Astronomy simulations (along with planetary impacts) are probably some of the first to have used them…
Galaxy formation is inherently chaotic i.e. it exhibits sensitive dependence to initial conditions. My first thought upon reading your comment is that an ANN approach must be wrong because of chaos. But of course a purely numerical gravitational simulation will also be wrong because of chaos.
So how do you tell which method is ... less wrong?
Ah, we use the ML methods to understand what happens to a large collection of galaxies (maybe 100k - 1 million), all co-evolving together. Trying to use these methods v.s. direct simulation on individual galaxies is a no-go, too many nonlinearities exist.
Your comment about a numerical gravitational simulation being 'wrong' because of chaos may be true on an individual resolution element level, yes, but our simulations are made up of many billions of individual particles. Like traditional CFD simulations we can also demonstrate some level of convergence.
You raise several points which I’ll try to address.
We did develop a surrogate model for this problem, see the narrative and references below eq. 1 of [1]. The surrogate is perhaps 10x faster than the full physics model. It was developed by a postdoc working with the science team and took a few months. (We needed a lot of science team expertise because they advised on what computations could be simplified and what parameters could be lumped together.)
We use the surrogate for various Monte Carlo analyses, parameter sweeps, etc. And for MCMC as described in [1]. All tasks that the full physics model is too slow for.
But the emulator, developed more recently, is much, much faster than the surrogate. It did not require as much science team expertise to develop, it was more of a data science problem than an integrated physics/data science task. So it was faster to put together.
The increased speed of the emulator means that all those parameter sweeps and MCMC inversions can be much broader in scope.
Furthermore, the emulator is somewhat generic. We have used the same approach in other spectroscopic contexts. E.g., [2], for imaging spectroscopy, and there is a still-unpublished infrared water-vapor/cloud sounder example.
It seems like generic emulation for a broad class of far IR to UV spectroscopy radiative transfer is possible, without needing problem-specific physical modeling expertise to find model simplifications. (We have direct access to this expertise, it’s just that those folks have their own research to do.)
So: faster than a surrogate, broad in scope, faster turnaround.
About error control. We’re using a Gaussian process with a learned kernel, so we do get error estimates out of the approach, if we want to use them. See under “kernel flows” at [3].
The idea of improving quality and relevance of simulations with machine learning makes a lot of sense to me and seems like it might help to avoid some of the problems that so called AI can have where they draw dramatically incorrect conclusions from large but flawed data sets. At the same time it seems like these technologies could be used to jump to completely bizarre conclusions that quickly become contaminated with imperfections, distortions, and other analysis about data. In a way it seems like this is starting to duplicate problems humans have with the irregular borders between brilliance and madness.
From a pragmatic perspective, being able to use AI to accelerate or otherwise enhance computations for science and engineering has a pretty big value proposition. Being able to turn around high fidelity calculations in a fraction of the time would yield much better product designs and scientific results.
On the flip side, like you suggest, the cost of being wrong can much higher than many current uses of AI. This is where research into AI interpretability, robustness, and uncertainty quantification can really help.
You don't have to use AI approaches to replace the actual scientific and engineering calculations, you can use them more as aides in the process. You can use them as yet another way to search the space of possibilities as to which simulations you may want to run and which seem most promising.
In the case of reducing compute demand where you might want to find efficient shortcuts to run high fidelity simulations faster, one could use these to create a candidate space of cases to then run the actual simulations against. This may still reduce the total space of simulations you need to run. Lots of potential here without having to make too many sacrifices.
The real issue we have in the simulation space though is overpromising. People give far too much credit to some of these models and simulations that, while sometimes rigorous, often don't fully capture the phenomena they intend to. At the end of the day, even the scientific and engineering simulations are just another pass/filter to narrow down the cases of experiments you want to try in reality.
I read recently about Francis Bacon’s scientific method. It was not the modern method. It focused more on creating lists of contrasting cases, focusing on inductive reasoning. Honestly, I didn’t entirely understand. Robert Hooke had a method, too, which he described as a philosophical “superstructure.”
Some variety in how knowledge is automatically/systematically developed seems important.
As I recall, Karl Popper is generally thought to be a major contributor to the approach of modern scientific methods. Having worked in simulation and simulation validation before, simulation is a method that is powerful, but adds complexity in that you're basically running a meta verification loop on top of direct experimentation loop, and people can get lost in that complexity if you aren't careful with how you keep your simulation in touch with reality.
His ideas in that area were very handwavy, so nobody had a clear idea of what he meant. The classic study of his philosophy is Francis Bacon: From Magic to Science (1957).
Thanks for sharing, this looks really interesting, the approaches of including AI in theory of science are extremely interesting and will potentially create a new scientific paradigms. Things really are happening right now.
This looks wonderful, but I can't help but think the medium of a PDF is not optimal for something of this scope (24 coauthors! 700 references!) and potential impact - it's goal is not just to share a new discovery but get a bunch of people on board with a new way of thinking and working.
Feels like it'd be much better in wiki form where it can live & grow and its easier to navigate between discrete concepts.
The lead author is from a nonprofit called the Institute for Simulation Intelligence: https://simulation.science/
It must be fairly new, since their website is a skeleton compared to this article. I wonder what publication this is ultimately going to wind up in. Either way, this is exactly the kind of stuff I’m trying to integrate at my lab, and I’m glad more people are taking interest in this field!
Submission statement: I came across this article earlier today and it hit home on a lot of things that have been swirling in my head lately on the current direction of scientific computing methods. I work in R&D in this space, and would be curious to hear the perspectives of others on the methods surveyed in this paper.
Seems underwhelming to me. A bunch of useful ideas from computer science coupled together with mostly physics applications. The paper itself includes plenty of references, which demonstrate why their manifest is unnecessary.
I don’t get, nor appreciate the low-effort dismissal.
In the world of traditional science and engineering that I come from, these kinds of methods are relatively unknown, let alone used to any real benefit. A document like this aggregates and distills the relevant methods, references, and beneficial applications in one place, which is helpful to decision makers and R&D leads who would otherwise not have known where to look or what to look at.
It can be useful, but the way they frame it is not clear to me. It is an overview and reference, which is indeed very important, but it is not the new paradigm they make it sound like.
Btw, really not a low effort. I have invested time on this. If you can explain how this is more than a survey, please share.
It’s certainly true that much of this stuff has been around a while to those familiar, and I didn’t attempt to frame this article as anything beyond a survey in my original statement. The title is what it is. I’m most interested in hearing perspectives from people who are actually using these methods to do real work, and to hear where the cutting edge actually is these days.
That said, a review like this is still significant to people who are starting out in this field or otherwise not previously familiar, and this is the most recent distillation I’ve seen that is written at a level accessible to a broader science and engineering audience.
https://www.nationalacademies.org/our-work/realizing-opportu...
OP is a long read and suffers from being something of a catalog of promising ideas -- so it's hard to digest. But from all the examples given, you do get the feeling there's something really powerful and new coming into being that combines the large datasets you can obtain from simulation with machine learning/stats tools to build models, find posteriors, etc.
In science applications of remote sensing, my group at NASA/JPL has obtained huge speedups from replacing computationally-expensive physics-based forward models with emulators based on ANNs or GPs. In terms of OP, this is "Motif 2" (surrogates/emulators) combined with fitting based on "Motif 7" (differentiable programming).
You build a training set using selected runs of the physics-based forward model, and train an emulator that links (say) at-sensor radiances with ground or atmosphere conditions. Roughly a 400-variable to 400-variable function emulator.
Then you use the emulator for (say) each 30m x 30m pixel of a global satellite dataset, instead of running the forward model for each such pixel. Replacing the expensive forward model can reduce workload for a NASA science mission by factors of 10 or more.
This capability really didn't exist 10 years ago, even though a lot of the algorithms were almost there, and the computational capacity was almost there. It's some kind of capacity-building thing where you have to have competence across several kinds of "data science" in order to make a system that's effective.