In game development there are already pretty mature photogrammetry techniques that reverse engineer the lighting and shadow conditions of a scene in order to "subtract" these from objects in the scene so you can get the neutral albedo of their surfaces (this is needed for physically based rendering, so the texture in the game engine is doesn't have highlights or shadows baked into it, and can thus get applied dynamically by the engine instead based on the ingame surroundings). It'd be interesting to see if a technique like that could be applied here too. And on the flipside, if you want to do this using ML, rendered images from a game engine/blender could be used for generating large amounts of training data with perfectly defined "ground truth" colors already known.
I always thought that these techniques only work because they use multiple viewpoints of the same object (with the same lightning), can infer topography from these and then, knowing that, can "figure out" the lighting.
Would be nice to know how well this would work with just very few images, maybe even with different positions, let alone a single image.