Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>Stable-Diffusion is a latent diffusion model, which diffuses in a latent space instead of the original image space. Therefore, we need the loss to propagate back from the VAE's encoder

There is also an alternative way to handle this latent difference with the original paper that should also work :

Instead of working in voxel color space, you push the latent to the voxel (Aka instead of having a voxel grid of 3d rgb color, you have a voxel grid of dimlatent latents, (you can also use spherical harmonics if you want as it works just the same in nd) ).

Only the color prediction network differ, the density is kept the same.

The NERF then directly render to the latent space (so there are less rays to render) which mean you need to decode it with the VAE only for visualization purposes and not in the training loop.



This sounds really interesting but I’m not sure I follow. Having a hard time expressing how I’m confused though (maybe its unfamiliar nerf terminology) but if you have the time I’d be very interested if you could reformulate this alternative method somehow (I’ve been stuck on this very issue for two days now trying to implement this myself).


NERF is neural radiance fields (the neural 3D reconstruction method that nVidia published recently.)

Basically if I’m reading it right, this does the synthesis in latent space (which describes the scene rather than rendering vocals) then translates it into a NERF. It sounds kind like the Stable Diffusion description that was on here earlier.


Do you have refereneces to NeRF papers that can directly compute the field in the latent code? Because it NeRF based methods are essentially performing solving the rendering equation to learn the mapping, what could be alternative equation for directly learning latent code? Your idea is interesting, could you elaborate on it!


Sorry no references for NeRF papers. But the idea is not new. My experience was originally with PointClouds and 3d keypoint-feature slam.

In these, you represent the world as a sparse latent representation, collection of 3D coordinates, and their corresponding SIFT feature descriptor. You rotate and translate these keypoints to obtain a novel view of the features in 2D image. (The descriptor can be taken as an interpolate of the descriptors weighted by the difference in orientation between views as features only match for small viewing angle difference (like 20°) ). And then you could invert features to retrieve a pixel-space image (for example https://openaccess.thecvf.com/content_cvpr_2016/papers/Dosov... ) (although it's never needed in practice.

Coming back to NERF, it's the same principle. When your NERF has converged, if you don't have transparent object, along the ray the density will be 0 except when you intersect the geometry where only a single voxel will be hit in which case you fetch the latent stored in the voxel latent (spherical harmonics) from the direction given by the ray during the training with a latent image.

The rendering equation is still the same but instead of rendering a single ray, it would be analog to rendering a group of close rays in order to render a patch of image, of which the latent is a compressed representation. You have to be careful not to make the patch too big, because like with a lens in the real world, spherical transform flip the patch-image upon translation, but neural network should transparently handle this.

The converged representation is an approximation based on linear approximation and interpolation along positions and ray direction, provided that you have enough resolution, you can construct it manually from the solution and see how it behaves in the rendering.

Will the convergence process work ? It will depend on how well latent mix, along a ray. The light transport equation is usually linear, and latent usually mix well linearly (even more so when weighted by a density), But in the case it doesn't mix well you can learn a mixing of latent rule that help it converge.

Also once you have a latent Nerf, it won't allow you directly to obtain a STL/obj directly but you should have 3d consistent views from which you could render a classical NERF, but you can/should also instead optimize for the classical voxel grid, that fit the latent voxel grid (aka that give the same image patches).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: