Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
MIT Unveils Gen AI Tool That Generates High Res Images 30 Times Faster (hothardware.com)
188 points by mikhael on March 27, 2024 | hide | past | favorite | 60 comments


Official news release rather than whatever this site is:

https://news.mit.edu/2024/ai-generates-high-quality-images-3...


>Our method is similar to GANs in that a critic is jointly trained with the generator to minimize a divergence between the real and fake distributions, but differs in that our training does not play an adversarial game that may cause training instability, and our critic can fully leverage the weights of a pretrained diffusion model.

Very glad to see GANs (or GAN-likes) coming back! Also I don't know if the examples were cherrypicked but a lot of the images look more realistic than SD. For example the dog in the snow[0] is severely oversaturated in the SD version giving it that distinct AI look (sometimes referred to as AIslop). Also the landscape scene[1] has a random lens flare in the SD version which makes it look oversaturated as well. The MIT images are much better in this regard.

[0] https://tianweiy.github.io/dmd/images/teaser/teaser2_Page_1_... vs https://tianweiy.github.io/dmd/images/teaser/teaser2_Page_1_...

[1] https://tianweiy.github.io/dmd/images/teaser/teaser2_Page_1_... vs https://tianweiy.github.io/dmd/images/teaser/teaser2_Page_1_...


Can GANs prevent the ">10 fingers" problem?


A lot of the problems of Stable Diffusion arise because of using a text encoder that isn't up to the task to encode the meaning sufficiently precise. For example, if you mention the word "green" in the prompt, it often carries over to the whole image instead of the object that is being described. Numbers suffer from a similar problem. Rendering text is compromised by tokenization.

One problem with hands might be that they are comparatively small. The model easily gets the big picture right (head, arms, legs), but hands are so-called high-frequency details and are additionally featured in lots of different positions, which are seldom sufficiently described in the captions of the training data.


The >10 fingers problem should be solved via higher quality data or conditioning rather than a better model architecture.


Are you suggesting AI generates images of people with more than 10 fingers because there are too many pictures of people with more than 10 fingers in the input data? That seems unlikely..


Higher quality could also mean the description attached to the original images in the dataset. If you were to describe a picture of someone you would never specifically call out the fact that they have 5 fingers per hand, we take it for granted, so that kind of information may never appear in the dataset.

But I think what the grant parent means by conditioning is non-textual conditioning, like ControlNet. This will always be more powerful than trying to describe something by text. Think about the description of a character in a novel vs the movie adaptation.


In some pictures not all 5 fingers on a hand are visible, so maybe it appears that humans have a variable amount of fingers?


Humans with less than 5 fingers per hand do indeed exist. How does that lead to a default of 7?


I meant more because people are holding objects or their hand is just at an angle where all fingers are not visible in the picture.

I don't understand how neutral networks operate, but my layman's guess is that when you sometimes see hands with 5 fingers visible, sometimes 4, sometimes 3, sometimes 2, sometimes 1, and sometimes 0, then it's not immediately apparent that it means that every hand has between 5 and 0 fingers.

Think of it this way, if the AI has ever only seen houses with a maximum of 10 windows in it's entire training set, is it so unthinkable that it sometimes draws a house with 12 windows? that's a "sensible" understanding about how houses have a variable amount of windows. It just doesn't work for fingers.

I'm sure the same issue would arise if humans had other body parts that came in large quantities, but almost everything else is either 1 or 2 like the nose or eyes.


Counting fingers is how humans do it, not necessarily how AI does. On a five-fingered hand, fingers are more likely to have neighbors than not. Why wouldn't the default be infinite fingers?


People holding hands, people high-fiving, etc.


Yes, but if AI cannot solve the fingers problem, then can it reliably generate images of things which we have relatively few example images of? We have an enormous amount of images of hands.


it's exactly the reverse: the issues with generative AI isn't the data anymore, but the models that are not able to understand the data


Outside of the work to study and improve available technologies ("how far can we push a hammer"),

you do not draw an individual with anomalous hands because you have an ontological model in which "humans normally have five fingers per hand".

Knowing "how the world works" is the appropriate source for subsequent expression of a representation.

Somewhere in the architecture a world model should be formed.


>Very glad to see GANs (or GAN-likes) coming back! Adobe has had GigaGAN out a while now though?


This is just an ad dumpster borderline malware link.

Here is the link to the actual tool.

https://tianweiy.github.io/dmd/


Is the text prompt input the only common thing between SD and MIT models when comparing outputs?

If so I am surprised at how SIMILAR the outputs of both models look in the general layout / framing / composition of the image.

How can for example both fox astronaut images have near identical backdrop of earth and earth alone on the same side of image at same apparent size. Virtually the same shade of deep blue for deep space.

"Lightshow at the Dolomities" output is virtually identical. So similar that they almost look like iPhone versus Samsung Galaxy camera comparison shots of same scene (I am exaggerating of course but almost there).

What results in such a close similarity in outputs? Same training data set?

It is almost like they have the same DNA!


I believe this is similar to the Latent Consistency Modeling approach, where it’s a replacement for the “diffusion” process, not the underlying weights. Basically, they have a more efficient process for pulling images out of the weights, not necessarily a set of new weights.


The weights are different, because the model is different.

As jzbontar below mentions, the crucial point is that the random noise mask is the same. The diffusion models are trained to turn random noise to an image, and they are deterministic at that - the same noise leads to the same image.

What the authors did here was to find a smart way of training a new model able to "simulate" in a single step what diffusion achieves in many; to do so, they took many triplets of (prompt, noise, image) generated starting from random noise and a (fixed) pretrained stable diffusion checkpoint. The model is trained to replicate the results.

So, it is surprising that this works at all at creating meaningful images, but it would be _really_ surprising (i.e. probably impossible) if it generated meaningful images which were seriously different from the ones it was pretrained with!


Oh the images and prompts we see in article are from the training data?

Pardon my ignorance ...

Does MIT model then not work as a general text-to-image model to generate novel images based on arbitrary new text prompts that it has not seen before?


Nothing to pardon, asking questions is always the right thing to do :-) I also didn't look into the paper in great details, although I'm quite sure I am not fooling myself, but still take this with a grain of salt.

My understanding is that this paper by MIT doesn't train any new model from scratch. I takes a pretrained model (e.g. StableDiffusion), which however is trained to do "a small step" only: you fix a number of steps (e.g. 1000 in the MIT paper), and ask the model to predict how to "enhance" an image by a certain step (e.g. of size 1/1000); the constants are adjusted so that, if the model is "perfect", you get from pure white noise to an image in the exact number of steps you set. If I remember correctly how diffusion works, in theory you could set this number to any value, including 1, but in practice you need several hundreds to get a good result, i.e. the original StableDiffusion model is only able to fit a small adjustment.

This new paper shows how to "distil" the original model (in this case, StableDiffusion) into another model. However, unlike typical distillation, which is used to compress a big model into a smaller one, in this case the distilled model is basically the same as the one you start with; but it has been trained with a different objective, namely to transform random noise to the prediction that the original model (StableDiffusion) would make in 1000 steps. To do so, it is trained on a very large amount of triples (text, noise, image). But I don't think you can incorporate into this training procedure other "real" images that are not generated by the model you start with, because you don't have a corresponding noise (abstractly, there is no such concept as "corresponding noise" to a given image, because the relation noise -> image depends on the specific model you start with, and this map is not anywhere near invertible, since not all images can be generated by StableDiffusion, or any other model).

Once the model is trained, you can of course give it a new prompt and, in theory, it should generate something rather similar to what StableDiffusion would generate with the same prompt (hopefully, the example displayed on their web page are not from the training set! Otherwise it would be totally useless). But you should never obtain something "totally different" from what StableDiffusion would give you, so in that sense it's not "general", it is "just" a model that imitates StableDiffusion very well while being much faster. Which is already great of course :-)


MIT is somewhat distillation model, maybe it was distilled from SD? Edit: it was, section 3.1


I assume those images were generated using the same text prompt as well as the same initial random noise.


There are so many similar diffusion model distillation techniques these days that it's become hard to tell the difference between them, this is probably the fourth example I've seen that uses adversarial loss to distill the model, the others being UFOgen, some other work by StabilityAI on SDXL-turbo and similar, SDXL-Lightning by ByteDance. I found this blog page that explains some of the differences: https://sander.ai/2024/02/28/paradox.html


This is 10 times faster than almost all of the fast image generation models.


I mean, it's not the only one that can run SD-1.5 in one step, so no.


This is 30 times faster than SD (on its own, as opposed to LCM methods) right?

Noting the "50" steps they referenced- whereas I've been using LCMs with 3-8 steps for months.

Just trying to understand context of the paper / comparisons etc.

90ms is fast, but like 2-3x speedup (over what I've been seeing) which is still huge, but not 30x

Can someone with more context explain?


It's 30x faster than 50 step SD with the same quality, unlike LCMs which can result in substantially lower quality. The actual page of the work shows the difference.


30 times faster than SD1.5.

Which is in the same broad neighborhood as existing Stable Diffusion 1.5 LCM (and similar to the speedup, compared to SDXL, for the SDXL LCM, Turbo, and Lightning models.)



I'm glad to see another project tackling the challenge of speeding up image generation. After seeing the bang-for-buck jump from sdxl to sdxl-turbo, I assume there's tons more room for improvement on the speed side of things.


The word "confabulate" is underused in AI image generation circles.


Makes me wonder that on any given day someone could find way to improve something, and OpenAI value will evaporate quickly.

Seems like these startups with high valuations, are more tenuous even than the startups of years past.


OpenAI's value doesn't just come from their tech. Copyright Shield is a deal closer from business types I have spoken to. Telling a company, "if anyone we took data from has a problem they have to deal with our lawyers and not you" is music because what companies want most is stability and reliance. It's the entire reason SLAs are such a big deal in B2B.


https://openai.com/policies/business-terms

Section 10 deals with the indemnification they are offering. There are a lot of limitations. It's definitely not terrible, but it's not remotely close to what a business actually wants in an indemnification agreement.

In 10.1 (the indemnification from them to you) does not include "hold harmless", but in 10.2 (the indemnification from you to them) it does. That's not an accident, and those aren't meaningless words ;)

If you get sued and notify OpenAI of the suit, their lawyers take over completely. You must do anything they ask (as long as it is "reasonable"), including sitting for depositions, preserving evidence, participating in the discovery process, etc. If you want to have any involvement beyond being told what to do, you have to hire your own lawyers at your own expense. And at the end of it all, they will come to a settlement with the other side. As long as it is "reasonable", you must sign it.

https://openai.com/policies/service-terms

> If there is a conflict between the Service Terms and your Agreement, the Service Terms will control

> This indemnity does not apply where: (i) Customer or Customer’s End Users knew or should have known the Output was infringing or likely to infringe, (ii) Customer or Customer’s End Users disabled, ignored, or did not use any relevant citation, filtering or safety features or restrictions provided by OpenAI, (iii) Output was modified, transformed, or used in combination with products or services not provided by or on behalf of OpenAI, (iv) Customer or its End Users did not have the right to use the Input or fine-tuning files to generate the allegedly infringing Output, (v) the claim alleges violation of trademark or related rights based on Customer’s or its End Users’ use of Output in trade or commerce, and (vi) the allegedly infringing Output is from content from a Third Party Offering.

If you knew or should have known, you're on your own. If you didn't follow the rules exactly, you're on your own. If you used the output in commerce and they claim a trademark violation, you're on your own.

And lastly, remember from the business terms that OpenAI's lawyers take control and you must do whatever they ask in terms of depositions and discovery? In dealing with all that information, if they see any indication that you no longer qualify for indemnification OpenAI is going to say goodbye, send all the legal bills to you.

It's better than nothing, for sure. But not all that confidence-inspiring.


There's more to startup value that whether or not they've implemented the latest nifty trick into their models. Their moat is architecture, research, distribution, design, and scale. Have all the local LLM hacks based off Llama 2 and the like done anything to diminish OpenAI's value?


I would say it has. If no one could demonstrate even GPT3.5 capabilities, then OpenAI could be worth a few hundred billion more in private valuations.

But multiple vendors have already demonstrated viable alternatives.


Their valuation tripled to 80 billion in 10 months. I'm pretty sure they're not losing much to open source


It gives an idea in investors' minds that OpenAI's models can be matched or be "good enough" by open source models.


OpenAI has been at least one year ahead of everyone else in everything they have released do far. And there’s no sign they are stopping any time soon (e.g. GPT5 is expected this year).


GPT-5 will likely only see improvements in deductive reasoning like in math, I think it there are diminishing returns if all GPT-5 is a larger transformer than GPT-4


That’s probably what GPT-4.5 will be. The bar is much higher for GPT-5.


Also Meta can spend billions building their own models and release it for free to kill your business.


And people will for sure use them to start new startups


Safe to say that FedEx has had to add another flight or two to carry documents from OpenAI to the USPTO.


Could be. Could also be that LLM is to OpenAI what information retrieval is to Google. A lot is publicly known in the information retrieval space, but google still dominates.


That is good point. The original Google designs are public, those were papers that were published. Anybody could create a new 'google', its just about getting name recognition and users. Guess example is DuckDuckGo, or Bing, they throw money at advertising just to get users, but underlying search isn't the problem.


Is the tool available to use somewhere? I've been looking around, but I am not familiar with the research paper format of these types of tools so maybe I am overlooking some obvious link to a repository or something?


Am I missing it or is the source code not available?


I don't think they open sourced it. Thinking back, most AI breakthrough papers I've read don't include source code unfortunately. Researchers want their names out there explaining what they did and that they did it first, but MIT might want to license the implementation IP or Adobe (sounds like their interns discovered this during a summer) lawyers might be hanging onto it for a business edge.


Pretty ironic given this comment at the end of the page

> While image generators like DALL-E and Meta AI's Imagine can produce extremely impressive results, these groups are highly protective of their technology and jealously guard it from curious public eyes. Meanwhile, you can go read about MIT's findings at this link.[0]

[0] https://tianweiy.github.io/dmd/


It's 100% Adobe.


Article is only talking about 1 step generation. It doesn't anything if 2 steps improve the image at all. The images look good enough but not as good as 30-50 step generation by SD. The cat in water is much clear example of that.


With all this tech i really wish we'd see it brought to Apples hardware, so much available unified memory, and yet generation times are shit in SD, even when converted to coreml


I am waiting for the day i can hook up my webcam and do some live video to video :)


Given the speed increase, how much work would be needed to create videos?


So annoying to have articles not point to the original paper.


Diffusion models are exceptionally rich troves of statistical information. You just have to find it. The model will take any random noise and denoise of a bit. By guiding the denoising process, you can speed it up or improve the output tremendously.


I think it is wild that "good" outputs are always some of the most uncanny imagery. Sure it is better than older versions, but "photorealistic" only makes sense if you are describing a photo to someone who can't see. I know it's "early" but they don't deserve this praise. It is cool they can assemble things but it really is a very low bar compared to reality. I know non creative people don't care but audiences care.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: