>Our method is similar to GANs in that a critic is jointly trained with the generator to minimize a divergence between the real and fake distributions, but differs in that our training does not play an adversarial game that may cause training instability, and our critic can fully leverage the weights of a pretrained diffusion model.
Very glad to see GANs (or GAN-likes) coming back! Also I don't know if the examples were cherrypicked but a lot of the images look more realistic than SD. For example the dog in the snow[0] is severely oversaturated in the SD version giving it that distinct AI look (sometimes referred to as AIslop). Also the landscape scene[1] has a random lens flare in the SD version which makes it look oversaturated as well. The MIT images are much better in this regard.
A lot of the problems of Stable Diffusion arise because of using a text encoder that isn't up to the task to encode the meaning sufficiently precise. For example, if you mention the word "green" in the prompt, it often carries over to the whole image instead of the object that is being described. Numbers suffer from a similar problem. Rendering text is compromised by tokenization.
One problem with hands might be that they are comparatively small. The model easily gets the big picture right (head, arms, legs), but hands are so-called high-frequency details and are additionally featured in lots of different positions, which are seldom sufficiently described in the captions of the training data.
Are you suggesting AI generates images of people with more than 10 fingers because there are too many pictures of people with more than 10 fingers in the input data? That seems unlikely..
Higher quality could also mean the description attached to the original images in the dataset. If you were to describe a picture of someone you would never specifically call out the fact that they have 5 fingers per hand, we take it for granted, so that kind of information may never appear in the dataset.
But I think what the grant parent means by conditioning is non-textual conditioning, like ControlNet. This will always be more powerful than trying to describe something by text. Think about the description of a character in a novel vs the movie adaptation.
I meant more because people are holding objects or their hand is just at an angle where all fingers are not visible in the picture.
I don't understand how neutral networks operate, but my layman's guess is that when you sometimes see hands with 5 fingers visible, sometimes 4, sometimes 3, sometimes 2, sometimes 1, and sometimes 0, then it's not immediately apparent that it means that every hand has between 5 and 0 fingers.
Think of it this way, if the AI has ever only seen houses with a maximum of 10 windows in it's entire training set, is it so unthinkable that it sometimes draws a house with 12 windows? that's a "sensible" understanding about how houses have a variable amount of windows. It just doesn't work for fingers.
I'm sure the same issue would arise if humans had other body parts that came in large quantities, but almost everything else is either 1 or 2 like the nose or eyes.
Counting fingers is how humans do it, not necessarily how AI does. On a five-fingered hand, fingers are more likely to have neighbors than not. Why wouldn't the default be infinite fingers?
Yes, but if AI cannot solve the fingers problem, then can it reliably generate images of things which we have relatively few example images of? We have an enormous amount of images of hands.
Is the text prompt input the only common thing between SD and MIT models when comparing outputs?
If so I am surprised at how SIMILAR the outputs of both models look in the general layout / framing / composition of the image.
How can for example both fox astronaut images have near identical backdrop of earth and earth alone on the same side of image at same apparent size. Virtually the same shade of deep blue for deep space.
"Lightshow at the Dolomities" output is virtually identical. So similar that they almost look like iPhone versus Samsung Galaxy camera comparison shots of same scene (I am exaggerating of course but almost there).
What results in such a close similarity in outputs? Same training data set?
I believe this is similar to the Latent Consistency Modeling approach, where it’s a replacement for the “diffusion” process, not the underlying weights. Basically, they have a more efficient process for pulling images out of the weights, not necessarily a set of new weights.
The weights are different, because the model is different.
As jzbontar below mentions, the crucial point is that the random noise mask is the same. The diffusion models are trained to turn random noise to an image, and they are deterministic at that - the same noise leads to the same image.
What the authors did here was to find a smart way of training a new model able to "simulate" in a single step what diffusion achieves in many; to do so, they took many triplets of (prompt, noise, image) generated starting from random noise and a (fixed) pretrained stable diffusion checkpoint. The model is trained to replicate the results.
So, it is surprising that this works at all at creating meaningful images, but it would be _really_ surprising (i.e. probably impossible) if it generated meaningful images which were seriously different from the ones it was pretrained with!
Oh the images and prompts we see in article are from the training data?
Pardon my ignorance ...
Does MIT model then not work as a general text-to-image model to generate novel images based on arbitrary new text prompts that it has not seen before?
Nothing to pardon, asking questions is always the right thing to do :-) I also didn't look into the paper in great details, although I'm quite sure I am not fooling myself, but still take this with a grain of salt.
My understanding is that this paper by MIT doesn't train any new model from scratch. I takes a pretrained model (e.g. StableDiffusion), which however is trained to do "a small step" only: you fix a number of steps (e.g. 1000 in the MIT paper), and ask the model to predict how to "enhance" an image by a certain step (e.g. of size 1/1000); the constants are adjusted so that, if the model is "perfect", you get from pure white noise to an image in the exact number of steps you set. If I remember correctly how diffusion works, in theory you could set this number to any value, including 1, but in practice you need several hundreds to get a good result, i.e. the original StableDiffusion model is only able to fit a small adjustment.
This new paper shows how to "distil" the original model (in this case, StableDiffusion) into another model. However, unlike typical distillation, which is used to compress a big model into a smaller one, in this case the distilled model is basically the same as the one you start with; but it has been trained with a different objective, namely to transform random noise to the prediction that the original model (StableDiffusion) would make in 1000 steps. To do so, it is trained on a very large amount of triples (text, noise, image). But I don't think you can incorporate into this training procedure other "real" images that are not generated by the model you start with, because you don't have a corresponding noise (abstractly, there is no such concept as "corresponding noise" to a given image, because the relation noise -> image depends on the specific model you start with, and this map is not anywhere near invertible, since not all images can be generated by StableDiffusion, or any other model).
Once the model is trained, you can of course give it a new prompt and, in theory, it should generate something rather similar to what StableDiffusion would generate with the same prompt (hopefully, the example displayed on their web page are not from the training set! Otherwise it would be totally useless). But you should never obtain something "totally different" from what StableDiffusion would give you, so in that sense it's not "general", it is "just" a model that imitates StableDiffusion very well while being much faster. Which is already great of course :-)
There are so many similar diffusion model distillation techniques these days that it's become hard to tell the difference between them, this is probably the fourth example I've seen that uses adversarial loss to distill the model, the others being UFOgen, some other work by StabilityAI on SDXL-turbo and similar, SDXL-Lightning by ByteDance. I found this blog page that explains some of the differences: https://sander.ai/2024/02/28/paradox.html
It's 30x faster than 50 step SD with the same quality, unlike LCMs which can result in substantially lower quality. The actual page of the work shows the difference.
Which is in the same broad neighborhood as existing Stable Diffusion 1.5 LCM (and similar to the speedup, compared to SDXL, for the SDXL LCM, Turbo, and Lightning models.)
I'm glad to see another project tackling the challenge of speeding up image generation. After seeing the bang-for-buck jump from sdxl to sdxl-turbo, I assume there's tons more room for improvement on the speed side of things.
OpenAI's value doesn't just come from their tech. Copyright Shield is a deal closer from business types I have spoken to. Telling a company, "if anyone we took data from has a problem they have to deal with our lawyers and not you" is music because what companies want most is stability and reliance. It's the entire reason SLAs are such a big deal in B2B.
Section 10 deals with the indemnification they are offering. There are a lot of limitations. It's definitely not terrible, but it's not remotely close to what a business actually wants in an indemnification agreement.
In 10.1 (the indemnification from them to you) does not include "hold harmless", but in 10.2 (the indemnification from you to them) it does. That's not an accident, and those aren't meaningless words ;)
If you get sued and notify OpenAI of the suit, their lawyers take over completely. You must do anything they ask (as long as it is "reasonable"), including sitting for depositions, preserving evidence, participating in the discovery process, etc. If you want to have any involvement beyond being told what to do, you have to hire your own lawyers at your own expense. And at the end of it all, they will come to a settlement with the other side. As long as it is "reasonable", you must sign it.
> If there is a conflict between the Service Terms and your Agreement, the Service Terms will control
> This indemnity does not apply where: (i) Customer or Customer’s End Users knew or should have known the Output was infringing or likely to infringe, (ii) Customer or Customer’s End Users disabled, ignored, or did not use any relevant citation, filtering or safety features or restrictions provided by OpenAI, (iii) Output was modified, transformed, or used in combination with products or services not provided by or on behalf of OpenAI, (iv) Customer or its End Users did not have the right to use the Input or fine-tuning files to generate the allegedly infringing Output, (v) the claim alleges violation of trademark or related rights based on Customer’s or its End Users’ use of Output in trade or commerce, and (vi) the allegedly infringing Output is from content from a Third Party Offering.
If you knew or should have known, you're on your own. If you didn't follow the rules exactly, you're on your own. If you used the output in commerce and they claim a trademark violation, you're on your own.
And lastly, remember from the business terms that OpenAI's lawyers take control and you must do whatever they ask in terms of depositions and discovery? In dealing with all that information, if they see any indication that you no longer qualify for indemnification OpenAI is going to say goodbye, send all the legal bills to you.
It's better than nothing, for sure. But not all that confidence-inspiring.
There's more to startup value that whether or not they've implemented the latest nifty trick into their models. Their moat is architecture, research, distribution, design, and scale. Have all the local LLM hacks based off Llama 2 and the like done anything to diminish OpenAI's value?
OpenAI has been at least one year ahead of everyone else in everything they have released do far. And there’s no sign they are stopping any time soon (e.g. GPT5 is expected this year).
GPT-5 will likely only see improvements in deductive reasoning like in math, I think it there are diminishing returns if all GPT-5 is a larger transformer than GPT-4
Could be. Could also be that LLM is to OpenAI what information retrieval is to Google. A lot is publicly known in the information retrieval space, but google still dominates.
That is good point. The original Google designs are public, those were papers that were published. Anybody could create a new 'google', its just about getting name recognition and users. Guess example is DuckDuckGo, or Bing, they throw money at advertising just to get users, but underlying search isn't the problem.
Is the tool available to use somewhere? I've been looking around, but I am not familiar with the research paper format of these types of tools so maybe I am overlooking some obvious link to a repository or something?
I don't think they open sourced it. Thinking back, most AI breakthrough papers I've read don't include source code unfortunately. Researchers want their names out there explaining what they did and that they did it first, but MIT might want to license the implementation IP or Adobe (sounds like their interns discovered this during a summer) lawyers might be hanging onto it for a business edge.
Pretty ironic given this comment at the end of the page
> While image generators like DALL-E and Meta AI's Imagine can produce extremely impressive results, these groups are highly protective of their technology and jealously guard it from curious public eyes. Meanwhile, you can go read about MIT's findings at this link.[0]
Article is only talking about 1 step generation. It doesn't anything if 2 steps improve the image at all. The images look good enough but not as good as 30-50 step generation by SD. The cat in water is much clear example of that.
With all this tech i really wish we'd see it brought to Apples hardware, so much available unified memory, and yet generation times are shit in SD, even when converted to coreml
Diffusion models are exceptionally rich troves of statistical information. You just have to find it. The model will take any random noise and denoise of a bit. By guiding the denoising process, you can speed it up or improve the output tremendously.
I think it is wild that "good" outputs are always some of the most uncanny imagery. Sure it is better than older versions, but "photorealistic" only makes sense if you are describing a photo to someone who can't see. I know it's "early" but they don't deserve this praise. It is cool they can assemble things but it really is a very low bar compared to reality. I know non creative people don't care but audiences care.
https://news.mit.edu/2024/ai-generates-high-quality-images-3...