Hey I work at Adept and helped make this!
Happy to answer questions.
The thing I think is especially neat/notable is how simple you can make the model architecture while still getting good performance.
I expect we'll continue to see bits of these models get deleted in the next few years
What’s the situation with the license? Your blog post says you are open sourcing it, but it’s currently only available under a non-commercial license instead. Is an open source release forthcoming?
Yeah... in the blog post, they do explicitly mention "cc-by-nc", which I find disappointing.
Anything that Adept is "excited to see what the community builds on top of it" would only serve Adept and no one else! What incentive does the community have to build on top of Fuyu, when the community can't benefit from its own work? If Adept wants to benefit from word-of-mouth discussion of their models and from community contributions that make those models work better, as has happened dramatically with Llama 2, then they need to give the community the opportunity to benefit too.
Also weird: if you look at the tags on Hugging Face, you'll see it is listed as "cc". This comes from the README[0] metadata. "cc" is not really a license.
It's open source by their definition, that is source available (open). Everyone always thinks the term open source is protected in any way while the entity that has established the commercial usage aspect is the Open Source Foundation. And noone is forced to abide by their ideology
FOSS meets the commercial usage requirement much better. Otherwise the term FOSS would be redundant.
I believe the copyright on AI model weights in the US is not fully established, but so far it has been held that a list of numbers can not be copyrighted, so likely the same applies to model weights. Note that you don't have to enter into an agreement with Adept to use the model.
Alternatively, use and download the weights in Japan that has explicitly no copyright on AI models.
Any digital object can be represented as a list of numbers (this is precisely the origin of the term digital). Since there is clearly precedent for copyrighted digital objects (media, software, etc), reducing something to "a list of numbers" is not a useful distinction in regard to copyright law.
Model weight are akin to markov chains and compressed data. They are direct representations of the data they where created from in the same way that markov chains are created from hidden markov chains and Zipped files are created from files.
Zipping a file does not grant the copyright protection of the zipped output beyond the copyright of the original file.
If you take some copyrighted data, a set of books, for example. And count words in these books and then plot a distribution of top 100 word frequencies. The copyright for that new image would belong to you.
Exactly. Data is not covered by American copyright and artifacts generated by LLM and diffusion tools are not covered by copyright protection unless there was human involvement and humans are transparent about how they participated in the creation of the artifacts.
For now there is a lot of human involvement. You pretty much need a team of engineers or an equivalent to get anything besides minor fine tuning done. And there is usually human labor involved at labeling, feedback and evaluation stages.
The issue circles back to their needing to be transparent about how they did the work.
When it comes to intellectual property there are two methods of protecting it: either you can keep it a trade secret and only use it in house (the secret sauce approach) or you keep things out in the open and seek copyright or patent or trademark protection. You can't have it both ways and even more so with AI co-created artifacts. If they are transparent about all the steps involved and what the humans did then they can seek protection for the human created parts. This also allows others to then replicate these steps and to create similar artifacts.
It sounds like they and many other "AI" teams want patent protection without having to register for it. These teams are trying to write their own licenses to rights they do not have.
I highly doubt that any of this will hold up infront of a court. For intellectual property not just the result is important but also the creation process, and there is enough work going into the data science here
Hey! Awesome work. It seems like in theory this encoding scheme should enable the a model like this to generate images as well, by outputting image tokens, is that right?
Neat idea! Are the batches encoded as tokens into the input sequence? This is something I really like about the multi-modal PALM papers since it enables the multi-modal tokens to be referenced.
The architecture is quite compelling. I would not have expected it to work as well as it does. Glancing at the benchmarks it's basically on par with other VLMs in its class, despite having no separate image encoder.
Is there an associated paper? Or more specifically, details on the training dataset? It must have been a mix of text and VLM tasks, otherwise one or the other capability would have rotted during training. But I wonder if they trained off strictly VLM corpora, or also used plain image-text datasets like CLIP. It would be interesting if only the former.
Also makes me wonder if it could be trained on something like CommonCrawl where all the images are retained and interspersed correctly throughout the text. This model could theoretically train just fine off that, and it would unlock a whole new dataset effectively.
And has there been an inspection of what the model is outputting for predicted image "tokens"? Is it correctly predicting projected image patches to any degree of accuracy? And could therefore also generate images inline with text if another de-projection layer was trained?
I too would like to know about the training dataset, as I just took a look at the one for LLava[0], and found out that they used a pretty big amount of BLIP auto generated captions.
This seemed a bit surreal to me, like trying to train an LLM with the outputs of a worse performing smaller LLM.
The Fuyu pre-trained model is not open source. At best, it is source-available. It's also not the only multimodal model you can run locally.
A few other examples include LLaVA[0], IDEFICS[1][2], and CogVLM[3]. Mini-GPT[4] might be another one to look at. I'm pretty sure all of these have better licenses than Fuyu. Fuyu's architecture does sound really interesting, but the license on the pre-trained model is a complete non-starter for almost anything.
If it's for the win (?), the most permissible is the one you choose. This is a extraordinarily competitive space. The sooner you make the choice and it's MIT, the sooner I personally put forth serious contribution time and the faster you grow in the broad and competitive ecosystem. Your main options are GNU All-permissive License, MIT License, BSD licenses, Apple Public Source License and Apache license.
It depends a lot on what you want the license to do, so I don’t really want to say one way or another.
IANAL, but my understanding is that code without a license effectively has an “all rights reserved” license in the U.S., meaning that it can’t be used for anything at all — even non-commercial work.
Really cool that the image patches are converted to tokens with just a linear projection instead of a big embedding model! I wonder if that trick will prove viable for other multimodel media like audio.
Not using embeddings/lookup table means they can't generate image/audio, which to me it's a severe limitation. Why bother going to the process of generating a multimodal transformer if it's able to generate nothing but text?
This looks so cool, and from reading the Hugging Face model card it should be easy enough to run. I do almost all of my work with text, NLP, IR, etc., and I have wanted to try multi-modal models. I just bookmarked the model card page.
I am also getting even more excited by the explosion of work on open models. I still haven’t adjusted to how good mistral-7B is, and it runs on my Mac without breaking a sweat.
I gave it a shot on an M1 Max with 64GB RAM yesterday and it consumed all available RAM and hit a wall. I can run other, larger models without any problems so I assume it’s not an intrinsic limitation, but I didn’t spend any time debugging it.
This looks epic. Definitely going to explore adding it to Autodistill[1] this weekend. Any chance you'll be publicly releasing the internal OCR finetune?
Awesome! I can't wait to see how we can make local models for, say, describing images offline, or even getting a few screenshots of, say, a video game and describing what's going on.
This looks great! Is there any software that supports these? Llama.cpp, Ollama, LM studio, etc are really convenient, but I don't think they have image support yet?
Can this be used to click around in the browser with text prompts? Maybe after some fine-tuning on screen recordings of specific workflows in browsers.
Note that you can get the model weights on HuggingFace here: https://huggingface.co/adept/fuyu-8b