It kinda works. I tried "kermit doing push ups" and it looks like kermit. And Kermit sorta looks like he is in push up position. Other than that it looks a lot like those early image generation AIs that vaguely look like what you asked but is all mutated. Animation itself is not very good for this prompt.
But hey, still pretty good for the early days. Maybe need to figure out how to prompt engineer it and tune it. Seems like it's very heavily based on existing image models? Wondering how easy it is to adapt to other image models. I think I need to read the paper. https://imgur.com/a/h3ciJNn
The examples on GitHub are so much better than the examples here. I wonder if the authors are cherry picking or we just don't know how to get good results.
Why are people downvoting this? The examples on GitHub are clearly doing the action described (unlike the dachsund above) and don't have the wild deformations that the "Spongebob hugging a cactus" or "Will Smith eating spaghetti" animations have.
In my experience AI generative processes always involve a lot of cherry picking, rarely have I generated 1 image or 1 text and it was perfect or very representative of what can be attained with some refining.
Seems fair to choose good results when you try to demonstrate what the software "can" do. Maybe a mention of the process often being iterative would be the most honest, but I think anyone familiar with these tools assume this by now.
This is, unfortunately, standard practice on generative papers. Best is to show cherry picked and random selection. But if you don't cherry pick you don't get published. Even having random samples can put you in a tough position.
To be fair, metrics for vision are limited and many don't know how to correctly inturpret them. You both want to know what's the best images the network can produce a well as the average. Reviewers, unfortunately, just want to reject (no incentive you accept). That said, it can still be a useful tool because you know the best (or near) that the model can produce. This is also why platforms like HuggingFace are so critical.
I do think we should ask more of the research community but just know that this is in fact normal and even standard. Most research judge by things like citations and just reading the works themselves. The more hype in a research area the more noisy the signal for good research. (This is good research imo fwiw)
They're probably cherry picking. I know on ModelScope text2video I have to generate 10 clips per prompt to get 2 or 3 usable clips and out of that will come one really good one. But it's fast enough to do that I just sort of accept that as the workflow and generate in batches of 10. I assume it's likely the same for this.
it only matters to the inpatient among us, give it a few weeks and things will only get better. The pace of ai innovation is frightening but exciting too.
As a phrase it's fine, it just that in the context of "random internet commenter on random HN thread", no one knows you and has no idea what you said recently.
My horse is also really high. We've been smoking weed together.
Question about models and weights. When an organization says they release the weights, how is that different from when an organization releases a model, say Whisper from OpenAI? In what way are the model releases different in these cases?
OpenAI also released the weights for Whisper. Some model releases, like LLaMa from Meta, just contain the code for training and the data. You can train the weights yourself, but for LLaMa that takes multiple weeks on very expensive hardware.
(LLaMa also released the weights for researchers and they leaked online, but the weights are not open-source.)
When a company releases a model including the weights, you can download their pre-trained weights and run inference with them, without having to train by yourself.
When you download a trained model for use by Python, I'm assuming the model contains both the architecture (the neural net or even a boosted tree) as well as the weights / tree structure that makes the model actually usable in inference. When organizations release a trained model, I'm assuming that the weights are necessary to make use of that model? If not, then are they not really releasing the model, but just the architecture and training data?
For example Lucene, models will be the java library, data is text data like wikipedia and weights are the lucene index. if you have all the 3 you can start searching right away, if you have model+data you have to generate the index which can take a lot of time, training/indexing take more than searching or using the model. if you have just he model you need to get your own data and run training on it
with llama.cpp getting GPT4all inference support the same day it came out, I feel like llama.cpp might soon become a general purpose high performance inference library/toolkit. excited!
Heh, things seem to be moving in this direction, but I think it's still a very long way to go. But who knows - the amount of contributions to the project keep growing. I guess when we have a solid foundation for LLM inference we can think about supporting SD as well
I see that there is a 12GB VRAM requirement. Can my 6GB GPU do anything to at least provide some performance advantages as instead of running it entirely on the CPU?
Seems very limited. I wonder if the same can be achieved with just stable diffusion and neighbor latent walks with very small steps. On the other hand the interpolation techniques with the GigaGAN txt2img produce much higher quality “videos” than this
Weirdly Corridor digital had a AI generated video and they suffered from what is slightly happening here - the image of the bear / panda or whatever is a different animal each time (ie it's a panda, just a different one "hallucinated" each frame.
corridor digital handled it by training their model on specific images of specific people - and so they effectively said "video of the panda called phil that we have trained you on images of"
Clearly this is not possible here - so I am missing how they got it close
OK... This won't upend the VFX industry. Maybe a subsequent generation that is vastly more precise, but this isn't even ballpark. The stick figure results are vastly less impressive than current methods made with less time and nearly as little effort. Look at Unreal Engine's talk at GDC and then consider that movie studios generally won't use those tools for anything visually important because they just aren't as good as pre-rendered 3D. Even in places like The Mandalorian it was only used for background imagery.
Someone will see these results, and think "if only it was 10x better, I could be rich by selling to the vfx industry". They'll then start training a 10x bigger model (there is plenty of video training data out there).
And in 3 months, it will be upending the vfx industry...
Once again, the hubris of developers when trying to evaluate art just kills me. The most important part of art is deciding exactly what needs to be in the image and how it should look-- not the act of actually creating it. Even using plain language, describing these things in plain English with the precision needed would take way longer than current experts take to make them using existing tools, and the refinement process is full of uncertainty. At most, getting a base layer for people to modify using the professional tools of the trade. Much like development work, the final 10% takes 90% of the time. In dev, you're figuring out how to make it. Usually in VFX, it's figuring out how it should look. That's a fundamentally human process. Whether it accomplishes its goal is evaluated by humans and it needs to satisfy human tastes. No matter how good the tools get at creating images, they just won't be able to speak to human needs and emotions like people will, maybe ever.
> they just won't be able to speak to human needs and emotions like people will, maybe ever.
Until the AI knows what you want and like by analyzing your browser and viewing history. There is a part in the Three Body Problem trilogy where the aliens create art that's on par if not better than human-made art, and humans are shocked because the aliens do not even resemble any sort of human intelligence at all.
Customizing marketing experiences for consumers is a vastly different task than creating art. Creating images is a vastly different process than creating art though creating art might involve it. If you think art is merely customizing imagery things to suit people's preferences, you don't understand art. No matter how hard sci-fi gets, what you're talking about is still very much "fi".
> If you think art is merely customizing imagery things to suit people's preferences, you don't understand art.
No, art is whatever the artist wants to call art, and it's whatever people want to find meaning in. Sure, you can't equate it to producing images, but the vast majority of people won't really care, if they can see some cool images or videos then that's what they'll want. This is the same argument that was used when Stable Diffusion released, yet the inexorable wheel of progress turns on.
> No, art is whatever the artist wants to call art, and it's whatever people want to find meaning in.
Your non-definition of art has no relevance to this discussion. You assume this is all very simple because you have no idea what you're talking about. You're engaged in the sort of discussion that inspired the dunning kruger research.
> the vast majority of people won't really care
If the vast majority of people won't care, why do all major movies and games spend tens or hundreds of millions of dollars on VFX when they could get lower quality versions of the same exact imagery for 1/20th the budget like SyFy productions do? It provides a huge ROI because people care. If your taste and/or perception isn't sophisticated enough for it to matter, you can't just assume that's the case for everybody else. Truthfully, you almost certainly do have the perceptual sophistication to care-- you just haven't spent much time breaking down all of the incredibly important details that escape your notice that heavily influence your impression of the end product.
There's no shame in not knowing something, but there is shame in the hubris of trying to explain that thing to subject matter experts.
----
EDIT: I can't reply to your comment but it doesn't matter. You're clearly way outside of your depth and I have no interest in playing along to help you avoid having to confront that. Bye.
Also, not sure why you can't reply to me, I see my comment and your comment just fine. Are you being rate limited by HN?
Either way, saying someone is "way out of their depth" simply because they disagree with your preconceived notions really isn't a way to communicate, if so, just shut yourself off to any opinions at all. To onlookers, it just seems like an excuse to not engage in meaningful discourse.
I'm saying you're way out of your depth because you're treading the philosophical ground most committed artists are sick of within a few years of adolescence, and your assertions about the effect and creation of visuals in entertainment show even less understanding than that. I've had a billion conversations like this with developers, engineers, etc. so enamored with their own ability to reason that they incorrectly assume it gives them universal expertise.
I'm well beyond the age where I feel the need to explain basic aspects of something I have professional expertise in because someone insists their lack of knowledge is just as valid as my hard-won education and experience. So if you want to keep arguing about it, go ahead. You're just going to do it by yourself.
Good for you, perhaps you do have more knowledge than me. But as I mentioned, there are not just developers on this site, I've been doing art for far longer than I've been a developer, for example.
> You assume this is all very simple because you have no idea what you're talking about. You're engaged in the sort of discussion that inspired the dunning kruger research.
Not really. Don't assume you're the only artist on this site, there are others too. Why don't you provide a valid definition of art then? Then we'll use that one for this discussion.
> If the vast majority of people won't care, why do all major movies and games spend tens or hundreds of millions of dollars on VFX when they could get lower quality versions of the same exact imagery for 1/20th the budget like SyFy productions do?
That's...my point. This is basically what I said (or perhaps what I meant to say, if that didn't come across clearly). Then you said,
> Creating images is a vastly different process than creating art though creating art might involve it. If you think art is merely customizing imagery things to suit people's preferences, you don't understand art.
I understood this to mean that you see something beyond creating images (ie, fancy VFX) and that there is some deeper meaning of "art." My point is that as long as people see pretty, expensive pictures on the screen, many of them won't care about some ideal artistic merit. See how much money Avatar or Transformers make over generally more highly-regarded artistic films like Everything Everywhere All At Once which didn't make nearly as much.
This is a completely different problem than image generation and we do not know how to do it well. Don't just assume AI can do whatever you just imagined they can.
Besides, humans have the power of boredom and so it's not possible to create "perfect" images that would hold their attention forever.
TikTok is the only recommender I've heard actually work for people. Netflix is tuned for what they want to show people, not what people want to watch. Algorithmic timelines generally aren't popular with users though.
(Although a lot of "recommendation algorithms" are actually spam fighting, not engagement promoting.)
> Algorithmic timelines generally aren't popular with users though.
Depends on how well they work, as you say, TikTok is a big one where I doubt many people want their feed in a chronological manner instead. Technical people especially seem to dislike algorithmic feeds but for the layperson, they don't really mind it (indeed, they don't even think about it at all, based on personal experience some of the people I've talked to don't even know that chronological feeds are even a thing).
I’m sorry about these replies they’re so funny. I’m an engineer and wish more of us understood that these aren’t workflow replacements, we’d build much better tools. Been in AI art for 2 years so I also grok that we’re not one more model away from reading your mind and always perfect results
Yeah. Convincing people that they don't know something they think they know is always a tough task. Convincing developers they don't know something they think they know just might be impossible. I feel like the kid in the Starfish parable.
It will help add video illustrations to ordinary presentations or talking-head videos. A market enough for some attention from, say, Adobe, and for a bunch of less expensive offerings.
So, it doesn't upend a multibillion-dollar market because someone who has no idea how the market works made an overly enthusiastic comment on HN? Who would have thought.
That's like saying kids at home will be making enterprise software once code generation tools become smooth enough. ChatGPT won't replace novelists, either, no matter how solid its grammar gets.
People who don't mind having output that's an aesthetic amalgam of whatever it has already ingested won't mind using these sorts of tools to generate work. For a movie studio that lives and dies on presenting imagery so precisely and thoughtfully crafted that it leaves a lasting impression for decades, I doubt it will be anything more than a tool in the toolkit to smooth things along for the people who've made careers figuring out how to do that.
I think there's two reasons this sort of thinking is so prevelant. A) Since developers are only really exposed to the tools-end of other people's professions, they tend to forget that the most important component is the brain that's dreaming up the imagery and ideas to begin with. Art school or learning to be an artist is a lot more about thinking, ideas, analyzing, interpreting, and honing your eye than about using tools... people without those skills who can use amazing generative tools will make smooth, believable looking garbage in no time flat from the comfort of their own living rooms. Great. B) Most people, especially those from primarily STEM backgrounds, don't really understand what art communicates beyond what it physically represents and maybe some intangible vibe or mood. Someone who really knows what they're talking about would probably take longer to accurately describe an existing artistic image than would be reasonable to feed to a prompt. Once again, that will be fine for a lot of people-- even small game studios, for example, that need to generate assets for their 3rd mobile release this month, but it's got miiiiles to go before it's making a movie scene or key game moment that people will remember for decades.
> For a movie studio that lives and dies on presenting imagery so precisely and thoughtfully crafted that it leaves a lasting impression for decades
I'd hope so, but there are a lot of films churning out the next Michael Bay Transformers-type movie which also get the vast majority of VFX work and make the most revenue that upending the VFX industry might remove a lot of jobs.
You again vastly underestimate the amount of work that goes into the intellectual and artistic components of VFX, and vastly underestimate what the artistic components require of someone. Those movies are nearly entirely VFX. 50% of the budget in some cases. That's not because the tools are expensive or the pieces just take a long time to build-- it's repeatedly iterating and putting things in context to figure out what should be there. No matter how fast those iterations get, that's just not something machines will be able to do themselves. Ever. No machine will be able to say "this is the most emotionally impactful texture and drape for that superhero's cape in this scene." They might be able to give you a thousand different versions of it, but someone still has to look at all of that and respond to it. Replicants from Blade Runner couldn't do that and to say we're a ways off from that is a pretty huge understatement.
> No matter how fast those iterations get, that's just not something machines will be able to do themselves. Ever.
Why do people say stuff like this? How do you know what machines will ever be able to do, can you tell the future? And more fundamentally, do you believe humans are more than a biological machine? If you don't (and thus aren't a physicalist) then sure, you can make statements like this because you'd believe that humans have something fundamentally different that no machine can replicate, but if you do think so, you'd believe that anything humans can do, machines will eventually be able to do, even if it takes a long time.
Now you're moving the goal posts from automatic plausible image generation to full on general AI and saying 'some day' it could happen. This is not the same discussion.
It doesn't really have to be general AI to iterate on the same loop as humans, as the parent mentions. Using words like "(n)ever" should be restricted to things that we've proven to be unable to be done, like the laws of mathematics or physics, not stuff that we already see is feasible today (ie, human and animal minds and physical movement). It's not a question of "some day," it's more a question of which philosophical belief one holds to make them think it could or couldn't happen. Like I mentioned, some people simply believe humans are somehow better than machines or animals, and based on what we've seen so far, that's simply not the case.
I'm not even sure what you are actually trying to say at this point.
At first you seemed to be insinuating that AI will "upend vfx" because a lot fof movies are "like transformers" and now you're saying "never is a long time and what if AI becomes like a human"
Sorry, I was mainly responding to the criticism that many non-technical people say, "AI will never come for our jobs," discounting the rate of change that technology generally experiences over and above other types of human endeavors. I'm just frustrated by people thinking they and their job are somehow special and laying back while real change happens all around them. My points are a bit disjointed, I could have been clearer, sorry about that.
Your points were perfectly portrayed. The problem was that they are extrapolations based on assumptions born from idle musings and suppositions. AI is already used heavily in VFX and it's use is expanding. Tech artists in VFX, games , etc are not non-technical people. Much, if not most of any given job involves coding in Python, C++, or a handful of proprietary languages. I was a full-time developer for ten years and I know one with a masters in computer science in addition to an MFA. If you knew things like that, or anything about the way professional VFX was done, the logistics of how movies get made, what intellectually and creatively goes into creating movie scenes, the involved piplines and their purposes, both technical and artistic, or any of the other critical bits of knowledge needed to talk about how this stuff actually works, then your comments would have been a lot better. If that wasn't obnoxious enough by itself, it manifests in the urge rub your perceived insecurity of people's careers in their faces. No matter how insanely ill-informed your statements are, that's just a messed up thing to do. It's not normal. Man. Get some help.
Okay, let's change my statement to: many technical people (or just people with jobs in general) say that AI won't come for our jobs, using words like "n/ever" as you have done, that "that's just not something machines will be able to do themselves. Ever. No machine will be able to say 'this is the most emotionally impactful texture and drape for that superhero's cape in this scene.'". My point is not on non-technical people particularly, it's on the latter part of my statement, that people in general think that they are somehow special and will be in for quite a rude surprise when AI and automation eventually does do what they thought was was only in the realm of mere humans. People in general underestimate the end to end automation that will occur. You do too, seeing as you've said what I've quoted above. You say AI is heavily used, great, but don't be surprised when AI is so heavily used that it becomes the only thing used, eventually.
All of your hand waving and jungle gym of qualifications boil down to you thinking AI will "solve" art as a technical problem. Which, of course, is ridiculous. The purpose of art is to communicate intangible, and sometimes indescribable emotions and ideas about the human condition. It's the difference between a poem and prose describing the same thing using the same words in a different order. When AI can become it's own art critic so perceptive that it can make better qualitative judgements on microscopic levels about what constitutes good and bad art, it will no longer be artificial. It will just be intelligent. Can it remix it? Pantomime it? Demonstrably. Percieve and reason about it at a human enough level to remove humans from the process entirely? Magical thinking. Even if that did happen, and your crowd of starry-eyed bandwagoneers isn't just confusing technology with magic as humans have always done, the fact that your first instinct is to literally rub it in the faces of people who stand to lose the most is unfathomably pathetic. I can't imagine the person who did that would have anything close to the emotional perception and sophistication to reason about how most people perceive something as fundamentally emotional as art. You don't just need therapy, you need to read philosophy.
I'm completely done with this pointless pedantic argument against your lack of understanding.
There it is, only humans can make art, your paragraph is proving my own point. This is exactly the type of thing I'm talking about as I mentioned with non-technical people. The people who say this simply can't perceive that there is nothing special about humans and our brains that can't be replicated. It's because I've read enough philosophy that I'm a physicalist and not a dualist. Continue thinking that humans are more than biological machines while other people continue to make more and more powerful AI.
And if you're done with this argument, as you've mentioned a few times over threads, I'm not sure why you continue to reply.
Well you made a liar of me again because I'm going to take a swing at yet another soft ball you've lobbed. If you don't think there's anything fundamentally different about human brains and current computing technology, you should actually try reading neurologists' research about the structure of the human brain. Or any neurons at all. To quote Christof Koch-- chief scientific officer at the Allen Institute for Brain Science in Seattle-- from a Wired article “The roundworm has exactly 302 neurons, and we still have no frigging idea how this animal works.” (Maybe he hasn't thought to ask an AI enthusiast!)
We're not even close to knowing what every part of the brain does, or even have a complete model of how individual biological neurons work, let alone know how we would replicate them, let alone have the potential for doing so in my career.
Might this happen in a really, really long time? Maybe? We're sure as hell not going to do it with a cluster of GPUs. The actual functions of the neurochemical parts that drive emotion are one of the parts we know least about.
So I guess I'm in this for the long haul. What other aspect of this topic can you make bold declarations about without actually having checked if they're correct?
--------
EDIT: I once again can't reply because HN probably has better sense than I do.
The context of this conversation is the VFX industry. Your assertion was that people who work in the VFX industry shouldn't be surprised if AI takes their jobs. You've moved the goalpost continents from that argument, instead arguing that you're still theoretically correct because in some distant future, science will prevail in mimicking or creating biological machines that can perceive emotion.
Please re-read what I said. In the very first comment I made where I said "Maybe a subsequent generation that is vastly more precise, but this isn't even ballpark." And then after that I said that when we get to the stage where machines can do this, it'll philosophically be actual intelligence and not artificial intelligence. Pretending that's in the scope of this conversation is just daft. And then in this very comment where I mused that it might happen in a really long time. We both know that's not relevant to the current VFX industry and you're just not capable of admitting you're wrong.
Any other topics you'd like to pretend the discussion is about so you can pretend you're still correct?
Don't make a God of the gaps type argument, just because we don't know doesn't mean it's unknowable. If one believes in a scientific universe where laws govern reality, then one must accept the fact that brains are biological machines, where removing any part affects the qualia that an organism experiences. Also, I never said that only GPUs are going to create a general intelligence. For example, if we clone a human brain in the future, we'd get the same outcome.
You are again missing my entire point, which is to not say things like something is "never" going to happen if we already observe it happening. I'm not sure how many times I can repeat this point over and over. If you disagree that brains are not biological machines or that they can be replicated by humans through technology, just tell me now and we can stop, that is a fundamental difference that is unreconcilable in just an HN conversation.
I'm just frustrated by people thinking they and their job are somehow special and laying back while real change happens all around them.
What makes you think industrial light and magic or anyone in the vfx thinks this way? It has been one of the most competitive and rapidly changing industries in the last 40 years.
I'd guess it's an abject lack of knowledge about the field, and a deep-seated insecurity that prevents him from feeling good about himself if he doesn't feel like he's superior to someone else.
It will create an entirely new low end to the industry (which may be used in prototyping workflows by the high end), but the high end is going to want both more detailed control of the avatars (I would say models, but that means different things in the colliding domains) and already does skeletal animation with many more control points.
You want to upend the VFX industry, you probably want text-to-3d-model and text-for-script-for-existing-animation-systems, that still supports either of those being finetuned manually with existing tools.
ControlNet is the new development that will really allow us to guide diffusion model outputs more granularly - first I’m seeing it used against video generation
[edit] _Almost_ everything I try except the predefined examples returns "error". "a pencil with wings" returns something nothing like a pencil that does, in fact, have wings.
I haven't seen a single example from this model that demonstrates video with any time of time continuity. It appears every frame is independent to each other.
Yes, you can run any inference with any model on CPU, but here some examples:
To create a single frame with Stable Diffusion 4-5B parameters, 512x512, 20 iterations takes 5-30 minutes depending on your CPU. On any modern GPU it's only 0.1-20 seconds!
Similarly with LLMs, to produce one token with a transformer of 30-50 layers 7-12B parameters you will wait several CPU minutes while it takes few seconds on a Pascal-generation GPU and tiny fraction of second on Ampere.
The *.cpp adaptations of popular models appear to have optimized them for the CPU, for example llama.cpp and alpaca.cpp let me generate several tokens in a matter of seconds.
> To create a single frame with Stable Diffusion 4-5B parameters, 512x512, 20 iterations takes 5-30 minutes depending on your CPU
Depends a lot on the cpu. Are you specifically talking about Text2Video or SD in general? IIRC, last time I tried SD on my CPU (10 core 10850k, not exactly cutting edge) it did take less than one minute for more than 20 iters. This was about 4-5 months ago, things might have gotten better.
The GPU (even a vintage 1070) was faster still of course.
https://github.com/Picsart-AI-Research/Text2Video-Zero/blob/...