[edit: 23.01.2026 use last version v2.0 now (see version description)
Workaround for small isue in v2.0 with audio part: Go to the bottom right of the ‘01 Audio...’ group and simply move the ‘Any Switch’ node from the subgroup ‘01.1.3’ to a free area in the ‘01’ group and make sure the node is not bypassed.
I will fix this in the next version].
Special thanks to:
@boinobin730 for lot of testing, sharing knowlage and pushing this project 🙂
@SeoulSeeker for sharing his knowlage and giving the first crucial hints.
Features:
This workflow uses InfiniteTalk to generate videos of a talking/singing person/objekt. The resulting video is guided/controlled by a start image, an audio source (speech/voice/song) and a control video to guide the general movement. I designed it as an all-in-one workflow. You just need a start image and/or optional audio/video source.
- Works perfect on RTX 3060 with 12 GB VRAM and 32 GB RAM + large swap file (min. 64 - 128 GB).
- Easy installation (all necessary models linked).
- Easy to use via switch options.
- High Quality outputs.
The workflow includes 4 simple steps:
1. Audio generation or load,
2. Video generation or load for DWPose motion control,
3. InfiniteTalk: generates the final LQ video output (guided by DWPose and audio syncronised,
4. Upscaling and framrate multiplying for smoth HQ outputs.
Videos of around 5 seconds working well. Longer videos (around 10 seconds) are possible, but you might run quickly into known video issues, like looping movements, OOM errors, etc.
This workflow is quite advanced now - I would say in an early beta status. Everything should work technically. So I believe it is a good basis for more advanced tests and hopfully some fun 🙂
My intention is to integrate the Step Audio EditX Engine for easy to use advanced audio control via tags soon as possible. But actually there are some issues with the corresponding nodes.
A next step might be the integration of camera control.
Attention:
This workflow is intended for advanced comfyui users. Even installation and usage should be simple, this workflow is actually a basis for testing and developing and you might need some comfyui knowlage to use it. Please understand, I will not give basic installation and comfyui support here.
If you are a beginner with vido generation and more complex workflows, I would recommend you my other workflow for video generation. This one has been well tested and is allready much better documented and commented.
About the basics:
This workflow based on official templates and different allready published workflows. I just put different parts together, created a hopefully easy-to-use “design” and optimized everything for 12 GB VRAM.
Description
first "alpha" release for testing
FAQ
Comments (46)
Hi Arkinson, I tried really hard to get it to work in its alpha state. Initially i tried to run it by just inputting some text in the generate audio group. It processed but only gave me preview motion of 1 second. So I thought I will just bypass it using a load audio node of about 10 seconds, wav file. Audio was a 10 second song with just vocals no instruments. It ran through the motions and generated an excellent save video lq motion. but during the infinite talk node group it created a black video which then translated to an upscaled black video. (dw pose generation looks fine). some noteable error messages. unexpected audio encoder: ['lm_head.bias', 'lm_head.weight']
Requested to load Wav2Vec2Model , a lot of lora key not loaded errors. I asked chatgpt , it said that there might be length mismatch of frames and to let the infinite talk dictate the frame length not to truncate it. If this makes sense. i can provide you with its output and reasoning. I have to admit that logic might make sense as when I played with infinite talk before the feed in sound file would finish but the animation still kept going for a few seconds. you can see it in one of my profile examples where she sings and stops and the video plays and she kind a just blinks and nods her head a little. I don't know, I'm just throwing it out there. This is a really fun workflow though once it works for me. I'm going to get up to all sorts of mischief with it. Thanks Arkinson.
@boinobin730 Thank you so much for first testing.
01 Audio part: Simply add as much text or change speed till you get around 5 seconds of audio preview 😉 Don`t go to long (like 10 seconds) at first.
02 Motion control video: Yes, this part should work properly. After publishing I saw I can reduce resolution here drastically for faster generation and just upscale it before DWPose processing. I will publish it in the next version soon.
03 InfinityTalk LQ video: Black output: I have to check if I provided the right model in the download links. There might be some mismatch, cause I could not remember anymore where I downloaded the model I used. I'll sort it out shortly. Please wait until then.
04 Upscale/Multiplaying: Don`t use it, if you get a wrong output in step 03. Upscaling a black video will allways result in a black video - except there is some supernatural magic in the game 🤣
Ok - I call back soon.
@boinobin730 Ok, I published version v1.1 for faster generation in step 02 wich included some changes in subgraphs for steps 02 and 03.
Down/upscaling/determining the right resolution (dividable by 16) for the InfiniteTalk part seems to be essential (in the beginning I struggeld a lot with this "simple" stuff getting wired error messages).
I checked the models. Everything should be fine. Please could you run a fresh test now:
- to exlude any other errors, please use step 01 to generate just 3 - 5 seconds of audio.
- don`t use any additional Loras at first,
- go step by step.
@arkinson Terrific. It works perfectly now for me. It is a lot more versatile than i expected. Thank you.
@boinobin730 Perfect 🙂 If you have a hint for a better text-to-speech workflow, please let me know.
I tested text-to-song allready. Workflow is very easy too, but prompting is quite more complex - to control it, you need a lot of terms used for music. I did just some quick tests. Prompting to vocal only could give funny effects: click here. Maybe I will give it some more tries....
I also tested index tts, but get out some strange noises only and not one understandable word.
@arkinson its a really great start. I just posted 1 example to the gallery, i subbed out the tts to a snipet of audio from a song. Surprisingly it does work, but because of instruments it doesn't look as convincing as if she is singing. (the gallery examples are frozen because i think copyright music- download it and you will hear and see it). I am now using Audacity and stripping the instrumentals from the song to see if I get a better lip sync output. I will try different songs as well. I still like Vibevoice but I never got around to testing the 7b model only the 1.5b model. There is the original 7b model (the unnerfed) version floating around on the internet. But it might OOM. I suppose if we curate a good voice output first in a workflow we can then just take the output and feed it into this workflow. I haven't tried LTX2 yet. too many AI toys to play with and not enough time.
@boinobin730 I allredy responded yesterday to you, but accidentally did not clicked at "Comment" 😕 Oh my, I hate this. Ok, once again 🙂 Thank you so much for your first examples. I am glad it is running on your side.
I see the same small video quality issues (color changing, some stripes, etc.) I cut out the first video frames and audio in subgraph step 04 but it still is visible....
Another issue seems to be: Overall, the S2V model does not seem to generate movements as impressive as the “enhanced” models for motion control.
Instruments/music/background noises: This should be not a big problem cause there is an "audio seperation" node in subgraph step 03. So movement should be generated by voice only (in theory). The complete audio stream is simply added to the finished video at the end. But maybe there is some tweaking of the "audio seperation" values necessary. I have no experiances yet.
Audacity/Vibevoice: Do you have any working workflow links? I will definately have a look at it. My general intention is to find a most simple but more flexible solution for audio generation that will work in most use cases and is handable for "common" users and is integratable in step 01 - hopfully 🙂 Things I have in mind would be: talking, crying, laughing, asking, singing - with different emphasis, etc. I believe this would be really cool....
If you like to try text-to-song. Just use the comfyui template. Installation is easy.
Do not try LTX2 🤣 I gave it a try allready and was not able to get just the basics running. Snowshoe told me he run it with >88 gb vram 🙄
@arkinson Yes after playing with the workflow a lot yesterday, I can see what you mean with the S2V degredation. The enhanced motion especially the initial motion is very expressive, whilst the finished output has actions that are fairly muted on comparison. I guess its just a natural limitation to what it can do at the moment.
I didn't know it separates the vocal audio in the workflow. I did notice a lip syncing improvement when i had a song with much clearer vocals. For example , there were 2 version of the song I used in my test. The first version was the remixed version which had a lot more instrumentals, the 2nd version is the original which I have now been using. I get a better output from the 2nd version. (i will post my example soon).
Do you know why the upscale cuts off a few seconds of voice? whereas the infinite talk has the completed speech correctly? Perhaps framerate changes?
I have to dig around for the vibevoice workflow. It got lost when I migrated over. It also chose this time to boost the c: so started from scratch with a full reinstall windows. Fun times.....
I had no idea LTX2 was that hungry on RAM. I went and paid a small fortune for extra RAM a week or so ago. Damn things are crazy in AI land. Everyone is trying to make AI cat videos. or maybe something else....
@boinobin730 Since I'm just a newbie in the audio field, I did some research yesterday to see if we might be sailing in a completely wrong boat. There is a solution from Kijai with the WanVideoWrapper. Very interesting, but obviously on a completely different level and quite experimental. Overall, I believe we are currently on the right track with the S2V model and InfiniteTalk....
Yes, the movement restrictions are certainly due to the S2V model. I haven't looked yet to see if there are any better models. I just grabbed the first one I found....
Audio separation: Open the subgraph in step 03: the audio separation node has some parameters that you might be able to adjust – I haven't tried that yet....
Audio cut-off in step 04: Yes, silly mistake on my part. Open the subgraph and disable the two nodes for now: “Get Image...” and “Audio Cut.” I wanted to cut out the first three video frames (because they are often poor quality) and cut the audio accordingly, but I overlooked the fact that the audio cut is not done in frames, but in milliseconds. I still need to convert that.
Vibevoice workflow: I have now installed tts_audio_suite and found it. However, I have only experimented with Chatterbox so far. Hmm, but it all seems much less controllable than I had naively imagined. But I'll keep at it. And text-to-song is really funny. Unfortunately, the sound quality is very poor and synthetic...
Reinstalling Windows??? God forbid 🙄
LTX2: No, not ram -> more then 88 gb vram 😂
@arkinson Thanks. I will adjust the settings as noted.
Yes it was the chatterbox workflow that I used. I felt Vibevoice was superior to indextts. I didn't spend much time on it but I felt the sample from the one shot was pretty accurate. Obviously you need a good few second sample to use. But there is really nothing stopping us from curating good speech and then passing that input into this workflow.
88Gb VRAM ? what the? jealousy... Much!!!.
I have seen some low vram gguf LTX2 workflows. I will see if I can get anywhere with it.
@boinobin730 I am actually trying to get a little bit deeper into tts audio suite. For a beginner this is all pretty complex. But I believe it will go in the right direction. I`m allready able simply to add "emotions" wich is really cool. Just struggling around with audio "tags". Documentations are mostly some kind of junky crap and finding the right models is allways a hell...... but if I get out something usefull, I will publish it here for audio testing....
Yeah - I heard too that LTX2 should work with low vram, but did not found anything working quickly... If you get it running some day, let me know.
@arkinson Sounds very interesting. Every day presents new possibilities and development tools. I will let you know how I go with LTX2.
@boinobin730 New workflow version v1.2 out. Only small changes:
- Easy option to load recorded audio.
- Video/audio syncing in step 04 should be fixed now and you can select how many first frames to skip (open subgraph). My origional calculation was right, but in "Audio Cut" node I set length to low. Cause you generated very long videos this runs into issues...
If you like have a look at ElevenLabs online audio generator (see desription in my new workflow). It seems to do all I wish so far and is free usable for testing. Just a first quick test for fun 😅
@arkinson Thanks. This looks good. I will try it.
@arkinson So this is really odd. THe new workflow produces black again for me. I thought that is strange, fired up the 1.1 workflow and I got black again. It was working 1 day ago. I wasn't using it yesterday as I was testing LTX workflows (p.s. They suck at the moment). so I am now trying to figure out what in the dickens happened between my working outputs to now. Comfyui craziness. Other people also seem to be having issues, so there is got to be a common denominator. I think it might be sage attention. but i will keep testing and let you know.
@arkinson I worked it out. Ezi install ComfyUi has 2 Batch file runs, if you install Sage Attention. The workflow gets some sort of mismatch when you try to run the Sage Attention startup. It must be a global version mismatch with the Sage nodes. I tried disabling the nodes in the work flow but that didn't fix the problem. Only if you run the NON Sage Attention startup does it work. I also believe that using the NON Sage Attention startup still runs sage attention as the workflow has no issues what so ever. I tested the speeds and the workflow runs quicker with the nodes initialised vs the nodes turned off. As long as you run the standard ComfyUI startup without sage attention it runs fine. This also includes the new workflow version 1.2. Possibly this is the reason others are having trouble at step 3 as well.
@boinobin730 Just had a look at your latest series - oh my gosh 😂🤣 Wich sound engine did you used?
Your Comfyui-Easy-Install: something seems to be wrong on your side. I`m on latest 0.9.2 and both start bat files are working. But you are right - you can even use the standard bat file (without SageAttention) and it works too, cause we explicitly start the Triton + SageAttention part via the nodes in the workflow. The SageAttention bat file just should start comfyui with SageAttention initialized "globally".
@arkinson . I wont be surprised if i got a buggy comfyui install. Umm, I cheated. I ripped the voices from a game to test how good the lip sync is. But I have fed these same voices into vibevoice and got very similar outputs. Not the same but quite similar. I will have a play with vibevoice some more and see. IndexTTs is not too bad, and i should spend some more time tinkering around with it.
@boinobin730 Hi - thank you. I have tried most audio engines. In principle, they are all very similar. Some allow you to use very simple text tags (e.g., change speaker, simple emotions, etc.). As I understand it, only “Step Audio EditX Engine” works automatically and with more complex tags. Hence my other comment. I posted the error to the developer on GitHub, but have not yet received a response.
By the way, the workflow has recently stopped working for me as well. But I still need to investigate what is wrong here. Seems we all have buggy systems now 🙄
@arkinson No worries. I haven't tested today since I've been busy the whole day. I will try and see if its still stable. I will also have a go at the audio workflow you posted. chat soon
@boinobin730 Don`t hurry 🙂
@arkinson Sound workflow is still ok for me. I'm glad you got yours sorted in the end. I see version 4 of your original workflow. Outstanding! Thank you. I will play with it soon. I managed to get working a vibevoice workflow that operates as a single speaker and works well as an 8bit model. https://huggingface.co/FabioSarracino/VibeVoice-Large-Q8 it is not perfect and certain seed numbers are better than others but it does a fairly good job for a single speaker. Workflow examples are given and are put in the VibeVoice_ComfyUI custom_nodes folder. I just posted an example where I used a vibevoice artificial voice generated from a 4second sample of the game voice i used in the purple hair girl examples. I purposely made her perform an action whilst talking, to see the effect. The first example removed the cup on her 2nd sip action. The 2nd example just made her speak whilst her mouth is covered. Overall I would suggest to people to avoid any action that would cover the speakers mouth as it breaks immersion. The 2nd example was a straight 17 second sound clip. Massive. it too 2hours 32 minutes to render. I do not suggest long sound clips. Break them up and join them back like I did in the song example. I'm going to keep testing.
@boinobin730 Hi - thank you so much. Your example is sooo cool 😂 Unfortunately I have not got any response from github yet, regarding to the Step Audio EditX Enginge issue. But I'll keep at it....
@boinobin730 Btw. I don`t know if you allready use the custom node pack TTS_Audio_Suite. Here it is very easy to use any audio engine you like - you can just switch between chatterbox, VibeVoice, or what ever you want. There are simple example workflows on their github page. I only have the issue with the Step Audio EditX engine. The problem is, all other engines do not completely support easy tag usage.
@arkinson oh ok.. interesting. thanks... I am only just getting up to speed with how it all can fit together. I will have a go and see if I can cobble together something useable. Along with your new 4.0 workflow. We are going to be able to generate some very interesting outputs. I didn't know mmaudio is NSFW friendly. Fun times. LOL. I can then put video and sound tracks altogether in Davinci Resolve. I haven't managed to get your audio workflow going. Theoretically, it looks promising. It would definitely fill a gap in being able to control the dialogue instead of waiting for random generations that don't sound quite right.
@boinobin730 Believe me, 12 hours ago I hardly understood anything myself 🙄 To differentiate more clearly I have both workflows renamed:
01. "Wan 2.2 Video + Sound" (the old one, v4.0 actually): cause mmaudio grabs the video and generates the sound.
02. "Wan 2.2 Video + Voice + Motion Control" (v1.2 actually): here we do the opposite - the voice generates/manipulates the final video.
I believe 01 is perfect for all kinds of more simple (n)sfw sounds without speech/voice, But it also seems possible, for example, to generate music/singing when an instrument is played in the video....
02 I would primarily focus on spoken language and perhaps music/singing and allways as image/video to video+sound...
The easiest way to combine both worlds, I suspect woud be post video editing, maybe sound mixing too.
The voice worked, but beyond that, just blank windows. No error messages, just ended process.
@camarcuson194 Stupid question: you enabled option 02 after generating sound??
@arkinson Not stupid... yes! Activated each of the four, same response at each layer. Confirmed the proper model showing in the selection, and toggled on the 2nd pair. Run starts then immediately ends after the voice step. I know I'm missing something. Also, the model link list doesn't completely match the path placement list. And I think there's a typo on the audio link list?
@camarcuson194 Yes, the right Diffusion Model in the Loader node is fixed in version v1.1 but the link for Audio Encoder is wrong. Thank you.
Let`s go step by step with your issue. Use version v1.1. Upload a start image and enter a prompt. Keep all seeds on fixed! Went on as described in the workflow:
1. Switch on step 01, switch off steps 02 - 04 and generate a audio preview of about 3 - 5 seconds.
2. Switch on step 02 (step 01 and 02 are on now, 03 and 04 are off) and press Run.
In the console you should see a "got prompt" at least and of course some error message. If not there might be something wrong with your comfyui installation.
Btw. you have installed Triton + SageAttention? If not, you have to disable the loader nodes in the subgraphs of course.
@arkinson Well, no success with anything now, even audio. I'd guess it's a Python incompatibility.
@camarcuson194 I was also getting a black screen on step 3. I found that if I ran Comfyui without the sage attention command in the startup, it worked. Comfyui Easy install has 2 batch files, try the non sage attention version and then try the workflow. Comfyui in stability matrix has a radio button switch you need to turn off. Let us know how it went for you.
[contact support: it is actually a known issue that the buzz summary on the model page is not displayed right]
@lllionelll Thank you so much for your generious buzzing 😋🙂 Just a wired question: civitai notified me, that you tiped this model here (video+audio), but it seems you tiped my other model?? 🙄
Is there any way to make a video using an audio that I already recorded in Suno?
@paulo Hi - I don`t know Suno, but if you have a standard audio file, like mp3 or what ever (or you can convert it to) it schould be no problem: disable step 01 and add a audio loader node to feed all the given audio inputs in the next steps with your recorded audio feed. But keep in mind the length of your audio - cause the video length for steps 02 - 04 is calculated by your initial audio. So far I never tested audio/video much longer then up to 10 seconds. And have a look in the conservation with boinobin730 for a workaround with mismatched audio/video sync in step 04. I will fix this in next version soon.
@paulo Thanks for buzzinbg :-)
Steps 1 and 2 work fine for me, but Comfy is crashing on step 3.
I see the following:
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
unexpected audio encoder: ['lm_head.bias', 'lm_head.weight']
Requested to load Wav2Vec2Model
Unloaded partially: 872.48 MB freed, 8275.82 MB remains loaded, 42.22 MB buffer reserved, lowvram patches: 0
loaded completely; 684.81 MB usable, 617.65 MB loaded, full load: True
Unloaded partially: 2082.70 MB freed, 6193.13 MB remains loaded, 42.22 MB buffer reserved, lowvram patches: 0
On the workflow, there's a green box around the Load Diffusion Model (Wan2_2-S2V-14B_fp8_e4m3fn_scaled_KJ.safetensors).
Any ideas?
Sorry - as mentioned, I am not able to give "comfyui support" here. But I would research for audio encoder at first.
I was also getting a black screen on step 3. I found that if I ran Comfyui without the sage attention command in the startup, it worked. Comfyui Easy install has 2 batch files, try the non sage attention version and then try the workflow. Comfyui in stability matrix has a radio button switch you need to turn off. Let us know how it went for you.
@boinobin730 It's not that I get a black screen on step 3, it's that Comfy crashes. I tried the newest workflow where you can select an audio file rather than create one and it seemed to work (I also bypassed the Sage Attention stuff). If I went back to trying the create a audio and going through each step, it crashed again during step 3. This was on a fresh install with everything updated.
The only thing I didn't try was creating the audio file and then switching to loading it manually and seeing if that worked.
@jonk999 You provided not much information, so it is not easy to help you. Wich OS, wich comfyui version, wich comfyui release???? Really a completely fresh installation?? No other nodes installed, than needed for this workflow?? Any node conflicts?? Comfyui updates actually nearly daily - as today we are on release version v0.10.0. All custom nodes updated?? Comfyui really crashes or do you just get the above error message and comfyui only pauses??? If it crashes, what is about your swp file? Did you searched for your error message as I allready mentioned?
And have a look at my last comment here on the model page, cause I actually had trouble too and solved it systematically.
@arkinson Apologies. Windows 10. Installed via ComfyUI Easy Install then loaded the workflow and installed other nodes. Then ran comfy update. I have 32GB system RAM and 32GB swap file (system managed). Comfy definitely crashes as I can't do anything further until I close the log window and run the bat file again (without sage).
I'll do another fresh install and remove all installed nodes other than comfy manager and load the workflow and install the nodes it requires and try again. And yes, I searched for the error and any possible solutions before posting here.
@jonk999 Hi - thank you. Did you read my troubleshooting comment here as mentioned??? You should do exactly the same steps. If you do, you will see your swap file is much too small. Only use my suggested "fixed" values at first, cause I run into errors for myself with "system managed" setting. And yes, definately update your gpu driver.
Ok, if I understand you right, comfyui outputs your above mentioned error massage and just "stucks" loading the diffusion model, right? Please be exact, cause we had a lot of out of RAM (not VRAM) "crashes" without any error messages. In this cases the console closes the running task, cause the server has crashed.
What are the results of your research for the error message? Any usefull information? Unspecific? Nothing found" Sorry, but just to say "I did" is really not helpfull....
@arkinson Yes, I did see your comment. I haven't tried increasing the swap file size as yet. I will when I get time and see how that goes. Worse case is I generate the audio file, then just load it manually as when I loaded another audio file, it worked. So for me seems to be something with using Step 1 generating the file... Which is perhaps a little weird.
I have searched for "unexpected audio encoder: ['lm_head.bias', 'lm_head.weight']" and there was one result containing exactly that but was from a Mac user and didn't contain anything useful.
@jonk999 Argh - did not saved my last comment. Ok - manage your swap file first. I`m pretty sure that is the main problem. And just a hint: do not generate/use audio longer then 3 - 5 seconds at first. Cause, the longer your audio, the longer the video and the faster you will run in RAM and/or VRAM errors.
@arkinson Yep. Will increase swap file and test when I get a chance. The generated audio I was trying with was only around 2 seconds. The other audio I loaded manually that looked to have worked was longer.