Wan 2.2 Video + Voice + Motion Control All-In-One workflow optimized for RTX 3060 12 GB VRAM GPU

Wan 2.2 Video + Voice + Motion Control All-In-One workflow optimized for RTX 3060 12 GB VRAM GPU - v1.0

NSFW

[edit: 23.01.2026 use last version v2.0 now (see version description)

Workaround for small isue in v2.0 with audio part: Go to the bottom right of the ‘01 Audio...’ group and simply move the ‘Any Switch’ node from the subgroup ‘01.1.3’ to a free area in the ‘01’ group and make sure the node is not bypassed.

I will fix this in the next version].

Special thanks to:

@boinobin730 for lot of testing, sharing knowlage and pushing this project 🙂

@SeoulSeeker for sharing his knowlage and giving the first crucial hints.

Features:

This workflow uses InfiniteTalk to generate videos of a talking/singing person/objekt. The resulting video is guided/controlled by a start image, an audio source (speech/voice/song) and a control video to guide the general movement. I designed it as an all-in-one workflow. You just need a start image and/or optional audio/video source.

- Works perfect on RTX 3060 with 12 GB VRAM and 32 GB RAM + large swap file (min. 64 - 128 GB).

- Easy installation (all necessary models linked).

- Easy to use via switch options.

- High Quality outputs.

The workflow includes 4 simple steps:

1. Audio generation or load,

2. Video generation or load for DWPose motion control,

3. InfiniteTalk: generates the final LQ video output (guided by DWPose and audio syncronised,

4. Upscaling and framrate multiplying for smoth HQ outputs.

Videos of around 5 seconds working well. Longer videos (around 10 seconds) are possible, but you might run quickly into known video issues, like looping movements, OOM errors, etc.

This workflow is quite advanced now - I would say in an early beta status. Everything should work technically. So I believe it is a good basis for more advanced tests and hopfully some fun 🙂

My intention is to integrate the Step Audio EditX Engine for easy to use advanced audio control via tags soon as possible. But actually there are some issues with the corresponding nodes.

A next step might be the integration of camera control.

Attention:

This workflow is intended for advanced comfyui users. Even installation and usage should be simple, this workflow is actually a basis for testing and developing and you might need some comfyui knowlage to use it. Please understand, I will not give basic installation and comfyui support here.

If you are a beginner with vido generation and more complex workflows, I would recommend you my other workflow for video generation. This one has been well tested and is allready much better documented and commented.

About the basics:

This workflow based on official templates and different allready published workflows. I just put different parts together, created a hopefully easy-to-use “design” and optimized everything for 12 GB VRAM.

Description

first "alpha" release for testing

FAQ

Comments (46)

boinobin730Jan 15, 2026· 2 reactions

CivitAI

Hi Arkinson, I tried really hard to get it to work in its alpha state. Initially i tried to run it by just inputting some text in the generate audio group. It processed but only gave me preview motion of 1 second. So I thought I will just bypass it using a load audio node of about 10 seconds, wav file. Audio was a 10 second song with just vocals no instruments. It ran through the motions and generated an excellent save video lq motion. but during the infinite talk node group it created a black video which then translated to an upscaled black video. (dw pose generation looks fine). some noteable error messages. unexpected audio encoder: ['lm_head.bias', 'lm_head.weight']

Requested to load Wav2Vec2Model , a lot of lora key not loaded errors. I asked chatgpt , it said that there might be length mismatch of frames and to let the infinite talk dictate the frame length not to truncate it. If this makes sense. i can provide you with its output and reasoning. I have to admit that logic might make sense as when I played with infinite talk before the feed in sound file would finish but the animation still kept going for a few seconds. you can see it in one of my profile examples where she sings and stops and the video plays and she kind a just blinks and nods her head a little. I don't know, I'm just throwing it out there. This is a really fun workflow though once it works for me. I'm going to get up to all sorts of mischief with it. Thanks Arkinson.

[edit: 23.01.2026 use last version v2.0 now (see version description)

Features:

The workflow includes 4 simple steps:

Attention:

About the basics:

Description

FAQ

What is Wan 2.2 Video + Voice + Motion Control All-In-One workflow optimized for RTX 3060 12 GB VRAM GPU?

What files are available and where can I download them?

Comments (46)

Details

Files

wan22VideoVoiceMotionControlAll_v10.zip

Mirrors