Wan 2.2 Video + Voice + Motion Control All-In-One workflow optimized for RTX 3060 12 GB VRAM GPU

Wan 2.2 Video + Voice + Motion Control All-In-One workflow optimized for RTX 3060 12 GB VRAM GPU - v1.2

NSFW

[edit: 23.01.2026 use last version v2.0 now (see version description)

Workaround for small isue in v2.0 with audio part: Go to the bottom right of the ‘01 Audio...’ group and simply move the ‘Any Switch’ node from the subgroup ‘01.1.3’ to a free area in the ‘01’ group and make sure the node is not bypassed.

I will fix this in the next version].

Special thanks to:

@boinobin730 for lot of testing, sharing knowlage and pushing this project 🙂

@SeoulSeeker for sharing his knowlage and giving the first crucial hints.

Features:

This workflow uses InfiniteTalk to generate videos of a talking/singing person/objekt. The resulting video is guided/controlled by a start image, an audio source (speech/voice/song) and a control video to guide the general movement. I designed it as an all-in-one workflow. You just need a start image and/or optional audio/video source.

- Works perfect on RTX 3060 with 12 GB VRAM and 32 GB RAM + large swap file (min. 64 - 128 GB).

- Easy installation (all necessary models linked).

- Easy to use via switch options.

- High Quality outputs.

The workflow includes 4 simple steps:

1. Audio generation or load,

2. Video generation or load for DWPose motion control,

3. InfiniteTalk: generates the final LQ video output (guided by DWPose and audio syncronised,

4. Upscaling and framrate multiplying for smoth HQ outputs.

Videos of around 5 seconds working well. Longer videos (around 10 seconds) are possible, but you might run quickly into known video issues, like looping movements, OOM errors, etc.

This workflow is quite advanced now - I would say in an early beta status. Everything should work technically. So I believe it is a good basis for more advanced tests and hopfully some fun 🙂

My intention is to integrate the Step Audio EditX Engine for easy to use advanced audio control via tags soon as possible. But actually there are some issues with the corresponding nodes.

A next step might be the integration of camera control.

Attention:

This workflow is intended for advanced comfyui users. Even installation and usage should be simple, this workflow is actually a basis for testing and developing and you might need some comfyui knowlage to use it. Please understand, I will not give basic installation and comfyui support here.

If you are a beginner with vido generation and more complex workflows, I would recommend you my other workflow for video generation. This one has been well tested and is allready much better documented and commented.

About the basics:

This workflow based on official templates and different allready published workflows. I just put different parts together, created a hopefully easy-to-use “design” and optimized everything for 12 GB VRAM.

Description

Switch option added: You can now choose between simple audio generation or uploading an existing audio file,
audio/video syncing with cut off first video frames fixed in step 04
extended and corrected documentation

[edit: 23.01.2026 use last version v2.0 now (see version description)

Features:

The workflow includes 4 simple steps:

Attention:

About the basics:

Description

FAQ

What is Wan 2.2 Video + Voice + Motion Control All-In-One workflow optimized for RTX 3060 12 GB VRAM GPU?

What files are available and where can I download them?

Comments (69)

Details

Files

wan22VideoVoiceMotionControlAll_v12.zip

Mirrors