Just some experimental lip syncing tool i made a while ago. Requires a 16 or 32 fps video of a character and a voice audio input. The output is the same video but with the added lip motion of the character synced to the audio. Works best for simple anime style, but can also work great for detailed realistic videos (check the examples above).
The workflow allows to insert the inpainted lip motion within any part of the video with smooth in-and-out transitions. It uses SAM3 segmentation to precisely target the face and mouth of the character to resample them at low denoise using Wan2.2 S2V.
Detailed description of how to use the workflow can be found within the notes inside!
IMPORTANT:
All the custom nodes used in this workflow can be installed through the Manager. I recommend installing them one by one in case some of them change the versions of your existing python dependencies. As always, don't be lazy and check the requirements.txt for each custom node pack before you install. Otherwise you risk getting conflicts with your other installed node packs or even BRICKING your ComfyUI installation entirely! So fat WARNING right here.
Required resources:
For the Wan2.2 S2V itself, i personally use DaSiWa's finetuned checkpoint because it works well with anime styles and has lightning lora embedded. You can check it out here:
https://civarchive.com/models/2151205/dasiwa-wan-22-14b-s2v
The normal Wan2.2 S2V would probably work just as fine with some appropriate adjustments to KSamplers.As for the VAE, text and audio encoders, i used the models provided by the official ComfyUI documentation. They can be found here:
https://docs.comfy.org/tutorials/video/wan/wan2-2-s2vThe key part of this workflow is automatic segmentation which is done using SAM3 model. I used the model, linked in the "ComfyUI-Easy-Sam3" node pack documentation (the full 3.21 GB one, but fp16 will probably work just fine). Find the download links and were to put the model here:
https://github.com/yolain/ComfyUI-Easy-Sam3
System requirments:
I tried to add some memory-clearing nodes after each step to make the workflow more memory friendly but those can be kinda buggy. Due to Comfy's smart native offloading 12GB VRAM should probably be enough to run this, however the workflow still remains very RAM-heavy. For my specs (RTX4080 with 16GB VRAM and 64 GB of RAM) the total run time is around 3 min and RAM/VRAM peak usage is at maximum. My honest bet would be that if you can fit normal Wan2.2 S2V, than you can handle this workflow also. If not, then try swapping the model loaders for GGUFs idk.
Known issues:
If the input video has intense movements, the output might have the mouth of the character slightly swaying relative to their face.
Crossfading stage for first and last frames can produce buggy mouth morphs. To avoid this, try using the input with closed mouth or change the number of frames to crossfade.
Example videos above:
All of the example video were generated using taek75799's Enhanced WAN 2.2, the FASTMOVE V2 FP8 version specifically. Check out this awesome model here:
https://civarchive.com/models/2053259/wan-22-enhanced-nsfw-or-svi-or-camera-prompt-adherence-lightning-edition-i2v-and-t2v-fp8-gguf?modelVersionId=2477539The voice lines for characters were generated using Qwen3-TTS 1.7B. It can be easily run in ComfyUI via the following node pack:
https://github.com/flybirdxx/ComfyUI-Qwen-TTS