ComfyUI Multitalk Workflow | Multi-Speaker Lip-Synced Video Generator

One-click create multi-speaker lip-sync videos from portraits and voices!

Who it's for: creators who want this pipeline in ComfyUI without assembling nodes from scratch. Not for: one-click results with zero tuning — you still choose inputs, prompts, and settings.

Open preloaded workflow on RunComfy

Open preloaded workflow on RunComfy (browser)

Why RunComfy first
- Fewer missing-node surprises — run the graph in a managed environment before you mirror it locally.
- Quick GPU tryout — useful if your local VRAM or install time is the bottleneck.
- Matches the published JSON — the zip follows the same runnable workflow you can open on RunComfy.

When downloading for local ComfyUI makes sense — you want full control over models on disk, batch scripting, or offline runs.

How to use (local ComfyUI)
1. Load inputs (images/video/audio) in the marked loader nodes.
2. Set prompts, resolution, and seeds; start with a short test run.
3. Export from the Save / Write nodes shown in the graph.

Expectations — First run may pull large weights; cloud runs may require a free RunComfy account.

Overview

This workflow generates lip-synced videos from portraits and audio, supporting both single and multi-speaker outputs with detailed facial motion and speech alignment.

Important nodes:

Key nodes in ComfyUI MultiTalk workflow

`MultiTalkWav2VecEmbeds` (#79/#162)

Converts one or more dialogue tracks into MultiTalk conversation embeddings. Start with one audio input for single-person or two for multi-person; add masks when you need per-face routing. Adjust only what matters: number of frames to match your planned clip length, and whether to provide ref_target_masks for precise speaker-to-face alignment.

`AudioSeparation` (#88/#160/#161)

Optional cleanup for noisy inputs. Route your noisy clip into this node and forward the Vocals output. Use it when field recordings include background music or chatter; skip it if you already have clean voice tracks.

`IndexTTSNode` (#163/#164)

Turns Speaker 1 - Text and Speaker 2 - Text into dialogue audio. Provide a short reference_audio to clone tone and pacing, then supply text lines. Keep sentences brief and natural for best lip timing in MultiTalk.

`WanVideoTextEncodeSingle` (#18)

Encodes your scene prompt for Wan 2.1. Favor simple, concrete descriptions of place, lighting, and style. Avoid long lists; one or two sentences are enough for the sampler to l

Notes

ComfyUI Multitalk Workflow | Multi-Speaker Lip-Synced Video Generator — see RunComfy page for the latest node requirements.

Open preloaded workflow on RunComfy

Overview

Key nodes in ComfyUI MultiTalk workflow

`MultiTalkWav2VecEmbeds` (#79/#162)

`AudioSeparation` (#88/#160/#161)

`IndexTTSNode` (#163/#164)

`WanVideoTextEncodeSingle` (#18)

Notes

Description

Details

Files

comfyuiMultitalkWorkflow_v10.zip

Mirrors

Open preloaded workflow on RunComfy

Overview

Key nodes in ComfyUI MultiTalk workflow

MultiTalkWav2VecEmbeds (#79/#162)

AudioSeparation (#88/#160/#161)

IndexTTSNode (#163/#164)

WanVideoTextEncodeSingle (#18)

Notes

Description

Details

Files

comfyuiMultitalkWorkflow_v10.zip

Mirrors

`MultiTalkWav2VecEmbeds` (#79/#162)

`AudioSeparation` (#88/#160/#161)

`IndexTTSNode` (#163/#164)

`WanVideoTextEncodeSingle` (#18)