Turns any photo into lifelike talking avatars with emotion and voice.
Who it's for: creators who want this pipeline in ComfyUI without assembling nodes from scratch. Not for: one-click results with zero tuning — you still choose inputs, prompts, and settings.
Open preloaded workflow on RunComfy
Open preloaded workflow on RunComfy (browser)
Why RunComfy first
- Fewer missing-node surprises — run the graph in a managed environment before you mirror it locally.
- Quick GPU tryout — useful if your local VRAM or install time is the bottleneck.
- Matches the published JSON — the zip follows the same runnable workflow you can open on RunComfy.
When downloading for local ComfyUI makes sense — you want full control over models on disk, batch scripting, or offline runs.
How to use (local ComfyUI)
1. Load inputs (images/video/audio) in the marked loader nodes.
2. Set prompts, resolution, and seeds; start with a short test run.
3. Export from the Save / Write nodes shown in the graph.
Expectations — First run may pull large weights; cloud runs may require a free RunComfy account.
Overview
With this smart audiovisual workflow, you can turn any portrait into a captivating talking avatar. Designed for digital creators and storytellers, it just needs a single image to produce expressive animation with real-time voice and lips that match perfectly. The system blends photo-to-video and voice synthesis in one step, giving you cinematic-level realism. Its integration makes it easy to create character-driven clips for content, marketing, or creative projects fast, without technical setup. You can achieve detailed motion and engaging personality with every generated scene.
Important nodes:
Key nodes in Comfyui Character AI Ovi workflow
WanVideoTextEncodeCached(#85)
Encodes the main positive prompt and video negative prompt into embeddings used by both branches. Keep dialogue inside<S>…<E>and place sound design inside<AUDCAP>…<ENDAUDCAP>. For best alignment, avoid multiple sentences in one speech tag and keep the line concise.WanVideoTextEncodeCached(#96)
Provides a dedicated negative text embedding for audio. Use it to suppress artifacts like robotic tone or heavy reverberation without affecting visuals. Start with short descriptors and expand only if you still hear the issue.WanVideoOviCFG(#94)
Blends the original text embeddings with the audio-specific negatives through an Ovi-aware classifier-free guidance. Raise it when speech content drifts from the written line or lip motions feel off. Lower it slightly if motion becomes stiff or over-constrained.WanVideoSampler(#80)
The heart of Character AI Ovi. It consumes image embeds, joint text embeds, and optional guidance to sample a single latent that contains both video and audio. More steps increase fidelity but also runtime. If you see memory pressure or stalls, pair higher block-swap with cache on, and consider disabling torch compile for quick troubleshooting.WanVideoEmptyMMAudioLatents(#125)
Initializes the audio latent timeline. The default length is tuned for a 121-frame, 24 fps clip. Adjusting this to change duration is experimental; change it only if you understand how it must track frame count.VHS_VideoCombine(#88)
Muxes decoded frames and audio to MP4. Set frame rate to match your sampling target and toggle trim-to-audio if you want the final cut to follow the generated waveform. Use the CRF control to balance file size and quality.
Notes
Character AI Ovi in ComfyUI | Image2Video + Voice Workflow — see RunComfy page for the latest node requirements.