🎬 LTXV 2.3 Video-to-Audio — ComfyUI Workflow
Add synced audio to your silent videos using LTXV 2.3's native joint audio-video generation
Take any silent video and let LTXV 2.3 generate natural-sounding audio that actually matches what's happening on screen — footsteps, ambience, impacts, and more. Unlike tools that bolt on a separate audio model, LTXV 2.3 generates audio and video from within a single Diffusion Transformer model, so sync happens at the model level rather than in post.
✨ What This Does
This workflow uses LTXV 2.3's native joint audio-video generation — audio and video are produced together inside a single Diffusion Transformer (DiT) model, with sync happening at the model level. Feed it a silent MP4, describe the sound you want, and get back a video with synced audio baked in. No separate audio model, no manual alignment.
🚀 How to Use
Load the workflow — drag the
.jsonfile into ComfyUI or use Load from the menu.Input your video — connect your silent video file to the Load Video node. MP4 works best; keep it under 10 seconds to start.
Write your audio prompt — describe the sounds you want (e.g.
"footsteps on gravel, distant wind, birds chirping"). Be specific — it responds well to descriptive prompts.Set your seed — different seeds give noticeably different results, so worth exploring a few.
Run — hit Queue Prompt and grab a coffee. The output video with merged audio will appear in your
output/folder.
🔄 Where This Fits in Your Workflow
For best results, run this after frame interpolation but before upscaling:
Generate Video → Frame Interpolation → 🎧 Add Audio (here) → UpscaleWhy this order matters (Mostly speed of use):
After interpolation — interpolation can subtly alter frame timing and motion cadence. Running audio generation on the interpolated video means the model is analyzing the final motion, giving you tighter sync.
Before upscaling — upscalers don't touch the audio track, so it doesn't matter where upscaling falls from an audio perspective. But keeping it last means you're not running a heavy upscale pass on a clip you might still tweak.
💡 If you're using a temporal upscaler (like LTXV's own latent upscaler), treat it the same as interpolation — add audio after it.
🍆💦 Sex Sounds Loras used in examples
🎧 Example Inputs & Outputs
Input VideoPromptResultPerson walking through forest"footsteps on dry leaves, wind through trees, distant birdsong"Footsteps land on every step, ambient nature bed underneathObject falling and hitting floor"heavy thud, small objects scattering, room reverb"Impact synced to hit frame, natural room echoWaves on a beach"ocean waves crashing, gentle sea foam, seagulls"Continuous wave sync matched to motion rhythm
💡 Tip: The model pays attention to motion intensity. Fast movement + an energetic prompt = much better results than a mismatch between the two.
⚠️ Known Limitations & Tips
Limitations
I
Works best on videos under 10 seconds — longer clips may drift out of sync toward the end.
Speech and singing are not reliably reproduced — this shines on environmental and ambient sound, not voice.
Tips that actually help
🎯 Be descriptive in your prompt —
"sharp metallic clang with reverb"beats"metal sound"every time.🔁 Run multiple seeds — audio generation has real variance; what sounds off on seed 42 might be perfect on seed 7.
🎚️ Check your input video FPS — the workflow assumes 24fps by default. If your video is a different frame rate, adjust the FPS node accordingly or sync will be off.
Description
This is a first release