For a detailed video introduction, results showcase, and step-by-step tutorial, please refer to the video below:
Click to experience the workflow online.
Open Source Address: https://github.com/Soul-AILab/SoulX-FlashTalk
Workflow: AA--FlashTalk Digital Human - Twice the Speed
Experience Link: https://www.runninghub.ai/post/2030296318643544066/?inviteCode=rh-v1401
Workflow: AA--FlashTalk Digital Human + Qwen TTS Voice Presets - Generate Digital Anchor from One Image
Experience Link: https://www.runninghub.ai/post/2029760205487087617/?inviteCode=rh-v1401
Workflow: AA--FlashTalk Digital Human + Voice Cloning (Emotion Control)
Experience Link: https://www.runninghub.ai/post/2029525701015117826/?inviteCode=rh-v1401
Workflow: AA--Chasing Closed-Source Consistency! Xiaohongshu FireRed-Image-Edit 1.1 Image Editing + Upscaling Restoration
Experience Link: https://www.runninghub.ai/post/2029824095113711618/?inviteCode=rh-v1401
I built three ComfyUI workflows for different scenarios using the Flash Talk real-time digital human model, aiming to solve the problem of quickly generating high-quality lip-sync videos from text or audio.
Here is a core introduction to these three workflows:
Original Basic Workflow:
This is the simplest version. The model is encapsulated within the pipeline, requiring no prompt. Just upload a reference image and driving audio, set the resolution (note that resolutions above 540p require VRAM adjustments), and directly generate a lip-sync video. Its running speed is about twice as fast as Infinite Talk, making it ideal for quickly validating results.
Qwen TTS + Voice Preset Workflow:
This combination suits scenarios needing quick voiceovers. I utilized Qwen TTS's nine preset voices (covering Chinese, English, Japanese, Korean, and dialects), eliminating the hassle of finding external voiceovers. Paired with Flash Talk, it enables rapid conversion from text to digital human video.
F5-TTS + Voice Cloning + Emotion Control Workflow:
This is my most frequently used advanced version. By uploading original audio for cloning and combining it with eight emotion parameters (such as adjusting Happy and Sad values), you can finely control the emotional tone of the output speech. This is particularly suitable for creating videos with delicate emotional radio styles, like the "Queen of the Kingdom of Women."
Additionally, when creating the intro for Queen of the Kingdom of Women – Emotional Radio, I used the FireRed image editing workflow to restore and redraw the character from a blurry screenshot, ensuring consistency and clarity. Finally, the video was upscaled to 4K resolution using an upscaling workflow.
The core advantages of this solution are: fast speed, no complex prompts required, simple nodes, and the ability to endow characters with rich emotional expression through voice cloning. As mentioned in the video, technological progress simplifies expression, but to create truly moving content, we still need a deep understanding of emotions and characters.