LTX-2.3 Whisper & Soft-Spoken Audio LoRA
Base model: LTX-2.3 · Type: Audio-style LoRA · Rank: 32
---
## What this does
LTX-2.3 can generate dialogue, multi-speaker scenes, and full dynamic range audio including screaming — but it cannot whisper. This LoRA adds two quiet vocal registers to the model:
- Whispering — devoiced, breathy, close-mic delivery
- Soft-spoken — voiced but low-volume, intimate, relaxed
The LoRA targets only the three attention modules that write to the audio branch audio_attn1, audio_attn2, video_to_audio_attn). Video output is provably unchanged — no visual fighting, no style drift.
---
## Usage
Load at strength 1.0. The register is controlled entirely by the manner keyword in your prompt — no special strength tuning needed.
### Trigger words (none, use natural language)
| Whispering | (woman, whispering) | (man, whispering quietly) |
| Soft-spoken | (woman, speaking softly) | (man, speaking softly) |
> Note: Male whisper may requires the extra word quietly to tip the model over. (man, whispering) alone produces soft-spoken, not true whisper.
### Prompt format
Follow the LTX-2.3 dialogue caption style:
```
a [scene description], ([gender], [manner]): "[what they say]", intimate ASMR
```
Examples:
```
a woman sitting close to a microphone in warm dim lighting, (woman, whispering): "close your eyes and listen"
a man at a desk late at night, (man, speaking softly): "I've been thinking about this all day"
a woman doing a skincare routine, (woman, whispering quietly): "this is my favourite step"
```
### Without manner keywords
Using the LoRA without any manner keyword defaults to soft-spoken — a subtle volume-softening effect on whatever the base model would have generated. Useful as a gentle "quieter audio" modifier.
---
## What it can't do
- No intra-clip register mixing. You can't have one character whisper and another speak normally in the same clip. The register applies to the whole generation. For mixed-register dialogue, generate each part separately and cut them together.
- No magic above the vocoder ceiling. The audio chain passes through a mel spectrogram bottleneck. Breathy whisper HF energy gets partially smoothed. Expect intimate and quiet, not studio-crisp ASMR.
- Video is untouched by design. If you want the visuals to also feel ASMR (soft lighting, close-up framing), describe that in the scene prompt — the LoRA won't help or hurt.
---
## Training details
| | |
|---|---|
| Base model | LTX-2.3 dev |
| Steps | 2000 |
| Rank / Alpha | 32 / 32 |
| Target modules | audio_attn1, audio_attn2, video_to_audio_attn |
| Training resolution | 192×192, 97 frames (~4s @ 24fps) |
| Dataset | 74 clips, 8 voices (4F / 4M), 2 registers each |
Clips were 4-second segments sourced from ASMR content across 8 speakers — 4 female (2 soft-spoken, 2 whisper) and 4 male (2 soft-spoken, 2 whisper). Captions used Whisper ASR transcription in (gender, manner): "transcript", intimate ASMR format.
Description
78 audio clips spanning 8 voices, both male and female, supporting whispering and softly spoken audio.
FAQ
Comments (6)
Does this work combined with character loras?
I don't see why it wouldn't. It's purely audio-trained, so it didn't touch the video layer at all. However, it may fight with a character LoRA if that was trained with both video and audio.
you can also use "LTX2 Lora Loader Advanced" from Kj-Nodes. it lets you disable certain blocks (video, audio other) so you can prevent the lora from affecting i.e. video generation.
@OneBullet I don't actually use Comfy (MacOS here)
Thanks so much. This is a much needed lora. I've only just started testing but, so far, it's working great. It seems like you can use a low strength e.g. 0.3 to create a soft neutral voice or crank it all the way up to 1.0 or above to go for the full hypnosis voice!
Yeah, I've heard it's very situation-aware, as well. Like if the subject is further away from the camera, the strength should be adjusted. Pretty wild that LTX didn't just support this without this kind of LoRA, but it was a neat process creating it.