ACE-Step 1.5 is an open-source music model similar to Suno, Udio, Mureka, Lyria, etc.
Workflows are embedded in the videos
Just drag and drop them into ComfyUI to import it (it includes an optional group of nodes to load a static image and turn it into a video along with the generated song).
Features
Lightweight: The model runs locally with less than 4GB of VRAM.
Fast: a 2 minutes song can be generated in a minute or so in a low-end GPU (<= 8GB VRAM).
Uncensored: you can prompt any lyrics you want.
Multiple languages: over 50 languages are supported officially.
Models
SFT 1.7B AIO checkpoint
I've merged the following models into a single checkpoint file:
acestep-v15-sft.safetensors: the SFT model, which sounds better than the Turbo version, while requiring only a few more steps.
qwen_0.6b_ace15.safetensors: mandatory text encoder.
qwen_1.7b_ace15.safetensors: additional 1.7B text encoder, which is many times faster than the 4B alternative while being almost as good in my own experiments).
ace_1.5_vae.safetensors: the VAE.
Simply download it to your ComfyUI/models/checkpoints folder.
Tips
Prompts
The model expects 2 optional prompts: caption/tags and lyrics/structure.
Caption/tags
Guide the model to what kind of genre and instruments you want in your music, as well vocals, mood, era, mixing style, etc.
Genres:
jazz,rap,rock,metal,hip hop,bossa nova,electronic,synthwave,blues,reggae, etc. A potential list of all 178k genres used during training can be found here.Instruments:
acoustic guitar,piano,drums,bass,synths,electric guitar,violin, etcVocals:
raspy male vocal,young female vocal,duet harmonies,whispering child, etcSee more tags in the official tutorial.
Either comma-separated tags, or natural language.
Multiple genres may be provided, but conflicts will likely harm the quality.
Even small changes will impact the result substantially.
Lyrics/structure
The model will sing better if each line contains between 6 and 10 syllables.
It's recommended to provide structure tags to organise your lyrics. Examples:
[Intro],[Verse],[Chorus],[Bridge],[Instrumental],[End], etc.
Add an empty line between structure blocks.
Inside a structure tag, you may add other hints. Examples:
[Intro - Dreamy],[Chorus - Layered vocals],[Instrumental - Guitar solo],[Bridge - Whispering], etc.
Some singing techniques/effects are recognised. Examples:
ACE-Step is heeere: hold the note for longer.For your mind (your mind): backing vocals.Stand up and SHOUT: sing with more power.[pt] Obrigado, amigo: switches to a different language[whispering] Don't be afraid: attempt to add said effect.
Metadata
All metadata are an effort to guide music attributes, but they might be overridden by the prompts.
The most relevant are listed below:
bpm: beats per minute, determines the tempo. Common distribution: slow songs 60–80, mid-tempo 90–120, fast songs 130–190.duration: target duration in seconds. The model officially supports 10s-600s. Short songs (30–60s) and medium length (2–4min) are stable; very long generation may have repetition or structure issues.timesignature: 4/4 (most common), 3/4 (waltz), 6/8 (swing feel).language: choose one of the many supported languages for the lyrics.keyscale: Affects overall pitch and emotional color. Usually Minor = darker mood; Major = brighter mood.generate_audio_codes: when enabled (recommended), spends much more time on the text encoder conditioning to improve song quality substantially.
Sampling
Most of the music (melody, harmony, cadence, etc) comes from the conditioning, so tweak sampling parameters to explore variations.
steps: the SFT model requires at least 20, but I recommend 30-50 for good results. Sometimes requires 50-100+ to improve failed parts.cfg: 1.0 is good enough, even for the SFT model. Increasing it to 2.0 seems to improve vocals while reducing presence of instruments. Over 2.0 starts to harm the output (do your own experiments though).sampler: my favourites:sa_solver_pece,heun,dpmpp_sde,uni_pc_bh2,euler, etc.scheduler: my favourites:beta,simple,kl_optimal, etc.
Description
Merge of the following ACE-Step 1.5 models into a single checkpoint file:
acestep-v15-sft.safetensors: the SFT model.
qwen_0.6b_ace15.safetensors: mandatory text encoder.
qwen_1.7b_ace15.safetensors: additional 1.7B text encoder for better audio codes.
ace_1.5_vae.safetensors: the VAE.
FAQ
Comments (10)
This... is a little interesting. Is this only music generation audio or is it like video with audio?
Is there a workflow for this?
ACE-Step 1.5 is for music generation only, but the demo videos are just a text2music workflow that I join with a static image to create the video. (The workflow is embedded in the demo videos, just drop it into ComfyUI to import it)
Muito bom! Sigo seu canal no youtube e la só tem conteudo excelente. To aguardando um video seu sobre como criar um Lora
Gratidão mano!
Valeu pelo apoio! Esse ano ainda sai o vídeo de LoRA kkkk Abraço
Wow. The quality is incredible. Goodbye music industry...lol Thank you! This is amazing to have.
Thank the folks from ACE Studio and Step Fun, who implemented the model, I'm just the messenger ;)
I tried loading the workflows into swarmui's comfy setup but it couldnt read them, is it possible to have a json of one of them? Im trying to figure out how to get audio to audio working with SFT and havent found a good workflow
After battling with Civitai, I finally managed to upload the workflow under "Training Data".
fonctionne t'il aussi sur gpu amd? j ai essayé le workflow qu il y a sur l image reggae, il s execute correctement, mais le mp3 en sortie n a aucun son
Good question. I'm afraid I don't have an AMD GPU to test it. Perhaps replace the "Save Audio (MP3)" with another node, like "Preview Audio" or even "Save Audio (FLAC)".