Aozora-XL: A V-Prediction SDXL Model
Aozora-XL is a v-prediction model based on NoobAI v-pred, fine-tuned for improved stability and coherence. It uses a custom training script that allows full/partial fine-tuning on a 12GB consumer GPU, such as an RTX 3060. The training script is available on GitHub at Aozora_SDXL_Training for community use.
Never merged
No internally merged loras
Version 0.15 Updates
This version builds on 0.1 by addressing specific issues in the v-prediction setup. It was trained on the v0.1 base to restore vibrant colors and reduce the slight whitewash effect present in earlier releases. Additional fine-tuning focused on fixing common v-prediction problems, such as inconsistencies in scene composition and detail rendering. It used a dataset of ~50,000 images consisting of visual novel and anime content with deep colors, trained for 5 epochs. Settings included:
- Base Model: Aozora V0.1
- Max Train Steps: 250000
- Gradient Accumulation Steps: 64
- Mixed Precision: bfloat16
- UNET Learning Rate: 8e-07
- LR Scheduler: Cosine with 10% warmup
- Features: Min-SNR Gamma (corrected variant, gamma 5.0), Zero Terminal SNR, IP Noise Gamma (0.1), Residual Shifting, Conditional Dropout (prob 0.1)
These changes result in better color fidelity and more reliable outputs across various prompts.
- Note: All preview images where generated without any detailers or enhancers to show base capabilities
Version 0.1 Overview
The initial release (v0.1 alpha) was a proof-of-concept, trained for 10 epochs on a dataset of ~18,500 images (50% ZZZ characters up to version 2.0, 50% top-rated Danbooru images). It maintains traits from the base model (NoobAI-XL/NAI-XL V-Pred 1.0) while showing gains in stability due to the training approach.
Project Goals
- Provide a GUI-based training script to enable SDXL fine-tuning on consumer hardware.
- Continue developing Aozora-XL into a stable, controllable model through ongoing training on diverse datasets.
Training Method
The method optimizes efficiency by training ~92% of the UNet. It includes adaptive Min-SNR gamma weighting for v-prediction stability and custom learning rate schedules.
Training Specs:
- Hardware: 1x NVIDIA RTX 3060 (12GB VRAM usage: ~11.8 GB)
- Optimizer: Adafactor
- Batch Size: 1 with 64 Gradient Accumulation Steps
- UNet Params Trained: 2.3B
Recommended Settings
- Positive Prompt: very awa, masterpiece, best quality
- Negative Prompt: Optional; try (worst quality, low quality) if needed
- Sampler: DPM++ 3M SDE GPU or Euler (Euler for line art, SDE for details like hands/feet)
- Scheduler: SGM Uniform or Normal
- Steps: 25-35
- CFG Scale: 3-5 (works well at low values)
- Resolution: 1024x1024 or similar (up to 1152x1152)
- Hires. Fix: Use with upscalers like RealESRGEN at ~0.35 denoise
Experiment with settings, as v-prediction models can vary by system.
License
This model follows the license of its base, NoobAI-XL. Review and comply with those terms.
Description
Initial alpha release as a Proof of concept
FAQ
Comments (20)
A very imaginative attempt!
This model is wonderful, keep it up! And big thanks for sharing that script! You are great!
Thanks. While it's certainly not as powerful as a full fine-tune, training on datasets of thousands often doesn't require retraining everything – most layers already possess significant knowledge. I see it as a middle ground between Dreambooth and a full fine-tune. Plus, with good hardware, you can use the same script for faster training by increasing the batch size.
@Hysocs I was thinking about getting into fine-tuning, but was lost in ton of guides that completely different. And I had like 2-3 favorite models (That's pretty old for this moment, and all of them is V-Pred), but at this moment I will use this model while trying to fine-tune something with Your script, glab I haven't got really far from Your GPU (mine is 3060ti 8gb). But anyway, huge thanks to You for the hard and nice work, that's definitely getting me back into models. I hope You'll keep this up in any way and passing
@candy_69
With 8GB of VRAM, you can remove the feed-forward layer (the "ff.net") to save some VRAM it takes up about half of the parameters that are trained. This might let you fit the model into 8GB, though it would be pushing the limits.
Be aware: disabling ff.net will reduce the model's ability to learn new concepts effectively. However, this approach should still yield better results than training a LoRA and merging it later.
If you're not comfortable coding, try pasting the script into an AI assistant and ask it how to disable specific modules.
but it think 12gb of vram is the absolute limit but the good news is a 3060 with 12gb used is around $150-200 so you could easily build a machine to train for cheaper than renting for a few hundred hours
@Hysocs I'll definitely will look into it, was planning to upgrade for something with 16gb, maybe I'll try it on 8gb to just look around and get used to it. Anyway, Your reply is absolutely massive help for understanding and not only for me, but other users. I hope more people notices this beautiful model and Your approach to it. Big thanks for Your work, and for help. You are amazing!
@candy_69 I turned my script into a gui based script with auto installation, so this should allow you to just run it without needing to know anything code related, you can check it out on my github
@Hysocs Already checked that Man, that's big a work done right there. Thanks a lot!
The part with the zzz characters is worded pretty weird i could only get the chars to work that most of the other models know as well. So no Burnice or Caesar or Yanagi or Lighter. Miyabi yes but the 1.5 and 1.6 and 1.7 chars are missing as well. Okay so after further testing I'm kind of impressed of what this can do. I tested a lot of different models now and I'm always coming back to it.
How are you prompting for them? the model isnt finished with training so it will get better but its also trained from danbooru so it would be prompted like "1girl, d, bandeau, black jacket, burnice white". i can make continuation training for a more standard format if you provide it
For the next training run i will use non cleaned danbooru tags as it seems to be a issue, look out for my next release and character recognition for zzz should be alot better, it should be done training in a few days
suprizingly a very good model,other finetune models always influence styles a lot or they will degradation on some concepts. cant wait to see update.
I'm intentionally avoiding stylistic overfitting with this model. By selectively training only key UNET layers and minimizing changes to the base, I can mostly prevent catastrophic forgetting or overfitting. I'm still experimenting to find the optimal balance, which is a slow process.
The result is a model with high stylistic neutrality and flexibility. This makes it an ideal, non-opinionated base that should be highly receptive to LoRAs, unlike models that have been heavily fine-tuned for a specific look.
Alot of people like models with a baked in style but i find that i would much rather have a model i can slap a lora onto when i want it
@Hysocs I agree. Noobai has lots of artist in it,make a base style on the model is just breaking it.
@NunFish I’m currently improving my dataset to fix the model’s color issues. NoobAI has been flooded with so much Danbooru content that it’s hard to find images it hasn’t seen from the easiest place to get data. Many other fine-tunes overuse Danbooru data, reshowing images and causing overfitting. Colors should be better in v0.2, and afterward for 0.3 I’ll work on fixing the round faces that im sure you may notice.
Pretty nice, good job! Gonna take a look at your training GUI for finetuning.
Just wondering, do you have in mind some hyperparams to finetune for just new knowledge? I may want to update the noob dataset. I.E., up to June 2025, but not sure how to do it without breaking the model itself.
I’ve noticed that this model produces significantly fewer anatomical distortions at high resolutions (e.g., 1536x1536) compared to others. I was wondering, was it specifically trained for high-resolution generation?
it was trained at a max of 1152, but images weren't stretched or warped due to my bucketing system, most of them where weird resolutions like 500x1152, ect. so its could have reduced the guessing and warping you see from models at higher resolutions. but in no way was it trained to be specifically for 1536. but most images where on the larger size
but if you do some generations you will sometimes be mistaken and second guess if its a vpred, to be honest 100% i think training specific layers on the vpred foundation is the cause, for some odd reason its stabilized itself and im not 100% sure. it performs more like a eps than a vpred as its less muddy. you might even call it broken but in a good way. tho its not perfect but compared to other vpred models i use, i would say this model produces better hands and features 60-70% of the time compared to the 40-50% of other models
Hysocs Thank you for the detailed explanation! Whatever the technical reason may be, it’s clear that this model produces significantly fewer distortions in hands, feet, and anatomy during high-resolution generation or upscaling compared to others.
This is going to be my go-to model for a while. Thanks again for sharing such a great model!
Details
Available On (2 platforms)
Same model published on other platforms. May have additional downloads or version variants.



















