Qwen 3_4_B Trained Text Encoder for Z-Image
FP32
Full Finetune at FP32 (Full Model Finetune - All Parameters & All layers)
FP32 Finetune of QWEN3_4b focusing on describing human features SFW/NSFW captions.
Can be run in FP32 with no time loss on most machines that use CPU offloading.
BF16
Full Finetune at BF16 (20 Layers)
Long Text descriptions 500-1000 token length focusing on describing human features.
For use with Z-Image or Z-Image Turbo
Comparison Images showing QWEN base VS Human Corpus HERE
Description
FAQ
Comments (26)
What is the difference with the abliterated version (I use the GGUF Q8 from hui_hui on hf) ?
So unless you run with high vram flag, the gguf TE would be cast to fp32 by the CPU, in most cases this is slower for inference.
To my knowledge a FP32 training of QWEN_3_4b has not been done prior to this.
I can confirm it's faster (unless you don't have the file on nvme)
What's the difference between FP32 and BF16? Any qualitative distinction? Which is higher quality, I'm assuming FP32.
FP32 would be for most cases, but you need to set the FP32 clip flag - The FP32 has a higher level of training the the BF16, so it will naturally produce prompts with more divergence.
what is the difference between this fp16 and normal qwen3 4b fp16 one? does it have more prompt adherence, better knowledge than default one?
In some cases yes. It can help retrieve details on base Z but would be very well suited for NSFW finetunes or lora's
From a technical standpoint it's impossible to achieve what you have set out to do. Z-Image is trained using embeddings from Qwen3_4b, that is what we call a "done deal". By finetuning the TE you are now conditioning the diffuser against vectors it wasn't trained on. The embedding geometry has changed so drastically that Z-Image is no longer able to access even some high-level concepts like "right or left" properly. The only thing I can say has slightly improved is waist-up female nudity, everything else is degraded because Z-Image doesn't know what to do with those new embedding strengths. While I admire your inquisitive mindset and knack for experimentation, ask yourself this: We already have single-concept checkpoints, do we really need single-concept text encoders? Save the compute and train a LoRA next, I implore you.
This is not a single-concept LLM, respectfully post the lack of high level concepts, seed to seed and I will look into it - but regarding your conclusion it is incorrect.
Training Only QWEN3_4b LLM Slightly better prompt interpretation, minor impact on image output fidelity
Training Full Z-Image training (diffusion + conditioning)Major improvements in visual quality, detail, style, etc
@Felldude I've posted a comparison image which highlights just a handful of low- and high-level concept failures that randomly became apparent with the prompt I was working on today.
@InvictusAI I saw your images and, based on their description, they were generated with the limited version of BF16. It would be interesting to see the same result with the FP32 version.
@Diioo It would be, preferably in FP32 mode as it gets cast to that for CPU offloading anyway (Which cost speed and precision on all but the newest CPU's).
@Diioo FP32 version useless for 99.99% of users due VRAM and RAM limitations. Most people not have 64 Gb RAM to fit both diffusion model and so large text encoder at same time.
@velanteg Query Chat GPT about that, unless you have the highvram flag set or can safely fit both models into VRAM the LLM gets offloaded to CPU. If it is on CPU it is upcast to FP32 in most cases. Google or ChatGPT "what cpus support fp16 and bf16" the list is fairly small.
@Diioo The issue here is not the precision or size, it's boring old maths, the Diffusion model is trained against a TextEncoder which produces a very specific embedding geometry during training, that is locked into the model. ANY changes to the Text Encoders weights will change the embedding geometry and cause a misalignment. The more tuning happens the worse it gets, as seen here. It's not that the precision of the vector location is lacking, it's that the locations of certain vectors have moved. Imagine you change the meaning of every word in a dictionary to the word that comes after, except for a handful of words you want to focus on. You couldn't communicate with anyone anymore about anything but that one topic. That's why this model might excel at producing plain old nudes, but now struggles with more or less niche concepts. Still, I might try the full fp32 model just for posterity and see how that compares.
@Felldude I really don't want to sound too negative here, I really appreciate your effort at this, being able to create improved text encoders would be a dream. Have you tried fine-tuning the image model with a varied dataset, but using your new TE instead of stock Qwen3 during training? Just to bring the two back into alignment? Theoretically it wouldn't even take a lot of training, just a bit to make sure that the embeddings match what Z-Image expects from the TE.
@InvictusAI If your wondering what the end goal of the FP32 Caption Features its currently stuck behind Civitai jank but - https://civitai.com/models/2409672
@InvictusAI Regarding the math, I am training the full model not just the hidden outputs, the idea is that the LLM will fill in some of the blanks and you will not need to fill in every detail. Yes it is certainly possible to ruin a LLM with training, but I tracked every 10 token predictions to see what the semetic drift was...and yes that did slow down the training but it was worth it for a "visual" of what the model was seeing/drifting
@Felldude I posted another image comparing your fp32 model with a fp8_scaled version of Qwen3. I've gotta say the rift is smaller here, but the concept dilution is still immutable. Next I'll compare your fp32 to the original Qwen recommended with Z-Image.
So it's entirely possible that I don't have a full grasp on how Z-Image handles its TE, but if it's like most diffusers, then checking for semantic drift every 10 token prediction is wasted effort. Unless you're using the TE in CoT mode, prediction never occurs, after the forward-pass the "LLM" part has done it's job, no autoregression takes place, nothing to predict. It's just Tokenizer -> Forward-pass -> Hidden states/embedding, done. But I might be wrong.
Unless you're monitoring semantic drift in the hopes that the embedding geometry is maintained, that's fine, I guess, but seems like a impossibly large task to do comprehensively.
I do like that your model managed to, for example, improve the quality of generated nipples, without changing anything about the diffuser weights at all, that's pretty impressive, but it does come at a clear cost, and you'll manage to get that with a lora without same tradeoff.
Surprisingly to me, fp32 indeed performs better than fp16, so hats off to that, I hope you'll find a way to train the LLM without the "losses" in the future. But keep in mind any change in embedding geometry will cause the diffuser to expect something the TE can no longer provide, if you find a way around that, you will have won the game.
@InvictusAI Dit models do not use the LLM head only the hidden state of the QWEN, A full finetune, unless I am wrong about this could use a frozen QWEN or even per-computed caption hidden states without QWEN loaded. The issue with a transformers only model would be how deterministic the output is from the VAE generated latnent vs transformer.
The input is very much one latent, one caption feature.
The reason I trained QWEN was to align it with a vision model. I chose llama to align it with as it is more tenable then QWEN vision plus a large LLM
Quick question, for VRAM that can comfortably fit the model, TE, and VAE, do you think there would be a noticeable difference between FP32 and BF16?
I know ComfyUI autocasts FP32 to BF16 on my GPU for models already unless I use Kijai's loader, but I have enough VRAM to never need CPU offloading.
With CLIP yes for an LLM its not as likely - I did 1k images comparing clip in FP32 to FP16 or BF16 and it consistently had more errors with hand placement etc. - But with Dit and LLM I would guess it maters less, granted I have not tested it in the same way
I think the 5090 could fit all models in VRAM in NF4, this might be the only case in the large models - The other case for advocating for using a FP8 or BF16 Text model that is offloaded to CPU would be when your ram is not sufficient to cover GPU cache and fully load the FP32 text LLM
Oh or the third being you have a CPU with AVX-512 support Pytorch appears to have added this in
Thanks for the reply. I have 47.8 GiB of VRAM. Loading the full weights for ZIT uses about 22GiB.
I tried the FP32 just to see how much space it would take but even with the "--fp32-text-enc" it seems to be converting to BF16 on my Blackwell card. With and without the cli flag I use about 29GiB.
KJ has loader nodes to force weights for models and VAE, but nothing for CLIP/TE unfortunately.
All runs with the default qwen_3_4b and your FP32 show:
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load ZImageTEModel_
Model ZImageTEModel_ prepared for dynamic VRAM loading. 15343MB Staged. 0 patches attached. Force pre-loaded 145 weights: 766 KB.
model weight dtype torch.bfloat16, manual cast: None
model_type FLOW
Requested to load Lumina2
Model Lumina2 prepared for dynamic VRAM loading. 11739MB Staged. 0 patches attached.
With the only difference being 7671MB Staged with Qwen vs 15343MB Staged with your FP32.
* Sorry for using Gibibytes, but that's what NVTOP reports in and it's the easiest way for me to monitor these things. And apparently comments delete markdown formatting...
@070809 That is an interesting one, you could modify the model_management.py file to force cpu for clip if its comfy - For GPU accelerators and ECC there might be more at work



















