CivArchive
    Z-Image Trained Text Encoder - FP32
    NSFW
    Preview 121249512
    Preview 121249513
    Preview 121250280
    Preview 121255923
    Preview 121255925
    Preview 121256718
    Preview 121257383
    Preview 121257717
    Preview 121258449
    Preview 121259491
    Preview 121261026
    Preview 121261143
    Preview 121261532
    Preview 121261597
    Preview 121261891
    Preview 121263502
    Preview 121729588
    Preview 121729587
    Preview 121729584
    Preview 121729589

    Qwen 3_4_B Trained Text Encoder for Z-Image

    FP32

    • Full Finetune at FP32 (Full Model Finetune - All Parameters & All layers)

    • FP32 Finetune of QWEN3_4b focusing on describing human features SFW/NSFW captions.

    • Can be run in FP32 with no time loss on most machines that use CPU offloading.

    BF16

    • Full Finetune at BF16 (20 Layers)

    • Long Text descriptions 500-1000 token length focusing on describing human features.

    • For use with Z-Image or Z-Image Turbo


    • Comparison Images showing QWEN base VS Human Corpus HERE

    Description

    FAQ

    Comments (26)

    BrewceFeb 16, 2026
    CivitAI

    What is the difference with the abliterated version (I use the GGUF Q8 from hui_hui on hf) ?

    Felldude
    Author
    Feb 16, 2026

    So unless you run with high vram flag, the gguf TE would be cast to fp32 by the CPU, in most cases this is slower for inference.

    To my knowledge a FP32 training of QWEN_3_4b has not been done prior to this.

    BrewceFeb 16, 2026· 1 reaction

    I can confirm it's faster (unless you don't have the file on nvme)

    PolygonFeb 16, 2026
    CivitAI

    What's the difference between FP32 and BF16? Any qualitative distinction? Which is higher quality, I'm assuming FP32.

    Felldude
    Author
    Feb 16, 2026

    FP32 would be for most cases, but you need to set the FP32 clip flag - The FP32 has a higher level of training the the BF16, so it will naturally produce prompts with more divergence.

    lowkeylayersFeb 16, 2026
    CivitAI

    what is the difference between this fp16 and normal qwen3 4b fp16 one? does it have more prompt adherence, better knowledge than default one?

    Felldude
    Author
    Feb 16, 2026· 1 reaction

    In some cases yes. It can help retrieve details on base Z but would be very well suited for NSFW finetunes or lora's

    InvictusAIFeb 18, 2026· 4 reactions
    CivitAI

    From a technical standpoint it's impossible to achieve what you have set out to do. Z-Image is trained using embeddings from Qwen3_4b, that is what we call a "done deal". By finetuning the TE you are now conditioning the diffuser against vectors it wasn't trained on. The embedding geometry has changed so drastically that Z-Image is no longer able to access even some high-level concepts like "right or left" properly. The only thing I can say has slightly improved is waist-up female nudity, everything else is degraded because Z-Image doesn't know what to do with those new embedding strengths. While I admire your inquisitive mindset and knack for experimentation, ask yourself this: We already have single-concept checkpoints, do we really need single-concept text encoders? Save the compute and train a LoRA next, I implore you.

    Felldude
    Author
    Feb 18, 2026

    This is not a single-concept LLM, respectfully post the lack of high level concepts, seed to seed and I will look into it - but regarding your conclusion it is incorrect.

    Training Only QWEN3_4b LLM Slightly better prompt interpretation, minor impact on image output fidelity

    Training Full Z-Image training (diffusion + conditioning)Major improvements in visual quality, detail, style, etc

    InvictusAIFeb 18, 2026

    @Felldude I've posted a comparison image which highlights just a handful of low- and high-level concept failures that randomly became apparent with the prompt I was working on today.

    DiiooFeb 19, 2026· 1 reaction

    @InvictusAI I saw your images and, based on their description, they were generated with the limited version of BF16. It would be interesting to see the same result with the FP32 version.

    Felldude
    Author
    Feb 19, 2026

    @Diioo It would be, preferably in FP32 mode as it gets cast to that for CPU offloading anyway (Which cost speed and precision on all but the newest CPU's).

    velantegFeb 20, 2026

    @Diioo FP32 version useless for 99.99% of users due VRAM and RAM limitations. Most people not have 64 Gb RAM to fit both diffusion model and so large text encoder at same time.

    Felldude
    Author
    Feb 20, 2026

    @velanteg Query Chat GPT about that, unless you have the highvram flag set or can safely fit both models into VRAM the LLM gets offloaded to CPU. If it is on CPU it is upcast to FP32 in most cases. Google or ChatGPT "what cpus support fp16 and bf16" the list is fairly small.

    InvictusAIFeb 21, 2026

    @Diioo The issue here is not the precision or size, it's boring old maths, the Diffusion model is trained against a TextEncoder which produces a very specific embedding geometry during training, that is locked into the model. ANY changes to the Text Encoders weights will change the embedding geometry and cause a misalignment. The more tuning happens the worse it gets, as seen here. It's not that the precision of the vector location is lacking, it's that the locations of certain vectors have moved. Imagine you change the meaning of every word in a dictionary to the word that comes after, except for a handful of words you want to focus on. You couldn't communicate with anyone anymore about anything but that one topic. That's why this model might excel at producing plain old nudes, but now struggles with more or less niche concepts. Still, I might try the full fp32 model just for posterity and see how that compares.

    InvictusAIFeb 21, 2026

    @Felldude I really don't want to sound too negative here, I really appreciate your effort at this, being able to create improved text encoders would be a dream. Have you tried fine-tuning the image model with a varied dataset, but using your new TE instead of stock Qwen3 during training? Just to bring the two back into alignment? Theoretically it wouldn't even take a lot of training, just a bit to make sure that the embeddings match what Z-Image expects from the TE.

    Felldude
    Author
    Feb 21, 2026

    @InvictusAI If your wondering what the end goal of the FP32 Caption Features its currently stuck behind Civitai jank but - https://civitai.com/models/2409672

    Felldude
    Author
    Feb 21, 2026

    @InvictusAI Regarding the math, I am training the full model not just the hidden outputs, the idea is that the LLM will fill in some of the blanks and you will not need to fill in every detail. Yes it is certainly possible to ruin a LLM with training, but I tracked every 10 token predictions to see what the semetic drift was...and yes that did slow down the training but it was worth it for a "visual" of what the model was seeing/drifting

    InvictusAIFeb 22, 2026· 2 reactions

    @Felldude I posted another image comparing your fp32 model with a fp8_scaled version of Qwen3. I've gotta say the rift is smaller here, but the concept dilution is still immutable. Next I'll compare your fp32 to the original Qwen recommended with Z-Image.

    So it's entirely possible that I don't have a full grasp on how Z-Image handles its TE, but if it's like most diffusers, then checking for semantic drift every 10 token prediction is wasted effort. Unless you're using the TE in CoT mode, prediction never occurs, after the forward-pass the "LLM" part has done it's job, no autoregression takes place, nothing to predict. It's just Tokenizer -> Forward-pass -> Hidden states/embedding, done. But I might be wrong.

    Unless you're monitoring semantic drift in the hopes that the embedding geometry is maintained, that's fine, I guess, but seems like a impossibly large task to do comprehensively.

    I do like that your model managed to, for example, improve the quality of generated nipples, without changing anything about the diffuser weights at all, that's pretty impressive, but it does come at a clear cost, and you'll manage to get that with a lora without same tradeoff.

    Surprisingly to me, fp32 indeed performs better than fp16, so hats off to that, I hope you'll find a way to train the LLM without the "losses" in the future. But keep in mind any change in embedding geometry will cause the diffuser to expect something the TE can no longer provide, if you find a way around that, you will have won the game.

    Felldude
    Author
    Feb 22, 2026

    @InvictusAI Dit models do not use the LLM head only the hidden state of the QWEN, A full finetune, unless I am wrong about this could use a frozen QWEN or even per-computed caption hidden states without QWEN loaded. The issue with a transformers only model would be how deterministic the output is from the VAE generated latnent vs transformer.

    The input is very much one latent, one caption feature.

    The reason I trained QWEN was to align it with a vision model. I chose llama to align it with as it is more tenable then QWEN vision plus a large LLM

    070809Mar 21, 2026
    CivitAI

    Quick question, for VRAM that can comfortably fit the model, TE, and VAE, do you think there would be a noticeable difference between FP32 and BF16?

    I know ComfyUI autocasts FP32 to BF16 on my GPU for models already unless I use Kijai's loader, but I have enough VRAM to never need CPU offloading.

    Felldude
    Author
    Mar 21, 2026· 1 reaction

    With CLIP yes for an LLM its not as likely - I did 1k images comparing clip in FP32 to FP16 or BF16 and it consistently had more errors with hand placement etc. - But with Dit and LLM I would guess it maters less, granted I have not tested it in the same way

    Felldude
    Author
    Mar 21, 2026

    I think the 5090 could fit all models in VRAM in NF4, this might be the only case in the large models - The other case for advocating for using a FP8 or BF16 Text model that is offloaded to CPU would be when your ram is not sufficient to cover GPU cache and fully load the FP32 text LLM

    Felldude
    Author
    Mar 21, 2026

    Oh or the third being you have a CPU with AVX-512 support Pytorch appears to have added this in

    070809Mar 21, 2026· 1 reaction

    Thanks for the reply. I have 47.8 GiB of VRAM. Loading the full weights for ZIT uses about 22GiB.

    I tried the FP32 just to see how much space it would take but even with the "--fp32-text-enc" it seems to be converting to BF16 on my Blackwell card. With and without the cli flag I use about 29GiB.

    KJ has loader nodes to force weights for models and VAE, but nothing for CLIP/TE unfortunately.

    All runs with the default qwen_3_4b and your FP32 show:

    VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16

    CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16

    Requested to load ZImageTEModel_

    Model ZImageTEModel_ prepared for dynamic VRAM loading. 15343MB Staged. 0 patches attached. Force pre-loaded 145 weights: 766 KB.

    model weight dtype torch.bfloat16, manual cast: None

    model_type FLOW

    Requested to load Lumina2

    Model Lumina2 prepared for dynamic VRAM loading. 11739MB Staged. 0 patches attached.

    With the only difference being 7671MB Staged with Qwen vs 15343MB Staged with your FP32.

    * Sorry for using Gibibytes, but that's what NVTOP reports in and it's the easiest way for me to monitor these things. And apparently comments delete markdown formatting...

    Felldude
    Author
    Mar 21, 2026

    @070809 That is an interesting one, you could modify the model_management.py file to force cpu for clip if its comfy - For GPU accelerators and ECC there might be more at work

    Checkpoint
    ZImageTurbo

    Details

    Downloads
    1,044
    Platform
    CivitAI
    Platform Status
    Available
    Created
    2/16/2026
    Updated
    5/4/2026
    Deleted
    -

    Files

    zImageTrainedText_fp32.safetensors

    Mirrors