What is this?
This is a simple workflow for Z-image base that produces high quality, extremely realistic images at high native resolution. There are also tips for using Z-image base below and some general info you might find helpful. There's now also a separate inpainting workflow.
The sampler settings are geared towards sharpness and clarity, but you can introduce grain and other defects through prompting.
All the images attached to the image-gen post were generated directly with this workflow with no further editing.
Read on for update details and all the info you need to get started.
Update 2026-03-29: Upgraded Inpainting Workflow
I found a node that blends images together with semi-transparent masks. This means we can blend the old image with the inpainted area much more effectively now, completely eliminating the seams that would sometimes be visible around the masked area.
TLDR: download the inpainting workflow again, and get the LayerStyle nodes from the list below. You'll get much better results than before.
Update 2026-03-12: New inpainting workflow
Turns out Z-image base is really good at inpainting! I've put together an easy-to-use workflow for it, which is an expansion of the original workflow. It's not as simple to follow what it's doing though, so start with the normal workflow first if you're new. You can also use the workflow for localised refinement (e.g. fixing faces or hands), or as a simple image-to-image generator too.
All the same info below applies, and the workflow itself has a bunch of info inside about how it works and how to use it. Long story short, I've made some small improvements to how inpainting is done (with any model) in ComfyUI, so you can apply the same methodology to other models if you want improvements there. Read the text boxes to see what it's doing!
You'll also need an additional custom node set to use it (more info below): ComfyUI_essentials
Here's an album of all the separate example images attached to the post, including the masks I used for them: g-drive
If you want examples/info on unsafe-for-work ways to use this, I made a post on reddit about it: https://www.reddit.com/r/unstable_diffusion/comments/1rrt0ny/degenerate_inpainting_with_zimage_base_workflow/
Nodes & Models
Custom Nodes:
RES4LYF - A very popular set of samplers & schedulers, and some very helpful nodes. These are needed to get the best z-image base outputs, IMO.
RGTHREE - (Optional) A popular set of helper nodes. If you don't want this you can just delete the seed generator and lora stacker nodes, then use the default comfy lora nodes instead. RES4LYF comes with a seed generator node as well, I just like RGTHREE's more.
ComfyUI GGUF - (Optional) Lets you load GGUF models, which for some reason ComfyUI still can't do natively. If you want to use a non-GGUF model you can just skip this, delete the UNET loader node and replace it with the normal 'load diffusion model' node.
ComyUI Essentials - (Inpaint workflow only) Adds a bunch of very helpul nodes. We're using it specifically for its number comparison node so we can switch between the image-to-image and inpainting modes automatically.
ComfyUI LayerStyle - (Inpaint workflow only) Adds a ton of nodes for image transformations, similar to the tools in photoshop. We're using this for its image blending node, which allows us to blend two images using a semi-transparent mask.
Models:
Main model: Z-image base GGUFs - BF16 recommended if you have 16GB+ VRAM. Q8 will just barely fit on 8GB VRAM if you know what you're doing. Q6_k will fit easily in 8GB. Avoid using FP8, the Q8 gguf is better.
Text Encoder: Normal | gguf Qwen 3 4B Text Encoder - Grab the biggest one that fits in your VRAM, which would be the full normal one if you have 10GB+ VRAM or the Q8 GGUF if you have less than 8GB VRAM. Some people say text encoder quality doesn't matter much & to use a lower sized one, but it absolutely does matter and can drastically affect quality. For the same reason, do not use an abliterated text encoder unless you've tested it and compared outputs to ensure the quality doesn't suffer.
If you're using the GGUF text encoder, swap out the "Load CLIP" node for the "ClipLoader (GGUF)" node.
VAE: Flux 1.0 AE
Info & Tips
Sampler Settings
I've found that a two-stage sampler setup gives very good results for z-image base. The first stage does 95% of the work, and the second does a final little pass with a low noise scheduler to bring out fine details. It produces very clear, very realistic images and is particularly good at human skin.
CFG 4 works most of the time, but you can go up as high as CFG 7 to get different results.
This is all with shift 1. If you don't know what that is, don't worry - it's the default!
Stage 1:
Sampler - res_2s
Scheduler - beta
Steps - 22
Denoise: 1.00
Stage 2:
Sampler - res_2s
Scheduler - normal
Steps - 3
Denoise: 0.15
Resolutions
High res generation
One of the best things about Z-image in general is that it can comfortably handle very high resolutions compared to other models. You can gen in high res and use an upscaler immediately without needing to do any other post-processing.
(info on upscalers + links to some good ones further below)
Note: high resolutions take a long time to gen. A 1280x1920 shot takes around ~95 seconds on an RTX 5090, and a 1680x1680 shot takes ~110 seconds.
Different sizes & aspect ratios change the output
Different resolutions and aspect ratios can often drastically change the composition of images. If you're having trouble getting something ideal for a given prompt, try using a higher or lower resolution or changing the aspect ratio.
It will change the amount of detail in different areas of the image, make it more or less creative (depending on the topic), and will often change the lighting and other subtle features too.
I suggest generating in one big and one medium resolution whenever you're working on a concept, just to see if one of the sizes works better for it.
Good resolutions
The workflow has a variety of pre-set resolutions that work very well. They're grouped by aspect ratio, and they're all divisible by 16. Z-image base (as with most image models) works best when dimensions are divisible by 16, and some models require it or else they mess up at the edges.
Here's a picture of the different resolutions if you don't want to download the workflow: imgbb | g-drive
You can go higher than 1920 to a side, but I haven't done it much so I'm not making any promises. Things do tend to get a bit weird when you go higher, but it is possible.
I do most of my generations at 1920 to a side, except for square images which I do at 1680x1680. I sometimes use a lower resolution if I like how it turns out more (e.g. the picture of the rat is 1680x1120).
Realism Negative Prompt
The negative prompt matters a lot with z-image base. I use the following to get consistently good realism shots:
3D, ai generated, semi realistic, illustrated, drawing, comic, digital painting, 3D model, blender, video game screenshot, screenshot, render, high-fidelity, smooth textures, CGI, masterpiece, text, writing, subtitle, watermark, logo, blurry, low quality, jpeg, artifacts, grainyPrompt Structure
You essentially just want to write clear, simple descriptions of the things you want to see. Your first sentence should be a basic intro to the subject of the shot, along with the style. From there you should describe the key features of the subject, then key features of other things in the scene, then the background. Then you can finish with compositional info, lighting & any other meta information about the shot.
Use new lines to separate key parts out to make it easier for you to read & build the prompt. The model doesn't care about new lines, they're just for you.
If something doesn't matter to you, don't include it. You don't need to specify the lighting if it doesn't matter, you don't need to precisely say how someone is posed, etc; just write what matters to you and slowly build the prompt out with more detail as needed.
You don't need to include parts that are implied by your negative prompt. If you're using the realism negative prompt I mentioned earlier, you don't usually need to specify that it's a photograph.
Your structure should look something like this (just an example, it's flexible):
A <style> shot of a <subject + basic description> doing <something>. The <subject> has <more detail>. The subject is <more info>. There is a <something else important> in <location>. The <something else> is <more detail>.
The background is a <location>. The scene is <lit in some way>. The composition frames <something> and <something> from <an angle or photography term or whatever>.Following that structure, here are a couple of the prompts for the images attached to this post. You can check the rest out by clicking on the images at the top.
The ballet woman
A shot of a woman performing a ballet routine. She's wearing a ballet outfit and has a serious expression. She's in a dynamic pose.
The scene is set in a concert hall. The composition is a close up that frames her head down to her knees. The scene is lit dramatically, with dark shadows and a single shaft of light illuminating the woman from above.The rat on the fence post
A close up shot of a large, brown rat eating a berry. The rat is on a rickety wooden fence post. The background is an open farm field.The woman in the water
A surreal shot of a beautiful woman suspended half in water and half in air. She has a dynamic pose, her eyes are closed, and the shot is full body. The shot is split diagonally down the middle, with the lower-left being under water and the upper-right being in air. The air side is bright and cloudy, while the water side is dark and menacing.The space capsule
A woman is floating in a space capsule. She's wearing a white singlet and white panties. She's off-center, with the camera focused on a window with an external view of earth from space. The interior of the space capsule is dark.Upscaling
Z-image makes very sharp images, which means you can directly upscale them very easily. Conventional upscale models rely on sharp/clear images to add detail, so you can't reliably use them on a model that doesn't make sharp images.
My favourite upscaler for NAKED PEOPLE or human face close-ups is 4xFaceUp. It's ridiculously good at skin detail, but has a tendency to make everything else look a bit stringy (for lack of a better word). Use it when a human being showing lots of skin is the main focus of the shot.
Here's a 6720x6720 version of the sitting bikini girl that was upscaled directly using the 4xFaceUp upscaler: imgbb | g-drive
For general shots you can use something like 4xNomos2.
Alternatively, you can use SeedVR2, which also has the benefit of working on blurry images (not a problem with z-image anyway). It's not as good at human skin as 4xFaceUp, but it's better at everything else. It's also very reliable and pretty much always works. I have a basic workflow for it here: https://pastebin.com/9D7sjk3z
ClownShark Ksampler - what is it?
It's a node from the RES4LYF pack. It works the same as a normal sampler, but with a few differences:
"ETA". This setting basically adds extra noise during sampling using fancy math, and it generally helps get more detail out of generations, and makes them more stable. It's insane how much of a difference this makes with some models. This is the main reason we're using the clownshark sampler. A value of 0.5 is usually good, but I've seen it be good up to 0.7 for certain models (like Klein 9B), and lower is good for some models too. You can range it from 0.1 to 0.9 for interesting results.
"bongmath". This setting turns on bongmath. It's some kind black magic that improves sampling results without any downsides - but only if you're using the sampler a certain way. It does nothing in these particular workflows. Someone tries to explain what it is here: https://www.reddit.com/r/StableDiffusion/comments/1l5uh4d/someone_needs_to_explain_bongmath/
It has access to a ton of alternative samplers/schedulers (we're not using anything interesting in these workflows, just res_2s)
It has some funky mechanisms for doing follow-on/continuous generation by chaining multiple samplers together (this is what the "sampler_mode" setting is for). We're not using that in these workflows though.
You don't need to use this sampler if you don't want to; you can use the res_2s/beta sampler/scheduler with a normal ksampler node as long as you have RES4LYF installed. But seeing as the clownshark sampler comes with RES4LYF anyway we may as well use it - the ETA setting is awesome.
Effect of CFG on outputs
Lower than 4 CFG is bad. Other than that, going higher has pretty big and unpredictable effects on the output for z-image base. You can usually range from 4 to 7 without destroying your image. It doesn't seem to affect prompt adherence much.
Going higher than 4 will change the lighting, composition and style of images somewhat unpredictably, so it can be helpful to do if you just want to see different variations on a concept. You'll find that some stuff just works better at 5, 6 or 7. Play around with it, but stick with 4 when you're just messing around.
Going higher than 4 also helps the model adhere to realism sometimes, which is handy if you're doing something realism-adjacent like trying to make a shot of a realistic elf or something.
Base vs Distil vs Turbo
They're good for different things. I'm generally a fan of base models, so most workflows I post are / will be for base models. Generally they give the highest quality but are much slower and can be finicky to use at times.
What is distillation?
It's basically a method of narrowing the focus of a model so that it converges on what you want faster and more consistently. This allows a distil to generate images in fewer steps and more consistently for whatever subject/topic was chosen. They often also come pre-negatived (in a sense) so that you can use 1.0 CFG and no negative prompt. Distils can be full models or simple loras.
The downside of this is that the model becomes more narrow, making it less creative and less capable outside of the areas it was focused on during distillation. For many models it also reduces the quality of image outputs, sometimes massively. Models like Qwen and Flux have god-awful quality when distilled (especially human skin), but luckily Z-image distils pretty well and only loses a little bit of quality. Generally, the fewer steps the distil needs the lower the quality is. 4-step distils usually have very poor quality compared to base, while 8+ step distils are usually much more balanced.
Z-image turbo is just an official distil, and it's focused on general realism and human-centric shots. It's also designed to run in around 10 steps, allowing it to maintain pretty high quality.
So, if you're just doing human-centric shots and don't mind a small quality drop, Z-image turbo will work just fine for you. You'll want to use a different workflow though - let me know if you'd like me to upload mine.
Below are the typical pros and cons of base models and distils. These are pretty much always true, but not always a 'big deal' depending on the model. As I said above, Z-image distils pretty well so it's not too bad, but be careful which one you use - tons of distils are terrible at human skin and make people look plastic (z-image turbo is fine).
Base model pros:
Generally gives the highest quality outputs with the finest details, once you get the hang of it
Creative and flexible
Base model cons:
Very slow
Usually requires a lengthy negative prompt to get good results
Creativity has a downside; you'll often need to generate something several times to get a result you like
More prone to mistakes when compared to the focus areas of distils
e.g. z-image base is more likely to mess up hands/fingers or distant faces compared to z-image turbo
Distil pros:
Fast generations
Good at whatever it was focused on (e.g. people-centric photography for z-image turbo)
Doesn't need a negative prompt (usually)
Distil cons:
Bad at whatever it wasn't focused on, compared to base
Usually bad at facial expressions (not able to do 'extreme' ones like anger properly)
Generally less creative, less flexible (not always a downside)
Lower quality images, sometimes by a lot and sometimes only by a little (depends on the model, the specific distil, and the subject matter)
Can't have a negative prompt (usually)
You can get access to negative prompts using NAG (not covered in this info page)
Description
Update 2026-03-29: Check the main description!
FAQ
Comments (9)
Man, I really wanted to see how this worked, but I wasn't getting good results.
There are a bunch of things I don't quite get.
1. If you are using Rez samplers, why would you not use sigmas? For what you are doing, you'd get better result from advanced samplers. If you do want to use her samplers, then set them up for continuation as they were designed.
2. You are processing the mask twice instead of using the second sampler as a stitch. Remove the latent noise mask and watch the results.
3. Hook an "image preview from latent" to the first ksampler, then an "image preview" to the vae decoder on the second one and put them next to each other.
4. The "mask settings" node you used is really meant for regional prompting. Use "grow mask with blur" and mess with those settings.
I'm really kinda digging calculating steps off of noise. Pretty cool idea.
All in all nice work.
1. The clownshark sampler is "designed" for more than one thing, and one of those things is is normal image generation. You don't have to use the continuation functions, they're not a requirement. Likewise, you're not required to use custom sigmas with anything in the res4lyf pack if you don't want to.
You do not get the same/better result from using a normal advanced ksampler. Swap one in and see what happens. The clownshark sampler does more stuff than just resampling, that's why it has a "standard" option in the sampler_mode selector.
You might be confused about the intention; the second sampler is being used a low noise refiner, it's not meant to be a continuation of the previous sampler. The only difference is it's using the normal scheduler instead of beta, purely because it's less noisy on the tail-end. I find it improves the clarity of the output with no downsides.
2. Fair point about processing the mask twice, that's an accidental holdover from an earlier revision. It's not causing any harm though so I'll remove it later as part of another update :)
3. ... did you even try what you're suggesting yourself? You'll find that the two outputs are in fact different. I don't understand why you'd ask me to check it without checking it yourself first.
4. The mask fix node is not for regional prompting, it's for whatever you need it for. It's called "mask fix" not "regional prompting mask fix".
I'm using the mask fix node because I only needed feathering and possibly expanding/contracting, which this does perfectly fine, and I didn't want to make people install another node pack for no reason. You're welcome to swap it out for your preferred mask manipulator if you like.
All that said, fair enough if you're not entirely happy with the results. I'm experimenting with the controlnet union model to see if I can improve it, but so far it's giving worse results. I'll update this post if I manage to improve the results with it. Also if you have suggestions (that you've actually tested) for how to use the clownshark sampler better in this scenario, please do share! I've never managed to get good results from fiddling with resampling/continuation. At least, not better than just doing a normal gen.
@nsfwVariant I don't know if you looked at my profile or not, but I spend most of my free time building workflows. I get excited to see people put up their works. I'm also a huge fan of sharing my knowledge. Making workflows is more fun than making the actual image.
I know how the samplers all work. Here's an article on them https://civitai.com/articles/24343/a-ksampler-by-any-other-name
I'm telling you that Sam's samplers are not designed for that purpose. The ETA makes zero difference in your application.. You will very much get better results with a regular sampler and a denoise schedule.
The mask nopde you use was part os the Regional rollout. It's not meant to be used in this situation, but of you ant to argue with me about it, you win.
I did put the two images next to each other. You are only doing two steps with no real noise difference. Its' why I mentioned it. Chaining both samplers with a step skip would give you excellent results with ths Sampler.
I apologize if I came off a liitle harsh. I had legitimate questions and wanted to tell you about better ways to do it. I liked your flow. It was well put together and using the noise to calculate steps is somehting I never thought to do. 🙏
Here's my inpainting/fixer/ etc. workflow.
https://civitai.com/models/2323071/fix-that-image-sfwnsfw
@lonecatone23 Thank you for circling back with an explanation, I appreciate it. I'm a bit snappy when folks come in with lots of assertions but no evidence, so the red flags keep going off in my head - and they still are. I'm gonna say some more stuff that sounds snappy, but I assure you I'm not mad or anything - just confused by what you're saying as it contradicts what I've observed & read.
I'll provide practical examples below, and if anything I've said is incorrect from your perspective I'm going to need you to come back with real, practical evidence that contradicts it. If you do that, then I'm very happy to hear you out and change my mind. Sorry for the long reply, there's a lot to cover.
1. The clownshark sampler is not making "zero difference" in this situation. The ETA setting on the clownshark sampler significantly changes the output of generated images. I'll use the Anima model to demonstrate because the difference is more obvious with non-realism styles. Here are 5 images generated with Anima, the top row is all clownshark sampler @0.5 eta and the bottom row is all normal ksampler with the same seeds & all the same settings: https://ibb.co/gnrPfbk
There are two extremely obvious differences between the two rows.
Difference 1 - the clownshark row all generated in a much more consistent style; all five of them are in a smooth digital painting style. The normal ksampler row, on the other hand, has an anime style one, a semi-realistic digital painting one, two with the same style as the top row, and one in a flatter digital painting style.
Difference 2 - if you look through them all you'll see that the bottom row has consistently flatter shading and less general detail than the top row does. That is to say, the clownshark row is noticeably more detailed. Very noticeably. If you're interested, adjusting the ETA from 0.0 to 1.0 also shows that the effect varies quantifiably as it goes higher/lower.
This demonstrates very clearly that there is an easily observable difference between the normal ksampler and the clownshark sampler when it comes to standard image generation, through nothing more than the ETA setting (and any other differences there might be in the backend execution of the node).
2. Regarding the clownshark sampler not being "designed" for basic image generation. Firstly, it's referred to as an "all-in-one sampling node designed for convenience without compromising on control or quality" on the github page, and explains that most of the inputs are optional - indicating that the intended use case is essentially to replace the normal ksampler in any scenario. Secondly, and more directly, there are literally example workflows in the res4lyf repo that use a single sampler for normal image generation, without any of the fancy settings in use. If you do know the creator, and this is indeed a mistake, kindly ask them to delete those examples from the repo to prevent confusion. Given point #1 about there being a clear difference even in normal image generation, you'll need to firmly explain what the issue is with this use case before I go taking your word for it.
3. On the mask fix node, it's not that I don't believe the node wasn't originally made for regional prompting, that's cool if it was. My question to you is... what bearing does that have on its use as a mask editor? None of the settings are specific to regional prompting. You would never know it was originally created for regional prompting unless someone randomly told you it was. It even has most of the same options as the KJNodes one you mentioned - is that node also for regional prompting, then? Is this node somehow unsuitable for mask editing unless it's for regional prompting? More than anything else you said this one is driving me nuts, I just do not understand what you mean by "not meant to be used in this situation". It's like you're trying to tell me a particular pair of scissors isn't made to cut yellow paper.
Can you explain why this node is only suitable for masks related to regional prompting?
4. There are two distinct effects that come from the second sampler the way it's been set up; it's not there for no reason. I've generated three sets of images to demonstrate, which are here: https://ibb.co/wZS0JgVR
The images on the left are without the second sampler, and the images on the right are with the second sampler.
Effect 1 - While res_2s/beta does a really good job of composition and adding details, it tends to over-detail the image in a similar way to res_2m (but less extreme). This results in images that are, if anything, too detailed to be realistic. The second sampler using res_2s/normal smoothes away those extra details and brings back more realism. The realism part is subjective of course, but the effect I'm talking about is very real. This is most noticeable on human skin, and you can easily see the detail reduction I'm talking about on the man's forehead in each picture: https://ibb.co/3m4XKvrQ
Effect 2 - The second sampler sharpens the image noticeably. This is most noticeable when you've got high-contrast areas right next to each other, and you can see this with the edges of the irises of the eyes having sharper definition in each picture: https://ibb.co/hRWKSXnz
Here's an album with all of the pictures separately, in case you'd like to flick through them to compare easier: https://ibb.co/album/VgJfK7
I will say that a more advanced user might be able to accomplish the same thing with custom sigmas and a single sampler, but I'm not at that level so two samplers it is for me.
Lastly, I'm definitely happy to be educated on how to get better results with the chaining & step skip you mentioned - can you give any example settings I could start from? I'm cool to play around with it, but I don't even know where to begin. Like I said earlier, I really don't know what I'm doing with the fancy settings.
@nsfwVariant This is actually a brilliant response.
The eta is the same as add noise. Sam is just very anal about her labeling of stuff (you espend hours smoking weed and writing code).
You really can accomplish almost identical results fom an advanced Sampler. If you look at the code side by side, they don't vary much. Try setting up a normal ksampler then a clowshark sampler.
Look, I sit here all day and write workflows and code. Inpainting works best with certain nodes that work the mask a different way.
I have a lot of different workflows on my page for you to mess with with all kinds of sampler setups. Pick one.
Here's an example of a Clownshark Sampler set up. https://civitai.com/models/2231181/z-image-base-and-turbo-pro-grade-realism-workflow-low-or-high-vram
I built this today. It has chainsamplers. the start/stop/ step skip is what sets them apart. Look how I drug the image over to Klein. https://civitai.com/models/2468755/zit-realism-wklein9b-enhancer
more than small parts it starts to break apart.
Is there any way to add a reference image?
Probably, but I haven't played around with that much. You could try looking into something like IP Adapter, or try pasting what you want on top & using lower denoise. e.g. you could copy-paste a car into a picture and then mask over it @ 0.7 denoise.
Or you could use an edit model like Klein for the first pass, then cut out the edit from that and paste it into the original so you can re-do it in Z-image base for the quality boost.
Hard to give more specific advice without knowing your use case :)
Thank you for sharing this workflow. I want to use it to fix only a very small part of a high-resolution (4K) image, but the processing is quite slow. Since this is my first time using the Base model, I am not sure whether this speed is normal or if the high resolution is the main cause.
If the slowness is caused by the high resolution, would it be better to use this node?
https://github.com/lquesada/ComfyUI-Inpaint-CropAndStitch?tab=readme-ov-file
If possible, could you add it to the workflow for me?
Size is definitely a big factor. A 1920x1440 image, for example, is 2.6x larger than a 1040x1040 image, and so takes 2.6x longer to process.
Using a crop node would definitely speed things up for a large image, but you do need to be careful about what exactly you crop; if you crop too much, the model can lose important context from the rest of the image. One of those things where practice helps.
I'll play around with the node you linked and add it if it's good, thanks for sharing! In the meantime, you can also manually crop your image before processing if you'd like the same benefit.






