What is this?
A tool for using JSON with Anima. This model does not require JSON, however it does provide added beneficial control WITH JSON while simultaneously being capable at many new plain English prompting capacities that were quite weak or non-existent before.
The trigger word is NOT the exact token "JSON", it's literal json in string form.
Prompt Directly
Use JSON > ENGLISH > BOORU.
You will get the best yield in this order. You can swap booru for english if you get hallucinations.
The model was trained with both english and booru json, so the processing should be okay.
90k Brent E1+E2 1.0
Temporary version, will be replaced with the full 1.0 train. All epochs available on huggingface.
https://huggingface.co/AbstractPhil/anima-90k
This is only the VLM half, that only ran for about 1 epoch. The plan is 2 epochs VLM and 1 epoch animetimm. That should be enough. The final version will be uploaded tonight.
Have fun.
Epoch 2 Release
The version is stronger and more capable while still containing the majority of the original model. More robust and capable than v1 and better at plain English.
Epoch 3 Time Stage
Epoch 3 is roughly 375,000 samples, which will be the full subject bucketing system imposed only on the animetimm system. This has shown the most robust capacity with this model, while still learning the plain English associations necessary to use more Qwen than before.
This will take roughly 74 hours, so by next weekend I'll have everything worked out for a full comfyui release.
10k Brent V0.5
{
"subjects": [
{
"name": "subjects name here",
"attributes": ["attributes", "go", "here however you want to divide them"],
"actions": ["actions go here", "in english or broken sequences"],
},
],
"setting": "supports settings",
}
Down here reinforce the system with plain english like this, explain the system and situation.
1girl, here, do, the, booru, tags, like how, you, would,Probably doesn't need to be perfect, can likely jank it and it will not care if the json is valid.
Add up to 8 subjects, bounding boxes not supported yet, semantic offset is partially working, and associative offset is partially functional.
Attributes hallucinate without reinforcement with the booru tags, for now.
Will bias QWEN more heavily the higher the strength is for this version.
Strengths
Handles low step or high step models fairly well. Reduce strength for low steps and you'll still get some use of the json.
Weaknesses
Attributes hallucinate. Actions hallucinate. Names are pretty good.
1k Brent (Preview)
Similar format as the V0.5.
Booru tags MORE critical. Different biases
Weaknesses
Strong, but will bias a different array of images. More rigid and smaller array.
Text has problems, increase strength to the negative if you have large problems.
Brent 10k V0.5 Release
Fully revamped trainer; a forked diffusion-pipe with a considerably faster parquet processing pipeline.
https://github.com/AbstractEyes/diffusion-pipe/tree/feat/parquet-hf-dataset-backend
Instead of the anima trainer.
https://huggingface.co/datasets/AbstractPhil/diffusion-pretrain-set-ft1
10,000 images instead of 1000.
I ran too many epochs, however the balanced train will allow the model to operate on lower strength. The next run will be considerably more images, a higher diversity in images, a better character controller, a higher complexity yield for json capacity, and a much larger complexity with json prompts.
Subject Bucketing upgrade
The bucketing system handles roaring fast speeds and a shared grab-bag capacity for buckets which both reduces prep time and still produces more images than the model can ingest on 4 gpus. The parquet processing pipeline processes images considerably faster and still handles AR bucketing at lightning speed, all because of the random grab-bag processing capacity of the parquet system.
Improved Cache
The original caching system is quite improved now, converted to parquet processing that easily capped the 4 a40 gpus with 100% processing.
More Data
A much larger train of 10,000 dual-prompted images. Repeats are based on both buckets and their subject selectiveness frequency.
Suggested Use
I suggest reduced strength which will still promote the lora's strength without introducing the QWEN biases as strongly.
I've included trigger prompt assistance for using the built in subject format.
Brent 1k (PREVIEW) Release
https://github.com/AbstractEyes/anima-trainer
Trained with the same trainer as Anima was trained with originally - diffusion-pipe, snapped together with a new dataset organization system so I could run it in either Runpod or notebooks.
https://huggingface.co/datasets/AbstractPhil/diffusion-pretrain-set-ft1
This is 1k images randomly sampled and subject-bucketed from the 80k image dataset "qwen_90k" that will be trained next.
https://huggingface.co/AbstractPhil/Qwen3.5-0.8B-json-captioner
Each of the images were captioned using the VLM's VIT for a JSON outputted system and additionally a variant of AnimeTIMM VIT also captioned and then processed into JSON as well.
12 epochs on the VLM JSON captions, same images back in for 8 more epochs with AnimeTIMM JSON. This is the results from subject-bucketing with json.
More specifically
https://huggingface.co/blog/AbstractPhil/subject-bucketing
This is a subject-bucket trained JSON finetune.
The specific targets are meant to provide better accuracy and more fidelity to finetunes experimentally while simultaneously training a proof-of-concept paradigm related to subject-bucketing.
TLDR Subject Bucketing
Dataset, balancing. Normally you end up with a series of, problems from finetunes. Breakpoints, kinks, issues, distortions, faults, and so on.
This is meant as an experiment to solve those exact problems. By finetuning a model with JSON, you provide a form of differentiated perspective to the AI. By grouping subjects to a more complex paradigm as stated in the article - the differentiation becomes robust.
A little longer, still short.
Each token separator is another format of language that QWEN already understands and recognizes. The more you combine in sequence, the more QWEN will understand this process - providing more utilizable structure to the diffusion system.
With robust and orderly encodings provided to the diffusion system that include differentiated lesser-used tokens in conjunction with more common-use tokens, the more powerful the training results in useful outcomes.
Why?
The smaller-scale non-bucketed variants were successful, so it's time to train the real thing. The tool itself, and the tool yields.
Now the first 1k image train for the direct tool has been successful. The results are yielding and powerful. This merits a full uptick in training.
Description
FAQ
Comments (14)
is this like bringing the json prompt capabilities from ideogram v4 to anima?
I haven't played with Ideogram V4 but I've been planning this one for a couple months. My dataset consists of over 700k fully prepared dual-prompt images with my shared QWEN 3.5 0.8b model as the catalyst for the entire system.
SDXL took to it like a bag of rocks, however Anima took it fairly clean.
What exactly does Lora do? Can I just use it to generate prompts in JSON format? What exactly does that look like?
It accepts plain English prompting as well as JSON prompting.
@AbstractPhila But if this Lora not for enhancing the JSON promptstructure understanding, what is the idea for it? For what is this?
@VKilko The model becomes more selective with larger margins between the LLM inputs. The LLM itself isn't particularly very smart, so more sparse captions have trouble. This both strengthens small chains of tokens by giving them scaffolding with JSON, as well as trains subject symbolism from the LLM into the diffusion mechanism. Thus allowing the model to align to specifics in a different way, in this case JSON was the catalyst and plain English was the mechanism.
@VKilko https://huggingface.co/datasets/AbstractPhil/anima-90k-cache/tree/main/vlm This will give a good idea if what's in there.
Here is one with a viewer, same images.
https://huggingface.co/datasets/AbstractPhil/sdxl-qwen-phase0
@AbstractPhila What is structure / format of JSON?
I did some testing and ... I can't see any difference with | without this LORA using Anima base.
Modern models, surprisingly, do understand JSON, some more others less, i.e. using Anima gives 60/40 positive results but Krea2 jumps to 90/10.
I used Ideogram JSON description from KJ and am surprised that this does work so well for Krea2, not ideal, but this is all "Ai" shtick these days ("good enough so we all should use it"), much better than in Anima.
The most problematic part is bbox coordinates that Anima seams to ignore in i.e. 50/50.
I haven't trained bounding box coordinates yet, you need to use difference offsets for now. "to the left of", "the upper right corner of the image", etc.
The next structure I create will be substantially more powerful. I'm scaling up to full VIT classification capacity; text identification, rotation, offset, depth, scale, bounding boxes, and considerably more identified capacities all packed into JSON.
In that sense I'm going to find the strongest VLM that can run on the rtx 6000 pro's 95 gigs of vram, and with that the version 2 will be considerably more powerful.
Version 1 is currently cooking, and the subject semantics association preview shows that it will in fact yield - but my eyes are now open to something much much more powerful.
As the sample images doesn't show any JSON in their prompt, could you give us an example ?
It's a bit barebones for now, but it'll get the model started for the next batch.
There's an actual qwen model you can use to translate your plain english prompt directly to the json format that this model learned.
https://huggingface.co/AbstractPhil/anima-prelim-1k-r64/tree/main/comfy-qwen-json
The qwen node works in comfyui but I haven't packaged it up into it's own repo yet. It requires transformers >5.4
I suggest appending the plain english + booru tags after the json formatted data, which provides the necessary solidity to the prompt.
What do you think about xml as a input structure like NewbieAi have.
Example prompt:
<character_1>
<n>$character_1$</n>
<gender>1girl</gender>
<appearance>chibi, red_eyes, blue_hair, long_hair, hair_between_eyes, head_tilt, tareme, closed_mouth</appearance>
<clothing>school_uniform, serafuku, white_sailor_collar, white_shirt, short_sleeves, red_neckerchief, bow, blue_skirt, miniskirt, pleated_skirt, blue_hat, mini_hat, thighhighs, grey_thighhighs, black_shoes, mary_janes</clothing>
<expression>happy, smile</expression>
<action>standing, holding, holding_briefcase</action>
<position>center_left</position>
</character_1>
<character_2>
<n>$character_2$</n>
<gender>1girl</gender>
<appearance>chibi, red_eyes, pink_hair, long_hair, very_long_hair, multi-tied_hair, open_mouth</appearance>
<clothing>school_uniform, serafuku, white_sailor_collar, white_shirt, short_sleeves, red_neckerchief, bow, red_skirt, miniskirt, pleated_skirt, hair_bow, multiple_hair_bows, white_bow, ribbon_trim, ribbon-trimmed_bow, white_thighhighs, black_shoes, mary_janes, bow_legwear, bare_arms</clothing>
<expression>happy, smile</expression>
<action>standing, holding, holding_briefcase, waving</action>
<position>center_right</position>
</character_2>
<general_tags>
<count>2girls, multiple_girls</count>
<style>anime_style, digital_art</style>
<background>white_background, simple_background</background>
<atmosphere>cheerful</atmosphere>
<quality>high_resolution, detailed</quality>
<objects>briefcase</objects>
<other>alternate_costume</other>
</general_tags>















