- THIS MODEL HAS TWO FILES. YOU NEED TO USE THEM TOGETHER!!!
- The associated trigger words are only for reference, it may need to be adjusted at some times.
- The recommended weight for the embedding model is 1, which provides higher fidelity; if greater generalization is required, it can be lowered to 0.5.
- The recommended weight for the LoRA model is 0.85; if there's evidence of contamination, consider lowering it to 0.5.
- The preview images were generated using a few fixed test prompts and several prompts derived from clustering dataset features. Random seeds were used, ruling out cherry-picking. What you see is what you get.
- No specialized training was done for outfits. You can check our provided preview post to get the prompts corresponding to the outfits.
How to Use This Model
THIS MODEL HAS TWO FILES. YOU NEED TO USE THEM TOGETHER!!!.
In this case, you need to download both yamato_kantaicollection.pt and
yamato_kantaicollection.safetensors, then use yamato_kantaicollection.pt as texture inversion embedding, and use
yamato_kantaicollection.safetensors as LoRA at the same time.
このモデルには2つのファイルがあります。一緒に使う必要があります!!!。
この場合、yamato_kantaicollection.ptとyamato_kantaicollection.safetensorsの両方をダウンロード
する必要があります。yamato_kantaicollection.ptをテクスチャ反転埋め込みとして使用し、同時にyamato_kantaicollection.safetensorsをLoRAとして使用してください。
这个模型有两个文件。你需要同时使用它们!!!。
在这种情况下,您需要下载yamato_kantaicollection.pt和yamato_kantaicollection.safetensors这两个文件,然后将yamato_kantaicollection.pt用作纹理反转嵌入,
同时使用yamato_kantaicollection.safetensors作为LoRA。
이 모델은 두 개의 파일이 있습니다. 두 파일을 함께 사용해야 합니다!!!.
이 경우에는 yamato_kantaicollection.pt와 yamato_kantaicollection.safetensors 두 파일을 모두 다운로드하신 다음에 yamato_kantaicollection.pt을 텍스처 반전 임베딩으로 사용하고,
동시에 yamato_kantaicollection.safetensors을 LoRA로 사용하셔야 합니다.
(Translated with ChatGPT)
The trigger word is yamato_kantaicollection, and the recommended tags are best quality, masterpiece, highres, solo, {yamato_kantaicollection:1.15}, long_hair, brown_hair, ponytail, hair_ornament, flower, hair_flower, brown_eyes, smile, breasts, cherry_blossoms, large_breasts, very_long_hair, blush, hair_between_eyes.
How This Model Is Trained
This model is trained with HCP-Diffusion. And the auto-training framework is maintained by DeepGHS Team.
Why Some Preview Images Not Look Like Yamato Kantaicollection
All the prompt texts used on the preview images (which can be viewed by clicking on the images) are automatically generated using clustering algorithms based on feature information extracted from the training dataset. The seed used during image generation is also randomly generated, and the images have not undergone any selection or modification. As a result, there is a possibility of the mentioned issues occurring.
In practice, based on our internal testing, most models that experience such issues perform better in actual usage than what is seen in the preview images. The only thing you may need to do is adjusting the tags you are using.
I Felt This Model May Be Overfitting or Underfitting, What Shall I Do
Our model has been published on huggingface repository - CyberHarem/yamatokantaicollection, where models of all the steps are saved. Also, we published the training dataset on huggingface dataset - CyberHarem/yamatokantaicollection, which may be helpful to you.
Why Not Just Using The Better-Selected Images
Our model's entire process, from data crawling, training, to generating preview images and publishing, is 100% automated without any human intervention. It's an interesting experiment conducted by our team, and for this purpose, we have developed a complete set of software infrastructure, including data filtering, automatic training, and automated publishing. Therefore, if possible, we would appreciate more feedback or suggestions as they are highly valuable to us.
Why Can't the Desired Character Outfits Be Accurately Generated
Our current training data is sourced from various image websites, and for a fully automated pipeline, it's challenging to accurately predict which official images a character possesses. Consequently, outfit generation relies on clustering based on labels from the training dataset in an attempt to achieve the best possible recreation. We will continue to address this issue and attempt optimization, but it remains a challenge that cannot be completely resolved. The accuracy of outfit recreation is also unlikely to match the level achieved by manually trained models.
In fact, this model's greatest strengths lie in recreating the inherent characteristics of the characters themselves and its relatively strong generalization capabilities, owing to its larger dataset. As such, this model is well-suited for tasks such as changing outfits, posing characters, and, of course, generating NSFW images of characters!😉".
For the following groups, it is not recommended to use this model and we express regret:
- Individuals who cannot tolerate any deviations from the original character design, even in the slightest detail.
- Individuals who are facing the application scenarios with high demands for accuracy in recreating character outfits.
- Individuals who cannot accept the potential randomness in AI-generated images based on the Stable Diffusion algorithm.
- Individuals who are not comfortable with the fully automated process of training character models using LoRA, or those who believe that training character models must be done purely through manual operations to avoid disrespecting the characters.
- Individuals who finds the generated image content offensive to their values.
Description
FAQ
Comments (4)
Hello
Since I am working on Kantai Collection's Houshou in parallel, I have taken a look at the published dataset-raw on huggingface. And I feel I should raise a few concerns with your training pipeline.
1. Inclusion of cosplay outfits
These images can teach the model the underlying characteristics of the training target, however it is at artist discretion how much the character will bear resemblance to common understanding of that character. In this dataset there are several examples such as 4a435ac7fdb1ebe0841a7728af1f2bbaf2375673.png, 21a2b906c8bcce4b33d602d000d2c135d1e134da.png, 01ca78ef2a38fa17cae8ea58e9f3d0651694e01f.png, 43b35ad8ad02320fc132b937af4dbc6ef5c7f1c3.png and to a lesser extent a few others which depart from the visual concept of Houshou significantly, only bearing resemblance in hair color, hair length, general stature and maybe eye color. For an automated pipeline it is better if such images are not included, since it is also unlikely they will cluster meaningfully during data processing.
2. Inclusion of overly small cropped images
Namely, adf6a8dd9d5ace9fc403722ed9d803580f29ef5d.png is too small (137x179) to provide sufficient detail to be net benefit to training even if SoTA upscaling methods are applied. Cropping images is often desirable for training but a size filtering after cropping should also be applied more aggressively.
3. Missing important dataset sources
It appears most images are sourced from minor collection websites such as Zerochan and Anime-Pictures, while the most complete repositories such as Danbooru, Gelbooru or pixiv are not directly referenced in metadata as far as I could see. These databases offer much better selection of images with attached metadata, including data that can be used as proxy for aesthetic score (such as in WD1.5 pipeline). Having manually parsed about 13% of all Houshou images from Danbooru personally, I think this method will be overly limited if it can only use the sources I see.
4. Dataset is too small for effective cluster discovery
200 images may be more than sufficient with manual curation. But for an automated pipeline which is not as limited by dataset size, more should be included. For instance the only meaningful clusters that I think could be made out of the dataset I saw were Houshou base form, and then a general unspecific cluster. Houshou Kai Ni (1 image), Houshou Kai Ni Sen (0 images), Houshou smock/kappougi outfit (~8 images) were not represented enough to where training will learn meaningful characteristics. In the first place, with this little data concept segmentation may not be very effective as there are several similar images even to the smock/kappougi outfit (such as apron, nurse outfits) which could get clustered together with it.
I think to increase the chance of getting meaningful concept clusters out at least 1000 images (but really as many as are available meeting the required significance/quality threshold) should be processed before clustering, and then only the top clusters used in actual training.
Also while I'm at it, although it is not about training pipeline, I think to demonstrate generalization and also improve general image quality a post-training generation pipeline should be added to create image previews using more aesthetically pleasing checkpoint models, and perhaps several different popular models to demonstrate generalizability. Few people will use these models unless it's been demonstrated they can produce useful results, and almost no one will be using these models on NAI (or whatever the training checkpoint is).
[Addendum] After looking at some other comments, it seems only Anything v5 is used to generate previews. However it is a fairly closely related model to NAI so the generalizability demonstration is not too strong. Also, these previews don't seem to be making very good use of the model's capabilities (I use Anything v5 prt-RE often myself). I am not sure if it due to the specific generation settings used, or inherent properties of the trained models, but more fine tuning on this front would also be beneficial.
I understand the utility and even welcome an automated training pipeline, but as the old programming adage references "garbage in, garbage out" ought to be carefully avoided. I have already personally encountered people who are generally dissatisfied with the quality of your Civit submissions.
I sincerely hope the above advice can be taken into consideration before significantly more human time is spent downloading and evaluating models produced by this pipeline.
Hello,
Thank you very much for your detailed and thoughtful suggestions; they have been of great help to us. Among the points you've raised:
1. Inclusion of cosplay outfits
Currently, our pipeline utilizes [CCIP](https://github.com/deepghs/imgutils#contrastive-character-image-pretraining) to filter the images scraped from various sources. This is a contrastive learning model trained on over 3000 characters and more than 400,000 images. Its purpose is to remove visually dissimilar characters from the dataset. However, since CCIP is trained on characters themselves, it predominantly focuses on features like hair color, eye color, and skin color, rather than clothing. As of now, there isn't a better solution to address this issue due to the inherent challenge of defining "out of character" visually.
2. Inclusion of overly small cropped images
Your point is entirely valid. We had not fully considered this scenario previously, and we will introduce stricter size filtering after cropping.
3. Missing important dataset sources
We have indeed developed pipelines for various sources including Danbooru, Gelbooru, and Pixiv. The reason we've chosen to use sites like Zerochan and Anime-Pictures is that the overall quality of images on more comprehensive platforms like Danbooru is not as expected. These sites often contain a significant number of low-quality images, particularly those tagged as NSFW, making the filtering process challenging.
While your point about the comprehensive metadata on Danbooru, we would need more specific information to see if there's any crucial data we've overlooked. This information may be helpful to us.
4. Dataset is too small for effective cluster discovery
Your observation is accurate. Based on our team's testing, the current pipeline generally yields better generalization due to a higher volume of images (200) compared to the limited number (usually not exceeding 50) that most manually-trained LoRA models use. However, clustering for certain features like costumes remains unstable. Several challenges contribute to this issue:
a) As mentioned, we lack a model capable of extracting feature vectors for costume characteristics. While we've analyzed the "cosplay" tag on Danbooru, the limited number of images poses difficulties in achieving satisfactory results with contrastive learning, which typically requires training on hundreds of thousands or even millions of images.
b) For a fully automated pipeline, it's difficult to predict all official images or skins for a character in advance. Therefore, clustering is necessary, but less popular or recognized appearances often lack sufficient images.
c) While increasing the image count seems like a good idea, the actual effectiveness needs verification. Moreover, drastically increasing the image count would significantly increase the workload for data scraping, processing, and training, requiring a careful trade-off.
5. Use of Base Model
This issue has been under consideration for a while, and we acknowledge your concern. Currently, the HCP-Diffusion framework we're using for training can only use base models in the diffusers format. However, some models on CivitAI like Meniamix and Cetusmix face issues during format conversion, preventing their use. Therefore, we are limited to using NAI for training and then employing diffusers-format Anything v5 models for generating preview images. We plan to address this issue in the HCP-Diffusion framework soon.
6. Our Approach
It's important to clarify the purpose of our work, which has been discussed within our team for some time:
* As you mentioned in points 1 and 4, it's evident that the automated pipeline inherently faces difficulties in capturing the nuances of character images compared to the human understanding of a character's appearance. This falls into a natural disadvantage at the information level, so this limitation cannot be overcome solely through technological advancements.
* Even with a contrastive learning model specifically tailored for character outfits, its capability would be limited to balancing the proportions of various attires within the dataset. For the rarely occurring outfits in online images, the model would still face limitations. In other words, for a fully automated pipeline, the outfit issue can only be optimized to a certain extent but cannot be entirely resolved; it's an intractable problem.
* Therefore, in the foreseeable future, it's challenging for automated models to match the precision of manually curated and meticulously tagged datasets.
However, the advantages of automated pipeline models include:
* Adequate accuracy in capturing the core appearance of characters, ensuring their recognition.
* Strong generalization across various prompts and base models, making them suitable for costume changes, different poses, or even NSFW content (indeed, generating NSFW content is something our models excel at).
* The ability to produce large quantities of models quickly, covering the entire cast of a game or anime series.
Considering these factors, our work aims to produce models with decent quality and good generalization relatively inexpensively, covering nearly all characters (including many lesser-known characters due to their low popularity). Our work is analogous to mass-producing garments in a clothing factory compared to the bespoke tailoring of high-end suits.
7. Future Plans
We have several planned projects in the pipeline, including:
* Training character models based on anime videos (currently in research, with some challenges).
Once again, we sincerely appreciate your genuine advice, which has been incredibly helpful to us. Thank you very much! Do you have a GitHub or Hugging Face account? We'd like to follow you for further technical discussions. [Heart emoji]
Our updates and announcements: https://civitai.com/articles/1897
I apologize for disturbing you again, but we need your assistance.
Recently, we have been investigating the quality issues concerning the LoRA models we have trained. However, there seems to be a significant discrepancy between the general evaluations on civitai and the actual results of our internal testing (our testing indicates that the majority of LoRA meet quality standards and exhibit higher generalization). This has left us quite puzzled. We currently believe that several factors might contribute to such evaluations:
* We have adopted a dual-model approach using embeddings (.pt files) and LoRA models (.safetensors files). This makes it different from the classical LoRA and requires loading both files to take full effect. We have noticed multiple instances where users forget to include the embedding model (in fact, according to our testing, not using the embedding model significantly impacts the fidelity of character recreation).
* The previous preview images used anything-v5, which had lower quality. This led to a poor initial impression and possibly resulted in negative reviews from those who hadn't used it before.
* The ability and stability to recreate outfits are not on par with manually trained LoRA models. This point is quite evident, and we are actively working on improvements.
However, based on our observations, it's likely that the negative reviews stem from more than just these reasons. Therefore, could you kindly describe, from the perspective of a LoRA user, the main differences in quality between our LoRA and manually trained LoRA models in as much detail as possible? If it's possible to share some image examples, it would greatly assist us.
Thank you very much!
Details
Files
Available On (1 platform)
Same model published on other platforms. May have additional downloads or version variants.



