CivArchive
    HuMo for Wan - HuMo 14B fp8 e4m3fn
    NSFW

    HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning


    ✨ Key Features

    HuMo is a unified, human-centric video generation framework designed to produce high-quality, fine-grained, and controllable human videos from multimodal inputs—including text, images, and audio. It supports strong text prompt following, consistent subject preservation, synchronized audio-driven motion.

    • ​​VideoGen from Text-Image​​ - Customize character appearance, clothing, makeup, props, and scenes using text prompts combined with reference images.

    • ​​VideoGen from Text-Audio​​ - Generate audio-synchronized videos solely from text and audio inputs, removing the need for image references and enabling greater creative freedom.

    • ​​VideoGen from Text-Image-Audio​​ - Achieve the higher level of customization and control by combining text, image, and audio guidance.

    Examples and models from the following sources reuploaded for your convenience here:
    https://huggingface.co/bytedance-research/HuMo
    https://github.com/Phantom-video/HuMo


    Compatible with both 480P and 720P resolutions. 720P inference will achieve much better quality.

    Description

    FAQ

    Checkpoint
    Wan Video 14B t2v

    Details

    Downloads
    115
    Platform
    CivitAI
    Platform Status
    Available
    Created
    9/13/2025
    Updated
    4/28/2026
    Deleted
    -

    Files

    humoForWan_humo14BFp8E4m3fn.safetensors

    Available On (1 platform)

    Same model published on other platforms. May have additional downloads or version variants.