DeepTech

OmniHuman-1: China's New AI Model Takes on OpenAI in Video Generation

OmniHuman is an end-to-end, multimodality-conditioned human video generation framework designed to produce human videos from a single image and motion signals, including audio alone, video alone, or a combination of both

OmniHuman-1: China's New AI Model Takes on OpenAI in Video Generation
info_icon

TikTok’s parent company ByteDance has unveiled a video-generating artificial intelligence (AI) model called OmniHuman-1. This model can generate realistic videos out of a single image or textual prompt, depicting people talking, dancing, singing and even playing instruments.

The model is designed to accurately replicate human speech, movement, and gestures. According to the company’s website, whether it’s a portrait, half-body shot, or full-body image, OmniHuman can generate lifelike movements, natural gestures, and stunning attention to detail. At its core, OmniHuman is a multimodality-conditioned human video generation model. This means it integrates various input types, such as images and audio clips, to create highly realistic videos.

OmniHuman is currently in the research phase and is not yet accessible to the public. The developers have shared demos and hinted at a possible code release in the future.

The release of this model marks another Chinese breakthrough in the AI industry following DeepSeek’s large language model (LLM) DeepSeek-v3. ByteDance’s OmniHuman-1 is a direct competitor to OpenAI’s video-generating model, Sora, which was released in December 2024, as well as other video-generating models like Runway’s Gen-3 Alpha and Luma AI’s Dream Machine.

How OmniHuman Works

OmniHuman is an end-to-end, multimodality-conditioned human video generation framework designed to produce human videos from a single image and motion signals, including audio alone, video alone, or a combination of both.

This framework incorporates a multimodality motion conditioning mixed training strategy, enabling the model to leverage the advantages of scaled-up data from mixed conditioning. By employing this approach, OmniHuman effectively addresses the challenges that previous end-to-end methods encountered due to the limited availability of high-quality data.

OmniHuman was trained on a vast dataset and an advanced AI framework. Researchers fed it over 18,700 hours of human video footage using a unique “omni-conditions” approach. This allows the model to learn simultaneously from text, audio, and body movements, resulting in more natural animations.

Key features of OmniHuman include multimodality motion conditioning, realistic lip sync and gestures, support for various inputs, versatility across formats, high-quality output, and animation beyond humans.

OmniHuman’s Top Competitors

Sora is OpenAI’s text-to-video AI model, capable of generating high-quality videos up to one minute long from textual prompts. It stands out for its ability to maintain strong spatial and temporal consistency, demonstrating an advanced understanding of 3D environments, physics, and realistic motion.

Sora can extend existing videos, fill in missing frames, and generate dynamic camera movements, making it ideal for storytelling and creative video production. While the exact training details are proprietary, OpenAI has confirmed that the model was trained on a mix of publicly available and licensed datasets, ensuring diversity and accuracy in its outputs.

Runway’s Gen-3 Alpha is an advanced AI video model designed for high-quality and fast video generation, significantly improving upon its predecessors. It provides precise control over structure, style, and motion, allowing users to create visually consistent and intricate video sequences from text prompts.

The model particularly excels in maintaining character consistency and fluid motion, making it valuable for professional content creators. Trained on a massive dataset of 240 million images and 6.4 million video clips, Gen-3 Alpha leverages its extensive knowledge to produce realistic, detailed, and coherent videos efficiently.

Luma AI’s Dream Machine is a transformer-based video model built for scalability and efficiency, generating physically accurate and visually consistent footage. It supports multimodal input, meaning users can create realistic videos from both text prompts and images, offering greater flexibility in content creation.

The model features an intuitive interface with AI storyboards and reference styles, ensuring more predictable and customisable results. While specific training details are not fully disclosed, Dream Machine is known to have been trained on a large-scale dataset of video clips, equipping it with an extensive understanding of motion patterns and visual dynamics.

OmniHuman vs Sora

OpenAI's Sora and ByteDance's OmniHuman-1 use different architectures for text-to-video generation. Sora employs a transformer-based model with physics simulations and temporal coherence, excelling in realistic scene synthesis and spatial consistency in complex 3D environments.

OmniHuman-1, on the other hand, uses a multi-stream GAN (generative adversarial networks) approach optimised for human motion and character continuity, generating high-quality, lifelike human movements and expressions.

While Sora focuses on broad environmental realism, OmniHuman-1 excels in detailed character dynamics, making each model specialised in different aspects of video generation.

ByteDance, in a statement, said that OmniHuman "significantly outperforms existing methods, generating extremely realistic human videos based on weak signal inputs, especially audio."

In a research paper published on arXiv, the company claimed that this model can work with images of any aspect ratio, whether they are portraits, half-body, or full-body images, and deliver lifelike and high-quality results across various scenarios.

However, a head-to-head comparison between these models cannot be made as the parent companies have not yet released their scores on various benchmarks. The current comparison can only be based on user experience, which remains subjective.

Published At:
SUBSCRIBE
Tags

Click/Scan to Subscribe

qr-code

Advertisement

Advertisement

Advertisement

Advertisement

×