Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation - Note

type

status

date

slug

summary

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation - Note

Framework

The primary structure of this framework begins with the encoding of pose sequences using a Pose Guider, which is a stack of 2D convolutions. This encoded pose sequence is then fused with multi-frame noise and passed through a Denoising UNet to perform the video generation denoising process. The computational modules of the Denoising UNet consist of Spatial-Attention, Cross-Attention, and Temporal-Attention, as outlined in the dashed box on the right. The integration of reference images involves two aspects: first, detailed features are extracted using ReferenceNet for spatial attention; second, semantic features are extracted using a CLIP image encoder for cross-attention. Temporal-Attention operates across the temporal dimension. Finally, a VAE decoder converts the results into video segments.

1.1 Spatial-Attention

The Spatial-Attention mechanism integrates spatial attention layers into the Denoising UNet by replacing standard self-attention layers. Here’s how it works:

Feature Extraction:

ReferenceNet, a symmetric UNet structure, extracts detailed features from the reference image.

These features are then passed to the Denoising UNet, where they are combined with the noise latent features (multi-frame noise) used for video generation.

Spatial Attention Integration:

: Feature map from the Denoising UNet, representing multi-frame noise latent features.

: Feature map from ReferenceNet, representing detailed features of the reference image.

The ReferenceNet feature map is replicated times (to match the number of frames) and concatenated along the width dimension with .

Self-attention is applied to the concatenated feature map, enabling the model to learn relationships between reference image features and noise latent features.

Feature Selection:

After applying self-attention, the first half of the feature map is extracted as the output of the Spatial-Attention layer. This allows the model to selectively focus on the most relevant spatial details in the reference image.

1.2 Cross-Attention

Cross-Attention is an attention mechanism designed to establish dynamic relationships between two different inputs. Its core idea is:

Query (Q): The current feature from the generator (e.g., the current frame’s feature).

Key (K) and Value (V): Features from the reference information (e.g., the reference image’s feature).

By computing the similarity between Query and Key, Cross-Attention dynamically adjusts the weighting of Value, effectively injecting reference information into the generation process.

Similarity Matrix Calculation:

Formula:

Input Shapes:

Output Shape: (representing the similarity between each feature in the generation target and each feature in the reference image).

Normalization to Generate Attention Weights:

Formula:

Input Shape:

Output Shape: (normalized weights indicating how each feature in the generation target should extract information from the reference image).

Weighted Value (V):

Formula:

Input Shapes:

Weight Matrix:
Value (V):

Output Shape: (dynamically combined features of the generation target and reference image).

1.3 Temporal-Attention

Research has shown that adding temporal layers to text-to-image (T2I) models can capture temporal dependencies between video frames. This design facilitates the transfer of image generation capabilities from pretrained T2I models. The temporal layer is applied only within the Res-Trans block of the Denoising UNet. For ReferenceNet, it computes features for a single reference image and does not participate in temporal modeling. Since the Pose Guider provides controllability over the character's continuous movement, experiments show that the temporal layer ensures temporal smoothness and continuity in appearance details, avoiding the need for complex motion modeling.

This idea originates from the AnimateDiff paper, which highlights that videos add a time dimension compared to images. The original input tensor is 5D: . The authors reshape the frame axis to the batch axis, transforming the shape to , reducing the tensor to 4D. When the tensor reaches the motion module, its shape becomes , making it 3D. This reshaping enables the motion module to perform attention across frames for smoother video movement and consistent content.

The goal of the motion module’s network design is to enable efficient information exchange across frames. To expand the motion module’s receptive field, the authors insert motion modules at every resolution level of the U-shaped diffusion network. Additionally, sinusoidal positional encoding is added to the self-attention module, allowing the network to perceive the temporal position of the current frame in the animation.

2. Training Strategy

The training process is divided into two stages:

First Stage:

Training is performed using a single video frame, with temporal layers excluded from the Denoising UNet.

The model takes a single frame of noise as input and trains ReferenceNet and Pose Guider simultaneously.

Reference images are randomly selected from the entire video clip.

Models for Denoising UNet and ReferenceNet are initialized with pretrained weights from Stable Diffusion (SD), while Pose Guider is initialized with Gaussian weights. The final projection layer uses zero convolutions.

The weights of the VAE encoder, decoder, and CLIP image encoder remain unchanged.

The optimization goal is to generate high-quality animated images given a reference image and target pose.

Second Stage:

The temporal layers are introduced into the previously pretrained model, with initialization from pretrained weights of AnimateDiff.

The model’s input is a 24-frame video clip.

Only the temporal layers are trained during this stage, while all other weights in the network remain fixed.

3. Experimental Setup

The training dataset consists of 5,000 character video clips (2-10 seconds in duration) collected from the internet. Pose sequences (including body and hand poses) are extracted using DWPose and rendered as skeleton images using OpenPose. The training hardware consists of 4 NVIDIA A100 GPUs.

First Training Stage:

A single video frame is sampled, resized, and center-cropped to a resolution of 768×768.

Batch size is set to 64, and training is performed for 30,000 steps.

Learning rate: 1e-5.

Second Training Stage:

A 24-frame video sequence is used.

Batch size is set to 4, and training is performed for 10,000 steps.

Learning rate: 1e-5.

During inference, the length of the pose contour is adjusted to match the feature contour of the character reference image. The DDIM sampler is used for 20 denoising steps. The authors also adopt a temporal aggregation method to concatenate results from different batches to generate long videos.