Home › Knowledge Base › Diffusion Transformers

Diffusion Transformers

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

Announce Type: new Abstract: Text-to-video (T2V) generation faces challenging questions when generating videos with long horizons containing multiple events. Inspired by the intrinsics of the diffusion process, we probe video diffusion transformers (DiTs) and uncover intrinsic turning points in the DiT denoising trajectory where conditioning text affects generation from global layout to fine-grained details. Building on this finding, we present TunerDiT, a simple yet effective progressive...

arXiv CS 9d ago

FocusDiT: Masking Queries in Diffusion Transformers for Fine-grained Image Generation

arXiv:2606.02090v1 Announce Type: new Abstract: Diffusion transformer (DiT) has been widely adopted in the generative diffusion field, advancing the denoising of query tokens through attention and Feed-Forward (\text{FFN}) layers. FFN actually acts as the key-value vocabulary for decoding visual contents where the value embeds the visual semantical knowledge. We present that focusing on critical query tokens corresponding to more complex details and encouraging the model to improve these...

arXiv CS 8d ago

FocusDiT: Masking Queries in Diffusion Transformers for Fine-grained Image Generation

arXiv:2606.02090v2 Announce Type: replace Abstract: Diffusion transformer (DiT) has been widely adopted in the generative diffusion field, advancing the denoising of query tokens through attention and Feed-Forward (\text{FFN}) layers. FFN actually acts as the key-value vocabulary for decoding visual contents where the value embeds the visual semantical knowledge. We present that focusing on critical query tokens corresponding to more complex details and encouraging the model to improve these...

arXiv CS 7d ago

Unified Safe In-context Image Generation in Multimodal Diffusion Transformers via Restricting Unsafe Information Flows

arXiv:2606.06875v1 Announce Type: new Abstract: Diffusion transformers (DiTs) equipped with multimodal attention (MM-Attn) have become a dominant paradigm for image generation. However, preventing the generation of harmful content remains a critical challenge, particularly in image-to-image (I2I) editing tasks. Existing safety mechanisms are primarily designed for text-to-image (T2I) synthesis or U-Net-based architectures, which limits their effectiveness for unified safety mitigation in...

arXiv CS 2d ago

LiteVSR: Lightweight Adaptation of Frozen Diffusion Transformers for Video Super-Resolution

arXiv:2606.09250v1 Announce Type: new Abstract: Adapting large-scale pre-trained video generators for Video Super-Resolution (VSR) in novel domains remains computationally prohibitive. Methods that reformulate generation as direct Low-Quality to High-Quality mappings deviate from the original generative formulation, demanding extensive fine-tuning. ControlNet-style adapters lose their efficiency under modern Diffusion Transformers since the absence of encoder-decoder hierarchy forces...

arXiv CS 1d ago

Real-Time AttentionBender: Granular Interactive Network Bending of Video Diffusion Transformers

arXiv:2606.06497v2 Announce Type: replace Abstract: Generative video models have achieved remarkable visual fidelity, yet their prompt-only interface offers thin creative agency and obscures the model's material process from the artists working with it. We present Real-Time AttentionBender, a tool that extends the practice of network bending across the full depth of the video diffusion transformer (DiT) and brings it into live, interactive generation. Built as a plugin within the DayDream...

arXiv CS 1d ago

TrioPose: Native Triple-Stream Diffusion Transformers for Pose-Guided Text-to-Image Generation

arXiv:2606.07053v1 Announce Type: new Abstract: Pose-guided text-to-image generation often suffers from limb distortions and feature crosstalk in complex multi-person scenarios. While existing UNet-based adapters struggle with long-range spatial dependencies, emerging Multimodal Diffusion Transformers (MM-DiTs) offer superior global modeling. However, naive signal concatenation in MM-DiTs severely disrupts pre-trained latent distributions.

arXiv CS 2d ago

Real-Time AttentionBender: Granular Interactive Network Bending of Video Diffusion Transformers

arXiv:2606.06497v1 Announce Type: new Abstract: Generative video models have achieved remarkable visual fidelity, yet their prompt-only interface offers thin creative agency and obscures the model's material process from the artists working with it. We present Real-Time AttentionBender, a tool that extends the practice of network bending across the full depth of the video diffusion transformer (DiT) and brings it into live, interactive generation. Built as a plugin within the DayDream Scope...

arXiv CS 2d ago

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

Announce Type: replace Abstract: Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to...

arXiv CS 6d ago

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

arXiv:2605.30409v1 Announce Type: new Abstract: Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formidable challenge due to the stringent requirements for temporal consistency and inference throughput. In this paper, we present SANA-Streaming, a system-algorithm co-designed framework for high-resolution, real-time streaming video editing on consumer GPUs, with the following three core designs:...

arXiv CS 9d ago