Home › Knowledge Base › Autoregressive Visual Generation Needs

Autoregressive Visual Generation Needs

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Autoregressive Visual Generation Needs a Prologue

arXiv:2605.06137v2 Announce Type: replace Abstract: In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Instead of modifying visual tokens to satisfy both reconstruction and generation, Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy (CE) loss, while visual tokens remain dedicated to reconstruction.

arXiv CS 9d ago

Video-Mirai: Autoregressive Video Diffusion Models Need Foresight

arXiv:2606.03971v1 Announce Type: new Abstract: Causal video generators must predict from the past, but they need not learn only from it. In streaming autoregressive video diffusion, each emitted segment becomes a commitment that future segments must preserve. Standard training, however, only asks each causal state to explain the present.

arXiv CS 7d ago

On the Limits of Token Reduction for Efficient Unified Vision Language Training

arXiv:2606.01503v1 Announce Type: new Abstract: Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but their joint training is computationally expensive and largely overlooked from an efficiency perspective. In this work, we study the feasibility and limits of token-reduction-based acceleration for unified VLM training. Through a systematic analysis of layerwise attention allocation, we uncover a fundamental...

arXiv CS 8d ago

SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation

arXiv:2603.18599v2 Announce Type: replace Abstract: Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck that severely limits overall throughput. To overcome this, we introduce SJD-PAC, an enhanced SJD framework.

arXiv CS 7d ago

Representation Forcing for Bottleneck-Free Unified Multimodal Models

Announce Type: new Abstract: Unified multimodal models (UMMs) aim to handle perception and generation in a single model. Yet existing UMMs still rely on a frozen, separately pretrained VAE for image generation, imposing a structural bottleneck. Naively removing it introduces a quality gap, as the model must learn both high-level structure and low-level details from raw pixels.

arXiv CS 9d ago

Representation Forcing for Bottleneck-Free Unified Multimodal Models

arXiv:2605.31604v2 Announce Type: replace Abstract: Unified multimodal models (UMMs) aim to handle perception and generation in a single model. Yet existing UMMs still rely on a frozen, separately pretrained VAE for image generation, imposing a structural bottleneck. Naively removing it introduces a quality gap, as the model must learn both high-level structure and low-level details from raw pixels.

arXiv CS 6d ago

Nvidia Cosmos 3

Physical AI systems must understand the real world before they can act within it. Robots, autonomous vehicles, and smart spaces need to understand what’s happening in their world, predict what’s likely to happen next, and generate actions for specific environments, embodiments, and tasks. NVIDIA Cosmos 3 is a frontier foundation model for physical AI that combines physical reasoning, world generation, and action generation within a single open model.

Hacker News 9d ago

AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens

Announce Type: new Abstract: Image tokenizers, from 2D grids to recent 1D sequences, typically encode every image with the same fixed number of tokens. Yet visual complexity is highly heterogeneous, so a uniform budget overspends on simple inputs and underserves complex ones. Existing elastic tokenizers expose variable-length reconstructions, but often leave token length as a deployment-time operating point, a search target, or an external prediction rather than an output of the tokenizer...

arXiv CS 2d ago