Home › Knowledge Base › Unified Continuous and Discrete Visual Tokenization

Unified Continuous and Discrete Visual Tokenization

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging

arXiv:2605.30904v1 Announce Type: new Abstract: Most visual tokenizers for image generation are bifurcated into two families with complementary limitations: continuous VAEs offer high-fidelity reconstruction but suffer from dense, entangled latents that are poorly suited for semantic control, whereas discrete VQ-based models enable autoregressive generation yet struggle with gradient sparsity, unstable training, and codebook collapse. In this work, we introduce MergeTok, a unified tokenizer...

arXiv CS 9d ago

Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning

Announce Type: new Abstract: Autonomous driving requires reasoning about how ego actions shape the evolution of the surrounding world. However, most end-to-end methods rely on direct state-to-action mappings, capturing correlations without explicitly modeling action-conditioned dynamics. Conversely, continuous-latent world models often lack compositional structure for causal reasoning across counterfactual futures.

arXiv CS 5d ago

ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Discrete Diffusion Models

arXiv:2512.14099v3 Announce Type: replace Abstract: Motivated by discrete diffusion's success in language-vision modeling, we explore its potential for multi-view generation, a task dominated by continuous approaches. We introduce ViewMask-1-to-3, formulating multi-view generation as a discrete sequence modeling problem where each viewpoint is represented as visual tokens from MAGVIT-v2. Through discrete diffusion via masked token prediction, our approach enables progressive multi-view...

arXiv CS 6d ago

Wavelet as Tokenizer: Preliminary Results on a Shared Wavelet Token Schema for Natural Signals

arXiv:2606.02631v1 Announce Type: cross Abstract: This paper studies whether audio, images, and video can share a common wavelet token schema rather than relying on separate modality-specific latent grids. It introduces a preliminary continuous-token model built around a one-level Haar DWT/IDWT frontend, a shared coefficient-token layout, optional structural metadata, lightweight modality value adapters, and a shared token-wise encoder-decoder trunk. On Speech Commands, EuroSAT RGB, and...

arXiv CS 7d ago

GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

Announce Type: new Abstract: Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world deployment with unseen objects, background shifts, and different robot embodiments. We argue that this stems from the lack of a unified geometry-aware manipulation representation, leaving existing VLAs vulnerable to low-level trajectory supervision, misaligned 3D features, and embodiment differences. To address this, we propose GEAR-VLA, a VLA framework for...

arXiv CS 1d ago

The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook

Announce Type: replace Abstract: Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces.

arXiv CS 2d ago

ClothTransformer: Unified Latent-Space Transformers for Scalable Cloth Simulation

arXiv:2605.27852v3 Announce Type: replace Abstract: Unified and scalable Transformers have recently achieved remarkable success in modeling diverse phenomena traditionally associated with computer graphics, such as 3D visual effects, rendering processes, and motion in videos. In this work, we take a step further by investigating whether modern Transformer techniques can tackle the challenging task of cloth simulation. To this end, we present ClothTransformer, a framework that reformulates...

arXiv CS 1d ago

ClothTransformer: Unified Latent-Space Transformers for Scalable Cloth Simulation

arXiv:2605.27852v2 Announce Type: replace Abstract: Unified and scalable Transformers have recently achieved remarkable success in modeling diverse phenomena traditionally associated with computer graphics, such as 3D visual effects, rendering processes, and motion in videos. In this work, we take a step further by investigating whether modern Transformer techniques can tackle the challenging task of cloth simulation. To this end, we present ClothTransformer, a framework that reformulates...

arXiv CS 5d ago