Visual Impressions
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
When Do Diffusion Models learn to Generate Multiple Objects?
arXiv:2605.00273v2 Announce Type: replace Abstract: Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself.
SymTRELLIS: Symmetry-Enforced Voxel Latents for 3D Generation
Announce Type: new Abstract: Single-view 3D generative models have achieved impressive visual quality, yet they are not designed to satisfy structural or functional requirements, and in practice, often fall short. Symmetry is one such requirement: violations, even subtle ones, on symmetry can render a model physically unusable. We present SymTRELLIS, a method that enforces arbitrary finite point group symmetries (rotational, reflectional, and polyhedral) during the flow-based 3D generation...
VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation
arXiv:2601.23286v4 Announce Type: replace Abstract: While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised...
Enhancing Adversarial Robustness with Signed Distance Fields for Harmonizing Geometric Invariance and Texture
arXiv:2602.05175v2 Announce Type: replace Abstract: Deep neural networks demonstrate impressive performance in visual recognition but remain highly vulnerable to imperceptible adversarial attacks. Existing defense strategies such as adversarial training and diffusion-based purification have achieved significant progress but are frequently constrained by high computational cost, information loss, and inference latency.
Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation
arXiv:2602.11790v2 Announce Type: replace Abstract: Although recent end-to-end video generation models demonstrate impressive performance in visually oriented content creation, they remain limited in scenarios that require strict logical rigor and precise knowledge representation, such as instructional and educational media. To address this problem, we propose LASEV, a hierarchical LLM-based multi-agent system for generating high-quality instructional videos from educational problems. LASEV...
Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?
arXiv:2606.04811v1 Announce Type: new Abstract: Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it...
Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?
arXiv:2606.04811v2 Announce Type: replace Abstract: Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion...
Coding Agent Is Good As World Simulator
arXiv:2605.14398v2 Announce Type: replace Abstract: World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video-based approaches demonstrating impressive progress in generating visually plausible dynamics. However, because these models typically infer dynamics from video and represent them in latent states, they do not explicitly enforce physical constraints. As a result, the generated video rollouts are not physically plausible,...
OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning
arXiv:2606.08572v1 Announce Type: new Abstract: While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this...
Seeing is Believing: Aligning Prompt Rewriting with Visual Anchors for Text-to-Image Generation
Announce Type: new Abstract: Despite the impressive capabilities of text-to-image (T2I) models, an intent-generation gap often persists due to the brevity and ambiguity of user prompts. Existing approaches primarily polish the prompt for fluency and readability. However, the enhancement process still lacks visual grounding.