Home Knowledge Base Visual Impressions

Visual Impressions

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

When Do Diffusion Models learn to Generate Multiple Objects?

arXiv:2605.00273v2 Announce Type: replace Abstract: Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself.

arXiv CS 1d ago

SymTRELLIS: Symmetry-Enforced Voxel Latents for 3D Generation

Announce Type: new Abstract: Single-view 3D generative models have achieved impressive visual quality, yet they are not designed to satisfy structural or functional requirements, and in practice, often fall short. Symmetry is one such requirement: violations, even subtle ones, on symmetry can render a model physically unusable. We present SymTRELLIS, a method that enforces arbitrary finite point group symmetries (rotational, reflectional, and polyhedral) during the flow-based 3D generation...

arXiv CS 6d ago

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

arXiv:2601.23286v4 Announce Type: replace Abstract: While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised...

arXiv CS 1d ago

Enhancing Adversarial Robustness with Signed Distance Fields for Harmonizing Geometric Invariance and Texture

arXiv:2602.05175v2 Announce Type: replace Abstract: Deep neural networks demonstrate impressive performance in visual recognition but remain highly vulnerable to imperceptible adversarial attacks. Existing defense strategies such as adversarial training and diffusion-based purification have achieved significant progress but are frequently constrained by high computational cost, information loss, and inference latency.

arXiv CS 1d ago

Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

arXiv:2602.11790v2 Announce Type: replace Abstract: Although recent end-to-end video generation models demonstrate impressive performance in visually oriented content creation, they remain limited in scenarios that require strict logical rigor and precise knowledge representation, such as instructional and educational media. To address this problem, we propose LASEV, a hierarchical LLM-based multi-agent system for generating high-quality instructional videos from educational problems. LASEV...

arXiv CS 8d ago

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

arXiv:2606.04811v1 Announce Type: new Abstract: Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it...

arXiv CS 6d ago

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

arXiv:2606.04811v2 Announce Type: replace Abstract: Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion...

arXiv CS 5d ago

Coding Agent Is Good As World Simulator

arXiv:2605.14398v2 Announce Type: replace Abstract: World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video-based approaches demonstrating impressive progress in generating visually plausible dynamics. However, because these models typically infer dynamics from video and represent them in latent states, they do not explicitly enforce physical constraints. As a result, the generated video rollouts are not physically plausible,...

arXiv CS 8d ago

OmniCap-IF: Benchmarking and Improving Instruction Following Abilities for Omni-Video Captioning

arXiv:2606.08572v1 Announce Type: new Abstract: While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this...

arXiv CS 1d ago

Seeing is Believing: Aligning Prompt Rewriting with Visual Anchors for Text-to-Image Generation

Announce Type: new Abstract: Despite the impressive capabilities of text-to-image (T2I) models, an intent-generation gap often persists due to the brevity and ambiguity of user prompts. Existing approaches primarily polish the prompt for fluency and readability. However, the enhancement process still lacks visual grounding.

arXiv CS 1d ago