Universal Vision-Language Models
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Defending Jailbreak Attacks on Large Language Models via Manifold Trajectory Kinetics
arXiv:2606.07335v1 Announce Type: new Abstract: Jailbreak prompts can bypass alignment guardrails in large language models (LLMs) and elicit unsafe outputs, making reliable deployment-time detection critical. Prior detection approaches largely rely on a fixed metric space, e.g., raw inputs, gradients, or hidden features, in which benign and jailbreak prompts are linearly separable. We show this assumption breaks under (i) pseudo-malicious prompts that are benign by intent but contain...
Agentic Physical AI toward a Domain-Specific Foundation Model for Energy Systems: A Case Study on Nuclear Reactor Control
arXiv:2512.23292v4 Announce Type: replace Abstract: The prevailing paradigm in AI for physical systems: scaling general-purpose foundation models toward universal multimodal reasoning, confronts a barrier at the control interface. Frontier vision-language models achieve only 50-53% accuracy on basic quantitative physics tasks, behaving as approximate guessers that preserve semantic plausibility while violating physical constraints. Safety-critical control demands outcome-space guarantees...
Agentic Physical AI toward a Domain-Specific Foundation Model for Energy Systems: A Case Study on Nuclear Reactor Control
arXiv:2512.23292v5 Announce Type: replace Abstract: The prevailing paradigm in AI for physical systems: scaling general-purpose foundation models toward universal multimodal reasoning, confronts a barrier at the control interface. Frontier vision-language models achieve only 50-53% accuracy on basic quantitative physics tasks, behaving as approximate guessers that preserve semantic plausibility while violating physical constraints. Safety-critical control demands outcome-space guarantees...
MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence
arXiv:2606.02463v1 Announce Type: new Abstract: In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural language, RGB images, point clouds, depth maps and camera poses. Existing Vision-Language models (VLMs) are fine-tuned over a single modality. This completely ignores the question semantics which may favor a different modality than the finetuned modality.
DiTTo: Scalable Order-aware All-in-One Image Restoration Agent
arXiv:2605.30915v1 Announce Type: new Abstract: Real-world images rarely suffer from a single degradation, and the order in which degradations are removed substantially affects the final restoration quality, motivating agent-based image restoration (IR), where a vision-language model schedules a pool of pre-built restoration-experts. However, existing training-based agents require $\mathcal{O}((N^{\mathbf{D}})^{2})$ restoration-expert calls per image to construct the Optimal...
BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs
arXiv:2604.10528v5 Announce Type: replace Abstract: While Vision-Language Models (VLMs) demonstrate remarkable zero-shot recognition capabilities across a diverse spectrum of multimodal tasks, it yet remains an open question whether these architectures genuinely comprehend geometric structure or merely exploit RGB textures and contextual priors as statistical shortcuts. Existing evaluations fail to isolate this mechanism, conflating semantic reasoning with texture mapping and relying on...
BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs
Announce Type: replace Abstract: While Vision-Language Models (VLMs) demonstrate remarkable zero-shot recognition capabilities across a diverse spectrum of multimodal tasks, it yet remains an open question whether these architectures genuinely comprehend geometric structure or merely exploit RGB textures and contextual priors as statistical shortcuts. Existing evaluations fail to isolate this mechanism, conflating semantic reasoning with texture mapping and relying on imprecise annotations...
DiTTo: Scalable Order-aware All-in-One Image Restoration Agent
Announce Type: replace Abstract: Real-world images rarely suffer from a single degradation, and the order in which degradations are removed substantially affects the final restoration quality, motivating agent-based image restoration (IR), where a vision-language model schedules a pool of pre-built restoration-experts. However, existing training-based agents require $\mathcal{O}((N^{\mathbf{D}})^{2})$ restoration-expert calls per image to construct the Optimal Restoration-action Trajectory...
ObjEmbed: Towards Universal Multimodal Object Embeddings
arXiv:2602.01753v3 Announce Type: replace Abstract: Aligning objects with corresponding textual descriptions is a fundamental challenge and a realistic requirement in vision-language understanding. While recent multimodal embedding models excel at global image-text alignment, they often struggle with fine-grained alignment between image regions and specific phrases. In this work, we present ObjEmbed, a novel MLLM embedding model that decomposes the input image into multiple regional...
VISTA: Vision-Grounded and Physics-Validated Adaptation of UMI data for VLA Training
Announce Type: new Abstract: Universal Manipulation Interface (UMI) enables scalable real-world robot data collection without hardware-specific teleoperation, yet leveraging UMI data to train large-scale Vision-Language-Action (VLA) models remains fundamentally challenging. We identify two critical mismatches: wrist-mounted fisheye views, with severe radial distortion and local gripper-centric perspectives, are out-of-distribution for pretrained VLMs; and human-collected trajectories...