Home › Knowledge Base › VFM

VFM

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Geometry-Preserving Unsupervised Alignment for Heterogeneous Foundation Models

Announce Type: new Abstract: Foundation models have driven rapid progress in computer vision, yet the two dominant paradigms, vision-language foundation models (VLMs) and vision-only foundation models (VFMs), remain only partially compatible. VLMs offer language-grounded semantic alignment but are often visually coarse, while VFMs learn discriminative perceptual geometry but lack semantic grounding. We propose GPUA (Geometry-Preserving Unsupervised Alignment), a framework that integrates the...

arXiv CS 6d ago

$\text{VG}^2$GT: Voxel-Gaussian Splatting Visual Geometry Grounded Transformer

Announce Type: new Abstract: Gaussian splatting has shown strong potential for 3D reconstruction and novel view synthesis. However, most existing methods require accurate camera parameters and per-scene optimization, while feed-forward methods with pixel-aligned Gaussian primitives often suffer from artifacts and non-uniform primitives. In this paper, we propose $\text{VG}^2$GT, a Voxel-Gaussian Splatting Visual Geometry-Grounded Transformer.

arXiv CS 8d ago

Revisiting Model Stitching In the Foundation Model Era

arXiv:2603.12433v3 Announce Type: replace Abstract: Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g.,...

arXiv CS 6d ago

$\text{VG}^2$GT: Voxel-Gaussian Splatting Visual Geometry Grounded Transformer

Announce Type: replace Abstract: Gaussian splatting has shown strong potential for 3D reconstruction and novel view synthesis. However, most existing methods require accurate camera parameters and per-scene optimization, while feed-forward methods with pixel-aligned Gaussian primitives often suffer from artifacts and non-uniform primitives. In this paper, we propose $\text{VG}^2$GT, a Voxel-Gaussian Splatting Visual Geometry-Grounded Transformer.

arXiv CS 6d ago

DRIFT: From Robustness Gaps to Invariance Manifolds for AI-Generated Image Detection

arXiv:2606.06918v1 Announce Type: new Abstract: The rapid evolution of generative image models challenges existing AI-generated image detectors, particularly in open-world settings with unseen generators. Recent training-free approaches measure robustness gaps in frozen vision foundation models (VFMs), detecting fakes via perturbation-induced embedding drift. However, these methods rely on fixed invariance geometry inherited from pretraining and lack principled adaptation to the detection task.

arXiv CS 2d ago

FMplex: Model Virtualization for Serving Extensible Foundation Models

arXiv:2606.09643v1 Announce Type: new Abstract: Foundation models (FMs) are increasingly used as backbones for downstream tasks across language, vision, time-series, and multimodal applications. Yet existing model-serving systems deploy each customized task as an independent model instance, thereby replicating heavyweight backbones, wasting accelerator memory, and losing opportunities to amortize batching and loading costs. This paper presents FMplex, a serving system that treats FM...

arXiv CS 1d ago

HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

arXiv:2605.15741v2 Announce Type: replace Abstract: Pixel-space diffusion models bypass the reconstruction bottleneck of Variational Autoencoders (VAEs) but face a fundamental "granularity dilemma": capturing global semantics favors large patch scales, while generating high-fidelity details demands fine-grained inputs. To address this issue, we propose HyperDiT, a unified framework establishing Hyper-Connected Cross-Scale Interactions to bridge the semantic and pixel manifold. Diverging from...

arXiv CS 6d ago