Home › Knowledge Base › SigLIP

SigLIP

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

CLIP-like Model as a Foundational Density Ratio Estimator

Announce Type: replace Abstract: Density ratio estimation is a core concept in statistical machine learning because it provides a unified mechanism for tasks such as importance weighting, divergence estimation, and likelihood-free inference, but its potential in vision and language models has not been fully explored. Modern vision-language encoders such as CLIP and SigLIP are trained with contrastive objectives that implicitly optimize log density ratios between joint and marginal image-text...

arXiv CS 8d ago

KODA: Contrastive Representation Comparison and Alignment for Vision-Language Foundation Models

new Abstract: Vision-language foundation models such as CLIP and SigLIP provide widely used representations for multimodal learning systems. While these models are typically compared through downstream performance, such evaluations often do not explain how their representations differ structurally. In this work, we study this problem through the task of Contrastive Embedding Clustering: identifying sample subsets that are weakly clustered under one representation but strongly clustered under...

arXiv CS 6d ago

Unified Pix Token And Word Token Generative Language Model

arXiv:2605.14028v2 Announce Type: replace Abstract: Since the emergence of Vision Transformer (ViT), it has been widely used in generative language model and generative visual model. Especially in the current state-of-art open source multimodal models, ViT obtained by CLIP or SigLIP method serves as the vision encoder backbone to help them acquire visual understanding capabilities. But this method leads to limitations in visual understanding for details, such as difficulty in recognizing...

arXiv CS 5d ago

Distortion-Aware Fusion of Statistical and Vision-Language Features for Blind Image Quality Assessment

Announce Type: new Abstract: Blind image quality assessment (BIQA) aims to predict perceived image quality without access to a reference image. Classical natural scene statistics (NSS) descriptors and modern vision-language model (VLM) embeddings address this problem from fundamentally different perspectives, yet whether combining them yields complementary benefits and how to weight their contributions per input image remains unexplored. We propose a distortion-aware fusion framework that...

arXiv CS 8d ago

Imagine Before You Draw: Visual Prompt Engineering for Image Generation

arXiv:2606.04457v1 Announce Type: new Abstract: Incorporating visual semantic representations as an intermediate step before image generation can reduce the modeling difficulty between text and images, thereby improving generation quality. Recent works such as X-Omni and BLIP3o-Next have explored this direction, but they typically use a two-stage external pipeline: a separate autoregressive model first generates semantic tokens, which are then fed as conditioning to an independent diffusion...

arXiv CS 6d ago

Revisiting Model Stitching In the Foundation Model Era

arXiv:2603.12433v3 Announce Type: replace Abstract: Model stitching, connecting early layers of one model (source) to later layers of another (target) via a light stitch layer, has served as a probe of representational compatibility. Prior work finds that models trained on the same dataset remain stitchable (negligible accuracy drop) despite different initializations or objectives. We revisit stitching for Vision Foundation Models (VFMs) that vary in objectives, data, and modality mix (e.g.,...

arXiv CS 6d ago