CLIP Score
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
arXiv:2605.12969v3 Announce Type: replace Abstract: Group Relative Policy Optimization (GRPO) is one of the most widely adopted RLVR algorithms for post-training large language models on reasoning tasks. We first show that GRPO admits an equivalent discriminative reformulation, in which policy optimization maximizes the expected score gap between verified positive and negative rollouts. This reformulation reveals two objective-level limitations: likelihood-misaligned surrogate scores, in...
GP-Adapter: Gaussian Process CLIP-Adapter for Few-Shot Out-of-Distribution Detection
arXiv:2606.07102v1 Announce Type: new Abstract: We propose GP-Adapter, a training-free framework that augments CLIP (Contrastive Language-Image Pre-training) with Gaussian Process (GP) uncertainty modeling for few-shot classification and out-of-distribution (OOD) detection. While CLIP achieves strong zero-shot recognition, it yields deterministic similarity scores and offers limited uncertainty information, which is critical under distribution shift and data scarcity. GP-Adapter constructs...
The Unreasonable Redundancy of Nature's Protein Folds
The Unreasonable Redundancy of Nature's Protein Folds Over the last few years, deep neural networks have made generative language modeling dramatically more powerful, giving us large language models. A similar leap happened for continuous modalities like images and videos.
SVHighlights: Towards Extremely Long Sport Video Highlight Detection
new Abstract: While highlight detection for long-form videos is of great practical importance, most existing methods remain limited to short-form content, largely due to the absence of a suitable benchmark. To bridge this gap, we introduce SVHighlights, to the best of our knowledge, the first benchmark for highlight detection in extremely long sports videos, each exceeding one hour in duration, across multiple sports categories. SVHighlights is constructed from pairs of full-length sports...
Foundation VAEs for 3D CT Reconstruction, Augmentation, and Generation
Announce Type: new Abstract: Variational autoencoders (VAEs) compress high resolution CT volumes into compact latents while preserving clinically relevant structure. However, training CT-specific VAEs from scratch or heavily fine-tuning them incurs substantial computational and engineering cost, and often degrades under heterogeneous scanners, protocols, and diseases. This paper makes a progressive stride toward training-free medical VAEs by leveraging a critical observation: a single...
Emotion-Aware Image Generation from Korean Diary Text via LLM-based Prompt Translation and LoRA Fine-Tuning
Announce Type: new Abstract: T2I models cannot effectively capture sentiment from various types of text, including diaries, as they primarily focus on visual object-related patterns rather than contextual emotional understanding. This paper proposes an emotion-aware text-to-image pipeline that generates children's hand drawing style images from short Korean diary entries. The proposed pipeline employs Qwen3-8B for recognising implicit sentiment from short diaries, and Stable Diffusion 3.5...
Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation
arXiv:2606.05290v1 Announce Type: new Abstract: Recent progress in generative modeling has made safety control a central challenge, yet existing approaches remain largely model-specific, requiring retraining or tailored interventions for each new architecture. In this work, we ask whether safety can be represented as a portable latent direction, learned once and reused across heterogeneous generators. We introduce the first framework for cross-model safety steering, in which a safety...
Emotion-Aware Image Generation from Korean Diary Text via LLM-based Prompt Translation and LoRA Fine-Tuning
arXiv:2606.05816v2 Announce Type: replace Abstract: T2I models cannot effectively capture sentiment from various types of text, including diaries, as they primarily focus on visual object-related patterns rather than contextual emotional understanding. This paper proposes an emotion-aware text-to-image pipeline that generates children's hand drawing style images from short Korean diary entries. The proposed pipeline employs Qwen3-8B for recognising implicit sentiment from short diaries, and...
The best player on the Dodgers not named Ohtani? A...
Andy Pages didn't just struggle in the Dodgers' playoff run last year -- he was historically bad. As a 24-year-old being counted on for the first time, he went 4-for-51 with 11 strikeouts and zero walks and was benched in the middle of the World Series. Among those with at least 50 plate appearances, his .211 OPS was the lowest ever for a single postseason.
ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions
arXiv:2606.05635v1 Announce Type: Prior work on aesthetic composition typically produces a single aesthetically pleasing crop, overlooking the narrative value of composing multiple shots from one scene. In practice, multi-shot composition is critical for downstream creative workflows: commercial posters often require multiple crops with different emphases (e.g., context, subject, and emotion/product details) to present key story beats.