Alternating Vision Transformer
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning
Announce Type: new Abstract: Dual-hand action segmentation, densely predicting actions for both hands from untrimmed videos, is essential for understanding complex bimanual activities. However, it poses several unique challenges: complex inter-hand dependencies, visual asymmetry between hands, representation conflicts where the dominant hand monopolizes gradients, and semantic ambiguity in fine-grained actions. We propose Polyphony, a three-stage method to address these challenges through:...
Vision Hopfield Memory Networks for Image Recognition
Announce Type: replace Abstract: Recent vision backbones, such as Transformer families and state-space models like Mamba, have achieved remarkable progress on image recognition. Despite their empirical success, these architectures remain far from the computational principles of the human brain, often demanding enormous amounts of training data while offering limited interpretability. We propose the Vision Hopfield Memory Network (V-HMN), a brain-inspired vision backbone that integrates...
GoQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization
arXiv:2605.26092v4 Announce Type: replace Abstract: The deployment of Large Language Models (LLMs) and Vision Transformers (ViTs) on edge devices is significantly constrained by memory limitations and the critical timing bottlenecks introduced by dense Multiply-Accumulate (MAC) arrays. In the ultra-low bit regime, logarithmic Power-of-Two (PoT) quantization provides a hardware-efficient alternative by replacing MAC operations with bit-shifts. However, the non-uniform exponential lattice is...
The people who actually want AI to replace humanity
“I want AI to be a tool that allows human flourishing!” exclaimed Brad Carson, a former member of Congress. “There is an option out there where AI is just a tool for us.” The people who actually want AI to replace humanity We need to create a new humanism before the “AI successionists” win.
GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation
arXiv:2606.01412v1 Announce Type: new Abstract: Post-training quantization is widely used for compressing large neural networks, but aggressive low-bit quantization can significantly degrade model quality. A common remedy is to augment the quantized weights with a low-rank correction, leading to approximations of the form $W\approx Q+LR$. In this paper, we study this low-precision plus low-rank representation through the layer-wise reconstruction objective $\|XW-X(Q+LR)\|_F^2$, where $X$ is...
TALON: Token-Aligned Lightweight Adapters for 6-DoF Spacecraft Pose Estimation
new Abstract: Monocular 6-DoF spacecraft pose estimation methods predominantly process individual frames, discarding the temporal information present in an image sequence acquired during spacecraft manoeuvres. Few temporal approaches require full backbone fine-tuning or auxiliary optical flow networks, risking catastrophic forgetting or increasing computational cost, respectively. We propose TALON (Token-Aligned Lightweight adapters for Orbital Navigation): spatiotemporal 3D adapters...
Vision-Language Models Mistake Head Orientation for Gaze Direction: Nonverbal Conversation Cues
arXiv:2506.05412v4 Announce Type: replace Abstract: Where someone looks is a nonverbal communication cue that children and adults readily use. How well can Vision-Language Models (VLMs) infer gaze targets? To construct evaluation stimuli, we captured 1,360 real-world photos of scenes in which a person gazes at one of several objects on a table.
Med-URWKV{\dag}: Toward Enhanced Pretrained Pure VRWKV Models for Medical Image Segmentation
arXiv:2506.10858v2 Announce Type: replace-cross Abstract: Medical image segmentation is a fundamental task in computer-aided diagnosis and treatment. Existing approaches based on CNNs, ViTs, Mamba, and hybrid models still suffer from limitations such as restricted receptive fields, high computational cost, or insufficient accuracy. Recently, Vision Receptive-field Weighted Key-Value (VRWKV) models have emerged as a promising alternative,delivering strong long-range dependency modeling for...
ChannelTok: Efficient Flexible-Length Vision Tokenization
Announce Type: new Abstract: Leading flexible vision tokenizers achieve SOTA quality at an extreme cost, relying on parameter-heavy backbones and slow, multi-step generative decoders. We depart from this complex, spatial-token paradigm and introduce a simple, lightweight, and fast channel-wise flexible-length tokenizer. Our method treats each latent channel as a visual token, enabling a parameter-efficient CNN-Transformer hybrid backbone.
‘Happiness is not just about GDP’: ambitious plan or utopia?
Some will question its credibility. But the alternative future to the one imagined in the World Justice Report is far more bleak• Academics set out sweeping vision for planetary survivalIn our increasingly dystopian world, who wouldn’t want to at least be open to a utopian antidote? The World Justice Report, published on Thursday, outlines how to build a prosperous, equitable world within safe planetary boundaries.