Home › Knowledge Base › Vision Transformer

Vision Transformer

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

new Abstract: Despite high accuracy, Vision Transformer (ViT) predictions can be driven by spurious cues, raising the need to understand their inner workings before safe deployment. Sparse autoencoders (SAEs) provide a promising lens for decomposing model representations into human-interpretable concepts, yet adapting SAE-based interpretation to ViTs remains challenging due to limited control over concept coverage and subjective, non-scalable feature interpretation. To fill the gaps,...

arXiv CS 2d ago

Phase Marginalization for Patch-Grid Instability in Vision Transformers

arXiv:2606.08132v1 Announce Type: new Abstract: Vision Transformers operate on fixed patch grids, which can introduce phase-dependent instability for dense prediction: changing the patch partition can change the token evidence available to a pixel, especially near boundaries. We formalize patch-grid phase as a nuisance variable and propose Phase Marginalization, a post-hoc marginalization method that evaluates structured patch-grid phases, inverse-aligns dense outputs, and aggregates them in...

arXiv CS 1d ago

PSViT: A Methodology for Structurally Pruning Spiking Vision Transformers

arXiv:2606.03257v1 Announce Type: new Abstract: Spiking Vision Transformer (SViT) models are promising low-power ViT models for solving vision-based tasks with state-of-the-art performance. However, their large sizes limit their deployments for resource-constrained embedded platforms, underscoring the needs of model compression. One of prominent compression techniques is pruning, and the state-of-the-art works employ unstructured pruning techniques to compress SViT models.

arXiv CS 7d ago

I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation

Announce Type: replace Abstract: Vision Transformers (ViTs) have recently achieved strong results in semantic segmentation, yet their deployment on resource-constrained devices remains limited due to their high memory footprint and computational cost. Quantization offers an effective strategy to improve efficiency, but ViT-based segmentation models are notoriously fragile under low precision, as quantization errors accumulate across deep encoder-decoder pipelines. We introduce I-Segmenter,...

arXiv CS 1d ago

Vision Transformers and Convolutional Neural Networks for Land Use Scene Classification

Announce Type: replace Abstract: Land Use Scene Classification (LUSC) from remote sensing imagery plays a critical role in environmental monitoring, urban planning, and sustainable resource management. In recent years, deep learning methods have significantly advanced the state of the art, with Convolutional Neural Networks (CNNs) dominating the field because of their strong ability to capture local spatial features. However, the emergence of Vision Transformers (ViTs) has introduced a new...

arXiv CS 6d ago

PrimeSVT: An Automated Memory-aware Pruning Framework with Prioritized Compression Policy for Spiking Vision Transformers

arXiv:2606.03428v1 Announce Type: new Abstract: The large sizes of Spiking Vision Transformers (SViTs) still hinder their embedded implementation, highlighting the need for model compression. State-of-the-art works compress SViT models through unstructured pruning, which needs specialized hardware accelerators for their specific sparsity patterns to maximize efficiency gains. Moreover, their manual approach requires a huge design time to find an appropriate pruning setting for each network,...

arXiv CS 7d ago

Vision Transformer Finetuning Benefits from Non-Smooth Components

arXiv:2602.06883v3 Announce Type: replace Abstract: The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their \emph{plasticity}.

arXiv CS 6d ago

Self-Supervised Vision Transformers for CBCT-Based Detection of Temporomandibular Joint Osteoarthritis

new Abstract: Temporomandibular joint osteoarthritis (TMJ OA) is a prevalent degenerative condition whose osseous changes are often subtle on cone-beam CT (CBCT), making automated detection challenging. We study how well the DINO family of self-supervised vision transformers -- DINOv1, DINOv2, DINOv2+reg, and RAD-DINO (a radiology-pretrained variant) -- transfers to CBCT, asking how much backbone adaptation is needed and of what kind. We propose a simple slice-based pipeline using Vision...

arXiv CS 1d ago

An Open-Source Two-Stage Computer Vision Pipeline for Fine-Grained Vehicle Classification using Vision Transformers

Announce Type: new Abstract: Vehicle body type is a significant determinant of cyclist injury severity in overtaking crashes, yet automated tools for classifying vehicles into injury-risk-relevant categories from naturalistic roadway video do not exist in the open literature. Standard object detection benchmarks provide only coarse vehicle labels (car, truck, bus, motorcycle), while existing fine-grained recognition systems are trained on controlled imagery and lack evaluation for deployment...

arXiv CS 6d ago

Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers

arXiv:2606.04373v2 Announce Type: replace Abstract: Data-Free Quantization (DFQ) addresses data security concerns by synthesizing samples, without accessing real data. It has garnered increasing attention in the context of Vision Transformers (ViTs), owing to the superiority of the self-attention mechanism compared to classical convolutional operation. However, previous DFQ arts for ViTs often suffer from a distribution mismatch between synthetic samples and input distribution expected by...

arXiv CS 2d ago