Home › Knowledge Base › Tensor Cores

Tensor Cores

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Graph Traversal on Tensor Cores: A BFS Framework for Modern GPUs

arXiv:2606.05081v1 Announce Type: new Abstract: Modern GPUs have Tensor Cores (TCs) capable of extremely high-throughput matrix operations, yet graph algorithms remain difficult to accelerate because of their irregular and data-dependent execution patterns. This work presents BLEST, a TC-accelerated framework that reformulates Breadth-First Search (BFS) as a bit-level sparse matrix-vector computation while addressing the load imbalance, memory inefficiency, and synchronization overheads that...

arXiv CS 6d ago

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

arXiv:2606.08761v1 Announce Type: new Abstract: W4A4 quantization promises full utilization of INT4 Tensor Cores, yet group dequantization overhead on CUDA Cores has driven existing systems to mixed-precision fallbacks. We present the first systematic study of how intra-SM compute balance governs this bottleneck. Through controlled benchmarks across four GPUs from Ampere and Ada architectures, we identify the Tensor Cores to CUDA Cores throughput ratio ($\rho$) as the primary hardware...

arXiv CS 1d ago

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Announce Type: replace Abstract: NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO...

arXiv CS 5d ago

Hierarchical Recursive Precision for Accelerating Symmetric Linear Solves on MXUs

Announce Type: replace Abstract: Symmetric positive-definite system solvers based on Cholesky factorization are fundamental to many scientific applications, such as climate modeling. We present a portable, nested recursive mixed-precision solver designed for Matrix Processing Units (MXUs), including NVIDIA Tensor Cores (H200) and AMD Matrix Cores (MI300X), that assigns low-precision FP16 arithmetic to large off-diagonal blocks, while preserving high precision on diagonal blocks to ensure...

arXiv CS 8d ago

CART: Context-Anchored Recurrent Transformer -- A Parameter-Efficient Architecture with Learned Stability

arXiv:2606.01495v2 Announce Type: replace Abstract: We present CART (Context-Anchored Recurrent Transformer), a parameter-efficient language model that reuses a single shared core block R times across depth. Unlike prior looped transformers that recompute key-value tensors at every iteration, CART computes K and V once from a multi-layer prelude and has the recurrent core cross-attend to those frozen tensors via multi-head latent attention. A learned Linear Time-Invariant (LTI) gate keeps...

arXiv CS 6d ago

CART: Context-Anchored Recurrent Transformer -- A Parameter-Efficient Architecture with Learned Stability

new Abstract: We present CART (Context-Anchored Recurrent Transformer), a parameter-efficient language model that reuses a single shared core block R times across depth. Unlike prior looped transformers that recompute key-value tensors at every iteration, CART computes K and V once from a multi-layer prelude and has the recurrent core cross-attend to those frozen tensors via multi-head latent attention. A learned Linear Time-Invariant (LTI) gate keeps the recurrence stable: its spectral...

arXiv CS 8d ago

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

arXiv:2605.30409v1 Announce Type: new Abstract: Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formidable challenge due to the stringent requirements for temporal consistency and inference throughput. In this paper, we present SANA-Streaming, a system-algorithm co-designed framework for high-resolution, real-time streaming video editing on consumer GPUs, with the following three core designs:...

arXiv CS 9d ago

NVIDIA's RTX Spark is an AI "superchip" that will power Windows laptops and desktops

NVIDIA's RTX Spark is an AI "superchip" that will power Windows laptops and desktops The company claims it offers 1 petaflop of AI computing power. It was only a matter of time before NVIDIA released a powerful system-on-a-chip (SOC) to take on AMD's Ryzen AI Max and Qualcomm's latest Snapdragon X2 chips. At Computex today, NVIDIA unveiled the RTX Spark, a "superchip" meant to give both laptops and small desktops fast AI and graphics performance.

Engadget 9d ago

Characterizing the Impact of NVFP4 Quantization for Low-Power Edge AI Deployment

arXiv:2606.06527v2 Announce Type: replace Abstract: Energy-efficient neural-network inference at the edge requires reducing arithmetic cost, memory traffic, computation energy, and storage overhead while maintaining acceptable accuracy. This paper presents an ablation-focused study of NVFP4 quantization for edge-efficient neural networks, with emphasis on the relationship between activation precision, weight precision, block-size scaling, retraining, and model accuracy. NVFP4 activations are...

arXiv CS 1d ago

MixFP4: Enhancing NVFP4 with Adaptive FP4/INT4 Block Representations

Announce Type: new Abstract: As large language models continue to scale, fine-grained block-scaled low-precision formats such as NVFP4 are increasingly adopted for their substantial throughput and memory benefits. However, a single FP4 micro-format often mismatches heterogeneous block-level tensor statistics. To address this without changing the standard block-scaled MMA/GEMM execution path, we propose MixFP4, a mixed micro-format extension to NVFP4 that selects between two stored FP4...

arXiv CS 9d ago