CUDA Cores
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing
arXiv:2606.08761v1 Announce Type: new Abstract: W4A4 quantization promises full utilization of INT4 Tensor Cores, yet group dequantization overhead on CUDA Cores has driven existing systems to mixed-precision fallbacks. We present the first systematic study of how intra-SM compute balance governs this bottleneck. Through controlled benchmarks across four GPUs from Ampere and Ada architectures, we identify the Tensor Cores to CUDA Cores throughput ratio ($\rho$) as the primary hardware...
Nvidia's Grace Blackwell superchips are officially coming to the PC with RTX Spark notebooks
COMPUTEX 2026: It only took a year and a half but the same silicon at the heart of Nvidia's DGX Spark AI workstations will soon be powering Windows PCs. During his GTC Taiwan keynote on Monday, Nvidia CEO Jensen Huang revealed the N1X, a high-end mobile processor that combines an Arm-based CPU co-designed with MediaTek with a Blackwell based GPU on board. Marketed under the “RTX Spark” banner, Nvidia’s new notebooks and mini PCs signal a deeper push into the a PC arena long dominated by...
I Put a Datacenter GPU in My Gaming PC for £200
I Put a Datacenter GPU in My Gaming PC for £200 I already had an RTX 4080. Good enough for gaming, not good enough for the models I wanted to run locally. The next step up in GPU land is either spend a fortune on a card with more VRAM, or find another way.
Graph Traversal on Tensor Cores: A BFS Framework for Modern GPUs
arXiv:2606.05081v1 Announce Type: new Abstract: Modern GPUs have Tensor Cores (TCs) capable of extremely high-throughput matrix operations, yet graph algorithms remain difficult to accelerate because of their irregular and data-dependent execution patterns. This work presents BLEST, a TC-accelerated framework that reformulates Breadth-First Search (BFS) as a bit-level sparse matrix-vector computation while addressing the load imbalance, memory inefficiency, and synchronization overheads that...
Nvidia’s RTX Spark Laptops Look Hell-Bent on Disruption
The moment many have been waiting years for has arrived. Nvidia has long made graphics cards that powered the Windows PC ecosystem for decades—now it wants to control the whole thing with “superchips,” starting with the RTX Spark. Announced over the weekend at the Computex tech expo in Taiwan, RTX Spark chips combine unified memory, RTX graphics, and the new part: the N1 CPU.
SET: Stream-Event-Triggered Scheduling for Efficient CUDA Graph Pipelines
arXiv:2606.05495v1 Announce Type: new Abstract: Achieving peak GPU performance remains a significant challenge as the system throughput is constrained by host-device synchronization delays and kernel scheduling overheads, even with aggressive kernel optimizations and batch processing. Furthermore, existing approaches often underutilize hardware resources such as compute cores and copy engines due to scheduling overheads. To address these problems, we propose a CUDA runtime framework for...
Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs
Announce Type: replace Abstract: NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO...
Efficient Parallel Algorithms for Hypergraph Matching
arXiv:2602.22976v3 Announce Type: replace Abstract: We present efficient parallel algorithms for computing maximal matchings in hypergraphs. Our algorithm finds locally maximal edges in the hypergraph and adds them in parallel to the matching. In the CRCW PRAM models our algorithms achieve $O(\log{\log{\Delta}}\log{m})$ time with $O(\kappa\log {m})$ work w.h.p. where $m$ is the number of hyperedges, and $\kappa$ is the sum and $\Delta$ is the maximum of all vertex degrees.
Caspar: CUDA Accelerator for Symbolic Programming with Adaptive Reordering
arXiv:2605.30583v1 Announce Type: new Abstract: We present Caspar, a library that makes the power of modern GPUs more accessible in robotics and provides a state-of-the-art nonlinear GPU solver that can be applied to a wide range of different optimization problems. Caspar bridges the gap between expressive symbolic programming in Python and high-performance GPU runtimes in C++ by automatically generating optimized CUDA kernels from symbolic expressions. Building on the SymForce library,...
AI Agent Guidelines for CS336 at Stanford
This file provides instructions for AI coding assistants (like ChatGPT, Claude Code, GitHub Copilot, Cursor, etc.) working with students in CS336. AI agents should function as teaching aids that help students learn through explanation, guidance, and feedback—not by completing assignments for them.