Home › Knowledge Base › CUDA Cores

CUDA Cores

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

arXiv:2606.08761v1 Announce Type: new Abstract: W4A4 quantization promises full utilization of INT4 Tensor Cores, yet group dequantization overhead on CUDA Cores has driven existing systems to mixed-precision fallbacks. We present the first systematic study of how intra-SM compute balance governs this bottleneck. Through controlled benchmarks across four GPUs from Ampere and Ada architectures, we identify the Tensor Cores to CUDA Cores throughput ratio ($\rho$) as the primary hardware...

arXiv CS 1d ago

Nvidia's Grace Blackwell superchips are officially coming to the PC with RTX Spark notebooks

COMPUTEX 2026: It only took a year and a half but the same silicon at the heart of Nvidia's DGX Spark AI workstations will soon be powering Windows PCs. During his GTC Taiwan keynote on Monday, Nvidia CEO Jensen Huang revealed the N1X, a high-end mobile processor that combines an Arm-based CPU co-designed with MediaTek with a Blackwell based GPU on board. Marketed under the “RTX Spark” banner, Nvidia’s new notebooks and mini PCs signal a deeper push into the a PC arena long dominated by...

The Register 9d ago

I Put a Datacenter GPU in My Gaming PC for £200

I Put a Datacenter GPU in My Gaming PC for £200 I already had an RTX 4080. Good enough for gaming, not good enough for the models I wanted to run locally. The next step up in GPU land is either spend a fortune on a card with more VRAM, or find another way.

Hacker News 10d ago

Graph Traversal on Tensor Cores: A BFS Framework for Modern GPUs

arXiv:2606.05081v1 Announce Type: new Abstract: Modern GPUs have Tensor Cores (TCs) capable of extremely high-throughput matrix operations, yet graph algorithms remain difficult to accelerate because of their irregular and data-dependent execution patterns. This work presents BLEST, a TC-accelerated framework that reformulates Breadth-First Search (BFS) as a bit-level sparse matrix-vector computation while addressing the load imbalance, memory inefficiency, and synchronization overheads that...

arXiv CS 6d ago

Nvidia’s RTX Spark Laptops Look Hell-Bent on Disruption

The moment many have been waiting years for has arrived. Nvidia has long made graphics cards that powered the Windows PC ecosystem for decades—now it wants to control the whole thing with “superchips,” starting with the RTX Spark. Announced over the weekend at the Computex tech expo in Taiwan, RTX Spark chips combine unified memory, RTX graphics, and the new part: the N1 CPU.

Wired 7d ago

SET: Stream-Event-Triggered Scheduling for Efficient CUDA Graph Pipelines

arXiv:2606.05495v1 Announce Type: new Abstract: Achieving peak GPU performance remains a significant challenge as the system throughput is constrained by host-device synchronization delays and kernel scheduling overheads, even with aggressive kernel optimizations and batch processing. Furthermore, existing approaches often underutilize hardware resources such as compute cores and copy engines due to scheduling overheads. To address these problems, we propose a CUDA runtime framework for...

arXiv CS 5d ago

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Announce Type: replace Abstract: NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO...

arXiv CS 5d ago

Efficient Parallel Algorithms for Hypergraph Matching

arXiv:2602.22976v3 Announce Type: replace Abstract: We present efficient parallel algorithms for computing maximal matchings in hypergraphs. Our algorithm finds locally maximal edges in the hypergraph and adds them in parallel to the matching. In the CRCW PRAM models our algorithms achieve $O(\log{\log{\Delta}}\log{m})$ time with $O(\kappa\log {m})$ work w.h.p. where $m$ is the number of hyperedges, and $\kappa$ is the sum and $\Delta$ is the maximum of all vertex degrees.

arXiv CS 2d ago

Caspar: CUDA Accelerator for Symbolic Programming with Adaptive Reordering

arXiv:2605.30583v1 Announce Type: new Abstract: We present Caspar, a library that makes the power of modern GPUs more accessible in robotics and provides a state-of-the-art nonlinear GPU solver that can be applied to a wide range of different optimization problems. Caspar bridges the gap between expressive symbolic programming in Python and high-performance GPU runtimes in C++ by automatically generating optimized CUDA kernels from symbolic expressions. Building on the SymForce library,...

arXiv CS 9d ago

AI Agent Guidelines for CS336 at Stanford

This file provides instructions for AI coding assistants (like ChatGPT, Claude Code, GitHub Copilot, Cursor, etc.) working with students in CS336. AI agents should function as teaching aids that help students learn through explanation, guidance, and feedback—not by completing assignments for them.

Hacker News 9d ago