CUDA Accelerator
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Caspar: CUDA Accelerator for Symbolic Programming with Adaptive Reordering
arXiv:2605.30583v1 Announce Type: new Abstract: We present Caspar, a library that makes the power of modern GPUs more accessible in robotics and provides a state-of-the-art nonlinear GPU solver that can be applied to a wide range of different optimization problems. Caspar bridges the gap between expressive symbolic programming in Python and high-performance GPU runtimes in C++ by automatically generating optimized CUDA kernels from symbolic expressions. Building on the SymForce library,...
Nvidia RTX Spark
RTX Spark Superchip Up to Blackwell RTX GPU Up to Ultra-Efficient CPU Up to FP4 AI Performance Up to Unified Memory CUDA, the software that accelerates the world’s AI, runs natively on RTX Spark.
CodegenBench: Can LLMs Write Efficient Code Across Architectures?
arXiv:2606.04023v1 Announce Type: new Abstract: While large language models (LLMs) have been extensively evaluated on code generation tasks for general-purpose programming and GPU-accelerated environments (e.g., PyTorch, CUDA), their capabilities in CPU-oriented high-performance computing (HPC) across diverse architectures remain underexplored. To bridge this gap, we introduce CodegenBench, a comprehensive benchmark suite designed to evaluate the generation of efficient parallel code across...
HASTE: Hardware-Aware Dynamic Sparse Training for Large Output Spaces
Announce Type: new Abstract: Extreme multi-label classification (XMC) involves learning models over large output spaces with millions of labels, making the output layer a memory-compute bottleneck. While sparsity-based methods reduce arithmetic complexity, they often fail to yield proportional speedups due to irregular memory access, poor hardware utilization, or reliance on auxiliary architectural components in long-tailed regimes.
UniLab: A Heterogeneous Architecture for Robot RL Beyond GPU-Dominant Paradigms
arXiv:2605.30313v2 Announce Type: replace Abstract: Simulation-based RL for contemporary robot control is increasingly organized around GPU-resident simulation: physics, rollout collection, and learning are placed on a single GPU-centric execution path. This paradigm has greatly improved training speed, but it has also encouraged a default assumption that efficient training requires physics to reside on the GPU. We revisit this assumption.
UniLab: A Heterogeneous Architecture for Robot RL Beyond GPU-Dominant Paradigms
arXiv:2605.30313v3 Announce Type: replace Abstract: Simulation-based RL for contemporary robot control is increasingly organized around GPU-resident simulation: physics, rollout collection, and learning are placed on a single GPU-centric execution path. This paradigm has greatly improved training speed, but it has also encouraged a default assumption that efficient training requires physics to reside on the GPU.
Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs
Announce Type: replace Abstract: NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO...
CNBC's The China Connection newsletter: China learns to build without Nvidia
Hi, this is Evelyn, writing to you from Beijing. Welcome to the latest edition of The China Connection — a succinct snapshot of what I'm seeing and hearing from local businesses. China's tech self-sufficiency push is rapidly becoming a reality as companies focus on business questions that run deeper than geopolitics.
RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models
Announce Type: replace Abstract: Diffusion Transformers (DiT) achieve strong performance in image generation but incur substantial inference costs. While prior work has reduced this cost via quantization and distillation, semi-structured sparsity, which can nearly halve FLOPs, remains underexplored. A key reason is that most existing approaches focus on weight sparsification, and pruning 50% of the weights can remove critical model capacity and degrade generation quality.
AtlasRAN: Timing-Aware Evaluation of Open-source 5G Platforms for Integrated Wireless Testbeds
Announce Type: replace Abstract: Open-source 5G and O-RAN experimentation now spans discrete-event simulators, host-OS emulators, SDR hardware-in-the-loop testbeds, O-RU/Open Fronthaul deployments, wireless digital twins, and accelerator-backed RAN runtimes. These environments may expose similar protocol interfaces while preserving very different timing, I/O, synchronization, buffering, transport, and observability behavior. Thus, studies that appear to measure the same network property may...