CUDA
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
LLM-Based Porting of Optimized C++ to CUDA Through Deoptimization and Reoptimization
arXiv:2606.06063v1 Announce Type: new Abstract: When porting high-performance computing (HPC) code from CPU to GPU, CPU-oriented optimizations may obstruct LLM-based CUDA translation. We design and evaluate a Deopt-Reopt workflow that first simplifies the input C++ code and then retranslates and reoptimizes it for CUDA, comparing it against direct translation (Direct) on twelve HPC kernels with two LLMs (gpt-oss-120b (O120) and qwen-3-235b-a22b-instruct-2507 (Q235)) in Single-shot (one pass)...
AgileOS: A GPU Operating System Layer for Protected CUDA Services
Announce Type: new Abstract: Modern GPU applications increasingly interact with storage systems, network devices, vendor libraries, and GPU-resident services rather than executing only isolated compute kernels. This shift creates a need for operating-system-like protection around GPU services, where service metadata, device queues, memory-mapped I/O regions, and library-internal state should not be directly exposed to untrusted application kernels. However, today's CUDA programming model, by...
SET: Stream-Event-Triggered Scheduling for Efficient CUDA Graph Pipelines
arXiv:2606.05495v1 Announce Type: new Abstract: Achieving peak GPU performance remains a significant challenge as the system throughput is constrained by host-device synchronization delays and kernel scheduling overheads, even with aggressive kernel optimizations and batch processing. Furthermore, existing approaches often underutilize hardware resources such as compute cores and copy engines due to scheduling overheads. To address these problems, we propose a CUDA runtime framework for...
Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs
Announce Type: replace Abstract: NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs. We present the first independent, cross-architecture evaluation of CuTile against established approaches such as cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs spanning Hopper and Blackwell: H100 NVL, B200, and RTX PRO...
Caspar: CUDA Accelerator for Symbolic Programming with Adaptive Reordering
arXiv:2605.30583v1 Announce Type: new Abstract: We present Caspar, a library that makes the power of modern GPUs more accessible in robotics and provides a state-of-the-art nonlinear GPU solver that can be applied to a wide range of different optimization problems. Caspar bridges the gap between expressive symbolic programming in Python and high-performance GPU runtimes in C++ by automatically generating optimized CUDA kernels from symbolic expressions. Building on the SymForce library,...
Fast-Vollib: A Fast Implied Volatility Library for Pythonwith PyTorch, JAX, and CUDA Fused-Kernel Backends
arXiv:2604.27210v2 Announce Type: replace-cross Abstract: We present fast-vollib, an open-source Python library that provides high-performance European option pricing, implied volatility (IV) computation, and Greeks under the Black-76, Black-Scholes, and Black-Scholes-Merton models. The library is designed as a drop-in alternative to the de-facto-standard py_vollib and py_vollib_vectorized packages, with pluggable PyTorch and JAX execution backends, a CUDA fused-kernel Triton contribution...
Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA
Summary: The article introduces Tiny-vLLM, a high-performance LLM inference engine developed in C++ and CUDA. The engine is designed to provide efficient and scalable inference for large language models, making it suitable for various applications such as natural language processing and machine learning. The article highlights the key features and benefits of Tiny-vLLM, including its ability to handle large models and its compatibility with various hardware platforms.
APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing
arXiv:2606.08761v1 Announce Type: new Abstract: W4A4 quantization promises full utilization of INT4 Tensor Cores, yet group dequantization overhead on CUDA Cores has driven existing systems to mixed-precision fallbacks. We present the first systematic study of how intra-SM compute balance governs this bottleneck. Through controlled benchmarks across four GPUs from Ampere and Ada architectures, we identify the Tensor Cores to CUDA Cores throughput ratio ($\rho$) as the primary hardware...
AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis
arXiv:2606.09682v1 Announce Type: new Abstract: AutoMegaKernel (AMK) compiles a HuggingFace Llama-family model into a single persistent cooperative CUDA kernel that runs the whole forward pass in one launch, with no per-model hand-written CUDA. The contribution is the system, not raw speed. A frozen schedule-IR validator statically certifies deadlock-freedom and race-freedom via static graph checks (not a mechanized proof), so an unsafe agent-proposed schedule is rejected before launch:...
Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation
Announce Type: replace Abstract: Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data are scarce. We present an automated dataset generation pipeline featuring a dual-LLM Questioner-Solver design that incorporates external knowledge from compilers and runtime feedback. Beyond traditional source-target code...