GPU
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU
arXiv:2606.04847v1 Announce Type: new Abstract: Native GPU kernel generation turns high-level tensor programs into executable, efficient low-level code. Existing Large Language Models (LLMs) struggle with this task, while execution-based reinforcement learning suffers from sparse rewards, reward hacking, and training instability. We present MusaCoder, a full-stack training framework for native GPU kernel generation on CUDA and MUSA backends.
GPU accelerated population genetics statistics using pg_gpu
Population genetics summary statistics-- diversity, divergence, linkage disequilibrium, selection scans, and dimensionality reduction-- are fundamental across human, agricultural, and ecological genomics. As whole-genome sequencing datasets have grown to hundreds of thousands of individuals, the cost of computing these statistics on conventional CPU implementations has become a major bottleneck: windowed scans of a single chromosome arm can take hours to days, and computation of pairwise...
Communication Strategy Selection for Multi-GPU 3D FDTD with Convolutional Perfectly Matched Boundary Layers
arXiv:2606.06910v1 Announce Type: new Abstract: In this paper we describe a communication-strategy study for multi-GPU three-dimensional finite-difference time-domain computation with convolutional perfectly matched layer boundary conditions using CUDA. The metrics used to determine the most effective implementation include runtime, throughput in millions of output points per second, strong-scaling efficiency, CPML overhead, host-staged versus direct GPU-to-GPU exchange speedup, and...
GNStor: Design of GPU-Native High-Performance Remote All-Flash Array
arXiv:2606.04908v1 Announce Type: new Abstract: GPU has become the leading computing device for a wide range of data-intensive applications, which tightly collaborates with remote all-flash array (AFA) to accommodate ever-expanding datasets, facilitate multi-client data sharing, and guarantee fault tolerance. Although GPU is the center of computation, all I/O processes in existing GPU-AFA systems are still CPU-centric. CPU orchestrates remote I/O requests and executes a centralized AFA...
AgileOS: A GPU Operating System Layer for Protected CUDA Services
Announce Type: new Abstract: Modern GPU applications increasingly interact with storage systems, network devices, vendor libraries, and GPU-resident services rather than executing only isolated compute kernels. This shift creates a need for operating-system-like protection around GPU services, where service metadata, device queues, memory-mapped I/O regions, and library-internal state should not be directly exposed to untrusted application kernels. However, today's CUDA programming model, by...
UniLab: A Heterogeneous Architecture for Robot RL Beyond GPU-Dominant Paradigms
arXiv:2605.30313v3 Announce Type: replace Abstract: Simulation-based RL for contemporary robot control is increasingly organized around GPU-resident simulation: physics, rollout collection, and learning are placed on a single GPU-centric execution path. This paradigm has greatly improved training speed, but it has also encouraged a default assumption that efficient training requires physics to reside on the GPU.
UniLab: A Heterogeneous Architecture for Robot RL Beyond GPU-Dominant Paradigms
arXiv:2605.30313v2 Announce Type: replace Abstract: Simulation-based RL for contemporary robot control is increasingly organized around GPU-resident simulation: physics, rollout collection, and learning are placed on a single GPU-centric execution path. This paradigm has greatly improved training speed, but it has also encouraged a default assumption that efficient training requires physics to reside on the GPU. We revisit this assumption.
GPU Acceleration of Collinear and Noncollinear DFT Using a Numerical Atomic Orbital-Based DFT Code
new Abstract: We implement GPU acceleration of collinear and noncollinear density functional theory (DFT) calculations in the numerical atomic orbitals (NAOs) code OpenMX by offloading matrix multiplications and eigenvalue solves (plus selected auxiliary steps) to cuBLAS/cuSOLVER and OpenACC. Benchmarks on the Pegasus supercomputer (per node: a 48-core Intel Xeon Platinum 8468 CPU and one NVIDIA H100 GPU) compare GPU-accelerated and CPU-only runs under identical settings. For a 512-atom...
I Put a Datacenter GPU in My Gaming PC for £200
I Put a Datacenter GPU in My Gaming PC for £200 I already had an RTX 4080. Good enough for gaming, not good enough for the models I wanted to run locally. The next step up in GPU land is either spend a fortune on a card with more VRAM, or find another way.
GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization
Announce Type: new Abstract: GPU kernels are the workhorse of modern deep learning, and optimizing them (via evolutionary search or coding agents) usually requires repeated measurement on target hardware. While these measurements provide the ground-truth signal necessary for kernel search, they are costly, because each evaluation of a kernel requires compilation and repeated execution on a GPU. As improvements in LLM inference reduce the cost of writing novel kernels and LLM-driven searches...