Home Knowledge Base Speedup

Speedup

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Subspace-selective unitary manipulation based on the Hilbert-space symmetric structures in the multiple-quantum operator algebra spaces in the quantum-computing speedup theory

arXiv:2606.03859v2 Announce Type: replace-cross Abstract: The quantum-computing speedup theory considers the symmetric structures and properties of quantum systems as the fundamental Quantum-Computing-Speedup (QCS) resources which are responsible for exponentially speeding up quantum computing and simulating. At present a large and important problem is how to make use of the fundamental QCS resources to speed up essentially quantum computing and simulating. Here the author makes a great...

arXiv Physics 5d ago

Subspace-selective unitary manipulation based on the Hilbert-space symmetric structures in the multiple-quantum operator algebra spaces in the quantum-computing speedup theory

Announce Type: cross Abstract: The quantum-computing speedup theory considers the symmetric structures and properties of quantum systems as the fundamental Quantum-Computing-Speedup (QCS) resources which are responsible for exponentially speeding up quantum computing and simulating. At present a large and important problem is how to make use of the fundamental QCS resources to speed up essentially quantum computing and simulating. Here the author makes a great effort toward solving this...

arXiv Physics 7d ago

Communication Strategy Selection for Multi-GPU 3D FDTD with Convolutional Perfectly Matched Boundary Layers

arXiv:2606.06910v1 Announce Type: new Abstract: In this paper we describe a communication-strategy study for multi-GPU three-dimensional finite-difference time-domain computation with convolutional perfectly matched layer boundary conditions using CUDA. The metrics used to determine the most effective implementation include runtime, throughput in millions of output points per second, strong-scaling efficiency, CPML overhead, host-staged versus direct GPU-to-GPU exchange speedup, and...

arXiv CS 2d ago

HASTE: Hardware-Aware Dynamic Sparse Training for Large Output Spaces

Announce Type: new Abstract: Extreme multi-label classification (XMC) involves learning models over large output spaces with millions of labels, making the output layer a memory-compute bottleneck. While sparsity-based methods reduce arithmetic complexity, they often fail to yield proportional speedups due to irregular memory access, poor hardware utilization, or reliance on auxiliary architectural components in long-tailed regimes.

arXiv CS 8d ago

Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap

arXiv:2512.10236v3 Announce Type: replace Abstract: Modern ML workloads demand distributing training and inference across multiple GPUs. However, these parallelization techniques often suffer from exposed critical-path communication, leaving a potential 1.7x speedup on the table through compute-communication overlap. Prior overlapping methods harness the fact that ML model state and inputs are already sharded into the number of GPUs, and overlap the compute and communication at shard...

arXiv CS 6d ago

Fast and Robust On-Device Speaker Diarization: Relative Minimum Cluster Size for Stride-Accelerated Pipelines

Announce Type: cross Abstract: Speech applications such as meeting transcription and voice agents would benefit from on-device speaker diarization, but practical adoption is limited by inference cost. We study how far a Pyannote 3.1-based pipeline can be accelerated on consumer hardware (an RTX 5070 Ti GPU and an Apple M4 laptop) while preserving diarization error rate (DER). A simple recipe: coarser segmentation stride and per-chunk embedding, yields multi-fold speedups and is DER-neutral...

arXiv CS 1d ago

Speedrunning Tabular Foundation Model Pretraining

arXiv:2606.03681v1 Announce Type: new Abstract: Pretraining cost is a major bottleneck for research on tabular foundation models, slowing the iteration cycle for new architectures, priors, and optimization ideas. Yet the community lacks a simple way to compare and accumulate pretraining speedups.

arXiv CS 7d ago

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

Announce Type: new Abstract: Speculative decoding accelerates generation by verifying multiple drafted tokens in a single target-model forward pass, reducing sequential decoding iterations. Model-free variants avoid auxiliary draft models by reusing text and model states already available during generation, but their speedup depends on the reliability of the constructed drafts. We identify two limitations of existing reuse-based methods: lexically anchored retrieval has limited recall under...

arXiv CS 5d ago

Design Space Exploration of DMA based Finer-Grain Compute Communication Overlap

arXiv:2512.10236v2 Announce Type: replace Abstract: Modern ML workloads demand distributing training and inference across multiple GPUs. However, these parallelization techniques often suffer from exposed critical-path communication, leaving a potential 1.7x speedup on the table through compute-communication overlap. Prior overlapping methods harness the fact that ML model state and inputs are already sharded into the number of GPUs, and overlap the compute and communication at shard...

arXiv CS 8d ago

Spectral Anatomy of Quantum Gaussian Process Kernels

Announce Type: new Abstract: Two recent results have reshaped quantum Gaussian processes (QGPs). On the one hand, \citet{lowe2025assessing} rule out the exponential speedups claimed by HHL-based QGP regression in the typical, well-conditioned regime; on the other, an independent line of work shows that highly expressive quantum kernels suffer posterior pathologies that break Bayesian optimization.

arXiv CS 9d ago