Efficient Parallel Algorithms for
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Efficient Parallel Algorithms for Hypergraph Matching
arXiv:2602.22976v3 Announce Type: replace Abstract: We present efficient parallel algorithms for computing maximal matchings in hypergraphs. Our algorithm finds locally maximal edges in the hypergraph and adds them in parallel to the matching. In the CRCW PRAM models our algorithms achieve $O(\log{\log{\Delta}}\log{m})$ time with $O(\kappa\log {m})$ work w.h.p. where $m$ is the number of hyperedges, and $\kappa$ is the sum and $\Delta$ is the maximum of all vertex degrees.
FlashCP: Load-Balanced Communication-Efficient Context Parallelism for LLM Training
Announce Type: new Abstract: Context parallelism (CP) is essential for training large-scale, long-context language models, as it partitions sequences to reduce memory overhead. However, existing CP methods suffer from workload imbalance, inefficient kernels, and redundant communication due to static sequence sharding and key-value (KV) tensor communication. We present FlashCP, a load-balanced and communication-efficient framework for CP training.
Efficient Scaling of LLM Training with Flexible Context Parallelism
arXiv:2602.21788v2 Announce Type: replace Abstract: Scaling long-context capabilities is crucial for Large Language Models (LLMs). However, real-world data contain a large number of sequences with heterogeneous lengths. Existing training libraries for LLMs rely on static parallelism strategies, which suffer from severe load imbalance, redundant communication, and suboptimal hardware utilization under data heterogeneity.
Parallel Metric Skiplists and Nearest Neighbor Search
Announce Type: new Abstract: The metric skip-list is a data structure designed for efficient nearest and $k$-nearest neighbor search in metric spaces. For many real-world datasets with reasonable distributions - specifically, those with a constant expansion rate - it supports $\tilde{O}(n)$ construction time and $O(k\log n)$ query time, where $n$ is the input size and $k$ is the number of nearest neighbors in queries. Notably, unlike alternative approaches, it does not require a bounded...
How Much Parallelism Is "Free"? A Principle of Near-Free Parallelism for Parallel Decoding
arXiv:2605.30851v1 Announce Type: new Abstract: Parallel decoding improves generation efficiency by processing multiple decode positions within a single decode forward, but reported speedups conflate algorithmic token utilization with the system cost of executing multiple positions. We isolate the system side by introducing Near-Free Parallelism (NFP), the maximum number of positions executable at near-free latency. Analyzing Dense FFNs, MoE FFNs, and Attention against an idle-compute...
Functional design of efficient and parallelizable combinatorial generators using convolution
arXiv:2507.03980v4 Announce Type: replace Abstract: The application of program transformation and algebraic methods to the development of efficient combinatorial optimization (CO) algorithms relies on an exhaustive combinatorial generator for the problem specification, followed by the fusion of thinning or filtering processes into this specification. However, the effectiveness of such fusion transformations critically depends on the structural compatibility between the objective function and...
Functional design of efficient and parallelizable combinatorial generators using convolution
arXiv:2507.03980v3 Announce Type: replace Abstract: The application of program transformation and algebraic methods to the development of efficient combinatorial optimization (CO) algorithms relies on an exhaustive combinatorial generator for the problem specification, followed by the fusion of thinning or filtering processes into this specification. However, the effectiveness of such fusion transformations critically depends on the structural compatibility between the objective function and...
On GPU Implementation for Multi-Precision Integer Division
arXiv:2606.06386v1 Announce Type: new Abstract: This paper presents the issues arising in implementing a fast integer division algorithm on general purpose GPUs. The algorithm uses a Newton iteration based on the shifted inverse operation, keeping all arithmetic in the integer domain and relying on data-parallel operators. The principal contribution is an efficient GPU/CUDA implementation for integer precisions from $2^{15}$ to $2^{18}$ -- sizes not supported by \cgbn{} division.
FLARE: Diffusion for Hybrid Language Model
arXiv:2606.01774v1 Announce Type: new Abstract: Autoregressive (AR) large language models (LLMs) have achieved broad practical success, but sequential decoding remains a key bottleneck for low-latency deployment. Recent efficient-inference work has progressed along two axes: reducing the cost of each model invocation through efficient architectures, and reducing serial decoding steps through parallel generation. Hybrid attention backbones address the former, while diffusion language models...
Defending Against Malicious Finetuning by Scaling Train-time Adversarial Attacks
Announce Type: new Abstract: Current open-weight large language models (LLMs) are prone to malicious finetuning attacks, which could compromise the safety alignment of LLMs with only a few steps of supervised finetuning (SFT) on poisoned datasets. Existing alignment-stage defenses are primarily designed to defend against attacks that use parameter-efficient finetuning methods. However, they fail to defend against stronger attacks that use full-parameter finetuning.