Flash Attention
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Radiation damage to normal mammalian tissue in vivo with laser-driven protons at ultra-high instantaneous dose rate
Announce Type: replace Abstract: The differential sparing of normal tissues relative to tumor control observed at ultra-high dose rates, referred to as the FLASH effect, has recently gained considerable attention. The therapeutic advantages of FLASH radiotherapy are expected to be further amplified through the use of protons and ions, which enable precise dose deposition at tumor depth while minimizing irradiation of healthy tissues proximal and distal to the target. Nevertheless, the...
A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)
A 10 year old Xeon is all you need 17 minutes read The previous post covered getting Gemma 4’s MTP drafters quantized and paired with a verifier. This one is about running the result on a machine that has no business running it. I have a recycled server.
Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection
arXiv:2602.03216v3 Announce Type: replace Abstract: The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at specific layers, which can retain irrelevant tokens or rely on irreversible early decisions despite the layer-/head-wise dynamics of token importance. In this paper, we propose Token Sparse Attention, a...
LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents
Announce Type: replace Abstract: Role specialization in multi-LLM agent systems is often realized via multi-LoRA, where agents share a pretrained backbone and differ only by lightweight adapters. Despite sharing base model weights, each agent independently builds and stores its own KV cache for the same long, tool-augmented trajectories, incurring substantial memory and compute overhead. Existing KV cache sharing methods largely overlook this multi-LoRA setting.
Man 'hours away from losing his vision' urges people to check eyes for symptom
Man 'hours away from losing his vision' urges people to check eyes for symptom People are urged to 'get checked immediately' after man says he was only a few hours away from losing his vision after noticing change A man who was 'hours away from losing his vision' is urging people not to ignore tiny specks or threads drifting across their field of vision, as they may be a warning sign of a serious condition. Eye floaters are something most people will have experienced, especially when...
The Last Evolution, by John W Campbell Jr. (1932)
The Project Gutenberg EBook of The Last Evolution, by John Wood Campbell This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org
AURA: Action-Gated Memory for Robot Policies at Constant VRAM
Announce Type: new Abstract: The KV-cache is the right memory for datacenters but the wrong memory for robots. Datacenter inference batches many short requests and resets them, amortizing an attention cache across a crowd. Embodied agents instead run one long, non-resetting episode on bandwidth-limited edge hardware, where high-bandwidth memory and flash are scarce, flash has finite write endurance, and memory writes rather than compute can become the binding constraint.
Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents
arXiv:2606.06453v1 Announce Type: new Abstract: Sparse attention is becoming increasingly important for serving large language models (LLMs) as generation lengths continue to grow. However, deploying and evaluating new sparse attention algorithms at scale remains highly engineering-intensive, slowing both human researchers and AI agents in exploring the sparse attention design. To address this challenge, we present Vortex, a system that combines a Python-embedded frontend language atop a...
Auditing LLM Editorial Bias in News Media Exposure
arXiv:2510.27489v2 Announce Type: replace Abstract: Large Language Models (LLMs) increasingly act as gateways to web content, shaping how millions of users encounter online information. Unlike traditional search engines, whose retrieval and ranking mechanisms are well studied, the selection processes of web-connected LLMs add layers of opacity to how answers are generated.
From Storage to Steering: Memory Control Flow Attacks on LLM Agents
Announce Type: replace Abstract: Modern agentic systems allow Large Language Model (LLM) agents to tackle complex tasks through extensive tool usage, forming structured control flows of tool selection and execution. Existing security analyses often treat these control flows as ephemeral, one-off sessions, overlooking the persistent influence of memory. This paper identifies a new threat from Memory Control Flow Attacks (MCFA) that memory can dominate the control flow, forcing unintended tool...