Roofline
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
From Roofline to Ruggedness: Decomposing and Smoothing the GEMM Performance Landscape
arXiv:2605.29752v1 Announce Type: cross Abstract: Adjacent GEMM problems that differ by a single 128-element step in N can show 30% different throughput on the same GPU. This pervasive performance ruggedness - invisible to roofline analysis and peak-FLOPs intuition, yet dominant for every non-peak workload - is the subject of this paper. We propose performance ruggedness analysis as an analytical framework complementary to roofline: rather than summarizing GPU performance with a scalar...
Microarchitectural Co-Optimization for Sustained Throughput of RISC-V Multi-Lane Chaining Vector Processors
arXiv:2604.22314v2 Announce Type: replace Abstract: Modern RISC vector processors rely on multi-lane parallelism and chaining to achieve high sustained throughput, yet practical execution often deviates from the ideal reference due to microarchitectural inefficiencies. This work targets the open-source RVV processor Ara and analyzes its sustained-throughput loss under a fixed hardware configuration. We first establish an ideal multi-lane chaining model that decomposes ideal execution into...
From Chiang Mai’s black house to the Bangkok skyline: Inside 137 Pillars
From Chiang Mai’s black house to the Bangkok skyline: Inside 137 Pillars A derelict teak house in Chiang Mai became the unlikely beginning of 137 Pillars, linking a heritage hotel in Wat Ket with a sky-high Sukhumvit address. Chiang Mai locals called it Baan Dam – the Black House. Stand in the grounds today of 137 Pillars House and you would never guess the name.
Why The Macallan remains one of the world’s most coveted single malt whiskies
Why The Macallan remains one of the world’s most coveted single malt whiskies From sherry-seasoned oak casks in Spain to its striking Speyside distillery, The Macallan has built a whisky world around patience, provenance and luxury. Scotland has always known how to command the global stage, lending its dramatic landscapes and finest exports to everything from Hollywood blockbusters to peak-TV dramas. At The Macallan Estate, that cinematic prestige is palpable.
DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing
arXiv:2511.04791v2 Announce Type: replace Abstract: Modern LLM serving systems must sustain high throughput while meeting strict latency SLOs across two distinct inference phases: compute-intensive prefill and memory-bound decode phases. Existing approaches either (1) aggregate both phases on shared GPUs, leading to interference between prefill and decode phases, which degrades Time-Between-Tokens (TBT); or (2) disaggregate the two phases across GPUs, improving latency but wasting resources...
vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models
arXiv:2606.08094v1 Announce Type: new Abstract: Vision-Language-Action (VLA) policies are typically shipped as Python/PyTorch stacks that assume a workstation-class GPU, a mismatch for the hardware on which robots actually run. We present vla.cpp, a portable C++ inference runtime built on llama.cpp. To our knowledge, it is the first ggml-class engine to natively serve the flow-matching and diffusion VLA inference pattern, in which a cached vision-language prefix is consumed by a...
Upcoming car launches in June 2026: New S-Class, BMW X6 & more
June 2026 is set to bring several new vehicle launches to the Indian market, spanning the luxury, performance, hybrid and electric segments. Automakers are preparing to introduce new models and powertrain options as they broaden their offerings across categories. The key launches expected this month include the new Mercedes-Benz S-Class, BYD’s new hybrid model, the BMW X6 M60i and the Skoda Kodiaq RS.
FP8 is All You Need (Part 1): Debunking Hardware FP64 as the HPC Holy Grail
arXiv:2606.06510v1 Announce Type: new Abstract: Conventional HPC dogma holds that native hardware FP64 silicon is the irreducible foundation of scientific computing -- the "holy grail" of double-precision simulation. This paper argues the dogma is wrong: on AI-optimised GPUs of the B300 generation and beyond, abundant FP8 tensor throughput combined with the Chinese Remainder Theorem-based Ozaki Scheme II recovers memory-roof execution at full FP64 accuracy across the canonical HPC kernel...
Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data
arXiv:2606.04238v1 Announce Type: new Abstract: Aggressive weight quantization to 2-bit precision offers substantial throughput and memory gains for large language model (LLM) inference, but typically incurs severe accuracy degradation. These gains are particularly relevant for edge and on-device deployment, where memory capacity and bandwidth are primary constraints. In this work, we extend Recover-LoRA -- a lightweight, data-free accuracy recovery method originally developed for general...
Lattice Boltzmann Methods for Compressible (Magneto)hydrodynamics
arXiv:2606.00641v1 Announce Type: new Abstract: The simulation of magnetohydrodynamic (MHD) flows presents a highly complex, tightly coupled transport problem that poses severe numerical and computational demands. Towards this, we propose a novel class of Lattice Boltzmann Methods (LBM) schemes capable of solving a wide range of transport equation systems with high computational efficiency and scalability. Our approach exploits the algorithmic structure of kinetic formulations to separately...