Home › Knowledge Base › Inference

Inference

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

What Type of Inference is Active Inference?

Announce Type: new Abstract: Active inference casts decision-making as inference, with the Expected Free Energy (EFE) unifying goal-directed and information-seeking behavior. Recent work showed that EFE minimization can be written as Variational Free Energy (VFE) minimization on a generative model augmented with epistemic priors. We prove that the VFE of the augmented model can be rewritten as the VFE of the predictive model plus explicit entropy-correction terms, making the EFE contribution...

arXiv CS 6d ago

Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch

arXiv:2511.17826v2 Announce Type: replace Abstract: Deterministic inference is increasingly critical for large language model (LLM) applications such as LLM-as-a-judge evaluation, multi-agent systems, and Reinforcement Learning (RL). However, existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the...

arXiv CS 9d ago

DPU or GPU for Accelerating Neural Networks Inference -- Why not both? Split CNN Inference

Announce Type: replace Abstract: Video and image streaming on edge devices requires low latency. To address this, Neural Networks (NNs) are widely used, and prior work mainly focuses on accelerating them with single hardware units such as Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), and Deep Learning Processing Units (DPUs). However, further reductions in latency can be observed by combining these units.

arXiv CS 5d ago

Verifiable and Confidential DNN Inference on Low-End Edge Devices

Announce Type: new Abstract: Deploying deep neural network (DNN) inference on low-end edge devices raises two key challenges: protecting model confidentiality against a potentially compromised edge system and enabling verifiable inference without incurring prohibitive overhead. Existing approaches either house partial models and inference software within trusted execution environments (TEEs), resulting in high cost and an application-dependent trusted computing base (TCB), or execute in...

arXiv CS 2d ago

Fast Transformer Inference on ARM-Based HMPSoCs

arXiv:2606.02836v1 Announce Type: new Abstract: Transformer models have set new performance standards for machine learning (ML) tasks. However, their resource-intensive deployment on resource-constrained edge devices for cloud-free, on-chip transformer inference remains challenging. The ARM Compute Library (ARM-CL) framework provides low-latency CNN inference on ARM-based edge devices but lacks support for transformer inference.

arXiv CS 7d ago

Collaborative Edge-to-Server Inference for Vision-Language Models

arXiv:2512.16349v2 Announce Type: replace Abstract: We propose a collaborative edge-to-server inference framework for vision-language models (VLMs) that reduces communication cost while maintaining inference accuracy. In typical deployments, visual data captured at edge devices (clients) is transmitted to the server for VLM inference. However, transmitting full-resolution images incurs high communication cost.

arXiv CS 1d ago

Flow-based generative models for amortized Bayesian inference in regression and inverse PDE problems

Announce Type: new Abstract: Bayesian inference provides a principled framework for uncertainty quantification in scientific machine learning. However, conventional Bayesian approaches usually require solving a new inference problem for each observation set, causing substantial computational costs that hinder real-time applications like online monitoring and digital twins. Furthermore, inferring over infinite-dimensional function spaces with varying observation sets poses major challenges...

arXiv Physics 16h ago

DriftSched: Adaptive QoS-Aware Scheduling under Runtime Token Drift for Multi-Tenant GPU Inference

arXiv:2606.02982v1 Announce Type: new Abstract: The rapid growth of large language model (LLM) inference services has increased the demand for efficient multi-tenant GPU scheduling. While modern inference runtimes such as vLLM improve throughput through continuous batching and optimized memory management, accurately estimating the runtime cost of heterogeneous inference requests remains a significant challenge.

arXiv CS 7d ago

Generative Augmented Inference

arXiv:2604.14575v2 Announce Type: replace Abstract: Large language models enable inexpensive AI-generated annotations, but using them reliably for causal inference remains challenging. Naively pooling AI and human data induces bias, while existing methods such as Prediction-Powered Inference (PPI; Angelopoulos et al., 2023a) treat AI outputs as proxies of true labels -- an assumption often violated for generative model outputs in practice. We propose Generative Augmented Inference (GAI), a...

arXiv CS 6d ago

General Synthetic-Powered Inference

arXiv:2509.20345v3 Announce Type: replace-cross Abstract: The rapid proliferation of high-quality synthetic data -- generated by advanced AI models or collected as auxiliary data from related tasks -- presents both opportunities and challenges for statistical inference. This paper introduces a GEneral Synthetic-Powered Inference (GESPI) framework that wraps around a broad class of statistical inference procedures to safely enhance sample efficiency by combining synthetic and real data. Our...

arXiv CS 5d ago