Long
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
arXiv:2505.19293v2 Announce Type: replace Abstract: Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper...
SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization
arXiv:2505.11166v3 Announce Type: replace Abstract: Despite advances in pretraining with extended context sizes, large language models (LLMs) still face challenges in effectively utilizing real-world long-context information, primarily due to insufficient long-context alignment caused by data quality issues, training inefficiencies, and the lack of well-designed optimization objectives. To address these limitations, we propose a framework named \textbf{S}h\textbf{o}rt-to-\textbf{Lo}ng...
Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
arXiv:2606.01629v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of...
Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
Announce Type: replace Abstract: As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of output length;...
Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning
Announce Type: replace Abstract: Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored. In this work, we hypothesize that current limitations in reasoning stem, in part, from insufficient long-context capacity, motivated by empirical observations such as (1) higher context window length often leads to stronger reasoning performance, and (2) failed reasoning cases resemble failed long-context cases. To test...
Orthogonal Learner for Estimating Heterogeneous Long-Term Treatment Effects
arXiv:2604.00915v2 Announce Type: replace Abstract: Estimation of heterogeneous long-term treatment effects (HLTEs) is relevant for personalized decision-making in marketing, economics, and medicine, where short-term observational datasets are often combined with long-term observational datasets. However, HLTE estimation is challenging due to limited overlap in treatment assignments or in long-term outcomes for certain subpopulations, which can lead to unstable HLTE estimates with large...
Trump says Iran has taken too long to negotiate, will 'pay the price'
Trump says Iran has taken too long to negotiate, will 'pay the price' Iran has "taken too long to negotiate a deal that would have been great for them, now they will have to pay the price", says US President Donald Trump. US President Donald Trump said on Wednesday (Jun 10) Iran had taken too long to negotiate a deal and would now "have to pay the price", while Tehran said it would reassess diplomatic engagement with Washington after tit-for-tat strikes overnight. Iran launched missile and...
T2LM: Long-Term 3D Human Motion Generation from Multiple Sentences
arXiv:2406.00636v2 Announce Type: replace Abstract: In this paper, we address the challenging problem of long-term 3D human motion generation. Specifically, we aim to generate a long sequence of smoothly connected actions from a stream of multiple sentences (i.e., paragraph). Previous long-term motion generating approaches were mostly based on recurrent methods, using previously generated motion chunks as input for the next step.
VideoWeaver: Evaluating and Evolving Skills for Agentic Long Video Generation
arXiv:2606.08091v1 Announce Type: new Abstract: Recent agent frameworks such as Claude Code, Codex, and OpenClaw are strong at tool use and orchestration, but whether they can handle long video generation, a long-horizon multimodal task, remains underexplored. Unlike earlier video agents whose pipeline is handcrafted, these frameworks can build and refine their own workflows. We introduce VideoWeaver, an agent harness and benchmark that evaluates and evolves skills for long video generation,...
DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation
arXiv:2605.21028v2 Announce Type: replace Abstract: Autoregressive long video generation often adopts bounded-memory streaming for efficiency, typically combining local windows for short-term continuity with static early-frame sinks as long-range anchors. However, this fixed allocation keeps early frames cached even when the current visual state has substantially diverged from them, while discarding potentially more relevant intermediate history. As a result, the retained long-range context...