Home › Knowledge Base › BIG-Bench

BIG-Bench

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Pitfalls of Evaluating Language Models with Open Benchmarks

arXiv:2507.00460v3 Announce Type: replace Abstract: Open Large Language Model (LLM) benchmarks, such as HELM and BIG-Bench, provide standardized and transparent evaluation protocols that support comparative analysis, reproducibility, and systematic progress tracking in Language Model (LM) research. Yet, this openness also creates substantial risks of data leakage during LM testing--deliberate or inadvertent, thereby undermining the fairness and reliability of leaderboard rankings and leaving...

arXiv CS 5d ago

Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning

arXiv:2507.12612v3 Announce Type: replace Abstract: Supervised fine-tuning performance for large language models depends strongly on how training budget is distributed across a heterogeneous set of tasks. In practice, mixtures are often fixed using simple heuristics (e.g., uniform or size-proportional sampling) that ignore task interactions, which can hurt transfer and waste budget on redundant sources. We introduce TaskPGM, a framework for learning continuous task mixtures via an...

arXiv CS 5d ago

Learning Task Mixtures from Task Affinities: A Probabilistic Graphical Model for Supervised Fine-Tuning

arXiv:2507.12612v4 Announce Type: replace Abstract: Supervised fine-tuning performance for large language models depends strongly on how training budget is distributed across a heterogeneous set of tasks. In practice, mixtures are often fixed using simple heuristics (e.g., uniform or size-proportional sampling) that ignore task interactions, which can hurt transfer and waste budget on redundant sources. We introduce TaskPGM, a framework for learning continuous task mixtures via an...

arXiv CS 1d ago