Home Knowledge Base MCQ

MCQ

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

arXiv:2606.07167v1 Announce Type: new Abstract: Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We introduce UrduMMLU, a benchmark of 26,431 Urdu MCQs across 26 subjects and five domains, collected from native Urdu MCQ banks and public examination PDFs.

arXiv CS 2d ago

Bridging the Knowledge-Prediction Gap in LLMs on Multiple-Choice Questions

arXiv:2509.23782v4 Announce Type: replace Abstract: While large language models (LLMs) perform strongly on diverse tasks, their trustworthiness is limited by erratic behavior that is unfaithful to their internal knowledge. In particular, LLMs often fail on multiple-choice questions (MCQs) even if they encode correct answers in their hidden representations, revealing a misalignment between internal knowledge and output behavior. We investigate and mitigate this knowledge-prediction gap on...

arXiv CS 8d ago

Ctrl+alt+examine: Can India's biggest exams go online?

The alarm shrills at 4am. Clothes, ironed the night before, lie ready. The admit card — printed, laminated, triple-checked — waits by the door.

Times of India 3d ago

Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

arXiv:2604.04944v2 Announce Type: replace Abstract: Multiple-choice questions (MCQs) are widely used to evaluate large language models (LLMs). However, LLMs remain vulnerable to the presence of plausible distractors. This often diverts attention toward irrelevant choices, resulting in unstable oscillation between correct and incorrect answers.

arXiv CS 6d ago

OmniHalluc-L: Counterfactual Benchmarking and Modality-Perturbation Reliability Calibration for Long-Form Omni Hallucination

Announce Type: new Abstract: Long-video Omni assistants often fail not by inventing content, but by misbinding real evidence: they hear the right utterance and see the right event, yet attach it to the wrong speaker, moment, or modality. These \emph{almost-true} errors evade standard video QA because local evidence remains valid, so item-level scoring can reward both a supported claim and its near-counterfactual. We introduce a counterfactual event-binding protocol that constructs paired...

arXiv CS 7d ago

Structure-Aware Modeling of Multiple-Choice Questions Improves Automatic Difficulty Estimation

arXiv:2606.08988v1 Announce Type: new Abstract: Automatic Question Difficulty Estimation (AQDE) holds growing promise for educational assessment because it has the potential to yield difficulty estimates that are competitive with expert judgment, while helping reduce the time and financial burden associated with pilot administrations and scaling to digital testing contexts. Prior AQDE studies report mixed evidence on whether adding distractors as additional text to the question stem and the...

arXiv CS 1d ago