Weather
$\mathrm{ECI}_{\mathrm{sem}}$: Semantic Residual Effective Contrastive Information for Evaluating Hard Negatives
Key Points
arXiv:2603.20990v3 Announce Type: replace Abstract: Hard-negative source selection for dense retrieval is usually decided only after fine-tuning and downstream evaluation. We propose $\mathrm{ECI}_{\mathrm{sem}}$, a semantic residual variant of Effective Contrastive Information (ECI) that ranks candidate negative sources using frozen target-encoder embeddings. $\mathrm{ECI}_{\mathrm{sem}}$ is training-free, not label-free: each scored example requires a query, a labeled positive, and an...
arXiv:2603.20990v3 Announce Type: replace
Abstract: Hard-negative source selection for dense retrieval is usually decided only after fine-tuning and downstream evaluation. We propose $\mathrm{ECI}_{\mathrm{sem}}$, a semantic residual variant of Effective Contrastive Information (ECI) that ranks candidate negative sources using frozen target-encoder embeddings. $\mathrm{ECI}_{\mathrm{sem}}$ is training-free, not label-free: each scored example requires a query, a labeled positive, and an explicit candidate negative. $\mathrm{ECI}_{\mathrm{sem}}$ builds a weighted residual information matrix from target consistency, semantic locality, lexical residuality, and a log-determinant diversity objective. On MS MARCO negative sources, in-family $\mathrm{ECI}_{\mathrm{sem}}$ ranks LLM negatives highest among non-hybrid sources and Dense+LLM highest among hybrid sources, matching the strongest aggregate BEIR transfer results across DistilBERT, E5-base, and Contriever. Controlled ablations show that this alignment depends on using the target encoder family, while additional ablations show stability under sample-size, temperature, tokenizer, and IDF-corpus perturbations. The theory gives a local linearized link to loss reduction, while the empirical study treats downstream evaluation as the final test.