Jaccard
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
FOLD: Fuzzy Online Deduplication for Very Large Evolving Datasets via Approximate Nearest Neighbor Search
Announce Type: new Abstract: Fuzzy deduplication is key to constructing large language model training corpora. However, classic Locality-Sensitive Hashing pipelines scale poorly as corpora grow and are ill-suited to continuous ingestion. We present FOLD (Fuzzy Online Deduplication), an online fuzzy deduplication system that delivers high recall and throughput for evolving datasets.
AeroMesa: Efficient Data Management System for Multi-Dimensional Spatio-Temporal Trajectories
arXiv:2606.09581v1 Announce Type: new Abstract: The rapid growth of trajectory data -- especially the dense 4D traces generated by unmanned aerial vehicles (UAVs) -- is placing mounting pressure on spatio-temporal data management systems. Existing HBase-based trajectory indexes suffer from three limitations: coarse-grained temporal pruning, locality-unfriendly XZ2 spatial encodings with workload-blind ordering, and severe row-key interval fragmentation when altitude is jointly encoded with...
How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages, Scripts, and Rewordings
Announce Type: replace Abstract: Sparse autoencoder (SAE) features are increasingly used to interpret language models, with auto-generated natural-language labels serving as the primary interface for understanding what each feature represents. We ask whether these labels generalize: does a feature labeled for a concept actually track that concept across languages and scripts? Using Serbian digraphia as a controlled testbed--the same language written in both Latin and Cyrillic via...
How Quantization Changes Interpretable Features: A Sparse Autoencoder Analysis of Language Models
new Abstract: Quantization is a standard path to deploying large language models, and a quantized model is typically judged acceptable when its perplexity or downstream accuracy stays close to the full-precision original. Whether the model still computes in the same way, or whether the interpretable features identified in the full-precision model survive weight rounding, is rarely tested, even as safety audits and steering interventions increasingly rely on those features. We ask whether...
Worker Utility as Hysteresis: A Preisach Model of Transaction Acceptance in Gig Labour Markets
arXiv:2606.04916v1 Announce Type: new Abstract: Worker utility is not observed -- only its consequence is. Each gig transaction produces a single bit: accepted or rejected.
Label-Efficient 3D Forest Mapping: Self-Supervised and Transfer Learning for Instance Segmentation, Semantic Segmentation, and Species Classification
arXiv:2511.06331v2 Announce Type: replace Abstract: Detailed structural and species information on individual tree level is increasingly important to support precision forestry, biodiversity conservation, and provide reference data for biomass and carbon mapping. Point clouds from airborne and ground-based laser scanning are currently the most suitable data source to rapidly derive such information at scale. Recent advancements in deep learning improved segmenting and classifying individual...
RobustModelMaker: Coupling Bootstrap Stability Selection with Leakage-Safe Nested Cross-Validation for Scientific Machine Learning
arXiv:2606.01566v1 Announce Type: new Abstract: Small-to-medium scientific datasets place machine learning pipelines under two compounding pressures. Single-run feature selection produces feature sets that change substantially under small perturbations of the training data, and any procedure that uses the same data for selection, tuning, and evaluation produces optimistically biased performance estimates. The two failure modes are routinely treated as separable, but in the regimes where...
Rule Extraction in Machine Learning: Chat Incremental Pattern Constructor
arXiv:2208.00335v5 Announce Type: replace Abstract: Rule extraction is a central problem in interpretable machine learning because it seeks to convert opaque predictive behavior into human-readable symbolic structure. This paper presents Chat Incremental Pattern Constructor (ChatIPC), a lightweight incremental symbolic learning system that extracts ordered token-transition rules from text, enriches them with definition-based expansion, and constructs responses by similarity-guided candidate...
Perplexity Can Miss SAE Feature Damage Under Quantization
Announce Type: replace Abstract: Quantization is a standard path to deploying large language models, and quantized models are typically judged acceptable when perplexity or downstream accuracy remains close to the full-precision original. But behavioral parity need not imply feature fidelity: the sparse-autoencoder (SAE) features used to interpret a full-precision model may change after weight rounding.
Gate AI: LLM Security Benchmark Evaluation Methodology and Results
arXiv:2606.02959v1 Announce Type: new Abstract: Published evaluations of prompt-injection and jailbreak detectors for Large Language Models often suffer from two systematic weaknesses: per-dataset threshold tuning and undisclosed operating points. We describe an evaluation harness that addresses both.