Home Knowledge Base Dataset

Dataset

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Less Is More? When Dataset Context Hurts LLM-Generated Dataset Descriptions

Announce Type: new Abstract: Dataset search and reuse are strongly constrained by the quality of metadata such as natural language descriptions, which are often sparse or inconsistent. Although large language models (LLMs) can generate such descriptions automatically, little empirical guidance exists on what makes a good dataset description and what dataset context LLMs actually need. We study these questions through a literature-grounded framework of dataset description quality and a...

arXiv CS 8d ago

idSCD: Identifying Training Datasets through Semantic Correlation Descriptors

Announce Type: new Abstract: Can a dataset be recognized from the spurious correlations it induces during training? We argue that datasets leave dataset-specific traces in a model's learned semantic correlation structure: incidental regularities that are predictive within a dataset, but not causal for the underlying task, can be internalized during training. We use this insight to study dataset-level membership inference, moving beyond existing methods that rely on behavioral or...

arXiv CS 9d ago

Sound Effects Dataset Unification With the Universal Category System

Announce Type: new Abstract: Sound effects (SFX) datasets and libraries often employ distinct tagging schemes, taxonomies, and metadata structures. This creates challenges for research on SFX classification and generation because incompatible taxonomies lead to siloed datasets that might require individualized approaches, result in non-comparable outcomes, and prevent data merging strategies. We propose a modular dataset relabeling framework that adopts the Universal Category System (UCS),...

arXiv CS 5d ago

Unifying Dataset Pruning and Distillation for Efficient Large-scale Compression

Announce Type: replace Abstract: Dataset pruning (DP) and dataset distillation (DD) fundamentally differ in their outputs: DP selects original image subsets, while DD generates synthetic images. Recently, DD's increasing reliance on original images suggests a convergence of the two directions. To investigate this convergence trend, we propose a unified dataset compression (DC) benchmark.

arXiv CS 5d ago

Towards Personalized Bangla Book Recommendation: A Large-Scale Heterogeneous Book Graph Dataset

arXiv:2602.12129v2 Announce Type: replace Abstract: Personalized book recommendation in Bangla literature has been constrained by the lack of structured, large-scale, and publicly available datasets. This work introduces RokomariBG, a large-scale heterogeneous book graph dataset designed to support research on personalized recommendation in a low-resource language setting. The dataset comprises 127,302 books, 63,723 users, 16,601 authors, 1,515 categories, 2,757 publishers, and 209,602...

arXiv CS 1d ago

Places in the Wild: A Large, High-Resolution RAW Photograph Dataset for Ecologically Valid Vision Research

arXiv:2606.02481v1 Announce Type: new Abstract: Large image datasets have accelerated progress in cognitive neuroscience and computer vision. However, most datasets are low-resolution, internet-sourced JPEGs with unknown capture conditions and limited spatial context. Places in the Wild is a dataset of 67,574 high-resolution photographs collected in situ across 810 physical locations spanning 260 basic-level scene categories, including indoor, urban, and natural environments.

arXiv CS 8d ago

FRED: A Multi-Modal Autonomous Driving Dataset for Flooded Road Environments

arXiv:2605.22018v2 Announce Type: replace Abstract: The Flooded Road Environments Dataset (FRED) is, to our knowledge, the first multi-modal autonomous driving dataset specifically targeting the collection of data from scenarios involving water hazards on the road. The dataset contains images from a 2.3 MP FLIR Blackfly USB3 camera, 64-beam 360 degree point clouds from an Ouster OS1-64 LiDAR, and data from an iXblue ATLANS-C IMU corrected by a Geoflex RTK GNSS, from five separate locations...

arXiv CS 7d ago

AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation

arXiv:2603.25726v3 Announce Type: replace Abstract: We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation. While recent works with foundation approaches have shown that scaling training data markedly improves hand pose estimation, existing real-world datasets are limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale. To address this bottleneck, our...

arXiv CS 1d ago

Vision-language Models for Driver Monitoring Systems: A Driver Activity Description Dataset

arXiv:2606.02273v1 Announce Type: new Abstract: Understanding subtle driver actions is essential for building reliable driver monitoring systems. Existing visionlanguage models (VLMs) are trained on general datasets and struggle to recognize fine distinctions in driver behaviors. This paper addresses this limitation by creating a detailed natural language version of the Drive&Act dataset.

arXiv CS 8d ago

Video Understanding by Design: How Datasets Shape Video Models

arXiv:2509.09151v2 Announce Type: replace Abstract: Research in video understanding has advanced rapidly, driven by increasingly diverse datasets and more powerful model architectures. While existing surveys typically organize progress by tasks, benchmarks, or model families, they provide limited insight into why particular architectures emerged and succeeded. In this survey, we argue that the evolution of video understanding is fundamentally shaped by dataset structure.

arXiv CS 1d ago