A DACADE PASSED, ARE WE THERE YET?
张杨 SDS, Fudan University 2025-04-14
These pictures are randomly selected from three datasets(YCD i.e. YFCC, CC and DataComp). Can you name the dataset for each of them?
Difficulty Assessment:
Biases captured by human:
Different from social and stereotypical bias. This mostly concerns the proper coverage of concepts and objects, or in other words, how representative the dataset is for the real world.
All results were obtained with the ConvNeXt-T model:
High accuracy is observed across dataset combinations, architectures and sizes.
Moreover, dataset classification accuracy benefits from more training data and data augmentation(RandCrop, CutMix etc.).
The high accuracy may simply result from the presence of a certain signature.
Potential signatures could involve:
Train and Validation: the models learned for dataset classification exhibit generalization behaviors.
This again suggests that the model attempts to capture shared, generalizable patterns in the real dataset classification task.
Under linear probing protocol:
From dataset classfication to image classfication.
This reveals that the dataset bias discovered by neural networks is relevant to semantic features that are useful for image classification.
Here the task is: contrastive learning (MoCo v3)
Despite larger and more diversified datasets, cross-dataset generalization remains a problem. Interestingly, simply combining all datasets yields the best overall result.
Even in the context of modern large-scale datasets:
Limitation
THANKS