ZMIC Journal Club

Battle on Dataset Bias


A DACADE PASSED, ARE WE THERE YET?


张杨
SDS, Fudan University
2025-04-14

Dataset Bias
ZMIC Journal Club

Paper Info

Intro
ZMIC Journal Club

Table Of Content

  • Introduction
    • Dataset classfication task
    • User Study
  • Related Work
    • Dataset bias
  • Experiments
    • Main Observation
    • Further Study
  • Insights
  • Inspiration
Intro
ZMIC Journal Club

Task: Name The Dataset

These pictures are randomly selected from three datasets(YCD i.e. YFCC, CC and DataComp). Can you name the dataset for each of them?

Intro
ZMIC Journal Club

Answer

Intro
ZMIC Journal Club

User Study 1

  • Users. A group of 20 participants with ML background.
  • Settings. Each user is asked to classify 100 validation images, without time limit.
  • Results. The human performance is much lower than the neural network’s 84.7%.

Intro
ZMIC Journal Club

User Study 2

Difficulty Assessment:

  • 15 participants describe the task as “difficult”.
  • No participant describes the task as “easy”.
  • 2 participants commented that they found the task “interesting”.

Biases captured by human:

  • Some of them are meaingful: e.g. "white background" for DataComp.
  • Many of them are meaningless: e.g. "the inclusion of people" in images.
Intro
ZMIC Journal Club

Dataset Bias

Different from social and stereotypical bias. This mostly concerns the proper coverage of concepts and objects, or in other words, how representative the dataset is for the real world.

  • Torralba & Efros (2011) presented the dataset classification problem and examined dataset bias in the context of hand-crafted features with SVM classifiers.
  • Tommasi et al. (2015) studied the dataset classification problem using neural networks.
  • The concept of classifying different datasets has been further developed in domain adaption methods (Tzeng et al., 2014; Ganin et al., 2016). (adversarially learning)
Related Work
ZMIC Journal Club

Torralba & Efros (2011)

Related Work
ZMIC Journal Club

Dataset used

  • Flickr: The best place to be a photographer online.
  • Common Crawl is an organization that crawls the web data.
Main Observation
ZMIC Journal Club

Main Observation: 84.7% acc by NN

All results were obtained with the ConvNeXt-T model:

Main Observation
ZMIC Journal Club

Observations

High accuracy is observed across dataset combinations, architectures and sizes.

Moreover, dataset classification accuracy benefits from more training data and data augmentation(RandCrop, CutMix etc.).

Main Observation
ZMIC Journal Club

Low-level signatures?

The high accuracy may simply result from the presence of a certain signature.

Potential signatures could involve:

  • JPEG compression artifacts (e.g., different datasets may have different compression quality factors)
  • color quantization artifacts (e.g., colors are trimmed or quantized depending on the individual dataset).
  • camera intrinsic parameters (e.g., focal length, sensor size)
    • This is a question raised by the community.
Further Study 1
ZMIC Journal Club

Corruptions to suppress signatures

Further Study 1
ZMIC Journal Club

Corruptions' Results

Further Study 1
ZMIC Journal Club

Memorization or Generalization?

Train and Validation: the models learned for dataset classification exhibit generalization behaviors.

This again suggests that the model attempts to capture shared, generalizable patterns in the real dataset classification task.

Further Study 2
ZMIC Journal Club

Self-supervised learning?

Under linear probing protocol:

Further Study 3
ZMIC Journal Club

Transfer learning?

From dataset classfication to image classfication.

This reveals that the dataset bias discovered by neural networks is relevant to semantic features that are useful for image classification.

Further Study 4
ZMIC Journal Club

Cross-Dataset Generalization

Here the task is: contrastive learning (MoCo v3)

Despite larger and more diversified datasets, cross-dataset generalization remains a problem. Interestingly, simply combining all datasets yields the best overall result.

Further Study 5
ZMIC Journal Club

ARE WE THERE YET? -- NO

Even in the context of modern large-scale datasets:

  • the datasets bias can still be easily captured by modern neural networks which is robust across models, dataset combinations, and many other settings
  • such bias may contain some generalizable and transferrable patterns, and that it may not be easily noticed by human beings.

Limitation

Insights
ZMIC Journal Club

Inspiration

  • Kaiming He的写作很好,值得学习
    • 故事讲得好,选题切入角度很吸引人
    • 实验设置非常全面,会议文章甚至包含user study
  • 老树也能发新芽
    • 很久以前的研究主题在新的时代背景下也许仍然有价值
  • Domain adaption
    • 通过"Dataset classfication"这个任务进行自监督预训练
    • 能否训练一个泛化能力强的模型?
    • 能否主动建模Dataset bias?
Inspiration
ZMIC Journal Club

THANKS

THANKS