ZMIC Journal Club

Evading the Simplicity Bias


Training a Diverse Set of Models Discovers Solutions with Superior OOD Generalization


张杨
SDS, Fudan University
2024-02-22

Evading the Simplicity Bias
ZMIC Journal Club

Introduction

  • Evading the Simplicity Bias: Training a Diverse Set of Models Discovers Solutions with Superior OOD Generalization
  • Publication: CVPR 2022
  • Abstract:
    • Neural networks trained with SGD are shown to have simplicity bias which can explain their lack of robustness out of distribution (OOD).
    • They train a set of similar models to fit the data in different ways using a penalty on the alignment of their input gradients. They show theoretically and empirically that this induces the learning of more complex predictive patterns.
    • OOD generalization fundamentally requires information beyond i.i.d. examples. Their approach shows that we can defer this requirement from training stage to an independent model selection stage.
Intro
ZMIC Journal Club

Inductive Bias and OOD Generalization

  • At the core of every learning algorithm are a set of inductive biases. They define the learned function outside of training examples and they allow extrapolation to novel test points.

center

  • shape(bird) or background(sky)? This is where a learning algorithm’s inductive biases come into play.

  • OOD generalization is not achievable only through regularizers, network architectures, or unsupervised control of inductive biases.

Background
ZMIC Journal Club

Simplicity Bias

  • Simplicity is defined corresponding to the feature which induce minimal linear decision boundary.
  • Not a property of neural networks themselves. \cite{shah2020pitfalls} showed that neural networks trained with SGD are biased to learn the simplest predictive features in the data while ignoring others.
  • Pros: by promoting simpler decision boundary, can act as an implicit regularizer and improves generalization.
  • Cons: mechanisms to learn are more likely to be overshadowed by simpler spurious patterns. This will lead to shortcut learning or poor OOD generalization.
    • CV: use the background rather than the shape of the object to do image recognition.
    • NLP: use the presence of certain words rather than the overall meaning of a sentence to do atural language understanding.
Background
ZMIC Journal Club

Method Overview

The regularizer is required because trivial options such as training models with different initial weights, hyperparameters, architectures, or shuffling of the data do not prevent converging to very similar solutions affected by the simplicity bias.

  • A diversity loss penalizes pairwise similarities between models, using each classifier’s input gradient at training points.
Method
ZMIC Journal Club

Setup

  • Dataset:
  • Model: , suppose where is a feature extractor and is a classifier. is hidden representation of input data.
  • Train:

where and .

  • Diversity Loss: we compare the functions implemented classifiers using their input gradients

where is the gradient of its largest component (top predicted score).

  • Complete Method:

Method
ZMIC Journal Club

FAQ

  • How diversity can induce complexity?
    • By assumption of the simplicity bias the model learned by default lies at one end of the space of solutions.
  • Why use input gradients to quantify diversity?
    • \cite{selvaraju2017grad} show input gradients are indicative of the features used by the model.
    • Furthermore where is a test point and is a nearby training point.
  • See more in Appendix A:
  • Where to split a model into “feature extractor” and “classifier”?
  • Why not design the diversity regularizer on the activations of the models but on the input gradients?
  • Is the introduction of more diversity just a fancy random search?
Method
ZMIC Journal Club

Biased activity recognition

  • This experiment try to figure out: Are these patterns relevant for OOD generalization in computer vision tasks?

Experiments
ZMIC Journal Club

Multi-dataset collages

  • This experiment try to figure out: Can we learn predictive patterns otherwise ignored by standard SGD and existing regularizers?


Experiments
ZMIC Journal Club

Domain generalization

  • PACS dataset is a standard benchmark for visual domain generalization (DG). PACS contains 4 domain(Art, Cartoon, Photo and Sketch) and each domain contains 7 categories.
  • VLCS is included for an additional cross-dataset evaluation i.e. zero-shot transfer.
Experiments
ZMIC Journal Club

Domain generalization

Proposed method compared with existing methods on PACS.

Experiments
ZMIC Journal Club

Discussion

  • Limitations of the method
    • The main hyperparameters(the regularizer strength and the number of models learned) setting give no guarantees.
  • Model fitting and model selection are equally hard?
    • In this approach the two steps can be completely decoupled.
  • Universality of inductive biases
    • The inductive biases of any learning algorithm cannot be universally superior to another’s.
    • This method does not affect inductive biases in a directed way. It only increases the variety of the learned models, so it could be seen as a “meta-regularizer”.
    • Experiments also show that intuitive notions behind classical regularizers like smoothness (Jacobian regularization), sparsity (L1 norm), or simplicity (L2 norm) are sometimes detrimental.
Discussion
ZMIC Journal Club

References

  • Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., and Batra, D. (2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626.
  • Shah, H., Tamuly, K., Raghunathan, A., Jain, P., and Netrapalli, P. (2020). The pitfalls of simplicity bias in neural networks. Advances in Neural Information Processing Systems, 33:9573–9585.
  • Teney, D., Abbasnejad, E., Lucey, S., and Van den Hengel, A. (2022). Evading the simplicity bias: Training a diverse set of models discovers solutions with superior ood generalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16761–16772. Presenter: Yang Zhang SDS, Fudan University August 4, 2024 13 / 14
Ref