Getting Started: Download MetaShift

Download MetaShift Github Repo

Download at our GitHub repo.

Downlaod Base Dataset: Visual Genome

We leveraged the natural heterogeneity of [Visual Genome](https://visualgenome.org) and its annotations to construct MetaShift. Download the pre-processed and cleaned version of Visual Genome by GQA.

wget -c https://nlp.stanford.edu/data/gqa/images.zip
unzip images.zip -d allImages
wget -c https://nlp.stanford.edu/data/gqa/sceneGraphs.zip
unzip sceneGraphs.zip -d sceneGraphs
  • Extract the files. After this step, the base dataset file structure should look like this:

/your_path/
    allImages/
        images/
            <ID>.jpg
            ...
    sceneGraphs/
        train_sceneGraphs.json
        val_sceneGraphs.json
  • Specify local path of Visual Genome defined in dataset/Constants.py (e.g., IMAGE_DATA_FOLDER=/data/GQA/allImages/images/).

Understanding dataset/meta_data/full-candidate-subsets.pkl

The image IDs for each subset are provided as a Python Dictionary in generate_dataset/meta_data/full-candidate-subsets.pkl in the Github repo. The Python scpript generate_dataset/meta_data/create_MetaShift.py provides the code for generating the MetaShift.

The metadata file dataset/meta_data/full-candidate-subsets.pkl is the most important piece of metadata of MetaShift, which provides the full subset information of MetaShift. To facilitate understanding, we have provided a notebook dataset/understanding_full-candidate-subsets-pkl.ipynb to show how to extract information from it.

Basically, the pickle file stores a collections.defaultdict(set) object, which contains 17,938 keys. Each key is a string of the subset name like dog(frisbee), and the corresponding value is a list of the IDs of the images that belong to this subset. The image IDs can be used to retrieve the image files from the Visual Genome dataset that you just downloaded. In our current version, 13,543 out of 17,938 subsets have more than 25 valid images. In addition, dataset/meta_data/full-candidate-subsets.pkl is drived from the scene graph annotation, so check it out if your project need additional information about each image.

Generate the Full MetaShift Dataset

Since the total number of all subsets is very large, all of the following scripts only generate a subset of MetaShift. As specified in [dataset/Constants.py](./dataset/Constants.py), we only generate MetaShift for the following classes (subjects). You can add any additional classes (subjects) into the list. See [dataset/meta_data/class_hierarchy.json](./dataset/meta_data/class_hierarchy.json) for the full object vocabulary and its hierarchy.

SELECTED_CLASSES = [
    'cat', 'dog',
    'bus', 'truck',
    'elephant', 'horse',
    'bowl', 'cup',
    ]

In addition, to save storage, all copied images are symbolic links. You can set use_symlink=True in the code to perform actual file copying. If you really want to generate the full MetaShift, then set ONLY_SELECTED_CLASSES = True in dataset/Constants.py.

cd dataset/
python generate_full_MetaShift.py

The following files will be generated by executing the script. Modify the global varaible SUBPOPULATION_SHIFT_DATASET_FOLDER to change the destination folder.

/data/MetaShift/MetaDataset-full
├── cat/
    ├── cat(keyboard)/
    ├── cat(sink)/
    ├── ...
├── dog/
    ├── dog(surfboard)
    ├── dog(boat)/
    ├── ...
├── bus/
├── ...

Beyond the generated MetaShift dataset, the scipt also genervates the meta-graphs for each class in dataset/meta-graphs.

.
├── README.md
├── dataset/
    ├── generate_full_MetaShift.py
    ├── meta-graphs/             (generated meta-graph visualization)
        ├──  cat_graph.jpg
        ├──  dog_graph.jpg
        ├──  ...
    ├── ...

Section 4.2: Evaluating Subpopulation Shifts

Run the python script dataset/subpopulation_shift_cat_dog_indoor_outdoor.py to reproduce the MetaShift subpopulation shift dataset (based on Visual Genome images) in the paper.

cd dataset/
python subpopulation_shift_cat_dog_indoor_outdoor.py

The python script generates a “Cat vs. Dog” dataset, where the general contexts “indoor/outdoor” have a natural spurious correlation with the class labels.

The following files will be generated by executing the python script dataset/subpopulation_shift_cat_dog_indoor_outdoor.py.

Output files (mixed version: for reproducing experiments)

/data/MetaShift/MetaShift-subpopulation-shift
├── imageID_to_group.pkl
├── train/
    ├── cat/             (more cat(indoor) images than cat(outdoor))
    ├── dog/             (more dog(outdoor) images than cat(indoor))
├── val_out_of_domain/
    ├── cat/             (cat(indoor):cat(outdoor)=1:1)
    ├── dog/             (dog(indoor):dog(outdoor)=1:1)

where imageID_to_group.pkl is a dictionary with 4 keys : 'cat(outdoor)', 'cat(outdoor)', 'dog(outdoor)', 'dog(outdoor)'. The corresponding value of each key is the list of the names of the images that belongs to that subset. Modify the global varaible SUBPOPULATION_SHIFT_DATASET_FOLDER to change the destination folder. You can tune the NUM_MINORITY_IMG to control the amount of subpopulation shift.

Output files (unmixed version, for other potential uses)

To facilitate other potential uses, we also outputs an unmixed version, where we output the 'cat(outdoor)', 'cat(outdoor)', 'dog(outdoor)', 'dog(outdoor)' into 4 seperate folders. Modify the global varaible CUSTOM_SPLIT_DATASET_FOLDER to change the destination folder.

/data/MetaShift/MetaShift-Cat-Dog-indoor-outdoor
├── imageID_to_group.pkl
├── train/
    ├── cat/             (all cat(indoor) images)
        ├── cat(indoor)/
    ├── dog/             (all dog(outdoor) images)
        ├── dog(outdoor)/
├── test/
    ├── cat/             (all cat(outdoor) images)
        ├── cat(outdoor)/
    ├── dog/             (all dog(indoor) images)
        ├── dog(indoor)/

Appendix D: Constructing MetaShift from COCO Dataset

The notebook dataset/extend_to_COCO/coco_MetaShift.ipynb reproduces the COCO subpopulation shift dataset in paper Appendix D. Executing the notebook would construct a “Cat vs. Dog” task based on COCO images, where the “indoor/outdoor” contexts are spuriously correlated with the class labels.

Install COCO Dependencies

Install pycocotools (for evaluation on COCO):

conda install cython scipy
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

COCO Data preparation

2017 Train/Val annotations [241MB]

2017 Train images [118K/18GB]

Download and extract COCO 2017 train and val images with annotations from http://cocodataset.org. We expect the directory structure to be the following:

/home/ubuntu/data/coco/
  annotations/  # annotation json files
  train2017/    # train images
  val2017/      # val images

Modify the global varaible IMAGE_DATA_FOLDER to change the COCO image folder.

Output files (mixed version: for reproducing experiments)

The following files will be generated by executing the notebook.

/data/MetaShift/COCO-Cat-Dog-indoor-outdoor
├── imageID_to_group.pkl
├── train/
    ├── cat/
    ├── dog/
├── val_out_of_domain/
    ├── cat/
    ├── dog/

where imageID_to_group.pkl is a dictionary with 4 keys : 'cat(outdoor)', 'cat(outdoor)', 'dog(outdoor)', 'dog(outdoor)'. The corresponding value of each key is the list of the names of the images that belongs to that subset. Modify the global varaible CUSTOM_SPLIT_DATASET_FOLDER to change the destination folder.

Section 4.1: Evaluating Domain Generalization

Run the python script dataset/domain_generalization_cat_dog.py to reproduce the MetaShift domain generalization dataset (based on Visual Genome images) in the paper.

cd dataset/
python domain_generalization_cat_dog.py

Output files (cat vs. dog, unmixed version)

The following files will be generated by executing the python script dataset/domain_generalization_cat_dog.py. Modify the global varaible CUSTOM_SPLIT_DATASET_FOLDER to change the COCO image folder.

/data/MetaShift/Domain-Generalization-Cat-Dog
├── train/
    ├── cat/
        ├── cat(sofa)/              (The cat training data is always cat(\emph{sofa + bed}) )
        ├── cat(bed)/               (The cat training data is always cat(\emph{sofa + bed}) )
    ├── dog/
        ├── dog(cabinet)/           (Experiment 1: the dog training data is dog(\emph{cabinet + bed}))
        ├── dog(bed)/               (Experiment 1: the dog training data is dog(\emph{cabinet + bed}))

        ├── dog(bag)/               (Experiment 2: the dog training data is dog(\emph{bag + box}))
        ├── dog(box)/               (Experiment 2: the dog training data is dog(\emph{bag + box}))

        ├── dog(bench)/             (Experiment 3: the dog training data is dog(\emph{bench + bike}))
        ├── dog(bike)/              (Experiment 3: the dog training data is dog(\emph{bench + bike}))

        ├── dog(boat)/              (Experiment 4: the dog training data is dog(\emph{boat + surfboard}))
        ├── dog(surfboard)/         (Experiment 4: the dog training data is dog(\emph{boat + surfboard}))

├── test/
    ├── dog/
        ├── dog(shelf)/             (The test set we used in the paper)
        ├── dog(sofa)/
        ├── dog(grass)/
        ├── dog(vehicle)/
        ├── dog(cap)/
    ├── cat/
        ├── cat(shelf)/
        ├── cat(grass)/
        ├── cat(sink)/
        ├── cat(computer)/
        ├── cat(box)/
        ├── cat(book)/

Code for Distribution Shift Experiments

The python script experiments/distribution_shift/main_generalization.py is the entry point for running the distribution shift experiemnts for Section 4.2 (Evaluating Subpopulation Shifts) and Appendix D (Constructing MetaShift from COCO Dataset), and Section 4.1 (Evaluating Domain Generalization). As a running example, the default value for --data in argparse is /data/MetaShift/MetaShift-subpopulation-shift (i.e., for Section 4.2).

clear && CUDA_VISIBLE_DEVICES=3 python main_generalization.py --num-domains 2 --algorithm ERM
clear && CUDA_VISIBLE_DEVICES=4 python main_generalization.py --num-domains 2 --algorithm GroupDRO
clear && CUDA_VISIBLE_DEVICES=5 python main_generalization.py --num-domains 2 --algorithm IRM
clear && CUDA_VISIBLE_DEVICES=6 python main_generalization.py --num-domains 2 --algorithm CORAL
clear && CUDA_VISIBLE_DEVICES=7 python main_generalization.py --num-domains 2 --algorithm CDANN

Our code is based on the DomainBed, as introduced in In Search of Lost Domain Generalization. The codebase also provides many additional algorithms. Many thanks to the authors and developers!

Citation

@InProceedings{liang2022metashift,
title={MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training Conflicts},
author={Weixin Liang and James Zou},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=MTex8qKavoS}
}