Compositionality¶

Number of tasks: 7

AROCocoOrder¶

Compositionality Evaluation of images to their captions.Each capation has four hard negatives created by order permutations.

Dataset: mteb/ARO-COCO-order • License: mit • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
image to text (i2t)	text_acc	eng	Encyclopaedic	expert-annotated	created

Citation

@inproceedings{yuksekgonul2023and,
  author = {Yuksekgonul, Mert and Bianchi, Federico and Kalluri, Pratyusha and Jurafsky, Dan and Zou, James},
  booktitle = {The Eleventh International Conference on Learning Representations},
  title = {When and why vision-language models behave like bags-of-words, and what to do about it?},
  year = {2023},
}

AROFlickrOrder¶

Compositionality Evaluation of images to their captions.Each capation has four hard negatives created by order permutations.

Dataset: mteb/ARO-Flickr-Order • License: mit • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
image to text (i2t)	text_acc	eng	Encyclopaedic	expert-annotated	created

Citation

@inproceedings{yuksekgonul2023and,
  author = {Yuksekgonul, Mert and Bianchi, Federico and Kalluri, Pratyusha and Jurafsky, Dan and Zou, James},
  booktitle = {The Eleventh International Conference on Learning Representations},
  title = {When and why vision-language models behave like bags-of-words, and what to do about it?},
  year = {2023},
}

AROVisualAttribution¶

Compositionality Evaluation of images to their captions.

Dataset: mteb/ARO-Visual-Attribution • License: mit • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
image to text (i2t)	text_acc	eng	Encyclopaedic	expert-annotated	created

Citation

@inproceedings{yuksekgonul2023and,
  author = {Yuksekgonul, Mert and Bianchi, Federico and Kalluri, Pratyusha and Jurafsky, Dan and Zou, James},
  booktitle = {The Eleventh International Conference on Learning Representations},
  title = {When and why vision-language models behave like bags-of-words, and what to do about it?},
  year = {2023},
}

AROVisualRelation¶

Compositionality Evaluation of images to their captions.

Dataset: mteb/ARO-Visual-Relation • License: mit • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
image to text (i2t)	text_acc	eng	Encyclopaedic	expert-annotated	created

Citation

@inproceedings{yuksekgonul2023and,
  author = {Yuksekgonul, Mert and Bianchi, Federico and Kalluri, Pratyusha and Jurafsky, Dan and Zou, James},
  booktitle = {The Eleventh International Conference on Learning Representations},
  title = {When and why vision-language models behave like bags-of-words, and what to do about it?},
  year = {2023},
}

ImageCoDe¶

Identify the correct image from a set of similar images based on a precise caption.

Dataset: mteb/imagecode-multi • License: cc-by-sa-4.0 • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
image, text to image (it2i)	image_acc	eng	Web, Written	derived	found

Citation

@article{krojer2022image,
  author = {Krojer, Benno and Adlakha, Vaibhav and Vineet, Vibhav and Goyal, Yash and Ponti, Edoardo and Reddy, Siva},
  journal = {arXiv preprint arXiv:2203.15867},
  title = {Image retrieval from contextual descriptions},
  year = {2022},
}

SugarCrepe¶

Compositionality Evaluation of images to their captions.

Dataset: mteb/SUGARCREPE_fmt • License: mit • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
image to text (i2t)	text_acc	eng	Encyclopaedic	expert-annotated	created

Citation

@article{hsieh2024sugarcrepe,
  author = {Hsieh, Cheng-Yu and Zhang, Jieyu and Ma, Zixian and Kembhavi, Aniruddha and Krishna, Ranjay},
  journal = {Advances in neural information processing systems},
  title = {Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality},
  volume = {36},
  year = {2024},
}

Winoground¶

Compositionality Evaluation of images to their captions.

Dataset: facebook/winoground • License: https://huggingface.co/datasets/facebook/winoground/blob/main/license_agreement.txt • Learn more →

Task category	Score	Languages	Domains	Annotations Creators	Sample Creation
image to text (i2t)	accuracy	eng	Social	expert-annotated	created

Citation

@misc{thrush2022winogroundprobingvisionlanguage,
  archiveprefix = {arXiv},
  author = {Tristan Thrush and Ryan Jiang and Max Bartolo and Amanpreet Singh and Adina Williams and Douwe Kiela and Candace Ross},
  eprint = {2204.03162},
  primaryclass = {cs.CV},
  title = {Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality},
  url = {https://arxiv.org/abs/2204.03162},
  year = {2022},
}