Skip to content

Any2AnyRetrieval

  • Number of tasks: 89

AudioCapsA2TRetrieval

Natural language description for any kind of audio in the wild.

Dataset: mteb/audiocaps_a2tLicense: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) cv_recall_at_5 eng, zxx Encyclopaedic, Written derived found
Citation
@inproceedings{kim2019audiocaps,
  author = {Kim, Chris Dongjoo and Kim, Byeongchang and Lee, Hyunmin and Kim, Gunhee},
  booktitle = {Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)},
  pages = {119--132},
  title = {Audiocaps: Generating captions for audios in the wild},
  year = {2019},
}

AudioCapsT2ARetrieval

Natural language description for any kind of audio in the wild.

Dataset: mteb/audiocaps_t2aLicense: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 eng, zxx Encyclopaedic, Written derived found
Citation
@inproceedings{kim2019audiocaps,
  author = {Kim, Chris Dongjoo and Kim, Byeongchang and Lee, Hyunmin and Kim, Gunhee},
  booktitle = {Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)},
  pages = {119--132},
  title = {Audiocaps: Generating captions for audios in the wild},
  year = {2019},
}

AudioSetStrongA2TRetrieval

Retrieve all temporally-strong labeled events within 10s audio clips from the AudioSet Strongly-Labeled subset.

Dataset: mteb/audioset_strong_a2tLicense: cc-by-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) cv_recall_at_5 eng AudioScene derived found
Citation
@misc{hershey2021benefittemporallystronglabelsaudio,
  archiveprefix = {arXiv},
  author = {Shawn Hershey and Daniel P W Ellis and Eduardo Fonseca and Aren Jansen and Caroline Liu and R Channing Moore and Manoj Plakal},
  eprint = {2105.07031},
  primaryclass = {cs.SD},
  title = {The Benefit Of Temporally-Strong Labels In Audio Event Classification},
  url = {https://arxiv.org/abs/2105.07031},
  year = {2021},
}

AudioSetStrongT2ARetrieval

Retrieve audio segments corresponding to a given sound event label from the AudioSet Strongly-Labeled 10s clips.

Dataset: mteb/audioset_strong_t2aLicense: cc-by-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 eng AudioScene derived found
Citation
@misc{hershey2021benefittemporallystronglabelsaudio,
  archiveprefix = {arXiv},
  author = {Shawn Hershey and Daniel P W Ellis and Eduardo Fonseca and Aren Jansen and Caroline Liu and R Channing Moore and Manoj Plakal},
  eprint = {2105.07031},
  primaryclass = {cs.SD},
  title = {The Benefit Of Temporally-Strong Labels In Audio Event Classification},
  url = {https://arxiv.org/abs/2105.07031},
  year = {2021},
}

BLINKIT2IRetrieval

Retrieve images based on images and specific retrieval instructions.

Dataset: mteb/blink-it2iLicense: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to image (it2i) cv_recall_at_1 eng Encyclopaedic derived found
Citation
@article{fu2024blink,
  author = {Fu, Xingyu and Hu, Yushi and Li, Bangzheng and Feng, Yu and Wang, Haoyu and Lin, Xudong and Roth, Dan and Smith, Noah A and Ma, Wei-Chiu and Krishna, Ranjay},
  journal = {arXiv preprint arXiv:2404.12390},
  title = {Blink: Multimodal large language models can see but not perceive},
  year = {2024},
}

BLINKIT2TRetrieval

Retrieve images based on images and specific retrieval instructions.

Dataset: mteb/blink-it2tLicense: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to text (it2t) cv_recall_at_1 eng Encyclopaedic derived found
Citation
@article{fu2024blink,
  author = {Fu, Xingyu and Hu, Yushi and Li, Bangzheng and Feng, Yu and Wang, Haoyu and Lin, Xudong and Roth, Dan and Smith, Noah A and Ma, Wei-Chiu and Krishna, Ranjay},
  journal = {arXiv preprint arXiv:2404.12390},
  title = {Blink: Multimodal large language models can see but not perceive},
  year = {2024},
}

CIRRIT2IRetrieval

Retrieve images based on texts and images.

Dataset: mteb/mbeir_cirr_task7License: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to image (it2i) ndcg_at_10 eng Encyclopaedic derived found
Citation
@inproceedings{liu2021image,
  author = {Liu, Zheyuan and Rodriguez-Opazo, Cristian and Teney, Damien and Gould, Stephen},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages = {2125--2134},
  title = {Image retrieval on real-life images with pre-trained vision-and-language models},
  year = {2021},
}

CMUArcticA2TRetrieval

Retrieve the correct transcription for an English speech segment. The dataset is derived from the phonetically balanced CMU Arctic single-speaker TTS corpora. The corpora contains 1150 samples based on read-aloud segments from books, which are out of copyright and derived from the Gutenberg project.

Dataset: mteb/CMU_Arctic_a2tLicense: cc0-1.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) cv_recall_at_5 eng Spoken derived found
Citation
@techreport{cmu-lti-03-177,
  author = {Clark, Rob and Richmond, Keith},
  institution = {Carnegie Mellon University, Language Technologies Institute},
  number = {CMU-LTI-03-177},
  title = {A detailed report on the CMU Arctic speech database},
  year = {2003},
}

CMUArcticT2ARetrieval

Retrieve the correct audio segment for an English transcription. The dataset is derived from the phonetically balanced CMU Arctic single-speaker TTS corpora. The corpora contains 1150 audio-text pairs based on read-aloud segments from public domain books originally sourced from the Gutenberg project.

Dataset: mteb/CMU_Arctic_t2aLicense: cc0-1.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 eng Spoken derived found
Citation
@techreport{cmu-lti-03-177,
  author = {Clark, Rob and Richmond, Keith},
  institution = {Carnegie Mellon University, Language Technologies Institute},
  number = {CMU-LTI-03-177},
  title = {A detailed report on the CMU Arctic speech database},
  year = {2003},
}

CUB200I2IRetrieval

Retrieve bird images from 200 classes.

Dataset: mteb/cub200_retrievalLicense: not specified • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to image (i2i) cv_recall_at_1 eng Encyclopaedic derived created
Citation
@article{welinder2010caltech,
  author = {Welinder, Peter and Branson, Steve and Mita, Takeshi and Wah, Catherine and Schroff, Florian and Belongie, Serge and Perona, Pietro},
  month = {09},
  pages = {},
  title = {Caltech-UCSD Birds 200},
  year = {2010},
}

ClothoA2TRetrieval

An audio captioning datasetst containing audio clips and their corresponding captions.

Dataset: mteb/ClothoLicense: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) cv_recall_at_5 eng Encyclopaedic, Written derived found
Citation
@misc{drossos2019clothoaudiocaptioningdataset,
  archiveprefix = {arXiv},
  author = {Konstantinos Drossos and Samuel Lipping and Tuomas Virtanen},
  eprint = {1910.09387},
  primaryclass = {cs.SD},
  title = {Clotho: An Audio Captioning Dataset},
  url = {https://arxiv.org/abs/1910.09387},
  year = {2019},
}

ClothoT2ARetrieval

An audio captioning datasetst containing audio clips from the Freesound platform and their corresponding captions.

Dataset: mteb/ClothoLicense: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 eng Encyclopaedic, Written derived found
Citation
@misc{drossos2019clothoaudiocaptioningdataset,
  archiveprefix = {arXiv},
  author = {Konstantinos Drossos and Samuel Lipping and Tuomas Virtanen},
  eprint = {1910.09387},
  primaryclass = {cs.SD},
  title = {Clotho: An Audio Captioning Dataset},
  url = {https://arxiv.org/abs/1910.09387},
  year = {2019},
}

CommonVoiceMini17A2TRetrieval

Speech recordings with corresponding text transcriptions from CommonVoice dataset.

Dataset: mteb/common_voice_17_0_miniLicense: cc0-1.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) cv_recall_at_5 ara, ast, bel, ben, bre, ... (50) Spoken human-annotated found
Citation
@inproceedings{ardila2019common,
  author = {Ardila, Rosana and Branson, Megan and Davis, Kelly and Henretty, Michael and Kohler, Michael and Meyer, Josh and Morais, Reuben and Saunders, Lindsay and Tyers, Francis M and Weber, Gregor},
  booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference},
  pages = {4218--4222},
  title = {Common voice: A massively-multilingual speech corpus},
  year = {2020},
}

CommonVoiceMini17T2ARetrieval

Speech recordings with corresponding text transcriptions from CommonVoice dataset.

Dataset: mteb/common_voice_17_0_miniLicense: cc0-1.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 ara, ast, bel, ben, bre, ... (50) Spoken human-annotated found
Citation
@inproceedings{ardila2019common,
  author = {Ardila, Rosana and Branson, Megan and Davis, Kelly and Henretty, Michael and Kohler, Michael and Meyer, Josh and Morais, Reuben and Saunders, Lindsay and Tyers, Francis M and Weber, Gregor},
  booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference},
  pages = {4218--4222},
  title = {Common voice: A massively-multilingual speech corpus},
  year = {2020},
}

CommonVoiceMini21A2TRetrieval

Speech recordings with corresponding text transcriptions from CommonVoice dataset.

Dataset: mteb/common_voice_21_0_miniLicense: cc0-1.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) cv_recall_at_5 abk, afr, amh, ara, asm, ... (114) Spoken human-annotated found
Citation
@inproceedings{ardila2019common,
  author = {Ardila, Rosana and Branson, Megan and Davis, Kelly and Henretty, Michael and Kohler, Michael and Meyer, Josh and Morais, Reuben and Saunders, Lindsay and Tyers, Francis M and Weber, Gregor},
  booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference},
  pages = {4218--4222},
  title = {Common voice: A massively-multilingual speech corpus},
  year = {2020},
}

CommonVoiceMini21T2ARetrieval

Speech recordings with corresponding text transcriptions from CommonVoice dataset.

Dataset: mteb/common_voice_21_0_miniLicense: cc0-1.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 abk, afr, amh, ara, asm, ... (114) Spoken human-annotated found
Citation
@inproceedings{ardila2019common,
  author = {Ardila, Rosana and Branson, Megan and Davis, Kelly and Henretty, Michael and Kohler, Michael and Meyer, Josh and Morais, Reuben and Saunders, Lindsay and Tyers, Francis M and Weber, Gregor},
  booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference},
  pages = {4218--4222},
  title = {Common voice: A massively-multilingual speech corpus},
  year = {2020},
}

EDIST2ITRetrieval

Retrieve news images and titles based on news content.

Dataset: mteb/mbeir_edis_task2License: apache-2.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to image, text (t2it) ndcg_at_10 eng News derived created
Citation
@inproceedings{liu2023edis,
  author = {Liu, Siqi and Feng, Weixi and Fu, Tsu-Jui and Chen, Wenhu and Wang, William},
  booktitle = {Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
  pages = {4877--4894},
  title = {EDIS: Entity-Driven Image Search over Multimodal Web Content},
  year = {2023},
}

EmoVDBA2TRetrieval

Natural language emotional captions for speech segments from the EmoV-DB emotional voices database.

Dataset: mteb/EmoV_DB_a2tLicense: https://github.com/numediart/EmoV-DB/blob/master/LICENSE.md • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) cv_recall_at_5 eng Spoken derived found
Citation
@misc{adigwe2018emotional,
  archiveprefix = {arXiv},
  author = {Adaeze Adigwe and Noé Tits and Kevin El Haddad and Sarah Ostadabbas and Thierry Dutoit},
  eprint = {1806.09514},
  primaryclass = {cs.CL},
  title = {The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems},
  url = {https://arxiv.org/abs/1806.09514},
  year = {2018},
}

EmoVDBT2ARetrieval

Natural language emotional captions for speech segments from the EmoV-DB emotional voices database.

Dataset: mteb/EmoV_DB_t2aLicense: https://github.com/numediart/EmoV-DB/blob/master/LICENSE.md • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 eng Spoken derived found
Citation
@misc{adigwe2018emotional,
  archiveprefix = {arXiv},
  author = {Adaeze Adigwe and Noé Tits and Kevin El Haddad and Sarah Ostadabbas and Thierry Dutoit},
  eprint = {1806.09514},
  primaryclass = {cs.CL},
  title = {The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems},
  url = {https://arxiv.org/abs/1806.09514},
  year = {2018},
}

EncyclopediaVQAIT2ITRetrieval

Retrieval Wiki passage and image and passage to answer query about an image.

Dataset: izhx/UMRB-EncyclopediaVQALicense: cc-by-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to image, text (it2it) cv_recall_at_5 eng Encyclopaedic derived created
Citation
@inproceedings{mensink2023encyclopedic,
  author = {Mensink, Thomas and Uijlings, Jasper and Castrejon, Lluis and Goel, Arushi and Cadar, Felipe and Zhou, Howard and Sha, Fei and Araujo, Andr{\'e} and Ferrari, Vittorio},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages = {3113--3124},
  title = {Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories},
  year = {2023},
}

FORBI2IRetrieval

Retrieve flat object images from 8 classes.

Dataset: mteb/forb_retrievalLicense: not specified • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to image (i2i) cv_recall_at_1 eng Encyclopaedic derived created
Citation
@misc{wu2023forbflatobjectretrieval,
  archiveprefix = {arXiv},
  author = {Pengxiang Wu and Siman Wang and Kevin Dela Rosa and Derek Hao Hu},
  eprint = {2309.16249},
  primaryclass = {cs.CV},
  title = {FORB: A Flat Object Retrieval Benchmark for Universal Image Embedding},
  url = {https://arxiv.org/abs/2309.16249},
  year = {2023},
}

Fashion200kI2TRetrieval

Retrieve clothes based on descriptions.

Dataset: mteb/mbeir_fashion200k_task3License: apache-2.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to text (i2t) ndcg_at_10 eng Encyclopaedic derived created
Citation
@inproceedings{han2017automatic,
  author = {Han, Xintong and Wu, Zuxuan and Huang, Phoenix X and Zhang, Xiao and Zhu, Menglong and Li, Yuan and Zhao, Yang and Davis, Larry S},
  booktitle = {Proceedings of the IEEE international conference on computer vision},
  pages = {1463--1471},
  title = {Automatic spatially-aware fashion concept discovery},
  year = {2017},
}

Fashion200kT2IRetrieval

Retrieve clothes based on descriptions.

Dataset: mteb/mbeir_fashion200k_task0License: apache-2.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to image (t2i) ndcg_at_10 eng Encyclopaedic derived created
Citation
@inproceedings{han2017automatic,
  author = {Han, Xintong and Wu, Zuxuan and Huang, Phoenix X and Zhang, Xiao and Zhu, Menglong and Li, Yuan and Zhao, Yang and Davis, Larry S},
  booktitle = {Proceedings of the IEEE international conference on computer vision},
  pages = {1463--1471},
  title = {Automatic spatially-aware fashion concept discovery},
  year = {2017},
}

FashionIQIT2IRetrieval

Retrieve clothes based on descriptions.

Dataset: mteb/mbeir_fashioniq_task7License: apache-2.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to image (it2i) ndcg_at_10 eng Encyclopaedic derived created
Citation
@inproceedings{wu2021fashion,
  author = {Wu, Hui and Gao, Yupeng and Guo, Xiaoxiao and Al-Halah, Ziad and Rennie, Steven and Grauman, Kristen and Feris, Rogerio},
  booktitle = {Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition},
  pages = {11307--11317},
  title = {Fashion iq: A new dataset towards retrieving images by natural language feedback},
  year = {2021},
}

FleursA2TRetrieval

Speech recordings with corresponding text transcriptions from the FLEURS dataset.

Dataset: mteb/fleursLicense: apache-2.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) cv_recall_at_5 afr, amh, ara, asm, ast, ... (102) Spoken human-annotated found
Citation
@inproceedings{conneau2023fleurs,
  author = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur},
  booktitle = {2022 IEEE Spoken Language Technology Workshop (SLT)},
  organization = {IEEE},
  pages = {798--805},
  title = {Fleurs: Few-shot learning evaluation of universal representations of speech},
  year = {2023},
}

FleursT2ARetrieval

Speech recordings with corresponding text transcriptions from the FLEURS dataset.

Dataset: mteb/fleursLicense: apache-2.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 afr, amh, ara, asm, ast, ... (102) Spoken human-annotated found
Citation
@inproceedings{conneau2023fleurs,
  author = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur},
  booktitle = {2022 IEEE Spoken Language Technology Workshop (SLT)},
  organization = {IEEE},
  pages = {798--805},
  title = {Fleurs: Few-shot learning evaluation of universal representations of speech},
  year = {2023},
}

Flickr30kI2TRetrieval

Retrieve captions based on images.

Dataset: mteb/flickr30ki2tLicense: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to text (i2t) ndcg_at_10 eng Web, Written derived found
Citation
@article{Young2014FromID,
  author = {Peter Young and Alice Lai and Micah Hodosh and J. Hockenmaier},
  journal = {Transactions of the Association for Computational Linguistics},
  pages = {67-78},
  title = {From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions},
  url = {https://api.semanticscholar.org/CorpusID:3104920},
  volume = {2},
  year = {2014},
}

Flickr30kT2IRetrieval

Retrieve images based on captions.

Dataset: mteb/flickr30kt2iLicense: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to image (t2i) ndcg_at_10 eng Web, Written derived found
Citation
@article{Young2014FromID,
  author = {Peter Young and Alice Lai and Micah Hodosh and J. Hockenmaier},
  journal = {Transactions of the Association for Computational Linguistics},
  pages = {67-78},
  title = {From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions},
  url = {https://api.semanticscholar.org/CorpusID:3104920},
  volume = {2},
  year = {2014},
}

GLDv2I2IRetrieval

Retrieve names of landmarks based on their image.

Dataset: mteb/gld-v2License: apache-2.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to image (i2i) ndcg_at_10 eng Encyclopaedic derived created
Citation
@inproceedings{Weyand_2020_CVPR,
  author = {Weyand, Tobias and Araujo, Andre and Cao, Bingyi and Sim, Jack},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  title = {Google Landmarks Dataset v2 - A Large-Scale Benchmark for Instance-Level Recognition and Retrieval},
  year = {2020},
}

GLDv2I2TRetrieval

Retrieve names of landmarks based on their image.

Dataset: mteb/gld-v2-i2tLicense: apache-2.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to text (i2t) ndcg_at_10 eng Encyclopaedic derived created
Citation
@inproceedings{Weyand_2020_CVPR,
  author = {Weyand, Tobias and Araujo, Andre and Cao, Bingyi and Sim, Jack},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  title = {Google Landmarks Dataset v2 - A Large-Scale Benchmark for Instance-Level Recognition and Retrieval},
  year = {2020},
}

GigaSpeechA2TRetrieval

Given an English speech segment, retrieve its correct transcription. Audio comes from the 10 000‑hour training subset of GigaSpeech, which originates from ≈40 000 hours of transcribed audiobooks, podcasts, and YouTube.

Dataset: mteb/gigaspeech_a2tLicense: https://github.com/SpeechColab/GigaSpeech/blob/main/LICENSE • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) cv_recall_at_5 eng Spoken human-annotated found
Citation
@inproceedings{GigaSpeech2021,
  author = {Chen, Guoguo and Chai, Shuzhou and Wang, Guanbo and Du, Jiayu and Zhang, Wei-Qiang and Weng, Chao and Su, Dan and Povey, Daniel and Trmal, Jan and Zhang, Junbo and Jin, Mingjie and Khudanpur, Sanjeev and Watanabe, Shinji and Zhao, Shuaijiang and Zou, Wei and Li, Xiangang and Yao, Xuchen and Wang, Yongqing and Wang, Yujun and You, Zhao and Yan, Zhiyong},
  booktitle = {Proc. Interspeech 2021},
  title = {GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio},
  year = {2021},
}

GigaSpeechT2ARetrieval

Given an English transcription, retrieve its corresponding audio segment. Audio comes from the 10 000‑hour training subset of GigaSpeech, sourced from ≈40 000 hours of transcribed audiobooks, podcasts, and YouTube.

Dataset: mteb/gigaspeech_t2aLicense: https://github.com/SpeechColab/GigaSpeech/blob/main/LICENSE • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 eng Spoken human-annotated found
Citation
@inproceedings{GigaSpeech2021,
  author = {Chen, Guoguo and Chai, Shuzhou and Wang, Guanbo and Du, Jiayu and Zhang, Wei-Qiang and Weng, Chao and Su, Dan and Povey, Daniel and Trmal, Jan and Zhang, Junbo and Jin, Mingjie and Khudanpur, Sanjeev and Watanabe, Shinji and Zhao, Shuaijiang and Zou, Wei and Li, Xiangang and Yao, Xuchen and Wang, Yongqing and Wang, Yujun and You, Zhao and Yan, Zhiyong},
  booktitle = {Proc. Interspeech 2021},
  title = {GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio},
  year = {2021},
}

GoogleSVQA2TRetrieval

Multilingual audio-to-text retrieval using the Simple Voice Questions (SVQ) dataset. Given an audio query, retrieve the corresponding text transcription.

Dataset: mteb/svqLicense: cc-by-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) cv_recall_at_5 acm, apc, arq, arz, ben, ... (20) Spoken human-annotated found
Citation
@inproceedings{heigold2025massive,
  author = {Georg Heigold and Ehsan Variani and Tom Bagby and Cyril Allauzen and Ji Ma and Shankar Kumar and Michael Riley},
  booktitle = {The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  title = {Massive Sound Embedding Benchmark ({MSEB})},
  url = {https://openreview.net/forum?id=X0juYgFVng},
  year = {2025},
}

GoogleSVQT2ARetrieval

Multilingual text-to-audio retrieval using the Simple Voice Questions (SVQ) dataset. Given a text query, retrieve the corresponding audio recording.

Dataset: mteb/svqLicense: cc-by-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 acm, apc, arq, arz, ben, ... (20) Spoken human-annotated found
Citation
@inproceedings{heigold2025massive,
  author = {Georg Heigold and Ehsan Variani and Tom Bagby and Cyril Allauzen and Ji Ma and Shankar Kumar and Michael Riley},
  booktitle = {The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  title = {Massive Sound Embedding Benchmark ({MSEB})},
  url = {https://openreview.net/forum?id=X0juYgFVng},
  year = {2025},
}

HatefulMemesI2TRetrieval

Retrieve captions based on memes to assess OCR abilities.

Dataset: mteb/MMSoc_HatefulMemesLicense: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to text (i2t) ndcg_at_10 eng Encyclopaedic derived found
Citation
@article{kiela2020hateful,
  author = {Kiela, Douwe and Firooz, Hamed and Mohan, Aravind and Goswami, Vedanuj and Singh, Amanpreet and Ringshia, Pratik and Testuggine, Davide},
  journal = {Advances in neural information processing systems},
  pages = {2611--2624},
  title = {The hateful memes challenge: Detecting hate speech in multimodal memes},
  volume = {33},
  year = {2020},
}

HatefulMemesT2IRetrieval

Retrieve captions based on memes to assess OCR abilities.

Dataset: mteb/MMSoc_HatefulMemesLicense: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to image (t2i) ndcg_at_10 eng Encyclopaedic derived found
Citation
@article{kiela2020hateful,
  author = {Kiela, Douwe and Firooz, Hamed and Mohan, Aravind and Goswami, Vedanuj and Singh, Amanpreet and Ringshia, Pratik and Testuggine, Davide},
  journal = {Advances in neural information processing systems},
  pages = {2611--2624},
  title = {The hateful memes challenge: Detecting hate speech in multimodal memes},
  volume = {33},
  year = {2020},
}

HiFiTTSA2TRetrieval

Sentence-level text captions aligned to 44.1 kHz audiobook speech segments from the Hi‑Fi Multi‑Speaker English TTS dataset. Dataset is based on public audiobooks from LibriVox and texts from Project Gutenberg.

Dataset: mteb/hifi-tts_a2tLicense: cc-by-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) cv_recall_at_5 eng Spoken derived found
Citation
@article{bakhturina2021hi,
  author = {Bakhturina, Evelina and Lavrukhin, Vitaly and Ginsburg, Boris and Zhang, Yang},
  journal = {arXiv preprint arXiv:2104.01497},
  title = {{Hi-Fi Multi-Speaker English TTS Dataset}},
  year = {2021},
}

HiFiTTST2ARetrieval

Sentence-level text captions aligned to 44.1 kHz audiobook speech segments from the Hi‑Fi Multi‑Speaker English TTS dataset. Dataset is based on public audiobooks from LibriVox and texts from Project Gutenberg.

Dataset: mteb/hifi-tts_t2aLicense: cc-by-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 eng Spoken derived found
Citation
@article{bakhturina2021hi,
  author = {Bakhturina, Evelina and Lavrukhin, Vitaly and Ginsburg, Boris and Zhang, Yang},
  journal = {arXiv preprint arXiv:2104.01497},
  title = {{Hi-Fi Multi-Speaker English TTS Dataset}},
  year = {2021},
}

ImageCoDeT2IRetrieval

Retrieve a specific video frame based on a precise caption.

Dataset: mteb/imagecodeLicense: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to image (t2i) cv_recall_at_3 eng Web, Written derived found
Citation
@article{krojer2022image,
  author = {Krojer, Benno and Adlakha, Vaibhav and Vineet, Vibhav and Goyal, Yash and Ponti, Edoardo and Reddy, Siva},
  journal = {arXiv preprint arXiv:2203.15867},
  title = {Image retrieval from contextual descriptions},
  year = {2022},
}

InfoSeekIT2ITRetrieval

Retrieve source text and image information to answer questions about images.

Dataset: mteb/InfoSeekIT2ITRetrievalLicense: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to image, text (it2it) ndcg_at_10 eng Encyclopaedic derived found
Citation
@inproceedings{chen2023can,
  author = {Chen, Yang and Hu, Hexiang and Luan, Yi and Sun, Haitian and Changpinyo, Soravit and Ritter, Alan and Chang, Ming-Wei},
  booktitle = {Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
  pages = {14948--14968},
  title = {Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?},
  year = {2023},
}

InfoSeekIT2TRetrieval

Retrieve source information to answer questions about images.

Dataset: mteb/mbeir_infoseek_task6License: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to text (it2t) ndcg_at_10 eng Encyclopaedic derived found
Citation
@inproceedings{chen2023can,
  author = {Chen, Yang and Hu, Hexiang and Luan, Yi and Sun, Haitian and Changpinyo, Soravit and Ritter, Alan and Chang, Ming-Wei},
  booktitle = {Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
  pages = {14948--14968},
  title = {Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?},
  year = {2023},
}

JLCorpusA2TRetrieval

Emotional speech segments from the JL-Corpus, balanced over long vowels and annotated for primary and secondary emotions.

Dataset: mteb/jl_corpus_a2tLicense: cc0-1.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) cv_recall_at_5 eng Spoken derived found
Citation
@inproceedings{james2018open,
  author = {James, Jesin and Li, Tian and Watson, Catherine},
  booktitle = {Proc. Interspeech 2018},
  title = {An Open Source Emotional Speech Corpus for Human Robot Interaction Applications},
  year = {2018},
}

JLCorpusT2ARetrieval

Emotional speech segments from the JL-Corpus, balanced over long vowels and annotated for primary and secondary emotions.

Dataset: mteb/jl_corpus_t2aLicense: cc0-1.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 eng Spoken derived found
Citation
@inproceedings{james2018open,
  author = {James, Jesin and Li, Tian and Watson, Catherine},
  booktitle = {Proc. Interspeech 2018},
  title = {An Open Source Emotional Speech Corpus for Human Robot Interaction Applications},
  year = {2018},
}

JamAltArtistA2ARetrieval

Given audio clip of a song (query), retrieve all songs from the same artist in the Jam-Alt-Lines dataset

Dataset: mteb/jam-alt-linesLicense: cc-by-nc-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to audio (a2a) ndcg_at_10 deu, eng, fra, spa Music derived found
Citation
@inproceedings{cifka-2024-jam-alt,
  author = {Ond{\v{r}}ej C{\'{\i}}fka and
Hendrik Schreiber and
Luke Miner and
Fabian{-}Robert St{\"{o}}ter},
  booktitle = {Proceedings of the 25th International Society for
Music Information Retrieval Conference},
  doi = {10.5281/ZENODO.14877443},
  pages = {737--744},
  publisher = {ISMIR},
  title = {Lyrics Transcription for Humans: {A} Readability-Aware Benchmark},
  url = {https://doi.org/10.5281/zenodo.14877443},
  year = {2024},
}

JamAltLyricA2TRetrieval

From audio clips of songs (query), retrieve corresponding textual lyric from the Jam-Alt-Lines dataset

Dataset: mteb/jam-alt-linesLicense: cc-by-nc-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) ndcg_at_10 deu, eng, fra, spa Music derived found
Citation
@inproceedings{cifka-2024-jam-alt,
  author = {Ond{\v{r}}ej C{\'{\i}}fka and
Hendrik Schreiber and
Luke Miner and
Fabian{-}Robert St{\"{o}}ter},
  booktitle = {Proceedings of the 25th International Society for
Music Information Retrieval Conference},
  doi = {10.5281/ZENODO.14877443},
  pages = {737--744},
  publisher = {ISMIR},
  title = {Lyrics Transcription for Humans: {A} Readability-Aware Benchmark},
  url = {https://doi.org/10.5281/zenodo.14877443},
  year = {2024},
}

JamAltLyricT2ARetrieval

From textual lyrics (query), retrieve corresponding audio clips of songs from the Jam-Alt-Lines dataset

Dataset: mteb/jam-alt-linesLicense: cc-by-nc-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) ndcg_at_10 deu, eng, fra, spa Music derived found
Citation
@inproceedings{cifka-2024-jam-alt,
  author = {Ond{\v{r}}ej C{\'{\i}}fka and
Hendrik Schreiber and
Luke Miner and
Fabian{-}Robert St{\"{o}}ter},
  booktitle = {Proceedings of the 25th International Society for
Music Information Retrieval Conference},
  doi = {10.5281/ZENODO.14877443},
  pages = {737--744},
  publisher = {ISMIR},
  title = {Lyrics Transcription for Humans: {A} Readability-Aware Benchmark},
  url = {https://doi.org/10.5281/zenodo.14877443},
  year = {2024},
}

LASSA2TRetrieval

Language-Queried Audio Source Separation (LASS) dataset for audio-to-text retrieval. Retrieve text descriptions/captions for audio clips using natural language queries.The original dataset is based on the AudioCaps dataset.The source audio has been synthesized by mixing two audio with their labelled snr ratio as indicated in the dataset.

Dataset: mteb/lass-synth-a2tLicense: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) cv_recall_at_5 eng AudioScene derived found
Citation
@inproceedings{liu2022separate,
  author = {Liu, Xubo and Liu, Haohe and Kong, Qiuqiang and Mei, Xinhao and Zhao, Jinzheng and Huang, Qiushi and Plumbley, Mark D and Wang, Wenwu},
  booktitle = {INTERSPEEH},
  title = {Separate What You Describe: Language-Queried Audio Source Separation},
  year = {2022},
}

LASST2ARetrieval

Language-Queried Audio Source Separation (LASS) dataset for text-to-audio retrieval. Retrieve audio clips corresponding to natural language text descriptions/captions.The original dataset is based on the AudioCaps dataset.The source audio has been synthesized by mixing two audio with their labelled snr ratio as indicated in the dataset.

Dataset: mteb/lass-synth-t2aLicense: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 eng AudioScene derived found
Citation
@inproceedings{liu2022separate,
  author = {Liu, Xubo and Liu, Haohe and Kong, Qiuqiang and Mei, Xinhao and Zhao, Jinzheng and Huang, Qiushi and Plumbley, Mark D and Wang, Wenwu},
  booktitle = {INTERSPEEH},
  title = {Separate What You Describe: Language-Queried Audio Source Separation},
  year = {2022},
}

LLaVAIT2TRetrieval

Retrieve responses to answer questions about images.

Dataset: izhx/UMRB-LLaVALicense: cc-by-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to text (it2t) cv_recall_at_5 eng Encyclopaedic derived found
Citation
@inproceedings{lin-etal-2024-preflmr,
  address = {Bangkok, Thailand},
  author = {Lin, Weizhe  and
Mei, Jingbiao  and
Chen, Jinghong  and
Byrne, Bill},
  booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  doi = {10.18653/v1/2024.acl-long.289},
  editor = {Ku, Lun-Wei  and
Martins, Andre  and
Srikumar, Vivek},
  month = aug,
  pages = {5294--5316},
  publisher = {Association for Computational Linguistics},
  title = {{P}re{FLMR}: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers},
  url = {https://aclanthology.org/2024.acl-long.289},
  year = {2024},
}

LibriTTSA2TRetrieval

Given audiobook speech segments from the multi‑speaker LibriTTS corpus, retrieve the correct text transcription. LibriTTS is a 585‑hour, 24 kHz, multi‑speaker English TTS corpus derived from LibriVox (audio) and Project Gutenberg (text).

Dataset: mteb/LibriTTS_a2tLicense: cc-by-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) cv_recall_at_5 eng Spoken derived found
Citation
@misc{zen2019librittscorpusderivedlibrispeech,
  archiveprefix = {arXiv},
  author = {Heiga Zen and Viet Dang and Rob Clark and Yu Zhang and Ron J. Weiss and Ye Jia and Zhifeng Chen and Yonghui Wu},
  eprint = {1904.02882},
  primaryclass = {cs.SD},
  title = {LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech},
  url = {https://arxiv.org/abs/1904.02882},
  year = {2019},
}

LibriTTST2ARetrieval

Given an English text transcription, retrieve its corresponding audiobook speech segment from the multi‑speaker LibriTTS corpus. LibriTTS is a 585‑hour, 24 kHz, multi‑speaker English TTS corpus derived from LibriVox and Project Gutenberg.

Dataset: mteb/LibriTTS_t2aLicense: cc-by-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 eng Spoken derived found
Citation
@misc{zen2019librittscorpusderivedlibrispeech,
  archiveprefix = {arXiv},
  author = {Heiga Zen and Viet Dang and Rob Clark and Yu Zhang and Ron J. Weiss and Ye Jia and Zhifeng Chen and Yonghui Wu},
  eprint = {1904.02882},
  primaryclass = {cs.SD},
  title = {LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech},
  url = {https://arxiv.org/abs/1904.02882},
  year = {2019},
}

MACSA2TRetrieval

Audio captions and tags for urban acoustic scenes in TAU Urban Acoustic Scenes 2019 development dataset.

Dataset: mteb/MACS_a2tLicense: https://zenodo.org/records/5114771 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) cv_recall_at_5 eng AudioScene human-annotated found
Citation
@misc{martinmorato2021groundtruthreliabilitymultiannotator,
  archiveprefix = {arXiv},
  author = {Irene Martin-Morato and Annamaria Mesaros},
  eprint = {2104.04214},
  primaryclass = {eess.AS},
  title = {What is the ground truth? Reliability of multi-annotator data for audio tagging},
  url = {https://arxiv.org/abs/2104.04214},
  year = {2021},
}

MACST2ARetrieval

Audio captions and tags for urban acoustic scenes in TAU Urban Acoustic Scenes 2019 development dataset.

Dataset: mteb/MACS_t2aLicense: https://zenodo.org/records/5114771 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 eng AudioScene human-annotated found
Citation
@misc{martinmorato2021groundtruthreliabilitymultiannotator,
  archiveprefix = {arXiv},
  author = {Irene Martin-Morato and Annamaria Mesaros},
  eprint = {2104.04214},
  primaryclass = {eess.AS},
  title = {What is the ground truth? Reliability of multi-annotator data for audio tagging},
  url = {https://arxiv.org/abs/2104.04214},
  year = {2021},
}

METI2IRetrieval

Retrieve photos of more than 224k artworks.

Dataset: mteb/metLicense: not specified • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to image (i2i) cv_recall_at_1 eng Encyclopaedic derived created
Citation
@inproceedings{ypsilantis2021met,
  author = {Ypsilantis, Nikolaos-Antonios and Garcia, Noa and Han, Guangxing and Ibrahimi, Sarah and Van Noord, Nanne and Tolias, Giorgos},
  booktitle = {Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
  title = {The met dataset: Instance-level recognition for artworks},
  year = {2021},
}

MSCOCOI2TRetrieval

Retrieve captions based on images.

Dataset: mteb/mbeir_mscoco_task3License: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to text (i2t) ndcg_at_10 eng Encyclopaedic derived found
Citation
@inproceedings{lin2014microsoft,
  author = {Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll{\'a}r, Piotr and Zitnick, C Lawrence},
  booktitle = {Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13},
  organization = {Springer},
  pages = {740--755},
  title = {Microsoft coco: Common objects in context},
  year = {2014},
}

MSCOCOT2IRetrieval

Retrieve images based on captions.

Dataset: mteb/mbeir_mscoco_task0License: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to image (t2i) ndcg_at_10 eng Encyclopaedic derived found
Citation
@inproceedings{lin2014microsoft,
  author = {Lin, Tsung-Yi and Maire, Michael and Belongie, Serge and Hays, James and Perona, Pietro and Ramanan, Deva and Doll{\'a}r, Piotr and Zitnick, C Lawrence},
  booktitle = {Computer Vision--ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13},
  organization = {Springer},
  pages = {740--755},
  title = {Microsoft coco: Common objects in context},
  year = {2014},
}

MemotionI2TRetrieval

Retrieve captions based on memes.

Dataset: mteb/MMSoc_MemotionLicense: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to text (i2t) ndcg_at_10 eng Encyclopaedic derived found
Citation
@inproceedings{sharma2020semeval,
  author = {Sharma, Chhavi and Bhageria, Deepesh and Scott, William and Pykl, Srinivas and Das, Amitava and Chakraborty, Tanmoy and Pulabaigari, Viswanath and Gamb{\"a}ck, Bj{\"o}rn},
  booktitle = {Proceedings of the Fourteenth Workshop on Semantic Evaluation},
  pages = {759--773},
  title = {SemEval-2020 Task 8: Memotion Analysis-the Visuo-Lingual Metaphor!},
  year = {2020},
}

MemotionT2IRetrieval

Retrieve memes based on captions.

Dataset: mteb/MMSoc_MemotionLicense: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to image (t2i) ndcg_at_10 eng Encyclopaedic derived found
Citation
@inproceedings{sharma2020semeval,
  author = {Sharma, Chhavi and Bhageria, Deepesh and Scott, William and Pykl, Srinivas and Das, Amitava and Chakraborty, Tanmoy and Pulabaigari, Viswanath and Gamb{\"a}ck, Bj{\"o}rn},
  booktitle = {Proceedings of the Fourteenth Workshop on Semantic Evaluation},
  pages = {759--773},
  title = {SemEval-2020 Task 8: Memotion Analysis-the Visuo-Lingual Metaphor!},
  year = {2020},
}

MusicCapsA2TRetrieval

Natural language description for music audio.

Dataset: mteb/MusicCaps_a2tLicense: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) cv_recall_at_5 zxx Music human-annotated found
Citation
@misc{agostinelli2023musiclmgeneratingmusictext,
  archiveprefix = {arXiv},
  author = {Andrea Agostinelli and Timo I. Denk and Zalán Borsos and Jesse Engel and Mauro Verzetti and Antoine Caillon and Qingqing Huang and Aren Jansen and Adam Roberts and Marco Tagliasacchi and Matt Sharifi and Neil Zeghidour and Christian Frank},
  eprint = {2301.11325},
  primaryclass = {cs.SD},
  title = {MusicLM: Generating Music From Text},
  url = {https://arxiv.org/abs/2301.11325},
  year = {2023},
}

MusicCapsT2ARetrieval

Natural language description for music audio.

Dataset: mteb/MusicCaps_t2aLicense: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 zxx Music human-annotated found
Citation
@misc{agostinelli2023musiclmgeneratingmusictext,
  archiveprefix = {arXiv},
  author = {Andrea Agostinelli and Timo I. Denk and Zalán Borsos and Jesse Engel and Mauro Verzetti and Antoine Caillon and Qingqing Huang and Aren Jansen and Adam Roberts and Marco Tagliasacchi and Matt Sharifi and Neil Zeghidour and Christian Frank},
  eprint = {2301.11325},
  primaryclass = {cs.SD},
  title = {MusicLM: Generating Music From Text},
  url = {https://arxiv.org/abs/2301.11325},
  year = {2023},
}

NIGHTSI2IRetrieval

Retrieval identical image to the given image.

Dataset: mteb/mbeir_nights_task4License: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to image (i2i) ndcg_at_10 eng Encyclopaedic derived created
Citation
@article{fu2024dreamsim,
  author = {Fu, Stephanie and Tamir, Netanel and Sundaram, Shobhita and Chai, Lucy and Zhang, Richard and Dekel, Tali and Isola, Phillip},
  journal = {Advances in Neural Information Processing Systems},
  title = {DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data},
  volume = {36},
  year = {2024},
}

OKVQAIT2TRetrieval

Retrieval a Wiki passage to answer query about an image.

Dataset: izhx/UMRB-OKVQALicense: cc-by-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to text (it2t) cv_recall_at_10 eng Encyclopaedic derived created
Citation
@inproceedings{marino2019ok,
  author = {Marino, Kenneth and Rastegari, Mohammad and Farhadi, Ali and Mottaghi, Roozbeh},
  booktitle = {Proceedings of the IEEE/cvf conference on computer vision and pattern recognition},
  pages = {3195--3204},
  title = {Ok-vqa: A visual question answering benchmark requiring external knowledge},
  year = {2019},
}

OVENIT2ITRetrieval

Retrieval a Wiki image and passage to answer query about an image.

Dataset: mteb/mbeir_oven_task8License: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to image, text (it2it) ndcg_at_10 eng Encyclopaedic derived created
Citation
@inproceedings{hu2023open,
  author = {Hu, Hexiang and Luan, Yi and Chen, Yang and Khandelwal, Urvashi and Joshi, Mandar and Lee, Kenton and Toutanova, Kristina and Chang, Ming-Wei},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages = {12065--12075},
  title = {Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities},
  year = {2023},
}

OVENIT2TRetrieval

Retrieval a Wiki passage to answer query about an image.

Dataset: mteb/mbeir_oven_task6License: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to text (it2t) ndcg_at_10 eng Encyclopaedic derived created
Citation
@inproceedings{hu2023open,
  author = {Hu, Hexiang and Luan, Yi and Chen, Yang and Khandelwal, Urvashi and Joshi, Mandar and Lee, Kenton and Toutanova, Kristina and Chang, Ming-Wei},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages = {12065--12075},
  title = {Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities},
  year = {2023},
}

ROxfordEasyI2IRetrieval

Retrieve photos of landmarks in Oxford, UK.

Dataset: mteb/r-oxford-easy-multiLicense: not specified • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to image (i2i) map_at_5 eng Web derived created
Citation
@inproceedings{radenovic2018revisiting,
  author = {Radenovi{\'c}, Filip and Iscen, Ahmet and Tolias, Giorgos and Avrithis, Yannis and Chum, Ond{\v{r}}ej},
  booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages = {5706--5715},
  title = {Revisiting oxford and paris: Large-scale image retrieval benchmarking},
  year = {2018},
}

ROxfordHardI2IRetrieval

Retrieve photos of landmarks in Oxford, UK.

Dataset: mteb/r-oxford-hard-multiLicense: not specified • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to image (i2i) map_at_5 eng Web derived created
Citation
@inproceedings{radenovic2018revisiting,
  author = {Radenovi{\'c}, Filip and Iscen, Ahmet and Tolias, Giorgos and Avrithis, Yannis and Chum, Ond{\v{r}}ej},
  booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages = {5706--5715},
  title = {Revisiting oxford and paris: Large-scale image retrieval benchmarking},
  year = {2018},
}

ROxfordMediumI2IRetrieval

Retrieve photos of landmarks in Oxford, UK.

Dataset: mteb/r-oxford-medium-multiLicense: not specified • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to image (i2i) map_at_5 eng Web derived created
Citation
@inproceedings{radenovic2018revisiting,
  author = {Radenovi{\'c}, Filip and Iscen, Ahmet and Tolias, Giorgos and Avrithis, Yannis and Chum, Ond{\v{r}}ej},
  booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages = {5706--5715},
  title = {Revisiting oxford and paris: Large-scale image retrieval benchmarking},
  year = {2018},
}

RP2kI2IRetrieval

Retrieve photos of 39457 products.

Dataset: mteb/rp2kLicense: not specified • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to image (i2i) cv_recall_at_1 eng Web derived created
Citation
@article{peng2020rp2k,
  author = {Peng, Jingtian and Xiao, Chang and Li, Yifan},
  journal = {arXiv preprint arXiv:2006.12634},
  title = {RP2K: A large-scale retail product dataset for fine-grained image classification},
  year = {2020},
}

RParisEasyI2IRetrieval

Retrieve photos of landmarks in Paris, UK.

Dataset: mteb/r-paris-easy-multiLicense: not specified • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to image (i2i) map_at_5 eng Web derived created
Citation
@inproceedings{radenovic2018revisiting,
  author = {Radenovi{\'c}, Filip and Iscen, Ahmet and Tolias, Giorgos and Avrithis, Yannis and Chum, Ond{\v{r}}ej},
  booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages = {5706--5715},
  title = {Revisiting oxford and paris: Large-scale image retrieval benchmarking},
  year = {2018},
}

RParisHardI2IRetrieval

Retrieve photos of landmarks in Paris, UK.

Dataset: mteb/r-paris-hard-multiLicense: not specified • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to image (i2i) map_at_5 eng Web derived created
Citation
@inproceedings{radenovic2018revisiting,
  author = {Radenovi{\'c}, Filip and Iscen, Ahmet and Tolias, Giorgos and Avrithis, Yannis and Chum, Ond{\v{r}}ej},
  booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages = {5706--5715},
  title = {Revisiting oxford and paris: Large-scale image retrieval benchmarking},
  year = {2018},
}

RParisMediumI2IRetrieval

Retrieve photos of landmarks in Paris, UK.

Dataset: mteb/r-paris-medium-multiLicense: not specified • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to image (i2i) map_at_5 eng Web derived created
Citation
@inproceedings{radenovic2018revisiting,
  author = {Radenovi{\'c}, Filip and Iscen, Ahmet and Tolias, Giorgos and Avrithis, Yannis and Chum, Ond{\v{r}}ej},
  booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages = {5706--5715},
  title = {Revisiting oxford and paris: Large-scale image retrieval benchmarking},
  year = {2018},
}

ReMuQIT2TRetrieval

Retrieval of a Wiki passage to answer a query about an image.

Dataset: izhx/UMRB-ReMuQLicense: cc0-1.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to text (it2t) cv_recall_at_5 eng Encyclopaedic derived created
Citation
@inproceedings{luo-etal-2023-end,
  address = {Toronto, Canada},
  author = {Luo, Man  and
Fang, Zhiyuan  and
Gokhale, Tejas  and
Yang, Yezhou  and
Baral, Chitta},
  booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  doi = {10.18653/v1/2023.acl-long.478},
  editor = {Rogers, Anna  and
Boyd-Graber, Jordan  and
Okazaki, Naoaki},
  month = jul,
  pages = {8573--8589},
  publisher = {Association for Computational Linguistics},
  title = {End-to-end Knowledge Retrieval with Multi-modal Queries},
  url = {https://aclanthology.org/2023.acl-long.478},
  year = {2023},
}

SOPI2IRetrieval

Retrieve product photos of 22634 online products.

Dataset: mteb/stanford-online-productsLicense: not specified • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to image (i2i) cv_recall_at_1 eng Encyclopaedic derived created
Citation
@inproceedings{oh2016deep,
  author = {Oh Song, Hyun and Xiang, Yu and Jegelka, Stefanie and Savarese, Silvio},
  booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages = {4004--4012},
  title = {Deep metric learning via lifted structured feature embedding},
  year = {2016},
}

SciMMIRI2TRetrieval

Retrieve captions based on figures and tables.

Dataset: mteb/SciMMIRLicense: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to text (i2t) ndcg_at_10 eng Academic derived found
Citation
@misc{wu2024scimmirbenchmarkingscientificmultimodal,
  archiveprefix = {arXiv},
  author = {Siwei Wu and Yizhi Li and Kang Zhu and Ge Zhang and Yiming Liang and Kaijing Ma and Chenghao Xiao and Haoran Zhang and Bohao Yang and Wenhu Chen and Wenhao Huang and Noura Al Moubayed and Jie Fu and Chenghua Lin},
  eprint = {2401.13478},
  primaryclass = {cs.IR},
  title = {SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval},
  url = {https://arxiv.org/abs/2401.13478},
  year = {2024},
}

SciMMIRT2IRetrieval

Retrieve figures and tables based on captions.

Dataset: mteb/SciMMIRLicense: mit • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to image (t2i) ndcg_at_10 eng Academic derived found
Citation
@misc{wu2024scimmirbenchmarkingscientificmultimodal,
  archiveprefix = {arXiv},
  author = {Siwei Wu and Yizhi Li and Kang Zhu and Ge Zhang and Yiming Liang and Kaijing Ma and Chenghao Xiao and Haoran Zhang and Bohao Yang and Wenhu Chen and Wenhao Huang and Noura Al Moubayed and Jie Fu and Chenghua Lin},
  eprint = {2401.13478},
  primaryclass = {cs.IR},
  title = {SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval},
  url = {https://arxiv.org/abs/2401.13478},
  year = {2024},
}

SketchyI2IRetrieval

Retrieve photos from sketches.

Dataset: mteb/sketchyLicense: not specified • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to image (i2i) cv_recall_at_1 eng Encyclopaedic derived created
Citation
@inproceedings{ypsilantis2021met,
  author = {Ypsilantis, Nikolaos-Antonios and Garcia, Noa and Han, Guangxing and Ibrahimi, Sarah and Van Noord, Nanne and Tolias, Giorgos},
  booktitle = {Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
  title = {The met dataset: Instance-level recognition for artworks},
  year = {2021},
}

SoundDescsA2TRetrieval

Natural language description for different audio sources from the BBC Sound Effects webpage.

Dataset: mteb/sounddescs_a2tLicense: apache-2.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) cv_recall_at_5 zxx Encyclopaedic, Written derived found
Citation
@inproceedings{Koepke2022,
  author = {Koepke, A.S. and Oncescu, A.-M. and Henriques, J. and Akata, Z. and Albanie, S.},
  booktitle = {IEEE Transactions on Multimedia},
  title = {Audio Retrieval with Natural Language Queries: A Benchmark Study},
  year = {2022},
}

SoundDescsT2ARetrieval

Natural language description for different audio sources from the BBC Sound Effects webpage.

Dataset: mteb/sounddescs_t2aLicense: apache-2.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 zxx Encyclopaedic, Written derived found
Citation
@inproceedings{Koepke2022,
  author = {Koepke, A.S. and Oncescu, A.-M. and Henriques, J. and Akata, Z. and Albanie, S.},
  booktitle = {IEEE Transactions on Multimedia},
  title = {Audio Retrieval with Natural Language Queries: A Benchmark Study},
  year = {2022},
}

SpokenSQuADT2ARetrieval

Text-to-audio retrieval task based on SpokenSQuAD dataset. Given a text question, retrieve relevant audio segments that contain the answer. Questions are derived from SQuAD reading comprehension dataset with corresponding spoken passages.

Dataset: mteb/spoken-squad-t2aLicense: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 eng Academic, Encyclopaedic, Non-fiction derived found
Citation
@inproceedings{li2018spokensquad,
  author = {Li, Chia-Hsuan and Ma, Szu-Lin and Zhang, Hsin-Wei and Lee, Hung-yi and Lee, Lin-shan},
  booktitle = {Interspeech},
  pages = {3459--3463},
  title = {Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension},
  year = {2018},
}

StanfordCarsI2IRetrieval

Retrieve car images from 196 makes.

Dataset: mteb/stanford_cars_retrievalLicense: not specified • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to image (i2i) cv_recall_at_1 eng Encyclopaedic derived created
Citation
@inproceedings{Krause2013CollectingAL,
  author = {Jonathan Krause and Jia Deng and Michael Stark and Li Fei-Fei},
  title = {Collecting a Large-scale Dataset of Fine-grained Cars},
  url = {https://api.semanticscholar.org/CorpusID:16632981},
  year = {2013},
}

TUBerlinT2IRetrieval

Retrieve sketch images based on text descriptions.

Dataset: mteb/tu-berlinLicense: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to image (t2i) ndcg_at_10 eng Encyclopaedic derived found
Citation
@article{eitz2012humans,
  author = {Eitz, Mathias and Hays, James and Alexa, Marc},
  journal = {ACM Transactions on graphics (TOG)},
  number = {4},
  pages = {1--10},
  publisher = {Acm New York, NY, USA},
  title = {How do humans sketch objects?},
  volume = {31},
  year = {2012},
}

UrbanSound8KA2TRetrieval

UrbanSound8K: Audio-to-text retrieval of urban sound events.

Dataset: mteb/Urbansound8K_a2tLicense: cc-by-nc-sa-3.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
audio to text (a2t) cv_recall_at_5 zxx AudioScene human-annotated found
Citation
@inproceedings{Salamon:UrbanSound:ACMMM:14,
  author = {Salamon, Justin and Jacoby, Christopher and Bello, Juan Pablo},
  booktitle = {Proceedings of the 22nd ACM international conference on Multimedia},
  organization = {ACM},
  pages = {1041--1044},
  title = {A Dataset and Taxonomy for Urban Sound Research},
  year = {2014},
}

UrbanSound8KT2ARetrieval

UrbanSound8K: Text-to-audio retrieval of urban sound events.

Dataset: mteb/Urbansound8K_t2aLicense: cc-by-nc-sa-3.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to audio (t2a) cv_recall_at_5 zxx AudioScene human-annotated found
Citation
@inproceedings{Salamon:UrbanSound:ACMMM:14,
  author = {Salamon, Justin and Jacoby, Christopher and Bello, Juan Pablo},
  booktitle = {Proceedings of the 22nd ACM international conference on Multimedia},
  organization = {ACM},
  pages = {1041--1044},
  title = {A Dataset and Taxonomy for Urban Sound Research},
  year = {2014},
}

VQA2IT2TRetrieval

Retrieve the correct answer for a question about an image.

Dataset: mteb/vqa-2License: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to text (it2t) ndcg_at_10 eng Web derived found
Citation
@inproceedings{Goyal_2017_CVPR,
  author = {Goyal, Yash and Khot, Tejas and Summers-Stay, Douglas and Batra, Dhruv and Parikh, Devi},
  booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {July},
  title = {Making the v in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering},
  year = {2017},
}

VisualNewsI2TRetrieval

Retrieval entity-rich captions for news images.

Dataset: mteb/mbeir_visualnews_task3License: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image to text (i2t) ndcg_at_10 eng Encyclopaedic derived created
Citation
@inproceedings{liu2021visual,
  author = {Liu, Fuxiao and Wang, Yinghan and Wang, Tianlu and Ordonez, Vicente},
  booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
  pages = {6761--6771},
  title = {Visual News: Benchmark and Challenges in News Image Captioning},
  year = {2021},
}

VisualNewsT2IRetrieval

Retrieve news images with captions.

Dataset: mteb/mbeir_visualnews_task0License: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to image (t2i) ndcg_at_10 eng Encyclopaedic derived created
Citation
@inproceedings{liu2021visual,
  author = {Liu, Fuxiao and Wang, Yinghan and Wang, Tianlu and Ordonez, Vicente},
  booktitle = {Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
  pages = {6761--6771},
  title = {Visual News: Benchmark and Challenges in News Image Captioning},
  year = {2021},
}

VizWizIT2TRetrieval

Retrieve the correct answer for a question about an image.

Dataset: mteb/vizwizLicense: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
image, text to text (it2t) ndcg_at_10 eng Web derived found
Citation
@inproceedings{gurari2018vizwiz,
  author = {Gurari, Danna and Li, Qing and Stangl, Abigale J and Guo, Anhong and Lin, Chi and Grauman, Kristen and Luo, Jiebo and Bigham, Jeffrey P},
  booktitle = {Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages = {3608--3617},
  title = {Vizwiz grand challenge: Answering visual questions from blind people},
  year = {2018},
}

WebQAT2ITRetrieval

Retrieve sources of information based on questions.

Dataset: mteb/mbeir_webqa_task2License: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to image, text (t2it) ndcg_at_10 eng Encyclopaedic derived created
Citation
@inproceedings{chang2022webqa,
  author = {Chang, Yingshan and Narang, Mridu and Suzuki, Hisami and Cao, Guihong and Gao, Jianfeng and Bisk, Yonatan},
  booktitle = {Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages = {16495--16504},
  title = {Webqa: Multihop and multimodal qa},
  year = {2022},
}

WebQAT2TRetrieval

Retrieve sources of information based on questions.

Dataset: mteb/mbeir_webqa_task1License: cc-by-sa-4.0 • Learn more →

Task category Score Languages Domains Annotations Creators Sample Creation
text to text (t2t) ndcg_at_10 eng Encyclopaedic derived created
Citation
@inproceedings{chang2022webqa,
  author = {Chang, Yingshan and Narang, Mridu and Suzuki, Hisami and Cao, Guihong and Gao, Jianfeng and Bisk, Yonatan},
  booktitle = {Proceedings of the IEEE/CVF conference on computer vision and pattern recognition},
  pages = {16495--16504},
  title = {Webqa: Multihop and multimodal qa},
  year = {2022},
}