Skip to content

Clustering

Includes tasks which evaluates if the embedding space is globally coherent with distances reflecting semantic groups.

AudioClustering

  • Number of tasks: 10

AmbientAcousticContextClustering

Clustering task based on a subset of the Ambient Acoustic Context dataset containing 1-second segments for workplace activities.

Dataset: mteb/ambient-acoustic-context-smallLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
audio to audio (a2a) eng Speech, Spoken human-annotated found v_measure
Citation
@inproceedings{10.1145/3379503.3403535,
  address = {New York, NY, USA},
  articleno = {33},
  author = {Park, Chunjong and Min, Chulhong and Bhattacharya, Sourav and Kawsar, Fahim},
  booktitle = {22nd International Conference on Human-Computer Interaction with Mobile Devices and Services},
  doi = {10.1145/3379503.3403535},
  isbn = {9781450375160},
  keywords = {Acoustic ambient context, Conversational agents},
  location = {Oldenburg, Germany},
  numpages = {9},
  publisher = {Association for Computing Machinery},
  series = {MobileHCI '20},
  title = {Augmenting Conversational Agents with Ambient Acoustic Contexts},
  url = {https://doi.org/10.1145/3379503.3403535},
  year = {2020},
}

CREMA_DClustering

Emotion clustering task with audio data for 6 emotions: Anger, Disgust, Fear, Happy, Neutral, Sad.

Dataset: mteb/crema-dLicense: http://opendatacommons.org/licenses/odbl/1.0/ • Learn more →

Category Languages Domains Annotations Creation Score
audio to audio (a2a) eng Speech human-annotated created v_measure
Citation
@article{Cao2014-ih,
  author = {Cao, Houwei and Cooper, David G and Keutmann, Michael K and Gur,
Ruben C and Nenkova, Ani and Verma, Ragini},
  copyright = {https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html},
  journal = {IEEE Transactions on Affective Computing},
  keywords = {Emotional corpora; facial expression; multi-modal recognition;
voice expression},
  language = {en},
  month = oct,
  number = {4},
  pages = {377--390},
  publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
  title = {{CREMA-D}: Crowd-sourced emotional multimodal actors dataset},
  volume = {5},
  year = {2014},
}

ESC50Clustering

The ESC-50 dataset contains 2,000 labeled environmental audio recordings evenly distributed across 50 classes (40 clips per class). These classes are organized into 5 broad categories: animal sounds, natural soundscapes & water sounds, human (non-speech) sounds, interior/domestic sounds, and exterior/urban noises. This task evaluates unsupervised clustering performance on environmental audio recordings.

Dataset: mteb/esc50License: cc-by-nc-sa-3.0 • Learn more →

Category Languages Domains Annotations Creation Score
audio to audio (a2a) zxx Speech, Spoken human-annotated found v_measure
Citation
@inproceedings{piczak2015dataset,
  author = {Piczak, Karol J.},
  booktitle = {Proceedings of the 23rd {Annual ACM Conference} on {Multimedia}},
  date = {2015-10-13},
  doi = {10.1145/2733373.2806390},
  isbn = {978-1-4503-3459-4},
  location = {{Brisbane, Australia}},
  pages = {1015--1018},
  publisher = {{ACM Press}},
  title = {{ESC}: {Dataset} for {Environmental Sound Classification}},
  url = {http://dl.acm.org/citation.cfm?doid=2733373.2806390},
}

GTZANGenreClustering

Music genre clustering task based on GTZAN dataset with 10 music genres.

Dataset: mteb/gtzan-genreLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
audio to audio (a2a) zxx Music human-annotated found v_measure
Citation
@article{1021072,
  author = {Tzanetakis, G. and Cook, P.},
  doi = {10.1109/TSA.2002.800560},
  journal = {IEEE Transactions on Speech and Audio Processing},
  keywords = {Humans;Music information retrieval;Instruments;Computer science;Multiple signal classification;Signal analysis;Pattern recognition;Feature extraction;Wavelet analysis;Cultural differences},
  number = {5},
  pages = {293-302},
  title = {Musical genre classification of audio signals},
  volume = {10},
  year = {2002},
}

MusicGenreClustering

Clustering music recordings in 9 different genres.

Dataset: mteb/music-genreLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
audio to audio (a2a) zxx Music human-annotated found v_measure
Citation
@inproceedings{homburg2005benchmark,
  author = {Homburg, Helge and Mierswa, Ingo and M{\"o}ller, B{\"u}lent and Morik, Katharina and Wurst, Michael},
  booktitle = {ISMIR},
  pages = {528--31},
  title = {A Benchmark Dataset for Audio Classification and Clustering.},
  volume = {2005},
  year = {2005},
}

VehicleSoundClustering

Clustering vehicle sounds recorded from smartphones (0 (car class), 1 (truck, bus and van class), 2 (motorcycle class))

Dataset: mteb/Vehicle_sounds_classification_datasetLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
audio to audio (a2a) zxx Scene derived created v_measure
Citation
@inproceedings{inproceedings,
  author = {Bazilinskyy, Pavlo and Aa, Arne and Schoustra, Michael and Spruit, John and Staats, Laurens and van der Vlist, Klaas Jan and de Winter, Joost},
  month = {05},
  pages = {},
  title = {An auditory dataset of passing vehicles recorded with a smartphone},
  year = {2018},
}

VoiceGenderClustering

Clustering audio recordings based on gender (male vs female).

Dataset: mteb/VoiceGenderClusteringLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
audio to audio (a2a) eng Spoken derived found v_measure
Citation
@inproceedings{Chung18b,
  author = {Joon Son Chung and Arsha Nagrani and Andrew Zisserman},
  booktitle = {Proceedings of Interspeech},
  title = {VoxCeleb2: Deep Speaker Recognition},
  year = {2018},
}

VoxCelebClustering

Clustering task based on the VoxCeleb dataset for sentiment analysis, clustering by positive/negative sentiment.

Dataset: mteb/Sentiment_Analysis_SLUE-VoxCelebLicense: cc-by-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
audio to audio (a2a) eng Speech, Spoken human-annotated found v_measure
Citation
@misc{shon2022sluenewbenchmarktasks,
  archiveprefix = {arXiv},
  author = {Suwon Shon and Ankita Pasad and Felix Wu and Pablo Brusco and Yoav Artzi and Karen Livescu and Kyu J. Han},
  eprint = {2111.10367},
  primaryclass = {cs.CL},
  title = {SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech},
  url = {https://arxiv.org/abs/2111.10367},
  year = {2022},
}

VoxPopuliAccentClustering

Clustering English speech samples by non-native accent from European Parliament recordings.

Dataset: mteb/voxpopuli-accent-clusteringLicense: cc0-1.0 • Learn more →

Category Languages Domains Annotations Creation Score
audio to audio (a2a) eng Speech, Spoken human-annotated found v_measure
Citation
@inproceedings{wang-etal-2021-voxpopuli,
  address = {Online},
  author = {Wang, Changhan  and
Riviere, Morgane  and
Lee, Ann  and
Wu, Anne  and
Talnikar, Chaitanya  and
Haziza, Daniel  and
Williamson, Mary  and
Pino, Juan  and
Dupoux, Emmanuel},
  booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
  doi = {10.18653/v1/2021.acl-long.80},
  month = aug,
  pages = {993--1003},
  publisher = {Association for Computational Linguistics},
  title = {{V}ox{P}opuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation},
  url = {https://aclanthology.org/2021.acl-long.80},
  year = {2021},
}

VoxPopuliGenderClustering

Subsampled Dataset for clustering speech samples by speaker gender (male/female) from European Parliament recordings.

Dataset: mteb/mini-voxpopuliLicense: cc0-1.0 • Learn more →

Category Languages Domains Annotations Creation Score
audio to audio (a2a) deu, eng, fra, pol, spa Speech, Spoken human-annotated found v_measure
Citation
@inproceedings{wang-etal-2021-voxpopuli,
  address = {Online},
  author = {Wang, Changhan  and
Riviere, Morgane  and
Lee, Ann  and
Wu, Anne  and
Talnikar, Chaitanya  and
Haziza, Daniel  and
Williamson, Mary  and
Pino, Juan  and
Dupoux, Emmanuel},
  booktitle = {Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
  doi = {10.18653/v1/2021.acl-long.80},
  month = aug,
  pages = {993--1003},
  publisher = {Association for Computational Linguistics},
  title = {{V}ox{P}opuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation},
  url = {https://aclanthology.org/2021.acl-long.80},
  year = {2021},
}

Clustering

  • Number of tasks: 109

AlloProfClusteringP2P

Clustering of document titles and descriptions from Allo Prof dataset. Clustering of 10 sets on the document topic.

Dataset: mteb/AlloProfClusteringP2PLicense: mit • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) fra Encyclopaedic, Written human-annotated found v_measure
Citation
@misc{lef23,
  author = {Lefebvre-Brossard, Antoine and Gazaille, Stephane and Desmarais, Michel C.},
  copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International},
  doi = {10.48550/ARXIV.2302.07738},
  keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  publisher = {arXiv},
  title = {Alloprof: a new French question-answer education dataset and its use in an information retrieval case study},
  url = {https://arxiv.org/abs/2302.07738},
  year = {2023},
}

AlloProfClusteringP2P.v2

Clustering of document titles and descriptions from Allo Prof dataset. Clustering of 10 sets on the document topic.

Dataset: mteb/AlloProfClusteringP2P.v2License: mit • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) fra Encyclopaedic, Written human-annotated found v_measure
Citation
@misc{lef23,
  author = {Lefebvre-Brossard, Antoine and Gazaille, Stephane and Desmarais, Michel C.},
  copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International},
  doi = {10.48550/ARXIV.2302.07738},
  keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  publisher = {arXiv},
  title = {Alloprof: a new French question-answer education dataset and its use in an information retrieval case study},
  url = {https://arxiv.org/abs/2302.07738},
  year = {2023},
}

AlloProfClusteringS2S

Clustering of document titles from Allo Prof dataset. Clustering of 10 sets on the document topic.

Dataset: mteb/AlloProfClusteringS2SLicense: mit • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) fra Encyclopaedic, Written human-annotated found v_measure
Citation
@misc{lef23,
  author = {Lefebvre-Brossard, Antoine and Gazaille, Stephane and Desmarais, Michel C.},
  copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International},
  doi = {10.48550/ARXIV.2302.07738},
  keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  publisher = {arXiv},
  title = {Alloprof: a new French question-answer education dataset and its use in an information retrieval case study},
  url = {https://arxiv.org/abs/2302.07738},
  year = {2023},
}

AlloProfClusteringS2S.v2

Clustering of document titles from Allo Prof dataset. Clustering of 10 sets on the document topic.

Dataset: mteb/AlloProfClusteringS2S.v2License: mit • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) fra Encyclopaedic, Written human-annotated found v_measure
Citation
@misc{lef23,
  author = {Lefebvre-Brossard, Antoine and Gazaille, Stephane and Desmarais, Michel C.},
  copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International},
  doi = {10.48550/ARXIV.2302.07738},
  keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  publisher = {arXiv},
  title = {Alloprof: a new French question-answer education dataset and its use in an information retrieval case study},
  url = {https://arxiv.org/abs/2302.07738},
  year = {2023},
}

ArXivHierarchicalClusteringP2P

Clustering of titles+abstract from arxiv. Clustering of 30 sets, either on the main or secondary category

Dataset: mteb/arxiv-clustering-p2pLicense: cc0-1.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Academic, Written derived found v_measure

ArXivHierarchicalClusteringS2S

Clustering of titles from arxiv. Clustering of 30 sets, either on the main or secondary category

Dataset: mteb/arxiv-clustering-s2sLicense: cc0-1.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Academic, Written derived found v_measure

ArxivClusteringP2P

Clustering of titles+abstract from arxiv. Clustering of 30 sets, either on the main or secondary category

Dataset: mteb/arxiv-clustering-p2pLicense: cc0-1.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Academic, Written derived found v_measure
Citation
@misc{arxiv_org_submitters_2024,
  author = {arXiv.org submitters},
  doi = {10.34740/KAGGLE/DSV/7548853},
  publisher = {Kaggle},
  title = {arXiv Dataset},
  url = {https://www.kaggle.com/dsv/7548853},
  year = {2024},
}

ArxivClusteringP2P.v2

Clustering of titles+abstract from arxiv. Clustering of 30 sets, either on the main or secondary category

Dataset: mteb/arxiv-clustering-p2pLicense: cc0-1.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Academic, Written derived found v_measure
Citation
@misc{arxiv_org_submitters_2024,
  author = {arXiv.org submitters},
  doi = {10.34740/KAGGLE/DSV/7548853},
  publisher = {Kaggle},
  title = {arXiv Dataset},
  url = {https://www.kaggle.com/dsv/7548853},
  year = {2024},
}

ArxivClusteringS2S

Clustering of titles from arxiv. Clustering of 30 sets, either on the main or secondary category

Dataset: mteb/arxiv-clustering-s2sLicense: cc0-1.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Academic, Written derived found v_measure
Citation
@misc{arxiv_org_submitters_2024,
  author = {arXiv.org submitters},
  doi = {10.34740/KAGGLE/DSV/7548853},
  publisher = {Kaggle},
  title = {arXiv Dataset},
  url = {https://www.kaggle.com/dsv/7548853},
  year = {2024},
}

BeytooteClustering

Beytoote Website Articles Clustering

Dataset: MCINext/beytoote-clusteringLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) fas News derived found v_measure
Citation

BigPatentClustering

Clustering of documents from the Big Patent dataset. Test set only includes documents belonging to a single category, with a total of 9 categories.

Dataset: jinaai/big-patent-clusteringLicense: cc-by-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Legal, Written derived found v_measure
Citation
@article{DBLP:journals/corr/abs-1906-03741,
  author = {Eva Sharma and
Chen Li and
Lu Wang},
  bibsource = {dblp computer science bibliography, https://dblp.org},
  biburl = {https://dblp.org/rec/journals/corr/abs-1906-03741.bib},
  eprint = {1906.03741},
  eprinttype = {arXiv},
  journal = {CoRR},
  timestamp = {Wed, 26 Jun 2019 07:14:58 +0200},
  title = {{BIGPATENT}: A Large-Scale Dataset for Abstractive and Coherent Summarization},
  url = {http://arxiv.org/abs/1906.03741},
  volume = {abs/1906.03741},
  year = {2019},
}

BigPatentClustering.v2

Clustering of documents from the Big Patent dataset. Test set only includes documents belonging to a single category, with a total of 9 categories.

Dataset: mteb/big-patentLicense: cc-by-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Legal, Written derived found v_measure
Citation
@article{DBLP:journals/corr/abs-1906-03741,
  author = {Eva Sharma and
Chen Li and
Lu Wang},
  bibsource = {dblp computer science bibliography, https://dblp.org},
  biburl = {https://dblp.org/rec/journals/corr/abs-1906-03741.bib},
  eprint = {1906.03741},
  eprinttype = {arXiv},
  journal = {CoRR},
  timestamp = {Wed, 26 Jun 2019 07:14:58 +0200},
  title = {{BIGPATENT}: A Large-Scale Dataset for Abstractive and Coherent Summarization},
  url = {http://arxiv.org/abs/1906.03741},
  volume = {abs/1906.03741},
  year = {2019},
}

BiorxivClusteringP2P

Clustering of titles+abstract from biorxiv. Clustering of 10 sets, based on the main category.

Dataset: mteb/biorxiv-clustering-p2pLicense: https://www.biorxiv.org/content/about-biorxiv • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Academic, Written derived created v_measure

BiorxivClusteringP2P.v2

Clustering of titles+abstract from biorxiv across 26 categories.

Dataset: mteb/biorxiv-clustering-p2pLicense: https://www.biorxiv.org/content/about-biorxiv • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Academic, Written derived created v_measure

BiorxivClusteringS2S

Clustering of titles from biorxiv. Clustering of 10 sets, based on the main category.

Dataset: mteb/biorxiv-clustering-s2sLicense: https://www.biorxiv.org/content/about-biorxiv • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Academic, Written derived created v_measure

BiorxivClusteringS2S.v2

Clustering of titles from biorxiv across 26 categories.

Dataset: mteb/biorxiv-clustering-s2sLicense: https://www.biorxiv.org/content/about-biorxiv • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Academic, Written derived created v_measure

BlurbsClusteringP2P

Clustering of book titles+blurbs. Clustering of 28 sets, either on the main or secondary genre.

Dataset: slvnwhrl/blurbs-clustering-p2pLicense: cc-by-nc-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) deu Fiction, Written derived found v_measure
Citation
@inproceedings{Remus2019GermEval2T,
  author = {Steffen Remus and Rami Aly and Chris Biemann},
  booktitle = {Conference on Natural Language Processing},
  title = {GermEval 2019 Task 1: Hierarchical Classification of Blurbs},
  url = {https://api.semanticscholar.org/CorpusID:208334484},
  year = {2019},
}

BlurbsClusteringP2P.v2

Clustering of book titles+blurbs. Clustering of 28 sets, either on the main or secondary genre.

Dataset: slvnwhrl/blurbs-clustering-p2pLicense: cc-by-nc-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) deu Fiction, Written derived found v_measure
Citation
@inproceedings{Remus2019GermEval2T,
  author = {Steffen Remus and Rami Aly and Chris Biemann},
  booktitle = {Conference on Natural Language Processing},
  title = {GermEval 2019 Task 1: Hierarchical Classification of Blurbs},
  url = {https://api.semanticscholar.org/CorpusID:208334484},
  year = {2019},
}

BlurbsClusteringS2S

Clustering of book titles. Clustering of 28 sets, either on the main or secondary genre.

Dataset: slvnwhrl/blurbs-clustering-s2sLicense: cc-by-nc-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) deu Fiction, Written derived found v_measure
Citation
@inproceedings{Remus2019GermEval2T,
  author = {Steffen Remus and Rami Aly and Chris Biemann},
  booktitle = {Conference on Natural Language Processing},
  title = {GermEval 2019 Task 1: Hierarchical Classification of Blurbs},
  url = {https://api.semanticscholar.org/CorpusID:208334484},
  year = {2019},
}

BlurbsClusteringS2S.v2

Clustering of book titles. Clustering of 28 sets, either on the main or secondary genre.

Dataset: slvnwhrl/blurbs-clustering-s2sLicense: cc-by-nc-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) deu Fiction, Written derived found v_measure
Citation
@inproceedings{Remus2019GermEval2T,
  author = {Steffen Remus and Rami Aly and Chris Biemann},
  booktitle = {Conference on Natural Language Processing},
  title = {GermEval 2019 Task 1: Hierarchical Classification of Blurbs},
  url = {https://api.semanticscholar.org/CorpusID:208334484},
  year = {2019},
}

BuiltBenchClusteringP2P

Clustering of built asset item descriptions based on categories identified within industry classification systems such as IFC, Uniclass, etc.

Dataset: mehrzad-shahin/BuiltBench-clustering-p2pLicense: cc-by-nd-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Engineering, Written derived created v_measure
Citation
@article{shahinmoghadam2024benchmarking,
  author = {Shahinmoghadam, Mehrzad and Motamedi, Ali},
  journal = {arXiv preprint arXiv:2411.12056},
  title = {Benchmarking pre-trained text embedding models in aligning built asset information},
  year = {2024},
}

BuiltBenchClusteringS2S

Clustering of built asset names/titles based on categories identified within industry classification systems such as IFC, Uniclass, etc.

Dataset: mehrzad-shahin/BuiltBench-clustering-s2sLicense: cc-by-nd-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Engineering, Written derived created v_measure
Citation
@article{shahinmoghadam2024benchmarking,
  author = {Shahinmoghadam, Mehrzad and Motamedi, Ali},
  journal = {arXiv preprint arXiv:2411.12056},
  title = {Benchmarking pre-trained text embedding models in aligning built asset information},
  year = {2024},
}

CLSClusteringP2P

Clustering of titles + abstract from CLS dataset. Clustering of 13 sets on the main category.

Dataset: C-MTEB/CLSClusteringP2PLicense: apache-2.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) cmn Academic, Written derived found v_measure
Citation
@article{li2022csl,
  author = {Li, Yudong and Zhang, Yuqing and Zhao, Zhe and Shen, Linlin and Liu, Weijie and Mao, Weiquan and Zhang, Hui},
  journal = {arXiv preprint arXiv:2209.05034},
  title = {CSL: A large-scale Chinese scientific literature dataset},
  year = {2022},
}

CLSClusteringP2P.v2

Clustering of titles + abstract from CLS dataset. Clustering of 13 sets on the main category.

Dataset: mteb/CLSClusteringP2P.v2License: apache-2.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) cmn Academic, Written derived found v_measure
Citation
@misc{li2022csl,
  archiveprefix = {arXiv},
  author = {Yudong Li and Yuqing Zhang and Zhe Zhao and Linlin Shen and Weijie Liu and Weiquan Mao and Hui Zhang},
  eprint = {2209.05034},
  primaryclass = {cs.CL},
  title = {CSL: A Large-scale Chinese Scientific Literature Dataset},
  year = {2022},
}

CLSClusteringS2S

Clustering of titles from CLS dataset. Clustering of 13 sets on the main category.

Dataset: C-MTEB/CLSClusteringS2SLicense: apache-2.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) cmn Academic, Written derived found v_measure
Citation
@article{li2022csl,
  author = {Li, Yudong and Zhang, Yuqing and Zhao, Zhe and Shen, Linlin and Liu, Weijie and Mao, Weiquan and Zhang, Hui},
  journal = {arXiv preprint arXiv:2209.05034},
  title = {CSL: A large-scale Chinese scientific literature dataset},
  year = {2022},
}

CLSClusteringS2S.v2

Clustering of titles from CLS dataset. Clustering of 13 sets on the main category.

Dataset: C-MTEB/CLSClusteringS2SLicense: apache-2.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) cmn Academic, Written derived found v_measure
Citation
@misc{li2022csl,
  archiveprefix = {arXiv},
  author = {Yudong Li and Yuqing Zhang and Zhe Zhao and Linlin Shen and Weijie Liu and Weiquan Mao and Hui Zhang},
  eprint = {2209.05034},
  primaryclass = {cs.CL},
  title = {CSL: A Large-scale Chinese Scientific Literature Dataset},
  year = {2022},
}

ClusTREC-Covid

A Topical Clustering Benchmark for COVID-19 Scientific Research across 50 covid-19 related topics.

Dataset: Uri-ka/ClusTREC-CovidLicense: cc-by-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Academic, Medical, Written expert-annotated created v_measure
Citation
@inproceedings{katz-etal-2024-knowledge,
  address = {Miami, Florida, USA},
  author = {Katz, Uri  and
Levy, Mosh  and
Goldberg, Yoav},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2024},
  month = nov,
  pages = {8838--8855},
  publisher = {Association for Computational Linguistics},
  title = {Knowledge Navigator: {LLM}-guided Browsing Framework for Exploratory Search in Scientific Literature},
  url = {https://aclanthology.org/2024.findings-emnlp.516},
  year = {2024},
}

DigikalamagClustering

A total of 8,515 articles scraped from Digikala Online Magazine. This dataset includes seven different classes.

Dataset: mteb/DigikalamagClusteringLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) fas Web derived found v_measure
Citation

DutchNewsArticlesClusteringP2P

This dataset contains all the articles published by the NOS as of the 1st of January 2010. The data is obtained by scraping the NOS website. The NOS is one of the biggest (online) news organizations in the Netherlands.

Dataset: clips/mteb-nl-news-articles-clsLicense: cc-by-nc-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) nld News, Written derived found v_measure

DutchNewsArticlesClusteringS2S

This dataset contains all the articles published by the NOS as of the 1st of January 2010. The data is obtained by scraping the NOS website. The NOS is one of the biggest (online) news organizations in the Netherlands.

Dataset: clips/mteb-nl-news-articles-clsLicense: cc-by-nc-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) nld News, Written derived found v_measure

EightTagsClustering

Clustering of headlines from social media posts in Polish belonging to 8 categories: film, history, food, medicine, motorization, work, sport and technology.

Dataset: PL-MTEB/8tags-clusteringLicense: gpl-3.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) pol Social, Written derived found v_measure
Citation
@inproceedings{dadas-etal-2020-evaluation,
  address = {Marseille, France},
  author = {Dadas, Slawomir  and
Pere{\\l}kiewicz, Micha{\\l}  and
Po{\\'s}wiata, Rafa{\\l}},
  booktitle = {Proceedings of the Twelfth Language Resources and Evaluation Conference},
  editor = {Calzolari, Nicoletta  and
B{\'e}chet, Fr{\'e}d{\'e}ric  and
Blache, Philippe  and
Choukri, Khalid  and
Cieri, Christopher  and
Declerck, Thierry  and
Goggi, Sara  and
Isahara, Hitoshi  and
Maegaard, Bente  and
Mariani, Joseph  and
Mazo, H{\\'e}l{\\`e}ne  and
Moreno, Asuncion  and
Odijk, Jan  and
Piperidis, Stelios},
  isbn = {979-10-95546-34-4},
  language = {English},
  month = may,
  pages = {1674--1680},
  publisher = {European Language Resources Association},
  title = {Evaluation of Sentence Representations in {P}olish},
  url = {https://aclanthology.org/2020.lrec-1.207},
  year = {2020},
}

EightTagsClustering.v2

Clustering of headlines from social media posts in Polish belonging to 8 categories: film, history, food, medicine, motorization, work, sport and technology.

Dataset: PL-MTEB/8tags-clusteringLicense: gpl-3.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) pol Social, Written derived found v_measure
Citation
@inproceedings{dadas-etal-2020-evaluation,
  address = {Marseille, France},
  author = {Dadas, Slawomir  and
Pere{\\l}kiewicz, Micha{\\l}  and
Po{\\'s}wiata, Rafa{\\l}},
  booktitle = {Proceedings of the Twelfth Language Resources and Evaluation Conference},
  editor = {Calzolari, Nicoletta  and
B{\\'e}chet, Fr{\\'e}d{\\'e}ric  and
Blache, Philippe  and
Choukri, Khalid  and
Cieri, Christopher  and
Declerck, Thierry  and
Goggi, Sara  and
Isahara, Hitoshi  and
Maegaard, Bente  and
Mariani, Joseph  and
Mazo, H{\\'e}l{\\`e}ne  and
Moreno, Asuncion  and
Odijk, Jan  and
Piperidis, Stelios},
  isbn = {979-10-95546-34-4},
  language = {English},
  month = may,
  pages = {1674--1680},
  publisher = {European Language Resources Association},
  title = {Evaluation of Sentence Representations in {P}olish},
  url = {https://aclanthology.org/2020.lrec-1.207},
  year = {2020},
}

GeoreviewClusteringP2P

Review clustering based on Yandex Georeview dataset

Dataset: ai-forever/georeview-clustering-p2pLicense: mit • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) rus Reviews, Written derived found v_measure

HALClusteringS2S

Clustering of titles from HAL (https://huggingface.co/datasets/lyon-nlp/clustering-hal-s2s)

Dataset: lyon-nlp/clustering-hal-s2sLicense: apache-2.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) fra Academic, Written human-annotated found v_measure
Citation
@misc{ciancone2024extending,
  archiveprefix = {arXiv},
  author = {Mathieu Ciancone and Imene Kerboua and Marion Schaeffer and Wissam Siblini},
  eprint = {2405.20468},
  primaryclass = {cs.CL},
  title = {Extending the Massive Text Embedding Benchmark to French},
  year = {2024},
}

HALClusteringS2S.v2

Clustering of titles from HAL (https://huggingface.co/datasets/lyon-nlp/clustering-hal-s2s)

Dataset: mteb/HALClusteringS2S.v2License: apache-2.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) fra Academic, Written human-annotated found v_measure
Citation
@misc{ciancone2024extending,
  archiveprefix = {arXiv},
  author = {Mathieu Ciancone and Imene Kerboua and Marion Schaeffer and Wissam Siblini},
  eprint = {2405.20468},
  primaryclass = {cs.CL},
  title = {Extending the Massive Text Embedding Benchmark to French},
  year = {2024},
}

HUMEArxivClusteringP2P

Human evaluation subset of Clustering of titles+abstract from arxiv. Clustering of 30 sets, either on the main or secondary category

Dataset: mteb/mteb-human-arxiv-clusteringLicense: cc0-1.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to text (t2t) eng Academic, Written derived found v_measure
Citation
@misc{arxiv_org_submitters_2024,
  author = {arXiv.org submitters},
  doi = {10.34740/KAGGLE/DSV/7548853},
  publisher = {Kaggle},
  title = {arXiv Dataset},
  url = {https://www.kaggle.com/dsv/7548853},
  year = {2024},
}

HUMERedditClusteringP2P

Human evaluation subset of Clustering of title+posts from reddit. Clustering of 10 sets of 50k paragraphs and 40 sets of 10k paragraphs.

Dataset: mteb/mteb-human-reddit-clusteringLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to text (t2t) eng Social, Web, Written derived found v_measure
Citation
@article{geigle:2021:arxiv,
  archiveprefix = {arXiv},
  author = {Gregor Geigle and
Nils Reimers and
Andreas R{\"u}ckl{\'e} and
Iryna Gurevych},
  eprint = {2104.07081},
  journal = {arXiv preprint},
  title = {TWEAC: Transformer with Extendable QA Agent Classifiers},
  url = {http://arxiv.org/abs/2104.07081},
  volume = {abs/2104.07081},
  year = {2021},
}

HUMESIB200ClusteringS2S

Human evaluation subset of Clustering of news article headlines from SIB-200. Clustering of 10 sets, each with 8 categories and 10 texts per category.

Dataset: mteb/mteb-human-sib200-clusteringLicense: cc-by-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to text (t2t) ara, dan, eng, fra, rus News, Written derived found v_measure
Citation
@inproceedings{adelani-etal-2023-sib,
  address = {Toronto, Canada},
  author = {Adelani, David Ifeoluwa  and
Hedderich, Michael A.  and
Zhu, Dawei  and
van den Berg, Esther  and
Klakow, Dietrich},
  booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  doi = {10.18653/v1/2023.acl-long.660},
  month = jul,
  pages = {11784--11801},
  publisher = {Association for Computational Linguistics},
  title = {{SIB}-200: A Large-Scale News Classification Dataset for Over 200 Languages},
  url = {https://aclanthology.org/2023.acl-long.660},
  year = {2023},
}

HUMEWikiCitiesClustering

Human evaluation subset of Clustering of Wikipedia articles of cities by country from https://huggingface.co/datasets/wikipedia. Test set includes 126 countries, and a total of 3531 cities.

Dataset: mteb/mteb-human-wikicities-clusteringLicense: cc-by-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to text (t2t) eng Encyclopaedic, Written derived found v_measure
Citation
@online{wikidump2024,
  author = {Wikimedia Foundation},
  title = {Wikimedia Downloads},
  url = {https://dumps.wikimedia.org},
}

HamshahriClustring

These datasets have been extracted from the RSS feed of two Farsi news agency websites.

Dataset: community-datasets/farsi_newsLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) fas News derived found v_measure
Citation

IconclassClusteringS2S

Iconclass is an iconographic thesaurus, which is widely used in the digital heritage domain to describe subjects depicted in artworks. The task is to classify the first layer of Iconclass

Dataset: clips/mteb-nl-iconclass-clsLicense: cc-by-nc-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) nld Fiction, Written derived found v_measure
Citation
@article{banar2023transfer,
  author = {Banar, Nikolay and Daelemans, Walter and Kestemont, Mike},
  journal = {ACM Journal on Computing and Cultural Heritage},
  number = {2},
  pages = {1--16},
  publisher = {ACM New York, NY},
  title = {Transfer learning for the visual arts: The multi-modal retrieval of iconclass codes},
  volume = {16},
  year = {2023},
}

IndicReviewsClusteringP2P

Clustering of reviews from IndicSentiment dataset. Clustering of 14 sets on the generic categories label.

Dataset: mteb/IndicReviewsClusteringP2PLicense: cc0-1.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) asm, ben, brx, guj, hin, ... (13) Reviews, Written human-annotated machine-translated and verified v_measure
Citation
@article{doddapaneni2022towards,
  author = {Sumanth Doddapaneni and Rahul Aralikatte and Gowtham Ramesh and Shreyansh Goyal and Mitesh M. Khapra and Anoop Kunchukuttan and Pratyush Kumar},
  doi = {10.18653/v1/2023.acl-long.693},
  journal = {Annual Meeting of the Association for Computational Linguistics},
  title = {Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages},
  year = {2022},
}

KlueMrcDomainClustering

this dataset is a processed and redistributed version of the KLUE-MRC dataset. Domain: Game / Media / Automotive / Finance / Real Estate / Education

Dataset: mteb/KlueMrcDomainClusteringLicense: cc-by-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) kor News, Written human-annotated found v_measure
Citation
@misc{park2021klue,
  archiveprefix = {arXiv},
  author = {Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jungwoo Ha and Kyunghyun Cho},
  eprint = {2105.09680},
  primaryclass = {cs.CL},
  title = {KLUE: Korean Language Understanding Evaluation},
  year = {2021},
}

KlueYnatMrcCategoryClustering

this dataset is a processed and redistributed version of the KLUE-Ynat & KLUE-MRC dataset. News_category: IT/Science, Sports, Media/Culture, Ecomomy/Finance, Real Estate

Dataset: mteb/KlueYnatMrcCategoryClusteringLicense: cc-by-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to text (t2t) kor News, Written human-annotated found v_measure
Citation
@misc{park2021klue,
  archiveprefix = {arXiv},
  author = {Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jungwoo Ha and Kyunghyun Cho},
  eprint = {2105.09680},
  primaryclass = {cs.CL},
  title = {KLUE: Korean Language Understanding Evaluation},
  year = {2021},
}

LivedoorNewsClustering

Clustering of the news reports of a Japanese news site, Livedoor News by RONDHUIT Co, Ltd. in 2012. It contains over 7,000 news report texts across 9 categories (topics).

Dataset: mteb/LivedoorNewsClusteringLicense: cc-by-nd-2.1-jp • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) jpn News, Written derived found v_measure

LivedoorNewsClustering.v2

Clustering of the news reports of a Japanese news site, Livedoor News by RONDHUIT Co, Ltd. in 2012. It contains over 7,000 news report texts across 9 categories (topics). Version 2 updated on LivedoorNewsClustering by removing pairs where one of entries contain an empty sentences.

Dataset: mteb/LivedoorNewsClustering.v2License: cc-by-nd-2.1-jp • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) jpn News, Written derived found v_measure

MLSUMClusteringP2P

Clustering of newspaper article contents and titles from MLSUM dataset. Clustering of 10 sets on the newpaper article topics.

Dataset: mteb/mlsumLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) deu, fra, rus, spa News, Written derived found v_measure
Citation
@article{scialom2020mlsum,
  author = {Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo},
  journal = {arXiv preprint arXiv:2004.14900},
  title = {MLSUM: The Multilingual Summarization Corpus},
  year = {2020},
}

MLSUMClusteringP2P.v2

Clustering of newspaper article contents and titles from MLSUM dataset. Clustering of 10 sets on the newpaper article topics.

Dataset: mteb/mlsumLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) deu, fra, rus, spa News, Written derived found v_measure
Citation
@article{scialom2020mlsum,
  author = {Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo},
  journal = {arXiv preprint arXiv:2004.14900},
  title = {MLSUM: The Multilingual Summarization Corpus},
  year = {2020},
}

MLSUMClusteringS2S

Clustering of newspaper article contents and titles from MLSUM dataset. Clustering of 10 sets on the newpaper article topics.

Dataset: mteb/mlsumLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) deu, fra, rus, spa News, Written derived found v_measure
Citation
@article{scialom2020mlsum,
  author = {Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo},
  journal = {arXiv preprint arXiv:2004.14900},
  title = {MLSUM: The Multilingual Summarization Corpus},
  year = {2020},
}

MLSUMClusteringS2S.v2

Clustering of newspaper article contents and titles from MLSUM dataset. Clustering of 10 sets on the newpaper article topics.

Dataset: mteb/mlsumLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) deu, fra, rus, spa News, Written derived found v_measure
Citation
@article{scialom2020mlsum,
  author = {Scialom, Thomas and Dray, Paul-Alexis and Lamprier, Sylvain and Piwowarski, Benjamin and Staiano, Jacopo},
  journal = {arXiv preprint arXiv:2004.14900},
  title = {MLSUM: The Multilingual Summarization Corpus},
  year = {2020},
}

MasakhaNEWSClusteringP2P

Clustering of news article headlines and texts from MasakhaNEWS dataset. Clustering of 10 sets on the news article label.

Dataset: mteb/MasakhaNEWSClusteringP2PLicense: afl-3.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) amh, eng, fra, hau, ibo, ... (16) News, Non-fiction, Written derived found v_measure
Citation
@article{adelani2023masakhanews,
  author = {David Ifeoluwa Adelani and  Marek Masiak and  Israel Abebe Azime and  Jesujoba Oluwadara Alabi and  Atnafu Lambebo Tonja and  Christine Mwase and  Odunayo Ogundepo and  Bonaventure F. P. Dossou and  Akintunde Oladipo and  Doreen Nixdorf and  Chris Chinenye Emezue and  Sana Sabah al-azzawi and  Blessing K. Sibanda and  Davis David and  Lolwethu Ndolela and  Jonathan Mukiibi and  Tunde Oluwaseyi Ajayi and  Tatiana Moteu Ngoli and  Brian Odhiambo and  Abraham Toluwase Owodunni and  Nnaemeka C. Obiefuna and  Shamsuddeen Hassan Muhammad and  Saheed Salahudeen Abdullahi and  Mesay Gemeda Yigezu and  Tajuddeen Gwadabe and  Idris Abdulmumin and  Mahlet Taye Bame and  Oluwabusayo Olufunke Awoyomi and  Iyanuoluwa Shode and  Tolulope Anu Adelani and  Habiba Abdulganiy Kailani and  Abdul-Hakeem Omotayo and  Adetola Adeeko and  Afolabi Abeeb and  Anuoluwapo Aremu and  Olanrewaju Samuel and  Clemencia Siro and  Wangari Kimotho and  Onyekachi Raphael Ogbu and  Chinedu E. Mbonu and  Chiamaka I. Chukwuneke and  Samuel Fanijo and  Jessica Ojo and  Oyinkansola F. Awosan and  Tadesse Kebede Guge and  Sakayo Toadoum Sari and  Pamela Nyatsine and  Freedmore Sidume and  Oreen Yousuf and  Mardiyyah Oduwole and  Ussen Kimanuka and  Kanda Patrick Tshinu and  Thina Diko and  Siyanda Nxakama and   Abdulmejid Tuni Johar and  Sinodos Gebre and  Muhidin Mohamed and  Shafie Abdi Mohamed and  Fuad Mire Hassan and  Moges Ahmed Mehamed and  Evrard Ngabire and  and Pontus Stenetorp},
  journal = {ArXiv},
  title = {MasakhaNEWS: News Topic Classification for African languages},
  volume = {},
  year = {2023},
}

MasakhaNEWSClusteringS2S

Clustering of news article headlines from MasakhaNEWS dataset. Clustering of 10 sets on the news article label.

Dataset: mteb/MasakhaNEWSClusteringS2SLicense: afl-3.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) amh, eng, fra, hau, ibo, ... (16) News, Written human-annotated found v_measure
Citation
@article{adelani2023masakhanews,
  author = {David Ifeoluwa Adelani and  Marek Masiak and  Israel Abebe Azime and  Jesujoba Oluwadara Alabi and  Atnafu Lambebo Tonja and  Christine Mwase and  Odunayo Ogundepo and  Bonaventure F. P. Dossou and  Akintunde Oladipo and  Doreen Nixdorf and  Chris Chinenye Emezue and  Sana Sabah al-azzawi and  Blessing K. Sibanda and  Davis David and  Lolwethu Ndolela and  Jonathan Mukiibi and  Tunde Oluwaseyi Ajayi and  Tatiana Moteu Ngoli and  Brian Odhiambo and  Abraham Toluwase Owodunni and  Nnaemeka C. Obiefuna and  Shamsuddeen Hassan Muhammad and  Saheed Salahudeen Abdullahi and  Mesay Gemeda Yigezu and  Tajuddeen Gwadabe and  Idris Abdulmumin and  Mahlet Taye Bame and  Oluwabusayo Olufunke Awoyomi and  Iyanuoluwa Shode and  Tolulope Anu Adelani and  Habiba Abdulganiy Kailani and  Abdul-Hakeem Omotayo and  Adetola Adeeko and  Afolabi Abeeb and  Anuoluwapo Aremu and  Olanrewaju Samuel and  Clemencia Siro and  Wangari Kimotho and  Onyekachi Raphael Ogbu and  Chinedu E. Mbonu and  Chiamaka I. Chukwuneke and  Samuel Fanijo and  Jessica Ojo and  Oyinkansola F. Awosan and  Tadesse Kebede Guge and  Sakayo Toadoum Sari and  Pamela Nyatsine and  Freedmore Sidume and  Oreen Yousuf and  Mardiyyah Oduwole and  Ussen Kimanuka and  Kanda Patrick Tshinu and  Thina Diko and  Siyanda Nxakama and   Abdulmejid Tuni Johar and  Sinodos Gebre and  Muhidin Mohamed and  Shafie Abdi Mohamed and  Fuad Mire Hassan and  Moges Ahmed Mehamed and  Evrard Ngabire and  and Pontus Stenetorp},
  journal = {ArXiv},
  title = {MasakhaNEWS: News Topic Classification for African languages},
  volume = {},
  year = {2023},
}

MedrxivClusteringP2P

Clustering of titles+abstract from medrxiv. Clustering of 10 sets, based on the main category.

Dataset: mteb/medrxiv-clustering-p2pLicense: https://www.medrxiv.org/content/about-medrxiv • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Academic, Written derived created v_measure

MedrxivClusteringP2P.v2

Clustering of titles+abstract from medrxiv across 51 categories.

Dataset: mteb/medrxiv-clustering-p2pLicense: https://www.medrxiv.org/content/about-medrxiv • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Academic, Medical, Written derived created v_measure

MedrxivClusteringS2S

Clustering of titles from medrxiv. Clustering of 10 sets, based on the main category.

Dataset: mteb/medrxiv-clustering-s2sLicense: https://www.medrxiv.org/content/about-medrxiv • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Academic, Medical, Written derived created v_measure

MedrxivClusteringS2S.v2

Clustering of titles from medrxiv across 51 categories.

Dataset: mteb/medrxiv-clustering-s2sLicense: https://www.medrxiv.org/content/about-medrxiv • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Academic, Medical, Written derived created v_measure

MewsC16JaClustering

MewsC-16 (Multilingual Short Text Clustering Dataset for News in 16 languages) is constructed from Wikinews. This dataset is the Japanese split of MewsC-16, containing topic sentences from Wikinews articles in 12 categories. More detailed information is available in the Appendix E of the citation.

Dataset: mteb/MewsC16JaClusteringLicense: cc-by-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) jpn News, Written derived found v_measure
Citation
@inproceedings{nishikawa-etal-2022-ease,
  address = {Seattle, United States},
  author = {Nishikawa, Sosuke  and
Ri, Ryokan  and
Yamada, Ikuya  and
Tsuruoka, Yoshimasa  and
Echizen, Isao},
  booktitle = {Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
  month = jul,
  pages = {3870--3885},
  publisher = {Association for Computational Linguistics},
  title = {{EASE}: Entity-Aware Contrastive Learning of Sentence Embedding},
  url = {https://aclanthology.org/2022.naacl-main.284},
  year = {2022},
}

NLPTwitterAnalysisClustering

Clustering of tweets from twitter across 26 categories.

Dataset: hamedhf/nlp_twitter_analysisLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) fas Social derived found v_measure
Citation

OpenTenderClusteringP2P

This dataset contains all the articles published by the NOS as of the 1st of January 2010. The data is obtained by scraping the NOS website. The NOS is one of the biggest (online) news organizations in the Netherlands.

Dataset: clips/mteb-nl-opentender-cls-prLicense: cc-by-nc-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) nld Government, Written derived found v_measure
Citation
@misc{banar2025mtebnle5nlembeddingbenchmark,
  archiveprefix = {arXiv},
  author = {Nikolay Banar and Ehsan Lotfi and Jens Van Nooten and Cristina Arhiliuc and Marija Kliocaite and Walter Daelemans},
  eprint = {2509.12340},
  primaryclass = {cs.CL},
  title = {MTEB-NL and E5-NL: Embedding Benchmark and Models for Dutch},
  url = {https://arxiv.org/abs/2509.12340},
  year = {2025},
}

OpenTenderClusteringS2S

This dataset contains all the articles published by the NOS as of the 1st of January 2010. The data is obtained by scraping the NOS website. The NOS is one of the biggest (online) news organizations in the Netherlands.

Dataset: clips/mteb-nl-opentender-clst-s2s-prLicense: cc-by-nc-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) nld Government, Written derived found v_measure
Citation
@misc{banar2025mtebnle5nlembeddingbenchmark,
  archiveprefix = {arXiv},
  author = {Nikolay Banar and Ehsan Lotfi and Jens Van Nooten and Cristina Arhiliuc and Marija Kliocaite and Walter Daelemans},
  eprint = {2509.12340},
  primaryclass = {cs.CL},
  title = {MTEB-NL and E5-NL: Embedding Benchmark and Models for Dutch},
  url = {https://arxiv.org/abs/2509.12340},
  year = {2025},
}

PlscClusteringP2P

Clustering of Polish article titles+abstracts from Library of Science (https://bibliotekanauki.pl/), either on the scientific field or discipline.

Dataset: PL-MTEB/plsc-clustering-p2pLicense: cc0-1.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) pol Academic, Written derived found v_measure

PlscClusteringP2P.v2

Clustering of Polish article titles+abstracts from Library of Science (https://bibliotekanauki.pl/), either on the scientific field or discipline.

Dataset: mteb/PlscClusteringP2P.v2License: cc0-1.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) pol Academic, Written derived found v_measure

PlscClusteringS2S

Clustering of Polish article titles from Library of Science (https://bibliotekanauki.pl/), either on the scientific field or discipline.

Dataset: PL-MTEB/plsc-clustering-s2sLicense: cc0-1.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) pol Academic, Written derived found v_measure

PlscClusteringS2S.v2

Clustering of Polish article titles from Library of Science (https://bibliotekanauki.pl/), either on the scientific field or discipline.

Dataset: PL-MTEB/plsc-clustering-s2sLicense: cc0-1.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) pol Academic, Written derived found v_measure

RedditClustering

Clustering of titles from 199 subreddits. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences.

Dataset: mteb/reddit-clusteringLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Social, Web, Written derived found v_measure
Citation
@article{geigle:2021:arxiv,
  archiveprefix = {arXiv},
  author = {Gregor Geigle and
Nils Reimers and
Andreas R{\"u}ckl{\'e} and
Iryna Gurevych},
  eprint = {2104.07081},
  journal = {arXiv preprint},
  title = {TWEAC: Transformer with Extendable QA Agent Classifiers},
  url = {http://arxiv.org/abs/2104.07081},
  volume = {abs/2104.07081},
  year = {2021},
}

RedditClustering-VN

A translated dataset from Clustering of titles from 199 subreddits. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences. The process of creating the VN-MTEB (Vietnamese Massive Text Embedding Benchmark) from English samples involves a new automated system: - The system uses large language models (LLMs), specifically Coherence's Aya model, for translation. - Applies advanced embedding models to filter the translations. - Use LLM-as-a-judge to scoring the quality of the samples base on multiple criteria.

Dataset: GreenNode/reddit-clustering-vnLicense: cc-by-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) vie Social, Web, Written derived machine-translated and LM verified v_measure
Citation
@misc{pham2025vnmtebvietnamesemassivetext,
  archiveprefix = {arXiv},
  author = {Loc Pham and Tung Luu and Thu Vo and Minh Nguyen and Viet Hoang},
  eprint = {2507.21500},
  primaryclass = {cs.CL},
  title = {VN-MTEB: Vietnamese Massive Text Embedding Benchmark},
  url = {https://arxiv.org/abs/2507.21500},
  year = {2025},
}

RedditClustering.v2

Clustering of titles from 199 subreddits. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences.

Dataset: mteb/reddit-clusteringLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Social, Web, Written derived found v_measure
Citation
@article{geigle:2021:arxiv,
  archiveprefix = {arXiv},
  author = {Gregor Geigle and
Nils Reimers and
Andreas R{\"u}ckl{\'e} and
Iryna Gurevych},
  eprint = {2104.07081},
  journal = {arXiv preprint},
  title = {TWEAC: Transformer with Extendable QA Agent Classifiers},
  url = {http://arxiv.org/abs/2104.07081},
  volume = {abs/2104.07081},
  year = {2021},
}

RedditClusteringP2P

Clustering of title+posts from reddit. Clustering of 10 sets of 50k paragraphs and 40 sets of 10k paragraphs.

Dataset: mteb/reddit-clustering-p2pLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Social, Web, Written derived found v_measure
Citation
@article{geigle:2021:arxiv,
  archiveprefix = {arXiv},
  author = {Gregor Geigle and
Nils Reimers and
Andreas R{\"u}ckl{\'e} and
Iryna Gurevych},
  eprint = {2104.07081},
  journal = {arXiv preprint},
  title = {TWEAC: Transformer with Extendable QA Agent Classifiers},
  url = {http://arxiv.org/abs/2104.07081},
  volume = {abs/2104.07081},
  year = {2021},
}

RedditClusteringP2P-VN

A translated dataset from Clustering of title+posts from reddit. Clustering of 10 sets of 50k paragraphs and 40 sets of 10k paragraphs. The process of creating the VN-MTEB (Vietnamese Massive Text Embedding Benchmark) from English samples involves a new automated system: - The system uses large language models (LLMs), specifically Coherence's Aya model, for translation. - Applies advanced embedding models to filter the translations. - Use LLM-as-a-judge to scoring the quality of the samples base on multiple criteria.

Dataset: GreenNode/reddit-clustering-p2p-vnLicense: cc-by-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) vie Social, Web, Written derived machine-translated and LM verified v_measure
Citation
@misc{pham2025vnmtebvietnamesemassivetext,
  archiveprefix = {arXiv},
  author = {Loc Pham and Tung Luu and Thu Vo and Minh Nguyen and Viet Hoang},
  eprint = {2507.21500},
  primaryclass = {cs.CL},
  title = {VN-MTEB: Vietnamese Massive Text Embedding Benchmark},
  url = {https://arxiv.org/abs/2507.21500},
  year = {2025},
}

RedditClusteringP2P.v2

Clustering of title+posts from reddit. Clustering of 10 sets of 50k paragraphs and 40 sets of 10k paragraphs.

Dataset: mteb/reddit-clustering-p2pLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Social, Web, Written derived found v_measure
Citation
@article{geigle:2021:arxiv,
  archiveprefix = {arXiv},
  author = {Gregor Geigle and
Nils Reimers and
Andreas R{\"u}ckl{\'e} and
Iryna Gurevych},
  eprint = {2104.07081},
  journal = {arXiv preprint},
  title = {TWEAC: Transformer with Extendable QA Agent Classifiers},
  url = {http://arxiv.org/abs/2104.07081},
  volume = {abs/2104.07081},
  year = {2021},
}

RomaniBibleClustering

Clustering verses from the Bible in Kalderash Romani by book.

Dataset: mteb/RomaniBibleClusteringLicense: mit • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) rom Religious, Written derived human-translated and localized v_measure

RuSciBenchGRNTIClusteringP2P

Clustering of scientific papers (title+abstract) by rubric

Dataset: ai-forever/ru-scibench-grnti-classificationLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) rus Academic, Written derived found v_measure

RuSciBenchOECDClusteringP2P

Clustering of scientific papers (title+abstract) by rubric

Dataset: ai-forever/ru-scibench-oecd-classificationLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) rus Academic, Written derived found v_measure

SIB200ClusteringS2S

SIB-200 is the largest publicly available topic classification dataset based on Flores-200 covering 205 languages and dialects annotated. The dataset is annotated in English for the topics, science/technology, travel, politics, sports, health, entertainment, and geography. The labels are then transferred to the other languages in Flores-200 which are human-translated.

Dataset: mteb/sib200License: cc-by-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) ace, acm, acq, aeb, afr, ... (197) News, Written expert-annotated human-translated and localized v_measure
Citation
@article{adelani2023sib,
  author = {Adelani, David Ifeoluwa and Liu, Hannah and Shen, Xiaoyu and Vassilyev, Nikita and Alabi, Jesujoba O and Mao, Yanke and Gao, Haonan and Lee, Annie En-Shiun},
  journal = {arXiv preprint arXiv:2309.07445},
  title = {SIB-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects},
  year = {2023},
}

SIDClustring

Clustering of summariesfrom SIDClustring across categories.

Dataset: MCINext/sid-clusteringLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) fas Academic derived found v_measure
Citation

SNLClustering

Webscraped articles from the Norwegian lexicon 'Det Store Norske Leksikon'. Uses articles categories as clusters.

Dataset: adrlau/navjordj-SNL_summarization_copyLicense: cc-by-nc-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) nob Encyclopaedic, Non-fiction, Written derived found v_measure
Citation
@mastersthesis{navjord2023beyond,
  author = {Navjord, J{\o}rgen Johnsen and Korsvik, Jon-Mikkel Ryen},
  school = {Norwegian University of Life Sciences, {\AA}s},
  title = {Beyond extractive: advancing abstractive automatic text summarization in Norwegian with transformers},
  year = {2023},
}

SNLHierarchicalClusteringP2P

Webscrabed articles from the Norwegian lexicon 'Det Store Norske Leksikon'. Uses articles categories as clusters.

Dataset: mteb/SNLHierarchicalClusteringP2PLicense: cc-by-nc-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) nob Encyclopaedic, Non-fiction, Written derived found v_measure
Citation
@mastersthesis{navjord2023beyond,
  author = {Navjord, J{\\o}rgen Johnsen and Korsvik, Jon-Mikkel Ryen},
  school = {Norwegian University of Life Sciences, {\\AA}s},
  title = {Beyond extractive: advancing abstractive automatic text summarization in Norwegian with transformers},
  year = {2023},
}

SNLHierarchicalClusteringS2S

Webscrabed articles from the Norwegian lexicon 'Det Store Norske Leksikon'. Uses articles categories as clusters.

Dataset: mteb/SNLHierarchicalClusteringS2SLicense: cc-by-nc-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) nob Encyclopaedic, Non-fiction, Written derived found v_measure
Citation
@mastersthesis{navjord2023beyond,
  author = {Navjord, J{\\o}rgen Johnsen and Korsvik, Jon-Mikkel Ryen},
  school = {Norwegian University of Life Sciences, {\\AA}s},
  title = {Beyond extractive: advancing abstractive automatic text summarization in Norwegian with transformers},
  year = {2023},
}

SpanishNewsClusteringP2P

Clustering of news articles, 7 topics in total.

Dataset: jinaai/spanish_news_clusteringLicense: cc-by-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) spa News, Written derived found v_measure
Citation
@misc{kevinmorgado2019spanish,
  author = {Kevin Morgado},
  howpublished = {Kaggle},
  title = {Spanish News Classification},
  url = {https://www.kaggle.com/datasets/kevinmorgado/spanish-news-classification},
  year = {2019},
}

StackExchangeClustering

Clustering of titles from 121 stackexchanges. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences.

Dataset: mteb/stackexchange-clusteringLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Web, Written derived found v_measure
Citation
@article{geigle:2021:arxiv,
  archiveprefix = {arXiv},
  author = {Gregor Geigle and
Nils Reimers and
Andreas R{\"u}ckl{\'e} and
Iryna Gurevych},
  eprint = {2104.07081},
  journal = {arXiv preprint},
  title = {TWEAC: Transformer with Extendable QA Agent Classifiers},
  url = {http://arxiv.org/abs/2104.07081},
  volume = {abs/2104.07081},
  year = {2021},
}

StackExchangeClustering-VN

A translated dataset from Clustering of titles from 121 stackexchanges. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences. The process of creating the VN-MTEB (Vietnamese Massive Text Embedding Benchmark) from English samples involves a new automated system: - The system uses large language models (LLMs), specifically Coherence's Aya model, for translation. - Applies advanced embedding models to filter the translations. - Use LLM-as-a-judge to scoring the quality of the samples base on multiple criteria.

Dataset: GreenNode/stackexchange-clustering-vnLicense: cc-by-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) vie Web, Written derived machine-translated and LM verified v_measure
Citation
@misc{pham2025vnmtebvietnamesemassivetext,
  archiveprefix = {arXiv},
  author = {Loc Pham and Tung Luu and Thu Vo and Minh Nguyen and Viet Hoang},
  eprint = {2507.21500},
  primaryclass = {cs.CL},
  title = {VN-MTEB: Vietnamese Massive Text Embedding Benchmark},
  url = {https://arxiv.org/abs/2507.21500},
  year = {2025},
}

StackExchangeClustering.v2

Clustering of titles from 121 stackexchanges. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences.

Dataset: mteb/stackexchange-clusteringLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Web, Written derived found v_measure
Citation
@article{geigle:2021:arxiv,
  archiveprefix = {arXiv},
  author = {Gregor Geigle and
Nils Reimers and
Andreas R{\"u}ckl{\'e} and
Iryna Gurevych},
  eprint = {2104.07081},
  journal = {arXiv preprint},
  title = {TWEAC: Transformer with Extendable QA Agent Classifiers},
  url = {http://arxiv.org/abs/2104.07081},
  volume = {abs/2104.07081},
  year = {2021},
}

StackExchangeClusteringP2P

Clustering of title+body from stackexchange. Clustering of 5 sets of 10k paragraphs and 5 sets of 5k paragraphs.

Dataset: mteb/stackexchange-clustering-p2pLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Web, Written derived found v_measure
Citation
@article{geigle:2021:arxiv,
  archiveprefix = {arXiv},
  author = {Gregor Geigle and
Nils Reimers and
Andreas R{\"u}ckl{\'e} and
Iryna Gurevych},
  eprint = {2104.07081},
  journal = {arXiv preprint},
  title = {TWEAC: Transformer with Extendable QA Agent Classifiers},
  url = {http://arxiv.org/abs/2104.07081},
  volume = {abs/2104.07081},
  year = {2021},
}

StackExchangeClusteringP2P-VN

A translated Clustering of title+body from stackexchange. Clustering of 5 sets of 10k paragraphs and 5 sets of 5k paragraphs. The process of creating the VN-MTEB (Vietnamese Massive Text Embedding Benchmark) from English samples involves a new automated system: - The system uses large language models (LLMs), specifically Coherence's Aya model, for translation. - Applies advanced embedding models to filter the translations. - Use LLM-as-a-judge to scoring the quality of the samples base on multiple criteria.

Dataset: GreenNode/stackexchange-clustering-p2p-vnLicense: cc-by-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) vie Web, Written derived machine-translated and LM verified v_measure
Citation
@misc{pham2025vnmtebvietnamesemassivetext,
  archiveprefix = {arXiv},
  author = {Loc Pham and Tung Luu and Thu Vo and Minh Nguyen and Viet Hoang},
  eprint = {2507.21500},
  primaryclass = {cs.CL},
  title = {VN-MTEB: Vietnamese Massive Text Embedding Benchmark},
  url = {https://arxiv.org/abs/2507.21500},
  year = {2025},
}

StackExchangeClusteringP2P.v2

Clustering of title+body from stackexchange. Clustering of 5 sets of 10k paragraphs and 5 sets of 5k paragraphs.

Dataset: mteb/stackexchange-clustering-p2pLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Web, Written derived found v_measure
Citation
@article{geigle:2021:arxiv,
  archiveprefix = {arXiv},
  author = {Gregor Geigle and
Nils Reimers and
Andreas R{\"u}ckl{\'e} and
Iryna Gurevych},
  eprint = {2104.07081},
  journal = {arXiv preprint},
  title = {TWEAC: Transformer with Extendable QA Agent Classifiers},
  url = {http://arxiv.org/abs/2104.07081},
  volume = {abs/2104.07081},
  year = {2021},
}

SwednClustering

The SWE-DN corpus is based on 1,963,576 news articles from the Swedish newspaper Dagens Nyheter (DN) during the years 2000--2020. The articles are filtered to resemble the CNN/DailyMail dataset both regarding textual structure. This dataset uses the category labels as clusters.

Dataset: mteb/SwednClusteringLicense: cc-by-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) swe News, Non-fiction, Written derived found v_measure
Citation
@inproceedings{monsen2021method,
  author = {Monsen, Julius and J{\"o}nsson, Arne},
  booktitle = {Proceedings of CLARIN Annual Conference},
  title = {A method for building non-english corpora for abstractive text summarization},
  year = {2021},
}

SwednClusteringP2P

The SWE-DN corpus is based on 1,963,576 news articles from the Swedish newspaper Dagens Nyheter (DN) during the years 2000--2020. The articles are filtered to resemble the CNN/DailyMail dataset both regarding textual structure. This dataset uses the category labels as clusters.

Dataset: mteb/SwednClusteringP2PLicense: cc-by-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) swe News, Non-fiction, Written derived found v_measure
Citation
@inproceedings{monsen2021method,
  author = {Monsen, Julius and J{\"o}nsson, Arne},
  booktitle = {Proceedings of CLARIN Annual Conference},
  title = {A method for building non-english corpora for abstractive text summarization},
  year = {2021},
}

SwednClusteringS2S

The SWE-DN corpus is based on 1,963,576 news articles from the Swedish newspaper Dagens Nyheter (DN) during the years 2000--2020. The articles are filtered to resemble the CNN/DailyMail dataset both regarding textual structure. This dataset uses the category labels as clusters.

Dataset: mteb/SwednClusteringS2SLicense: cc-by-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) swe News, Non-fiction, Written derived found v_measure
Citation
@inproceedings{monsen2021method,
  author = {Monsen, Julius and J{\"o}nsson, Arne},
  booktitle = {Proceedings of CLARIN Annual Conference},
  title = {A method for building non-english corpora for abstractive text summarization},
  year = {2021},
}

TenKGnadClusteringP2P

Clustering of news article titles+subheadings+texts. Clustering of 10 splits on the news article category.

Dataset: slvnwhrl/tenkgnad-clustering-p2pLicense: cc-by-nc-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) deu Web, Written derived found v_measure

TenKGnadClusteringP2P.v2

Clustering of news article titles+subheadings+texts. Clustering of 10 splits on the news article category. v2 uses a faster evaluation method used in the MMTEB paper, which allow for notably faster evaluation.

Dataset: slvnwhrl/tenkgnad-clustering-p2pLicense: cc-by-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) deu News, Non-fiction, Written derived found v_measure

TenKGnadClusteringS2S

Clustering of news article titles. Clustering of 10 splits on the news article category.

Dataset: slvnwhrl/tenkgnad-clustering-s2sLicense: cc-by-nc-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) deu News, Non-fiction, Written derived found v_measure

TenKGnadClusteringS2S.v2

Clustering of news article titles. Clustering of 10 splits on the news article category. v2 uses a faster evaluation method used in the MMTEB paper, which allow for notably faster evaluation.

Dataset: slvnwhrl/tenkgnad-clustering-s2sLicense: cc-by-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) deu News, Non-fiction, Written derived found v_measure

ThuNewsClusteringP2P

Clustering of titles + abstracts from the THUCNews dataset

Dataset: C-MTEB/ThuNewsClusteringP2PLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) cmn News, Written derived found v_measure
Citation
@inproceedings{eisner2007proceedings,
  author = {Eisner, Jason},
  booktitle = {Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)},
  title = {Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL)},
  year = {2007},
}

@inproceedings{li2006comparison,
  author = {Li, Jingyang and Sun, Maosong and Zhang, Xian},
  booktitle = {proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics},
  pages = {545--552},
  title = {A comparison and semi-quantitative analysis of words and character-bigrams as features in chinese text categorization},
  year = {2006},
}

ThuNewsClusteringP2P.v2

Clustering of titles + abstracts from the THUCNews dataset

Dataset: C-MTEB/ThuNewsClusteringP2PLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) cmn News, Written derived found v_measure
Citation
@software{sun2016thuctc,
  author = {Sun, M. and Li, J. and Guo, Z. and Yu, Z. and Zheng, Y. and Si, X. and Liu, Z.},
  note = {THU Chinese Text Classification Toolkit},
  publisher = {THU Natural Language Processing Lab},
  title = {THUCTC: An Efficient Chinese Text Classifier},
  url = {https://github.com/thunlp/THUCTC},
  year = {2016},
}

ThuNewsClusteringS2S

Clustering of titles from the THUCNews dataset

Dataset: C-MTEB/ThuNewsClusteringS2SLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) cmn News, Written derived found v_measure
Citation
@inproceedings{eisner2007proceedings,
  author = {Eisner, Jason},
  booktitle = {Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)},
  title = {Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL)},
  year = {2007},
}

@inproceedings{li2006comparison,
  author = {Li, Jingyang and Sun, Maosong and Zhang, Xian},
  booktitle = {proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics},
  pages = {545--552},
  title = {A comparison and semi-quantitative analysis of words and character-bigrams as features in chinese text categorization},
  year = {2006},
}

ThuNewsClusteringS2S.v2

Clustering of titles from the THUCNews dataset

Dataset: C-MTEB/ThuNewsClusteringS2SLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) cmn News, Written derived found v_measure
Citation
@software{sun2016thuctc,
  author = {Sun, M. and Li, J. and Guo, Z. and Yu, Z. and Zheng, Y. and Si, X. and Liu, Z.},
  note = {THU Chinese Text Classification Toolkit},
  publisher = {THU Natural Language Processing Lab},
  title = {THUCTC: An Efficient Chinese Text Classifier},
  url = {https://github.com/thunlp/THUCTC},
  year = {2016},
}

TwentyNewsgroupsClustering

Clustering of the 20 Newsgroups dataset (subject only).

Dataset: mteb/twentynewsgroups-clusteringLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng News, Written derived found v_measure
Citation
@incollection{LANG1995331,
  address = {San Francisco (CA)},
  author = {Ken Lang},
  booktitle = {Machine Learning Proceedings 1995},
  doi = {https://doi.org/10.1016/B978-1-55860-377-6.50048-7},
  editor = {Armand Prieditis and Stuart Russell},
  isbn = {978-1-55860-377-6},
  pages = {331-339},
  publisher = {Morgan Kaufmann},
  title = {NewsWeeder: Learning to Filter Netnews},
  url = {https://www.sciencedirect.com/science/article/pii/B9781558603776500487},
  year = {1995},
}

TwentyNewsgroupsClustering-VN

A translated dataset from Clustering of the 20 Newsgroups dataset (subject only). The process of creating the VN-MTEB (Vietnamese Massive Text Embedding Benchmark) from English samples involves a new automated system: - The system uses large language models (LLMs), specifically Coherence's Aya model, for translation. - Applies advanced embedding models to filter the translations. - Use LLM-as-a-judge to scoring the quality of the samples base on multiple criteria.

Dataset: GreenNode/twentynewsgroups-clustering-vnLicense: cc-by-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) vie News, Written derived machine-translated and LM verified v_measure
Citation
@misc{pham2025vnmtebvietnamesemassivetext,
  archiveprefix = {arXiv},
  author = {Loc Pham and Tung Luu and Thu Vo and Minh Nguyen and Viet Hoang},
  eprint = {2507.21500},
  primaryclass = {cs.CL},
  title = {VN-MTEB: Vietnamese Massive Text Embedding Benchmark},
  url = {https://arxiv.org/abs/2507.21500},
  year = {2025},
}

TwentyNewsgroupsClustering.v2

Clustering of the 20 Newsgroups dataset (subject only).

Dataset: mteb/twentynewsgroups-clusteringLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng News, Written derived found v_measure
Citation
@incollection{LANG1995331,
  address = {San Francisco (CA)},
  author = {Ken Lang},
  booktitle = {Machine Learning Proceedings 1995},
  doi = {https://doi.org/10.1016/B978-1-55860-377-6.50048-7},
  editor = {Armand Prieditis and Stuart Russell},
  isbn = {978-1-55860-377-6},
  pages = {331-339},
  publisher = {Morgan Kaufmann},
  title = {NewsWeeder: Learning to Filter Netnews},
  url = {https://www.sciencedirect.com/science/article/pii/B9781558603776500487},
  year = {1995},
}

VABBClusteringP2P

This dataset contains the fourteenth edition of the Flemish Academic Bibliography for the Social Sciences and Humanities (VABB-SHW), a database of academic publications from the social sciences and humanities authored by researchers affiliated to Flemish universities (more information). Publications in the database are used as one of the parameters of the Flemish performance-based research funding system

Dataset: clips/mteb-nl-vabb-clsLicense: cc-by-nc-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) nld Academic, Written derived found v_measure
Citation
@dataset{aspeslagh2024vabb,
  author = {Aspeslagh, Pieter and Guns, Raf and Engels, Tim C. E.},
  doi = {10.5281/zenodo.14214806},
  publisher = {Zenodo},
  title = {VABB-SHW: Dataset of Flemish Academic Bibliography for the Social Sciences and Humanities (edition 14)},
  url = {https://doi.org/10.5281/zenodo.14214806},
  year = {2024},
}

VABBClusteringS2S

This dataset contains the fourteenth edition of the Flemish Academic Bibliography for the Social Sciences and Humanities (VABB-SHW), a database of academic publications from the social sciences and humanities authored by researchers affiliated to Flemish universities (more information). Publications in the database are used as one of the parameters of the Flemish performance-based research funding system

Dataset: clips/mteb-nl-vabb-clsLicense: cc-by-nc-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) nld Academic, Written derived found v_measure
Citation
@dataset{aspeslagh2024vabb,
  author = {Aspeslagh, Pieter and Guns, Raf and Engels, Tim C. E.},
  doi = {10.5281/zenodo.14214806},
  publisher = {Zenodo},
  title = {VABB-SHW: Dataset of Flemish Academic Bibliography for the Social Sciences and Humanities (edition 14)},
  url = {https://doi.org/10.5281/zenodo.14214806},
  year = {2024},
}

VGClustering

Articles and their classes (e.g. sports) from VG news articles extracted from Norsk Aviskorpus.

Dataset: navjordj/VG_summarizationLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) nob News, Non-fiction, Written derived found v_measure
Citation
@mastersthesis{navjord2023beyond,
  author = {Navjord, J{\o}rgen Johnsen and Korsvik, Jon-Mikkel Ryen},
  school = {Norwegian University of Life Sciences, {\AA}s},
  title = {Beyond extractive: advancing abstractive automatic text summarization in Norwegian with transformers},
  year = {2023},
}

VGHierarchicalClusteringP2P

Articles and their classes (e.g. sports) from VG news articles extracted from Norsk Aviskorpus.

Dataset: navjordj/VG_summarizationLicense: cc-by-nc-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) nob News, Non-fiction, Written derived found v_measure
Citation
@mastersthesis{navjord2023beyond,
  author = {Navjord, J{\\o}rgen Johnsen and Korsvik, Jon-Mikkel Ryen},
  school = {Norwegian University of Life Sciences, {\\AA}s},
  title = {Beyond extractive: advancing abstractive automatic text summarization in Norwegian with transformers},
  year = {2023},
}

VGHierarchicalClusteringS2S

Articles and their classes (e.g. sports) from VG news articles extracted from Norsk Aviskorpus.

Dataset: navjordj/VG_summarizationLicense: cc-by-nc-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) nob News, Non-fiction, Written derived found v_measure
Citation
@mastersthesis{navjord2023beyond,
  author = {Navjord, J{\\o}rgen Johnsen and Korsvik, Jon-Mikkel Ryen},
  school = {Norwegian University of Life Sciences, {\\AA}s},
  title = {Beyond extractive: advancing abstractive automatic text summarization in Norwegian with transformers},
  year = {2023},
}

WikiCitiesClustering

Clustering of Wikipedia articles of cities by country from https://huggingface.co/datasets/wikipedia. Test set includes 126 countries, and a total of 3531 cities.

Dataset: mteb/WikiCitiesClusteringLicense: cc-by-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Encyclopaedic, Written derived found v_measure
Citation
@online{wikidump2024,
  author = {Wikimedia Foundation},
  title = {Wikimedia Downloads},
  url = {https://dumps.wikimedia.org},
}

WikiClusteringP2P

Clustering of wikipedia articles inspired by BlubrbsClusteringP2P. Labels are taken from top-level categories of the respective languages (e.g., https://lv.wikipedia.org/wiki/Kategorija:Pamatkategorijas).

Dataset: ryzzlestrizzle/multi-wiki-clustering-p2pLicense: cc-by-sa-3.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) bos, cat, ces, dan, eus, ... (14) Encyclopaedic, Written derived created v_measure

WikiClusteringP2P.v2

Clustering of wikipedia articles inspired by BlubrbsClusteringP2P. Labels are taken from top-level categories of the respective languages (e.g., https://lv.wikipedia.org/wiki/Kategorija:Pamatkategorijas).

Dataset: mteb/WikiClusteringP2P.v2License: cc-by-sa-3.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) bos, cat, ces, dan, eus, ... (14) Encyclopaedic, Written derived created v_measure

WikipediaChemistryTopicsClustering

ChemTEB evaluates the performance of text embedding models on chemical domain data.

Dataset: BASF-AI/WikipediaEasy10ClusteringLicense: cc-by-nc-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Chemistry derived created v_measure
Citation
@article{kasmaee2024chemteb,
  author = {Kasmaee, Ali Shiraee and Khodadad, Mohammad and Saloot, Mohammad Arshi and Sherck, Nick and Dokas, Stephen and Mahyar, Hamidreza and Samiee, Soheila},
  journal = {arXiv preprint arXiv:2412.00532},
  title = {ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance \& Efficiency on a Specific Domain},
  year = {2024},
}

WikipediaSpecialtiesInChemistryClustering

ChemTEB evaluates the performance of text embedding models on chemical domain data.

Dataset: BASF-AI/WikipediaMedium5ClusteringLicense: cc-by-nc-sa-4.0 • Learn more →

Category Languages Domains Annotations Creation Score
text to category (t2c) eng Chemistry derived created v_measure
Citation
@article{kasmaee2024chemteb,
  author = {Kasmaee, Ali Shiraee and Khodadad, Mohammad and Saloot, Mohammad Arshi and Sherck, Nick and Dokas, Stephen and Mahyar, Hamidreza and Samiee, Soheila},
  journal = {arXiv preprint arXiv:2412.00532},
  title = {ChemTEB: Chemical Text Embedding Benchmark, an Overview of Embedding Models Performance \& Efficiency on a Specific Domain},
  year = {2024},
}

ImageClustering

  • Number of tasks: 5

CIFAR100Clustering

Clustering images from 100 classes.

Dataset: mteb/cifar100License: not specified • Learn more →

Category Languages Domains Annotations Creation Score
image to text (i2t) eng Web derived created nmi
Citation
@techreport{Krizhevsky09learningmultiple,
  author = {Alex Krizhevsky},
  institution = {},
  title = {Learning multiple layers of features from tiny images},
  year = {2009},
}

CIFAR10Clustering

Clustering images from 10 classes.

Dataset: mteb/cifar10License: not specified • Learn more →

Category Languages Domains Annotations Creation Score
image to category (i2c) eng Web derived created nmi
Citation
@techreport{Krizhevsky09learningmultiple,
  author = {Alex Krizhevsky},
  institution = {},
  title = {Learning multiple layers of features from tiny images},
  year = {2009},
}

ImageNet10Clustering

Clustering images from an 10-class subset of ImageNet which are generally easy to distinguish.

Dataset: mteb/imagenet-10License: not specified • Learn more →

Category Languages Domains Annotations Creation Score
image to text (i2t) eng Web derived created nmi
Citation
@inproceedings{deng2009imagenet,
  author = {Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Kai Li and Li Fei-Fei},
  booktitle = {2009 IEEE Conference on Computer Vision and Pattern Recognition},
  doi = {10.1109/CVPR.2009.5206848},
  keywords = {Large-scale systems;Image databases;Explosions;Internet;Robustness;Information retrieval;Image retrieval;Multimedia databases;Ontologies;Spine},
  number = {},
  pages = {248-255},
  title = {ImageNet: A large-scale hierarchical image database},
  volume = {},
  year = {2009},
}

ImageNetDog15Clustering

Clustering images from a 15-class dogs-only subset of the dog classes in ImageNet.

Dataset: mteb/imagenet-dog-15License: not specified • Learn more →

Category Languages Domains Annotations Creation Score
image to category (i2c) eng Web derived created nmi
Citation
@inproceedings{deng2009imagenet,
  author = {Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Kai Li and Li Fei-Fei},
  booktitle = {2009 IEEE Conference on Computer Vision and Pattern Recognition},
  doi = {10.1109/CVPR.2009.5206848},
  keywords = {Large-scale systems;Image databases;Explosions;Internet;Robustness;Information retrieval;Image retrieval;Multimedia databases;Ontologies;Spine},
  number = {},
  pages = {248-255},
  title = {ImageNet: A large-scale hierarchical image database},
  volume = {},
  year = {2009},
}

TinyImageNetClustering

Clustering over 200 classes.

Dataset: mteb/tiny-imagenetLicense: not specified • Learn more →

Category Languages Domains Annotations Creation Score
image to category (i2c) eng Reviews derived found nmi